-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapy.selector Enhancement Proposal #906
Comments
+1 from me to a general idea of extracting Selector to a separate library. But I haven't checked the details. See also #395 (comment). Selector can already be used without the rest of scrapy: |
All right! Thank you:) |
+1 I thought of this too a while ago. While using |
+1 the hard part is how to untie |
+1 |
@dangra @kmike I'm interested to do it. I think there are two steps to the PR.
|
I don't like the idea of copy-pasting the selectors code to another |
@Digenis good point. I will test this with temporary selectors repo and see how it goes. |
It may be possible to move just one folder without forking the repo; I haven't tried it, but something like http://gbayer.com/development/moving-files-from-one-git-repository-to-another-preserving-history/ may work. |
I do occasionally play with such things. I hope I don't make it seem complicated I think the deliverables of this milestone should not be just the new repository We should be able on an empty repo to add and fetch scrapy as a remote That way we can see the actions we take to strip-down scrapy to the selectors. |
@Digenis I understand filter-branch will only bring in history to some extent. So the advantage of using your approach over filter-branch is that people get to see where did Selectors repo come from right? If I fork Scrapy and cut it short to Selectors then it answers that question.
Why is that required? Why would someone like to do that if there already exists Selectors repo? |
I probably got to much into details towards the end. I wanted to say that once you fork and start rewriting history, |
I second @Digenis on this, I splitted several repositories in the past and it never was in a single pass so I found the best way is to write a script (bash) to rerun on each debug iteration. Once you wrote the script it is simple to make it public. |
I can create a script. First I will see how it pans out with my commits over Scrapy's history. It's just now that I realized what's actually the hardest part :) |
Just to update you guys, I have spend some time with git-filter-branch and git-subtree and it seems to me git-filter-branch is the way to go. You can see a working script which extracts selector and related modules from scrapy into separate branch. https://gist.github.com/umrashrf/2f090797bdc25857325b It's still work in progress as I have to apply my changes on it to make selectors reusable. I will soon update the PR. |
Great, thanks Umair! |
My pleasure @kmike |
Sorry to find this at such a late stage. Otherwise there is subtree filtering and manual tree grafting, but as noted before it wouldn't be an honest history, missing some files where code was moved to/from in the meantime or renamed. |
@nyov I think they are okay with history provided by git-filter-branch. They need the script for historical reasons. The script will be outdated for sure but at any point in future they can see how was it brought into existence. After writing script, I found that it's indeed better approach to split the repo as you don't need to keep tabs on everything in your head :) |
No, not historical. Each line of code exists to do something specific and hopefully understandable It often complements well, even "self-documenting" code. To explain it in one sentence: "git blame gives you what you couldn't find in the code comments" |
So @Digenis are you happy with my current approach to split it? Can you leave your comments at #1007. I pushed script with patches so you can see umrashrf@dc6565d. |
Hmm, I don't think you got my point. The But if you generate a custom, filtered, history to include into the new project, you rewrote history, thereby invalidating I'm just saying this from my experience from rewriting and grafting a lot of project histories together (from svn imports and what not), I believe the edit: It's also possible to do both approaches. Maybe that'd satisfy everyone: |
I think nyov's made a better suggestion. Still I find your script interesting and something I would refer to next time I 'll be doing complicated filter-branch sequences. |
Guys how this history looks like? 😎 https://github.com/umrashrf/selectors/commits/selectors-20150329 FYI Added two custom references (ROOT for selectors and SCRAPY_HEAD for scrapy commit from where it was detached). There are instructions in root commit as how to bring back full history. Edit 1: Here are the steps I took to create it https://gist.github.com/umrashrf/3538a4394561007a753d |
Makes sense. I'd migrate to markdown later.
I also think atomic commits are better. Can you suggest a commit message? |
Something like "set up Selectors project environment" maybe. |
@nyov done those changes. Now only docs are pending. |
Sorry, I don't understand all this git voodoo :) But in https://github.com/umrashrf/selectors/tree/master it is possible to check the history of each file using github interface - it shows all relevant commits, "blame" works. In "selectors" branch it doesn't work. |
Why should we require users to "set up Selectors project environment"? |
wait, what? |
Sorry, I was unclear: why should we require developers who want to use git blame or git history to set up Selectors project environment? |
@kmike, I don't really like to write this again, it already feels like I was pushing for this choice, which wasn't my intent. I wasn't thinking about github's blame not working with this choice. My reason for suggesting git replace is that I think it keeps the history here clean and honest. In summary, if scrapy.git would be rewritten afterwards to remove selectors history, moving it here would feel right, to me. Since it isn't and scrapy.git is sticking around for the foreseeable future, where you can look it up, duplicating some mangled history (other sha1 sums, commit messages not quite fitting since files are missing from commits) felt not quite right, to me. That is my opinion, which I hoped to explain, nothing else. Feel free to disagree and whatever works best for everyone else is good enough for me. |
@kmike what about creating a separate branch at Selectors repo with full history? I mean the result of git-replace. Call that separate branch history or anything. |
What practical problems are caused by unclean dishonest duplicated history, changed sha1 sums and not-quite-fitting commit messages? Unfitting commit messages may cause some confusion, but IMHO they are better than nothing. Having full history can save some time on tracking the code changes, allow people to use github UI and local git without any setup, it also preserves contribution history.
I think that a separate branch won't help much once we start making more changes to selectors; one can always use Scrapy repository to see old changes. |
Yes we can write a script which can update that separate branch. The advantage is it gives full selectors history in github UI. |
This adds complexity and thus makes maintenance harder. It is better to agree on one of the options, be it "drop history before relocation, but provide instructions on how to get it back" or "provide pseudo-history". Sorry, I still don't get what is clean about dropping the history and why do you like it :) As @Digenis said, "git blame gives you what you couldn't find in the code comments", and pseudo-history gives you a meaningful git blame - it is a practical advantage. There must be something, because it seems so far there is +2 or +3 ( @umrashrf, @nyov and @Digenis?) for dropping pseudo-history vs +1 (me) for providing it. |
I don't like dropping the history and neither do they. git-filter-branch has a good chance to break git-blame usefulness by giving you partial history. I also wondered how and talked this through with @nyov on IRC so I'm just going to paste them here. I was going to paste it before but sorry I got lost in other stuff.
So based on these points what do you think @kmike? |
ah, that stupid irc log. :p @kmike, I disagree on "it also preserves contribution history", at least commit author and dates are overwritten. But count me as +/-0. Both approaches have dis/advantages, I'm not sure which to favor anymore in this case. |
Hi guys, let's draw a hard line stating clear we are not rewriting Scrapy history by any chance :) As for selectors repository history, I +1 to provide pseudo history and make it clear that the project was part of Scrapy and the exact history can only be found there. And please let's move on with the rest of the work to make Selectors and standalone package |
(AN: the same decision is now open for the new |
IMO the pseudo history should retain commit dates and authors |
That's not trivially possible with |
@nyov https://github.com/umrashrf/selectors/commits/master shows right authors and original commit dates ("authored on"), while preserving information about the fact Umair made these commits; https://github.com/umrashrf/selectors/graphs/contributors displays correct contributors and graphs with original ("authored on") commit dates. My git foo is weak, but author/commiter and "authored on"/ "commited on" distinction sounds right, and github supports it well enough. |
Diffing two exerpts from the original log ( Note how for example the last commit by Paul was authored in November but only actually committed (or amended) in January, and then merged by Pablo some time later. $ diff -up original.log rewritten.log
--- original.log 2015-04-06 16:35:53.987149100 +0000
+++ rewritten.log 2015-04-06 16:36:07.083113844 +0000
@@ -1,59 +1,48 @@
-commit 1f184ed7bf5cd32fb2b19bf42d4260d81d4d1cdd
+commit 931c737b760b034ee122ad70db56abcf6c9eff60
Author: Paul Tremberth <paul.tremberth@>
AuthorDate: Wed Jan 15 15:17:18 2014 +0100
-Commit: Paul Tremberth <paul.tremberth@>
-CommitDate: Fri Jan 17 20:33:32 2014 +0100
+Commit: Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500
Disable smart strings in lxml XPath evaluations
-commit 71ada5476e5b81d9f42e607d5bc0b7e2baa8ef93
-Merge: b9bb9be 827c0cf
-Author: Pablo Hoffman <pablo@>
-AuthorDate: Thu Jan 16 06:32:05 2014 -0800
-Commit: Pablo Hoffman <pablo@>
-CommitDate: Thu Jan 16 06:32:05 2014 -0800
-
- Merge pull request #472 from redapple/exslt
-
- Register EXSLT namespaces by default (resolves #470)
-
-commit b3be6e210de837c54fc50cf62fc9e34a504c705e
+commit f533dd51f000a10ffe9606c20331499cabee3e54
Author: Daniel Graña <dangra@>
AuthorDate: Fri Jan 3 17:32:40 2014 -0200
-Commit: Daniel Graña <dangra@>
-CommitDate: Thu Jan 16 09:53:59 2014 -0200
+Commit: Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500
warn XPathSelector deprecation on subclassing and direct instance
-commit 827c0cf51f47ab87b1de1c517ee7d5095fdb9c32
+commit 1902ec2ab677e12d6f096337df4bd187ef593838
Author: Paul Tremberth <paul.tremberth@>
AuthorDate: Wed Jan 15 15:00:25 2014 +0100
-Commit: Paul Tremberth <paul.tremberth@>
-CommitDate: Wed Jan 15 15:00:25 2014 +0100
+Commit: Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500
Rename "regexp" prefix to "re"
-commit a3eba68aca804e67f9a083c4d9c490da9a243b59
+commit 09ff763d410a3825ddfb6e7cf2c63a8d1781484a
Author: Paul Tremberth <paul.tremberth@>
AuthorDate: Wed Jan 15 12:28:25 2014 +0100
-Commit: Paul Tremberth <paul.tremberth@>
-CommitDate: Wed Jan 15 12:28:25 2014 +0100
+Commit: Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500
Drop EXSLT strings and math extensions
-commit 5df56a975884ed0680283ff6e301f8a4d32f4712
+commit 754663d3a29a6dfdc2fbd3c982dc68f3192bb703
Author: Paul Tremberth <paul.tremberth@>
AuthorDate: Wed Jan 8 18:48:20 2014 +0100
-Commit: Paul Tremberth <paul.tremberth@>
-CommitDate: Tue Jan 14 12:49:17 2014 +0100
+Commit: Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500
Applying suggestions from previous comments
-commit d46534cc547b4da79f9b47b4ccec40a16f1ca3c2
+commit 97e2405a0d61797885fd3cc90b390eb4be630930
Author: Paul Tremberth <paul.tremberth@>
AuthorDate: Sun Nov 24 04:41:14 2013 +0100
-Commit: Paul Tremberth <paul.tremberth@>
-CommitDate: Tue Jan 14 12:49:17 2014 +0100
+Commit: Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500
Register EXSLT namespaces by default (resolves #470)
edit: anonymized the emails |
Let me put the lid on this discussion. It seems I sowed dissent by making my viewpoint known here. For that I am sorry, I shall try to wrap it up. @dangra also said to keep commiter stuff (and merges too). That's currently not the case here as I showed, but is possible. That makes part of my argument invalid, so let's do this. It's possible using the index filter, my apologies for saying it wouldn't be that trivial -- I must have lost some gitfoo myself. It's only hard to figure out all the right files to keep. FILES="all the files/to/keep"
git filter-branch -d /tmp/gfb \
--prune-empty \
--tag-name-filter cat \
--index-filter 'git rm --ignore-unmatch --cached -qr -- . && git reset -q $GIT_COMMIT -- '"$FILES" \
-- --all The inverse, eradicating the listed files from history, can be done like this: FILES="all the files/to/kill"
git filter-branch -d /tmp/gfb \
--prune-empty \
--tag-name-filter cat \
--index-filter 'git ls-files -z | grep -z '"$FILES"' | xargs -0 git rm -qr --cached --ignore-unmatch DummyWhichNeverExists' \
-- --all Using both in tandem, and maybe repeatedly, it's possible to whittle down the tree and commits to only the stuff to keep. |
@nyov no no, i'm glad to do it. thanks for teaching me more git! |
Fixed by https://github.com/scrapy/parsel and #1409. Thanks to all people involved! |
Are selector tickets being moved from scrapy to parsel? |
@Digenis some of them should, but there is still scrapy.Selector which wraps parsel.Selector. Do you have specific tickets in mind? |
Yeah, we should revisit the tickets, thanks for the reminder. |
Make scrapy.selector into a separate project.
When we're scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this: BeautifulSoup, lxml.
And although scrapy selectors are built over the lxml library, it is a better way to extract data from HTML source. In my opinion, people will prefer selector to BeautifulSoup and lxml , if make the scrapy.selectors into a separate project,
The text was updated successfully, but these errors were encountered: