Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy.selector Enhancement Proposal #906

Closed
zwidny opened this issue Sep 26, 2014 · 60 comments
Closed

Scrapy.selector Enhancement Proposal #906

zwidny opened this issue Sep 26, 2014 · 60 comments

Comments

@zwidny
Copy link

zwidny commented Sep 26, 2014

Make scrapy.selector into a separate project.

When we're scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this: BeautifulSoup, lxml.
And although scrapy selectors are built over the lxml library, it is a better way to extract data from HTML source. In my opinion, people will prefer selector to BeautifulSoup and lxml , if make the scrapy.selectors into a separate project,

@kmike
Copy link
Member

kmike commented Sep 26, 2014

+1 from me to a general idea of extracting Selector to a separate library. But I haven't checked the details. See also #395 (comment). Selector can already be used without the rest of scrapy: Selector(text=html), but I like an idea of making the separation cleaner if it is possible. To move forward this needs a more detailed discussion/proposal, +1s from other Scrapy commiters and a person who will actually implement it :)

@zwidny
Copy link
Author

zwidny commented Sep 26, 2014

All right! Thank you:)

@zwidny zwidny closed this as completed Sep 26, 2014
@umrashrf
Copy link

+1

I thought of this too a while ago. While using requests for on-demand scraping, I installed Scrapy just to use the chained Selectors.

@dangra
Copy link
Member

dangra commented Oct 7, 2014

+1 the hard part is how to untie scrapy.http.Response and the now over-engineered parsing-cache provided by scrapy.selector.lxmldocument.LxmlDocument. I guess accepting Response objects (and its caching) can be implemented inside Scrapy extending from the Selector class provided by the separate library.

@kmike kmike reopened this Oct 7, 2014
@Digenis
Copy link
Member

Digenis commented Oct 8, 2014

+1
I totally agree with dangra. Also, this "hard part", will prevent the base Selector from clutter in the future.
And this doesn't endanger the Selector, it can not just die aside on a separate project.

@umrashrf
Copy link

umrashrf commented Jan 7, 2015

@dangra @kmike I'm interested to do it.

I think there are two steps to the PR.

  • First step is to create a separate package with working selectors in its own repository at github.com/scrapy/selectors
    I have successfully untied Selector code from Scrapy and have created separate package. I made some tests and It is working fine locally. I completely removed response caching from LxmlDocument in hope to implement it at Scrapy side.
  • Second part is replacing Scrapy selectors with new library
    Here I need to dig how to use new selectors with response object efficiently having same kind of caching.

@Digenis
Copy link
Member

Digenis commented Jan 27, 2015

I don't like the idea of copy-pasting the selectors code to another
repository. All the code history disappears. I'd even prefer forking
the scrapy repository it self and start from there, instead of creating
a partial copy of the current state of a subdirectoy.

@umrashrf
Copy link

@Digenis good point. I will test this with temporary selectors repo and see how it goes.

@kmike
Copy link
Member

kmike commented Jan 27, 2015

It may be possible to move just one folder without forking the repo; I haven't tried it, but something like http://gbayer.com/development/moving-files-from-one-git-repository-to-another-preserving-history/ may work.

@Digenis
Copy link
Member

Digenis commented Jan 27, 2015

I do occasionally play with such things.
Of course we will not end up with a "honest" history
and it will take several filter-branch commands to get there.

I hope I don't make it seem complicated
but I am suggesting that since we are rewriting history
we should keep a meta-history of how we rewrote it.

I think the deliverables of this milestone should not be just the new repository
but also a list of git commands or a small shell script that create it (reproducible result).

We should be able on an empty repo to add and fetch scrapy as a remote
and after checking out some specific reference, run this list of commands
to create the individual repo.

That way we can see the actions we take to strip-down scrapy to the selectors.
Otherwise we are left only with the final result of each individual attempt to rewrite,
with no way to frame the changes (we will not have diffs to read)
and have something we can understand and comment on.

@umrashrf
Copy link

@Digenis I understand filter-branch will only bring in history to some extent. So the advantage of using your approach over filter-branch is that people get to see where did Selectors repo come from right? If I fork Scrapy and cut it short to Selectors then it answers that question.

We should be able on an empty repo to add and fetch scrapy as a remote
and after checking out some specific reference, run this list of commands
to create the individual repo.

Why is that required? Why would someone like to do that if there already exists Selectors repo?

@Digenis
Copy link
Member

Digenis commented Jan 28, 2015

I probably got to much into details towards the end.
I only argue about showing what you did to separate the selectors.
Of course people will know where the repo comes from.

I wanted to say that once you fork and start rewriting history,
you will be updating all the references and your work on it
will not be visible as in a typical version control workflow.
I expect this to be interesting enough to be exposed.
eg: by publishing the commands you ran to achieve this.

@dangra
Copy link
Member

dangra commented Jan 28, 2015

I second @Digenis on this, I splitted several repositories in the past and it never was in a single pass so I found the best way is to write a script (bash) to rerun on each debug iteration.

Once you wrote the script it is simple to make it public.

@umrashrf
Copy link

I can create a script. First I will see how it pans out with my commits over Scrapy's history. It's just now that I realized what's actually the hardest part :)

@umrashrf
Copy link

Just to update you guys, I have spend some time with git-filter-branch and git-subtree and it seems to me git-filter-branch is the way to go.

You can see a working script which extracts selector and related modules from scrapy into separate branch. https://gist.github.com/umrashrf/2f090797bdc25857325b

It's still work in progress as I have to apply my changes on it to make selectors reusable. I will soon update the PR.

@kmike
Copy link
Member

kmike commented Feb 12, 2015

Great, thanks Umair!

@umrashrf
Copy link

My pleasure @kmike

@nyov
Copy link
Contributor

nyov commented Feb 27, 2015

Sorry to find this at such a late stage.
As to history filtering, in this instance I don't so much see the necessity to keep the history (duplicated) in a new repository after splitting Selector code.
The old history is still right here!
Put a link or reference in the new root commit to the scrapy commit it was filtered out of, maybe push a git-replace "graftpoint" as well, and be done with it.

Otherwise there is subtree filtering and manual tree grafting, but as noted before it wouldn't be an honest history, missing some files where code was moved to/from in the meantime or renamed.

@umrashrf
Copy link

@nyov I think they are okay with history provided by git-filter-branch. They need the script for historical reasons. The script will be outdated for sure but at any point in future they can see how was it brought into existence.

After writing script, I found that it's indeed better approach to split the repo as you don't need to keep tabs on everything in your head :)

@Digenis
Copy link
Member

Digenis commented Mar 18, 2015

No, not historical.

Each line of code exists to do something specific and hopefully understandable
but the trial and error process that made someone write spawns a longer story.

It often complements well, even "self-documenting" code.
Because one day you will end up reading something counter-intuitive
that can make you waste time figuring out "how"
instead of looking in the history and stopping in the "why".

To explain it in one sentence: "git blame gives you what you couldn't find in the code comments"

@umrashrf
Copy link

So @Digenis are you happy with my current approach to split it? Can you leave your comments at #1007. I pushed script with patches so you can see umrashrf@dc6565d.

@nyov
Copy link
Contributor

nyov commented Mar 18, 2015

Hmm, I don't think you got my point. The git blame history of the project is still right here. With a git-replace ref "symlink" or pointer, the whole history of the newly detached project can be recovered by someone merging both project repositories locally (read the git-replace link I posted earlier).

But if you generate a custom, filtered, history to include into the new project, you rewrote history, thereby invalidating git blame's usefulness. filter-branch doesn't trivially cover merges, meaningful commit information may get lost. Doing it "wrong", you'll also overwrite the real commit author and date information. (Ofc that's actually the right way, since you rewrote history.)
So by creating a new history this way, you instead make it harder for someone to get the correct, canonical history (besides publishing a duplicated but different history).

I'm just saying this from my experience from rewriting and grafting a lot of project histories together (from svn imports and what not), I believe the filter-branch approach is the wrong one here instead of the git-replace approach.
OTOH if everyone else agrees on filter-branch, understanding these points I raised, don't let me stop you from going down that road.

edit: It's also possible to do both approaches. Maybe that'd satisfy everyone:
Publishing this rewritten history, and adding a git-replace ref at the time of the split, someone could override the generated history by pulling in the real one from the scrapy repo.

@umrashrf
Copy link

I think you make good points @nyov. I will play with git-replace over the weekend.

@Digenis @dangra @kmike what do you think about this approach?

@Digenis
Copy link
Member

Digenis commented Mar 19, 2015

I think nyov's made a better suggestion.
I said "history", at some point, then "filter-branch" and didn't think further.
Indeed it would be more convenient if nyov saw this earlier.

Still I find your script interesting and something I would refer to next time I 'll be doing complicated filter-branch sequences.

@umrashrf
Copy link

Guys how this history looks like? 😎 https://github.com/umrashrf/selectors/commits/selectors-20150329

FYI Added two custom references (ROOT for selectors and SCRAPY_HEAD for scrapy commit from where it was detached). There are instructions in root commit as how to bring back full history.

Edit 1: Here are the steps I took to create it https://gist.github.com/umrashrf/3538a4394561007a753d

@umrashrf
Copy link

Now looking at the commits, your new README.md should probably be an .rst file, scrapy projects aren't using markdown anywhere else that I know of.

Makes sense. I'd migrate to markdown later.

And just maybe you can squash some of these commits,

I also think atomic commits are better. Can you suggest a commit message?

@nyov
Copy link
Contributor

nyov commented Mar 30, 2015

Can you suggest a commit message?

Something like "set up Selectors project environment" maybe.

@umrashrf
Copy link

umrashrf commented Apr 2, 2015

@nyov done those changes. Now only docs are pending.

@kmike
Copy link
Member

kmike commented Apr 2, 2015

Sorry, I don't understand all this git voodoo :) But in https://github.com/umrashrf/selectors/tree/master it is possible to check the history of each file using github interface - it shows all relevant commits, "blame" works. In "selectors" branch it doesn't work.

@kmike
Copy link
Member

kmike commented Apr 2, 2015

Why should we require users to "set up Selectors project environment"?

@nyov
Copy link
Contributor

nyov commented Apr 2, 2015

wait, what?
oh no no no, "set up Selectors project environment" was meant as a commit message, not some user instruction. Could also be "shuffling some files around and adding a new README".

@kmike
Copy link
Member

kmike commented Apr 2, 2015

Sorry, I was unclear: why should we require developers who want to use git blame or git history to set up Selectors project environment?

@nyov
Copy link
Contributor

nyov commented Apr 2, 2015

@kmike, I don't really like to write this again, it already feels like I was pushing for this choice, which wasn't my intent. I wasn't thinking about github's blame not working with this choice.

My reason for suggesting git replace is that I think it keeps the history here clean and honest.
If we include a copy of a rewritten history here, it's a partial duplicate to scrapy's and potentially not quite right (from dropping files in the rewrite).

In summary, if scrapy.git would be rewritten afterwards to remove selectors history, moving it here would feel right, to me. Since it isn't and scrapy.git is sticking around for the foreseeable future, where you can look it up, duplicating some mangled history (other sha1 sums, commit messages not quite fitting since files are missing from commits) felt not quite right, to me.

That is my opinion, which I hoped to explain, nothing else. Feel free to disagree and whatever works best for everyone else is good enough for me.

@umrashrf
Copy link

umrashrf commented Apr 4, 2015

Sorry, I don't understand all this git voodoo :) But in https://github.com/umrashrf/selectors/tree/master it is possible to check the history of each file using github interface - it shows all relevant commits, "blame" works. In "selectors" branch it doesn't work.

@kmike what about creating a separate branch at Selectors repo with full history? I mean the result of git-replace. Call that separate branch history or anything.

@kmike
Copy link
Member

kmike commented Apr 4, 2015

My reason for suggesting git replace is that I think it keeps the history here clean and honest.
If we include a copy of a rewritten history here, it's a partial duplicate to scrapy's and potentially not quite right (from dropping files in the rewrite).

In summary, if scrapy.git would be rewritten afterwards to remove selectors history, moving it here would feel right, to me. Since it isn't and scrapy.git is sticking around for the foreseeable future, where you can look it up, duplicating some mangled history (other sha1 sums, commit messages not quite fitting since files are missing from commits) felt not quite right, to me.

What practical problems are caused by unclean dishonest duplicated history, changed sha1 sums and not-quite-fitting commit messages? Unfitting commit messages may cause some confusion, but IMHO they are better than nothing. Having full history can save some time on tracking the code changes, allow people to use github UI and local git without any setup, it also preserves contribution history.

@kmike what about creating a separate branch at Selectors repo with full history? I mean the result of git-replace. Call that separate branch history or anything.

I think that a separate branch won't help much once we start making more changes to selectors; one can always use Scrapy repository to see old changes.

@umrashrf
Copy link

umrashrf commented Apr 4, 2015

I think that a separate branch won't help much once we start making more changes to selectors; one can always use Scrapy repository to see old changes.

Yes we can write a script which can update that separate branch. The advantage is it gives full selectors history in github UI.

@kmike
Copy link
Member

kmike commented Apr 4, 2015

Yes we can write a script which can update that separate branch. The advantage is it gives full selectors history in github UI.

This adds complexity and thus makes maintenance harder. It is better to agree on one of the options, be it "drop history before relocation, but provide instructions on how to get it back" or "provide pseudo-history".

Sorry, I still don't get what is clean about dropping the history and why do you like it :) As @Digenis said, "git blame gives you what you couldn't find in the code comments", and pseudo-history gives you a meaningful git blame - it is a practical advantage. There must be something, because it seems so far there is +2 or +3 ( @umrashrf, @nyov and @Digenis?) for dropping pseudo-history vs +1 (me) for providing it.

@umrashrf
Copy link

umrashrf commented Apr 4, 2015

Sorry, I still don't get what is clean about dropping the history and why do you like it :)

I don't like dropping the history and neither do they. git-filter-branch has a good chance to break git-blame usefulness by giving you partial history. I also wondered how and talked this through with @nyov on IRC so I'm just going to paste them here. I was going to paste it before but sorry I got lost in other stuff.

nyov: well, assume at some point parts of the file were moved there from a different file. if you run filter-branch, and removing that other file, the source is lost
nyov: or if the file was renamed in the past, but you keep only one filename through the rewrite.
nyov: or, well, look at the history here: https://github.com/umrashrf/selectors/commits/master
nyov: all the commits now have you as the Committer, meaning the historical commit date is lost - not much use for git blame, right?
nyov: it's not so bad if the original Author was the commiter, too. but if they were different that information is lost
nyov: anyway. I see your reason for wanting to keep the history there. I'm very much for not losing history, myself.
nyov: It would make more sense if you rewrite scrapy.git at the same time, removing all references to selectors, ergo splitting repositories. But since this is a duplicate history, I just felt it more a waste of space to keep a modified history as well

So based on these points what do you think @kmike?

@nyov
Copy link
Contributor

nyov commented Apr 4, 2015

ah, that stupid irc log. :p

@kmike, I disagree on "it also preserves contribution history", at least commit author and dates are overwritten. But count me as +/-0. Both approaches have dis/advantages, I'm not sure which to favor anymore in this case.

@dangra
Copy link
Member

dangra commented Apr 4, 2015

Hi guys, let's draw a hard line stating clear we are not rewriting Scrapy history by any chance :)

As for selectors repository history, I +1 to provide pseudo history and make it clear that the project was part of Scrapy and the exact history can only be found there.

And please let's move on with the rest of the work to make Selectors and standalone package

@nyov
Copy link
Contributor

nyov commented Apr 4, 2015

(AN: the same decision is now open for the new scrapy-djangoitems project)

@dangra
Copy link
Member

dangra commented Apr 4, 2015

IMO the pseudo history should retain commit dates and authors

@nyov
Copy link
Contributor

nyov commented Apr 4, 2015

That's not trivially possible with git filter-branch (design decision being: it's been changed by someone, it should state as much). But it can certainly be hacked somehow.

@kmike
Copy link
Member

kmike commented Apr 6, 2015

I disagree on "it also preserves contribution history", at least commit author and dates are overwritten.

@nyov https://github.com/umrashrf/selectors/commits/master shows right authors and original commit dates ("authored on"), while preserving information about the fact Umair made these commits; https://github.com/umrashrf/selectors/graphs/contributors displays correct contributors and graphs with original ("authored on") commit dates. My git foo is weak, but author/commiter and "authored on"/ "commited on" distinction sounds right, and github supports it well enough.

@nyov
Copy link
Contributor

nyov commented Apr 6, 2015

Diffing two exerpts from the original log (git log -- scrapy/selector/) and the rewritten repo (git log), this is the information that changed.

Note how for example the last commit by Paul was authored in November but only actually committed (or amended) in January, and then merged by Pablo some time later.
The merge commit is dropped completely, the actual commit date by Paul is now the rewrite date of Umair. The actual author information is the only one staying intact.
This is what I mean by losing commit author and date information. If we don't care about this secondary info here, that's fine by me.

$ diff -up original.log rewritten.log 
--- original.log    2015-04-06 16:35:53.987149100 +0000
+++ rewritten.log   2015-04-06 16:36:07.083113844 +0000
@@ -1,59 +1,48 @@
-commit 1f184ed7bf5cd32fb2b19bf42d4260d81d4d1cdd
+commit 931c737b760b034ee122ad70db56abcf6c9eff60
 Author:     Paul Tremberth <paul.tremberth@>
 AuthorDate: Wed Jan 15 15:17:18 2014 +0100
-Commit:     Paul Tremberth <paul.tremberth@>
-CommitDate: Fri Jan 17 20:33:32 2014 +0100
+Commit:     Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500

     Disable smart strings in lxml XPath evaluations

-commit 71ada5476e5b81d9f42e607d5bc0b7e2baa8ef93
-Merge: b9bb9be 827c0cf
-Author:     Pablo Hoffman <pablo@>
-AuthorDate: Thu Jan 16 06:32:05 2014 -0800
-Commit:     Pablo Hoffman <pablo@>
-CommitDate: Thu Jan 16 06:32:05 2014 -0800
-
-    Merge pull request #472 from redapple/exslt
-    
-    Register EXSLT namespaces by default (resolves #470)
-
-commit b3be6e210de837c54fc50cf62fc9e34a504c705e
+commit f533dd51f000a10ffe9606c20331499cabee3e54
 Author:     Daniel Graña <dangra@>
 AuthorDate: Fri Jan 3 17:32:40 2014 -0200
-Commit:     Daniel Graña <dangra@>
-CommitDate: Thu Jan 16 09:53:59 2014 -0200
+Commit:     Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500

     warn XPathSelector deprecation on subclassing and direct instance

-commit 827c0cf51f47ab87b1de1c517ee7d5095fdb9c32
+commit 1902ec2ab677e12d6f096337df4bd187ef593838
 Author:     Paul Tremberth <paul.tremberth@>
 AuthorDate: Wed Jan 15 15:00:25 2014 +0100
-Commit:     Paul Tremberth <paul.tremberth@>
-CommitDate: Wed Jan 15 15:00:25 2014 +0100
+Commit:     Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500

     Rename "regexp" prefix to "re"

-commit a3eba68aca804e67f9a083c4d9c490da9a243b59
+commit 09ff763d410a3825ddfb6e7cf2c63a8d1781484a
 Author:     Paul Tremberth <paul.tremberth@>
 AuthorDate: Wed Jan 15 12:28:25 2014 +0100
-Commit:     Paul Tremberth <paul.tremberth@>
-CommitDate: Wed Jan 15 12:28:25 2014 +0100
+Commit:     Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500

     Drop EXSLT strings and math extensions

-commit 5df56a975884ed0680283ff6e301f8a4d32f4712
+commit 754663d3a29a6dfdc2fbd3c982dc68f3192bb703
 Author:     Paul Tremberth <paul.tremberth@>
 AuthorDate: Wed Jan 8 18:48:20 2014 +0100
-Commit:     Paul Tremberth <paul.tremberth@>
-CommitDate: Tue Jan 14 12:49:17 2014 +0100
+Commit:     Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500

     Applying suggestions from previous comments

-commit d46534cc547b4da79f9b47b4ccec40a16f1ca3c2
+commit 97e2405a0d61797885fd3cc90b390eb4be630930
 Author:     Paul Tremberth <paul.tremberth@>
 AuthorDate: Sun Nov 24 04:41:14 2013 +0100
-Commit:     Paul Tremberth <paul.tremberth@>
-CommitDate: Tue Jan 14 12:49:17 2014 +0100
+Commit:     Umair Ashraf <umr.ashrf@>
+CommitDate: Wed Mar 18 15:49:24 2015 +0500

     Register EXSLT namespaces by default (resolves #470)

edit: anonymized the emails

@nyov
Copy link
Contributor

nyov commented Apr 7, 2015

Let me put the lid on this discussion. It seems I sowed dissent by making my viewpoint known here. For that I am sorry, I shall try to wrap it up.
If taken as a precedent case, this currently looks like a blocker for the selectors, scrapyd-client and upcoming scrapy-djangoitem repos and that is bad. @dangra put his foot down, @kmike agrees, I'll follow. I converted @Digenis and @umrashrf ;) I guess, but I assume they'll follow, too?

@dangra also said to keep commiter stuff (and merges too). That's currently not the case here as I showed, but is possible. That makes part of my argument invalid, so let's do this.
scrapy-jsonrpc seems the actual precedent case with included history (I hadn't noticed before), and it looks like dangra handled it then, in a way that kept merges and committer information AFAICS.

It's possible using the index filter, my apologies for saying it wouldn't be that trivial -- I must have lost some gitfoo myself. It's only hard to figure out all the right files to keep.
@umrashrf, I feel bad for all the work you did already, if I should pitch in, let me know!
Here's a quick way to kill everything but the files you want to keep, but it requires listing all the possible paths that those files are on, in history (renames and moves included).

FILES="all the files/to/keep"
git filter-branch -d /tmp/gfb \
    --prune-empty \
    --tag-name-filter cat \
    --index-filter 'git rm --ignore-unmatch --cached -qr -- . && git reset -q $GIT_COMMIT -- '"$FILES" \
    -- --all

The inverse, eradicating the listed files from history, can be done like this:

FILES="all the files/to/kill"
git filter-branch -d /tmp/gfb \
    --prune-empty \
    --tag-name-filter cat \
    --index-filter 'git ls-files -z | grep -z '"$FILES"' | xargs -0 git rm -qr --cached --ignore-unmatch DummyWhichNeverExists' \
    -- --all

Using both in tandem, and maybe repeatedly, it's possible to whittle down the tree and commits to only the stuff to keep.

@umrashrf
Copy link

umrashrf commented Apr 7, 2015

@nyov no no, i'm glad to do it. thanks for teaching me more git!

@kmike
Copy link
Member

kmike commented Aug 12, 2015

Fixed by https://github.com/scrapy/parsel and #1409. Thanks to all people involved!

@kmike kmike closed this as completed Aug 12, 2015
@Digenis
Copy link
Member

Digenis commented Aug 26, 2015

Are selector tickets being moved from scrapy to parsel?

@kmike
Copy link
Member

kmike commented Aug 26, 2015

@Digenis some of them should, but there is still scrapy.Selector which wraps parsel.Selector. Do you have specific tickets in mind?

@Digenis
Copy link
Member

Digenis commented Aug 26, 2015

I don't have specific issues in mind, I am just reminding.
A quick search shows issues #753 #842 #765.
Pull requests will probably require more work than copying the tickets.

@kmike
Copy link
Member

kmike commented Aug 26, 2015

Yeah, we should revisit the tickets, thanks for the reminder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants