Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: how to store data in json #137

Closed
SophiaCY opened this issue Feb 10, 2015 · 17 comments
Closed

Question: how to store data in json #137

SophiaCY opened this issue Feb 10, 2015 · 17 comments

Comments

@SophiaCY
Copy link

Hi, recently I have installed portia and it's quite good. But since I'm not very familiar with it, I have met a problem:
In my template I used variants to scrape the website and it worked well when I clicked continue browsing or turned to a similiar page. However, when I started running the portia spider in cmd, it could successfully scrape data yet failed to store the whole data scraped in the json file. I found that in the json file only the data in the page I annotated were stored.
I guess the problem is "WARNING: Dropped: Duplicate product scraped at http://bbs.chinadaily.com.cn/forum-83-206.html, first one was scraped at http://bbs.chinadaily.com.cn/forum-83-2.html" (it appeared in cmd when the spider was running), but I don't know how to solve the problem.
I hope someone can help me as soon as possible.
Thanks very much!

@SophiaCY SophiaCY changed the title store data in json Question: how to store data in json Feb 10, 2015
@ruairif
Copy link
Contributor

ruairif commented Feb 10, 2015

For portiacrawl if you add -o FILENAME.json items will be saved to the file you specify.
Running portiacrawl -h gives more info.

@SophiaCY
Copy link
Author

Sorry, maybe I didn't express my problem clearly.
I want to scrape data in chinadaily's forum, so I annotate the page http://bbs.chinadaily.com.cn/forum-83-2.html and in crawling, I set the follow pattern as http://bbs.chinadaily.com.cn/forum-83-\d+.html

_2015-02-11t03-33-52 335z

In my template I use variants to scrape items:

_2015-02-10t09-15-44 802z

and it works well when I click continue browsing or turn to a similiar page:

_2015-02-11t03-40-59 033z

So I start to run the spider in console:
cd C:\Users\Administrator\portia\portia\slybot\slybot
portiacrawl C:\Users\Administrator\portia\portia\slyd\data\projects\new_project bbs.chinadaily.com.cn -o itemsss.json -t json

Then I can see the data I want in my console, but when the spider finished, only the items in http://bbs.chinadaily.com.cn/forum-83-2.html are stored:
qq 20150211114930

1

I don't know what I'm missing...

Thanks very much!!!;)

@ruairif
Copy link
Contributor

ruairif commented Feb 11, 2015

What are your start urls?

@SophiaCY
Copy link
Author

http://bbs.chinadaily.com.cn/forum-83-2.html
and it's also the page I annotate.

_2015-02-11t10-04-58 972z

@ruairif
Copy link
Contributor

ruairif commented Feb 11, 2015

Which links are green when you check the 'Overlay blocked links' box?

@SophiaCY
Copy link
Author

Just the links shown in the picture:
_2015-02-11t11-29-21 215z

@ruairif
Copy link
Contributor

ruairif commented Feb 11, 2015

Does it work if you create a file slybot/local_slybot_settings.py with the following content?:

ITEM_PIPELINES = {'slybot.dupefilter.DupeFilterPipeline': None}

@SophiaCY
Copy link
Author

Yes it works when I store data in a json file. Thank you very much!!!;)
However, because the data in the json file can't be imported to the database directly, I have to store data in a csv file. But when I run the spider with:
portiacrawl C:\Users\Administrator\portia\portia\slyd\data\projects\new_project bbs.chinadaily.com.cn -t csv -o p.csv
the spider runs successfully but only the names of the items are stored in the csv file.

2

3

Thanks a lot!

@ruairif
Copy link
Contributor

ruairif commented Feb 12, 2015

I'm not sure what's happening with your CSV file but with your JSON file there are a few other things you could try that might make it more suitable to being inserted into the database.

The duplicate filter was being triggered because all of your fields are marked as variant and as such only the first item is being kept. You could solve this by adding a dummy non variant field that is different on each page.
Another option you could try would be to use the Split Variants Middleware to create individual items for each entry. I'm not sure if these instructions are exactly correct but it would be great if you could try it out:

pip install scrapylib

Then in your local_slybot_settings.py file add:

SPIDER_MIDDLEWARES = {'slybot.spiderlets.SpiderletsMiddleware': 998,
                      'scrapylib.splitvariants.SplitVariantsMiddleware': 999}
SPLITVARIANTS_ENABLED = True

Please report back if it goes well or if you need more help.

@SophiaCY
Copy link
Author

Thanks a lot!
I have tried your second method and it successfully created individual items for each entry.
Here is the result:
the former:
5
6

the latter:
7
8

By the way, then I tried to store data in the csv file and they're successfully stored! Maybe it's your second method that helped me solve the problem.
It's very nice of you! Thank you very much!

@ruairif
Copy link
Contributor

ruairif commented Feb 12, 2015

You're very welcome.
Thank you for testing it I'll add that info to the README next time I update it.

@ruairif ruairif closed this as completed Feb 12, 2015
@SophiaCY
Copy link
Author

My pleasure ^_^

@maoouyang
Copy link

how to store in mysql ? I am puzzle!!

@maoouyang
Copy link

I am first use of portia,I had store data to json,but I dont kown how to store in mysql ..thanks.

@ghost
Copy link

ghost commented Apr 8, 2016

same problem? hello maoouyang have you understand how to do?

@AlexTan-b-z
Copy link

I set same with you,but I alwals received :
2016-08-25 16:03:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
why?
and the json file is empty,only have [.
who can help me? thank you!

@AlexTan-b-z
Copy link

Now,my question is same with you.
Json only the first page are stored.
CSV only the names of the items.
However I use the both two methods,it doesn't work.
How can I do it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants