Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download the data #8

Closed
juewang1996 opened this issue Feb 24, 2019 · 36 comments
Closed

download the data #8

juewang1996 opened this issue Feb 24, 2019 · 36 comments

Comments

@juewang1996
Copy link

Is the data still open for public? I click the link but can't download the data.

@nabanita-
Copy link

I am trying to download as well. The AWS link is showing following error
screenshot from 2019-02-24 15-48-15

@ushetaks
Copy link

hie, is the data still available for downloading?

@several27
Copy link
Owner

Apologies for the inconvenience, the download is temporarily down. I'm working on bringing it back.

If anyone had stored a local copy of the dataset, I'd appreciate them sending it over.

@pmacinec
Copy link

pmacinec commented Mar 2, 2019

I have a local copy probably, so I can upload it somewhere. I have also a subset of only fake and reliable news, that someone mentioned in another issue if needed. I can upload it probably on OneDrive if it is possible to upload that large files. Or do you have any idea where to upload it?

@AIRLegend
Copy link

I've just loss my copy this morning... Maybe you could use AWS free tier to upload the dataset to a bucket?

@several27
Copy link
Owner

@pmacinec If you can upload it to OneDrive and send me over a link, that'd be amazing (maciej[at]researchably.com) I'll copy it over to a new cloud and share a public free link here. Thanks!

@pmacinec
Copy link

pmacinec commented Mar 3, 2019

@several27 I have already sent you a link to download, so I hope anyone can download it soon from the new cloud.

@gao-xian-peh
Copy link

@pmacinec , would you mind sending me the link to download it too?

Appreciate it!

@pmacinec
Copy link

pmacinec commented Mar 4, 2019

Yes, of course, just send me your email. Or should I send it to email that u have public on your profile?

@ushetaks
Copy link

ushetaks commented Mar 4, 2019 via email

@juewang1996
Copy link
Author

@pmacinec ,could you please send me the link to download it too? I am very urgently to use it. Thank you! My email is wjhappy96@gmail.com

@ushetaks
Copy link

ushetaks commented Mar 4, 2019 via email

@pmacinec
Copy link

pmacinec commented Mar 4, 2019

Sent to both of you. Hope @several27 will upload it soon.

@ghost
Copy link

ghost commented Mar 4, 2019

Hi @pmacinec, can trouble you to send the link to me as well? Thanks! redcurries@gmail.com

@Kerrah
Copy link

Kerrah commented Mar 4, 2019

Hey @pmacinec sorry for bugging but could you send me the link as well? Thank you very much :) dawandd@hotmail.com

@gao-xian-peh
Copy link

@pmacinec I would appreciate it if it could be sent to my email address at pehgaoxian@gmail.com

Thank you once again! :)

@several27
Copy link
Owner

Hi all, thank you very much for your patience.

Thanks to @pmacinec the dataset can be now downloaded from: https://storage.googleapis.com/researchably-fake-news-recognition/news_cleaned_2018_02_13.csv.zip

@Ierpier
Copy link

Ierpier commented Mar 5, 2019

Would it be possible to also upload the subset of only fake and reliable articles that @pmacinec discussed?

@pmacinec
Copy link

pmacinec commented Mar 5, 2019

If @several27 wants, I can upload also this subset or do it just for u (or maybe share u a code to get only those messages from whole dataset, processed in chunks). Just let me know.

@Ierpier
Copy link

Ierpier commented Mar 5, 2019

An upload of the subset would be great. I don't have easy access to resources to easy process the entire dataset, which is why such a subset would be very convenient, practically. Let's wait for @pmacinec to see if he agrees with you sharing this, either publicly or privately. Many thanks in advance!

@several27
Copy link
Owner

@lerpier to read the dataset with minimal RAM usage use the ‘chunksize’ parameter in pandas.

E.g.: https://cmdlinetips.com/2018/01/how-to-load-a-massive-file-as-small-chunks-in-pandas/

@Ierpier
Copy link

Ierpier commented Mar 6, 2019

@several27 I tried something like this and it mostly worked, but ran into some issues after several chunks (not sure why). Would it be okay for @pmacinec to share the subset with only fake and real articles? Either with me privately with a link per email or a link publicly here?

@Ierpier
Copy link

Ierpier commented Mar 9, 2019

@pmacinec @several27 I don't mean to be a bother but is this still an option (uploading the fake/reliable subset?) I could post an email that you could share a link to so you don't have to share it publicly? It might be my code but when I try to process it myself I run into problems, so that would be a huge help.

@pmacinec
Copy link

pmacinec commented Mar 9, 2019

Maybe you can at first share your code and your problems.

To all that will ever want only subset of data with specific labels, please, try the following code to get only fake and reliable news.

chunksize = 200000 # depending on your memory, can be much more bigger but also smaller
for chunk in pd.read_csv('data/data.csv', chunksize=chunksize, encoding='utf-8', engine='python'):
    x = chunk[(chunk['type']=='reliable') | (chunk['type']=='fake')]
    ...

Hope this will help.

@Ierpier
Copy link

Ierpier commented Mar 9, 2019

Hm. This code is somewhat different from the one I tried. I'll try it tomorrow/monday and see if it works. Thanks! :)

@Ierpier
Copy link

Ierpier commented Mar 10, 2019

@pmacinec I tried running your code but it just gives me a df of 129194 articles by the new york times. No other sources and no fake articles at all. I also tried reading in the entire file in chunks which still raised a memory error. Reading in the entire file as is nearly blew up my pc (as expected, haha). Reading in just some rows using nrows works just fine, though.

@several27 what is the code you used to extract just fake and real? Is it the same that @pmacinec posted or did you do something different? What setup did you use to process it? I'm running python on a local jupyter server using a python 3.5 environment.

I would love to work with a (reasonably large, but not complete) subset of the real/fake articles in this dataset since none of the other fake news datasets suit my definitions of 'fake news' as well. However, I think the sheer size of the data is unfortunately causing some issues for me here. A subset of fake and reliable articles would be an absolute lifesaver right now, just so I don't have to process the entire file on my poor laptop. If you could possibly share that with me, I would be eternally grateful :).

(I'm really sorry if I come off as 'pushy'. I'm a bit stressed about this project).

@pmacinec
Copy link

pmacinec commented Mar 10, 2019

@Ierpier it is probably because you didn't finish the code above. In x variable are stored all fake and reliable news of current chunk. When another chunk is processed, it is overwritten. To not loose current data stored in x, you have to add it somewhere (e.g. new dataframe), where it will not be overwritten in loop.

append function can be helpful to you: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

Note: be careful, also only fake and reliable news need a lot of memory!

Or if you are ok to have only 100 000 of news (50k reliable, 50k fake), write me your email and I will share it to you. I can share you all reliable and fake news approximately next weekend. But please, at first, try the advice above.

@Ierpier
Copy link

Ierpier commented Mar 12, 2019

@pmacinec Well now I feel incredibly dumb, haha. That makes sense. I actually remember trying to use the strategy of appending chunks before though, and I ran into the problem that at some point everything just started ending up in the wrong columns and stuff. I will try it again tomorrow though. Hopefully I can figure it out and don't need to bother you any longer.

That said, the subset you are describing sounds very useful though, especially since I likely won't be able to work with the entire thing using my own resources anyway. If you could send it to me at ierpier.projects@gmail.com, I would be very grateful.

@impawan
Copy link

impawan commented Jan 10, 2020

@pmacinec
Hello Peter,
could you please share with me the link of the dataset as the Link provided by @several27 is not working now. My email is pawanprasad@outlook.com

@pmacinec
Copy link

Hello @impawan, I dont have the data available on google disk anymore. Probably, I have back up of this dataset on my disk, but I dont have it with myself. I am able to upload it approximately in 2 weeks.

@impawan
Copy link

impawan commented Jan 17, 2020

Thanks @pmacinec for helping me with this. I will wait for an update from you. 👍

@pmacinec
Copy link

pmacinec commented Jan 23, 2020

Hello @impawan. I have uploaded the dataset again, I can share with some of you, just write me or give me your email.

But please, @several27, are you going to upload the data again? Will be data available in the future for others? I created only temporary solution now (as previously).

@lgotsev
Copy link

lgotsev commented Mar 8, 2021

Hello @pmacinec. You've done a lot to keep the dataset "alive". Would you, please, share with me the link of the dataset as the link provided by @several27 is not working or a piece of data, example 100 000/200 000 of news (50/100k reliable, 50/100k fake). My email is: l.gotsev@unibit.bg. I'll appreciate your help. Thank you!

@pmacinec
Copy link

pmacinec commented Mar 8, 2021

Hello @lgotsev . The link that @several27 provided should be working, because the data are uploaded to Github as a part of release (https://github.com/several27/FakeNewsCorpus/releases/tag/v1.0).

Let me know, if the dataset is for any reason not working again (fortunately, I still have a copy).

@lgotsev
Copy link

lgotsev commented Mar 8, 2021

Thank you @pmacinec for your quick answer. I've tried several times on opening the files using 7zip but unpacking gives an error after an extraction of almost 3 GB of data. Probably another tool or using command prompt can help. I've noticed a new issue about the problem on July 2020 which is still open. So, please give an advice how to deal with it or probably once again there is an issue with it. Thank you!

@pmacinec
Copy link

If there is a problem with unpacking multiple zip files, maybe it makes sense to divide the data into chunks and zip each chunk respectively, then upload each zip here on github.

@several27 what do you think? If needed, I can do something like that and prepare pull-request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests