download the data #8

juewang1996 · 2019-02-24T00:48:40Z

Is the data still open for public? I click the link but can't download the data.

nabanita- · 2019-02-24T10:18:37Z

I am trying to download as well. The AWS link is showing following error

ushetaks · 2019-02-26T13:52:18Z

hie, is the data still available for downloading?

several27 · 2019-03-01T22:43:40Z

Apologies for the inconvenience, the download is temporarily down. I'm working on bringing it back.

If anyone had stored a local copy of the dataset, I'd appreciate them sending it over.

pmacinec · 2019-03-02T08:17:45Z

I have a local copy probably, so I can upload it somewhere. I have also a subset of only fake and reliable news, that someone mentioned in another issue if needed. I can upload it probably on OneDrive if it is possible to upload that large files. Or do you have any idea where to upload it?

AIRLegend · 2019-03-02T11:19:55Z

I've just loss my copy this morning... Maybe you could use AWS free tier to upload the dataset to a bucket?

several27 · 2019-03-02T18:53:56Z

@pmacinec If you can upload it to OneDrive and send me over a link, that'd be amazing (maciej[at]researchably.com) I'll copy it over to a new cloud and share a public free link here. Thanks!

pmacinec · 2019-03-03T07:01:20Z

@several27 I have already sent you a link to download, so I hope anyone can download it soon from the new cloud.

gao-xian-peh · 2019-03-03T21:21:11Z

@pmacinec , would you mind sending me the link to download it too?

Appreciate it!

pmacinec · 2019-03-04T07:03:21Z

Yes, of course, just send me your email. Or should I send it to email that u have public on your profile?

ushetaks · 2019-03-04T07:04:55Z

May you please send me the link on this email address please

…

On Mon, 04 Mar, 2019 at 09:03, Peter Mačinec ***@***.***> wrote: Yes, of course, just send me your email. Or should I send it to email that u have public on your profile? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AStvxugwRSVplupenmTSh9rhMAfzU9XHks5vTMU5gaJpZM4bOQDp> .

juewang1996 · 2019-03-04T07:09:38Z

@pmacinec ,could you please send me the link to download it too? I am very urgently to use it. Thank you! My email is wjhappy96@gmail.com

ushetaks · 2019-03-04T07:11:03Z

Ushemanhuna42@gmail.com is my email address

…

On Mon, 04 Mar, 2019 at 09:09, juewang1996 ***@***.***> wrote: @pmacinec <https://github.com/pmacinec> ,could you please send me the link to download it too? I am very urgently to use it. Thank you! My email is ***@***.*** — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AStvxnlFkpItaY2GvkWgd6bCjRaN7L-Sks5vTMazgaJpZM4bOQDp> .

pmacinec · 2019-03-04T07:21:27Z

Sent to both of you. Hope @several27 will upload it soon.

ghost · 2019-03-04T09:18:48Z

Hi @pmacinec, can trouble you to send the link to me as well? Thanks! redcurries@gmail.com

Kerrah · 2019-03-04T11:15:03Z

Hey @pmacinec sorry for bugging but could you send me the link as well? Thank you very much :) dawandd@hotmail.com

gao-xian-peh · 2019-03-04T19:11:23Z

@pmacinec I would appreciate it if it could be sent to my email address at pehgaoxian@gmail.com

Thank you once again! :)

several27 · 2019-03-04T19:35:16Z

Hi all, thank you very much for your patience.

Thanks to @pmacinec the dataset can be now downloaded from: https://storage.googleapis.com/researchably-fake-news-recognition/news_cleaned_2018_02_13.csv.zip

Ierpier · 2019-03-05T10:28:34Z

Would it be possible to also upload the subset of only fake and reliable articles that @pmacinec discussed?

pmacinec · 2019-03-05T12:49:32Z

If @several27 wants, I can upload also this subset or do it just for u (or maybe share u a code to get only those messages from whole dataset, processed in chunks). Just let me know.

Ierpier · 2019-03-05T12:54:29Z

An upload of the subset would be great. I don't have easy access to resources to easy process the entire dataset, which is why such a subset would be very convenient, practically. Let's wait for @pmacinec to see if he agrees with you sharing this, either publicly or privately. Many thanks in advance!

several27 · 2019-03-06T01:41:35Z

@lerpier to read the dataset with minimal RAM usage use the ‘chunksize’ parameter in pandas.

E.g.: https://cmdlinetips.com/2018/01/how-to-load-a-massive-file-as-small-chunks-in-pandas/

Ierpier · 2019-03-06T10:39:49Z

@several27 I tried something like this and it mostly worked, but ran into some issues after several chunks (not sure why). Would it be okay for @pmacinec to share the subset with only fake and real articles? Either with me privately with a link per email or a link publicly here?

Ierpier · 2019-03-09T10:50:03Z

@pmacinec @several27 I don't mean to be a bother but is this still an option (uploading the fake/reliable subset?) I could post an email that you could share a link to so you don't have to share it publicly? It might be my code but when I try to process it myself I run into problems, so that would be a huge help.

pmacinec · 2019-03-09T16:45:14Z

Maybe you can at first share your code and your problems.

To all that will ever want only subset of data with specific labels, please, try the following code to get only fake and reliable news.

chunksize = 200000 # depending on your memory, can be much more bigger but also smaller
for chunk in pd.read_csv('data/data.csv', chunksize=chunksize, encoding='utf-8', engine='python'):
    x = chunk[(chunk['type']=='reliable') | (chunk['type']=='fake')]
    ...

Hope this will help.

Ierpier · 2019-03-09T17:41:08Z

Hm. This code is somewhat different from the one I tried. I'll try it tomorrow/monday and see if it works. Thanks! :)

Ierpier · 2019-03-10T16:47:28Z

@pmacinec I tried running your code but it just gives me a df of 129194 articles by the new york times. No other sources and no fake articles at all. I also tried reading in the entire file in chunks which still raised a memory error. Reading in the entire file as is nearly blew up my pc (as expected, haha). Reading in just some rows using nrows works just fine, though.

@several27 what is the code you used to extract just fake and real? Is it the same that @pmacinec posted or did you do something different? What setup did you use to process it? I'm running python on a local jupyter server using a python 3.5 environment.

I would love to work with a (reasonably large, but not complete) subset of the real/fake articles in this dataset since none of the other fake news datasets suit my definitions of 'fake news' as well. However, I think the sheer size of the data is unfortunately causing some issues for me here. A subset of fake and reliable articles would be an absolute lifesaver right now, just so I don't have to process the entire file on my poor laptop. If you could possibly share that with me, I would be eternally grateful :).

(I'm really sorry if I come off as 'pushy'. I'm a bit stressed about this project).

pmacinec · 2019-03-10T17:56:44Z

@Ierpier it is probably because you didn't finish the code above. In x variable are stored all fake and reliable news of current chunk. When another chunk is processed, it is overwritten. To not loose current data stored in x, you have to add it somewhere (e.g. new dataframe), where it will not be overwritten in loop.

append function can be helpful to you: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

Note: be careful, also only fake and reliable news need a lot of memory!

Or if you are ok to have only 100 000 of news (50k reliable, 50k fake), write me your email and I will share it to you. I can share you all reliable and fake news approximately next weekend. But please, at first, try the advice above.

Ierpier · 2019-03-12T15:40:46Z

@pmacinec Well now I feel incredibly dumb, haha. That makes sense. I actually remember trying to use the strategy of appending chunks before though, and I ran into the problem that at some point everything just started ending up in the wrong columns and stuff. I will try it again tomorrow though. Hopefully I can figure it out and don't need to bother you any longer.

That said, the subset you are describing sounds very useful though, especially since I likely won't be able to work with the entire thing using my own resources anyway. If you could send it to me at ierpier.projects@gmail.com, I would be very grateful.

impawan · 2020-01-10T10:12:28Z

@pmacinec
Hello Peter,
could you please share with me the link of the dataset as the Link provided by @several27 is not working now. My email is pawanprasad@outlook.com

pmacinec · 2020-01-15T16:43:10Z

Hello @impawan, I dont have the data available on google disk anymore. Probably, I have back up of this dataset on my disk, but I dont have it with myself. I am able to upload it approximately in 2 weeks.

impawan · 2020-01-17T05:18:36Z

Thanks @pmacinec for helping me with this. I will wait for an update from you. 👍

pmacinec · 2020-01-23T18:30:38Z

Hello @impawan. I have uploaded the dataset again, I can share with some of you, just write me or give me your email.

But please, @several27, are you going to upload the data again? Will be data available in the future for others? I created only temporary solution now (as previously).

lgotsev · 2021-03-08T19:47:25Z

Hello @pmacinec. You've done a lot to keep the dataset "alive". Would you, please, share with me the link of the dataset as the link provided by @several27 is not working or a piece of data, example 100 000/200 000 of news (50/100k reliable, 50/100k fake). My email is: l.gotsev@unibit.bg. I'll appreciate your help. Thank you!

pmacinec · 2021-03-08T19:52:59Z

Hello @lgotsev . The link that @several27 provided should be working, because the data are uploaded to Github as a part of release (https://github.com/several27/FakeNewsCorpus/releases/tag/v1.0).

Let me know, if the dataset is for any reason not working again (fortunately, I still have a copy).

lgotsev · 2021-03-08T20:16:49Z

Thank you @pmacinec for your quick answer. I've tried several times on opening the files using 7zip but unpacking gives an error after an extraction of almost 3 GB of data. Probably another tool or using command prompt can help. I've noticed a new issue about the problem on July 2020 which is still open. So, please give an advice how to deal with it or probably once again there is an issue with it. Thank you!

pmacinec · 2021-03-17T18:23:53Z

If there is a problem with unpacking multiple zip files, maybe it makes sense to divide the data into chunks and zip each chunk respectively, then upload each zip here on github.

@several27 what do you think? If needed, I can do something like that and prepare pull-request.

several27 closed this as completed Mar 4, 2019

download the data #8

download the data #8

Comments

juewang1996 commented Feb 24, 2019

nabanita- commented Feb 24, 2019

ushetaks commented Feb 26, 2019

several27 commented Mar 1, 2019

pmacinec commented Mar 2, 2019

AIRLegend commented Mar 2, 2019

several27 commented Mar 2, 2019

pmacinec commented Mar 3, 2019

gao-xian-peh commented Mar 3, 2019

pmacinec commented Mar 4, 2019

ushetaks commented Mar 4, 2019 via email

juewang1996 commented Mar 4, 2019

ushetaks commented Mar 4, 2019 via email

pmacinec commented Mar 4, 2019

ghost commented Mar 4, 2019

Kerrah commented Mar 4, 2019

gao-xian-peh commented Mar 4, 2019

several27 commented Mar 4, 2019

Ierpier commented Mar 5, 2019

pmacinec commented Mar 5, 2019

Ierpier commented Mar 5, 2019

several27 commented Mar 6, 2019

Ierpier commented Mar 6, 2019

Ierpier commented Mar 9, 2019

pmacinec commented Mar 9, 2019

Ierpier commented Mar 9, 2019 • edited

Ierpier commented Mar 10, 2019 • edited

pmacinec commented Mar 10, 2019 • edited

Ierpier commented Mar 12, 2019 • edited

impawan commented Jan 10, 2020

pmacinec commented Jan 15, 2020

impawan commented Jan 17, 2020

pmacinec commented Jan 23, 2020 • edited

lgotsev commented Mar 8, 2021

pmacinec commented Mar 8, 2021

lgotsev commented Mar 8, 2021

pmacinec commented Mar 17, 2021

Ierpier commented Mar 9, 2019 •

edited

Ierpier commented Mar 10, 2019 •

edited

pmacinec commented Mar 10, 2019 •

edited

Ierpier commented Mar 12, 2019 •

edited

pmacinec commented Jan 23, 2020 •

edited