-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
download the data #8
Comments
hie, is the data still available for downloading? |
Apologies for the inconvenience, the download is temporarily down. I'm working on bringing it back. If anyone had stored a local copy of the dataset, I'd appreciate them sending it over. |
I have a local copy probably, so I can upload it somewhere. I have also a subset of only fake and reliable news, that someone mentioned in another issue if needed. I can upload it probably on OneDrive if it is possible to upload that large files. Or do you have any idea where to upload it? |
I've just loss my copy this morning... Maybe you could use AWS free tier to upload the dataset to a bucket? |
@pmacinec If you can upload it to OneDrive and send me over a link, that'd be amazing (maciej[at]researchably.com) I'll copy it over to a new cloud and share a public free link here. Thanks! |
@several27 I have already sent you a link to download, so I hope anyone can download it soon from the new cloud. |
@pmacinec , would you mind sending me the link to download it too? Appreciate it! |
Yes, of course, just send me your email. Or should I send it to email that u have public on your profile? |
May you please send me the link on this email address please
…On Mon, 04 Mar, 2019 at 09:03, Peter Mačinec ***@***.***> wrote:
Yes, of course, just send me your email. Or should I send it to email that
u have public on your profile?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AStvxugwRSVplupenmTSh9rhMAfzU9XHks5vTMU5gaJpZM4bOQDp>
.
|
@pmacinec ,could you please send me the link to download it too? I am very urgently to use it. Thank you! My email is wjhappy96@gmail.com |
Ushemanhuna42@gmail.com is my email address
…On Mon, 04 Mar, 2019 at 09:09, juewang1996 ***@***.***> wrote:
@pmacinec <https://github.com/pmacinec> ,could you please send me the
link to download it too? I am very urgently to use it. Thank you! My email
is ***@***.***
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AStvxnlFkpItaY2GvkWgd6bCjRaN7L-Sks5vTMazgaJpZM4bOQDp>
.
|
Sent to both of you. Hope @several27 will upload it soon. |
Hi @pmacinec, can trouble you to send the link to me as well? Thanks! redcurries@gmail.com |
Hey @pmacinec sorry for bugging but could you send me the link as well? Thank you very much :) dawandd@hotmail.com |
@pmacinec I would appreciate it if it could be sent to my email address at pehgaoxian@gmail.com Thank you once again! :) |
Hi all, thank you very much for your patience. Thanks to @pmacinec the dataset can be now downloaded from: https://storage.googleapis.com/researchably-fake-news-recognition/news_cleaned_2018_02_13.csv.zip |
Would it be possible to also upload the subset of only fake and reliable articles that @pmacinec discussed? |
If @several27 wants, I can upload also this subset or do it just for u (or maybe share u a code to get only those messages from whole dataset, processed in chunks). Just let me know. |
An upload of the subset would be great. I don't have easy access to resources to easy process the entire dataset, which is why such a subset would be very convenient, practically. Let's wait for @pmacinec to see if he agrees with you sharing this, either publicly or privately. Many thanks in advance! |
@lerpier to read the dataset with minimal RAM usage use the ‘chunksize’ parameter in pandas. E.g.: https://cmdlinetips.com/2018/01/how-to-load-a-massive-file-as-small-chunks-in-pandas/ |
@several27 I tried something like this and it mostly worked, but ran into some issues after several chunks (not sure why). Would it be okay for @pmacinec to share the subset with only fake and real articles? Either with me privately with a link per email or a link publicly here? |
@pmacinec @several27 I don't mean to be a bother but is this still an option (uploading the fake/reliable subset?) I could post an email that you could share a link to so you don't have to share it publicly? It might be my code but when I try to process it myself I run into problems, so that would be a huge help. |
Maybe you can at first share your code and your problems. To all that will ever want only subset of data with specific labels, please, try the following code to get only fake and reliable news. chunksize = 200000 # depending on your memory, can be much more bigger but also smaller
for chunk in pd.read_csv('data/data.csv', chunksize=chunksize, encoding='utf-8', engine='python'):
x = chunk[(chunk['type']=='reliable') | (chunk['type']=='fake')]
... Hope this will help. |
Hm. This code is somewhat different from the one I tried. I'll try it tomorrow/monday and see if it works. Thanks! :) |
@pmacinec I tried running your code but it just gives me a df of 129194 articles by the new york times. No other sources and no fake articles at all. I also tried reading in the entire file in chunks which still raised a memory error. Reading in the entire file as is nearly blew up my pc (as expected, haha). Reading in just some rows using nrows works just fine, though. @several27 what is the code you used to extract just fake and real? Is it the same that @pmacinec posted or did you do something different? What setup did you use to process it? I'm running python on a local jupyter server using a python 3.5 environment. I would love to work with a (reasonably large, but not complete) subset of the real/fake articles in this dataset since none of the other fake news datasets suit my definitions of 'fake news' as well. However, I think the sheer size of the data is unfortunately causing some issues for me here. A subset of fake and reliable articles would be an absolute lifesaver right now, just so I don't have to process the entire file on my poor laptop. If you could possibly share that with me, I would be eternally grateful :). (I'm really sorry if I come off as 'pushy'. I'm a bit stressed about this project). |
@Ierpier it is probably because you didn't finish the code above. In x variable are stored all fake and reliable news of current chunk. When another chunk is processed, it is overwritten. To not loose current data stored in x, you have to add it somewhere (e.g. new dataframe), where it will not be overwritten in loop. append function can be helpful to you: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html Note: be careful, also only fake and reliable news need a lot of memory! Or if you are ok to have only 100 000 of news (50k reliable, 50k fake), write me your email and I will share it to you. I can share you all reliable and fake news approximately next weekend. But please, at first, try the advice above. |
@pmacinec Well now I feel incredibly dumb, haha. That makes sense. I actually remember trying to use the strategy of appending chunks before though, and I ran into the problem that at some point everything just started ending up in the wrong columns and stuff. I will try it again tomorrow though. Hopefully I can figure it out and don't need to bother you any longer. That said, the subset you are describing sounds very useful though, especially since I likely won't be able to work with the entire thing using my own resources anyway. If you could send it to me at ierpier.projects@gmail.com, I would be very grateful. |
@pmacinec |
Hello @impawan, I dont have the data available on google disk anymore. Probably, I have back up of this dataset on my disk, but I dont have it with myself. I am able to upload it approximately in 2 weeks. |
Thanks @pmacinec for helping me with this. I will wait for an update from you. 👍 |
Hello @impawan. I have uploaded the dataset again, I can share with some of you, just write me or give me your email. But please, @several27, are you going to upload the data again? Will be data available in the future for others? I created only temporary solution now (as previously). |
Hello @pmacinec. You've done a lot to keep the dataset "alive". Would you, please, share with me the link of the dataset as the link provided by @several27 is not working or a piece of data, example 100 000/200 000 of news (50/100k reliable, 50/100k fake). My email is: l.gotsev@unibit.bg. I'll appreciate your help. Thank you! |
Hello @lgotsev . The link that @several27 provided should be working, because the data are uploaded to Github as a part of release (https://github.com/several27/FakeNewsCorpus/releases/tag/v1.0). Let me know, if the dataset is for any reason not working again (fortunately, I still have a copy). |
Thank you @pmacinec for your quick answer. I've tried several times on opening the files using 7zip but unpacking gives an error after an extraction of almost 3 GB of data. Probably another tool or using command prompt can help. I've noticed a new issue about the problem on July 2020 which is still open. So, please give an advice how to deal with it or probably once again there is an issue with it. Thank you! |
If there is a problem with unpacking multiple zip files, maybe it makes sense to divide the data into chunks and zip each chunk respectively, then upload each zip here on github. @several27 what do you think? If needed, I can do something like that and prepare pull-request. |
Is the data still open for public? I click the link but can't download the data.
The text was updated successfully, but these errors were encountered: