## In this notebook, I split the original dataset into 11 different datasets for separate cleaning and sampling. 

Several methods were used in the attempt to make this possible:

-Filtering as a DataFrame-> Memory in both Colab and local was insufficient to even open the file, much less perform operations on it

-Uploading the CSV file into SQL for filtering   
  -> using a copy_from csv to SQL in SQLalchemy didn't work because there was no way to skip corrupted rows, even though this was best practice and the fastest way.
  ->Doing it row by row was just as slow as converting them into csv files.

-Finally I decided to filter the original csv row by row and filter them into CSV files, updating everytime 10,000 rows were processed on Colab so as to offload the load.

### Important note: This file cannot be used as the dataset is too large too upload to free github, preserved in its Google Colab format. This file was running for around 10 straight days to clean the original CSV file.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
# To allow interactive plot.
from ipywidgets import *
from IPython.display import display
from datetime import datetime, timedelta


from google.colab import drive
drive.mount('/content/gdrive')


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:

filename = '/content/gdrive/My Drive/Capstone_DSI/news_cleaned_2018_02_13.csv'

In [0]:
first10000=pd.read_csv(filename,nrows=20)

In [0]:
columns=first10000.columns

In [5]:
first10000.head()

Unnamed: 0.1,Unnamed: 0,id,domain,type,url,content,scraped_at,inserted_at,updated_at,title,authors,keywords,meta_keywords,meta_description,tags,summary,source
0,0,2,express.co.uk,rumor,https://www.express.co.uk/news/science/738402/...,"Life is an illusion, at least on a quantum lev...",2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Is life an ILLUSION? Researchers prove 'realit...,Sean Martin,,[''],THE UNIVERSE ceases to exist when we are not l...,,,
1,1,6,barenakedislam.com,hate,http://barenakedislam.com/category/donald-trum...,"Unfortunately, he hasn’t yet attacked her for ...",2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Donald Trump,"Linda Rivera, Conrad Calvano, Az Gal, Lincoln ...",,[''],,,,
2,2,7,barenakedislam.com,hate,http://barenakedislam.com/category/donald-trum...,The Los Angeles Police Department has been den...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Donald Trump,"Linda Rivera, Conrad Calvano, Az Gal, Lincoln ...",,[''],,,,
3,3,8,barenakedislam.com,hate,http://barenakedislam.com/2017/12/24/more-winn...,The White House has decided to quietly withdra...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,"MORE WINNING! Israeli intelligence source, DEB...","Cleavis Nowell, Cleavisnowell, Clarence J. Fei...",,[''],,,,
4,4,9,barenakedislam.com,hate,http://barenakedislam.com/2017/12/25/oh-trump-...,“The time has come to cut off the tongues of t...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,"“Oh, Trump, you coward, you just wait, we will...","F.N. Lehner, Don Spilman, Clarence J. Feinour,...",,[''],,,,


| Type | Tag | Count (so far) | Description|
| ------------- |:-------------:|:-------------:|:-------------:|
| **Fake News** | fake | 928,083 | Sources that entirely fabricate information, disseminate deceptive content, or grossly distort actual news reports |
| **Satire** | satire | 146,080 | Sources that use humor, irony, exaggeration, ridicule, and false information to comment on current events. |
| **Extreme Bias** | bias | 1,300,444 | Sources that come from a particular point of view and may rely on propaganda, decontextualized information, and opinions distorted as facts. |
| **Conspiracy Theory** | conspiracy | 905,981 | Sources that are well-known promoters of kooky conspiracy theories. |
| **State News** | state | 0 | Sources in repressive states operating under government sanction. |
| **Junk Science** | junksci | 144,939 | Sources that promote pseudoscience, metaphysics, naturalistic fallacies, and other scientifically dubious claims. |
| **Hate News** | hate | 117,374 | Sources that actively promote racism, misogyny, homophobia, and other forms of discrimination. |
| **Clickbait** | clickbait | 292,201 | Sources that provide generally credible content, but use exaggerated, misleading, or questionable headlines, social media descriptions, and/or images. |
| **Proceed With Caution** | unreliable | 319,830 | Sources that may be reliable but whose contents require further verification. |
| **Political** | political | 2,435,471 | Sources that provide generally verifiable information in support of certain points of view or political orientations. |
| **Credible** | reliable | 1,920,139 | Sources that circulate news and information in a manner consistent with traditional and ethical practices in journalism (Remember: even credible sources sometimes rely on clickbait-style headlines or occasionally make mistakes. No news organization is perfect, which is why a healthy news diet consists of multiple sources of information). |


In [0]:
newstypes=['fake',
 'conspiracy',
 'political',
 'junksci',
 'unreliable',
 'bias',
 'hate',
 'reliable',
 'satire',
 'clickbait',
 'unknown',
 'rumor']

  As we can see, this dataset is massive in scale. We first extract individual datasets of each type, so as to better manage the data for usage as well as sampling. To do this, we read the csv line by line, then for each type, we append them into separate DataFrames. When the DataFrames reach a particular size where the memory cannot hold it(we set at 10,000 per DataFrame) we download the relevant csv file into a DataFrame, update it, and reupload the revised file.

In [0]:
fake=pd.DataFrame()
conspiracy=pd.DataFrame()
political=pd.DataFrame()
junksci=pd.DataFrame()
unreliable=pd.DataFrame()
bias=pd.DataFrame()
hate=pd.DataFrame()
reliable=pd.DataFrame()
satire=pd.DataFrame()
clickbait=pd.DataFrame()
unknown=pd.DataFrame()
rumor=pd.DataFrame()


In [0]:
catdict={
    'fake':fake,
 'conspiracy':conspiracy,
 'political':political,
 'junksci':junksci,
 'unreliable':unreliable,
 'bias':bias,
 'hate':hate,
 'reliable':reliable,
 'satire':satire,
 'clickbait':clickbait,
 'unknown':unknown,
 'rumor':rumor
}

In [0]:
for key in catdict.keys():
        #create csvname
        csvname='/content/gdrive/My Drive/Capstone_DSI/csvs/'+key+'23512.csv'

        tempdf=pd.DataFrame(columns=columns)
        tempdf.to_csv(csvname,index=False)
   

In [10]:
1180000+570000+410000+100000+90000+350000+700000+1320000+600000+1570000+270000+230000+230000+500000+210000+190000

8520000

In [11]:
exceptions=[]
for n,gm_chunk in enumerate(pd.read_csv(filename,chunksize=1,names=columns,skiprows=8520000)):


    typ=gm_chunk.iloc[0,3]
    try:
      catdict[typ]=pd.concat([catdict[typ],gm_chunk],axis=0)
    except Exception as e:
      exceptions.append({n:e})

    if n%10000==0:
      print(f'{n} out of 10,000,000 entries processed')
      skipr=n
      for key in catdict.keys():
        #create csvname
        csvname='/content/gdrive/My Drive/Capstone_DSI/csvs/'+key+'8.csv'
        #read stored csv
        tempdf=pd.read_csv(csvname)
        #update stored csv
        tempdf=pd.concat([tempdf,catdict[key]],axis=0)
        #upload stored csv
        tempdf.to_csv(csvname,index=False)
        #clear buffer DataFrames
        tempdf=pd.DataFrame()
        catdict[key]=pd.DataFrame()
  

    
print('DONE?????')

KeyboardInterrupt: ignored

In [0]:
catdict.values()