In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

%matplotlib inline

The number of articles that can be used turns out to be limited by the fact that the data is dominated by huge numbers of articles from three sites: Daily Kos (unreliable), and the National Review and New York Times (both reliable). Through some upsampling of the other articles in the reliable category, a total of 320,000 articles will be used from that category, so the same number will be used from the unreliable category.

In [109]:
names = ['bias', 'conspiracy', 'fake', 'hate', 'pol_bogus']
fnames = ['cleaned/unreliable/{}_cleaned.csv'.format(name) for name in names]
unreliable = pd.concat([pd.read_csv(fname) for fname in fnames], ignore_index=True)

In [110]:
unreliable.head()

Unnamed: 0.1,Unnamed: 0,id,domain,content
0,0,158,wnd.com,number black leaders who were prevented from d...
1,1,841,wnd.com,number black leaders who were prevented from d...
2,2,950,wnd.com,After date of various courts forcing Christian...
3,3,951,wnd.com,conservative organization Rep. person organiza...
4,4,1220,wnd.com,"Next to illegal immigration, feminism is the s..."


In [111]:
len(unreliable)

603482

In [112]:
unreliable.groupby('domain').count().iloc[:,0].sort_values()

domain
usanewsflash.com                     1
uspoln.com                           3
usafirstinformation.com              3
countdowntozerotime.com              3
goneleft.com                         4
viralliberty.com                     5
glaringhypocrisy.com                 5
newslogue.com                        5
dataasylum.com                       6
winningdemocrats.com                 9
thelastgreatstand.com                9
flashnewscorner.com                 10
sonsoflibertyradio.com              11
sheepkillers.com                    12
redcountry.us                       12
enhlive.com                         12
donaldtrumpnews.co                  13
onepoliticalplaza.com               13
readconservatives.news              15
jamesrgrangerjr.com                 15
yesimright.com                      18
nasamoonhoax.com                    20
usadosenews.com                     21
americanpatriotdaily.com            27
newcoldwar.org                      35
platosguns.com    

In [113]:
unreliable = unreliable.loc[:, ['id', 'domain', 'content']]

In [114]:
unreliable = unreliable.set_index('id')

In [115]:
weekly_standard = unreliable[unreliable['domain'].eq('weeklystandard.com')] # this site is credible

In [116]:
unreliable = unreliable[~unreliable['domain'].eq('weeklystandard.com')]

Only keep as many Daily Kos articles as needed, although a number larger than any other site is ok since this site runs articles from a very wide variety of authors.

In [117]:
dailykos = unreliable[unreliable['domain'].eq('dailykos.com')]

In [118]:
needed = 320000 - len(unreliable[~unreliable['domain'].eq('dailykos.com')])
needed

41312

In [119]:
dk_keepers = dailykos.sample(needed, random_state=24)

In [120]:
unreliable = unreliable[~unreliable['domain'].eq('dailykos.com')].append(dk_keepers)

In [121]:
len(unreliable)

320000

In [6]:
unreliable.to_csv('data_prepared/unreliable.csv')
pickle.dump(unreliable, open( "data_prepared/unreliable.pkl", "wb" ))

In [122]:
reliable = pd.concat([pd.read_csv(fname) for fname in ['cleaned/reliable/pol_cred_cleaned.csv',
                                                      'cleaned/reliable/credible1_cleaned.csv']], ignore_index=True)

In [123]:
reliable = reliable.loc[:, ['id', 'domain', 'content']]
reliable = reliable.set_index('id')
reliable = reliable.append(weekly_standard)

In [124]:
len(reliable)

368917

In [125]:
reliable.groupby('domain').count().iloc[:,0].sort_values()

domain
www.wsj.com                    115
www.theatlantic.com            141
www.buzzfeed.com               251
www.politico.com               592
www.nbcnews.com                625
foreignpolicyjournal.com       806
www.npr.org                    846
www.cbsnews.com               1240
www.usatoday.com              1849
www.bloomberg.com             2121
abcnews.go.com                2300
heritage.org                  2309
www.latimes.com               2348
jacobinmag.com                2390
theintercept.com              2556
www.nytimes.com               2848
www.washingtonpost.com        3001
baptistnews.com               6153
mintpressnews.com             7586
weeklystandard.com           30241
nationalreview.com          298599
Name: content, dtype: int64

In [126]:
reliable['domain'] = reliable['domain'].str.replace('www.', '')

In [127]:
scraped = pd.read_csv('cleaned/reliable/scraped_cleaned.csv')
scraped.head()

Unnamed: 0.1,Unnamed: 0,id,domain,content
0,0,100000000,https://www.reuters.com/article/us-venezuela-p...,Venezuelan opposition leader person place on ...
1,1,100000001,https://www.reuters.com/article/us-nigeria-ele...,Nigerian voters returned to the polls on date...
2,2,100000002,https://www.reuters.com/article/us-italy-polit...,place's prime minister said on date tenders f...
3,3,100000003,https://www.reuters.com/article/us-mideast-cri...,The U.S. backed organization paused military ...
4,4,100000004,https://www.reuters.com/article/us-mideast-cri...,The organization refugee agency should have a...


In [128]:
scraped['domain'] = scraped['domain'].str.extract("https?://(?:www.)([\w\d\.-]*)/.*")
scraped.head(10)

Unnamed: 0.1,Unnamed: 0,id,domain,content
0,0,100000000,reuters.com,Venezuelan opposition leader person place on ...
1,1,100000001,reuters.com,Nigerian voters returned to the polls on date...
2,2,100000002,reuters.com,place's prime minister said on date tenders f...
3,3,100000003,reuters.com,The U.S. backed organization paused military ...
4,4,100000004,reuters.com,The organization refugee agency should have a...
5,5,100000005,reuters.com,place's government said on date it would rele...
6,6,100000006,reuters.com,Saudi oil minister person person said on date...
7,7,100000007,reuters.com,An organization fighter detained in place urg...
8,8,100000008,cbsnews.com,place Rep. person hosted his date quote event ...
9,9,100000009,cbsnews.com,Scientists have discovered that grey seals can...


In [129]:
scraped = scraped.loc[:, ['id', 'domain', 'content']]
scraped = scraped.set_index('id')
reliable = reliable.append(scraped)

In [130]:
len(reliable)

390567

In [131]:
reliable.groupby('domain').count().iloc[:,0].sort_values()

domain
politico.eu                     26
wsj.com                        115
theatlantic.com                141
buzzfeed.com                   273
buzzfeednews.com               588
foreignpolicyjournal.com       806
reuters.com                    980
csmonitor.com                 1038
aljazeera.com                 1686
bloomberg.com                 2121
npr.org                       2292
abcnews.go.com                2300
heritage.org                  2309
latimes.com                   2348
jacobinmag.com                2390
theintercept.com              2556
apnews.com                    2604
politico.com                  2732
nytimes.com                   2848
washingtonpost.com            3001
usatoday.com                  3060
nbcnews.com                   3755
cbsnews.com                   5190
baptistnews.com               6153
mintpressnews.com             7586
weeklystandard.com           30241
nationalreview.com          299420
Name: content, dtype: int64

In [132]:
nat_review = reliable[reliable['domain'].eq('nationalreview.com')]
nat_review = nat_review.reset_index()

In [133]:
reliable = reliable[~reliable['domain'].eq('nationalreview.com')]
len(reliable)

91147

In [134]:
reliable = reliable.reset_index()

In [135]:
# Oversample (triple) all the remaining sites except weeklystandard.com, which already is well represented
reliable = pd.concat([reliable, reliable[~reliable['domain'].eq('weeklystandard.com')],
                                         reliable[~reliable['domain'].eq('weeklystandard.com')]], ignore_index=True)
len(reliable)

212959

In [136]:
# Add in half again as many articles from the National Review and New York Times, evenly split among them
reliable = reliable.append(nat_review.sample(53521, random_state=24))
len(reliable)

266480

In [137]:
reliable = reliable.append(pd.read_csv('cleaned/credible2_cleaned.csv', index_col=0), ignore_index=True)

In [138]:
len(reliable)

320000

In [13]:
reliable.to_csv('data_prepared/reliable.csv')
pickle.dump(reliable, open( "data_prepared/reliable.pkl", "wb" ))

In [3]:
import pickle
unreliable = pickle.load(open('data_prepared/unreliable.pkl', 'rb'))
reliable = pickle.load(open('data_prepared/reliable.pkl', 'rb'))

In [4]:
import spacy
nlp = spacy.load('en')

In [69]:
from itertools import islice

def preformat(article):
    '''
    Pre-formats an article (text string) for training by turning its sentences into lists of strings and removing
    the first 1-2 sentences and the last 2-3 sentences (depending on the length of the article). Promotional or
    editorial content, when present, most often appears at the end or beginning of an article. Articles are truncated 
    to a maximum length of 30 sentences.
    
    Output: A list of lists of words, with punctuation, symbols, particles and numbers removed.  
    '''
    counter, sents = itertools.tee(nlp(article).sents)
    num_sents = 0
    for item in counter:
        num_sents += 1
        
    if num_sents < 16:
        start = 1
        end = num_sents - 2
    elif num_sents < 20:
        start = 1
        end = num_sents - 3
    elif num_sents < 37:
        start = 2
        end = num_sents - 3
    else:
        start = 2
        end = 34
    
    output = []
    
    for sent in itertools.islice(sents, start, end):
        output.append( [word.text for word in sent if word.pos_ not in ['PUNCT', 'PART', 'SYM', 'NUM']] )

    return output

In [2]:
def recombine(array):
    '''
    Rejoins the lists of words in the articles pre-formatted for training into a single string.
    
    Returns: String containing all the words in the an article that was pre-formatted.
    '''
    return [' '.join(' '.join(sent) for sent in array)]

In [76]:
unreliable['content'] = unreliable['content'].apply(preformat)
unreliable.to_csv('data_final/unreliable_final.csv')
pickle.dump(unreliable, open( "data_final/unreliable_final.pkl", "wb" ))

In [71]:
unreliable.head()

Unnamed: 0,id,domain,content
0,158,wnd.com,"[[The, protest, in, date, was, part, of, a, pr..."
1,841,wnd.com,"[[The, protest, in, date, was, part, of, a, pr..."
2,950,wnd.com,"[[The, appeals, court, found, that, the, ordin..."
3,951,wnd.com,"[[The, organization, primary, in, place, is, d..."
4,1220,wnd.com,"[[Feminism, is, the, hub, of, communism, Marxi..."


In [75]:
unreliable.iloc[0,2]

[['The',
  'protest',
  'in',
  'date',
  'was',
  'part',
  'of',
  'a',
  'project',
  'by',
  'the',
  'organization',
  'for',
  'Bio',
  'Ethical',
  'Reform',
  'or',
  'place'],
 ['that',
  'uses',
  'photo',
  'mural',
  'exhibits',
  'and',
  'literature',
  'quote',
  'quote'],
 ['Sign',
  'for',
  'free',
  'news',
  'alerts',
  'from',
  'website',
  'place',
  'independent',
  'news',
  'network'],
 ['The',
  'parties',
  'filed',
  'a',
  'settlement',
  'date',
  'in',
  'which',
  'the',
  'federal',
  'government',
  'formally',
  'acknowledged',
  'quote',
  'quote',
  'quote'],
 ['The',
  'government',
  'also',
  'agreed',
  'pay',
  'attorney',
  'fees',
  'to',
  'the',
  'nonprofit',
  'American',
  'Freedom',
  'Law',
  'organization',
  'which',
  'represented',
  'the',
  'protesters'],
 ['person',
  'organization',
  'co',
  'founder',
  'and',
  'senior',
  'counsel',
  'said',
  'there',
  'was',
  'quote',
  'quote',
  'quote',
  'quote'],
 ['quote', 'quot

In [78]:
unreliable['content'] = unreliable['content'].apply(recombine)
unreliable.to_csv('data_final/baseline/unreliable_string.csv')
pickle.dump(unreliable, open( "data_final/baseline/unreliable_string.pkl", "wb" ))

In [79]:
unreliable.iloc[0,2]

['The protest in date was part of a project by the organization for Bio Ethical Reform or place that uses photo mural exhibits and literature quote quote Sign for free news alerts from website place independent news network The parties filed a settlement date in which the federal government formally acknowledged quote quote quote The government also agreed pay attorney fees to the nonprofit American Freedom Law organization which represented the protesters person organization co founder and senior counsel said there was quote quote quote quote quote quote quote he said The other plaintiffs in the lawsuit were place and its executive director person who was standing by during the incident organization and person who direct the place project targeting the history museum testified that when they were told they could not stand outside the museum with their sign organization responded that they were on a public sidewalk The senior organization officer warned the number that if they did not 

In [15]:
reliable = pickle.load(open( "data_final/reliable_final.pkl", "rb" ))

In [18]:
reliable.iloc[1,2]

[['Surely',
  'the',
  'work',
  'of',
  'Christians',
  'is',
  'more',
  'than',
  'simply',
  'fueling',
  'the',
  'engine',
  'of',
  'capitalism',
  'meaningful',
  'work',
  'also',
  'participates',
  'in',
  'person',
  'intention',
  'for',
  'the',
  'world'],
 ['Yet',
  'determining',
  'how',
  'person',
  'is',
  'at',
  'work',
  'in',
  'this',
  'world',
  'is',
  'of',
  'the',
  'hardest',
  'theological',
  'challenges'],
 ['Think', 'about', 'the', 'urgent', 'crises', 'confronting', 'us'],
 ['People',
  'of',
  'faith',
  'pray',
  'for',
  'deliverance',
  'trusting',
  'person',
  'hold',
  'the',
  'waters',
  'of',
  'the',
  'sea',
  'or',
  'help',
  'them',
  'elude',
  'their',
  'enemies',
  'pushing',
  'them',
  'over',
  'the',
  'border',
  'in',
  'place',
  'or',
  'rid',
  'them',
  'of',
  'the',
  'malignancy',
  'growing',
  'in',
  'their',
  'bodies',
  'or',
  'quell',
  'the',
  'rising',
  'tide',
  'of',
  'white',
  'supremacy'],
 ['Fervent

In [19]:
reliable['content'] = reliable['content'].apply(recombine)
reliable.to_csv('data_final/baseline/reliable_string.csv')
pickle.dump(reliable, open( "data_final/baseline/reliable_string.pkl", "wb" ))

In [20]:
reliable.iloc[1,2]

['Surely the work of Christians is more than simply fueling the engine of capitalism meaningful work also participates in person intention for the world Yet determining how person is at work in this world is of the hardest theological challenges Think about the urgent crises confronting us People of faith pray for deliverance trusting person hold the waters of the sea or help them elude their enemies pushing them over the border in place or rid them of the malignancy growing in their bodies or quell the rising tide of white supremacy Fervent prayer may not create the conditions for which they pray however many continue trust that person providence will prevail We must ask through what instrumentality Reading narratives of deliverance in place evokes hope for person mighty acts be victorious once again Many preachers and date school teachers have followed the lectionary texts from Exodus in date after place We have noted the trickery of person and organization the resistance of person d