Github Source (Bias) - https://github.com/conversationai/unintended-ml-bias-analysis

In [1]:
import pandas as pd
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
import gensim
import nltk.sentiment

nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('vader_lexicon')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Data Loading

In [3]:
# Load Dataset from drive
fake_news_data = pd.read_csv('/content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Iteration 1/Datasets/articles3.csv', low_memory =False)
n = 20
fake_news_data = fake_news_data.head(int(len(fake_news_data)*(n/100)))
fake_news_data.head() 

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...
1,103460,151909,Shakespeare’s first four folios sell at auctio...,Guardian,,2016-05-25,2016.0,5.0,https://www.theguardian.com/culture/2016/may/2...,Copies of William Shakespeare’s first four boo...
2,103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,..."
3,103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt..."
4,103463,151912,Texas man serving life sentence innocent of do...,Guardian,,2016-08-20,2016.0,8.0,https://www.theguardian.com/us-news/2016/aug/2...,A central Texas man serving a life sentence fo...


# Preprocessing

In [0]:
fake_news_data = fake_news_data.dropna()

In [0]:
import re

def cleaning(raw_news):
    
    # 1. Remove non-letters/Special Characters and Punctuations
    news = re.sub("[^a-zA-Z]", " ", raw_news)
    
    # 2. Convert to lower case.
    news =  news.lower()
    
    # 3. Tokenize.
    news_words = nltk.word_tokenize( news)
    
    # 4. Convert the stopwords list to "set" data type.
    stops = set(nltk.corpus.stopwords.words("english"))
    
    # 5. Remove stop words. 
    words = [w for w in  news_words  if not w in stops]
    
    # 6. Lemmentize 
    wordnet_lem = [ WordNetLemmatizer().lemmatize(w) for w in words ]
    
    # 7. Stemming
    stems = [nltk.stem.SnowballStemmer('english').stem(w) for w in wordnet_lem ]
    
    # 8. Join the stemmed words back into one string separated by space, and return the result.
    return " ".join(stems)

In [6]:
import time

t1 = time.time()
fake_news_data['clean_title'] = fake_news_data["title"].apply(cleaning) 
t2 = time.time()
print("\nTime to clean, tokenize and stem title in fake_news_data: \n", len(fake_news_data), "news:", (t2-t1)/60, "min")

# t1 = time.time()
# fake_news_data['clean_thread_title'] = fake_news_data["thread_title"].apply(cleaning) 
# t2 = time.time()
# print("\nTime to clean, tokenize and stem thread_title in fake_news_data: \n", len(fake_news_data), "news:", (t2-t1)/60, "min")

# t1 = time.time()
# fake_news_data['clean_content'] = fake_news_data["content"].apply(cleaning) 
# t2 = time.time()
# print("\nTime to clean, tokenize and stem text in fake_news_data: \n", len(fake_news_data), "news:", (t2-t1)/60, "min")


Time to clean, tokenize and stem title in fake_news_data: 
 7776 news: 0.08797337214152018 min


In [7]:
model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Iteration 1/Datasets/GoogleNews-vectors-negative300.bin.gz', binary=True)
words = model.index2word

w_rank = {}
for i,word in enumerate(words):
    w_rank[word] = i

WORDS = w_rank

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return - WORDS.get(word, 0)

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

# Credibility/Reliability

In [9]:
fake_news_data.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content,clean_title
0,103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...,alton sterl son everyon need protest right way...
2,103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,...",grandmoth death save life debt
3,103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt...",fear life lack mean cancer push find
5,103464,151914,My dad’s Reagan protests inspire me to stand u...,Guardian,Steven W Thrasher,2016-11-28,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,I have been battling depression and sleeplessn...,dad reagan protest inspir stand donald trump
6,103465,151915,Flatmates of gay Syrian refugee beheaded in Tu...,Guardian,Patrick Kingsley,2016-08-07,2016.0,8.0,https://www.theguardian.com/world/2016/aug/07/...,Three flatmates of a gay Syrian refugee behead...,flatmat gay syrian refuge behead turkey fear next


In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vector = CountVectorizer()
vector.fit(fake_news_data['clean_title'])
v = vector.transform(fake_news_data['clean_title'])
print(vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


In [11]:
vector.vocabulary_

{'alton': 223,
 'sterl': 7944,
 'son': 7735,
 'everyon': 2751,
 'need': 5519,
 'protest': 6509,
 'right': 6988,
 'way': 9086,
 'peac': 6060,
 'grandmoth': 3483,
 'death': 2024,
 'save': 7227,
 'life': 4704,
 'debt': 2027,
 'fear': 2918,
 'lack': 4532,
 'mean': 5082,
 'cancer': 1180,
 'push': 6563,
 'find': 3005,
 'dad': 1953,
 'reagan': 6711,
 'inspir': 4112,
 'stand': 7893,
 'donald': 2347,
 'trump': 8618,
 'flatmat': 3042,
 'gay': 3306,
 'syrian': 8184,
 'refuge': 6782,
 'behead': 701,
 'turkey': 8643,
 'next': 5575,
 'jaffa': 4230,
 'daredevil': 1983,
 'world': 9258,
 'steepest': 7934,
 'street': 7999,
 'nsa': 5685,
 'contractor': 1722,
 'arrest': 404,
 'alleg': 197,
 'theft': 8357,
 'top': 8474,
 'secret': 7323,
 'classifi': 1476,
 'inform': 4082,
 'dissolv': 2287,
 'charit': 1348,
 'foundat': 3153,
 'mount': 5380,
 'complaint': 1628,
 'serbian': 7376,
 'olymp': 5778,
 'rower': 7096,
 'sink': 7564,
 'feroci': 2952,
 'condit': 1654,
 'rio': 6997,
 'water': 9077,
 'vote': 9003,
 'rac

In [12]:
v.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [0]:
def spell_checker(text):
    all_words = re.findall(r'\w+', text.lower()) # split sentence to words
    correct_number  = 0
    incorrect_number = 0
    #print(text)
    for i in range(len(all_words)):
        if correction(all_words[i]) == all_words[i]:
            correct_number = correct_number + 1
        else:
            incorrect_number = incorrect_number + 1
    return correct_number , incorrect_number

In [0]:
def reliability(row):
  correct, incorrect = spell_checker(row['clean_title'])
  value = correct/(correct + incorrect)
  if value > 0.80:
    return "reliability"
  else:
    return "un-reliability"

In [0]:
fake_news_data['reliability'] = fake_news_data.apply (lambda row: reliability(row), axis=1)

In [16]:
fake_news_data.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content,clean_title,reliability
0,103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...,alton sterl son everyon need protest right way...,un-reliability
2,103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,...",grandmoth death save life debt,un-reliability
3,103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt...",fear life lack mean cancer push find,reliability
5,103464,151914,My dad’s Reagan protests inspire me to stand u...,Guardian,Steven W Thrasher,2016-11-28,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,I have been battling depression and sleeplessn...,dad reagan protest inspir stand donald trump,reliability
6,103465,151915,Flatmates of gay Syrian refugee beheaded in Tu...,Guardian,Patrick Kingsley,2016-08-07,2016.0,8.0,https://www.theguardian.com/world/2016/aug/07/...,Three flatmates of a gay Syrian refugee behead...,flatmat gay syrian refuge behead turkey fear next,reliability


In [17]:
fake_news_data['reliability'].value_counts()

reliability       4823
un-reliability    2953
Name: reliability, dtype: int64

## Website Data

In [0]:
import requests
import pandas as pd
import json

In [19]:
website_data = pd.read_csv('/content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Iteration 1/Datasets/website.csv', sep=',')
website_data.head()

Unnamed: 0,domain,page_rank_decimal,type
0,100percentfedup.com,3.94,unreliable
1,16wmpo.com,2.98,unreliable
2,21stcenturywire.com,4.76,reliable
3,24wpn.com,3.0,unreliable
4,365usanews.com,2.89,unreliable


In [0]:
domain = 'www.npr.org'

In [21]:
API_ENDPOINT = "https://openpagerank.com/api/v1.0/getPageRank?domains[]=" + domain.strip()
API_KEY = "sw0gsosokcwk0go8cgsokoowgk8gcw8gs4ckkswk"
print(API_ENDPOINT)

https://openpagerank.com/api/v1.0/getPageRank?domains[]=www.npr.org


In [0]:
headers = {'API-OPR':API_KEY}

In [23]:
response = requests.get(url = API_ENDPOINT, headers= headers)
data = json.loads(response.text)
text = data['response'][0]['page_rank_decimal']
print(text)

7.46


In [24]:
print((response.text))
data

{"status_code":200,"response":[{"status_code":200,"error":"","page_rank_integer":7,"page_rank_decimal":7.46,"rank":"148","domain":"npr.org"}],"last_updated":"29th Nov 2019"}


{'last_updated': '29th Nov 2019',
 'response': [{'domain': 'npr.org',
   'error': '',
   'page_rank_decimal': 7.46,
   'page_rank_integer': 7,
   'rank': '148',
   'status_code': 200}],
 'status_code': 200}

In [25]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn import metrics
import json
import pickle
class Rel_Cred_Feature():
    def setup(self): 
        #load the dataset
        columnNames = ["type"]
        website_data = pd.read_csv('/content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Iteration 1/Datasets/website.csv', sep=',')
        y = website_data['type']
        website_data = website_data.drop('type', axis=1)
        website_data = website_data.drop('domain', axis=1)
       
        X_train, X_test, y_train, y_test = train_test_split(website_data, y, test_size=0.2)

        #countVectorizerHeadlineText = CountVectorizer()
        #countVectorizerHeadlineText.fit_transform(website_data['comment'])

        self.logR_pipeline = MultinomialNB()

        self.logR_pipeline.fit(X_train, y_train)
        predicted_LogR = self.logR_pipeline.predict(X_test)
        score = metrics.accuracy_score(y_test, predicted_LogR)
        print("Bias Score Model Trained - accuracy:   %0.6f" % score)

    def predict(self, domain):
        API_ENDPOINT = "https://openpagerank.com/api/v1.0/getPageRank?domains[]=" + domain.strip()
        API_KEY = "sw0gsosokcwk0go8cgsokoowgk8gcw8gs4ckkswk"
        headers = {'API-OPR':API_KEY}
        response = requests.get(url = API_ENDPOINT, headers= headers)
        data = json.loads(response.text)
        text = data['response'][0]['page_rank_decimal']
        text = pd.DataFrame([text], columns=['page_rank_decimal'])
        predicted = self.logR_pipeline.predict(text)
        predicedProb = self.logR_pipeline.predict_proba(text)[:,1]
        return predicted[0], float(predicedProb)
    
relcred = Rel_Cred_Feature()
relcred.setup()
relcred.predict("www.google.com")


Bias Score Model Trained - accuracy:   0.557292


('unreliable', 0.6000000000000001)

In [0]:
pickle.dump(Rel_Cred_Feature(), open("/content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Models/Rel_Cred_Feature.sav", 'wb'))

In [27]:
fake_news_data.shape

(7776, 12)

In [28]:
from urllib.parse import urlparse

fake_news_data['credibility'] = 'reliable'

unique_url = {}
for index, row in fake_news_data.iterrows():
  url = urlparse(row['url']).netloc
  if url not in unique_url.keys():
    unique_url[url] = relcred.predict(url)[0]
    fake_news_data.at[index,'credibility'] = unique_url[url]
    #row['credibility'] = unique_url[url]
  elif url in unique_url.keys():
    fake_news_data.at[index,'credibility'] = unique_url[url]


fake_news_data.head()
#print(urlparse(row['url']).netloc)
#fake_news_data.apply(lambda row: print(urlparse(row['url']).netloc), axis=1)

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content,clean_title,reliability,credibility
0,103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...,alton sterl son everyon need protest right way...,un-reliability,unreliable
2,103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,...",grandmoth death save life debt,un-reliability,unreliable
3,103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt...",fear life lack mean cancer push find,reliability,unreliable
5,103464,151914,My dad’s Reagan protests inspire me to stand u...,Guardian,Steven W Thrasher,2016-11-28,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,I have been battling depression and sleeplessn...,dad reagan protest inspir stand donald trump,reliability,unreliable
6,103465,151915,Flatmates of gay Syrian refugee beheaded in Tu...,Guardian,Patrick Kingsley,2016-08-07,2016.0,8.0,https://www.theguardian.com/world/2016/aug/07/...,Three flatmates of a gay Syrian refugee behead...,flatmat gay syrian refuge behead turkey fear next,reliability,unreliable


# Data Narrative

## Credibility and Reliability

For Credibility and Reliability, I had a discussion with the professor and my team members on how to properly determine credibility and reliability.
The suggestion was to use the sources of the article to determine how credible and reliable the article was. 

Getting the source information was much harder than we originally anticipated.To overcome this we found a partly labeled dataset of news sources and their reliability and unreliability. We then used this, along with the website rank (score was received from an api), to train our model with the rank to determining if the site that weren't labled were reliable/unreliable.
 
This gave us an accuracy of 60%. To improve this accuracy, we would need to increase the number of reliable news sources in our labeled dataset. We can also change the model that we are using for the prediction to determine if we would have better accuracy as currently, we only had time to test againt one model.
 
Using this, we ran it against our fake news dataset and were able to make a prediction if the article was reliable or not reliable. I will need to calculate credibility in the next iteration. 
