Github Source (Bias) - https://github.com/conversationai/unintended-ml-bias-analysis

In [0]:
import pandas as pd
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
import gensim
import nltk.sentiment

nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('vader_lexicon')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [0]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Data Loading

In [0]:
# Load Dataset from drive
fake_news_data = pd.read_csv('/content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Iteration 1/Datasets/articles3.csv', low_memory =False)
n = 20
fake_news_data = fake_news_data.head(int(len(fake_news_data)*(n/100)))
fake_news_data.head()
fake_news_data.shape

(8514, 10)

# Preprocessing

In [0]:
fake_news_data = fake_news_data.dropna()

In [0]:
import re

def cleaning(raw_news):
    
    # 1. Remove non-letters/Special Characters and Punctuations
    # news = re.sub("[^a-zA-Z]", "", raw_news)
    news = re.sub("[,\.!?]", "", raw_news)
    # 2. Convert to lower case.
    news =  news.lower()
    
    # 3. Tokenize.
    news_words = nltk.word_tokenize( news)
    
    # 4. Convert the stopwords list to "set" data type.
    stops = set(nltk.corpus.stopwords.words("english"))
    
    # 5. Remove stop words. 
    words = [w for w in  news_words  if not w in stops]
    
    # 6. Lemmentize 
    wordnet_lem = [ WordNetLemmatizer().lemmatize(w) for w in words ]
    
    # 7. Stemming
    stems = [nltk.stem.SnowballStemmer('english').stem(w) for w in wordnet_lem ]
    
    # 8. Join the stemmed words back into one string separated by space, and return the result.
    return " ".join(stems)

In [0]:
import time

t1 = time.time()
fake_news_data['clean_content'] = fake_news_data["content"].apply(cleaning) 
t2 = time.time()
print("\nTime to clean, tokenize and stem title in fake_news_data: \n", len(fake_news_data), "news:", (t2-t1)/60, "min")

# t1 = time.time()
# fake_news_data['clean_thread_title'] = fake_news_data["thread_title"].apply(cleaning) 
# t2 = time.time()
# print("\nTime to clean, tokenize and stem thread_title in fake_news_data: \n", len(fake_news_data), "news:", (t2-t1)/60, "min")

# t1 = time.time()
# fake_news_data['clean_content'] = fake_news_data["content"].apply(cleaning) 
# t2 = time.time()
# print("\nTime to clean, tokenize and stem text in fake_news_data: \n", len(fake_news_data), "news:", (t2-t1)/60, "min")


Time to clean, tokenize and stem title in fake_news_data: 
 7776 news: 1.705378524462382 min


# Biased

In [0]:
fake_news_data.shape

(7776, 11)

In [0]:
# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import CountVectorizer
import nltk 
from nltk.corpus import wordnet


# Helper function
def print_topics(model, count_vectorizer, n_top_words):
   
    
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        # print("\nTopic #%d:" % topic_idx)
        top_words = [words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        num_ant = 0
        for word in top_words:
          ant = list()
          for synset in wordnet.synsets(word):
            for lemma in synset.lemmas():
                if lemma.antonyms():    #When antonyms are available, add them into the list
                  ant.append(lemma.antonyms()[0].name())
          
          found_ant = False
          for a in ant:
            if a in top_words:
              found_ant = True
              
          if found_ant:
            num_ant = num_ant + 1
        print(num_ant, len(top_words))

# Tweak the two parameters below
number_topics = 1
number_words = 10

for index, row in fake_news_data.iterrows():
  
  count_vectorizer = CountVectorizer(stop_words='english')
  count_data = count_vectorizer.fit_transform([row['clean_content']])

  # Create and fit the LDA model
  lda = LDA(n_components=number_topics, n_jobs=-1)
  lda.fit(count_data)
  # Print the topics found by the LDA model
  # print("Topics found via LDA:")
  print_topics(lda, count_vectorizer, number_words)
  

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
2 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
1 10
0 10
0 10
1 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
2 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10
0 10


In [0]:
#Labeling training dataset
senti = nltk.sentiment.vader.SentimentIntensityAnalyzer()
# fake_news_data = pd.read_csv('/content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Iteration 1/Datasets/toxic.csv', low_memory =False)

def unbiased_biased_label(row):
  
  if row['toxic']:
    return "biased"
  else:
    return "unbiased"
  
fake_news_data['unbiased_biased_title'] = fake_news_data.apply (lambda row: unbiased_biased_label(row), axis=1)
fake_news_data.unbiased_biased_title.value_counts()
fake_news_data.head()
fake_news_data.to_csv("/content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Iteration 1/Datasets/toxic.csv")

In [0]:
fake_news_data.head()

In [0]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn import metrics
import pickle
class BiasScoreFeature():
    def setup(self): 
        #load the dataset
        columnNames = ["comment", "toxic"]
        toxic_data = pd.read_csv('//content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Iteration 1/Datasets/toxic.csv', sep=',')
        y = toxic_data['unbiased_biased_title']
        toxic_data.drop('toxic', axis=1)
       
        X_train, X_test, y_train, y_test = train_test_split(toxic_data, y, test_size=0.2)

        countVectorizerHeadlineText = CountVectorizer()
        countVectorizerHeadlineText.fit_transform(toxic_data['comment'])

        self.logR_pipeline = Pipeline([
            ('NBCV',countVectorizerHeadlineText),
            ('nb_clf',MultinomialNB())])

        self.logR_pipeline.fit(X_train['comment'], y_train)
        predicted_LogR = self.logR_pipeline.predict(X_test['comment'])
        score = metrics.accuracy_score(y_test, predicted_LogR)
        print("Bias Score Model Trained - accuracy:   %0.6f" % score)

    def predict(self, text):
        predicted = self.logR_pipeline.predict([text])
        predicedProb = self.logR_pipeline.predict_proba([text])[:,1]
        return predicted[0], float(predicedProb)
    
biasscore = BiasScoreFeature()
biasscore.setup()
biasscore.predict("Says the Annies List political group supports third-trimester abortions on demand.")



In [0]:
# pickle.dump(BiasScoreFeature(), open("/content/drive/My Drive/MLSpring2020/TheMeanSquares-StockPrediction/Alternus-Vera TheMeanSquares/Iteration 1/Models/biased_unbiased.sav", 'wb'))

# Data Narrative

## Credibility and Reliability

For Credibility and Reliability, I had a discussion with the professor and my team members on how to properly determine credibility and reliability.
The suggestion was to use the sources of the article to determine how credible and reliable the article was. 

Getting the source information was much harder than we originally anticipated.To overcome this we found a partly labeled dataset of news sources and their reliability and unreliability. We then used this, along with the website rank (score was received from an api), to train our model with the rank to determining if the site that weren't labled were reliable/unreliable.
 
This gave us an accuracy of 60%. To improve this accuracy, we would need to increase the number of reliable news sources in our labeled dataset. We can also change the model that we are using for the prediction to determine if we would have better accuracy as currently, we only had time to test againt one model.
 
Using this, we ran it against our fake news dataset and were able to make a prediction if the article was reliable or not reliable. I will need to calculate credibility in the next iteration. 


## Bias

I had initially discussed with the professor to use the existing fake-news dataset to help create a model to classify unbiased/biased. The problem with this was that the data was biased to begin with. And there was no real way to label unbiased data. 

I did some investigation around this and found several Kaggle competitions trying to do this using the Toxic or language of comments. I used their dataset and whichever was toxic, I would label as biased, otherwise un-biased. There is definitely improvement to be made here but for the time being this is what I went with this because the more toxic someone is, the more likely they are to be biased. I was able to use the countvectorizer to help my model identify posts/comments that seemed biased/un-biased. I didn’t get a chance but I wanted to see if I could get the sentiment analysis on the headline and content to see if they were different. If they were, this would tell me that it is a click bait like article (headline is saying one thing, content is saying something else). This would also help indicate if it is biased/un-biased. I could then add this with the countvectorizer and naivebayes classier to get better results. 

For the time, I got 85% accuracy by using just countvectorizer from the toxic dataset and naivebayes. This seems good but seems a little overfitted. Hopefully in the next iteration I will attempt to improve this by using the techniques discussed above.
