# EDA for 'debiaser' data product
#### Sagar Setru, September 16th, 2020

## Brief description using CoNVO framework

### Context

Some people are eager to get news from outside of their echo chamber. However, they do not know where to go outside of their echo chambers, and may also have some activation energy when it comes to seeking information from other sources. In the meantime, most newsfeeds only push you content that you agree with. You end up in an echo chamber, but may not have ever wanted to be in one in the first place.

### Need

A way to find news articles from different yet reliable media sources.

### Vision

Debiaser, a data product (maybe Chrome plug-in?) that will recommend news articles similar in topic to the one currently being read, but from several pre-curated and reliable news media organizations across the political spectrum, for example, following the "media bias chart" here https://www.adfontesmedia.com/ or the "media bias ratings" here: https://www.allsides.com/media-bias/media-bias-ratings. The app will determine the main topics of the text of a news article, and then show links to similar articles from other news organizations.

Caveats: Many of these articles may be behind paywalls. News aggregators already basically do this. How different is this than just searching Google using the title of an article?

### Outcome

People who are motivated to engage in content outside of their echo chambers have a tool that enables them to quickly find news similar to what they are currently reading, but from a variety of news organizations.

# EDA

In [35]:
# make sure I'm in the right environment

print('Conda environment:')
print(os.environ['CONDA_DEFAULT_ENV'])

Conda environment:
insight


In [36]:
# import base packages

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

In [324]:
# import text processing and NLP specific packages

# for generating LDA models
import gensim

# for preprocessing documents
from gensim.parsing.preprocessing import preprocess_documents

# to break articles up into sentences (currently not in use)
from nltk import tokenize

# for counting frequency of words
from collections import defaultdict

# for processing lda topic output
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import strip_numeric

In [262]:
def load_stop_words_csv_to_list(full_file_name):
    """fxn that loads stop words list downloaded from git repo called 'news-stopwords'"""
    
    stop_words = pd.read_csv(full_file_name)

    stop_words = stop_words['term']

    stop_words = [word for word in stop_words]
    
    return stop_words

In [38]:
# get list of manually made text files of articles

full_path = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/article_text_files/'

full_file_names = [
full_path+'ap_hurricane_sally_unleashes_20200916.txt',
full_path+'cnn_big_ten_backtracks_20200916.txt',
full_path+'nyt_on_the_fire_line_20200915.txt',
]

In [357]:
# choose list of stop words

# choose whether 1k, 10k, 100k, or nltk
which_stop_words = '1k'
# which_stop_words = '10k'
# which_stop_words = '100k'
# which_stop_words = 'nltk'

stop_words_path = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/stop_words_db/news-stopwords-master/'


if which_stop_words == '1k':
    
    # doing 1k words list
    stop_words_file_name = 'sw1k.csv'
    
    # make full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)
    
elif which_stop_words == '10k':
    
    # doing 10k words list
    stop_words_file_name = 'sw10k.csv'

    # make full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)

elif which_stop_words == '100k':
    
    # doing 100k
    stop_words_file_name = 'sw100k.csv'  
    
    # get full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)


elif which_stop_words == 'nltk':
    # import from nltk
    from nltk.corpus import stopwords
    
    stop_words = stopwords.words('english')
    
else:
    print('Select proper variable name for "which_stop_words"')
    
# adding custom words
stop_words.append('said')
stop_words.append("you're")

In [356]:
# print(stop_words)

In [392]:
# Create a floor of the frequency of words to remove
word_frequency_threshold = 1

# choose the number of LDA topics
num_lda_topics = 5

# loop through files
for ind, full_file_name in enumerate(full_file_names):
    
    # get the article text as one string, remove new lines
    with open(full_file_name, 'r') as file:
        article_text = file.read().replace('\n', ' ')
        
    # replace weird apostrophes
    article_text = article_text.replace("`","'")
    article_text = article_text.replace("’","'")
    article_text = article_text.replace("'","'")
    
    # get rid of punctuation
    article_text.translate(article_text.maketrans('', '', string.punctuation))

    # break article into sentences
#     article_sentences = tokenize.sent_tokenize(article_text)

    # following https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py
    # Lowercase each document, split it by white space and filter out stopwords
    texts = [[word for word in document.lower().split() if word not in stop_words] 
             for document in [article_text]]

    # Count word frequencies
    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    # Only keep words that appear more than set frequency, to produce the corpus
    processed_corpus = [[token for token in text if frequency[token] > word_frequency_threshold] for text in texts]

    print('Number of documents: %d' % len(processed_corpus))
    
    # generate a dictionary via gensim
    processed_dictionary = Dictionary(processed_corpus)
    
    # generate bag of words of the corpus
    bow_corpus = [processed_dictionary.doc2bow(text) for text in processed_corpus]
    
    # generate the LDA model
    lda = gensim.models.LdaModel(corpus = bow_corpus,
                                 num_topics = num_lda_topics,
                                 id2word = processed_dictionary,
                                  passes = 1)

    # get the topics from the lda model
    lda_topics = lda.show_topics(formatted=False)

    # initialize empty list for topics
    topics = []
    
#     # generate filter for lda topic output as lambda function
#     filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]

#     # loop through generated topics
#     for topic in lda_topics:
#         word_to_add = preprocess_string(topic[1], filters)
#         if word_to_add not in stop_words:
#             topics.append(word_to_add)

#     all_unique_topics = []
#     for topic_list in topics:
#         for topic in topic_list:
#             if topic not in all_unique_topics:
#                 if topic not in stop_words:
#                     all_unique_topics.append(topic)
    
    
    print(all_unique_topics)
    
#     print(topics)
    
#     if ind == 1:
#         break
        


Number of documents: 1


TypeError: decoding to str: need a bytes-like object, list found

In [398]:
topics_probs_dict = {}
topics = []
for topic in lda_topics:
    word_topics = topic[1]
    for word_topic, prob in word_topics:
        if word_topic not in topics:
            topics.append(word_topic)
        if word_topic not in topics_probs_dict.keys():
            topics_probs_dict[word_topic] = [prob]
        else:
            topics_probs_dict[word_topic].append(prob)
        

In [402]:
print(topics)

['hurricane', 'storm', 'gulf', 'alabama,', 'said.', 'rain', 'winds', 'forecasters', 'pensacola,', 'sally', 'pensacola', 'centimeters)']


In [403]:
print(topics_probs_dict)

{'hurricane': [0.028900892, 0.023605594, 0.056523804, 0.041764483, 0.03936894], 'storm': [0.026775975, 0.024644408, 0.039426956, 0.029284615, 0.032886967], 'gulf': [0.022866178, 0.018373583, 0.02937628, 0.021296008, 0.029017441], 'alabama,': [0.022147043, 0.019578151, 0.022074558, 0.023198724, 0.0257787], 'said.': [0.019591311, 0.018274995, 0.024987014, 0.021416679, 0.022029705], 'rain': [0.019554488, 0.017821308, 0.03733048, 0.021670463, 0.030659337], 'winds': [0.019485176, 0.023764139], 'forecasters': [0.019385332, 0.018169055, 0.019014213, 0.021644952], 'pensacola,': [0.019187111, 0.020984672, 0.029835008, 0.028988319, 0.022981185], 'sally': [0.018017035, 0.019778242, 0.02662366, 0.02455861, 0.034632687], 'pensacola': [0.016891448, 0.019275548, 0.019613955], 'centimeters)': [0.018891351]}


In [304]:
# Create a set of frequent meaningless words to remove
word_frequency_threshold = 2

# following https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py

# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stop_words]
         for document in [article_text]]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > word_frequency_threshold] for text in texts]

processed_dictionary = Dictionary(processed_corpus)
print(processed_dictionary)

Dictionary(6 unique tokens: ['coaches,', 'college', 'football', 'sports', 'student-athletes']...)


In [305]:
bow_corpus = [processed_dictionary.doc2bow(text) for text in processed_corpus]
lda_2 = gensim.models.LdaModel(corpus=bow_corpus,num_topics=5,id2word=processed_dictionary)

In [306]:
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import strip_numeric

# # lda.top_topics(corpus=bow_corpus,dictionary=processed_dictionary)
# for i in range(0, lda.num_topics-1):
#     current_topic = lda.print_topic(i)
#     print(current_topic)
    
lda_topics = lda_2.show_topics()

topics = []
filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]

for topic in lda_topics:
#     print(topic)
    topics.append(preprocess_string(topic[1], filters))

all_topic_words = []
for topic in topics:
    print(topic)
    for topic_word in topic:
        print(topic_word)
        all_topic_words.append(topic_word)
    
    print(' ')

print(all_topic_words)

['ten', 'football', 'sports', 'student', 'athletes', 'coaches', 'college']
ten
football
sports
student
athletes
coaches
college
 
['ten', 'football', 'college', 'sports', 'coaches', 'student', 'athletes']
ten
football
college
sports
coaches
student
athletes
 
['ten', 'football', 'college', 'coaches', 'sports', 'student', 'athletes']
ten
football
college
coaches
sports
student
athletes
 
['ten', 'football', 'college', 'coaches', 'student', 'athletes', 'sports']
ten
football
college
coaches
student
athletes
sports
 
['ten', 'football', 'coaches', 'college', 'student', 'athletes', 'sports']
ten
football
coaches
college
student
athletes
sports
 
['ten', 'football', 'sports', 'student', 'athletes', 'coaches', 'college', 'ten', 'football', 'college', 'sports', 'coaches', 'student', 'athletes', 'ten', 'football', 'college', 'coaches', 'sports', 'student', 'athletes', 'ten', 'football', 'college', 'coaches', 'student', 'athletes', 'sports', 'ten', 'football', 'coaches', 'college', 'student', '

In [307]:
print(topics)

[['ten', 'football', 'sports', 'student', 'athletes', 'coaches', 'college'], ['ten', 'football', 'college', 'sports', 'coaches', 'student', 'athletes'], ['ten', 'football', 'college', 'coaches', 'sports', 'student', 'athletes'], ['ten', 'football', 'college', 'coaches', 'student', 'athletes', 'sports'], ['ten', 'football', 'coaches', 'college', 'student', 'athletes', 'sports']]


In [318]:
all_unique_topics = []
for topic_list in topics:
    for topic in topic_list:
        if topic not in all_unique_topics:
            all_unique_topics.append(topic)

print(all_unique_topics)

['ten', 'football', 'sports', 'student', 'athletes', 'coaches', 'college']


In [314]:
all_topics

['coaches', 'college', 'football', 'sports', 'student', 'ten', 'athletes']

In [312]:
len(topics[0])
print(topics[0])

['ten', 'football', 'sports', 'student', 'athletes', 'coaches', 'college']
