# Validation using downloaded document corpus for 'debiaser' data product
#### Sagar Setru, September 21th, 2020

## Brief description using CoNVO framework

### Context

Some people are eager to get news from outside of their echo chamber. However, they do not know where to go outside of their echo chambers, and may also have some activation energy when it comes to seeking information from other sources. In the meantime, most newsfeeds only push you content that you agree with. You end up in an echo chamber, but may not have ever wanted to be in one in the first place.

### Need

A way to find news articles from different yet reliable media sources.

### Vision

Debiaser, a data product (maybe Chrome plug-in?) that will recommend news articles similar in topic to the one currently being read, but from several pre-curated and reliable news media organizations across the political spectrum, for example, following the "media bias chart" here https://www.adfontesmedia.com/ or the "media bias ratings" here: https://www.allsides.com/media-bias/media-bias-ratings. The app will determine the main topics of the text of a news article, and then show links to similar articles from other news organizations.

The product will generate topics for a given document via latent Dirichlet allocation (LDA) and then search news websites for the topic words generated.

Caveats: Many of these articles may be behind paywalls. News aggregators already basically do this. How different is this than just searching Google using the title of an article?

### Outcome

People who are motivated to engage in content outside of their echo chambers have a tool that enables them to quickly find news similar to what they are currently reading, but from a variety of news organizations.

### testing single document lda on these articles

In [1]:
# make sure I'm in the right environment (should be 'debiaser')
import os
print('Conda environment:')
print(os.environ['CONDA_DEFAULT_ENV'])

Conda environment:
debiaser


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

import json

# NLP Packages
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# to break articles up into sentences
from nltk import tokenize

import pyLDAvis
import pyLDAvis.gensim

from text_processing_functions import process_all_articles
from text_processing_functions import remove_stopwords
from text_processing_functions import get_simple_corpus_dictionary_bow
from text_processing_functions import entity_recognizer
from text_processing_functions import get_topic_words_mean_std_prob_frequency
from text_processing_functions import sort_topics_mean_frequency

import pickle
print('DONE')

DONE


In [3]:
all_news_df = pd.read_csv('./all_the_news/all_news_df_processed.csv')
all_news_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,index,index.1,id,title,author,date,content,year,month,publication,category,digital,section,url,article_length
0,0,0,1,Agent Cooper in Twin Peaks is the audience: on...,\nTasha Robinson\n,2017-05-31,And never more so than in Showtime’s new...,2017.0,5.0,Verge,Longform,1.0,,,2121
1,1,1,2,"AI, the humanity!",\nSam Byford\n,2017-05-30,AlphaGo’s victory isn’t a defeat for hum...,2017.0,5.0,Verge,Longform,1.0,,,1948
2,2,2,3,The Viral Machine,\nKaitlyn Tiffany\n,2017-05-25,Super Deluxe built a weird internet empi...,2017.0,5.0,Verge,Longform,1.0,,,3011
3,3,3,4,How Anker is beating Apple and Samsung at thei...,\nNick Statt\n,2017-05-22,Steven Yang quit his job at Google in th...,2017.0,5.0,Verge,Longform,1.0,,,3281
4,4,4,5,Tour Black Panther’s reimagined homeland with ...,\nKwame Opam\n,2017-05-15,Ahead of Black Panther’s 2018 theatrical...,2017.0,5.0,Verge,Longform,1.0,,,239


In [4]:
def load_stop_words_csv_to_list(full_file_name):
    """fxn that loads stop words list downloaded from git repo called 'news-stopwords'"""
    
    stop_words = pd.read_csv(full_file_name)

    stop_words = stop_words['term']

    stop_words = [word for word in stop_words]
    
    return stop_words

In [5]:
def get_lda_top_topic_words(lda_topics,num_topics,do_unique_search_words,n_search_words):
    """
    fxn for algorithm to return the top topic words
    algo varies based on:
    1) whether only unique words are wanted, and
    2) whether there is 1 or more topics
    
                
    if one topic, just take top word in each generated topic
    else, if do_unique_search_words, get top word in each topic that is unique,
          else, just get top word in each topic even if it isn't unique
    
    parameters
    ----------
    lda_topics - topic output from lda model
    num_topic - how many lda topics were generated
    do_unique_search_words - whether to all repeating words as search terms
    n_search_words - how many topic words to use as search terms
    
    outputs
    -------
    list and string of search/topic words
    """
    
    # string is for final search string
    lda_top_topic_words_string = ''

    # list is for checking previous words
    lda_top_topic_words_list = []
    
    # if lda model has only one topic
    if num_topics == 1:

        for topic in lda_topics:

            # get the list of topic words
            topic_words = topic[1]

            # loop through these words and get the top n number
            counter = -1
            for topic_word in topic_words:

                counter += 1

                if counter < n_search_words:

                    lda_top_topic_words_string += ' '+topic_word[0]
                    lda_top_topic_words_list.append(topic_word[0])

    # if lda model has more than one topic
    elif num_topics > 1:
            
        # this ind is to always get list of tuples of (word, prob)
        fixed_ind1 = 1

        # this ind is to always access the word in the tuple (word, prob)
        fixed_ind2 = 0

        # if you're okay with topic words repeating (often happens..)
        if not do_unique_search_words:

            # loop counter
            counter = 0
            
            # index of word within topic
            ind_use = 0
            
            # index of topic
            topic_use = -1
            
            for i in range(n_search_words):
                counter += 1

                if counter > num_topics:
                    ind_use += 1
                    counter = 1

                if topic_use < num_topics-1:
                    topic_use += 1
                else:
                    topic_use = 0

                # access the appropriate topic word
                word = lda_topics[topic_use][fixed_ind1][ind_use][fixed_ind2]

                lda_top_topic_words_string += ' '+word

                lda_top_topic_words_list.append(word)

        # don't reuse a word if it has already been used
        else:

            counter = 0
            ind_use = 0
            topic_use = -1
            
            # do loop over total words across all topics
            total_topic_words = len(lda_topics)*len(lda_topics[0][fixed_ind1])
            for i in range(total_topic_words):
                counter += 1

                if counter > num_topics:
                    ind_use += 1
                    counter = 1

                if topic_use < num_topics-1:
                    topic_use += 1
                else:
                    topic_use = 0

                # access the appropriate topic word
                word = lda_topics[topic_use][fixed_ind1][ind_use][fixed_ind2]

                # only add if it is not currently in the top topic words
                if word not in lda_top_topic_words_list:

                    lda_top_topic_words_string += ' '+word

                    lda_top_topic_words_list.append(word)
                
                # if the length of the topic words list is at the number of descired topics
                if len(lda_top_topic_words_list) == n_search_words:
                    break

    return lda_top_topic_words_string, lda_top_topic_words_list

In [6]:
def get_single_topic_word_probs(lda_topics,n_search_words_single_topic_analysis):
    """
    fxn for algorithm to return the top topic words
    algo varies based on:
    1) whether only unique words are wanted, and
    2) whether there is 1 or more topics
    
                
    if one topic, just take top word in each generated topic
    else, if do_unique_search_words, get top word in each topic that is unique,
          else, just get top word in each topic even if it isn't unique
    
    parameters
    ----------
    lda_topics - topic output from lda model
    num_topics - how many lda topics were generated
    do_unique_search_words - whether to all repeating words as search terms
    n_search_words - how many topic words to use as search terms
    
    outputs
    -------
    list and string of search/topic words
    """

    # generate empty vector for probs associated with words in topic
    lda_topic_word_probs = np.zeros((n_search_words_single_topic_analysis,1))

    # set default to nan in case any probs eval to 0..
    lda_topic_word_probs[:] = np.nan

    for topic in lda_topics:

        # get the list of topic words and probs
        topic_words = topic[1]

        # loop through these words and get the associated probabilities
        for ind, topic_word in enumerate(topic_words):

            # add probability to prob vector
            lda_topic_word_probs[ind] = topic_word[1]
            
    return lda_topic_word_probs

In [7]:
def count_word_frequencies(article_processed_whole,n_search_words):
    """
    fxn that does simple counting of word frequency.
    Goal is to have some baseline for how single doc LDA approach 
    compares to just counting most common words.
    """
    
    # dictionary of word counts
    word_dict_count = {}

    for word in article_processed_whole[0]:

        if word in word_dict_count.keys():

            word_dict_count[word] += 1

        else:

            word_dict_count[word] = 1

    # make list for word counts
    word_counts = []

    # loop through dictionary
    for key, value in word_dict_count.items():
        word_counts.append(value)
    
    # get unique values of word counts
#     word_counts = list(set(word_counts))

    # sort counts from high to low 
    word_counts = sorted(word_counts, reverse=True)

    # keep appropriate number of word counts
    word_counts_top = word_counts[0:n_search_words]

    # list for most common words
    most_common_words_list = []
    most_common_words_string = ''

    # loop through dictionary
    for key, value in word_dict_count.items():

        # if value of this word is one of the top ones, add this word for list of common words
        if value in word_counts_top:
            most_common_words_list.append(key)
            most_common_words_string += ' '+key
            
    return word_dict_count, word_counts_top, most_common_words_list, most_common_words_string

In [8]:
def get_jaccard_sim(list1, list2): 
    """
    fxn calculates jaccard sim between two lists of words
    from https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50
    """
    a = set(list1) 
    b = set(list2)
    
    c = a.intersection(b)
    
    return float(len(c)) / (len(a) + len(b) - len(c))

In [9]:
def calculate_cosine_similarity(bow_vec1,bow_vec2):
    """
    fxn calculates the bag of words similarity between two word vectors
    """
    
    # get the words in each vector and their lengths in a dictionary
    vec1_words_dict = {}
    vec2_words_dict = {}
    
    # get just the words
    vec1_words = []
    vec2_words = []
    
    # get just the values
    vec1_vals = np.zeros((len(bow_vec1)))
    vec2_vals = np.zeros((len(bow_vec2)))
    
    # populate dictionary and lists
    for ind, val in enumerate(bow_vec1):

        vec1_words_dict[val[0]] = val[1]
        vec1_words.append(val[0])
        vec1_vals[ind] = val[1]
    
    # populate dictionary and lists
    for ind, val in enumerate(bow_vec2):
        
        vec2_words_dict[val[0]] = val[1]
        vec2_words.append(val[0])
        vec2_vals[ind] = val[1]
        
    # get norms of each vector
    norm_vec1 = np.linalg.norm(vec1_vals)
    norm_vec2 = np.linalg.norm(vec2_vals)
    
    # get the list of all the words
    all_words = list(set().union(vec1_words,vec2_words))
    
    # loop through words, update dictionaries if word is not in original vector
    for word in all_words:
        
        if word not in vec1_words:
            
            vec1_words_dict[word] = 0
            
        if word not in vec2_words:
            
            vec2_words_dict[word] = 0
       
    # initialize float for final dot product
    dot_product = 0.0
    
    # loop through words
    for word in vec1_words_dict.keys():
        
        vec1_val = vec1_words_dict[word]
        vec2_val = vec2_words_dict[word]
        
        dot_product += (vec1_val * vec2_val)
    
    cosine_sim = dot_product/(norm_vec1*norm_vec2)
    
    return cosine_sim

In [10]:
# choose list of stop words

# choose whether 1k, 10k, 100k, or nltk
which_stop_words = '1k'
# which_stop_words = '10k'
# which_stop_words = '100k'
# which_stop_words = 'nltk'

stop_words_path = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/stop_words_db/news-stopwords-master/'


if which_stop_words == '1k':
    
    # doing 1k words list
    stop_words_file_name = 'sw1k.csv'
    
    # make full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)
    
elif which_stop_words == '10k':
    
    # doing 10k words list
    stop_words_file_name = 'sw10k.csv'

    # make full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)

elif which_stop_words == '100k':
    
    # doing 100k
    stop_words_file_name = 'sw100k.csv'  
    
    # get full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)


elif which_stop_words == 'nltk':
    # import from nltk
    from nltk.corpus import stopwords
    
    stop_words = stopwords.words('english')
    
else:
    print('Select proper variable name for "which_stop_words"')
    
# adding custom words
stop_words.append('said')
stop_words.append('youre')

In [11]:
# get a random sampling of these articles for testing single document lda

# number of times to draw a bootstrap sampling
n_bootstrap_samples = 1000

# number of articles to draw per sampling
n_articles_per_sample = 4

In [12]:
# choose the number of LDA topics
# num_lda_topics = [1,2,3,4,5,6,7,8,9,10]
num_lda_topics = [1,2]

# do by sentences
do_sentences = 1

# to show plots per  run
do_plot = 0

# to print output
do_print = 0

# to print progress of testing
do_progress_print = 1

# number of passes LDA does through corpus (hyperparameter)
n_passes = 6

# whether to use unique topic words or allow repeated words
do_unique_search_words = 1

# number of words to use in search
# (or number of top most often frequencies of words, e.g., use 5 highest word frequencies)
# (This could be more than 5 words if some words show up equally as often and among the msot often of all words)
n_search_words = 5

# dummy nlp variable, for now not in use because lemmatization not in use
nlp = []

# empty matrix for perplexity scores
perplexity_scores = np.zeros(( n_bootstrap_samples, n_articles_per_sample, len(num_lda_topics) ))

# empty matrix for coherence scores
coherence_scores = np.zeros(( n_bootstrap_samples, n_articles_per_sample, len(num_lda_topics) ))

# empty matrix for jaccard sim, set to nan in case any have 0 similarity
jaccard_sim_all = np.zeros(( n_bootstrap_samples, n_articles_per_sample, len(num_lda_topics) ))
jaccard_sim_all[:] = np.nan

# empty matrix for cosine sim, set to nan in case any have 0 similarity
cosine_sim_all = np.zeros(( n_bootstrap_samples, n_articles_per_sample, len(num_lda_topics) ))
cosine_sim_all[:] = np.nan

# empty matrix for length of most common words vec
number_most_common_words = np.zeros(( n_bootstrap_samples, n_articles_per_sample, len(num_lda_topics) ))

# empty matrix for probability vs word in topic (doing for n = 1 topic only)
n_search_words_single_topic_analysis = 10
topic_word_probs = np.zeros(( n_bootstrap_samples, n_articles_per_sample, n_search_words_single_topic_analysis ))


# counter for bootstrap sampling
counter_nboot = -1

for i in range(n_bootstrap_samples):
    
    # draw random articles
    articles_df_random_subset = all_news_df.sample(n=n_articles_per_sample)
    
    # get their content
    articles_content = articles_df_random_subset['content']
    
    counter_nboot += 1

    # counter for articles
    counter_article = -1
    
    for i in range(len(articles_content)):
        
        counter_article += 1
        
        for ind_num_topicss, num_topics in enumerate(num_lda_topics):

            # get the article text
            article_text = articles_content.iloc[i]
            
            if do_print:
                print(article_text)
                
            if do_progress_print:
                print(f'Bootstrap sample {counter_nboot+1} out of {n_bootstrap_samples}')
                print(f'Article {counter_article+1} out of {n_articles_per_sample}')
                print(f'LDA topic number {ind_num_topicss+1} out of {len(num_lda_topics)}')
                print(f'N topics: {num_topics}')
                print(' ')

            # for counting word frequencies
            article_processed_whole = process_all_articles([article_text],nlp)

            article_processed_whole = remove_stopwords(article_processed_whole,stop_words)


            if do_sentences:

                # break article into sentences
                article_text = tokenize.sent_tokenize(article_text)

                # process article
                article_processed = process_all_articles(article_text,nlp)

                # remove stopwords
                article_processed = remove_stopwords(article_processed,stop_words)

            else:

                # process article
                article_processed = process_all_articles([article_text],nlp)

                # remove stopwords
                article_processed = remove_stopwords(article_processed,stop_words)


            # get corpus, dictionary, bag of words
            processed_corpus, processed_dictionary, bow_corpus = get_simple_corpus_dictionary_bow(article_processed)

            # generate the LDA model
            lda = LdaModel(corpus = bow_corpus,
                             num_topics = num_topics,
                             id2word = processed_dictionary,
                             passes = n_passes)
            
            # calculate and store perplexity
            perplexity = lda.log_perplexity(bow_corpus)
            perplexity_scores[counter_nboot,counter_article,ind_num_topicss] = perplexity

            # calculate and store coherence
            coherence_model_lda = CoherenceModel(model=lda, texts=article_processed, dictionary=processed_dictionary, coherence='c_v')
            coherence_lda = coherence_model_lda.get_coherence()
            coherence_scores[counter_nboot,counter_article,ind_num_topicss] = coherence_lda
            
            # get the topics from the lda model
            lda_topics = lda.show_topics(formatted=False)

            # get the top topic words
            lda_top_topic_words_string, lda_top_topic_words_list = get_lda_top_topic_words(lda_topics,num_topics,do_unique_search_words,n_search_words)
            
            # for case of only one topic, store matrix of word probs
            if num_topics == 1:

                lda_topic_word_probs = get_single_topic_word_probs(lda_topics,n_search_words_single_topic_analysis)
                        
                # add to matrix of word probs
                topic_word_probs[counter_nboot,counter_article,:] = lda_topic_word_probs[:,0]

            # count word frequencies
            word_dict_count, word_counts_top, most_common_words_list, most_common_words_string = count_word_frequencies(article_processed_whole,n_search_words)
            
            # get jaccard similarity
            jaccard_sim = get_jaccard_sim(lda_top_topic_words_list, most_common_words_list)
            jaccard_sim_all[counter_nboot,counter_article,ind_num_topicss] = jaccard_sim
            
            # get word vectors for common words and lda top words
            vec_most_common_words = processed_dictionary.doc2bow(most_common_words_list,return_missing=False)
            vec_lda_top_topic_words = processed_dictionary.doc2bow(lda_top_topic_words_list,return_missing=False)
            
            # calculate cosine similarity
            cosine_sim = calculate_cosine_similarity(vec_most_common_words,vec_lda_top_topic_words)
            cosine_sim_all[counter_nboot,counter_article,ind_num_topicss] = cosine_sim
            
            # add number of words from counting top most frequent
            number_most_common_words[counter_nboot,counter_article,ind_num_topicss] = len(most_common_words_list)
            
            if do_plot:
                plt.figure(figsize=(15,5));
                plt.bar(topics_means,means_sorted,yerr=std_sorted);
                plt.ylabel('Mean probability');
                sns.set_context('talk', font_scale=1.5);
                plt.xticks(rotation=90);
                plt.show();
                plt.clf();
    #             plt.savefig('./eda_figs/mean_prob_vs_topic_big_ten_resumes.png', dpi=300, bbox_inches='tight')

                plt.figure(figsize=(15,5));
                plt.bar(topics_freq,freq_sorted);
                plt.ylabel('N');
                # plt.xlabel('Topics')
                sns.set_context('talk', font_scale=1.5);
                plt.xticks(rotation=90);
                plt.show();
                plt.clf();
    #             plt.savefig('./edafigs/frequency_vs_topic_big_ten_resumes.pdf')

# save all output to pickle files
pickle.dump( coherence_scores, open( './validation_data/coherence_scores.pkl', 'wb'))
pickle.dump( perplexity_scores, open( './validation_data/perplexity_scores.pkl', 'wb'))
pickle.dump( topic_word_probs, open( './validation_data/topic_word_probs.pkl', 'wb'))
pickle.dump( jaccard_sim_all, open( './validation_data/jaccard_sim_all.pkl', 'wb'))
pickle.dump( cosine_sim_all, open( './validation_data/cosine_sim_all.pkl', 'wb'))
pickle.dump( cosine_sim_all, open( './validation_data/number_most_common_words.pkl', 'wb'))

Bootstrap sample 1 out of 3
Article 1 out of 4
LDA topic number 1 out of 2
N topics: 1
 
Bootstrap sample 1 out of 3
Article 1 out of 4
LDA topic number 2 out of 2
N topics: 2
 
Bootstrap sample 1 out of 3
Article 2 out of 4
LDA topic number 1 out of 2
N topics: 1
 
Bootstrap sample 1 out of 3
Article 2 out of 4
LDA topic number 2 out of 2
N topics: 2
 
Bootstrap sample 1 out of 3
Article 3 out of 4
LDA topic number 1 out of 2
N topics: 1
 
Bootstrap sample 1 out of 3
Article 3 out of 4
LDA topic number 2 out of 2
N topics: 2
 
Bootstrap sample 1 out of 3
Article 4 out of 4
LDA topic number 1 out of 2
N topics: 1
 
Bootstrap sample 1 out of 3
Article 4 out of 4
LDA topic number 2 out of 2
N topics: 2
 
Bootstrap sample 2 out of 3
Article 1 out of 4
LDA topic number 1 out of 2
N topics: 1
 
Bootstrap sample 2 out of 3
Article 1 out of 4
LDA topic number 2 out of 2
N topics: 2
 
Bootstrap sample 2 out of 3
Article 2 out of 4
LDA topic number 1 out of 2
N topics: 1
 
Bootstrap sample 2 ou

FileNotFoundError: [Errno 2] No such file or directory: './number_most_common_words/number_most_common_words.pkl'

In [438]:
os.getcwd()

'/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser'

In [328]:
coherence_scores_mean = np.mean(coherence_scores,axis=1)
print(coherence_scores.shape)
print(coherence_scores_mean.shape)

(3, 4, 2)
(3, 2)


In [None]:
# plotting for perplexity and coherence scores

# get means per bootstrap sampling
coherence_scores_mean_per_sampling = np.mean(coherence_scores,axis=1)
perplexity_scores_mean_per_sampling = np.mean(perplexity_scores,axis=1)

# get mean across bootstrap



In [13]:
coherence_scores

array([[[0.51186944, 0.49164086],
        [0.57954709, 0.52093418],
        [0.47166151, 0.490692  ],
        [0.41322083, 0.34657116]],

       [[0.38860847, 0.39986555],
        [0.63278217, 0.59506605],
        [0.59818701, 0.60738289],
        [0.51623939, 0.39482988]],

       [[0.61204301, 0.57760808],
        [0.35694274, 0.40510517],
        [0.44635011, 0.41103484],
        [0.45970607, 0.40096481]]])

In [14]:
perplexity_scores

array([[[-4.02900825, -4.23792879],
        [-5.07480161, -5.29961568],
        [-4.60041331, -4.85357503],
        [-4.96198402, -5.15695703]],

       [[-5.26447149, -5.45155033],
        [-6.24191224, -6.44985396],
        [-6.55337379, -6.74278562],
        [-5.24764633, -5.44106885]],

       [[-5.5982057 , -5.79121929],
        [-5.29544008, -5.41833252],
        [-4.17055247, -4.39684645],
        [-5.40098499, -5.49089704]]])

In [15]:
# plotting for probability vs words, when single topics used
topic_word_probs

array([[[0.08196741, 0.04918041, 0.04098367, 0.03278691, 0.03278691,
         0.02459017, 0.02459017, 0.02459017, 0.02459017, 0.02459017],
        [0.04088081, 0.01886801, 0.01886801, 0.01572333, 0.01572333,
         0.01572333, 0.01572333, 0.01257865, 0.01257865, 0.00943397],
        [0.0314137 , 0.0314137 , 0.0314137 , 0.02617807, 0.02617806,
         0.02094244, 0.02094243, 0.01570682, 0.01570681, 0.01570681],
        [0.04901993, 0.02614394, 0.02287594, 0.02287593, 0.01633993,
         0.01307193, 0.01307193, 0.01307193, 0.01307192, 0.00980393]],

       [[0.02392364, 0.02392363, 0.01674652, 0.01674652, 0.01435415,
         0.01435415, 0.01435414, 0.01196178, 0.01196177, 0.01196177],
        [0.01595784, 0.01595781, 0.00957465, 0.00744693, 0.00638308,
         0.00638308, 0.00531922, 0.00531921, 0.00425536, 0.00425536],
        [0.00767482, 0.00767482, 0.00690732, 0.00613984, 0.00613983,
         0.00613983, 0.00613982, 0.00537234, 0.00460486, 0.00460485],
        [0.02619071, 0.02

In [16]:
# comparing word counting with lda
jaccard_sim_all

array([[[1.        , 0.25      ],
        [0.71428571, 0.71428571],
        [1.        , 0.42857143],
        [1.        , 0.42857143]],

       [[0.71428571, 0.5       ],
        [0.83333333, 0.83333333],
        [0.71428571, 0.71428571],
        [0.5       , 0.5       ]],

       [[0.71428571, 0.2       ],
        [1.        , 0.66666667],
        [0.5       , 0.5       ],
        [0.625     , 0.625     ]]])

In [17]:
cosine_sim_all

array([[[1.        , 0.4       ],
        [0.84515425, 0.84515425],
        [1.        , 0.6       ],
        [1.        , 0.6       ]],

       [[0.84515425, 0.6761234 ],
        [0.91287093, 0.91287093],
        [0.84515425, 0.84515425],
        [0.70710678, 0.70710678]],

       [[0.84515425, 0.3380617 ],
        [1.        , 0.8       ],
        [0.70710678, 0.70710678],
        [0.79056942, 0.79056942]]])

In [18]:
number_most_common_words

array([[[ 5.,  5.],
        [ 7.,  7.],
        [ 5.,  5.],
        [ 5.,  5.]],

       [[ 7.,  7.],
        [ 6.,  6.],
        [ 7.,  7.],
        [10., 10.]],

       [[ 7.,  7.],
        [ 5.,  5.],
        [10., 10.],
        [ 8.,  8.]]])

In [94]:
# lda_top_topic_words
all_sides_domains

0         abcnews.go.com
1          aljazeera.com
2             apnews.com
3                bbc.com
4          bloomberg.com
5          breitbart.com
6       buzzfeednews.com
7                cbn.com
8            cbsnews.com
9          csmonitor.com
10               cnn.com
11     thedailybeast.com
12      democracynow.org
13         factcheck.org
14            forbes.com
15           foxnews.com
16          huffpost.com
17       motherjones.com
18             msnbc.com
19    nationalreview.com
20           nbcnews.com
21            nypost.com
22           nytimes.com
23           newsmax.com
24               npr.org
25          politico.com
26            reason.com
27           reuters.com
28             salon.com
29         spectator.org
30       theatlantic.com
31       theguardian.com
32           thehill.com
33               wsj.com
Name: domain, dtype: object

In [39]:
# load domain names from all sides media csv, write string for google search
all_sides_with_domains = pd.read_csv('./all_sides_media_data/allsides_final_plus_others_with_domains.csv')

all_sides_names = all_sides_with_domains['name']
all_sides_domains = all_sides_with_domains['domain']

all_sides_names_domains = pd.concat([all_sides_names,all_sides_domains],axis=1)

In [161]:
all_sides_names_domains

Unnamed: 0,name,domain,google_query
0,ABC News (Online),abcnews.go.com,site:nytimes.com dow trading rates vaccine wsj
1,Al Jazeera,aljazeera.com,site:bloomberg.com dow trading rates vaccine wsj
2,Associated Press,apnews.com,site:reuters.com dow trading rates vaccine wsj
3,BBC News,bbc.com,site:wsj.com dow trading rates vaccine wsj
4,Bloomberg,bloomberg.com,
5,Breitbart News,breitbart.com,
6,BuzzFeed News,buzzfeednews.com,
7,CBN,cbn.com,
8,CBS News,cbsnews.com,
9,Christian Science Monitor,csmonitor.com,
