# LDA model testing for downloaded article database for 'debiaser' data product
#### Sagar Setru, September 21th, 2020

## Brief description using CoNVO framework

### Context

Some people are eager to get news from outside of their echo chamber. However, they do not know where to go outside of their echo chambers, and may also have some activation energy when it comes to seeking information from other sources. In the meantime, most newsfeeds only push you content that you agree with. You end up in an echo chamber, but may not have ever wanted to be in one in the first place.

### Need

A way to find news articles from different yet reliable media sources.

### Vision

Debiaser, a data product (maybe Chrome plug-in?) that will recommend news articles similar in topic to the one currently being read, but from several pre-curated and reliable news media organizations across the political spectrum, for example, following the "media bias chart" here https://www.adfontesmedia.com/ or the "media bias ratings" here: https://www.allsides.com/media-bias/media-bias-ratings. The app will determine the main topics of the text of a news article, and then show links to similar articles from other news organizations.

The product will generate topics for a given document via latent Dirichlet allocation (LDA) and then search news websites for the topic words generated.

Caveats: Many of these articles may be behind paywalls. News aggregators already basically do this. How different is this than just searching Google using the title of an article?

### Outcome

People who are motivated to engage in content outside of their echo chambers have a tool that enables them to quickly find news similar to what they are currently reading, but from a variety of news organizations.

### Testing LDA on larger document corpus

In [2]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# import user defined functions for text processing
from text_processing_functions import process_all_articles, remove_stopwords

In [4]:
# del process_all_articles

In [5]:
print('Conda environment:')
print(os.environ['CONDA_DEFAULT_ENV'])

Conda environment:
base


In [6]:
# import text processing and NLP specific packages

# for generating LDA models
import gensim
from gensim.corpora import Dictionary

# for preprocessing documents
from gensim.parsing.preprocessing import preprocess_documents

# for counting frequency of words
from collections import defaultdict

import string

# gensim
import gensim
from gensim.corpora import Dictionary
import gensim.corpora as corpora
# from gensim.utils import simple_preprocess

from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import seaborn as sns

import spacy

import pickle

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
print("DONE")

DONE


In [6]:
# for testing on HTML data

import requests
from bs4 import BeautifulSoup
import urllib.request,sys,time

In [7]:
def load_stop_words_csv_to_list(full_file_name):
    """fxn that loads stop words list downloaded from git repo called 'news-stopwords'"""
    
    stop_words = pd.read_csv(full_file_name)

    stop_words = stop_words['term']

    stop_words = [word for word in stop_words]
    
    return stop_words

In [8]:
def get_simple_corpus_dictionary_bow(texts,word_frequency_threshold):
    """fxn returns corpus, proc. dict, bag of words"""
    
    # Count word frequencies
    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    # Only keep words that appear more than set frequency, to produce the corpus
    processed_corpus = [[token for token in text if frequency[token] > word_frequency_threshold] for text in texts]
    
    # generate a dictionary via gensim
    processed_dictionary = Dictionary(processed_corpus)
    
    # generate bag of words of the corpus
    bow_corpus = [processed_dictionary.doc2bow(text) for text in processed_corpus]
    
    return processed_corpus, processed_dictionary, bow_corpus

In [9]:
# choose list of stop words

# choose whether 1k, 10k, 100k, or nltk
which_stop_words = '1k'
# which_stop_words = '10k'
# which_stop_words = '100k'
# which_stop_words = 'nltk'

stop_words_path = './stop_words_db/news-stopwords-master/'


if which_stop_words == '1k':
    
    # doing 1k words list
    stop_words_file_name = 'sw1k.csv'
    
    # make full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)
    
elif which_stop_words == '10k':
    
    # doing 10k words list
    stop_words_file_name = 'sw10k.csv'

    # make full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)

elif which_stop_words == '100k':
    
    # doing 100k
    stop_words_file_name = 'sw100k.csv'  
    
    # get full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)


elif which_stop_words == 'nltk':
    # import from nltk
    from nltk.corpus import stopwords
    
    stop_words = stopwords.words('english')
    
else:
    print('Select proper variable name for "which_stop_words"')
    
# adding custom words
stop_words.append('said')
stop_words.append('youre')

In [10]:
# load csv of processed data to pandas dataframe
articles_df = pd.read_csv('./all_the_news/all_news_df_processed.csv')
print("DONE")

DONE


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [11]:
# preview the data
articles_df.head()

Unnamed: 0,index,index.1,id,title,author,date,content,year,month,publication,category,digital,section,url,article_length
0,0,0,1,Agent Cooper in Twin Peaks is the audience: on...,\nTasha Robinson\n,2017-05-31,And never more so than in Showtime’s new...,2017.0,5.0,Verge,Longform,1.0,,,2121
1,1,1,2,"AI, the humanity!",\nSam Byford\n,2017-05-30,AlphaGo’s victory isn’t a defeat for hum...,2017.0,5.0,Verge,Longform,1.0,,,1948
2,2,2,3,The Viral Machine,\nKaitlyn Tiffany\n,2017-05-25,Super Deluxe built a weird internet empi...,2017.0,5.0,Verge,Longform,1.0,,,3011
3,3,3,4,How Anker is beating Apple and Samsung at thei...,\nNick Statt\n,2017-05-22,Steven Yang quit his job at Google in th...,2017.0,5.0,Verge,Longform,1.0,,,3281
4,4,4,5,Tour Black Panther’s reimagined homeland with ...,\nKwame Opam\n,2017-05-15,Ahead of Black Panther’s 2018 theatrical...,2017.0,5.0,Verge,Longform,1.0,,,239


In [12]:
articles_df['article_length'].describe()

count    182636.000000
mean        862.016393
std         864.620185
min          51.000000
25%         397.000000
50%         693.000000
75%        1069.000000
max       50517.000000
Name: article_length, dtype: float64

In [13]:
# get random n_sample number articles for testing
n_sample = 182636

# articles_df_test = articles_df.sample(n=n_sample)
articles_df_test = articles_df

In [14]:
# get just the articles content and titles
articles_content = articles_df_test['content'].astype('str')
articles_titles = articles_df_test['title'].astype('str')

# check for nans; if there are any, make sure to not add nan
print(articles_df['title'].isnull().sum())

1


Note: following nice tutorial provided here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#10removestopwordsmakebigramsandlemmatize

In [15]:
# merge titles and content
articles_full = []

for content,title in zip(articles_content, articles_titles):
    
    # don't add word 'nan'
    if title == 'nan':
        
        print(title)
        
        articles_full.append(content)
        
    else:
        
        articles_full.append(title+content)

nan


In [20]:
# show number of documents
n_documents = len(articles_full)
print(f'There are {n_documents} in this corpus.')

There are 182636 in this corpus.


In [21]:
# process documents
articles_processed = process_all_articles(articles_full,nlp=[])
print("DONE")

DONE


In [22]:
# import pickle
pickle.dump( articles_processed, open( './articles_processed.pkl', 'wb'))
print('done')
# articles_processed[0]

MemoryError: 

In [23]:
len(articles_processed)

182636

In [22]:
# build ngram models (bi, tri, quad)
# NOTE: CONSIDER TRYING THIS FOR BETTER NGRAMS:
# https://medium.com/@manjunathhiremath.mh/identifying-bigrams-trigrams-and-four-grams-using-word2vec-dea346130eb

ngram_min_count = 2;

bigram_threshold = 25;
trigram_threshold = 15;
quadgram_theshold = 100;

print('running...')
bigram = gensim.models.Phrases(articles_processed, min_count=ngram_min_count, threshold=bigram_threshold) # higher threshold fewer phrases.
print('making bigram')
trigram = gensim.models.Phrases(bigram[articles_processed], threshold=trigram_threshold)
print('making trigram')
quadgram = gensim.models.Phrases(trigram[articles_processed], threshold=quadgram_theshold)
print('making quadgram')

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
quadgram_mod = gensim.models.phrases.Phraser(quadgram)

pickle.dump( bigram_mod, open( './bigram_mod.pkl', 'wb'))
pickle.dump( trigram_mod, open( './trigram_mod.pkl', 'wb'))
pickle.dump( quadgram_mod, open( './quadgram_mod.pkl', 'wb'))

print('done')

running...
making bigram
making trigram
making quadgram
done


In [24]:
with open('./bigram_mod.pkl', 'rb') as pickle_file:
    bigram_mod = pickle.load(pickle_file)

with open('./trigram_mod.pkl', 'rb') as pickle_file:
    trigram_mod = pickle.load(pickle_file)
    
with open('./quadgram_mod.pkl', 'rb') as pickle_file:
    quadgram_mod = pickle.load(pickle_file)
print('done')

done


In [25]:
# fxns for bi, tri, quadgrams, and lemmatization
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def make_quadgrams(texts):
    return [quadgram_mod[trigram_mod[bigram_mod[doc]]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for text in texts:
        doc = nlp(" ".join(text)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [26]:
# remove stop words
print('running stop words')
articles_processed = remove_stopwords(articles_processed,stop_words)
print('done stop words')

# for up to quad grams
print('running ngrams')
articles_processed_ngrams = make_quadgrams(articles_processed)
print('done')

running stop words
done stop words
running ngrams
done


In [29]:
# Initialize spacy 'en_core_web_sm' model
import en_core_web_sm
nlp = en_core_web_sm.load()

In [30]:
# for up to quad grams
# articles_processed_ngrams = make_quadgrams(articles_processed)

In [30]:
# # Do lemmatization keeping only noun, adj, vb, adv
print('lemmatizing')
articles_processed_ngrams_lemmaed = lemmatization(articles_processed_ngrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print('done')

lemmatizing
done


In [31]:
pickle.dump( articles_processed_ngrams_lemmaed, open( './articles_processed_ngrams_lemmaed.pkl', 'wb'))
print('done')

done


In [35]:
# articles_processed_ngrams_lemmaed[322]

In [None]:
# make dictionary
print('making id2word')
id2word = corpora.Dictionary(articles_processed)
print('id2word done')

print('making freq of words')
# make frequency of words
corpus = [id2word.doc2bow(article) for article in articles_processed_ngrams_lemmaed]
print('freq of words done')

In [73]:
pickle.dump( id2word, open( './id2word.pkl', 'wb'))
print('done')

done


In [74]:
pickle.dump( corpus, open( './corpus.pkl', 'wb'))
print('done')

done


In [84]:
# set lda model n topics

# lda_n_topics = np.arange(10,110,10)
# lda_n_topics = np.concatenate((np.arange(1,2),lda_n_topics))

# lda_n_topics = np.arange(60,110,10)
lda_n_topics = np.concatenate(([100],[200]))
print(lda_n_topics)

[100 200]


In [85]:
# set LDA hyperparameters
n_docs_chunksize = 60000

n_training_passes = 100

In [None]:
# loop through number of topics

n_topics_coherence_dict = {}
n_topics_perplexity_dict = {}

for n_topic in lda_n_topics:

    print('train lda model')
    print(f'n_topics: {n_topic}')
    print(' ')
    lda_model = gensim.models.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=n_topic, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=n_docs_chunksize,
                                           passes=n_training_passes,
                                           alpha='auto',
                                           per_word_topics=True)
    
    print('model training done')
    
#     print('determining coherence')
#     # get model coherence
#     coherence_model_lda = CoherenceModel(model=lda_model, texts=articles_processed_ngrams_lemmaed, dictionary=id2word, coherence='c_v')
#     coherence_lda = coherence_model_lda.get_coherence()
#     n_topics_coherence_dict[n_topic] = coherence_lda
    
    print('calculating perplexity')
    perplexity = lda_model.log_perplexity(corpus)
    n_topics_perplexity_dict[n_topic] = perplexity
    

#     print('Coherence Score: ', coherence_lda)
    print('Perplexity Score: ',perplexity)

    pkl_file_name = f'./lda_models/lda_model_n_topics_{n_topic}_n_passes_{n_training_passes}_n_docs_chunksize_{n_docs_chunksize}.pkl'
    print(f'saving: {pkl_file_name}')
    pickle.dump( lda_model, open( pkl_file_name, 'wb'))
    print(' ')

train lda model
n_topics: 100
 
model training done
calculating perplexity
Perplexity Score:  -9.95008409124062
saving: ./lda_models/lda_model_n_topics_100_n_passes_100_n_docs_chunksize_60000.pkl
 
train lda model
n_topics: 200
 


In [71]:
# get model coherence
# coherence_model_lda = CoherenceModel(model=lda_model, texts=articles_processed_ngrams_lemmaed, dictionary=id2word, coherence='c_v')
# coherence_lda = coherence_model_lda.get_coherence()
# n_topics_coherence_dict[n_topic] = coherence_lda

# get model perplexity
# print('calculating perplexity')
# perplexity = lda_model.log_perplexity(corpus)
# n_topics_perplexity_dict[n_topic] = perplexity
print(perplexity)
print(n_topics_perplexity_dict)

-8.922071655468258
{1: -8.922071655468258}


In [34]:
# compute coherence 
coherence_model_lda = CoherenceModel(model=lda_model, texts=articles_processed_ngrams_lemmaed, dictionary=id2word, coherence='c_v')

coherence_lda = coherence_model_lda.get_coherence()

print('\nCoherence Score: ', coherence_lda)

  m_lr_i = np.log(numerator / denominator)
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))



Coherence Score:  nan


In [8]:
lda_model.show_topics(num_topics=5)

NameError: name 'lda_model' is not defined

In [83]:
# get list of manually made text files of articles

full_path = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/article_text_files/'

full_file_names = [
full_path+'ap_hurricane_sally_unleashes_20200916.txt',
full_path+'cnn_big_ten_backtracks_20200916.txt',
full_path+'nyt_on_the_fire_line_20200915.txt',
full_path+'foxnews_snake_face_mask_20200916_v2.txt',
'/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/article_html_files/U.S. Stocks Lower as Fed Outlook Rattles Investors - WSJ.html'
]

In [155]:
# Create a floor of the frequency of words to remove
word_frequency_threshold = 1

# choose the number of LDA topics
num_lda_topics = 5

# loop through files
for ind, full_file_name in enumerate(full_file_names):
    
    # get extension (for testing copy and pasted html)
    fname, ext = os.path.splitext(full_file_name)

    # get path and file name
    pathstr, fname = os.path.split(full_file_name)
        
    # print file name
    print(fname)
    print(ext)
    
    if ext == '.txt':
        # get the article text as one string, remove new lines
        with open(full_file_name, 'r') as file:
            article_text = file.read().replace('\n', ' ')
            
    elif ext == '.html':
        
        # get the html as one string
        with open(full_file_name, 'r') as file:
            coverpage = file.read()

        # create soup object
        soup = BeautifulSoup(coverpage, 'html.parser')

        # get title
        headline = soup.find('h1').get_text()
        print(headline)
        print(' ')

        # get text from all <p> tags
        p_tags = soup.find_all('p')

        # get text from each p tag and strip whitespace
        p_tags_text = [tag.get_text().strip() for tag in p_tags]
        
        # filter out sentences without periods
        p_tags_text = [sentence for sentence in p_tags_text if '.' in sentence]

        # convert all p_tags_text to single article text string

        p_tags_text_1string = ''

        for p_tag_text in p_tags_text:
            p_tags_text_1string += p_tag_text

        article_text = p_tags_text_1string
    
    
    article_text_processed = process_all_articles([article_text])
    
    id2word_test = corpora.Dictionary(article_text_processed)
    

    
#     lda_model.get_document_topics()
    
#     if ind == 0:
#         break
    
    

ap_hurricane_sally_unleashes_20200916.txt
.txt
cnn_big_ten_backtracks_20200916.txt
.txt
nyt_on_the_fire_line_20200915.txt
.txt
foxnews_snake_face_mask_20200916_v2.txt
.txt
U.S. Stocks Lower as Fed Outlook Rattles Investors - WSJ.html
.html
The Wall Street Journal
 


In [156]:
# id2word_test = corpora.Dictionary(articles_text_processed)

# make frequency of words
corpus_test = [id2word.doc2bow(article) for article in article_text_processed]

In [157]:
test = lda_model[corpus_test]

In [139]:
test2 = lda_model.get_document_topics(corpus_test)

In [140]:
test

<gensim.interfaces.TransformedCorpus at 0x14bde42b0>

In [142]:
test2

<gensim.interfaces.TransformedCorpus object at 0x14bd993d0>


In [158]:
test[0]

[(0, 0.26132786),
 (2, 0.14556913),
 (5, 0.20745209),
 (6, 0.077264264),
 (11, 0.01413767),
 (15, 0.097798266),
 (16, 0.012090704),
 (18, 0.09593318),
 (19, 0.0761351)]

In [161]:
test2[0]

[(0, 0.086565785),
 (2, 0.20013271),
 (5, 0.11250849),
 (6, 0.06964701),
 (11, 0.038633205),
 (15, 0.076380216),
 (16, 0.010171399),
 (18, 0.12004942),
 (19, 0.27560484)]

In [159]:
lda_model.print_topic(0)

'0.017*"drop" + 0.015*"oil" + 0.012*"fund" + 0.011*"product" + 0.010*"elect" + 0.010*"russian" + 0.010*"gain" + 0.010*"exchange" + 0.009*"investor" + 0.009*"contract"'

In [162]:
lda_model.print_topic(5)

'0.015*"candidate" + 0.014*"voter" + 0.014*"race" + 0.012*"conservative" + 0.009*"debate" + 0.009*"republican" + 0.008*"democratic" + 0.008*"poll" + 0.008*"speech" + 0.008*"promise"'

In [143]:
lda_model.print_topic(5)

'0.015*"candidate" + 0.014*"voter" + 0.014*"race" + 0.012*"conservative" + 0.009*"debate" + 0.009*"republican" + 0.008*"democratic" + 0.008*"poll" + 0.008*"speech" + 0.008*"promise"'

In [96]:
article_text_processed

[['hurricane',
  'sally',
  'unleashes',
  'flooding',
  'along',
  'the',
  'gulf',
  'coast',
  'by',
  'jay',
  'reeves',
  'angie',
  'wang',
  'and',
  'jeff',
  'martin',
  'minutes',
  'ago',
  'pensacola',
  'fla',
  'ap',
  'hurricane',
  'sally',
  'lumbered',
  'ashore',
  'near',
  'the',
  'florida',
  'alabama',
  'line',
  'wednesday',
  'with',
  'mph',
  'winds',
  'and',
  'rain',
  'measured',
  'in',
  'feet',
  'not',
  'inches',
  'swamping',
  'homes',
  'and',
  'trapping',
  'people',
  'in',
  'high',
  'water',
  'as',
  'it',
  'crept',
  'inland',
  'for',
  'what',
  'could',
  'be',
  'long',
  'slow',
  'and',
  'disastrous',
  'drenching',
  'across',
  'the',
  'deep',
  'south',
  'moving',
  'at',
  'an',
  'agonizing',
  'mph',
  'or',
  'about',
  'as',
  'fast',
  'as',
  'person',
  'can',
  'walk',
  'the',
  'storm',
  'made',
  'landfall',
  'at',
  'close',
  'to',
  'gulf',
  'shores',
  'alabama',
  'battering',
  'the',
  'metropolitan',
 

In [88]:
article_text_processed

[['hurricane',
  'sally',
  'unleashes',
  'flooding',
  'along',
  'the',
  'gulf',
  'coast',
  'by',
  'jay',
  'reeves',
  'angie',
  'wang',
  'and',
  'jeff',
  'martin',
  'minutes',
  'ago',
  'pensacola',
  'fla',
  'ap',
  'hurricane',
  'sally',
  'lumbered',
  'ashore',
  'near',
  'the',
  'florida',
  'alabama',
  'line',
  'wednesday',
  'with',
  'mph',
  'winds',
  'and',
  'rain',
  'measured',
  'in',
  'feet',
  'not',
  'inches',
  'swamping',
  'homes',
  'and',
  'trapping',
  'people',
  'in',
  'high',
  'water',
  'as',
  'it',
  'crept',
  'inland',
  'for',
  'what',
  'could',
  'be',
  'long',
  'slow',
  'and',
  'disastrous',
  'drenching',
  'across',
  'the',
  'deep',
  'south',
  'moving',
  'at',
  'an',
  'agonizing',
  'mph',
  'or',
  'about',
  'as',
  'fast',
  'as',
  'person',
  'can',
  'walk',
  'the',
  'storm',
  'made',
  'landfall',
  'at',
  'close',
  'to',
  'gulf',
  'shores',
  'alabama',
  'battering',
  'the',
  'metropolitan',
 

In [None]:
lda_model.get_document_topics()