# LDA model testing for downloaded article database for 'debiaser' data product
#### Sagar Setru, September 21th, 2020

## Brief description using CoNVO framework

### Context

Some people are eager to get news from outside of their echo chamber. However, they do not know where to go outside of their echo chambers, and may also have some activation energy when it comes to seeking information from other sources. In the meantime, most newsfeeds only push you content that you agree with. You end up in an echo chamber, but may not have ever wanted to be in one in the first place.

### Need

A way to find news articles from different yet reliable media sources.

### Vision

Debiaser, a data product (maybe Chrome plug-in?) that will recommend news articles similar in topic to the one currently being read, but from several pre-curated and reliable news media organizations across the political spectrum, for example, following the "media bias chart" here https://www.adfontesmedia.com/ or the "media bias ratings" here: https://www.allsides.com/media-bias/media-bias-ratings. The app will determine the main topics of the text of a news article, and then show links to similar articles from other news organizations.

The product will generate topics for a given document via latent Dirichlet allocation (LDA) and then search news websites for the topic words generated.

Caveats: Many of these articles may be behind paywalls. News aggregators already basically do this. How different is this than just searching Google using the title of an article?

### Outcome

People who are motivated to engage in content outside of their echo chambers have a tool that enables them to quickly find news similar to what they are currently reading, but from a variety of news organizations.

### Testing LDA on larger document corpus

In [2]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

In [42]:
# import user defined functions for text processing
from text_processing_functions import process_all_articles, remove_stopwords

In [4]:
# del process_all_articles

In [5]:
print('Conda environment:')
print(os.environ['CONDA_DEFAULT_ENV'])

Conda environment:
insight


In [6]:
# import text processing and NLP specific packages

# for generating LDA models
import gensim
from gensim.corpora import Dictionary

# for preprocessing documents
from gensim.parsing.preprocessing import preprocess_documents

# for counting frequency of words
from collections import defaultdict

import string

# gensim
import gensim
from gensim.corpora import Dictionary
import gensim.corpora as corpora
# from gensim.utils import simple_preprocess

from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import seaborn as sns

from collections import defaultdict

import string

import spacy

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [7]:
def load_stop_words_csv_to_list(full_file_name):
    """fxn that loads stop words list downloaded from git repo called 'news-stopwords'"""
    
    stop_words = pd.read_csv(full_file_name)

    stop_words = stop_words['term']

    stop_words = [word for word in stop_words]
    
    return stop_words

In [8]:
def get_simple_corpus_dictionary_bow(texts,word_frequency_threshold):
    """fxn returns corpus, proc. dict, bag of words"""
    
    # Count word frequencies
    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    # Only keep words that appear more than set frequency, to produce the corpus
    processed_corpus = [[token for token in text if frequency[token] > word_frequency_threshold] for text in texts]
    
    # generate a dictionary via gensim
    processed_dictionary = Dictionary(processed_corpus)
    
    # generate bag of words of the corpus
    bow_corpus = [processed_dictionary.doc2bow(text) for text in processed_corpus]
    
    return processed_corpus, processed_dictionary, bow_corpus

In [9]:
# choose list of stop words

# choose whether 1k, 10k, 100k, or nltk
which_stop_words = '1k'
# which_stop_words = '10k'
# which_stop_words = '100k'
# which_stop_words = 'nltk'

stop_words_path = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/stop_words_db/news-stopwords-master/'


if which_stop_words == '1k':
    
    # doing 1k words list
    stop_words_file_name = 'sw1k.csv'
    
    # make full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)
    
elif which_stop_words == '10k':
    
    # doing 10k words list
    stop_words_file_name = 'sw10k.csv'

    # make full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)

elif which_stop_words == '100k':
    
    # doing 100k
    stop_words_file_name = 'sw100k.csv'  
    
    # get full file name
    stop_words_full_file_name = stop_words_path+stop_words_file_name
    
    # get list of stop words
    stop_words = load_stop_words_csv_to_list(stop_words_full_file_name)


elif which_stop_words == 'nltk':
    # import from nltk
    from nltk.corpus import stopwords
    
    stop_words = stopwords.words('english')
    
else:
    print('Select proper variable name for "which_stop_words"')
    
# adding custom words
stop_words.append('said')
stop_words.append('youre')

In [10]:
# load csv of processed data to pandas dataframe
articles_df = pd.read_csv('./all_the_news/all_news_df_processed.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [11]:
# preview the data
articles_df.head()

Unnamed: 0,index,index.1,id,title,author,date,content,year,month,publication,category,digital,section,url,article_length
0,0,0,1,Agent Cooper in Twin Peaks is the audience: on...,\nTasha Robinson\n,2017-05-31,And never more so than in Showtime’s new...,2017.0,5.0,Verge,Longform,1.0,,,2121
1,1,1,2,"AI, the humanity!",\nSam Byford\n,2017-05-30,AlphaGo’s victory isn’t a defeat for hum...,2017.0,5.0,Verge,Longform,1.0,,,1948
2,2,2,3,The Viral Machine,\nKaitlyn Tiffany\n,2017-05-25,Super Deluxe built a weird internet empi...,2017.0,5.0,Verge,Longform,1.0,,,3011
3,3,3,4,How Anker is beating Apple and Samsung at thei...,\nNick Statt\n,2017-05-22,Steven Yang quit his job at Google in th...,2017.0,5.0,Verge,Longform,1.0,,,3281
4,4,4,5,Tour Black Panther’s reimagined homeland with ...,\nKwame Opam\n,2017-05-15,Ahead of Black Panther’s 2018 theatrical...,2017.0,5.0,Verge,Longform,1.0,,,239


In [12]:
articles_df['article_length'].describe()

count    182636.000000
mean        862.016393
std         864.620185
min          51.000000
25%         397.000000
50%         693.000000
75%        1069.000000
max       50517.000000
Name: article_length, dtype: float64

In [13]:
# get random n_sample number articles for testing
n_sample = 1000

articles_df_test = articles_df.sample(n=n_sample)

In [14]:
# get just the articles content and titles
articles_content = articles_df_test['content'].astype('str')
articles_titles = articles_df_test['title'].astype('str')

# check for nans; if there are any, make sure to not add nan
print(articles_df['title'].isnull().sum())

1


Note: following nice tutorial provided here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#10removestopwordsmakebigramsandlemmatize

In [15]:
# merge titles and content
articles_full = []

for content,title in zip(articles_content, articles_titles):
    
    # don't add word 'nan'
    if title == 'nan':
        
        print(title)
        
        articles_full.append(content)
        
    else:
        
        articles_full.append(title+content)

In [16]:
# show number of documents
n_documents = len(articles_full)
print(f'There are {n_documents} in this corpus.')

There are 1000 in this corpus.


In [23]:
# process documents
articles_processed = process_all_articles(articles_full)

In [18]:
# remove stopwords
# articles_processed = remove_stopwords(articles_processed,stop_words)

In [34]:
# len(articles_processed)
# articles_processed[0]

In [35]:
# build ngram models (bi, tri, quad)

ngram_min_count = 2;

bigram_threshold = 25;
trigram_threshold = 15;
quadgram_theshold = 100;

bigram = gensim.models.Phrases(articles_processed, min_count=ngram_min_count, threshold=bigram_threshold) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[articles_processed], threshold=trigram_threshold)
quadgram = gensim.models.Phrases(trigram[articles_processed], threshold=quadgram_theshold)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
quadgram_mod = gensim.models.phrases.Phraser(quadgram)

In [39]:
# fxns for bi, tri, quadgrams, and lemmatization
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def make_quadgrams(texts):
    return [quadgram_mod[trigram_mod[bigram_mod[doc]]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for text in texts:
        doc = nlp(" ".join(text)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [49]:
# remove stop words
articles_processed = remove_stopwords(articles_processed,stop_words)

# for up to quad grams
articles_processed_ngrams = make_quadgrams(articles_processed)

# # Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en_core_web_lg')

# # Do lemmatization keeping only noun, adj, vb, adv
articles_processed_ngrams_lemmaed = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

# print(data_lemmatized[:1])

OSError: [E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [24]:
# Create Dictionary
id2word = corpora.Dictionary(articles_processed)

# Create Corpus
texts = articles_processed

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

In [None]:
articles_full[0]

In [24]:
gensim.utils.simple_preprocess(articles_full[0], deacc=True)

['brock',
 'turner',
 'judge',
 'backs',
 'out',
 'of',
 'child',
 'porn',
 'case',
 'cnn',
 'the',
 'judge',
 'at',
 'the',
 'center',
 'of',
 'controversial',
 'sentence',
 'involving',
 'brock',
 'turner',
 'former',
 'stanford',
 'student',
 'convicted',
 'of',
 'sexual',
 'assault',
 'has',
 'recused',
 'himself',
 'from',
 'making',
 'decision',
 'in',
 'another',
 'sex',
 'case',
 'santa',
 'clara',
 'county',
 'judge',
 'aaron',
 'persky',
 'was',
 'expected',
 'to',
 'decide',
 'on',
 'thursday',
 'whether',
 'to',
 'reduce',
 'plumber',
 'felony',
 'conviction',
 'for',
 'possession',
 'of',
 'child',
 'pornography',
 'to',
 'misdemeanor',
 'but',
 'has',
 'now',
 'decided',
 'to',
 'take',
 'himself',
 'off',
 'the',
 'case',
 'apparently',
 'in',
 'light',
 'of',
 'media',
 'coverage',
 'while',
 'on',
 'vacation',
 'earlier',
 'this',
 'month',
 'my',
 'family',
 'and',
 'were',
 'exposed',
 'to',
 'publicity',
 'surrounding',
 'this',
 'case',
 'persky',
 'said',
 'in',
 