# Making quadgrams using article database
#### Sagar Setru, September, 2020

## Brief description using CoNVO framework

### Context

Some people are eager to get news from outside of their echo chamber. However, they do not know where to go outside of their echo chambers, and may also have some activation energy when it comes to seeking information from other sources. In the meantime, most newsfeeds only push you content that you agree with. You end up in an echo chamber, but may not have ever wanted to be in one in the first place.

### Need

A way to find news articles from different yet reliable media sources.

### Vision

Debiaser, a chrome extension that will recommend news articles similar in topic to the one currently being read, but from several pre-curated and reliable news media organizations across the political spectrum, for example, following the "media bias chart" here https://www.adfontesmedia.com/ or the "media bias ratings" here: https://www.allsides.com/media-bias/media-bias-ratings. The app will determine the main topics of the text of a news article, and then show links to similar articles from other news organizations.

The product will generate topics for a given document via latent Dirichlet allocation (LDA) and then search news websites for the topic words generated.

Caveats: Many of these articles may be behind paywalls. News aggregators already basically do this. How different is this than just searching Google using the title of an article?

### Outcome

People who are motivated to engage in content outside of their echo chambers have a tool that enables them to quickly find news similar to what they are currently reading, but from a variety of news organizations.

NOTE: run on EC2 because this is computationally intensive

### Testing LDA on larger document corpus

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
print('Conda environment:')
print(os.environ['CONDA_DEFAULT_ENV'])

Conda environment:
debiaser


In [3]:
# add parent directory to path 

import sys
sys.path.append(os.path.dirname(os.getcwd()))

In [4]:
# import text processing and NLP specific packages

# for generating ngrams
import gensim

# import functions for text processing
from debiaser.text_processing_functions import process_all_articles

import pickle

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
print("DONE")

DONE


In [10]:
# load csv of processed data to pandas dataframe
articles_df = pd.read_csv('../all_the_news/all_news_df_processed.csv')
print("DONE")

DONE


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [11]:
# preview the data
articles_df.head()

Unnamed: 0,index,index.1,id,title,author,date,content,year,month,publication,category,digital,section,url,article_length
0,0,0,1,Agent Cooper in Twin Peaks is the audience: on...,\nTasha Robinson\n,2017-05-31,And never more so than in Showtime’s new...,2017.0,5.0,Verge,Longform,1.0,,,2121
1,1,1,2,"AI, the humanity!",\nSam Byford\n,2017-05-30,AlphaGo’s victory isn’t a defeat for hum...,2017.0,5.0,Verge,Longform,1.0,,,1948
2,2,2,3,The Viral Machine,\nKaitlyn Tiffany\n,2017-05-25,Super Deluxe built a weird internet empi...,2017.0,5.0,Verge,Longform,1.0,,,3011
3,3,3,4,How Anker is beating Apple and Samsung at thei...,\nNick Statt\n,2017-05-22,Steven Yang quit his job at Google in th...,2017.0,5.0,Verge,Longform,1.0,,,3281
4,4,4,5,Tour Black Panther’s reimagined homeland with ...,\nKwame Opam\n,2017-05-15,Ahead of Black Panther’s 2018 theatrical...,2017.0,5.0,Verge,Longform,1.0,,,239


In [12]:
articles_df['article_length'].describe()

count    182636.000000
mean        862.016393
std         864.620185
min          51.000000
25%         397.000000
50%         693.000000
75%        1069.000000
max       50517.000000
Name: article_length, dtype: float64

In [13]:
# get random n_sample number articles for testing
n_sample = 182636

# articles_df_test = articles_df.sample(n=n_sample)
articles_df_test = articles_df

In [14]:
# get just the articles content and titles
articles_content = articles_df_test['content'].astype('str')
articles_titles = articles_df_test['title'].astype('str')

# check for nans; if there are any, make sure to not add nan
print(articles_df['title'].isnull().sum())

1


Note: following nice tutorial provided here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#10removestopwordsmakebigramsandlemmatize

In [15]:
# merge titles and content
articles_full = []

for content,title in zip(articles_content, articles_titles):
    
    # don't add word 'nan'
    if title == 'nan':
        
        print(title)
        
        articles_full.append(content)
        
    else:
        
        articles_full.append(title+content)

nan


In [20]:
# show number of documents
n_documents = len(articles_full)
print(f'There are {n_documents} in this corpus.')

There are 182636 in this corpus.


In [21]:
# process documents
articles_processed = process_all_articles(articles_full,nlp=[])
pickle.dump( articles_processed, open( '../articles_processed.pkl', 'wb'))
print("DONE")

DONE


In [22]:
# build ngram models (bi, tri, quad)
# NOTE: CONSIDER TRYING THIS FOR BETTER NGRAMS:
# https://medium.com/@manjunathhiremath.mh/identifying-bigrams-trigrams-and-four-grams-using-word2vec-dea346130eb

ngram_min_count = 2;

bigram_threshold = 25;
trigram_threshold = 15;
quadgram_theshold = 100;

print('running...')
bigram = gensim.models.Phrases(articles_processed, min_count=ngram_min_count, threshold=bigram_threshold) # higher threshold fewer phrases.
print('making bigram')
trigram = gensim.models.Phrases(bigram[articles_processed], threshold=trigram_threshold)
print('making trigram')
quadgram = gensim.models.Phrases(trigram[articles_processed], threshold=quadgram_theshold)
print('making quadgram')

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
quadgram_mod = gensim.models.phrases.Phraser(quadgram)

pickle.dump( bigram_mod, open( './bigram_mod.pkl', 'wb'))
pickle.dump( trigram_mod, open( './trigram_mod.pkl', 'wb'))
pickle.dump( quadgram_mod, open( './quadgram_mod.pkl', 'wb'))

print('done')

running...
making bigram
making trigram
making quadgram
done
