# TF-IDF testing for downloaded article database for 'debiaser' data product
#### Sagar Setru, September 21th, 2020

## Brief description using CoNVO framework

### Context

Some people are eager to get news from outside of their echo chamber. However, they do not know where to go outside of their echo chambers, and may also have some activation energy when it comes to seeking information from other sources. In the meantime, most newsfeeds only push you content that you agree with. You end up in an echo chamber, but may not have ever wanted to be in one in the first place.

### Need

A way to find news articles from different yet reliable media sources.

### Vision

Debiaser, a data product (maybe Chrome plug-in?) that will recommend news articles similar in topic to the one currently being read, but from several pre-curated and reliable news media organizations across the political spectrum, for example, following the "media bias chart" here https://www.adfontesmedia.com/ or the "media bias ratings" here: https://www.allsides.com/media-bias/media-bias-ratings. The app will determine the main topics of the text of a news article, and then show links to similar articles from other news organizations.

The product will generate topics and keywords for a given document via LDA and TF-IDF then search news websites for the topic words generated.

Caveats: Many of these articles may be behind paywalls. News aggregators already basically do this. How different is this than just searching Google using the title of an article?

### Outcome

People who are motivated to engage in content outside of their echo chambers have a tool that enables them to quickly find news similar to what they are currently reading, but from a variety of news organizations.

### Testing TFIDF on larger document corpus

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

import json

# NLP Packages
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.utils import simple_preprocess

# to break articles up into sentences
from nltk import tokenize


import pickle

from debiaser_validation_function import return_suggested_articles2
from debiaser_validation_function import make_bigrams
from debiaser_validation_function import make_trigrams
from debiaser_validation_function import make_quadgrams

from text_processing_functions import process_all_articles
from text_processing_functions import remove_stopwords
from text_processing_functions import get_simple_corpus_dictionary_bow

import requests

from bs4 import BeautifulSoup

import string

import pickle
print('DONE')

DONE


In [3]:
import os
print('Conda environment:')
print(os.environ['CONDA_DEFAULT_ENV'])

Conda environment:
debiaser


In [113]:
# load dictionary used to train model on EC2
id2word_file = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/all_the_news/id2word_ec2.pkl'
with open(id2word_file, 'rb') as pickle_file:
    processed_dictionary = pickle.load(pickle_file)
    
# load the processed bow corpus
corpus_file = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/all_the_news/corpus.pkl'
with open(corpus_file, 'rb') as pickle_file:
    bow_corpus = pickle.load(pickle_file)
print('done')

done


In [114]:
%%time
tfidf = TfidfModel(bow_corpus, id2word=processed_dictionary)

print('done')

done
CPU times: user 8.68 s, sys: 5.04 s, total: 13.7 s
Wall time: 17.2 s


In [129]:
pkl_file_name = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/all_the_news/tfidf_matrix.pkl'
pickle.dump( tfidf, open( pkl_file_name, 'wb'))
print('done')

done


In [122]:
with open(pkl_file_name, 'rb') as pickle_file:
    tfidf = pickle.load(pickle_file)

In [144]:
# url = 'https://www.nytimes.com/2020/09/25/us/politics/rbg-retirement-obama.html'
url = 'https://www.nytimes.com/2020/09/30/health/covid-cruise-ships.html'
url = 'https://www.theguardian.com/us-news/2020/oct/06/vice-presidential-debates-white-house-covid-19'
# url = 'https://www.npr.org/2020/10/06/920684113/michelle-obama-makes-final-pitch-vote-for-joe-biden-like-your-lives-depend-on-it'
# url = 'https://www.foxnews.com/politics/pence-warns-voters-you-wont-be-safe-if-biden-wins'

In [145]:
# dummy nlp variable
nlp=[]

# if lemmatizing into sentences
do_sentences = 0

print_article = 0

do_ngrams = 1
    
# get html content of url
page = requests.get(url)
coverpage = page.content

# create soup object
soup = BeautifulSoup(coverpage, 'html.parser')

# get title
headline = soup.find('h1').get_text()

# get text from all <p> tags
p_tags = soup.find_all('p')

# get text from each p tag and strip whitespace
p_tags_text = [tag.get_text().strip() for tag in p_tags]

# filter out sentences without periods
p_tags_text = [sentence for sentence in p_tags_text if '.' in sentence]

# convert all p_tags_text to single article text string
p_tags_text_1string = ''

for p_tag_text in p_tags_text:
    p_tags_text_1string += p_tag_text

# if print_ptags:
#     print(p_tags)
#     print(' ')
#     print(p_tags_text_1string)

combined_article = headline+'. '+p_tags_text_1string
if print_article:
    print(combined_article)


if do_ngrams:
    bigram_mod_file = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/all_the_news/bigram_mod.pkl'
    with open(bigram_mod_file, 'rb') as pickle_file:
        bigram_mod = pickle.load(pickle_file)

    trigram_mod_file = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/all_the_news/trigram_mod.pkl'
    with open(trigram_mod_file, 'rb') as pickle_file:
        trigram_mod = pickle.load(pickle_file)

    quadgram_mod_file = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/all_the_news/quadgram_mod.pkl'
    with open(quadgram_mod_file, 'rb') as pickle_file:
        quadgram_mod = pickle.load(pickle_file)

    # use nltk stopwords
    # stop_words = list(stopwords.words('english'))
    
# load stop words
stop_words = pd.read_csv('/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/stop_words_db/news-stopwords-master/sw1k.csv')

stop_words = stop_words['term']
stop_words = [word for word in stop_words]


# load all sides data
all_sides = pd.read_csv('/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/all_sides_media_data/allsides_final_plus_others_with_domains.csv')

all_sides_names = all_sides['name']
all_sides_domains = all_sides['domain']

# break up into sentences
if do_sentences:
    combined_article = tokenize.sent_tokenize(combined_article)
else:
    combined_article = [combined_article]

if print_article:
    print('TOKENIZED TO SENTENCES')
    print(combined_article)

article_processed = process_all_articles(combined_article,nlp)

# remove stopwords
article_processed = remove_stopwords(article_processed,stop_words)
print('AFTER STOPWORDS')
print(article_processed)


if do_ngrams:

    start = time.process_time()
    article_processed = make_quadgrams(article_processed,bigram_mod,trigram_mod,quadgram_mod)
    print('TIME FOR NGRAMS')
    print(time.process_time() - start)
    print('AFTER NGRAMS')

# generate bag of words of the article
bow_corpus_article = [processed_dictionary.doc2bow(text) for text in article_processed]


AFTER STOPWORDS
[['covid', 'outbreak', 'overshadows', 'vice', 'debate', 'wed', 'oct', 'edt', 'modified', 'wed', 'oct', 'edtafter', 'exclamation', 'recklessness', 'handling', 'coronavirus', 'crisis', 'vice', 'mike', 'pence', 'defend', 'televised', 'debate', 'democratic', 'vice', 'nominee', 'kamala', 'harris', 'pandemic', 'task', 'pence', 'polls', 'indicating', 'majority', 'americans', 'faith', 'ability', 'confront', 'virus', 'blame', 'mishandling', 'counter', 'pence', 'explain', 'virus', 'tear', 'republican', 'donor', 'circles', 'hospitalizing', 'exposing', 'mounting', 'secret', 'personnel', 'covid', 'pence', 'deepest', 'infection', 'zone', 'karen', 'pence', 'virus', 'perhaps', 'daunting', 'pence', 'harris', 'sharp', 'cross', 'examinations', 'settings', 'powerful', 'attorney', 'william', 'barr', 'supreme', 'brett', 'kavanaugh', 'tough', 'mike', 'pence', 'karthik', 'ganapathy', 'progressive', 'strategist', 'mvmt', 'communications', 'outbreak', 'metaphor', 'handling', 'virus', 'writ', 'mi

In [135]:
len(bow_corpus_article[0])

159

In [146]:
%%time
tfidf_vector = tfidf[bow_corpus_article[0]]
print('done')

done
CPU times: user 3.29 ms, sys: 14.2 ms, total: 17.4 ms
Wall time: 111 ms


In [147]:
def getKey(item):
    return item[1]

tfidf_vector_sort = sorted(tfidf_vector,key=getKey,reverse=True)
len(tfidf_vector_sort)

206

In [155]:
n_keywords = 10

top_tfidf_values = [tfidf_vector_sort[i][0] for i in range(0,n_keywords)]
print(top_tfidf_values)

top_words_list = [processed_dictionary[i].replace("_"," ") for i in top_tfidf_values]

top_words_string = ' '
for word in top_words_list:
    
    if word not in top_words_string:
        top_words_string += ' '+word

# top_words = [processed_dictionary[top_tfidf_values[0][0]]]


print('done')

[13993, 36232, 127767, 231756, 12068, 22705, 9138, 3804, 6095, 4073]
done


In [159]:
print(top_words_list)
# print(top_words_string)

top_words_list2 = []
for word in top_words_list:
    if " " in word:
        word_split = word.split()
        
        for new_word in word_split:
            top_words_list2.append(new_word)
            
    else:
        top_words_list2.append(word)
print(top_words_list2)

['harris', 'pence', 'mike pence', 'coronavirus', 'virus', 'shield', 'attorney', 'debate', 'senator', 'debated']
['harris', 'pence', 'mike', 'pence', 'coronavirus', 'virus', 'shield', 'attorney', 'debate', 'senator', 'debated']


In [None]:
# for word in top_words_list:
    

In [128]:
print(top_words_list)
print(top_words_string)

['pence', 'biden', 'coronavirus', 'vice', 'lady', 'spotlighted', 'factset', 'convention', 'fighting', 'quotes']
  pence biden coronavirus vice lady spotlighted factset convention fighting quotes


In [152]:
'mike pence'.split()

['mike', 'pence']