# EDA for 'debiaser' data product
#### Sagar Setru, September 16th, 2020

## Brief description using CoNVO framework

### Context

Some people are eager to get news from outside of their echo chamber. However, they do not know where to go outside of their echo chambers, and may also have some activation energy when it comes to seeking information from other sources. In the meantime, most newsfeeds only push you content that you agree with. You end up in an echo chamber, but may not have ever wanted to be in one in the first place.

### Need

A way to find news articles from different yet reliable media sources.

### Vision

Debiaser, a data product (maybe Chrome plug-in?) that will recommend news articles similar in topic to the one currently being read, but from several pre-curated and reliable news media organizations across the political spectrum, for example, following the "media bias chart" here https://www.adfontesmedia.com/ or the "media bias ratings" here: https://www.allsides.com/media-bias/media-bias-ratings. The app will determine the main topics of the text of a news article, and then show links to similar articles from other news organizations.

Caveats: Many of these articles may be behind paywalls. News aggregators already basically do this. How different is this than just searching Google using the title of an article?

### Outcome

People who are motivated to engage in content outside of their echo chambers have a tool that enables them to quickly find news similar to what they are currently reading, but from a variety of news organizations.

# EDA

In [35]:
# make sure I'm in the right environment

print('Conda environment:')
print(os.environ['CONDA_DEFAULT_ENV'])

Conda environment:
insight


In [36]:
# import base packages

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

In [37]:
# import text processing and NLP specific packages
import gensim
from nltk import tokenize

In [38]:
# get list of manually made text files of articles

full_path = '/Users/sagarsetru/Documents/post PhD positions search/insightDataScience/project/debiaser/article_text_files/'

full_file_names = [
full_path+'ap_hurricane_sally_unleashes_20200916.txt',
full_path+'cnn_big_ten_backtracks_20200916.txt',
full_path+'nyt_on_the_fire_line_20200915.txt',
]

In [173]:
# loop through files

# get full file names
for ind, full_file_name in enumerate(full_file_names):
    
    # get the article text as one string, remove new lines
    with open(full_file_name, 'r') as file:
        article_text = file.read().replace('\n', ' ')

    # break article into sentences
    article_sentences = tokenize.sent_tokenize(article_text)
    
    if ind == 0:
        break
        


In [174]:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = gensim.models.LdaModel(common_corpus, num_topics=10)

# article_dictionary = Dictionary(article_text)
# lda = gensim.models.LdaModel([article_text], num_topics=10)

In [185]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sagarsetru/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [186]:
# Create a set of frequent words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stop_words]
         for document in [article_text]]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

In [187]:
processed_corpus

[['hurricane',
  'sally',
  'along',
  'gulf',
  'coast',
  'wang',
  'jeff',
  'pensacola,',
  'hurricane',
  'sally',
  'ashore',
  'near',
  'mph',
  'winds',
  'rain',
  'homes',
  'people',
  'water',
  'inland',
  'slow',
  'across',
  'moving',
  '3',
  'mph,',
  'storm',
  'landfall',
  'gulf',
  'alabama,',
  'areas',
  'mobile,',
  'alabama,',
  'pensacola,',
  'florida,',
  'emergency',
  'crews',
  'people',
  'flooded',
  'pensacola,',
  'family',
  'sheriff',
  'david',
  'morgan',
  'said.',
  'thousands',
  'officials',
  'family',
  'service',
  'going',
  'morgan',
  'said.',
  '“it’s',
  'going',
  'storm',
  'three',
  'bridge',
  'across',
  'pensacola',
  'sheriff',
  'crews',
  'bridge',
  'part',
  'officials',
  'runs',
  'parallel',
  'gulf',
  'coast,',
  'areas',
  'alabama',
  '2',
  'feet',
  'rain',
  'centimeters)',
  'near',
  'pensacola,',
  'nearly',
  '3',
  'feet',
  'water',
  'covered',
  'streets',
  'downtown',
  'pensacola,',
  'national',
  's

In [188]:
bow_corpus = [processed_dictionary.doc2bow(text) for text in processed_corpus]
lda_2 = gensim.models.LdaModel(corpus=bow_corpus,num_topics=5,id2word=processed_dictionary)

In [190]:
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import strip_numeric

# # lda.top_topics(corpus=bow_corpus,dictionary=processed_dictionary)
# for i in range(0, lda.num_topics-1):
#     current_topic = lda.print_topic(i)
#     print(current_topic)
    
lda_topics = lda_2.show_topics()

topics = []
filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]

for topic in lda_topics:
#     print(topic)
    topics.append(preprocess_string(topic[1], filters))

print(topics)


[['hurricane', 'storm', 'gulf', 'pensacola', 'rain', 'said', 'water', 'people', 'said', 'sally'], ['hurricane', 'water', 'said', 'storm', 'gulf', 'people', 'rain', 'sally', 'pensacola', 'alabama'], ['rain', 'said', 'hurricane', 'water', 'storm', 'gulf', 'pensacola', 'pensacola', 'sally', 'people'], ['storm', 'said', 'hurricane', 'alabama', 'said', 'pensacola', 'rain', 'sally', 'gulf', 'like'], ['hurricane', 'water', 'said', 'storm', 'sally', 'rain', 'people', 'pensacola', 'alabama', 'gulf']]


In [164]:
# lda.show_topics()
print(current_topic)

0.022*"120" + 0.020*"52" + 0.019*"58" + 0.017*"13" + 0.017*"131" + 0.016*"130" + 0.016*"138" + 0.015*"86" + 0.014*"116" + 0.014*"89"


In [119]:
processed_dict = processed_dictionary.token2id

for key, value in processed_dict.items():
    print(key)
    print(value)
    print(' ')

2
0
 
3
1
 
about
2
 
above
3
 
across
4
 
after
5
 
against
6
 
alabama
7
 
alabama,
8
 
along
9
 
an
10
 
are
11
 
areas
12
 
as
13
 
ashore
14
 
at
15
 
be
16
 
blew
17
 
bridge
18
 
by
19
 
category
20
 
centimeters)
21
 
coast
22
 
coast,
23
 
coast.
24
 
covered
25
 
crews
26
 
dauphin
27
 
david
28
 
down
29
 
downtown
30
 
emergency
31
 
family
32
 
feet
33
 
five
34
 
flooded
35
 
florida,
36
 
forecaster
37
 
forecasters
38
 
from
39
 
georgia.
40
 
going
41
 
got
42
 
gulf
43
 
has
44
 
have
45
 
he
46
 
her
47
 
home.
48
 
homes
49
 
hotel
50
 
house
51
 
hurricane
52
 
hurricanes
53
 
inches
54
 
inland
55
 
into
56
 
is
57
 
it
58
 
jeff
59
 
just
60
 
kennon
61
 
knocked
62
 
kph).
63
 
lamar-acuff
64
 
landfall
65
 
least
66
 
like
67
 
louisiana
68
 
mayor
69
 
mississippi
70
 
mobile,
71
 
mobile.
72
 
more
73
 
morgan
74
 
moving
75
 
mph
76
 
mph,
77
 
muse
78
 
national
79
 
near
80
 
nearly
81
 
new
82
 
not
83
 
off
84
 
officials
85
 
on
86
 
one
87
 
orange
88


In [85]:
print(bow_corpus)
print(processed_corpus)

[[(0, 3), (1, 2), (2, 2), (3, 3), (4, 2), (5, 2), (6, 2), (7, 3), (8, 5), (9, 2), (10, 2), (11, 3), (12, 2), (13, 11), (14, 2), (15, 6), (16, 3), (17, 3), (18, 3), (19, 6), (20, 3), (21, 3), (22, 2), (23, 2), (24, 2), (25, 2), (26, 2), (27, 2), (28, 2), (29, 2), (30, 3), (31, 3), (32, 3), (33, 2), (34, 2), (35, 2), (36, 2), (37, 2), (38, 4), (39, 7), (40, 2), (41, 3), (42, 3), (43, 6), (44, 2), (45, 3), (46, 3), (47, 4), (48, 2), (49, 3), (50, 2), (51, 2), (52, 11), (53, 2), (54, 2), (55, 2), (56, 3), (57, 3), (58, 9), (59, 2), (60, 5), (61, 2), (62, 2), (63, 2), (64, 2), (65, 2), (66, 2), (67, 4), (68, 2), (69, 2), (70, 2), (71, 2), (72, 2), (73, 7), (74, 2), (75, 3), (76, 3), (77, 2), (78, 2), (79, 2), (80, 2), (81, 2), (82, 2), (83, 2), (84, 2), (85, 2), (86, 9), (87, 4), (88, 2), (89, 7), (90, 4), (91, 2), (92, 2), (93, 2), (94, 4), (95, 6), (96, 6), (97, 3), (98, 4), (99, 2), (100, 2), (101, 7), (102, 2), (103, 2), (104, 2), (105, 2), (106, 8), (107, 5), (108, 6), (109, 2), (110, 

In [46]:
article_dictionary = Dictionary(article_sentences)

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

In [47]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [48]:
article_text

"Hurricane Sally unleashes flooding along the Gulf Coast By JAY REEVES, ANGIE WANG and JEFF MARTIN 6 minutes ago  PENSACOLA, Fla. (AP) — Hurricane Sally lumbered ashore near the Florida-Alabama line Wednesday with 105 mph (165) winds and rain measured in feet, not inches, swamping homes and trapping people in high water as it crept inland for what could be a long, slow and disastrous drenching across the Deep South.  Moving at an agonizing 3 mph, or about as fast as a person can walk, the storm made landfall at 4:45 a.m. close to Gulf Shores, Alabama, battering the metropolitan areas of Mobile, Alabama, and Pensacola, Florida, which have a combined population of almost 1 million.  Emergency crews plucked people from flooded homes. In Escambia County, which includes Pensacola, more than 40 were rescued, including a family of four found in a tree, Sheriff David Morgan said.  He estimated thousands more will need to flee rising waters in the coming days. County officials urged residents t