## Comparative Analysis Sentiment

The purpose of this notebook is to attempt a sentiment analysis and gather the data together into a df. I will likely use VAD because of the three different scores I can look at for each source.

In [1]:
import os
import json
import random
import shutil
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import calendar
import seaborn as sn
from collections import Counter
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.cluster import hierarchical,KMeans, linkage_tree
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
%run functions.ipynb

First I need to load all my files relevant here.

In [3]:
cd_corp = json.load(open('../data/text/china_daily/cd_corpus_index.json'))
nyt_corp = json.load(open('../data/text/nyt/nyt_corpus_index.json'))
dt_corp = json.load(open('../data/text/daily_telegraph/dt_corpus_index.json'))
g_corp = json.load(open('../data/text/guardian/guardian_corpus_index.json'))
ht_corp = json.load(open('../data/text/hindustan_times/ht_corpus_index.json'))

In [4]:
for article in cd_corp[:32]:
    filename = article['Filename']
    text = open('../data/text/china_daily/{}'.format(filename)).read()
    article['text'] = text
for article in cd_corp[32:]:
    filename = article['Filename']
    text = open('../data/text/china_daily/{}'.format(filename)).read()
    body_text_start = text.index('Body')+4
    body_text_end = text.find('Load-Date:')
    body_text=text[body_text_start:body_text_end].strip()
    article['text'] = body_text

In [5]:
for article in nyt_corp:
    filename = article['Filename']
    text = open('../data/text/nyt/{}'.format(filename)).read()
    article['text'] = text

In [6]:
for article in dt_corp:
    if article['Filename'].startswith('wish-magazine'):
        dt_corp.remove(article)

In [7]:
for article in dt_corp:
    filename = article['Filename']
    text = open('../data/text/daily_telegraph/{}'.format(filename)).read()
    article['text'] = text

In [8]:
for article in g_corp:
    filename = article['Filename']
    text = open('../data/text/guardian/{}'.format(filename)).read()
    article['text'] = text

In [9]:
for article in ht_corp:
    filename = article['Filename']
    text = open('../data/text/hindustan_times/{}'.format(filename)).read()
    article['text'] = text

In [10]:
characters_to_remove = '!,.()[]|"'

Utilizing a smaller unit of analysis that consists of articles that contain at leaast one of the origin terms.

In [11]:
origin_terms = ['laboratory','lab','bioweapon','market','military','cold-chain','conspiracy','army','detrick', 'transparency','origins','wuhan','theory','imported']
origin_txt_cd=[]
for word in origin_terms:
    for article in cd_corp:
        if article in origin_txt_cd:
            continue
        elif article['text'].count(word)>0:
            origin_txt_cd.append(article)

In [12]:
origin_txt_nyt=[]
for word in origin_terms:
    for article in nyt_corp:
        if article in origin_txt_nyt:
            continue
        elif article['text'].count(word)>0:
            origin_txt_nyt.append(article)

In [13]:
origin_txt_dt=[]
for word in origin_terms:
    for article in dt_corp:
        if article in origin_txt_dt:
            continue
        elif article['text'].count(word)>0:
            origin_txt_dt.append(article)

In [14]:
origin_txt_g=[]
for word in origin_terms:
    for article in g_corp:
        if article in origin_txt_g:
            continue
        elif article['text'].count(word)>0:
            origin_txt_g.append(article)

In [15]:
origin_txt_ht=[]
for word in origin_terms:
    for article in ht_corp:
        if article in origin_txt_ht:
            continue
        elif article['text'].count(word)>0:
            origin_txt_ht.append(article)

Time for some VAD analysis!

In [16]:
NRC_VAD_lexicon = open('NRC-VAD-Lexicon.txt').readlines()

In [17]:
NRC_VAD = {}
for line in NRC_VAD_lexicon[1:]:  
    word, V,A,D = line.strip().split('\t')
    NRC_VAD[word] = {'V': float(V), 
                     'A': float(A),
                     'D': float(D)}

In [18]:
def process_article(article):
    toks= tokenize(article['text'], lowercase=True, strip_chars=characters_to_remove)
    article['tokens'] = toks
    
    article['Valence']=0
    article['Dominance']=0
    article['Arousal']=0
    
    article['VAD_toks']=[]
    
    for a in toks:
        if a.lower() in NRC_VAD.keys():
            scores = NRC_VAD[a.lower()]
            scores['tok']=a
            
            article['Valence']+=scores['V']
            article['Arousal']+=scores['A']
            article['Dominance']+=scores['D']
            
            article['VAD_toks'].append(scores)
    
    
    for dimension in ('Valence','Arousal','Dominance'):
        if len(article['VAD_toks'])>0:
            article[dimension] /= len(article['VAD_toks'])

Here I am just generating V, A, and D scores for each text in each source.

In [19]:
v_cd = []
a_cd = []
d_cd = []
for article in origin_txt_cd:
    process_article(article)
    v_cd.append(article.get('Valence',''))
    a_cd.append(article.get('Arousal',''))
    d_cd.append(article.get('Dominance',''))

In [20]:
v_nyt = []
a_nyt = []
d_nyt = []
for article in origin_txt_nyt:
    process_article(article)
    v_nyt.append(article.get('Valence',''))
    a_nyt.append(article.get('Arousal',''))
    d_nyt.append(article.get('Dominance',''))

In [21]:
v_dt = []
a_dt = []
d_dt = []
for article in origin_txt_dt:
    process_article(article)
    v_dt.append(article.get('Valence',''))
    a_dt.append(article.get('Arousal',''))
    d_dt.append(article.get('Dominance',''))

In [None]:
v_g = []
a_g = []
d_g = []
for article in origin_txt_g:
    fpath = os.path.join('..','data','text', 'guardian', article['Filename'])
    text = open(fpath).read()
    tokens = tokenize(text, lowercase=True, strip_chars=characters_to_remove)
    article['tokens']= tokens
    article['Valence']=0
    article['Dominance']=0
    article['Arousal']=0
    
    article['VAD_toks']=[]
    
    for a in tokens:
        if a.lower() in NRC_VAD.keys():
            scores = NRC_VAD[a.lower()]
            scores['tok']=a
            
            article['Valence']+=scores['V']
            article['Arousal']+=scores['A']
            article['Dominance']+=scores['D']
            
            article['VAD_toks'].append(scores)
    
    
    for dimension in ('Valence','Arousal','Dominance'):
        if len(article['VAD_toks'])>0:
            article[dimension] /= len(article['VAD_toks'])
            
    v_g.append(article.get('Valence',''))
    a_g.append(article.get('Arousal',''))
    d_g.append(article.get('Dominance',''))

In [None]:
v_ht = []
a_ht = []
d_ht = []
for article in origin_txt_ht:
    process_article(article)
    v_ht.append(article.get('Valence',''))
    a_ht.append(article.get('Arousal',''))
    d_ht.append(article.get('Dominance',''))

In [None]:
index = ['Valence','Arousal','Dominance']
vad_data = {'China Daily':[sum(v_cd)/len(v_cd),sum(a_cd)/len(a_cd),sum(d_cd)/len(d_cd)], 
            'NY Times': [sum(v_nyt)/len(v_nyt),sum(a_nyt)/len(a_nyt),sum(d_nyt)/len(d_nyt)],
            'Daily Telegraph': [sum(v_dt)/len(v_dt),sum(a_dt)/len(a_dt),sum(d_dt)/len(d_dt)],
            'The Guardian': [sum(v_g)/len(v_g),sum(a_g)/len(a_g),sum(d_g)/len(d_g)],
            'Hindustan Times':[sum(v_ht)/len(v_ht),sum(a_ht)/len(a_ht),sum(d_ht)/len(d_ht)]}

vad_df = pd.DataFrame(data=vad_data, index=index)

In [None]:
vad_df.to_csv('../final_data_story/visualizations/vad_sentiment.csv')

I exported to a csv file that I can easily load in my final data story. It's interesting to see that there aren't that many significant differences although I expected there to be more. This is worth pondering in the markdown explanations of my data story.