# Reuters Archives - Amazon Comprehend (NLP and Text Analytics)

**Objectives:** 
1. Use Amazon Comprehend for Topic Modeling and Sentiment Analysis
https://docs.aws.amazon.com/comprehend/latest/dg/getting-started.html
- Sentiment Analysis - https://docs.aws.amazon.com/comprehend/latest/dg/how-sentiment.html
- Topic Modeling - https://docs.aws.amazon.com/comprehend/latest/dg/topic-modeling.html

- The Reuters dataset used here "reuters_data.csv" was web scraped from https://uk.reuters.com/news/archive/GCA-ForeignExchange on Dec 2, 2018. It contains... 
- articles from 2010-05-17 to 2018-11-30
- 10,200 total articles
- Index([u'Date', u'Timestamp', u'excerpt', u'link', u'page', u'post', u'title'], dtype='object')

In [13]:
import boto3
import botocore

In [179]:
Bucket = "capstoneproject-770851433061"
Key = "reuters_data_with_location.csv" #"Name of the file in S3 that you want to download"
outPutName = "reuters_data_with_location.csv" #The name you want to save after we download from s3
s3 = boto3.resource('s3')
try:
    s3.Bucket(Bucket).download_file(Key, outPutName)
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise

In [None]:
python -m pip install --user nltk #python2

# Objective 1. Topic Modeling and Sentiment Analysis on Article Excerpts

## Load Dataset

In [106]:
import pandas as pd
df = pd.read_csv('./data/reuters_data.csv')

In [22]:
df.excerpt[1]

'Sterling slumped against the dollar and the euro on Tuesday as doubts grew about whether British Prime Minister Theresa May can get a Brexit agreement through a divided parliament.'

In [10]:
#get excerpts
reuters_excerpt = df.excerpt[:]
reuters_excerpt.head()

0    The U.S. dollar gained on Tuesday after Federa...
1    Sterling slumped against the dollar and the eu...
2    Sterling gave up most of its earlier gains and...
3    The dollar tumbled from two-week highs on Wedn...
4    The pound fell towards a two-week low on Thurs...
Name: excerpt, dtype: object

In [26]:
reuters_excerpt.to_csv('./data/reuters_excerpt.csv')

In [35]:
with open("reuters_excerpt.txt", "w") as my_output_file:
    [my_output_file.write("".join(row)+'\n') for row in reuters_excerpt]
my_output_file.close()

In [None]:
#import io
#with open("reuters_excerpt.txt",'r') as f:
#    text = f.read()
# process Unicode text
#with io.open("reuters_excerpt_utf.txt",'w',encoding='utf8') as f:
#    f.write(text)

## Submit to Amazon Comprehend for Topic Modeling and Sentiment
### grab output files from Topic Modeling

In [97]:
topic_terms = pd.read_csv('./data/excerpt-topic-terms.csv')
doc_topics = pd.read_csv('./data/excerpt-doc-topics.csv')

In [42]:
topic_terms.head()

Unnamed: 0,topic,term,weight
0,0,investor,0.017498
1,0,national,0.000737
2,0,opec,0.000701
3,0,capítulo,0.000558
4,0,swiss,0.000889


In [99]:
#import re
doc_topics['docname'] = doc_topics['docname'].apply(lambda x: re.sub('reuters_excerpt.csv:', '', x))
doc_topics.docname = pd.to_numeric(doc_topics.docname)
doc_topics = doc_topics.sort_values('docname')

In [112]:
doc_topics= doc_topics.reset_index()
doc_topics.topic.head()

0    12
1     2
2    19
3    19
4    19
Name: topic, dtype: int64

In [48]:
len(doc_topics.topic.unique())

20

In [178]:
doc_topics.groupby('topic')["proportion"].count()

topic
0      193
1        9
2     1754
3       66
4     1768
5       69
6       72
7        2
8        7
9     1193
10    1529
11       1
12      42
13    1761
14       2
16     425
17       2
19     896
22     373
23      38
Name: proportion, dtype: int64

In [9]:
from collections import defaultdict

In [None]:
#cluster_groups = kmeans.predict(ret2.T)
#set(cluster_groups)
#print(cluster_groups)
#print(list(zip(cluster_groups, ret2.columns)))

In [14]:
topic = topic_terms['topic']
term = topic_terms['term']
#set(list(topic))
#print(list(topic))
print (list(zip(topic, term)))

[(0, 'investor'), (0, 'national'), (0, 'opec'), (0, 'cap\xc3\xadtulo'), (0, 'swiss'), (0, 'member'), (0, 'scrap'), (0, 'revive'), (0, 'fight'), (0, 'snb'), (1, 'friday'), (1, 'wednesday'), (1, 'thursday'), (1, 'tuesday'), (1, 'monday'), (1, 'saturday'), (1, 'sunday'), (1, 'strategist'), (1, 'road'), (1, 'prove'), (2, 'euro'), (2, 'zone'), (2, 'debt'), (2, 'crisis'), (2, 'government'), (2, 'bond'), (2, 'greece'), (2, 'investor'), (2, 'market'), (2, 'european'), (3, 'bank'), (3, 'union'), (3, 'share'), (3, "bank's"), (3, 'company'), (3, 'opec'), (3, 'group'), (3, 'producer'), (3, 'banker'), (3, 'glut'), (4, 'yen'), (4, 'dollar'), (4, 'bank'), (4, 'monetary'), (4, 'japanese'), (4, 'japan'), (4, 'policy'), (4, 'minister'), (4, 'currency'), (4, 'low'), (5, 'u.s'), (5, 'britain'), (5, 'uk'), (5, 'union'), (5, 'export'), (5, 'mine'), (5, 'russian'), (5, 'brexit'), (5, 'energy'), (5, 'metal'), (6, 'dollar'), (6, 'ftse'), (6, 'share'), (6, 'brent'), (6, 'weight'), (6, 'bitcoin'), (6, 'barrel'),

In [11]:
topic_terms.iloc[:,[0,1]].head() #['topic','term']

Unnamed: 0,topic,term
0,0,investor
1,0,national
2,0,opec
3,0,capítulo
4,0,swiss


In [None]:
#similar_by_cluster = defaultdict(list)
#for a, b in zip(cluster_groups, ret2.columns):
#       similar_by_cluster[a].append(b)

In [12]:
#similar_by_cluster = defaultdict(list)
similar_by_cluster = defaultdict(list)
for a,b in zip(topic, term):
    similar_by_cluster[a].append(b)

In [15]:
similar_by_cluster

defaultdict(list,
            {0: ['investor',
              'national',
              'opec',
              'cap\xc3\xadtulo',
              'swiss',
              'member',
              'scrap',
              'revive',
              'fight',
              'snb'],
             1: ['friday',
              'wednesday',
              'thursday',
              'tuesday',
              'monday',
              'saturday',
              'sunday',
              'strategist',
              'road',
              'prove'],
             2: ['euro',
              'zone',
              'debt',
              'crisis',
              'government',
              'bond',
              'greece',
              'investor',
              'market',
              'european'],
             3: ['bank',
              'union',
              'share',
              "bank's",
              'company',
              'opec',
              'group',
              'producer',
              'banker',
              'glut']

In [113]:
df['topic'] = doc_topics['topic']

In [114]:
df.head()

Unnamed: 0,Date,Timestamp,excerpt,link,page,post,title,topic
0,2018-11-27,02:19:00,The U.S. dollar gained on Tuesday after Federa...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The U.S. dollar gained on...,Dollar gains as Fed's Clarida backs further ra...,12
1,2018-11-27,10:30:00,Sterling slumped against the dollar and the eu...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling slumped against th...,Sterling slides with UK Brexit vote in doubt,2
2,2018-11-28,09:25:00,Sterling gave up most of its earlier gains and...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling gave up most of it...,Sterling erases earlier gains after central ba...,19
3,2018-11-28,01:50:00,The dollar tumbled from two-week highs on Wedn...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The dollar tumbled from t...,Dollar drops as Fed's Powell says rates near n...,19
4,2018-11-29,09:52:00,The pound fell towards a two-week low on Thurs...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - The pound fell towards a tw...,Sterling heads towards two-week lows as Brexit...,19


In [181]:
#df[df.topic == 20]

Unnamed: 0,Date,Timestamp,excerpt,link,page,post,title,topic,term


In [48]:
#df.topic
#df = df.drop('term', 1)

In [61]:
list = []
list.append("d")
print(list)
list[0]

['d']


'd'

In [115]:
list = []
for i in df.topic:
    list.append(str(similar_by_cluster[i]))
    #df['term'][i]= str(similar_by_cluster[i])

In [116]:
df["term"] = list

In [117]:
df.head()

Unnamed: 0,Date,Timestamp,excerpt,link,page,post,title,topic,term
0,2018-11-27,02:19:00,The U.S. dollar gained on Tuesday after Federa...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The U.S. dollar gained on...,Dollar gains as Fed's Clarida backs further ra...,12,"['european', 'index', 'top', 'janet', 'chair',..."
1,2018-11-27,10:30:00,Sterling slumped against the dollar and the eu...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling slumped against th...,Sterling slides with UK Brexit vote in doubt,2,"['euro', 'zone', 'debt', 'crisis', 'government..."
2,2018-11-28,09:25:00,Sterling gave up most of its earlier gains and...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling gave up most of it...,Sterling erases earlier gains after central ba...,19,"['bank', 'rate', 'interest', 'sterling', 'engl..."
3,2018-11-28,01:50:00,The dollar tumbled from two-week highs on Wedn...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The dollar tumbled from t...,Dollar drops as Fed's Powell says rates near n...,19,"['bank', 'rate', 'interest', 'sterling', 'engl..."
4,2018-11-29,09:52:00,The pound fell towards a two-week low on Thurs...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - The pound fell towards a tw...,Sterling heads towards two-week lows as Brexit...,19,"['bank', 'rate', 'interest', 'sterling', 'engl..."


In [118]:
df.to_csv('./data/reuters_topic_modeling.csv', index=False)

## grab output files from sentiment

In [121]:
topic_terms = pd.read_csv('./data/excerpt-sentiment.txt', header = None)

In [123]:
topic_terms.columns = ['output','line','sentiment','mixed','negative','neutral','positive']

In [166]:
#topic_terms = topic_terms.drop('output', 1)
#topic_terms = topic_terms.drop('line', 1)

In [164]:
#topic_terms = topic_terms.drop(range(10200,10203),0)

In [170]:
topic_terms['sentiment'] = topic_terms['sentiment'].apply(lambda x: re.sub('"Sentiment":', '', x))
topic_terms['mixed'] = topic_terms['mixed'].apply(lambda x: re.sub('"SentimentScore": {"Mixed":', '', x))
topic_terms['negative'] = topic_terms['negative'].apply(lambda x: re.sub('"Negative":', '', x))
topic_terms['neutral'] = topic_terms['neutral'].apply(lambda x: re.sub('"Neutral":', '', x))
topic_terms['positive'] = topic_terms['positive'].apply(lambda x: re.sub('"Positive":', '', x))

In [172]:
topic_terms['sentiment'] = topic_terms['sentiment'].apply(lambda x: re.sub('"', '', x))
topic_terms['positive'] = topic_terms['positive'].apply(lambda x: re.sub('}}', '', x))

In [151]:
range(10200,10202)

[10200, 10201]

In [182]:
topic_terms.groupby('sentiment').count()

Unnamed: 0_level_0,mixed,negative,neutral,positive
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MIXED,8,8,8,8
NEGATIVE,704,704,704,704
NEUTRAL,9436,9436,9436,9436
POSITIVE,52,52,52,52


In [173]:
topic_terms.head(20)

Unnamed: 0,sentiment,mixed,negative,neutral,positive
0,NEUTRAL,0.00354626448825,0.0067107952199876,0.9880415201187134,0.0017014214536175
1,NEUTRAL,0.0081893661990761,0.3202889561653137,0.666081964969635,0.00543964933604
2,NEUTRAL,0.0080856867134571,0.0642052516341209,0.921276032924652,0.0064329295419156
3,NEUTRAL,0.0006543741328641,0.0048144487664103,0.9928760528564452,0.0016551942098885
4,NEUTRAL,0.0038918505888432,0.0506839826703071,0.942244291305542,0.0031799226999282
5,NEUTRAL,0.0042532058432698,0.1269941627979278,0.8649519681930542,0.0038006799295544
6,NEUTRAL,0.0039678472094237,0.030410561710596,0.9502358436584472,0.0153857255354523
7,NEUTRAL,0.0002372775634285,0.0004061247454956,0.9981526732444764,0.0012039742432534
8,NEUTRAL,0.0005289752734825,0.0032367452513426,0.9956300258636476,0.0006042591412551
9,NEUTRAL,0.0053769932128489,0.0210483744740486,0.9595285058021544,0.01404610555619


In [175]:
test3 = pd.concat([df, topic_terms], axis=1)
test3.head()

Unnamed: 0,Date,Timestamp,excerpt,link,page,post,title,topic,term,sentiment,mixed,negative,neutral,positive
0,2018-11-27,02:19:00,The U.S. dollar gained on Tuesday after Federa...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The U.S. dollar gained on...,Dollar gains as Fed's Clarida backs further ra...,12,"['european', 'index', 'top', 'janet', 'chair',...",NEUTRAL,0.00354626448825,0.0067107952199876,0.9880415201187134,0.0017014214536175
1,2018-11-27,10:30:00,Sterling slumped against the dollar and the eu...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling slumped against th...,Sterling slides with UK Brexit vote in doubt,2,"['euro', 'zone', 'debt', 'crisis', 'government...",NEUTRAL,0.0081893661990761,0.3202889561653137,0.666081964969635,0.00543964933604
2,2018-11-28,09:25:00,Sterling gave up most of its earlier gains and...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling gave up most of it...,Sterling erases earlier gains after central ba...,19,"['bank', 'rate', 'interest', 'sterling', 'engl...",NEUTRAL,0.0080856867134571,0.0642052516341209,0.921276032924652,0.0064329295419156
3,2018-11-28,01:50:00,The dollar tumbled from two-week highs on Wedn...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The dollar tumbled from t...,Dollar drops as Fed's Powell says rates near n...,19,"['bank', 'rate', 'interest', 'sterling', 'engl...",NEUTRAL,0.0006543741328641,0.0048144487664103,0.9928760528564452,0.0016551942098885
4,2018-11-29,09:52:00,The pound fell towards a two-week low on Thurs...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - The pound fell towards a tw...,Sterling heads towards two-week lows as Brexit...,19,"['bank', 'rate', 'interest', 'sterling', 'engl...",NEUTRAL,0.0038918505888432,0.0506839826703071,0.942244291305542,0.0031799226999282


In [176]:
test3.to_csv('./data/reuters_excerpt_NLP_full.csv', index=False)

### add post topic modeling

In [183]:
df2 = pd.read_csv('./data/reuters_post_topic_modeling.csv')

In [184]:
df2.head()

Unnamed: 0,Date,Timestamp,excerpt,link,page,post,title,post.topic,post.term
0,2018-11-27,02:19:00,The U.S. dollar gained on Tuesday after Federa...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The U.S. dollar gained on...,Dollar gains as Fed's Clarida backs further ra...,27,"['dollar', 'u.s', 'rate', 'federal', 'reservar..."
1,2018-11-27,10:30:00,Sterling slumped against the dollar and the eu...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling slumped against th...,Sterling slides with UK Brexit vote in doubt,19,"['bank', 'rate', 'sterling', 'england', 'inter..."
2,2018-11-28,09:25:00,Sterling gave up most of its earlier gains and...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling gave up most of it...,Sterling erases earlier gains after central ba...,4,"['euro', 'zone', 'debt', 'crisis', 'european',..."
3,2018-11-28,01:50:00,The dollar tumbled from two-week highs on Wedn...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The dollar tumbled from t...,Dollar drops as Fed's Powell says rates near n...,27,"['dollar', 'u.s', 'rate', 'federal', 'reservar..."
4,2018-11-29,09:52:00,The pound fell towards a two-week low on Thurs...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - The pound fell towards a tw...,Sterling heads towards two-week lows as Brexit...,19,"['bank', 'rate', 'sterling', 'england', 'inter..."


In [185]:
test3['post.topic'] = df2['post.topic']
test3['post.term'] = df2['post.term']

In [186]:
test3.head()

Unnamed: 0,Date,Timestamp,excerpt,link,page,post,title,topic,term,sentiment,mixed,negative,neutral,positive,post.topic,post.term
0,2018-11-27,02:19:00,The U.S. dollar gained on Tuesday after Federa...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The U.S. dollar gained on...,Dollar gains as Fed's Clarida backs further ra...,12,"['european', 'index', 'top', 'janet', 'chair',...",NEUTRAL,0.00354626448825,0.0067107952199876,0.9880415201187134,0.0017014214536175,27,"['dollar', 'u.s', 'rate', 'federal', 'reservar..."
1,2018-11-27,10:30:00,Sterling slumped against the dollar and the eu...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling slumped against th...,Sterling slides with UK Brexit vote in doubt,2,"['euro', 'zone', 'debt', 'crisis', 'government...",NEUTRAL,0.0081893661990761,0.3202889561653137,0.666081964969635,0.00543964933604,19,"['bank', 'rate', 'sterling', 'england', 'inter..."
2,2018-11-28,09:25:00,Sterling gave up most of its earlier gains and...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - Sterling gave up most of it...,Sterling erases earlier gains after central ba...,19,"['bank', 'rate', 'interest', 'sterling', 'engl...",NEUTRAL,0.0080856867134571,0.0642052516341209,0.921276032924652,0.0064329295419156,4,"['euro', 'zone', 'debt', 'crisis', 'european',..."
3,2018-11-28,01:50:00,The dollar tumbled from two-week highs on Wedn...,https://uk.reuters.com/article/uk-global-forex...,1,NEW YORK (Reuters) - The dollar tumbled from t...,Dollar drops as Fed's Powell says rates near n...,19,"['bank', 'rate', 'interest', 'sterling', 'engl...",NEUTRAL,0.0006543741328641,0.0048144487664103,0.9928760528564452,0.0016551942098885,27,"['dollar', 'u.s', 'rate', 'federal', 'reservar..."
4,2018-11-29,09:52:00,The pound fell towards a two-week low on Thurs...,https://uk.reuters.com/article/uk-britain-ster...,1,LONDON (Reuters) - The pound fell towards a tw...,Sterling heads towards two-week lows as Brexit...,19,"['bank', 'rate', 'interest', 'sterling', 'engl...",NEUTRAL,0.0038918505888432,0.0506839826703071,0.942244291305542,0.0031799226999282,19,"['bank', 'rate', 'sterling', 'england', 'inter..."


In [187]:
test3.to_csv('./data/reuters_excerpt_post_NLP_full.csv', index=False)