Name: Richard

I: Implement an exploratory technique:

 

If your data has some structure, for example it is in categories or has metadata (such as review ratings) attached, you may want to implement a supervised machine learning technique to identify the most distinctive words or features for each of your categories.

If you have shorter, more focused documents, or you only have a few documents (1-5), you might want to do vector space exploration, such as MDS or a dendrogram using a distance measure, or a clustering technique.

If you have longer, more complex documents and you have many of them (10 or more), you might want to implement a topic modeling algorithm.

Or, you can focus on the word level by using word weighting, frequent words, or word distinctiveness scores (such as chi-squared).

If you have specific words – or groups of words – you want to explore, you might choose to implement a word embeddings model.

 

II: Discuss the results. Did you find any themes or clusters that were obvious? Were there any themes or clusters that were not obvious? Were there any that did not make sense? Did this exercise generate any hypotheses about your corpus, or themes you might want to explore further? Why or why not?



### Part 1
- Tokenize
- Get Most Frequent Words
- Cluster, MDS, Dendrogram
- Word embedding

### Part 2
- Interpret 

In [46]:
from re import sub
import pandas as pd
import numpy as np
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

DATATYPES = {'Dow Jones Value': np.float32,
            'Dow Jones Delta': np.float32,
            'Dow Jones Proportion': np.float32,
            'Nasdaq Value': np.float32,
            'Nasdaq Delta': np.float32,
            'Nasdaq Proportion': np.float32,
            'S&P 500 Value': np.float32,
            'S&P 500 Delta': np.float32,
            'S&P 500 Proportion': np.float32 
            }

In [47]:
def join_blog_posts_on_date(df):
    posts = {}
    titles = {}
    for index,row in df.iterrows():
        date = row['Date']
        if date not in posts:
            posts[date] = []
            titles[date] = []
        posts[date].append(row['Body'])
        titles[date].append(row['Title'])

    posts = {date:' '.join(posts[date]) for date in posts}
    titles = {date: ' '.join(titles[date]) for date in titles}

    posts = pd.DataFrame(list(posts.items()))
    posts.columns = ['Date', 'Body']

    titles = pd.DataFrame(list(titles.items()))
    titles.columns = ['Date', 'Title']

    dj_deltas = df[['Date', 'Dow Jones Delta']].drop_duplicates()
    nd_deltas = df[['Date', 'Nasdaq Delta']].drop_duplicates()
    sp_deltas = df[['Date', 'S&P 500 Delta']].drop_duplicates()

    dj_delta_prop = (df['Dow Jones Delta'] / df['Dow Jones Value']).drop_duplicates()
    nd_delta_prop = (df['Nasdaq Delta'] / df['Nasdaq Value']).drop_duplicates()
    sp_delta_prop = (df['S&P 500 Delta'] / df['S&P 500 Value']).drop_duplicates()

    dj_deltas['Dow Jones Proportion'] = dj_delta_prop
    nd_deltas['Nasdaq Proportion'] = nd_delta_prop
    sp_deltas['S&P 500 Proportion'] = sp_delta_prop
    
    dataset = pd.merge(posts, titles, how='inner', on=['Date'])
    dataset = pd.merge(dataset, dj_deltas, how='inner', on=['Date'])
    dataset = pd.merge(dataset, nd_deltas, how='inner', on=['Date'])
    dataset = pd.merge(dataset, sp_deltas, how='inner', on=['Date'])

    dataset['Body'] = dataset['Body'].str.replace('\d+', '').str.replace('[^a-zA-Z ]', '')
    dataset['Title'] = dataset['Title'].str.replace('\d+', '').str.replace('[^a-zA-Z ]', '')
    dataset['Mean Delta'] = (dataset['Dow Jones Delta'] + dataset['Nasdaq Delta'] + dataset['S&P 500 Delta']) / 3
    dataset['Mean Proportion'] = (dataset['Dow Jones Proportion'] + 
                                  dataset['Nasdaq Proportion'] + 
                                  dataset['S&P 500 Proportion']) / 3
    dataset['Label'] = (dataset['Mean Proportion'] >= 0).apply(int)
    dataset['Label'] = 2 * dataset['Label'] - 1
    return dataset


def count_words(text):
    text = sub('[^a-zA-Z ]', '', text)
    counts = {}
    for word in text.lower().split():
        if word in stopwords:
            continue
        if word not in counts:
            counts[word] = 0
        counts[word] += 1
    return counts

### Explanation of Data Schema

In [26]:
df = pd.read_csv('data/dataset.csv', dtype=DATATYPES).dropna()
dataset = join_blog_posts_on_date(df)
dataset

Unnamed: 0,Date,Body,Title,Dow Jones Delta,Dow Jones Proportion,Nasdaq Delta,Nasdaq Proportion,S&P 500 Delta,S&P 500 Proportion,Mean Delta,Mean Proportion,Label
0,2017-03-03,Its been just six weeks since President Donald...,President Donald J Trump Delivers the Weekly A...,-51.371094,-0.002446,-21.569824,-0.003674,-7.810058,-0.003277,-26.916992,-0.003132,-1
1,2017-03-29,NOTICE CONTINUATION OF THE NATIONAL EMERG...,Notice Regarding the Continuation of the Natio...,69.169922,0.003348,16.790039,0.002847,6.930176,0.002935,30.963379,0.003043,1
2,2017-04-06,NOTICE CONTINUATION OF THE NATIONAL EMERG...,Notice Regarding the Continuation of the Natio...,-6.84961,-0.000331,-1.140136,-0.000194,-1.949951,-0.000827,-3.313232,-0.000451,-1
3,2017-03-08,Today President Donald J Trump met with Congre...,Readout of President Donald J Trumps Meeting w...,2.458984,0.000118,1.260254,0.000216,1.890137,0.0008,1.869792,0.000378,1
4,2017-02-17,The President today declared a major disaster ...,President Trump Approves Nevada Disaster Decla...,118.949219,0.005768,27.370117,0.004688,14.219971,0.006048,53.513103,0.005501,1
5,2017-04-04,NOMINATIONS SENT TO THE SENATESigal Mandelker ...,Two Nominations Delivered to the Senate Today ...,-41.089844,-0.001986,-34.129883,-0.005786,-7.209961,-0.003055,-27.476562,-0.003609,-1
6,2017-03-15,The National Building MuseumWashington DC PM E...,Remarks by the Vice President to the American ...,-15.548828,-0.000742,0.709961,0.00012,-3.880127,-0.001627,-6.239665,-0.00075,-1
7,2017-02-22,Fabick Cat FactoryFenton Missouri PM ESTTHE VI...,Remarks by the Vice President to Fabick Cat Em...,34.720703,0.001671,-25.120117,-0.004286,0.989991,0.000419,3.530192,-0.000732,-1
8,2017-02-16,President Donald J Trump spoke with President ...,Readout of the Presidents Call with President ...,4.28125,0.000208,23.680176,0.004072,3.939941,0.001679,10.633789,0.001986,1
9,2017-04-05,President Donald J Trump spoke today with Prim...,Readout of President Donald J Trumps Call with...,14.798828,0.000717,14.470215,0.002467,4.540039,0.00193,11.269694,0.001705,1


In [50]:
dataset['Title'] = dataset['Title'].apply(lambda title: title +' ')
dataset['Body'] = dataset['Body'].apply(lambda body: body + ' ')

In [63]:
grouped = dataset.groupby('Label')
joined_body = grouped['Body'].sum()
negative_body = joined_body[-1]
positive_body = joined_body[1]

positive_body_words = count_words(positive_body)
negative_body_words = count_words(negative_body)

In [65]:
positive_words = pd.DataFrame(list(positive_body_words.items()), 
                              columns=['Word', 'Positive']).sort_values(by='Positive',
                                                                     ascending=False).reset_index()
negative_words = pd.DataFrame(list(downward_body_words.items()), 
                              columns=['Word', 'Negative']).sort_values(by='Negative',
                                                                     ascending=False).reset_index()

In [70]:
word_df = pd.merge(positive_words, negative_words, how='inner', on=['Word'])
del word_df['index_x']
del word_df['index_y']
word_df['diff'] = word_df['Positive'] - word_df['Negative']
word_df

Unnamed: 0,Word,Positive,Negative,diff
0,president,2628,2501,127
1,think,1532,1620,-88
2,going,1274,1425,-151
3,spicer,1044,1252,-208
4,people,980,982,-2
5,thats,804,836,-32
6,know,724,721,3
7,im,705,741,-36
8,one,701,773,-72
9,trump,684,505,179
