In [None]:
'''
    Here, try out some different methods of doing machine learning.
    
    1. to start with, let's use the simplest approach possible - just count negative and positive words to come up
        with an overall positive and negative sentiment...
'''

In [11]:
'''
    This cell contains some preprocessing steps for the "trdata" dataframe.
'''

import pickle as pkl
import pandas as pd
import numpy as np

#first, open up the pickle to begin analysis
with open('./pandas_trdata/trdata.pickle','rb') as f:
    trdata = pkl.load(f)
print(trdata.shape)
print(trdata.head())

#cutoff: more or less than 1% --> +-1
def round_pct(x):
    if x > 0.01:
        return 1
    elif x < -0.01:
        return -1
    else:
        return 0

r1 = (trdata['Open'] - trdata['Close'])/trdata['Open']
r2 = r1.apply(round_pct)
trdata['rounded_growth'] = r2

print(trdata.head())

(26, 6)
          date                                        list_titles  length  \
0  06 Apr 2016  [Zuck tries Facebook Live ... doesn't go well,...     4.0   
1  07 Apr 2016  [The firm helping Facebook get $$$ in lost rev...     3.0   
2  08 Apr 2016  [Facebook taking shady retailers 'very serious...    28.0   
3  19 Apr 2016  [How one trader plans to make millions on Face...    18.0   
4  08 Apr 2016  [Actually, Netflix did warn about this price h...     9.0   

     ticker    Open   Close  
0  facebook  112.47  113.71  
1  facebook  113.79  113.64  
2  facebook  114.25  110.63  
3  facebook  111.10  112.29  
4   netflix  105.12  103.81  


In [38]:
'''
    Compute sentiment analysis accuracy with Mohit's original method.
'''

#get the "lmdict" file inside this notebook. This is the original method of sentiment analysis, counts +,- words!
# 'negate': modifiers that negate a positive word (i.e. ain't good = bad)
# 'lmdict': "positive" key links to list of positive words, vice versa with "negative" key.
%run -i 'lmdict.py'
print(negate[:10])

def lmd_sentiment_analysis(trdata):
    print("Computing accuracy of sentiment analysis using 'lmdict' method:")
    y = trdata['rounded_growth']
    y_hat = pd.Series() #add to it later...
    yhatlist = []

    #iterate over the data and do analysis on each data point (using the "lmdict" technique)
    for i in range(trdata.shape[0]):
        t0 = trdata.loc[i,:]
        titles = t0.list_titles
        titles = list(set(titles))

        print("actual number of distinct titles: ", end='')
        print(len(titles))    
        text = ' '.join(titles)
        neg_pos_ct = sentiment(text)  #results = [negwords,poswords,negative_words, positive_words]
        print(neg_pos_ct)

        decision = 0
        if neg_pos_ct[1] > neg_pos_ct[0]:
            decision = 1
        elif neg_pos_ct[1] < neg_pos_ct[0]:
            decision = -1

        yhatlist.append(decision)

    y_hat = pd.Series(yhatlist)
    #cast list as series
    print(y_hat.head())

    #compute accuracy
    r1 = (y == y_hat)
    acc = float(sum(r1))/r1.shape[0]
    print("\n******************************\n* Final Accuracy: ", end='')
    print(acc)


lmd_sentiment_analysis(trdata)

'''
    Problems with the above: often, very few of the words in the pos & neg lists are actually showing up in the 
    article titles.
        TODO: test it out on a larger dataset.
    
    Simple solution: "train" this model using "positive and negative words" frequently found in positive and negative
    days, rather than using a predefined dictionary.
    
    Solution: potentially use neural nets & deep learning to learn which words and sentences are associated with
    which positive and negative stock price changes...
    (we need to learn all the words that are found most commonly in these articles, not just the words that show
    up in harvard and lasswell categories {that would be restricting ourselves...})
'''

['aint', 'arent', 'cannot', 'cant', 'couldnt', 'darent', 'didnt', 'doesnt', "ain't", "aren't"]
Computing accuracy of sentiment analysis using 'lmdict' method:
actual number of distinct titles: 3
['fight', 'with', 'facebook', 'led', 'nfl', 'to', 'twitter', 'sources', 'zuck', 'tries', 'facebook', 'live', "doesn't", 'go', 'well', 'fight', 'with', 'facebook', 'led', 'to', 'nfl', 'to', 'twitter', 'sources']
[0, 0, [], []]
actual number of distinct titles: 1
['the', 'firm', 'helping', 'facebook', 'get', 'in', 'lost', 'revenue']
[1, 0, ['lost'], []]
actual number of distinct titles: 3
['chatbots', 'may', 'be', 'coming', 'to', 'facebook', 'facebook', 'live', 'video', 'set', 'to', 'beat', 'rivals', 'at', 'own', 'game', 'facebook', 'taking', 'shady', 'retailers', 'very', 'seriously']
[1, 0, ['seriously'], []]
actual number of distinct titles: 2
['how', 'one', 'trader', 'plans', 'to', 'make', 'millions', 'on', 'facebook', 'facebook', 'may', 'soon', 'let', 'users', 'collect', 'on', 'posts']
[0, 0,

'\n    Problems with the above: often, very few of the words in the pos & neg lists are actually showing up in the \n    article titles.\n'

In [None]:
'''
    sentiment analysis steps:
    ~ Another component: suppose you want to find a company like Apple. How do you know if sentiment is talking about
        "apple" the company or not? 
        ~ use embeddings to solve this problem?
        ~ use embeddings to find synonyms for companies like "apple"?
    
    1. It might be a good idea to use the full article contents.
    2. I think we should view each article as a statistical test.
        i. within each article, you will have a sentiment associated with the ticker.
            (you can use embeddings to find this.)
        ii. Each sentiment that that ticker gets associated with can be considered a separate data point.
        iii. Now what we want to determine is a p-value... what is the probability that the given distribution
            of negative sentiment was generated by a neutral-sentiment article?
            ~ might not have enough training data for this.
                * one way to get training data: find sentiment distro for most frequent words taht are not in the 
                title?
            ~ I have the feeling that articles are not going to have neutral sentiment very often.
            ~ Think about what kind of statistical test is needed for this...
                * you need to do a statistical test for an entire distribution.
                * maybe a likelihood ratio test? (seems complex...)
            ~ We could consider the articles as separate distributions or as part of a single continuous distribution.
            ~ Play with using sentiment in a single article vs. sentiment in just the headlines.
        
        iv. We can also play with a time-series model. Let's say someone just discovered a glitch in facebook,
            and the sentiment is way more negative today than it was yesterday. Probably we should sell today. But
            what about if the news is old? If this negative shift in sentiment has been going on for the last five
            days or so, how is the price impacted? 
            ~ what if the news is like a year old and is resurfacing (like the fb privacy stuff)?
            
        v. Play with different "sensitivities" of when to buy & sell...
'''