# Approach
   In order to investigate the impacts of emotional states on our stock model, we would first need to generate a sentiment scores of an observed company have news published on newspaper as well as stock prices announced on a given day. 

## Stock Classification 
<img width="669" alt="screenshot 2018-11-12 18 41 45" src="https://user-images.githubusercontent.com/30711638/48381597-a745d300-e6aa-11e8-8762-04c34c2f7569.png">

## Data Description
Our final dataset is the combination of historical stock data that reflects the pattern of stock movements, fundamental parameters that indicate the long-term financial health of a company, and the sentiment scores that symbolize the public opinions towards the given company.
### 1. Financial quantitative data
There are 2 sources of financial data that we use in the analysis
-Daily historical stock price that we get from [Yahoo Finance](https://finance.yahoo.com/quote/AAPL/history?p=AAPL)
-The fundamental information of a company’s stocks, which has 198 features including figures such as  debt, equity, book values, etc. We get this data from [GuruFocus](https://www.gurufocus.com/term/Shares+Outstanding/AAPL/Shares-Outstanding-Diluted-Average/Apple-Inc)
### 2. Financial news
We obtained financial news data by scraping articles related to 21 popular companies that have regularly appeared on the news from 01/01/2011 to the present. Although this was supposed to give us approximately 45,000 instances of various news events, there are many days where the observed companies don’t have any news, so after dropping all of the days with no news, we were left with a sample size of approximately 25,000 instances.  

In this format, wi=1 if word i is in the given piece of news. Then, we fed the set of feature vectors through multiple machine learning algorithms to determine the optimal classifier for generating sentiment scores. We refed each news event through this classifier to generate a corresponding sentiment score, which would be treated as a new feature for our stock classification model.

## Generating sentiment scores on financial news
<img width="732" alt="screenshot 2018-11-12 18 43 47" src="https://user-images.githubusercontent.com/30711638/48381638-e8d67e00-e6aa-11e8-8b37-85c49f87ed68.png">

In order to classify a news event with positive or negative sentiment, we used the Bags of Words approach, in which a piece of news is converted to a feature vector (w1,., wn) consisting of words.

## Preprocessing Techniques
To simplify the process and reduce computation time, we treated all news published on any given day as a single piece of news. By doing this, we assumed that all news published in a day about a company has the same sentiment.
<img width="227" alt="screenshot 2018-11-12 22 24 45" src="https://user-images.githubusercontent.com/30711638/48389049-c6a02880-e6c9-11e8-8204-f54155bc4d69.png">

We applied many common NLP techniques to clean our news data

- Remove:
    - Punctuations
    - Stop words
    - Tokenize
    - Any words that don’t appear at least 2 times throughout our dataset because these features aren’t likely to reveal any patterns
- Tokenizing text in each given event date:
    - Convert uppercase to lowercase
    - Lemmatizing: convert words into the corresponding words
    - Word Tagging: determine word types. In this project, We decided to only keep adjectives, adverbs and verbs for our feature vectors because nouns generally don’t reveal much sentiment information
    - Bigram: consider multiple words together
- Ex: “Apple set to expand Siri, taking different route from Amazons Alexa”
    - Vectors of Words = [‘set’, ‘expand’, ‘take’, ‘different’]
We also conducted many pair tests for each preprocessing techniques to determine what preprocessing techniques, when combined, improves the performance of our classification

<img width="602" alt="screenshot 2018-11-12 19 17 03" src="https://user-images.githubusercontent.com/30711638/48382551-9481cd00-e6af-11e8-9fc9-07a8a07d8e54.png">

Based on table 1, which plots the accuracy of the Naive Bayes model, we observe that none of the preprocessing techniques significantly improves the performance of the classification. We also notice that Stemming, which reduces words to roots of words, isn’t as informative as Lemmatizing since the roots of words may not have been used in the actual articles. That’s why we need to be careful applying the different techniques that are commonly used for machine learning problems: the techniques might degrade instead improve the model performance if it’s not suited for our data. It’s also commonly known that bigram might degrade our classification as it increases our features size without adding more useful information.  Table 1 shows that the combination of bigram, lemmatizing, and word tagging works the best. However, Table 1 also shows that the accuracy of most algorithms is very low, almost equivalent to random guessing. We think that this is due to the curse of dimensionality, since we have lots of features with only a sample size of 1254 and a feature size of 3455, worsening the sparseness of the data. Next, we want to investigate how the performance of text classification improves with increasing sample size.

# Experiments
## Building text classifiers for sentiment analysis

In [1]:
# import all the neccessary packages
import pandas as pd 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import nltk
import numpy as np
import random
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import ClassifierI
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.neural_network import MLPClassifier
import pickle 
import pdb
from nltk.stem import WordNetLemmatizer
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.stem import PorterStemmer
import math
from nltk.metrics.scores import accuracy, precision, recall
import collections
import os, sys
import time

In [26]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/tjhuynh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/tjhuynh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tjhuynh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/tjhuynh/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

-Because we have to process news articles for 21 different companies, it's better to write functions to read files of each company.

In [2]:
# Read news files of a company
def create_news_df(news_file):
    news_df = pd.read_csv(news_file, encoding = "ISO-8859-1")
    news_df['Date'] = pd.to_datetime(news_df['Date'], errors= 'coerce')
    news_df['Text'] = news_df['Body'].astype(str).str.cat(news_df['Title'].astype(str))
    del news_df['Body']
    del news_df['Title']
    news_df.fillna('', inplace=True)
    return news_df

# Create labels for text classfication

<img width="341" alt="screenshot 2018-11-12 22 35 11" src="https://user-images.githubusercontent.com/30711638/48389388-3bc02d80-e6cb-11e8-84ca-80edc5db2a8d.png">

To perform any classification task, I need to have labels for our data. Because I wanted to investigate how the news related to the stock movements, I decided that I should use some indicators of stock as the labels for the sentiment analysis model. One common binary stock indicator is comparing the stock returns with the return of S&P 500 index. This is a considerable assumption since

- The stock price is determined by aggregating the trades of everyone who reacts to the news, as well as the trades of other agents whose decisions are independent of what is in the news.
- The CAMP model states that a stock return can be decomposed into two components:

    - One captures all the risks correlated with the market
    - One captures all the risks that is unique to the stock. In the other words, it represents the well-being of the company itself and is independent of market movements and other assets.
These factors increases the uncertainties about the stock prediction, and in order to remedy this, I seek to find a quantity that is independent of the risk in market movements and other information not related to the news event. I found out that the quanity **Abnormal Return** satisfies the desired characteristics for stock label instead of the S&P 500.


## Method 1
## Label: Abnormal Return Changes
[Abnormal Return](https://www.investopedia.com/terms/a/abnormalreturn.asp) is the difference between the actual return of a security over a period of time and the expected return. The expected rate of return is the estimated return based on an asset pricing model, using a long run historical average or multiple valuation.
The thesis presented by Pablo Daniel Azar gives a much thorough understanding of Abnormal Return in the context of financial analysis.

Mathematically, Abnormal Return can be expressed as:

`Abnormal return = expected return - actual return (1)`

The "actual return" is the return that I obtained from Yahoo Finance and the "expected return" refers to the forecast return calculated by the Capital Asset Pricing Model (CAMP) framework. The basic formular for calculating a stock's expected return under the CAMP is

`Expected return = risk-free rate + beta x (market return - risk-free rate) (2)`

Using the (1) and (2) formulars, I calculate the abnomal return for 6-year time period of 22 companies and save the results in the folder Abnormal Returns. I read these "Abnormal Returns" csv file into dataframe with function `create_stock_df`.

To generate binary label, we create a threshold to convert Abnomral Return from continuous to binary values. 
<img width="504" alt="screenshot 2018-11-12 19 14 37" src="https://user-images.githubusercontent.com/30711638/48382452-36ed8080-e6af-11e8-97a7-aaefc285efd6.png">

- (+): Abnormal Return > 0.01%
- (-): Abnormal Return < 0.01%


In [3]:
# Read Yahoo stock of a company
def create_stock_df(stock_file):
    #  append Target into News dataframe
    stock_df = pd.read_csv(stock_file, encoding = "ISO-8859-1")
    stock_df['Date'] = pd.to_datetime(stock_df['Date'])
    return stock_df

In [4]:
# Combine news data and stock data of a company to prepare for label creation process
def combine_final_df(news_df, stock_df):
    df= stock_df.set_index('Date').join(news_df.set_index('Date'))
    df.fillna(0, inplace = True)
    df['Target'] = np.nan
    print(df.head())
    requirement = 0.00000
    for i in range(len(df)):
        if df['Abnormal Return'].iloc[i] > requirement:
            df['Target'].iloc[i] = 1.0
        elif df['Abnormal Return'].iloc[i] <  -requirement:
            df['Target'].iloc[i] = -1.0
        else:
            df['Target'].iloc[i] = 0.0
    return df

I make an assumption that the return on a firm is directly connected to the news. This is a relatively big assumption because in financial news, the text is written by a reporter who has no control over how the market will react. The classification is assigned by aggregating the trades of everyone who reacts to the news, as well as the trades of other agents whose decisions are independent of what is in the news. However, this assumption is reasonable for our model in which we try to investigate the relationship between the financial news and stock performance.

In [5]:
# We conduct experiments with data of 21 companies, so we have to combine processed data of these companies together to create 1 complete training dataset
def combine_multiple_company(company_list):  
    """
    Concatenate multiple dataframe of companies together to create big lexicon
    """
    company_dfs = []

    for company in company_list:
        print(company)
        news_file_name = 'data/News/'+ company + '_News.csv'
        news = create_news_df(news_file_name)
        stock_file_name = 'data/Abnormal_Returns/' + company + '_AbnormalReturn.csv'
        stock = create_stock_df(stock_file_name)

        final = combine_final_df(news, stock)
        company_dfs.append(final)
    total = pd.concat(company_dfs, ignore_index = True)
    return total

In [6]:
company_list = ["005930.KS", 'AAPL', 'INTC', 'MSFT', 'ORCL', 'SNE',
                'TDC', 'TSLA', 'TXN', 'FB', 'AMZN', 'QCOM', 'GOOG.O',
                'IBM', 'CVX', 'GE','WMT', 'WFC', 'XOM','T','F']

news_df = combine_multiple_company(company_list)
# news_df.to_csv('combined_companies.csv')
# news_df = pd.read_csv('combined_companies.csv')

def tag_label_to_event(news_df):
    data = news_df.values
    document = []
    for row in data: 
        if row[2] == 1.0 and row[1] != 'nannan':
            document.append( (row[1], "pos") )
        elif row[2] == -1.0 and row[1] != 'nannan':
            document.append( (row[1], "neg") )
    return document

005930.KS
            Abnormal Return  \
Date                          
2012-01-03           0.0000   
2012-01-04           0.0000   
2012-01-05          -0.0001   
2012-01-06           0.0001   
2012-01-09          -0.0001   

                                                         Text  Target  
Date                                                                   
2012-01-03  SEOUL South Korea said on Wednesday it had app...     NaN  
2012-01-04  SEOUL, Jan 4 Seoul shares slipped on\rWednesda...     NaN  
2012-01-05  SEOUL Samsung Electronics, the world's top mak...     NaN  
2012-01-06  TAIPEI Taiwanese smartphone maker HTC Corp rec...     NaN  
2012-01-09  LAS VEGAS Samsung Electronics Co unveiled its ...     NaN  


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


AAPL
            Abnormal Return    Text  Target
Date                                       
2011-01-03         0.000000  nannan     NaN
2011-01-04         0.006427  nannan     NaN
2011-01-05         0.003572  nannan     NaN
2011-01-06         0.001145  nannan     NaN
2011-01-07         0.008859  nannan     NaN
INTC
            Abnormal Return    Text  Target
Date                                       
2011-01-03         0.000000  nannan     NaN
2011-01-04         0.015676  nannan     NaN
2011-01-05        -0.014839  nannan     NaN
2011-01-06        -0.006037  nannan     NaN
2011-01-07        -0.003487  nannan     NaN
MSFT
            Abnormal Return    Text  Target
Date                                       
2011-01-03         0.000000  nannan     NaN
2011-01-04         0.005247  nannan     NaN
2011-01-05        -0.008219  nannan     NaN
2011-01-06         0.031412  nannan     NaN
2011-01-07        -0.005786  nannan     NaN
ORCL
            Abnormal Return    Text  Target
Date        

### Stemming and Lemmatizing¶
Normalizing the variations of words with the same meanings into the identical word. There are two ways to normalize words which are stemming (convert to roots of words) and lemmatizing (convert to actual words). For example:

- **Stemming**: The word "positive" and "positively" are derivatives of the same root of word "positiv" (notice "positiv" is non-existent word)
- **Lemmatizing**: The word "cats" is converted to lemmas "cat" which is an actual word In this project, I use lemmatizing because I find it easier to intepret the results.
As mentioned in Bag of Words model explaination, the classifier generally learn better with a smaller number of features in dataset. Stemming and lemmatizing is one method to reduce the number of features, thus improve the performance of the classifier.

In [28]:
def process_news_data(words):
    """
    Cleaning News data
    """
    lst = []
    for s in words:
        exclude = set(string.punctuation)
        s = ''.join(ch for ch in s if ch not in exclude)
        sentence_token = word_tokenize(s.lower())
        nostopword_sentence = []
        for word_token in sentence_token:
            stemmed_word = lemmatizer.lemmatize(word_token)
            # stemmed_word = ps.stem(word_token)
            if stemmed_word not in stopwords.words('english'):
                nostopword_sentence.append(stemmed_word)
            # if word_token not in stopwords.words('english'):
            #     nostopword_sentence.append(word_token)
        lst.append(nostopword_sentence)
    return lst

### Part of Speech Tagging¶
Speech tagging is labeling words in a sentence as nouns, adjectives, verbs...etc. The full list of tags are listed here. In this project, I decided to only keep adjectives, adverbs and verbs for our feature vectors because nouns generally don’t reveal much sentiment information

In [None]:
    def create_lexicon(news_df):
    text = np.array(process_news_data(news_df['Text'].values))
    all_words = []
    allowed_word_types = ['J','V','R']
    for day in list(range(len(text))):
        words = text[day]
        pos = nltk.pos_tag(words)
        for w in pos:
            if w[1][0] in allowed_word_types:
                all_words.append(w[0].lower())
    all_words = nltk.FreqDist(all_words)
    word_features = []
    for i in all_words.keys():
        if all_words.get(i) > 2:
            word_features.append(i)
#     word_features.remove('nannan')
    return word_features

In [69]:
news_df.head()

Unnamed: 0,Abnormal Return,Text,Target
0,0.0,SEOUL South Korea said on Wednesday it had app...,0.0
1,0.0,"SEOUL, Jan 4 Seoul shares slipped on\rWednesda...",0.0
2,-0.0001,"SEOUL Samsung Electronics, the world's top mak...",-1.0
3,0.0001,TAIPEI Taiwanese smartphone maker HTC Corp rec...,1.0
4,-0.0001,LAS VEGAS Samsung Electronics Co unveiled its ...,-1.0


### Collocations and Bigram¶
A collocation is a sequence of words that occur together unusually often (source). In a simplier sense, a collocation is a combination of words that don't carry meanings by themselves alone but is contextually meaningful together. A special case of collocation is bigrams which refer to a list of word pairs extracted from a text. For example

`list(bigrams( ['more', 'is', 'said', 'than', 'done'] ))
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]`  
*than* and *done* don't provide much information to the sentence, but the pair of words than done does.
Collocations are bigrams that appear frequently.

I use Chi-square scoring function from *nltk.metrics.BigramAssocMeasures* to evaluate the performance of Bigram that I implement from nltk.metrics.BigramCollocatioinFinder. The BigramCollocationFinder maintains 2 internal Frequency Distribution, one for individual word frequencies, another for bigram frequencies. The scoring function measures the collocation correlation of 2 words, or reveal whether the bigram occurs about as frequently as each individual word.
The Pearson's Chi-squared statistics measures the relationship between observed values and expected values or between two categorical variables. Chi-squared can be used in independence test, comparing 2 variables, or in a more general sense, test the difference in distributions of categorical variables, to see whether they are related.
In this project, I use Chi-squared independence test to check if sequences of words, called collocations, occured together more than they might by chance. These collocations are typically names, idioms, set-phrases and the like in text.

In [29]:
def bigram_word_features(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams_tuple = bigram_finder.nbest(score_fn, n)
    bigrams = [' '.join(each) for each in bigrams_tuple]
    return list([ngram for ngram in itertools.chain(words, bigrams)])

In [79]:
lemmatizer = WordNetLemmatizer()
ps = PorterStemmer()
singular_word_features = create_lexicon(news_df)
word_features = bigram_word_features(singular_word_features)
# word_features = create_lexicon(news_df)

document = tag_label_to_event(news_df)

In [80]:
save_document = open("pickled_algos/documents_22_abnormalReturn.pickle","wb")
pickle.dump(document, save_document)
save_document.close()

save_word_features = open("pickled_algos/word_features_22_abnormalReturn.pickle","wb")
pickle.dump(word_features, save_word_features)
save_word_features.close()

In [35]:
def find_features(line):
    sentence = word_tokenize(line)
#     words = sentence
    words = []
    features = {}
    for each in sentence:
        words.append( ps.stem(each) ) 
    for w in word_features:
        features[w] = (w in words)
    return features

In [36]:
featuresets = [ (find_features(line),category) for (line,category) in document]
random.shuffle(featuresets)
print(len(featuresets))

save_features = open("pickled_algos/featuresets_22_abnormalReturn.pickle","wb")
pickle.dump(featuresets, save_features)
save_features.close()

featuresets_f = open("pickled_algos/featuresets_22_abnormalReturn.pickle", "rb")
featuresets = pickle.load(featuresets_f)
featuresets_f.close()

posfeats = []
negfeats = []
for each in featuresets:
    if each[1] == "pos":
        posfeats.append(each)
    else:
        negfeats.append(each)
        
split = 0.6
num = math.ceil(len(featuresets)* split)
training_set = featuresets[:num]
testing_set = featuresets[num:]

negcutoff = math.ceil(len(negfeats)*split)
poscutoff = math.ceil(len(posfeats)*split)

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

15024


In [70]:
algos_list = [SVC(), 
              LogisticRegression(dual = True), 
              LogisticRegression(class_weight = 'balanced'),
              SGDClassifier(loss='log'),
              LinearSVC(C= 0.01),
              SVC(kernel = 'poly', C= 0.01),
              MLPClassifier(hidden_layer_sizes=(100,100,100,100)),
              MLPClassifier(hidden_layer_sizes=(300,300,300,300,300))]

for algo in algos_list:
    start = time.time()
    classifier = SklearnClassifier(algo).train(trainfeats)
    end = time.time()
    train_time = end-start

    for i, (feats, label) in enumerate(testfeats):
        refsets[label].add(i)
        observed = classifier.classify(feats)
        testsets[observed].add(i)

    accuracy = nltk.classify.accuracy(classifier, testfeats)
    pos_precison = precision(refsets['pos'], testsets['pos'])
    neg_precision = precision(refsets['neg'], testsets['neg'])
    pos_recall = recall(refsets['pos'], testsets['pos'])
    neg_recall = recall(refsets['neg'], testsets['neg'])

    pickle_name = "pickled_algos/" + str(algo) + "_classifier.pickle"
    save_classifier = open(pickle_name,"wb")
    pickle.dump(classifier, save_classifier)
    save_classifier.close()

    algo_name = str(algo)[:-2]
    print(str(algo)+","+str(split)+","+str(train_time)+","
            + str(accuracy)+","+str(pos_precison)+","+str(neg_precision)+","
            +str(pos_recall)+","+str(neg_recall)+"\n")


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),0.6,96.79088592529297,0.5105674821101681,0.5053153791637137,0.5105674821101681,0.48486909214552876,1.0





OSError: [Errno 63] File name too long: "pickled_algos/LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,\n          intercept_scaling=1, max_iter=100, multi_class='warn',\n          n_jobs=None, penalty='l2', random_state=None, solver='warn',\n          tol=0.0001, verbose=0, warm_start=False)_classifier.pickle"

In [64]:
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy, pos_accuracy, neg_accuracy percent: ", (nltk.classify.accuracy(classifier, testing_set))*100, 
    nltk.classify.accuracy(classifier,posfeats[:poscutoff]), nltk.classify.accuracy(classifier, negfeats[negcutoff:]))
classifier.show_most_informative_features(20)
save_classifier = open("pickled_algos/originalnaivebayes.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

Original Naive Bayes Algo accuracy, pos_accuracy, neg_accuracy percent:  51.22316525212182 0.614233907524932 0.5876792698826597
Most Informative Features
                 upstart = True              pos : neg    =      6.8 : 1.0
                   agent = True              pos : neg    =      6.0 : 1.0
                 layaway = True              pos : neg    =      6.0 : 1.0
                   tesco = True              pos : neg    =      6.0 : 1.0
                 iranian = True              pos : neg    =      6.0 : 1.0
                    mega = True              pos : neg    =      5.3 : 1.0
                 quantum = True              pos : neg    =      5.3 : 1.0
                 murdoch = True              neg : pos    =      5.3 : 1.0
                    thin = True              neg : pos    =      5.3 : 1.0
                  rupert = True              neg : pos    =      4.7 : 1.0
                    mask = True              neg : pos    =      4.7 : 1.0
                   gr

In [47]:
BernoulliNB_classifier = SklearnClassifier(BernoulliNB(alpha=3.0))
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)
save_classifier = open("pickled_algos/BernoulliNB_classifier.pickle","wb")
pickle.dump(BernoulliNB_classifier, save_classifier)
save_classifier.close()

BernoulliNB_classifier accuracy percent: 51.572641038442335


In [71]:
LogisticRegression_classifier = SklearnClassifier(LogisticRegression(dual = True))
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)
save_classifier = open("pickled_algos/LogisticRegression_classifier.pickle","wb")
pickle.dump(LogisticRegression_classifier, save_classifier)
save_classifier.close()


LogisticRegression_classifier accuracy percent: 51.22316525212182


In [72]:
Weighted_LogisticRegression_classifier = SklearnClassifier(LogisticRegression(class_weight = 'balanced'))
Weighted_LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(Weighted_LogisticRegression_classifier, testing_set))*100)
save_classifier = open("pickled_algos/Weighted_LogisticRegression_classifier.pickle","wb")
pickle.dump(Weighted_LogisticRegression_classifier, save_classifier)
save_classifier.close()

LogisticRegression_classifier accuracy percent: 51.30637377267432


In [73]:
SGDClassifier_classifier = SklearnClassifier(SGDClassifier(loss='log'))
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)
save_classifier = open("pickled_algos/SGD_classifier.pickle","wb")
pickle.dump(SGDClassifier_classifier, save_classifier)
save_classifier.close()



SGDClassifier_classifier accuracy percent: 50.29122982193377


In [74]:
LinearSVC_classifier = SklearnClassifier(LinearSVC(C= 0.01))
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)
save_classifier = open("pickled_algos/LinearSVC_classifier.pickle","wb")
pickle.dump(LinearSVC_classifier, save_classifier)
save_classifier.close()

LinearSVC_classifier accuracy percent: 50.9069728740223


In [75]:
PolySVC_classifier = SklearnClassifier(SVC(kernel = 'poly', C= 0.01))
PolySVC_classifier.train(training_set)
print("PolySVC_classifier accuracy percent:", (nltk.classify.accuracy(PolySVC_classifier, testing_set))*100)
save_classifier = open("pickled_algos/PolySVC_classifier.pickle","wb")
pickle.dump(PolySVC_classifier, save_classifier)
save_classifier.close()

PolySVC_classifier accuracy percent: 50.22466300549176


In [76]:
MLP_classifier = SklearnClassifier(MLPClassifier(hidden_layer_sizes=(100,100,100,100)))
MLP_classifier.train(training_set)
print("MLP_classifier accuracy percent:", (nltk.classify.accuracy(MLP_classifier, testing_set))*100)
save_classifier = open("pickled_algos/MLP_classifier.pickle","wb")
pickle.dump(MLP_classifier, save_classifier)
save_classifier.close()

MLP_classifier accuracy percent: 50.8404060575803


In [77]:
Big_MLP_classifier = SklearnClassifier(MLPClassifier(hidden_layer_sizes=(300,300,300,300,300)))
Big_MLP_classifier.train(training_set)
print("Big_MLP_classifier accuracy percent:", (nltk.classify.accuracy(Big_MLP_classifier, testing_set))*100)
save_classifier = open("pickled_algos/Big_MLP_classifier.pickle","wb")
pickle.dump(Big_MLP_classifier, save_classifier)
save_classifier.close()

Big_MLP_classifier accuracy percent: 50.67398901647528


## Method 2
## Label: S&P 500

In [82]:
import pandas as pd 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# import pandas_datareader.data as web
# import datetime as dt
import nltk
import numpy as np
import random
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import ClassifierI
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.neural_network import MLPClassifier
import pickle 
import pdb
from nltk.stem import WordNetLemmatizer
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.stem import PorterStemmer
import pandas_datareader.data as web
import datetime as dt
import math
# Create Word Dict (list with lenth 1644 days)
lemmatizer = WordNetLemmatizer()
# ps = PorterStemmer()

def create_news_df(news_file):
    news_df = pd.read_csv(news_file, encoding = "ISO-8859-1")
    news_df['Date'] = pd.to_datetime(news_df['Date'], errors= 'coerce')
    news_df['Text'] = news_df['Body'].astype(str).str.cat(news_df['Title'].astype(str))
    del news_df['Body']
    del news_df['Title']
    news_df.fillna('', inplace=True)
    return news_df

def create_stock_df(symbol):
    """
    downloads stock which is gonna be the output of prediciton
    """
    name = 'data/News/'+symbol + '_Stocks.csv'
    out =   pd.read_csv(name, encoding = "ISO-8859-1")
    # out['Date'] = pd.to_datetime(out['Date'])
    out['Return'] = out['Adj Close'].pct_change()

    sp500 = pd.read_csv('data/GuruFocus/Yahoo_Index_GSPC.csv')
    sp500['Date'] = pd.to_datetime(sp500['Date'])
    sp500['Sp_Return'] = sp500['Adj Close'].pct_change()

    del sp500['Open']
    del sp500['Close']
    del sp500['High']
    del sp500['Low']
    del sp500['Volume']
    del sp500['Adj Close']

    df= out.set_index('Date').join(sp500.set_index('Date'))
    df = df.dropna()

    df['Difference'] = df['Return'] - df['Sp_Return']
    return df

def combine_final_df(news_df, stock_df):
    df= stock_df.join(news_df.set_index('Date'))
    df.fillna(0, inplace = True)
    df['Target'] = np.nan
    requirement = 0.0
    for i in range(len(df)):
        if df['Difference'].iloc[i] > requirement:
            df['Target'].iloc[i] = 1.0
        elif df['Difference'].iloc[i] <  requirement:
            df['Target'].iloc[i] = -1.0
        else:
            df['Target'].iloc[i] = 0.0
    return df

def combine_multiple_company(company_list):  
    """
    Concatenate multiple dataframe of companies together to create big lexicon
    """
    company_dfs = []

    for company in company_list:
        print(company)
        news_file_name = 'data/News/'+ company + '_News.csv'
        news = create_news_df(news_file_name)
        
        stock = create_stock_df(company)

        final = combine_final_df(news, stock)
        company_dfs.append(final)
    total = pd.concat(company_dfs, ignore_index = True)
    return total

company_list = ["005930.KS", 'AAPL', 'INTC', 'MSFT', 'ORCL', 'SNE',
                'TDC', 'TSLA', 'TXN', 'FB', 'AMZN', 'QCOM', 'GOOG.O',
                'IBM', 'CVX', 'GE', 'VZ','WMT', 'WFC', 'XOM','T','F']

# company_list = ["005930.KS", 'AAPL']              

news_df = combine_multiple_company(company_list)

def tag_label_to_event(news_df):
    data = news_df.values
    document = []
    for row in data: 
        if row[-1] == 1.0 and row[-2] != 'nannan':
            document.append( (row[-2], "pos") )
        elif row[-1] == -1.0 and row[-2] != 'nannan':
            document.append( (row[-2], "neg") )
    return document

def process_news_data(words):
    """
    Cleaning News data
    """
    lst = []
    for s in words:
        exclude = set(string.punctuation)
        s = ''.join(ch for ch in s if ch not in exclude)
        sentence_token = word_tokenize(s.lower())
        nostopword_sentence = []
        for word_token in sentence_token:

            stemmed_word = lemmatizer.lemmatize(word_token)
            # stemmed_word = ps.stem(word_token)
            if stemmed_word not in stopwords.words('english'):
                nostopword_sentence.append(stemmed_word)
            # if word_token not in stopwords.words('english'):
            #     nostopword_sentence.append(word_token)
        lst.append(nostopword_sentence)
    return lst
    
def create_lexicon(news_df):
    text = np.array(process_news_data(news_df['Text'].values))
    all_words = []
    allowed_word_types = ['J','V','R']
    for day in list(range(len(text))):
        words = text[day]
        pos = nltk.pos_tag(words)
        for w in pos:
            if w[1][0] in allowed_word_types:
                all_words.append(w[0].lower())
    all_words = nltk.FreqDist(all_words)
    word_features = []
    for i in all_words.keys():
        if all_words.get(i) > 2:
            word_features.append(i)
    # word_features.remove('nannan')
    return word_features

def bigram_word_features(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams_tuple = bigram_finder.nbest(score_fn, n)
    bigrams = [' '.join(each) for each in bigrams_tuple]
    return list([ngram for ngram in itertools.chain(words, bigrams)])


singular_word_features = create_lexicon(news_df)
word_features = bigram_word_features(singular_word_features)
# word_features = create_lexicon(news_df)
print('number of word features', len(word_features))

document = tag_label_to_event(news_df)

save_document = open("pickled_algos/documents_22_sp500.pickle","wb")
pickle.dump(document, save_document)
save_document.close()

save_word_features = open("pickled_algos/word_features_22_sp500.pickle","wb")
pickle.dump(word_features, save_word_features)
save_word_features.close()

def find_features(line):
    sentence = word_tokenize(line)
    # words = sentence
    words = []
    for each in sentence:
        words.append( lemmatizer.lemmatize(each) ) 
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features

featuresets = [ (find_features(line),category) for (line,category) in document]
random.shuffle(featuresets)
print('number of featuresets', len(featuresets))

save_features = open("pickled_algos/featuresets_22_sp500.pickle","wb")
pickle.dump(featuresets, save_features)
save_features.close()

# featuresets_f = open("pickled_algos/featuresets_22_sp500.pickle", "rb")
# featuresets = pickle.load(featuresets_f)
# featuresets_f.close()

num = math.ceil(len(featuresets)*0.80)
training_set = featuresets[:num]
testing_set = featuresets[num:]


classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy percent: ", (nltk.classify.accuracy(classifier, testing_set))*100)
# classifier.show_most_informative_features(20)
save_classifier = open("pickled_algos/sp500_originalnaivebayes.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_MNB_classifier.pickle","wb")
pickle.dump(MNB_classifier, save_classifier)
save_classifier.close()

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_BernoulliNB_classifier.pickle","wb")
pickle.dump(BernoulliNB_classifier, save_classifier)
save_classifier.close()

LogisticRegression_classifier = SklearnClassifier(LogisticRegression(dual = True))
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_LogisticRegression_classifier.pickle","wb")
pickle.dump(LogisticRegression_classifier, save_classifier)
save_classifier.close()

Weighted_LogisticRegression_classifier = SklearnClassifier(LogisticRegression(class_weight = 'balanced'))
Weighted_LogisticRegression_classifier.train(training_set)
print("Weighted_LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(Weighted_LogisticRegression_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_Weighted_LogisticRegression_classifier.pickle","wb")
pickle.dump(Weighted_LogisticRegression_classifier, save_classifier)
save_classifier.close()

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_SGD_classifier.pickle","wb")
pickle.dump(SGDClassifier_classifier, save_classifier)
save_classifier.close()

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_LinearSVC_classifier.pickle","wb")
pickle.dump(LinearSVC_classifier, save_classifier)
save_classifier.close()

PolySVC_classifier = SklearnClassifier(SVC(kernel = 'poly', C= 0.01))
PolySVC_classifier.train(training_set)
print("PolySVC_classifier accuracy percent:", (nltk.classify.accuracy(PolySVC_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_PolySVC_classifier.pickle","wb")
pickle.dump(PolySVC_classifier, save_classifier)
save_classifier.close()

RbfSVC_classifier = SklearnClassifier(SVC(C= 0.01))
RbfSVC_classifier.train(training_set)
print("RbfSVC_classifier accuracy percent:", (nltk.classify.accuracy(RbfSVC_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_RbfSVC_classifier.pickle","wb")
pickle.dump(RbfSVC_classifier, save_classifier)
save_classifier.close()

NuSVC_classifier = SklearnClassifier(NuSVC(kernel = 'poly'))
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_NuSVC_classifier.pickle","wb")
pickle.dump(NuSVC_classifier, save_classifier)
save_classifier.close()

MLP_classifier = SklearnClassifier(MLPClassifier(hidden_layer_sizes=(100,100,100,100)))
MLP_classifier.train(training_set)
print("MLP_classifier accuracy percent:", (nltk.classify.accuracy(MLP_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_MLP_classifier.pickle","wb")
pickle.dump(MLP_classifier, save_classifier)
save_classifier.close()

Big_MLP_classifier = SklearnClassifier(MLPClassifier(hidden_layer_sizes=(300,300,300,300,300)))
Big_MLP_classifier.train(training_set)
print("Big_MLP_classifier accuracy percent:", (nltk.classify.accuracy(Big_MLP_classifier, testing_set))*100)
save_classifier = open("pickled_algos/sp500_Big_MLP_classifier.pickle","wb")
pickle.dump(Big_MLP_classifier, save_classifier)
save_classifier.close()

005930.KS


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


AAPL
INTC
MSFT
ORCL
SNE
TDC
TSLA
TXN
FB
AMZN
QCOM
GOOG.O
IBM
CVX
GE
VZ
WMT
WFC
XOM
T
F
number of word features 11972
number of featuresets 15701
Original Naive Bayes Algo accuracy percent:  50.955414012738856
MNB_classifier accuracy percent: 50.955414012738856
BernoulliNB_classifier accuracy percent: 50.98726114649682




LogisticRegression_classifier accuracy percent: 50.541401273885356
Weighted_LogisticRegression_classifier accuracy percent: 50.477707006369435




SGDClassifier_classifier accuracy percent: 50.41401273885351
LinearSVC_classifier accuracy percent: 50.15923566878981




PolySVC_classifier accuracy percent: 50.541401273885356
RbfSVC_classifier accuracy percent: 50.541401273885356
NuSVC_classifier accuracy percent: 49.10828025477707
MLP_classifier accuracy percent: 51.33757961783439
Big_MLP_classifier accuracy percent: 50.477707006369435
