# Introduction

This model aim to try around the accuracy over <br>
1. **Logistic Regression**
2. **Multinomial Naive Bayes**
3. **Random Forest**
4. **Gradient Boosting Machines**
5. **Naive Bayes SVM**
6. **Multilayer Perceptron Neural Network(MLP)**
7. **LSTM Neural Network**
8. **Bidirectional LSTM Neural Network**
9. **Convolutional Neural Network**

Data type <br>
- **x: a tfidf matrix derived from news headline data from Reddit WorldNews Channel**
- **y: binary classification derived from rising/dropping signal of DJIA**

Return value: a sorted dataframe recording accuracy from the models

# Libraries

In [1]:
%matplotlib inline
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from TryAroundModels import *

Using TensorFlow backend.


# Prepare Data

In [2]:
path = "../data/"
DJIA_fn = "DJIA_table.csv"
News_fn = "Combined_News_DJIA.csv"

DJIA_df = pd.read_csv(path + DJIA_fn)
DJIA_df = DJIA_df.sort_values("Date")
DJIA_df.index = range(len(DJIA_df))

News_df = pd.read_csv(path + News_fn)
News_df = News_df.sort_values("Date")
News_df.index = range(len(News_df))


def RemoveQuote(tmp_str):
    s_quote = False
    d_quote = False
    start_list = []
    end_list = []
    mid_str = tmp_str[5:len(tmp_str)-5]
    for i in range(5):
        if (not s_quote) or (not d_quote):
            try:
                if tmp_str[i] != "\'" and tmp_str[i] != "\"":
                    if i <=2 and tmp_str[i] == 'b':
                        continue
                    start_list.append(tmp_str[i])
                if tmp_str[-5+i] != "'" and tmp_str[-5+i] != '"':
                    end_list.append(tmp_str[-5+i])
            except:
                print(tmp_str)

    tmp_str = "".join(start_list) + mid_str + "".join(end_list)
    return tmp_str

headline_columns = [x for x in News_df.columns if re.match("Top", x)]
for col in headline_columns:
    News_df[col] = News_df[col].apply(lambda x: RemoveQuote(x) if x == x else x)
Comb_df = DJIA_df.merge(News_df, on = "Date", how = "inner")


train_index = pd.to_datetime(DJIA_df.Date, format = "%Y-%m-%d") < pd.to_datetime("2014-12-31", format = "%Y-%m-%d")
train_data = Comb_df[train_index]
test_data = Comb_df[~train_index]
joint_headlines_train = train_data[headline_columns[0]]
joint_headlines_test = test_data[headline_columns[0]]

for i in range(1, len(headline_columns)):
    joint_headlines_train += (' ' + train_data[headline_columns[i]].apply(lambda x: str(x) if x == x else ""))
    joint_headlines_test += (' ' + test_data[headline_columns[i]].apply(lambda x: str(x) if x == x else ""))

basicVetorizer = TfidfVectorizer(min_df=0.03, max_df=0.97, max_features = 200000, ngram_range = (2, 2))
basic_train = basicVetorizer.fit_transform([x for x in joint_headlines_train.values if x == x])
basic_test = basicVetorizer.transform([x for x in joint_headlines_test.values if x == x])
basic_whole = basicVetorizer.fit_transform([x for x in joint_headlines_train.append(joint_headlines_test).values if x==x])

X_train = basic_train
X_test = basic_test

Y_train = train_data.Label.values
Y_test = test_data.Label.values

X_raw_text_train = joint_headlines_train
X_raw_text_test = joint_headlines_test

X = basic_whole
Y = Comb_df.Label.values

In [3]:
Models = []
for i in np.nonzero([re.match("TryAroundModel", x) for x in locals().keys()])[0]:
    Models.append(list(locals().keys())[i])
accuracy_list = TryAroundModel(X_train, X_test, Y_train, Y_test, X_raw_text_train, X_raw_text_test, Models[:-4])

Logistic Regression -- Accuracy:  0.5567282321899736
Multinomial Naive Bayes -- Accuracy:  0.5145118733509235
Random Forest -- Accuracy:  0.5382585751978892
Gradient Boosting Machine -- Accuracy:  0.5065963060686016
Naive Bayes SVM -- Accuracy:  0.525065963060686
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Multilayer Perceptron Neural Network(MLP) -- Accuracy:  0.5065963060686016


In [4]:
pd.DataFrame(accuracy_list)

Unnamed: 0,0,1
0,Logistic Regression,0.556728
1,Random Forest,0.538259
2,Naive Bayes SVM,0.525066
3,Multinomial Naive Bayes,0.514512
4,Gradient Boosting Machine,0.506596
5,Multilayer Perceptron Neural Network(MLP),0.506596
