# Documentation (EMADE Implementation of Stocks Dataset) 

## Note: Run all the code cells below in order to ensure everything runs correctly 

We will start by installing the libraries/packages necessary for this specific dataset and EMADE. In my case, I needed to install Tensorflow and Keras. In order to do so, run cmd *AS ADMINISTRATOR* and then type:

In [2]:
# Only run this code if you do not have these libraries already installed.

# pip install tensorflow
# pip install keras

Import all necessary libraries/packages:

In [4]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

import shutil
import gzip

import os
import re
import math

This line is incase your code throws annoying warnings later on.

In [2]:
pd.options.mode.chained_assignment = None  # default='warn'

   **Open the  all_stocks_5yr  excel/csv file and look at the data.**          
   ** It will help you understand and visualize the following steps.**

Read your data (excel, csv, etc.) and select the specific rows you want to work with. In this case, I have multiple stocks, but will only be showing you the first one (MMM). MMM takes up the first 1258 rows.

In [3]:
data = pd.read_csv("all_stocks_5yr.csv")

# Print first 5 examples
print(data.head())
print(data.shape)

data = data.drop(data.index[1258:])
print(data.shape)
print(data)

        Date   Open   High    Low  Close     Volume Name
0  8/13/2012  92.29  92.59  91.74  92.40  2075391.0  MMM
1  8/14/2012  92.36  92.50  92.01  92.30  1843476.0  MMM
2  8/15/2012  92.00  92.74  91.94  92.54  1983395.0  MMM
3  8/16/2012  92.75  93.87  92.21  93.74  3395145.0  MMM
4  8/17/2012  93.93  94.30  93.59  94.24  3069513.0  MMM
(606801, 7)
(1258, 7)
           Date    Open    High     Low   Close     Volume Name
0     8/13/2012   92.29   92.59   91.74   92.40  2075391.0  MMM
1     8/14/2012   92.36   92.50   92.01   92.30  1843476.0  MMM
2     8/15/2012   92.00   92.74   91.94   92.54  1983395.0  MMM
3     8/16/2012   92.75   93.87   92.21   93.74  3395145.0  MMM
4     8/17/2012   93.93   94.30   93.59   94.24  3069513.0  MMM
5     8/20/2012   94.00   94.17   93.55   93.89  1640008.0  MMM
6     8/21/2012   93.98   94.10   92.99   93.21  2302988.0  MMM
7     8/22/2012   92.56   93.36   92.43   92.68  2463908.0  MMM
8     8/23/2012   92.65   92.68   91.79   91.98  1823757.0  

Assign your labels variable:

In [4]:
labels = data[["Close"]]
print("----------------------------LABELS-------------------------------")
print(labels.shape)
print(labels)
print("----------------------------LABELS-------------------------------")

----------------------------LABELS-------------------------------
(1258, 1)
       Close
0      92.40
1      92.30
2      92.54
3      93.74
4      94.24
5      93.89
6      93.21
7      92.68
8      91.98
9      92.83
10     92.59
11     92.30
12     92.43
13     91.76
14     92.60
15     91.68
16     91.75
17     93.28
18     92.82
19     90.67
20     91.17
21     90.81
22     92.06
23     93.98
24     93.78
25     93.43
26     93.63
27     93.58
28     93.21
29     93.73
...      ...
1228  208.19
1229  209.83
1230  209.76
1231  208.02
1232  209.59
1233  210.49
1234  209.66
1235  211.30
1236  211.09
1237  211.77
1238  211.68
1239  211.31
1240  212.10
1241  212.45
1242  211.16
1243  210.00
1244  199.39
1245  199.03
1246  200.05
1247  199.72
1248  201.17
1249  203.18
1250  205.41
1251  207.62
1252  207.65
1253  207.44
1254  206.43
1255  206.48
1256  206.23
1257  205.98

[1258 rows x 1 columns]
----------------------------LABELS-------------------------------


Modify the labels variable so that you replace their values with 0 or 1. In our case, we are determining if the 'Close' price is greater or less than the 'Close' price from the previous day. 

If the price is greater than the day before, we will put a 1. If it less, we will put a 0.

In [5]:
labels.loc[labels.Close >= labels.Close.shift(), 'Close'] = 1
labels.loc[labels.Close != 1, 'Close'] = 0


print(labels.shape)
print(labels)

(1258, 1)
      Close
0       0.0
1       0.0
2       1.0
3       1.0
4       1.0
5       0.0
6       0.0
7       0.0
8       0.0
9       1.0
10      0.0
11      0.0
12      1.0
13      0.0
14      1.0
15      0.0
16      1.0
17      1.0
18      0.0
19      0.0
20      1.0
21      0.0
22      1.0
23      1.0
24      0.0
25      0.0
26      1.0
27      0.0
28      0.0
29      1.0
...     ...
1228    1.0
1229    1.0
1230    0.0
1231    0.0
1232    1.0
1233    1.0
1234    0.0
1235    1.0
1236    0.0
1237    1.0
1238    0.0
1239    0.0
1240    1.0
1241    1.0
1242    0.0
1243    0.0
1244    0.0
1245    0.0
1246    1.0
1247    0.0
1248    1.0
1249    1.0
1250    1.0
1251    1.0
1252    1.0
1253    0.0
1254    0.0
1255    1.0
1256    0.0
1257    0.0

[1258 rows x 1 columns]


Drop columns that you don't need:

In [6]:
data = data.drop("Open", axis=1)
data = data.drop("High", axis=1)
data = data.drop("Low", axis=1)

print(data)

           Date   Close     Volume Name
0     8/13/2012   92.40  2075391.0  MMM
1     8/14/2012   92.30  1843476.0  MMM
2     8/15/2012   92.54  1983395.0  MMM
3     8/16/2012   93.74  3395145.0  MMM
4     8/17/2012   94.24  3069513.0  MMM
5     8/20/2012   93.89  1640008.0  MMM
6     8/21/2012   93.21  2302988.0  MMM
7     8/22/2012   92.68  2463908.0  MMM
8     8/23/2012   91.98  1823757.0  MMM
9     8/24/2012   92.83  1945796.0  MMM
10    8/27/2012   92.59  1879969.0  MMM
11    8/28/2012   92.30  1913066.0  MMM
12    8/29/2012   92.43  1735933.0  MMM
13    8/30/2012   91.76  1729576.0  MMM
14    8/31/2012   92.60  1917265.0  MMM
15     9/4/2012   91.68  2532261.0  MMM
16     9/5/2012   91.75  2985428.0  MMM
17     9/6/2012   93.28  3223309.0  MMM
18     9/7/2012   92.82  3215341.0  MMM
19    9/10/2012   90.67  6356086.0  MMM
20    9/11/2012   91.17  2403421.0  MMM
21    9/12/2012   90.81  2409574.0  MMM
22    9/13/2012   92.06  2734203.0  MMM
23    9/14/2012   93.98  4993059.0  MMM


Put the text (the stock 'Name' column) in a new variable and remove it from the original dataframe:

In [7]:
text = data[["Name"]].values
data = data.drop("Name", axis=1)

text_list = []
for i in text:
    text_list.append(i[0])
    
print(text_list)
print(data)

['MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM', 'MMM'

Turn the text and data values into numpy arrays:

In [8]:
text_array = np.array(text_list)
data_array = data.values

print(text_array)
print(data_array)

['MMM' 'MMM' 'MMM' ..., 'MMM' 'MMM' 'MMM']
[['8/13/2012' 92.4 2075391.0]
 ['8/14/2012' 92.3 1843476.0]
 ['8/15/2012' 92.54 1983395.0]
 ..., 
 ['8/9/2017' 206.48 1622213.0]
 ['8/10/2017' 206.23 1571545.0]
 ['8/11/2017' 205.98 1452811.0]]


Now, we will split our data into train and test data & concat the input data and labels together to fit EMADE's format.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(data_array, labels.values, test_size=0.33, shuffle=False)

data_train = np.concatenate((X_train, y_train), axis=1)
data_test = np.concatenate((X_test, y_test), axis=1)

print("----------------------------TRAIN-------------------------------")
print(data_train.shape)
print(data_train)
print("----------------------------TRAIN-------------------------------")

print()
print()
print()

print("----------------------------TEST-------------------------------")
print(data_test.shape)
print(data_test)
print("----------------------------TEST-------------------------------")

----------------------------TRAIN-------------------------------
(842, 4)
[['8/13/2012' 92.4 2075391.0 0.0]
 ['8/14/2012' 92.3 1843476.0 0.0]
 ['8/15/2012' 92.54 1983395.0 1.0]
 ..., 
 ['12/14/2015' 157.63 3448686.0 1.0]
 ['12/15/2015' 148.13 8645905.0 0.0]
 ['12/16/2015' 149.95 4774134.0 1.0]]
----------------------------TRAIN-------------------------------



----------------------------TEST-------------------------------
(416, 4)
[['12/17/2015' 148.85 3053902.0 0.0]
 ['12/18/2015' 146.92 5736432.0 0.0]
 ['12/21/2015' 147.48 2284411.0 1.0]
 ..., 
 ['8/9/2017' 206.48 1622213.0 1.0]
 ['8/10/2017' 206.23 1571545.0 0.0]
 ['8/11/2017' 205.98 1452811.0 0.0]]
----------------------------TEST-------------------------------


We split the data into training and test data with 67% as training and 33% as testing data. Then, we append the labels columns to the end of the input data columns to put the data into the format EMADE expects.