<font size = 6><b>CSE 255 Take Home Final: Analysis of stock prices</b></font>


In this take-home final you are to analyze the daily changes in stock prices using PCA and to measure the intrinsic dimension of stock sequences. Later you will also use xgboost to predict stock category from the eigenvectors of the data.

## Notebook 1: Preparing Data

In [None]:
!pwd  

## should be /home/jovyan/work/Final if you're using docker. If you're not using docker,
## you should still work inside the `Final` folder before proceeding forward.

In [None]:
## importing some useful python libraries

import sys,os
import numpy as np
from numpy.linalg import norm
import matplotlib.pyplot as plt
%matplotlib inline

from time import time
import math
import pandas as pd
from glob import glob
import pickle


### Download Data

We start by downloading data and pre-processing it to make it ready for analysis using Spark.

The data is a directory with .csv files, one for each stock. This directory has been tarred and uploaded to S3, at: https://mas-dse-open.s3.amazonaws.com/Stocks/spdata_csv.tgz

Download and untar the file to create a subdirectory of the current directory called `spdata_csv`

In [None]:
!pwd

## This should be /home/jovyan/work/Final if you're using docker.
## If not using docker, `cd` to the `Final` directory before proceeding.

In [None]:
## creating the necessary directory structure and downloading/extracting data 

%mkdir -p data/
%cd data
!rm -f spdata_csv.tgz && rm -rf spdata_csv ## Deleting any old copy present
!wget https://mas-dse-open.s3.amazonaws.com/Stocks/spdata_csv.tgz ## Downloading data
!tar -xf spdata_csv.tgz ## Extracting data
## Going back to `Final` directory to keep it as our working directory
%cd ../ 
%ls -al data/

In [None]:
## How is the data structured?

files=!ls -1 data/spdata_csv/train/
files[:5]

### Stock Info
The file `data/TickerInfo.tsv` contains information relating companies to sectors.

In [None]:
## How many train and test stocks?

!ls -l data/spdata_csv/train/ | wc
!ls -l data/spdata_csv/test/ | wc

## Read Data and create a single table

Your task in this notebook is to read the stock-information `.csv` files, extract from them the column 
`Adj. Open` and combine them into a single `.csv` file containing all of the information that is relevant for later analysis.

Below we suggest a particular sequence of steps, you can either follow these steps, or do this in your own way.  The end result should be a file called `SP500.csv` which stores the information described below.

### Step 1: files into pandas dataframes

In this step we read all of the relevant information into a large dictionary we call `Tables`.

The key to this dictionary is the stocks "ticker" which corresponds to the file name excluding the `.csv` extension. Hence, we read in all of the files in the directory `spdata_csv`.

In [None]:
cur_dir = !pwd
print("The current working directory: ", cur_dir)
data_dir_rel_path = 'data/spdata_csv/'

In [None]:
%cd $data_dir_rel_path
Tables={}
for filename in glob('*/*.csv'):
    print('\r',filename, end=' ')
    head, tail = os.path.split(filename)
    #stock_name = tail[:-4]
    code = filename[:-4]
    tbl=pd.read_csv(filename,index_col='Date',parse_dates=True)
    if(np.shape(tbl)[1]==12):
        Tables[code]=tbl.sort_index()
        Tables[code]
    else:
        print("This file does not have the correct number of columns.")
        print(filename,np.shape(tbl))
        
%cd ../../
print("The current working directory: ", cur_dir)

In [None]:
# Example of an entry in `Tables`
print(len(Tables))
Tables['train/IBM'].head()

### Step 2: Computing diffs and combining into a single table

The next step is to extract the relevant prices from each table, compute an additional quantity we call `diff` and create a single combined pandas dataframe called `Diffs` containing info about all stocks.

The price we will use is the **Adjusted Open Price** which is the price when the stock exchange opens in the morning. We use the **adjusted** price which eliminates technical adjustments such as stock splits.

It is more meaningful to predict *changes* in prices than prices themselves. We therefore compute, for each stock, a `Diffs` sequence in which $d(t)=\log \frac{p(t+1)}{p(t)}$ where $p(t)$ is the price at day $t$ and $d(t)$ is the price diff or the price ratio.

Obviously, if we have a price sequence of length $T$ then the length of the diff sequence will be $T-1$. To make the price sequence and the diff sequence have the same length we eliminate the last day price for each sequence after we've calculated the `diff` for that stock.

Your task in this step is to compute the diff sequence for each stock, and `join` them by date,  and create one large Pandas DataFrame called `Diffs` where the row index is the date, and there are two columns for each ticker. For example for the ticker `IBM`, there would be two columns `IBM_P` and `IBM_D`. The first corresponds to the prices of the IBM stock $p(t)$ and the second to the price difference $d(t)$. In total, the resultant Pandas dataframe should have 962 columns (i.e. 481*2).

In [None]:
def construct_df_with_diffs_and_prices(Tables):
    ## This is the df you will use to store the required info.
    ## You may keep `joining` calculated data for each stock in this DF.
    Diffs=pd.DataFrame()
    
    Indices=set(Tables.keys())
    print(len(Indices))
    
    for code in Indices:

        #BEGIN SOLUTION
        print('\r',code, end=' ')
        #i+=1
        tbl=Tables[code]
        S=tbl['Adj. Open']
        prices=np.array(S)  # The length of "prices" will remain the original length.
        diff=np.log(prices[1:]/prices[:-1])
        I=S.index[:-1]
        #print np.shae(Diffs),np.shape(Sdiff),np.shape(diff),len(I)
        Sdiff=pd.DataFrame({code+'_D':diff,code+'_P':prices[:-1]},index=I)
        Diffs=Diffs.join(Sdiff,how='outer')
        #END SOLUTION
    
    return Diffs

In [None]:
Diffs = construct_df_with_diffs_and_prices(Tables)
Diffs.head()

In [None]:
assert True if 'train/IBM_P' in Diffs else False, "Please check your implementation."
assert True if 'train/IBM_D' in Diffs else False, "Please check your implementation."

In [None]:
assert len(Diffs['train/IBM_P'])==len(Diffs['train/IBM_D'])==Diffs.shape[0], "The number of rows across stocks should remain fixed."
assert len(Diffs.columns) == 962, "The number of columns are not correct. Please check your implementation"

In [None]:
assert type(Diffs['train/IBM_P']) == pd.Series, "Every column should be a pandas series"
assert type(Diffs['train/IBM_P'][0]) == np.float64, "Every data point in a series should be np.float64"
assert type(Diffs['train/IBM_D']) == pd.Series, "Every column should be a pandas series"
assert type(Diffs['train/IBM_D'][0]) == np.float64, "Every data point in a series should be np.float64"

In [None]:
#Hidden tests here
### BEGIN HIDDEN TESTS 
assert Diffs.shape==(13422, 962), "incorrect shape"   
### END HIDDEN TESTS

In [None]:
#Hidden tests here
### BEGIN HIDDEN TESTS 
assert np.round(Diffs['train/AMZN_P'][10001],3) == np.round(7.59,2), "incorrect value"
assert np.round(Diffs['train/AMZN_D'][10000],3) == np.round(0.038957243253193226, 3), "incorrect value"   
### END HIDDEN TESTS

In [None]:
#Hidden tests here
### BEGIN HIDDEN TESTS 
assert np.round(np.sum(Diffs['train/AMZN_D']),3) == np.round(5.272647773904442,3), "incorrect value"
assert np.round(np.sum(Diffs['train/AMZN_P']),3) == np.round(457411.17666666675,3), "incorrect value"   
### END HIDDEN TESTS

In [None]:
# plot some stocks

Diffs[['train/AAPL_P','train/MSFT_P','train/IBM_P','test/8_P']].plot(figsize=(14,10));
plt.grid()

### Black Monday

One of the biggest crashes in the US stock market happened on
**Black Monday:** Oct 19 1987  

We will look at the stocks around that date

In [None]:
#Focus on "Black Monday:" the stock crash of Oct 19 1987

import datetime
format = "%b-%d-%Y"

_from = datetime.datetime.strptime('Sep-1-1987', format)
_to = datetime.datetime.strptime('Nov-30-1987', format)

Diffs.loc[_from:_to,['train/AAPL_P','train/MSFT_P','train/IBM_P']].plot(figsize=(14,10));
plt.grid()

**Why does it seems that the price of IBM fell much more than those of Apple and microsoft?**

Because IBM's price started so much higher. As explained above it is more informative to consider $\log(p_{t+1}/p_t)$

In [None]:
Diffs.loc[_from:_to,['train/AAPL_D','train/MSFT_D','train/IBM_D']].plot(figsize=(14,10));
plt.grid()

In [None]:
!pwd  

## should be /home/jovyan/work/Final if you're using docker. If you're not using docker,
## you should still work inside the `Final` folder before proceeding forward.

In [None]:
## Saving the data to appropriate location for use in next notebooks
Diffs.to_csv('data/SP500.csv')

### Note

In order to make sure errors in constructing data do not get propagated in other notebooks of the final, you may run the below cell which will download the instructors version of "SP500.csv". For next notebooks, you may use either your own version or the one provided by us. Ideally both should have the same contents

In [None]:
%mkdir -p data/
%cd data
!rm -f data.tgz && rm -rf data ## Instructor's version of the output from this notebook
!wget https://mas-dse-open.s3.amazonaws.com/Stocks/data.tgz
!tar -xf data.tgz ## Extracting data
%cd ../ ## Going back to `Final` directory to keep it as our working directory
%ls -al data/