# Stock Price Predictor - Exploratory Data Analysis

## Notebook Overview
1. [Introduction](#introduction)
2. [Load Data](#load-data)
3. [Understand Data](#understand-data)
4. [Explore Data](#explore-data)
    - Describe
    - Visualise

<a id="introduction"></a>
# 1. Introduction

An ideal real-world application of machine learning is in the world of trading and investing. It all works on predicting what will have to the price of a stock in the next few minutes or a few years. Hence, the aim is to **predict** and there is a **large amount of historical data** available to assist in making a prediction. Laying out the foundation to apply some machine learning algorithms.

I decided to take on this problem due to my interest in finance and some internship experience at a start-up that focuses on machine learning applications in intraday trading. I believe working on this project will give me a better understanding of the financial markets (finance in general) and time-series machine learning problems. As I have new to the world of machine learning and finance, I have decided to carry out technical analysis using the historical data of stocks.

## Problem Statement

**The aim of this project is to predict the long-term price trend of two indices and one stock with at least 90% accuracy.<sup>1</sup>** The stocks and indices will be chosen based on countries of interest and personal preference, but are clearly defined in the [Datasets and Inputs section](#Datasets-and-Inputs). In the end, the ability of the model will be tested by making predicting one full year worth of data. Furthermore, the model&#39;s ability to predict how the stocks performed in 2020 (the year of the pandemic) will be observed.

<sup>*1*</sup> *90% accuracy can be taken as 10% mean absolute percentage error.*

## Evaluation Metrics

I will be using two evaluation metrics to understand the model&#39;s performance.

1. **Mean Absolute Percentage Error (MAPE)**: It is the mean of percentage of absolute errors of the predictions. The following formula explains how it is calculated (&#39;MEAN ABSOLUTE PERCENTAGE ERROR (MAPE)&#39;, 2006; Glen, 2011) :

2. **Percentage Points Correctly Predicted** : It is the percentage of actual points that lie in the 30-70 (shorter inter-quartile) range of the predictions.

![Example Graph](images/metric-example-graph.png)

In the example graph above, it is clear that there are 3 out of 5 points fall in the 30-70 quartile range. Hence,

I came up with this metric as a solution to the problem predicting for larger intervals. I intend to use this to be understand of the model can make accurate predictions on the long-term trends. However, some weakness of the metric would be its inability to give great results for predictions with high variability (standard deviation). As a high standard deviation would be a larger area is covered by the predictions, the probability of the actual value to land within the 30-70 range is higher. However, if the standard deviation is high, the model is not following any specific trend (up or down) but is just spreading in both directions, leading to an inaccurate measure of what is actually happening.

Hence, a combination of MAPE and Percentage Points will give a better understanding of how the model is performing.

<a id="load-data"></a>
# 2. Load Modules and Data
All the required modules will be loaded here along with the data from the `CSV` files in the `data` directory files.

> **Citation for data**: _Yahoo Finance – stock market live, quotes, business &amp; finance news_ (no date). Available at: https://in.finance.yahoo.com/ (Accessed: 2 October 2020).

In [1]:
import pandas as pd
import numpy as np
import os

To easily access data for a particular stock or index, a python dictionary will be created using the `Ticker` names as `keys` and `values` being `pandas dataframes` with the columns. 

In [6]:
# Load stocks data
stock_names = ['^GSPC', '^BSESN', 'AAPL']

data_dir = 'data'
data = {}

for stock in stock_names:
    data[stock] = pd.read_csv(os.path.join(data_dir, stock + '.csv'),
                              parse_dates=True, index_col=['Date'])

In [7]:
data['AAPL'].head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1980-12-12,0.128348,0.128906,0.128348,0.128348,0.101261,469033600.0
1980-12-15,0.12221,0.12221,0.121652,0.121652,0.095978,175884800.0
1980-12-16,0.113281,0.113281,0.112723,0.112723,0.088934,105728000.0
1980-12-17,0.115513,0.116071,0.115513,0.115513,0.091135,86441600.0
1980-12-18,0.118862,0.11942,0.118862,0.118862,0.093777,73449600.0


<a id="understand-data"></a>
# 3. Understand Data
There are `six` columns (excluding the index column) that define a stock's position on a particular date.
the cash value of the last transacted price before the market closes
1. `Open`: "_cash value_" of the first when the market **opened** on the respective date.
2. `High`: highest value of the stock price for the specific date.
3. `Low`: lowest value of the stock price for the specific date.
4. `Close`: stock price when the market **closed** on the respective date.
5. `Adj Close`: 
6. `Volume`: the volume of stocks traded on a specific date.
https://www.investopedia.com/terms/a/adjusted_closing_price.asp#:~:text=Key%20Takeaways-,The%20adjusted%20closing%20price%20amends%20a%20stock's%20closing%20price%20to,accounting%20for%20any%20corporate%20actions.&text=The%20adjusted%20closing%20price%20factors,%2C%20dividends%2C%20and%20rights%20offerings.


In [5]:
data['AAPL'].info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10037 entries, 1980-12-12 to 2020-10-01
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       10036 non-null  float64
 1   High       10036 non-null  float64
 2   Low        10036 non-null  float64
 3   Close      10036 non-null  float64
 4   Adj Close  10036 non-null  float64
 5   Volume     10036 non-null  float64
dtypes: float64(6)
memory usage: 548.9 KB


In [4]:
data['AAPL'].describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,10036.0,10036.0,10036.0,10036.0,10036.0,10036.0
mean,9.235847,9.333947,9.136442,9.239224,8.699618,340923500.0
std,17.51578,17.715709,17.313096,17.525737,17.14675,342423300.0
min,0.049665,0.049665,0.049107,0.049107,0.038743,1388800.0
25%,0.27029,0.276741,0.264509,0.270424,0.230265,131528700.0
50%,0.437589,0.446429,0.430804,0.439018,0.369728,228256000.0
75%,10.730357,10.788839,10.648214,10.7225,9.265733,424843300.0
max,137.589996,137.979996,130.529999,134.179993,134.179993,7421641000.0


In [None]:
data['']