<img style="float: right;" width="120" src="../Images/supplier-logo.png">
<img style="float: left; margin-top: 0" width="80" src="../Images/client-logo.png">
<br><br><br>

# Overview

The objective of this notebook is to demonstrate to a relatively novice python data scientist how easy it is to perform complex data analysis on market data.

The piece of analysis that will be performed is:

1.	Correlate the daily returns of PXD stock against the daily returns of a small basket of securities (SPY, Gold, Natural Gas and Oil)
2.	Perform a regression analysis of PXD against the basket
3.	Extend this to perform a regression analysis of an arbitrary security against the same basket
4.	Export the data to an excel spreadsheet.

This analysis will touch on the following data science topics:
- The pandas, numpy and matplotlib python packages
- Importing data from csv files into a pandas DataFrame
- The rows, columns and index of a DataFrame
- Accessing data from a DataFrame along its rows and columns
- Slicing Data
- Time Series
- Merging multiple DataFames into a single larger DataFrame
- Exporting DataFames into an Excel spreadsheet


Users will see how to execute the following operations:
1.	Calculate the daily returns of a security
2.	Correlate the daily returns of securities
3.	Perform a linear regression analysis of data securities


## Import the libraries

To perform almost anything useful in a python program, the user/programmer/data scientist will need to use a pre-written bundle of python code, known as a python package.

Users must `import` these packages into their python program.

Importing packages into a python program is analogous to extending any piece of software by adding addons/apps to it. 

e.g.
- using an  Excel Add-in
- adding extensions to a chrome browser


The most commonly used python packages are `pandas`, `numpy` and `matplotlib`, although there are literally hundreds of other packages in everyday use.

**Aliases**<br>
It is often the case that users will give a package an alias when it is imported. Users can decide on the names of any aliases they use, but the conventional aliases for the packages we will use are given below.

**Magics**<br>
The line %matploblib inline is a `python magic`.
This magic forces the notebook to embed any graphs produced by matplotlib into this current notebook, rather than in a separate window.

In [None]:
# Load in libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


## Load the data into a DataFrame

Here I am using `pandas.read_excel(...)` to read in an excel spreadhseet into a pandas DataFrame called df_Basket

`io`
> the file being read in
    
`index_col`
> the column of the csv file we want to use as the index of our DataFrame

`parse_dates`
> instruct the read_csv function to convert anything that looks like a date into a python date

`sheet_name`
> the name of the work sheet in the spreadsheet containing the data

In [None]:
# Load in df_Basket
df_Basket = pd.read_excel(io ='../Data/regression.xlsx',
                          parse_dates=True,
                          index_col='Date',
                          sheet_name='Basket')

## Examine the Data

Usually users will quickly inspect the data they have loaded in, to make sure they have imported it correctly. Inspecting the data is also extremely useful to get a mental image of the size and shape of the data that has been loaded in.

Here are some of the most common operations 

In [None]:
# Enter Some functions to get the size and shape if the data here

df_Basket.size
df_Basket.shape
df_Basket.head()
df_Basket.tail()
df_Basket.describe()
df_Basket.describe().transpose()
df_Basket.hist()

## Time Series

Note that the index to the `df_Basket` DataFrame is a date time.
This allows a user to retrieve rows of data using dates and ranges of dates.


In [None]:
# for 2017
df_Basket['2017']


# for Jan 2017
df_Basket['2017-Jan']
df_Basket['2017-01']


## Slicing 
Slicing is a way to extract a subsection of a collection of data

The same slicing notation is used in Python to get a subsection of many other types of data e.g. lists of data, tables of data, timeseries of data etc.

Because of this, slicing is an important topic 

In Python, collections of data almost always call the first position 0 (ZERO) i.e. NOT 1

The general syntax for slicing is:

`string_to_slice [start : stop : step]`

- `start` is the position to start the slice from
- `stop` is the position AFTER the end of the slice
- `step` is how many steps between each item (steps of 2, 3, 4, etc.)

If step is left out then a value of 1 is assumed

In some circles, `step` is referred to as `stride`.

In [None]:
roman_numerals = ['I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', "IX", 'X']

# Enter some slices here
print (roman_numerals[3:8])

print(roman_numerals[:8])

print(roman_numerals[3:])

print(roman_numerals[3:8:2])

print(roman_numerals[:8:2])

print(roman_numerals[3::2])

print(roman_numerals[::])

# All but the last one
print(roman_numerals[:-1])

# The last one
print(roman_numerals[-1:])

## Slicing Time Series

Users can slice Dataframes into time series using the same syntax as used with strings, but there is one caveat. 
- When using the `numeric` positions of elements in a collection, slices are **UP TO, BUT NOT INCLUDING** the stop position
- When using the `index` positions of elements,(e.g. a time series index) slices are **UP TO, AND INCLUDING** the stop position


In [None]:
# Enter some time slices here
df_Basket['2017':'2018']
df_Basket['2017-Jan':'2018']
df_Basket['2017-Jan-31':'2018-Oct']

df_Basket['2017-01-12':]
df_Basket[:'2018-Feb']

df_Basket['2017':'2018':30]
df_Basket[::365]


## Correlation

A pandas DataFrame has a `corr()` function that will produce a new DataFrame containing the correlation matrix.

The correlation of prices for the basket DataFrame is:

In [None]:
# Correlate the basket here
df_Basket.corr()


## Scatter Plots

A scatter matrix of a correlation can give a user a better feel for a correlation.

Note that a user can correlate an entire Dataframe or they can slice into it.

The scatters along the diagonal are the distribution of values.


In [None]:
from pandas.plotting import scatter_matrix

# the entire date range
scatter_matrix(df_Basket, alpha=0.9, figsize=(18,6))

# only for the year 2018
scatter_matrix(df_Basket['2018'], alpha=0.9, figsize=(18,6))

## Calculating a daily percentage change

Correlating actual prices usually does not make any sense in this type of analysis. 

Correlating percentage changes in prices does make sense.

A pandas DataFrame has a `pct_change()` function which calculates precisely this for a column or columns of data.

Correlating the percentage change of the basket is extremely easy.

Notice the percentage change gives a stronger correlation than the actual prices.


In [None]:
# Correlate the percentage change of the basket here
df_Basket_Returns = df_Basket.pct_change()

df_Basket_Returns.corr()

scatter_matrix(df_Basket_Returns, alpha=0.9, figsize=(18,6))


## Correlating PXD against the basket

This is easy to achieve also

- Read the PXD data into a DataFrame.
- Calculate the percentage change for this data.
- Concatenate this DataFrame to the basket and then produce a single correlation.

As expected, the correlation between PXD performance and SPY is much stronger than with IBM and regular commodities

In [None]:
# Load in the PXD data
df_PXD = pd.read_excel(io ='../Data/regression.xlsx',
                       index_col='Date',
                       parse_dates=True,
                       sheet_name = 'PXD')

# Calculate its returns
df_PXD_Returns = df_PXD.pct_change()

# Concatenate the PXD returns to the Basket returns
df_returns = pd.concat([df_PXD_Returns, df_Basket_Returns], axis=1)

# Display the correlation between PXD and the Basket
df_returns.corr()

## Regression Analysis

Regression is an econometric method that allows a user to attempt to see what variables drive another variable.  Regression analysis calculates a number of important values

**Beta, Significance & P-Values**

A **beta** is what a 1-unit increase in the specific variable does to the dependent variable.
For example a .5 beta means that if a variable is increased by 1, the dependent variable would be expected to go up by 0.5. 

**Significance** measures the probability that a particular variable's value was a random occurrence.
For example, a significance factor of 0.05 means there was a 5% chance that this was a random event, 0.01 means a 1% chance, and so on. 

A **p-value** is used to provide the smallest level of significance at which the null hypothesis would be rejected.

Beta, Significance and p-values are important because they are indicators to users about what is significant.

For example, we might expect that gold prices effect all stocks. However, when we control for the market (by using SPY) we notice that the reason gold is correlated with all these stocks is because it is also correlated with the market! This gives us better precision since we see which firms really get effected by gold versus which just get effected by the overall market environment.

A very accurate regression analysis is performed using the statsmodels OLS function (ordinary least squares). The formula parameter sets the dependent variable and the matrix of factors. <br><br>
> `PXD` **~** `SPY + Oil + Gold + NG`<br>

In the above expression<br>
> `PXD` is the dependent variable <br>
> `SPY + Oil + Gold + NG` are the set of features in the model.


In [None]:
# Create a regression model
import statsmodels.formula.api as sm
model = sm.ols(formula="PXD ~ SPY + Oil + Gold + NG", data=df_returns)

# fit the model
result = model.fit()

# Prouduce the regression report
result.summary()

**Note the following**<br>
1) 2 factors have a p-value  (**P>|t|**) < 0.05 : SPY and Oil<br>
Given that PXD is an oil exploration company, it's daily returns follow very closely the returns of both the SP500 index and the price of Oil.

2) The PXD beta for stock market returns is relatively high at 0.77, and Oil has a beta of 0.27. This means that for every 1% change in the price of oil, users can expect a 0.27% change in the price of the PXD stock. 

3) Another way of thinking of this is 
> PXD = `(0.772 X SPY) + (0.269 X Oil) + (0.046 X Gold) + (0.045 X NG)`

We can turn then above statements into a function.

In [None]:
# Create a regress function here
def regress(stock, df):
    formula = " ~ SPY + Oil + Gold + NG"
    return sm.ols(formula=stock+formula, data=df).fit()



## Extract Data from the report

Use the `regress()` function to
- display the report
- display only the p-values
- display only the p-values less than 0.05

And most importantly, use the params attribute of the OLS report to display the `features` less than 0.05



In [None]:
# run the report
report = regress("PXD", df_returns)

# product the summary
report.summary()

# display all p-values
report.pvalues 

# display all p-values < 0.05
report.pvalues < .05

# display the `features` less than 0.05
truthSeries = report.pvalues < .05
report.params[truthSeries]


## Re-write the function

To display only those factors with a p-value less than some arbirtrary value

In [None]:
# Re-Factor the regress function
def regress(stock, df, pval):
    result = sm.ols(formula=stock+" ~ SPY + Oil + Gold + NG", data=df).fit()
    return pd.DataFrame(result.params[result.pvalues < pval],
                        columns=[stock])
 
# Execute the function
df_regress = regress(stock = "PXD", df = df_returns, pval = 0.05)


# Display the report
df_regress

# Exercise

## Perform this analysis for any abritrary stock

Rather than importing a single stock data, we can easily import all stock data for all companies in the SP500 index. 
We will follow the same “pattern”:
- Load a DataFrame that represents the basket.
- Load a DataFrame of all SP500 stocks.
- Calculate the daily returns and concatenate both DataFrames.
- Calculate the Betas for ALL stocks when regressions are performed against the basket.
- Save the results to a Spreadsheet.

The first 3 steps are:

In [None]:
# Load in df_Basket
df_Basket = pd.read_excel(io ='../Data/regression.xlsx',
                          parse_dates=True,
                          index_col='Date',
                          sheet_name = 'Basket')

# Load in the Stock Prices
df_Stocks = pd.read_excel(io = '../Data/regression.xlsx',
                          parse_dates=True,
                          index_col='Date',
                           sheet_name = 'StockPrices')

# Caluclate the daily returns for the Basket and for the Stocks
df_Basket_Returns = df_Basket.pct_change()
df_Stock_Returns = df_Stocks.pct_change()

# Concatenate both DataFrames
df_returns = pd.concat([df_Stock_Returns,df_Basket_Returns],axis=1)

## Calculate Betas for ALL stocks with p-value < 0.05


A very simple loop achieves this.<br>

**Note**<br>
1) the slice `f_returns.columns[:-4]` returns all BUT the last 4 columns as the last 4 columns are our basket.

2) We are progressively concatenating the results of the regression into a single DataFrame containing all of the results.


In [None]:
# Create an empy dataframe
df_all_Betas = pd.DataFrame()

# Loop over every returns and calcualte the betas from a regression 
for symbol in df_returns.columns[:-4]:
    df_regress = regress(stock = symbol, df = df_returns, pval = 0.05)
    df_all_Betas = pd.concat([df_all_Betas,df_regress], sort=False, axis=1)

# Dislpay the shape
df_all_Betas.shape

## Export the DataFrame to a Spreadsheet


In [None]:
# Create a writer
writer = pd.ExcelWriter('../Output/Sp500_Betas.xlsx')

# Convert the DataFrame to an XlsxWriter Excel object.
# In this case we'll put each of the FANG columns in a separate sheet.
df_all_Betas.to_excel(writer, sheet_name='SP500')

# Close the Pandas Excel writer and output the Excel file.
writer.save()