# Introduction

The goal of this tutorial is to walk you through some of the core modules used in Python for data analysis.  We're going to run through a simple example to get everyone on the same page.

## Jupyter

This document is a Jupyter Notebook, a tool for interactively running code interspersed with text and output.  You can create notebooks in a number of different programming languages like Python, R, or Julia.  Let's take a basic example using Python:

In [None]:
print("hello world")

You can modify the print statement above and rerun the corresponding "cell" to have it print whatever you want.

Jupyter notebooks are great for classes like this because you can run through the examples on your own machine with me live.

Before we get started, we're going to need to import a series of modules or libraries that will be used throughout the rest of the tutorial.  Modules are collections of pre-defined Python functions (and other objects) which you can use in your scripts.  Whereas libraries are larger collections of modules -- we'll mostly be dealing with libraries here though the distinction isn't important.

We import libraries as follows:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

It is also possible to import particular functions from a given module, for instance:

In [None]:
from numpy.linalg import norm

X = np.random.normal(size=(10, 10))
norm(X)

# Pandas

To start, we're going to spend most of our time using the module pandas.  Pandas is a data analysis library that contains many of the tools you'll want to use to work with data in Python.  Let's start by working with some real data:

## DataFrames

A dataframe is one of the core objects used in pandas.  It is essentially a matrix with additional metadata associated with the rows and columns.  For instance, the dataframe might have an index which corresponds to days of the week, while each column corresponds to a different assets returns.

One great thing about pandas is how easy it is to load data into Python.  You can even load data from the web, for instance lets use the following command to load a dataframe I've posted to github:

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/lbybee/pytutorial/master/49_ind_portfolios.csv")
df

What I've done is imported a csv table into a dataframe from the internet!  You can do a lot more than this with pandas.  It can work with many data files, I'd encourage you to check out all the IO options here:

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Now, what do we have in terms of data?

These are daily returns for a series of industry portfolios taken from Kenneth French's website:

https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

The first column is the date and the remaining columns are each portfolios returns.

How to I work with this dataframe?

I can access each column as follows:

In [None]:
df["Agric"]

This returns a __series__, which is just a column from a dataframe.  I can also look at multiple columns at once:

In [None]:
df[["Agric", "Food"]]

Note that I specified these columns with a list as opposed to a single string, this returned a sub-matrix corresponding to a sub-dataframe instead of a series.

## Numpy

Pandas is build on top of numpy -- a module I introduced in our previous TA session.

numpy is the core library used for linear algebra in Python and has many tools you'll end up using throughout the class.

The dataframe above is really just a wrapper around numpy, and I can access the raw values if I want:

In [None]:
df["Agric"].values

I can do many linear algebra operations on the dataframe itself, and these will behave you would expect:

In [None]:
df[["Agric", "Food", "Soda", "Beer"]].T.dot(df["Agric"])

What I've done here is taken the dot product between a series of columns, "Agric", "Food", "Soda", and "Beer", with the column "Agric".

## Data Operations

I can do many standard transformations to my dataframe and they'll behave as you expect:

In [None]:
np.square(df[["Agric", "Food"]])

In some cases the operations are contained within the dataframe:

In [None]:
df[["Agric"]].cumsum()

## Dates

Our date column behaves differently from the other columns, how should we treat this separately?

First, we can convert the date into a datetime which will allow us to perform operations which make assumptions based on date (we'll see more on these later):

In [None]:
df["date"] = pd.to_datetime(df["date"], format="%Y%m%d")
df["date"]

I can now access various datetime information from our date variable:

In [None]:
df["date"].dt.year

In [None]:
df["date"].dt.dayofweek

You can explore more on this here:

https://pandas.pydata.org/docs/user_guide/timeseries.html

I'd recommend checking this out if you intend to use Python long-term.

## The Index

If I print my dataframe again, I will see a "column" on the left corresponding to a series of integers:

In [None]:
df

This corresponds to the index for my dataframe.  I'm not going to spend too much time on indexes here, they can allow you to do some cool stuff, but one thing we may want to do here is set our date as the index:

In [None]:
dfdt = df.set_index("date")
dfdt

Now, all my columns should have the same type, and I can perform some dataframe wide operations:

In [None]:
dfdt.T.dot(dfdt)

Pandas can often handle much of the indexing and date work when you initially load the data:

In [None]:
dfdt = pd.read_csv("https://raw.githubusercontent.com/lbybee/pytutorial/master/49_ind_portfolios.csv",
                   index_col="date", parse_dates=True)
dfdt

## Indexing

Dataframes allow for many ways to access subsets of the data.  For instance, let's say I want to only look at the returns in the first month of 2020:

In [None]:
dfdt[dfdt.index.month == 1]

Alternatively, maybe I want to access all the rows where "Agric" returns are positive:

In [None]:
dfdt[dfdt["Agric"] > 0]

This is referred to as boolean indexing, because the `dfdt["Agric"] > 0` term yields a series of truth values:

In [None]:
dfdt["Agric"] > 0

I can also do this using multiple columns:

In [None]:
dfdt[(dfdt["Agric"] > 0) & (dfdt["Food"] > 0)]

This returns all the rows where both "Agric" and "Food" returns are positive.  The `&` is used for "and" and if I want do "or", I can use `|`.

I can also index as we did with numpy:

In [None]:
dfdt[:10]

In [None]:
dfdt[9:12]

If you're interested in checking out more about indexing I'd recommend the pandas wiki page:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

## Summaries and Aggregates

Now that we've explored how to manipulate the dataframe, let's put together some summaries of the data.

A basic command that you can run to get a good sense of your data is `describe`:

In [None]:
dfdt.describe()

I can also run many other standard operations here:

In [None]:
dfdt[["Agric"]].mean()

In [None]:
dfdt[["Agric"]].std()

In [None]:
dfdt[["Agric", "Food"]].aggregate(["mean", "std"])

## Groupby

A very useful tool to understand in pandas is groupby.  Groupby is a way to apply operations to a subsets of your data in a systematic way.  For instance, what if I want to get the mean return for each asset for each month?

In [None]:
dfdt.groupby(pd.Grouper(freq="M")).mean()

Since we are using a datetime index, I can group by this `pd.Grouper(freq="M")` object.  The `freq="M"` specifies what date frequency I want to use (in this case months), but I could specify other options, e.g. years or `freq="Y"`.

I can also groupby columns.  Let's assume we have another column corresponding to indicators for whether or not there is an FOMC meeting that day:

In [None]:
dfdt["FOMC"] = np.random.randint(0, 2, dfdt.shape[0])
dfdt.groupby("FOMC").mean()

I can also groupby both values:

In [None]:
dfdt.groupby([pd.Grouper(freq="Q"), "FOMC"]).mean()

This gives me the mean for each quarter/FOMC meeting pair.

Sometimes I may want to perform a groupby operation and update the original dataframe.  For instance, perhaps I want to subtract the monthly mean from each return series.  I can do this with the `transform` operation:

In [None]:
dfdt.groupby(pd.Grouper(freq="M")).transform(lambda x: x - x.mean())

This returns a dataframe of the same shape as the input.  The `lambda x: x - x.mean()` is a method for quickly defining functions inline -- in this case a function to demean.

Groupby is extremely powerful, I've only scratched the surface here, I'd encourage you to check out the corresponding wiki for more:

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

## Merging

I may at some point have multiple dataframes loaded and want to merge them.  For instance, perhaps I have a number of possible predictor variables for returns:

In [None]:
ffdf = pd.read_csv("https://raw.githubusercontent.com/lbybee/pytutorial/master/FF3.csv",
                   index_col="date", parse_dates=True)
dfdt = dfdt.drop(["FOMC"], axis=1, errors="ignore")
ffdf = dfdt.merge(ffdf, right_index=True, left_index=True)
print(ffdf.columns)
ffdf[["MktmRF", "SMB", "HML"]]

Merging can get complex, here it is simple because of how the indices are defined but I'd recommend reading the wiki and verifying that the merge does what you expect (by examining the data) when you start doing this:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

# Matplotlib

So far we've put together some basic summaries of our data and explored how to manipulate dataframes.  However, often the best way to get a sense for a new data set is to draw some plots.

To do this we'll use the matplotlib library imported above:

https://matplotlib.org/

There are many other cool libraries available that I'd encourage you to check out as well, e.g. seaborn:

https://seaborn.pydata.org/

## Time Series Plots

Let's start by generating some basic time series plots to see how our returns behave.  I can start by plotting the cumulative returns for `"Agric"`:

In [None]:
plt.plot(ffdf["Agric"].cumsum())
plt.xlabel("Date")
plt.ylabel("Cumulative Return")

What if I want to plot multiple return series alongside each other?

In [None]:
plt.plot(ffdf["Agric"].cumsum(), label="Agric")
plt.plot(ffdf["Food"].cumsum(), label="Food")
plt.plot(ffdf["Autos"].cumsum(), label="Autos")
plt.plot(ffdf["Banks"].cumsum(), label="Banks")
plt.xlabel("Date")
plt.ylabel("Cumulative Return")
plt.legend()

We only need to specify the column because our index is a date.  I could alternatively, tell matplotlib the `x` and `y` values separately:

In [None]:
plt.plot(ffdf.index, ffdf["Agric"].cumsum().values)
plt.xlabel("Date")
plt.ylabel("Cumulative Return")

## Scatter Plots

Does "Agric" have any market beta?  Let's look at a scatter plot to get a sense of the correlation:

In [None]:
plt.scatter(ffdf["MktmRF"], ffdf["Agric"])
plt.xlabel("MktmRF")
plt.ylabel("Agric")

## Build-In Pandas Plotting

Pandas can also create a number of different plots on its own.

In [None]:
ffdf[["Agric", "Food", "Autos", "Banks"]].boxplot()

In [None]:
ffdf[["Agric", "Food", "Autos", "Banks"]].hist()

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(ffdf[["Agric", "Food", "Autos", "Banks", "MktmRF"]], diagonal="kde")

A full list of the pandas in-house plotting options are available here:


https://pandas.pydata.org/docs/user_guide/visualization.html

## Heatmaps

Matplotlib is a fairly established library with many useful tools and tricks.  You can spend a considerable amount of time refining your plots to get exactly what you want.  Let me show you one last cool example before we move forward:

In [None]:
cov = dfdt.cov()
fig, ax = plt.subplots(figsize=(10, 10))
vlim = np.max([np.abs(np.min(cov)), np.max(cov)])
heatmap = ax.pcolor(cov.values, cmap=plt.cm.seismic, vmin=-vlim, vmax=vlim)
ax.set_xticks(np.arange(cov.shape[1]))
ax.set_xticklabels(cov.columns, rotation=90, fontsize="small")
ax.set_yticks(np.arange(cov.shape[1]))
ax.set_yticklabels(cov.index, fontsize="small")
plt.colorbar(heatmap)
plt.show()

# Statsmodels

While plots and descriptive summaries are nice, we often want fuller statistical models to understand asset prices.  There are a number of useful statistical libraries available in Python.  I'm just going to touch on two here and introduce you to methods for accessing more.

The first of these is statsmodels:

https://www.statsmodels.org/stable/index.html

## Regression

Let's fit a regression of our predictor variables on one of our return series:

In [None]:
import statsmodels.formula.api as smf

mod = smf.ols("Agric ~ MktmRF + HML + SMB", data=ffdf)
fit = mod.fit()
fit.summary()

# Sklearn

Statsmodels has many rigorous statistical methods, however for more "machine learning" applications, I'd recommend sklearn as a first stop:

https://scikit-learn.org/stable/

## Lasso/Penalized Models

The lasso is a useful tool for high-dimensional data sets.  If I have a large number of possible predictors, many of which have no association with my outcome variable, I can run a lasso to perform selection.  I won't go into the details of how this works but just show an example with our data above:

In [None]:
from sklearn import linear_model

noise = pd.DataFrame(np.random.normal(scale=0.1, size=(ffdf.shape[0], 100)), index=ffdf.index)
ldf = ffdf[["Agric", "MktmRF", "HML", "SMB"]].merge(noise, right_index=True, left_index=True)
mod = linear_model.Lasso(alpha=0.00075)
mod.fit(ldf.drop(["Agric"], axis=1), ldf["Agric"])
coef = pd.Series(mod.coef_, index=[c for c in ldf.columns if c != "Agric"])
coef[coef != 0]

# Other Libraries and Beyond

I've only touched on the very basics of what's possible in Python.  There are many other specialized libraries out there containing various useful functions.  As you explore these you'll want to get familiar with your package manager -- a tool for installing modules/libraries.

If you used Anaconda, you should be able to install packages by running:

`conda install <package>`

where `<package>` is the name of the library.  Otherwise one of the standard package manager in Python is pip:

`pip install <package>`.

In either case you can check out the document for more on this here:

https://docs.anaconda.com/anaconda/user-guide/tasks/install-packages/

https://pip.pypa.io/en/stable/reference/pip_install/