**Fin 585R**  
**Diether**  
**Python/Pandas Introduction**<br><br>

**Instructions**

+ Please read through my notes, and run each of the code cells.<br><br> 

+ You can run a cell of code by pressing SHIFT and ENTER at the same time.<br><br>


**I. Python/Pandas in Empirical Finance**

**A. Role of Python/Pandas in this Course**

+ Goal: Develop the Python/Pandas skills and tools necessary to engage in empirical research in Finance.<br><br>  

+ More specifically: use Python/Pandas to<br><br>

  - Test economic models<br><br>
  
  - Construct portfolios (container for financial assets)<br><br>
  
  - Create and backtest trading strategies.<br><br>
  
  - Estimate regressions: time series and panel regressions.<br><br>
  
+ I will focus on the most important features and programming constructs in Python/Pandas to accomplish goal.<br><br>


**B. Example: Portfolio Construction and Trading Strategies**

+ A core quant finance and academic skill is portfolio construction and backtesting.<br><br>

+ All trading strategies are implemented as portfolios (container for financial assets).<br><br>

+ Portfolio construction and backtesting can be broken into five general steps:<br>

  1. Data preparation.<br><br>

  2. Creation of the portfolio formation variable.<br><br>

  3. Binning the stock return data based the formation variable.<br><br>

  4. Portfolio creation.<br><br>

  5. Estimating historical performance of the strategy.<br><br>

+ Need to learn enough Python/Pandas so you can tackle each step for portfolio strategies you're interested in testing.<br><br>


**II. Why Python/Pandas?**

+ Why Python/Pandas? Why not Stata, R, SAS, or something else?<br><br>

+ All of those languages are used in empirical finance research.<br><br>

+ Python/Pandas has some important advantages:<br><br>

  - Very popular in finance world now.<br><br>

  - Well designed and popular general purpose programming language.<br><br>
  
  - Free<br><br>
  
  - Relatively easily to learn. <br><br>
  
  - Used in lots of different domains; it's not narrowly confined to the domain of quantitative finance or even scientific computing or data science.<br><br>
  

**III. Overview of Basic Concepts and Features**

+ Main purpose of this `notebook` is to introduce the `Pandas` `library`.<br><br>

+ `Pandas` is the main library for this course.<br><br>

+ Will overview core concepts and features of `pandas` for quant and academic finance.<br><br>

+ Will cover the concepts and features with more detail as we move forward.<br><br>


**A. Accessing the Pandas Library**

+ To use the Pandas library we have to tell Python that we want access to it.<br><br>

+ You make `pandas` accessible by using the `import` command.<br><br>

+ When importing the pandas' library, you also associate the library with a namespace: **use pd**<br><br>

  - Just convention<br><br>

  - Given the `pd` namespace $\rightarrow$ Pandas' functions looke like `pd.function`.<br><br>
  
  - For example, `pd.read_csv`. <br><br>

  - `pd` namespace is not required, but is a strong convention.<br><br>

  - Namespaces make it clear what library a certain function or command comes from if each library you use has it's own namespace.<br><br>


+ **code to importing pandas:**

In [1]:
import pandas as pd

<br>**B. Pandas core Data Structures: Dataframes and Series**

+ Core data structure/object: the **dataframe**.<br><br>

+ Dataframe: container for holding rectangular array of mixed type data called a `dataframe`.<br><br>

  - Columns: represent different variables (e.g, the stock price or earnings of Google).<br><br>

  - Rows: represent a given observation for those variables (e.g., January 2009 for Google).<br><br>

+ Dataframe: programming equivalent of a spreadsheet.<br><br> 

+ Each column can be of a different type: integers, floating point numbers, imaginary numbers, or strings. <br><br>


**Dataframes: Store Data and Provide Useful Functions**

+ Pandas' provides programmers with many ways to create new data, transform and combine data, aggregate data, or display data. <br><br>

  - Example: Pandas' has a built in operator (`/`) that allows you to divide one column into the other column element by element.<br><br>

  - Example: Built in mean function that computes sample average of each column.<br><br>
  
  - Higher level functions that, allow us to easily create graphs or plots of data.<br><br>
  
+ Many of functions are built into the dataframe.<br><br>

+ Built in functions called `methods`.<br><br>

+ Dataframe is an object the provides data storage and useful functions<br><br>


**Series**

+ `Series` in pandas' is the name for a single column of data.<br><br>
  
+ If you grab one column of a dataframe, you're grabbing a series.<br><br>

+ `Dataframes` and `Series` behave very similarly; for our purposes, it is mostly just be a technical distinction between a one dimensional and two dimensional array.<br><br>


**C. Importing Data and Creating a DataFrame:**

+ Getting data into a `Pandas'` dataframe is usually straight forward and easy. <br><br>

+ Pandas easily reads many different data formats: csv files, Excel files, SAS data files, Stata data files, Feather data files, etc.<br><br>

+ In this class, we will primarily use csv files.<br><br>

+ I will highlight other methods.<br><br>

**Example: Reading in Amazon Data**

+ Let's read in some data, and create a `dataframe` object.<br><br> 

+ The data we will to read into a `dataframe` are annual balance sheet data for Amazon and Hormel.<br><br>

+ The data are in a csv file so we can read in the data using the `read_csv` function.<br><br>

+ The `read_csv` function will automatically create a `dataframe` object containing the data contained in the csv file.<br><br>

+ The `read_csv` function has a lot of options and flexibility (take a look at the [help page for it in Pandas' documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)), but often you don't use any of them (particularly for well formed or non-messy csv files). <br><br>

+ The `read_csv` function can, of course, read files stored on your local machine, but it also has no trouble reading file stored remotely on a webserver; you just need to provide a URL. <br><br>

+ The code below calls pandas' `read_csv` function and then reads the csv located at the URL in quotes. After reading the file it creates a dataframe and assigns the dataframe to `df`.<br><br>

In [2]:
df = pd.read_csv('https://diether.org/prephd/01-intro.csv')

+ To read in data from non-csv formats you generally invoke a command very similar to `read_csv`. For example, you can read in a `Stata` datafile using the following:
```python
df = pd.read_stata('filename.dta')
```

+ Many other ways to create dataframes. For example, you can convert core Python data structures (like `lists` or `dictionaries`) into `dataframes`.<br><br> 


**Displaying or Printing out the Data in a Dataframe**

+ The **Jupyter notebook** is a special environment where if you type the name of a dataframe (or other datatypes), it will display the default view of that object (e.g., if the dataframe is small it will display all the data in the dataframe, and if it's large only a truncated view of the data will be displayed). <br><br>

+ If you just write a python program and run it outside of the jupyter notebook environment, then you need to use the `print` function to see any output.

In [9]:
df.head(4)

Unnamed: 0,tick,year,revenue,ebit,capx,debt,mktcap
0,HRL,2000,3675.132,262.607,100.125,145.928,2612.286625
1,HRL,2001,4124.112,300.306,77.129,462.407,3727.24518
2,HRL,2002,3910.314,310.509,64.465,409.648,3229.43192
3,HRL,2003,4200.328,307.024,67.104,395.273,3579.15013


In [11]:
df.round(1)

Unnamed: 0,tick,year,revenue,ebit,capx,debt,mktcap
0,HRL,2000,3675.1,262.6,100.1,145.9,2612.3
1,HRL,2001,4124.1,300.3,77.1,462.4,3727.2
2,HRL,2002,3910.3,310.5,64.5,409.6,3229.4
3,HRL,2003,4200.3,307.0,67.1,395.3,3579.2
4,HRL,2004,4779.9,357.0,80.4,361.5,4323.7
5,HRL,2005,5414.0,418.6,107.1,350.4,4510.2
6,HRL,2006,5745.5,448.9,141.5,350.1,5139.0
7,HRL,2007,6193.0,478.6,125.8,350.0,5485.6
8,HRL,2008,6754.9,509.4,125.9,350.0,4181.1
9,HRL,2009,6533.7,531.8,97.0,350.0,5138.0


**Print function**

+ You can also explicitly print a dataframe out using python's print function.<br><br>

In [None]:
print(df)

**Dataframes and Series**

+ Our `dataframe` is called `df`.<br><br>

+ If we select a column from the `dataframe` it will be of type `Series`.<br><br>

+ We select a column of a dataframe (a Series) by wrapping the column's name in quotes.<br><br>


In [7]:
df['revenue'].head(3)

0    3675.132
1    4124.112
2    3910.314
Name: revenue, dtype: float64

In [8]:
df[['year','revenue']].head(3)

Unnamed: 0,year,revenue
0,2000,3675.132
1,2001,4124.112
2,2002,3910.314


+ You typically must wrap the column's name in ' ' because most column names are stored as strings.<br><br>

+ You will need to reference columns this way as long as the variable names you use aren't entirely numeric (e.g., an integer). <br><br>

+ In Python, can delimited by strings either single (' ') or double quotes (" "). <br><br>  

**Checking the Data Type**

+ In Python, there is a `type` function that returns the type of a variable or object. <br><br>

In [13]:
type(df)

pandas.core.frame.DataFrame

In [14]:
type(df['revenue'])

pandas.core.series.Series

<br>**D. Data creation** 

+ A new column in a dataframe is typically created using the assignment operator.<br><br>

+ Like most programming languages, the assignment operator is just the equal sign (`=`) in Python.<br><br>

+ For example, suppose I want to create a new column that measure profit margin. Profit margin is defined as the following (note, ebit is earnings before interest and taxes):

$$
\text{Profit Margin} = \frac{ebit}{revenue}
$$

+ Python/Pandas code for creating profit margin column in the dataframe.<br><br>

In [17]:
df['profit_margin'] = df['ebit'] / df['revenue']
df.head(2)

Unnamed: 0,tick,year,revenue,ebit,capx,debt,mktcap,profit_margin
0,HRL,2000,3675.132,262.607,100.125,145.928,2612.286625,0.071455
1,HRL,2001,4124.112,300.306,77.129,462.407,3727.24518,0.072817


+ mathematical operations such as addition (+), subtraction (-), multiplication (*), or division (/) are all element by element operations between the dataframe columns that are addressed by the code.<br><br>


**E. If/then/else logic in Pandas:**

+ `If/then/else` logic is important in all types of programming.<br><br>

+ In `Python/Pandas`, you rarely will write code that looks like classic `if/then/else` statements.<br><br>

+ For example, many `Pandas` logical functions or statements are actually `if/then` statements with an implicit else.<br><br>

+ Data selection often involves if/then/else logic $\leftarrow$ in Pandas' jargon it's often called Boleen indexing.<br><br>

+ For example, we can use if/then/else logic to create a new variable that is `True` if the year is greater than 2010 and `False` otherwise. The logical statement looks like the following:

```
if (year is greater than 2010) then
   True
else
   False
```

+ Using python/pandas the code to implement the preceding logic is the following and it automatically creates a `Series` with `True` and `False` values based on the logical condtion that the year is greater than 2010:<br><br>

In [19]:
(df['year'] > 2010).head(2)

0    False
1    False
Name: year, dtype: bool

+ We can also assign this new TRUE/FALSE variable to the dataframe: 

In [21]:
df['gt_2010'] = df['year'] > 2010
df.tail(3)

Unnamed: 0,tick,year,revenue,ebit,capx,debt,mktcap,profit_margin,gt_2010
41,AMZN,2019,280522.0,14177.0,16861.0,63205.0,920224.3,0.050538,True
42,AMZN,2020,386064.0,22315.0,40140.0,87789.0,1638236.0,0.057801,True
43,AMZN,2021,469822.0,24429.0,61053.0,122595.0,1697179.0,0.051996,True


**F. Data selection** 

+ Based on if/then/else logic `Pandas` allows you to select only the rows or columns of a `dataframe` that you want.<br><br>

+ Suppose you only want observations where the year is greater than 2010. Pandas allow us to index a dataframe's rows based on a logical condition or True/False Values.<br><br>


In [None]:
df[df['gt_2010'] == True]

In [None]:
df[df['gt_2010']]

In [None]:
df[df['year'] > 2010]

**Creating a Sub Dataframe**

+ We can assign the smaller dataframe to a new dataframe with the following:

In [None]:
sub = df[df['year'] > 2010]
sub

**G. Deleting a Variable/Column of Data**

+ Very common to delete or remove columns.<br><br>

+ Typically rely on the `drop` function.<br><br>

+ For example, suppose I want to drop the `capx` column from the dataframe.<br><br>


In [None]:
df.drop('capx',axis='columns')

+ The preceding command, created a new `dataframe` with the `capx` column removed. <br><br>

+ Most `pandas` commands operate creating a new datafreame.<br><br>

+ To modify the original `dataframe` (df) we have to assign the `dataframe` created by the drop command to `df`.<br><br>

In [None]:
df = df.drop('capx',axis='columns')
df

<br>**IV. More Advanced Concepts and Features**

**A. The groupby/apply construct:**. 

+ The most important **programming idiom** or construct for this class is the `groupby/apply` construct.<br><br>

+ Allows us to loop through the data and group observations in a `dataframe` together, and then apply a function or data transformation to each group.<br><br>

+ For example, we often use it to group observations by date or to group all the observation of the same stock together. We then typically apply a function to the data that aggregates it or transforms it within these groups.<br><br>

+ The `groupby/apply` construct allows us to accomplish the following with just one (or a few lines) of code:

  1. Logically **group** observations together based on some attribute of the data: for example, we could group stock data based on whether the company was big or small.<br><br>

  2. **Apply** a function to the different groups. For example, we could compute the average number of analysts covering big versus small stocks.<br><br>

+ The groupby/apply does a whole bunch of work for us behind the scene. It loops all the observations, categorizes the observations into the groups, and then applies the functions seperately to each group.<br><br>


**B. User-written functions:**

+ You will write your own custom (i.e., user written) functions to extend the functionality of the `groupby/apply` construct.<br><br>

+ For example, writing a custom function is sometimes and important part of implementing a portfolio formation criteria for a trading strategy.<br><br>


**C. Merging data:** 

+ Merging data is a core part of the data preperation step from most empirical work or back testing of strategies.<br><br> 

+ You will learn how to merge dataframes together based on a single key or multiple keys.<br><br> 
