---
title: 'Essential Tools: Pandas'
jupyter: python3
---

[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tools4ds/DS701-Course-Notes/blob/main/ds701_book/jupyter_notebooks/02B-Pandas.ipynb)

In this chapter we discuss the first of two important Python packages, Pandas.

# Pandas

Pandas is a Python library for data manipulation and analysis. It ca be used to produce high quality plots and integrates nicely with other libraries that expect NumPy arrays. Knowledge and use of Pandas is essential as a data scientist.

The most important data structure provided by Pandas is the dataframe implemented in the [DataFrame](https://pandas.pydata.org/docs/reference/frame.html) class. A data frame is a table (2-D array) where each row and column has a label. 

Make it a habit that when you're given a dataset, load it into a dataframe.

## Fetching, storing and retrieving your data

For demonstration purposes, we'll use a utility library that fetches data from standard online sources, such as Yahoo! Finance.

In [None]:
# Import pandas and yfinance
import pandas as pd
import yfinance as yf

yahoo_stocks = pd.DataFrame(yf.download('YELP',start='2015-01-01',end='2015-12-31', progress = False))

yahoo_stocks.head()

This is a typical example of a dataframe.  

Notice how each row has a label and each column has a label.

A DataFrame is an object that has associated methods that help you explore and manipulate the data.

Here is a simple method: `.info()`

In [None]:
yahoo_stocks.info()

## Reading to/from a ``.csv`` file

Continuing to explore methods, let's write the dataframe out to a ``.csv`` file:

In [None]:
yahoo_stocks.to_csv('yahoo_data.csv')

In [None]:
!head yahoo_data.csv

And of course we can likewise read a ``.csv`` file into a dataframe.  This is probably the most common way you will get data into Pandas.

In [None]:
df = pd.read_csv('yahoo_data.csv')
df.head()

## Working with data columns

In general, we'll typically describe the rows in the dataframe as **items** and the columns as **features**.

In [None]:
df.columns

Pandas allows you to use standard python __indexing__ using square brackets to refer to columns (e.g., features) in your dataframe:

In [None]:
df['Open']

Pandas also allows you to refer to columns using an object attribute syntax.

Note that the column name cannot include a space in this case.

In [None]:
df.Open

You can select a list of columns:

In [None]:
df[['Open', 'Close']].head()

Putting things together -- make sure this syntax is clear to you:

In [None]:
df.Date.head(10)

In [None]:
df.Date.tail(10)

Changing column names is as simple as assigning to the `.columns` property.

Let's adjust the column names using to remove spaces.

In [None]:
new_column_names = [x.lower().replace(' ', '_') for x in df.columns]
df.columns = new_column_names
df.info()

Observe that we first created a list of column names without spaces using __list comprehension__. This is the pythonic way to generate a new list.

Now **all** columns can be accessed using the **dot** notation:

In [None]:
df.adj_close.head()

## A sampling of DataFrame methods.

There are many useful methods in the DataFrame object. It is important to familiarize yourself with these methods.

The following methods calculate the mean, standard deviation, and median of the specified numeric columns.

In [None]:
df[['high', 'low', 'open', 'close', 'volume', 'adj_close']].mean()

In [None]:
df[['high', 'low', 'open', 'close', 'volume', 'adj_close']].std()

In [None]:
df[['high', 'low', 'open', 'close', 'volume', 'adj_close']].median()

In [None]:
df.open.mean()

In [None]:
df.high.mean()

## Plotting methods

Pandas also implements a variety of easy-to-use plotting functions.

These are your "first look" functions and useful in exploratory data analysis.

Later, we will use more specialized graphics packages to create more sophisticated visualizations.

In [None]:
import matplotlib.pyplot as plt

df.high.plot(label='High')
df.low.plot(label='Low')
plt.title('YELP Stock Price')
plt.ylabel('Dollars')
plt.legend(loc='best')
plt.show()

In [None]:
df.adj_close.hist()
plt.xlabel('Adjusted Closing Price')
plt.ylabel('Dollars')
plt.title('YELP')
plt.show()

## Bulk Operations

Methods like ``sum()`` and ``std()`` work on entire columns. 

We can run our own functions across all values in a column (or row) using ``apply()``.

As an example, let's go back to this plot:

In [None]:
df.high.plot(label='High')
df.low.plot(label='Low')
plt.title('YELP Stock Price')
plt.ylabel('Dollars')
plt.legend(loc='best')
plt.show()

It's __almost__ perfect.  The only problem is the $x$-axis: it should show time.

To fix this, we need to make the dataframe __index__ -- that is, the __row labels__ -- into dates.

The "dates" in our data are currently __strings__. We need to convert them to an appropriate data type so that Pandas understands that they are actually dates.

In [None]:
df.date.head()

To convert each string in the `date` column to an actual date we will use `.apply()`:

In [None]:
from datetime import datetime

new_df = df.copy()
new_df.date = df.date.apply(lambda d: datetime.strptime(d.split()[0], "%Y-%m-%d"))
new_df.date.head()

Each row in a DataFrame is associated with an index, which is a label that uniquely identifies that row.

The row indices so far have been auto-generated by Pandas. They are integers starting from 0. 

To fix this set the `index` property of the DataFrame to equal the date column.

In [None]:
new_df.index = new_df.date
new_df.head()

Now that we have made an index based on a real date, we can drop the original `date` column.

In [None]:
new_df = new_df.drop(['date'], axis=1)
new_df.head()

Now we can see that Pandas handles these dates quite nicely.

In [None]:
new_df.high.plot(label='High')
new_df.low.plot(label='Low')
plt.title('YELP Stock Price')
plt.ylabel('Dollars')
plt.legend(loc='best')
plt.show()

## Accessing rows of the DataFrame

So far we've seen how to access a column of the DataFrame. To access a row we use different syntax.

To access a row by its index label, use the **`.loc()`** method ('location').

In [None]:
new_df.loc[datetime(2015, 1, 23, 0, 0)]

To access a row by its index number (i.e., like an array index), use **`.iloc()`** ('integer location')

In [None]:
new_df.iloc[0, :]

To iterate over the rows you can use **`.iterrows()`**.

In [None]:
num_positive_days = 0
for idx, row in df.iterrows():
    if row.close > row.open:
        num_positive_days += 1
        
print("The total number of positive-gain days is {}.".format(num_positive_days))

## Filtering

It is easy to select rows from the data.  

All the operations below return a new DataFrame, which itself can be treated the same way as all DataFrames we have seen so far.

In [None]:
tmp_high = new_df.high > 55
tmp_high.head()

Summing a Boolean array is the same as counting the number of `True` values.

In [None]:
sum(tmp_high)

Now, let's select only the rows of `new_df` that correspond to `tmp_high`,.

In [None]:
new_df[tmp_high]

Putting it all together, we have the following commonly-used patterns.

In [None]:
positive_days = new_df[new_df.close > new_df.open]
positive_days.head()

In [None]:
very_positive_days = new_df[(new_df.close - new_df.open) > 4]
very_positive_days.head()

## Creating new columns

To create a new column, simply assign values to it. The column name is similar to a key in a dictionary.

In [None]:
new_df['profit'] = (new_df.open < new_df.close)
new_df.head()

Let's give each row a `gain` value as a categorical variable.

In [None]:
for idx, row in new_df.iterrows():
    if row.open > row.close:
        new_df.loc[idx,'gain']='negative'
    elif (row.close - row.open) < 1:
        new_df.loc[idx,'gain']='small_gain'
    elif (row.close - row.open) < 6:
        new_df.loc[idx,'gain']='medium_gain'
    else:
        new_df.loc[idx,'gain']='large_gain'
new_df.head()

Here is another, more "functional", way to accomplish the same thing.

Define a function that classifies rows, and `apply` it to each row.

In [None]:
def namerow(row):
    if row.open > row.close:
        return 'negative'
    elif (row.close - row.open) < 1:
        return 'small_gain'
    elif (row.close - row.open) < 6:
        return 'medium_gain'
    else:
        return 'large_gain'

new_df['test_column'] = new_df.apply(namerow, axis=1)

In [None]:
new_df.head()

OK, point made, let's get rid of the extraneous `test_column`.

In [None]:
new_df.drop('test_column', axis=1)

## Grouping

A powerful DataFrame method is `groupby()`. 

This is analagous to `GROUP BY` in SQL.

It will group the rows of a DataFrame by the values in one (or more) columns and let you iterate through each group.

Here we will look at the average gain among the categories of gains (negative, small, medium, and large) we defined above and stored in the column `gain`.

In [None]:
gain_groups = new_df.groupby('gain')

Essentially, `gain_groups` behaves like a dictionary:
* the keys are the unique values found in the `gain` column, and 
* the values are DataFrames that contain only the rows having the corresponding unique values.

In [None]:
for gain, gain_data in gain_groups:
    print(gain)
    print(gain_data.head())
    print('=============================')

In [None]:
for gain, gain_data in new_df.groupby("gain"):
    print('The average closing value for the {} group is {}'.format(gain,
                                                           gain_data.close.mean()))

## Other Pandas Classes

A DataFrame is essentially an annotated 2-D array.

Pandas also has annotated versions of 1-D and 3-D arrays.

A 1-D array in Pandas is called a [Series](https://pandas.pydata.org/docs/reference/series.html). You can think of DataFrames as a dictionary of Series.

A 3-D array in Pandas is created using a [MultiIndex](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html#).

For more information read the documentation.

## Comparing multiple stocks

As a last task, we will use the experience we obtained so far -- and learn some new things -- in order to compare the performance of different stocks we obtained from Yahoo finance.

In [None]:
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
stock_df = pd.DataFrame()
for s in stocks:
    stock_df[s] = pd.DataFrame(yf.download(s, 
                                           start='2014-01-01', 
                                           end='2014-12-31', 
                                           progress=False))['Close']
stock_df.head()

In [None]:
stock_df.plot()
plt.show()

Next, we calculate the returns over a period of length $T$. The returns are defined as

$$ r(t) = \frac{f(t)-f(t-T)}{f(t)}. $$

The returns can be computed with a simple DataFrame method `pct_change()`.  Note that for the first $T$ timesteps, this value is not defined.

In [None]:
rets = stock_df.pct_change(30)
rets.iloc[25:35]

Now we'll plot the timeseries of the returns of the different stocks.

Notice that the `NaN` values are dropped by the plotting function.

In [None]:
rets.plot()
plt.show()

In [None]:
plt.scatter(rets.TSLA, rets.YELP)
plt.xlabel('TESLA 30-day returns')
plt.ylabel('YELP 30-day returns')
plt.show()

There appears to be some (fairly strong) correlation between the movement of TSLA and YELP stocks.  Let's measure this.

The correlation coefficient between variables $X$ and $Y$ is defined as follows

$$ \text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}. $$

Pandas provides a DataFrame method called `corr()` that computes the correlation coefficient of all pairs of columns.

In [None]:
rets.corr()

It takes a bit of time to examine that table and draw conclusions.  

To speed that process up let's visualize the table.


In [None]:
import seaborn as sns

sns.heatmap(rets.corr(), annot=True)
plt.show()

Finally, it is important to know that the plotting performed by Pandas is just a layer on top of `matplotlib` (i.e., the `plt` package).  

So Panda's plots can (and often should) be replaced or improved by using additional functions from `matplotlib`.

For example, suppose we want to know both the returns as well as the standard deviation of the returns of a stock (i.e., its risk).  

Here is a visualization of the result of such an analysis. We construct the plot using only functions from `matplotlib`.

In [None]:
plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Standard Deviation (Risk)')
plt.xlim([-.05,.1])
plt.ylim([0,.3])
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (30, -30),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.show()

To understand what these functions are doing, (especially the [annotate](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.annotate.html) function), you will need to consult the online documentation for [matplotlib](https://matplotlib.org/stable/api/index.html). 
