# Pandas Tutorial

`pandas` (short for Panel Data) is a python library which allows users to use R-like dataframes.  The functions mimic R's `data.frame` closely as they were the inspiration for `pandas`.

In this short tutorial, we'll go through some of the basic functionality of pandas.  **The library is very expansive, and you are encouraged to explore the documentation on your own**. I'm discovering new things about the library every day! At the end of this tutorial, you'll be able to work in data in pandas at a very basic level.

# Reading Data

Pandas offers lots of ways to read data into python. The most ubiqutous is likely `read_csv` which does what it says it does.

You can also read data from...:

* text files
* SQL databases 
* JSON files
* Google Big Query

and more.

Check out the documentation on I/O for more on reading and writing data.  See here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib.pyplot as plt

df = pd.read_csv('2018_data.csv', parse_dates=['created_at'])

#Display the top 5 rows of the dataframe in HTML
display(df.head())

# The Basics of Data Manipulation

Dataframes are like numpy arrays but with more functionality.  If we want to access a column, we can use the `.iloc` or `.loc` method.

For more, see here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

In [None]:
#Get the first column
df.iloc[:,0]


#Get the first row
df.iloc[0,:]

#Get first column by name
df.loc[:, 'created_at']

#Get multiple columns by name
df.loc[:, ['created_at','wr']]

We can also access columns as a method.  So to access the `wr` column, we can perform `df.wr`

In [None]:
#Get the wr column as a Series by accessing the column name as a method
df.wr

# Summarizing Data

Dataframes and Series have their own methods for summary.  let's take a look!

Lot's of methods of summary exist, including but not limited to:

* Mean
* Median
* Min/Max
* Quantile

There are also a multitude of functions to compute between-column statistics.  See here: https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#computation



In [None]:
#Take the mean of the wr column
df.wr.mean()

#But, we can also call functions on the columns
np.mean(df.wr)


#How well correlated are apparentTemperature and wr?

df.loc[:, ['apparentTemperature','wr']].corr()


# Other Methods on Non-Numeric Data

The `created_at` column contains dates.  We can extract different parts of those dates with some other methods!

Let's creat columns for the month and time the data was observed.

You can read more on timeedeltas and timestamps here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [None]:
#Create a new column for time
df['time'] = df.created_at.dt.hour


#Create a new column for month
df['month'] = df.created_at.dt.month_name()


# Aggregation

Sometimes, you want to know a summary of the data per group.  The average number of people in th weight room changes by month.  Let's use `groupby` to summarize by month.

You can learn more about aggregation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html


In [None]:
#Aggregate by month to find mean number of people in weight room

df.groupby('month').wr.mean()


# Don't Loop With Pandas, Always Vectorize

Looping through dataframes is not the most elegant or efficient way to do things.  Most computation in pandas can be done via vectorization so do it!

In [None]:
def count_months(df):
    months = {}
    for m in df.month:
        if m in months:
            months[m]+=1
        else:
            months[m]=1
            
    return months


count_months(df)

# Working With Categorical Data

Inspired by `factors` in R, `pandas` uses `pd.Categorical` to turn categorical variables into a `CategoricalDtype` column.

You can read more about categorical data here: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

In [None]:
#Turn months into an ORDERED categorical variable
months = ['January','February','March','April','May','June','July','August','September','October','November','December']
df['month'] = pd.Categorical(df.month, ordered = True, categories=months)
#Watch what happens when we group by an ordered categorical variable
df.groupby('month').wr.mean()

Categorical columns become really important when you are transforming data for learning.  A lot of analyses will have you one-hot-encode a categorical variable.  You can do this with pandas if your column is categorical.

In [None]:
#One Hot Encode Months
pd.get_dummies(df.month)


# Plotting

Though you can pass columns to `matplotlib` just like you would arrays, pandas offers it's own API for plotting.

For more, see here: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

In [None]:
#A simple histogram
fig, ax = plt.subplots(dpi = 120)

df.wr.hist(ax = ax, edgecolor = 'white', bins = 25)
ax.grid(False)
ax.set_xlabel('WR Numbers')
ax.set_ylabel('Frequency')
ax.set_title('Histogram of WR numbers')

#A simple scatter plot
fig2, ax2 = plt.subplots(dpi = 120)
df['time'] = df.created_at.dt.hour + df.created_at.dt.minute/60
df.plot.scatter(x = 'time', y = 'wr', alpha = 0.05, ax = ax2)

#You can combine multiple plots as well!
fig3, ax3 = plt.subplots(dpi = 120)

month_group = df.groupby('month')
monthly_mean = month_group.wr.mean()
monthly_sd = month_group.wr.std()

monthly_mean.plot(marker = 'o', linestyle = 'None',color = 'k')
monthly_mean.plot.bar(ax = ax3, yerr = monthly_sd, capsize = 5)
