<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">

<hr>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
                <img style="display: inline;" src="static/pandas_logo.png" style="overflow: hidden; width: 50%">
            <br>
            <br>
            <ul style="display: inline-block">
                <li>
                    <a href="http://pandas.pydata.org/">pandas Home</a>
                </li>
                <li>
                    <a href="http://pandas.pydata.org/pandas-docs/stable/api.html">pandas API Reference</a>
                </li>
                <li>
                    <a href="https://en.wikipedia.org/wiki/Pandas_(software)">pandas Wikipedia</a>
                </li>
            </ul>
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: middle;">
            <blockquote>
                <p style="font-style: italic;">Torture numbers, and they'll confess to anything.</p>
                <br>
                <p>-Gregg Easterbrook</p>
            </blockquote>
        </div>
    </div>
</div>

<hr>

## What is Pandas?

Pandas is a Python library that focuses on data manipulation and numerical analysis. While originally designed as an econometrics platform it has proven suitable for a variety of domains. It will be our primary interface going forward.

There are two impotant concepts we should discuss before we begin.

---

## Dataframe

Pandas borrowed the concept of the [pd.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) from [R](https://en.wikipedia.org/wiki/R_(programming_language).

In [None]:
# Import pandas under the name pd
import pandas as pd
import numpy as np
import matplotlib

%matplotlib inline
matplotlib.style.use('fivethirtyeight')

# Create a dataframe from a CSV file
df = pd.read_csv('data/cfpb.csv')

# Any dataframe at end of cell gets HTML representation
# Head limits the number of output rows
df

A dataframe can be thought of like a relationtional database table or an Excel sheet. It has rows and columns. The rows correspond with an individual item or entity. The columns correspond with various features of the entity. In the above, the columns and the row index are in navy. The actual data falls in the middle of the table.

## Series

DataFrames can be thought of a group of columns composed of [Series](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) objects. They can also be thought of a group of rows composed of Series. Unlike a dataframe, a Series only has an index and data.

In [None]:
# You can select a column series of a dataframe with this notation:
# dataframe[column_name]
# OR if there are no spaces in the name
# df.Product
# Head limits it output to the first 5 elements
df['Product'].head()

## What Can Pandas Do?

**Warning**: If you are a Python person, some of the syntax you're about to see may frighten and confuse you. This is a perfectly normal reaction. The only way to get R-like syntax in Python was to abuse the \__getitem__\() magic method. It's kludgy, but it works beautifully.

### Selecting and Filtering

In [None]:
# We can cut down the dataframe as needed
# You can also pass a list of columns to get back a subframe
# dataframe[list_of_column_names]
# http://pandas.pydata.org/pandas-docs/stable/indexing.html
tdf = df[['Product', 'Company', 'State']]
tdf.head()

In [None]:
# We can also cut down columns by filtering by number
# Loc is for indexing
tdf.iloc[:5]

In [None]:
# Or by content
# This works because of boolean indexing, which we will get to.
tdf[tdf['Company'] == 'CITIBANK, N.A.'].head(5)

In [None]:
# We can filter and then analyze columns.
filtered_df = tdf[tdf['Company'] == 'AMERICAN EXPRESS COMPANY']

# And we can use Series methods if we want to examine columns
filtered_df['State'].value_counts()[:5]

### Examine Columns

In [None]:
# Or get unique values
filtered_df['Product'].unique()[:5]

In [None]:
# We can get column datatypes
filtered_df.dtypes

## Groupby Calculations

In [None]:
# We can group the data and view it in aggregate
# http://pandas.pydata.org/pandas-docs/stable/groupby.html
gb = df.groupby(['Product', 'Sub-product'])

# Add count size gives basic numbers.
size = gb.mean().head(15)
pd.DataFrame(size['Restitution'])

In [None]:
# Pandas also has simplified plotting
size['Restitution'].sort_values().plot.barh()

In [None]:
# We can reshape data as necessary.
size.unstack().fillna(' ')

In [None]:
# Or get numerical data or transform based on common groups.
output = gb['Restitution'].agg([np.mean, np.median, np.max, np.min]).head(10)
output['range'] = output.amax - output.amin
output

## Custom Row by Row Functions

In [None]:
# We can use apply to apply functions in a vectorized manner
def transmogrify(row):
    id = row['Complaint ID']
    product = row['Product']
    state = row['State']
    return 'Complaint {} is a {} complaint from the state of {}.'.format(id, product, state)

output = df.apply(transmogrify, axis=1)[:5]
output[0]

In [None]:
# We can do database style merges, joins, and concatenations
# http://pandas.pydata.org/pandas-docs/stable/merging.html
df2 = pd.read_csv('data/simple.csv')

df2.head(5)

## Merges and joins

In [None]:
# Here we non-sensically merge arbitrary numbers from simple to the CFPB dataset
tdf = df.merge(df2, how='inner', left_on='Date received', right_on='Date')

tdf[['Date received', 'Product', 'Count']].head(5)

In [None]:
# We can write the data to disk in a single line
output.to_csv('data/transmogrify_output.csv')

## Time series analysis

In [None]:
# Pandas has built in support for datetime objects, too.
df['Date received'] = pd.to_datetime(df['Date received'])

# Here we simply groupby day of week
gb = df.groupby([df['Date received'].dt.weekday])
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# And get our counts.
data = gb.size()
data.index = days_of_week

# Then plot.
data.plot.barh()

In [None]:
# Which lets you get all sorts of fancy.
tix_df = df.set_index('Date received').sort_index()

# We can group Restitution by Product
gb = tix_df.groupby('Product')['Restitution']

# And then check out the rolling 30 day mean
roll = gb.rolling(window='7d')
count = roll.mean()

# Cleanup
deduped = count.groupby(['Product', 'Date received']).max()
final = deduped.unstack(0).fillna(method='ffill').fillna(0)

# And voila ...
date_range = ('2015-05-01', '2017-05-1') 
products = ('Credit card', 'Consumer Loan', 'Credit reporting')
final.loc[date_range[0]: date_range[1], products].plot(title='7-day rolling mean of Restitution by Product')

## Text mining

In [None]:
# And support for string methods
contains_lawyer = df['Consumer complaint narrative'].str.contains('lawyer')

# Get all items containing lawyer
data = df[contains_lawyer]['Consumer complaint narrative']

# Get text of first item
data.iloc[0]

In [None]:
# This includes regexes for text mining ... https://en.wikipedia.org/wiki/Regular_expression
regex_string = r'([Ll]awyer[\S\s]*?\.|[Aa]ttorney[\S\s]*?\.)'

# Look for each and every instance
lawyer_to_sentence_end = df['Consumer complaint narrative'].str.extract(regex_string,
                                                                        expand=True)
lawyer_to_sentence_end.dropna().head(5)

---

## Additional Learing Resources

* ### [Pandas From The Ground Up](https://www.youtube.com/watch?v=5JnMutdy6Fw) / [Slides](https://github.com/brandon-rhodes/pycon-pandas-tutorial) <- this video changed my life
* ### [10 Minutes to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
* ### [Visual Pandas](https://www.youtube.com/watch?v=9d5-Ti6onew)

---

# Next Up: [Scikit-Learn](8_scikit_learn.ipynb)

<img style="margin-left: 0; width: 40%;" src="static/sklearn_logo.png">

---