# Pandas: Tabular Data in Python

## Objectives

* Create `Series` and `DataFrame` objects from Python data types. 
* Create `DataFrame` objects from files.
* Index and slice `pandas` objects.
* Aggregate data in `DataFrame`s.
* Join multiple `DataFrame` objects.

## What is Pandas?

A Python library providing data structures and data analysis tools for tabular data of many types. Think of a `DataFrame` like a table in SQL.

## Benefits

  * Efficient storage and processing of data.
  * Includes many built-in functions for data transformation, aggregations, and plotting.
  * Great for exploratory work.

## Not so greats

  * Does not scale terribly well to large datasets.

## Documentation:

The documentation for pandas is here:

  * http://pandas.pydata.org/pandas-docs/stable/index.html
  
Particularly important reads (eventually) are:

  * [Indexing and Selecting](https://pandas.pydata.org/pandas-docs/stable/indexing.html)
  * [Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-mi-slicers)
  * [Group-by](https://pandas.pydata.org/pandas-docs/stable/groupby.html)

## Standard Imports

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')

## Numpy: A Quick Primer

`pandas` is built out of data types from `numpy` a lower level library.

The basic object in `numpy` is an `array`.

In [None]:
x = np.array([0, 1, 2, 3, 4, 5])
x

Arrays can be processed very efficiently.

In [None]:
x.sum()  # <-- As efficient as possible way to sum these numbers in python.

Arrays can be multi-dimensional.  A **two-dimensional array** is called a **matrix**.

In [None]:
M = np.array([
    [0, 1, 2],
    [1, 2, 3],
    [2, 3, 4],
    [5, 6, 7]
])

M

In [None]:
print(x.shape)
print(M.shape)

### Drawbacks of Numpy as a General Data Analysis Tool

#### Numpy arrays can only store homogeneous data

In [None]:
x = np.array([
    [2.0, 3.4, "Jack"],
    [1.0, 0.4, "Matt"],
    [5.0, 9.4, "Miles"]
])

That seemed to work...

In [None]:
x.dtype

What?

Numpy has chosen to store our array as **uncode strings**.  **Even the numbers are now strings!**

In [None]:
x[0, 0]  # <- It is a string! It is a string!

Arithmetic operations that should work do not.  Here is an attempt at a column sum!

In [None]:
# Column sum
x.sum(axis=1)

This happens because numpy arrays are **homogeneous**.  All the data in an array (even in different columns) must be of the same datatype!

#### Numpy Arrays only Accept Integer Indexes

You cannot assign column or row names to numpy arrays. This can make it harder to program.

## Getting Data into Pandas

### Creating DataFrames from Python Objects

You can think of DataFrames as labeled (columns) and indexed (rows) matrices. 

We can create DataFrames from numpy arrays and list of lists with provided labels and indices. The `columns=` parameter specifies the names for the columns; the `index=` specifies the names for the rows.

In [None]:
pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])

Alternatively, you can think of DataFrames as a combination of column vectors, so we can create DataFrames from a dictionary of column vectors.  The keys are the column labels, and the values are the vectors.

In [None]:
frame_dict = {'column_1': [1, 2, 3], 'column_2': [10, 11, 12]}
pd.DataFrame(frame_dict, index=['3', '2', '1'])

#### Exercise:

Create a data frame with two columns: `decreasing` and `increasing`, that have the numbers 1-100 in increasing and decreasing orders.

### Series

If DataFrames are labeled and indexed matrices, then Series are labeled and indexed vectors.

In [None]:
s = pd.Series([1, '2', 3], index=['a', 'b', 'c'], name='Numbers')
s

If you create a Series using a dictionary, the keys are treated as indices instead.  Note that the order of element might not be the same as the order in the dictionary.

In [None]:
pd.Series({'Star': 'Wars', 'Is': 'Boring', 'Please': 'Stop'})

You can take out a Series from a DataFrame.

In [None]:
df = pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])

print("Data Frame")
print(df)
print()
print("Column 'a'")
print(df['a'])

In [None]:
type(df['c'])

... or put a Series into a DataFrame as long as you have matching index.

In [None]:
df['d'] = pd.Series([4, 5], index=['foo', 'bar'])
df

The elements in the assignment above are matched **by index**, which is a common pattern in Pandas.

In [None]:
# Index flipped from previous example.
#                           v
df['d'] = pd.Series([4, 5], index=['bar', 'foo'])
df

If no indices match, missing values are filled into the unmatched spaces.

In [None]:
df['d'] = pd.Series([4, 5], index=['bar', 'baz'])
df

We can also put a list/vector into a DataFrame, and here there is no index, so the column is inserted in order.

In [None]:
df['e'] = [1, 2]
df

#### Exercise:

Create a data frame that has two columns `increasing` and `evens`.  The `increasing` column contains the numbers 1-100 in increasing order, and the `evens` column has the even numbers in increasing order at the same locations as in `increasing`, but with missing values in the other locations.

### Load data from csv

A csv (comma separated values) is a file format used to store data separated by a **delimiter**.

A delimiter is a **single character** that delimits boundaries between data elements in a file.  A comma is a traditional choice of delimiter but a relatively poor one because they are often part of elements themselves.  Better choices are pipe (`|`) and tab (`\t`).

In [None]:
# Pipe separated file.
!head 'playgolf.csv'

In a bizarre twist of history, comma separated files are often separated by different characters than commas.  There is no consistent convention of using a different file extension, but some people use `.psv` or `.tsv`.

Pandas has a `read_csv` function that loads a delimited file into a `DataFrame`.  The resulting object **must fit in memory**.

In [None]:
golf_df = pd.read_csv('playgolf.csv', delimiter='|')

`DataFrame.head` can be used to view a portion of our new dataframe.

In [None]:
golf_df.head()

## Extracting information from DataFrames

### Basic Row and Column Indexing

As we have seen, individual columns may be extracted from a `DataFrame` as a `Series` using the usual `__getitem__` style indexing using the name of the column.  

This is similar to how we index a dictionary.

In [None]:
golf_df['Temperature']

We can extract individual values by taking the series out of the matrix, then treating it like a list.

In [None]:
golf_df['Temperature'][0]

We can extract multiple rows at once.

In [None]:
golf_df[['Temperature', 'Humidity']]

If you try to index with a slice, however, it will only operate on the rows.

In [None]:
short_df = golf_df[0:5]
short_df

### Boolean / Logical Indexing

We can also index into a `DataFrame` using a list of **booleans** (i.e. `True` and `False` values). This will also operate on the rows.

In [None]:
# Takes rows 0, 2, and 4.
short_df[[True, False, True, False, True]]

Which doesn't seem that useful...except we can create a boolean `Series` by using comparisons on a Series

In [None]:
# A series of booleans.
golf_df['Temperature'] > 70

And them use the result to grab rows of the dataframe.

In [None]:
golf_df[golf_df['Temperature'] > 70][["Date", "Windy"]]

This is essentially applying a logical condition to select rows from a `DataFrame`.  This is one of the most common patterns in Pandas.

#### Exercise

Select all of the rainy days in which the humidity is larger than 90 from this data frame.

To review: if you index a `DataFrame` with a **single value** or a **list of values**, it selects the **columns**.

If you use a **slice** or **sequence of booleans**, it selects the **rows**. 

### Double Indexing

Suppose we want to set the value of the `Windy` column where `Temperature > 70` to True (because, um, science).

In [None]:
golf_df[golf_df['Temperature'] > 70]["Windy"] = True

What?

In [None]:
golf_df[golf_df['Temperature'] > 70]["Windy"]

Apparently that error actually meant something.

This pattern is called double indexing, and it is an antipattern!  Pandas can not guarentee that assignments will hold when you index twice!

To fix these issues, we need to study the other indexing options that Pandas provides.

### Other Indexers: .loc and .iloc

There are a few other indexing objects in pandas, both of which take a value to choose rows and a value to choose columns.

  - `df.iloc` is **positionally based**.  This indexer accepts integers and integer slices, and essentially treats the data frame as if it were a simple matrix.
  - `df.loc` is **label based**.  This indexer works with row and column indices / labels.
  
There used to be another one, and you will encounter it sometimes.

  - `df.ix` is **mixed**, it works with row numbers (integers) and column labels (names).
  
**The `ix` indexer is depreciated, and you will get a warning if you use it.  It will be removed in a future version of pandas.  Don't write code that uses ix!**

In [None]:
df = pd.DataFrame({
    'some_integers': [0, 0, 1, 1, 2, 2],
    'some_strings': ['x', 'y', 'z', 'x', 'y', 'z'],
    'some_booleans': [0, 0, 1, 0, 1, 1]},
    index=['a', 'b', 'c', 'd', 'e', 'f']
)
df

In [None]:
df.iloc[2:4, 0:2]

In [None]:
df.loc['b':'e', ['some_integers', 'some_booleans']]

**Deprecation Warning!!!!**

In [None]:
df.ix[2:4, ['some_integers', 'some_booleans']]

### Mixed Indexing

So what do we do if we want to get the rows by position, and get the columns by label?  I.e. if we have a use for **mixed indexing**.

In [None]:
# Mixed indexing with iloc: will not work.
df.iloc[2:4, ['some_integers', 'some_booleans']]

Doing mixed indexing in modern pandas is a more explicit, less magic.  You need to use the `df.index` and `df.columns` attributes to explicitly turn positions into labels.

In [None]:
df = pd.DataFrame({
    'some_integers': [0, 0, 1, 1, 2, 2],
    'some_strings': ['x', 'y', 'z', 'x', 'y', 'z'],
    'some_booleans': [0, 0, 1, 0, 1, 1]},
    index=['a', 'b', 'c', 'd', 'e', 'f']
)
df

#### Rows by position, Columns by name

In [None]:
df.index[2:4]

In [None]:
df.loc[df.index[2:4], ['some_integers', 'some_booleans']]

#### Rows by name, Columns by position

In [None]:
df.columns[[0, 2]]

In [None]:
df.loc[['c', 'd'], df.columns[[0, 2]]]

### Transforming data

Arithmetic operations apply to `Series` element by element.

In [None]:
# Yes, this makes no sense.
golf_df["TempHumid"] = golf_df['Temperature'] + golf_df['Humidity']

In [None]:
golf_df.head()

In [None]:
# More Usefully

# Heat index formula taken from wikipedia: 
#    https://en.wikipedia.org/wiki/Heat_index
temp = golf_df['Temperature']
humid = golf_df['Humidity']
golf_df['HeatIndex'] = (-42.37 + 2.05*temp + 10.14*humid
                        - 0.225*temp*humid
                        - 6.84e-3*temp**2 
                        - 5.482e-2*humid**2
                        + 1.23e-3*temp**2*humid
                        + 8.53e-4*temp*humid**2
                        - 1.99e-6*temp**2*humid**2
)
golf_df[['Temperature', 'Humidity', 'HeatIndex']]

We can create a new Series by applying functions to an existing Series.

In [None]:
# Create an indicator variable out of a column.
golf_df['Result'].apply(lambda x: 1 if x == 'Play' else 0)

Though the previous result is better executed as

In [None]:
(golf_df['Result'] == 'Play').astype(int)

We can check that these give the same things

In [None]:
golf_df['Result'].apply(lambda x: 1 if x == 'Play' else 0) == (golf_df['Result'] == 'Play').astype(int)

Or, to get a single answer

In [None]:
np.all(
    golf_df['Result'].apply(lambda x: 1 if x == 'Play' else 0) 
    == (golf_df['Result'] == 'Play').astype(int))

We can also apply function to each row of the DataFrame by specifying the column and axis equals 1, though this is not useful as in many cases becuase it's more efficient to use the arithmetic operations.

In [None]:
golf_df.apply(lambda x: x['Temperature'] + x['Humidity'], axis=1)

In general, `.apply` is useful for mapping complex functions across your data, you should be wary of using it in simple cases like this, there is probably a better way.

### Aggregating data

We can do something like the group by statement in SQL.

In [None]:
groups = golf_df.groupby('Outlook')

We can see that `groupby` creates a tuple for each Outlook with a segmented DataFrame.

In [None]:
for group in groups:
    print('Group Name: ', group[0])
    print('Group Data:\n', group[1])
    print('\n')

We can then apply some sort of aggregation to each subset of the data.

In [None]:
groups.count()

In [None]:
groups.sum()

In [None]:
groups.mean()

You can apply your own custom aggregation functions with `aggregate`.

In [None]:
groups.aggregate(min)

In [None]:
# Get the minimum Temperature within each group.
# Note: This is an awful way to accomplish this, it's just for illustration.
groups.aggregate(lambda df: sorted(df['Temperature'])[0])

You should investigate a better way to accomplish the task in the previous example.

Note that groupby is a big topic; more documentation is at http://pandas.pydata.org/pandas-docs/stable/groupby.html



### Joining DataFrames

We can join DataFrames in a similar way that we join tables to SQL.  In fact, left, right, outer, and inner joins work the same way here.

Lets create a fake DataFrame to join with first.

In [None]:
mood_df = pd.DataFrame([['overcast', 'sad'], ['rainy', 'sad'], ['sunny', 'happy']],
                       columns=['Weather', 'Mood'])

mood_df

We can do joins using the merge command.

In [None]:
golf_df.merge(mood_df, how='inner', left_on='Outlook', right_on='Weather')

There are, of course, other options besides `inner`, which you can find in the documentation.

## Concatenating dataframes

This is the equivalent of Unions in SQL, but a little more flexible.

In [None]:
df1 = pd.DataFrame(
    {'Col3': range(5), 'Col2': range(5), 'Col1': range(5)},
    index=range(0, 5))
df2 = pd.DataFrame(
    {'Col1': range(5), 'Col2': range(5), 'Col4': range(5)},
    index=range(3, 8))

In [None]:
df1

In [None]:
df2

#### Vertically

This is like a Union All. The `sort` parameter controls the order of the columns in the output.

In [None]:
pd.concat([df1, df2], axis=0, join='outer', sort=True)

An `inner` value limits the columns to those in all the inputs.

In [None]:
pd.concat([df1, df2], axis=0, join='inner', sort=True)

#### Horizontally

This is pretty much a simple join on indices.  While `concat` is capable of doing joins, it is far less flexible.

In [None]:
pd.concat([df1, df2], axis=1)

**Question:** why do some numbers show up as floats? Why do some numbers not?

For more on joining DataFrames, read https://pandas.pydata.org/pandas-docs/stable/merging.html

## Some Extra, Useful Stuff

### Various Summaries

The `info` method is useful for checking column types and quickly seeing if you have NaN in the data.

In [None]:
golf_df.info()

The `describe` method will give you a quick sense of the quartiles and distribution.

In [None]:
golf_df.describe()

### Frequency Tables

The `crosstab` function will allow us to quickly take a look at the frequency count between two columns.

In [None]:
pd.crosstab(golf_df['Outlook'], golf_df['Result'])

### DateTimes

We can turn strings of dates into datetime types by using Pandas' `to_datetime` function.

In [None]:
golf_df['DateTime'] = pd.to_datetime(golf_df['Date'])

In [None]:
golf_df.head()

In [None]:
golf_df.info()

In [None]:
golf_df['DateTime'].describe()

### Creating a New Row Index

We can also set the index to be an existing column(s).

In [None]:
date_df = golf_df.set_index('DateTime')
date_df

In [None]:
date_df.index

If we have an index of datetime types, we can use the `resample` method to quickly look at time based aggregations.

In [None]:
# Weekly means.
date_df.resample('W').mean()

This will be especially useful when we work with time series.

# Writing Data

We can write data into a csv file.

In [None]:
golf_df.to_csv('new_playgolf.csv', index=False)

In [None]:
!cat new_playgolf.csv