#Pandas

##What is Pandas?
A Python library providing data structures and data analysis tools.

##Why
- Alternative to Excel or R
- Based on Data Frames (think of it like a table) and Series (single column table / time series)

##Learning Pandas
* Almost anything you want to do is already a built-in function in Pandas.
* Before you decide to write a function to do some kind of operation on a Pandas object, scour the Pandas docs and StackOverflow
* http://pandas.pydata.org/pandas-docs/stable/index.html

#Objectives

- Create/Understand Series objects
- Create/Understand DataFrame objects
- Create and destroy new columns, apply functions to rows and columns
- Join/Merge Dataframes
- Use DataFrame grouping and aggregation
- Perform high-level EDA using Pandas

### Standard Imports

In [None]:
# By convention import pandas like:
import pandas as pd
import numpy as np

# For fake data.
from numpy.random import randn

#Series

Think of a Pandas Series as a _labeled_ one-dimensional vector. In fact, it need not be a numeric vector, it can contain arbitrary python objects.

Integer valued series:

Real valued series:

String valued series:

#Indexes.

Notice how each series has an index (in this case a relatively meaningless default index). Pandas can make great use of informative indexes. Indexes work similarly to a dictionary key, allowing fast lookups of the data associated with the index---which helps optimize many operations.

In [None]:
# Sample index - each data point is labelled with a state.
index1 = ['California', 'Alabama', 'Indiana', 'Montana', 'Kentucky']
index2 = ['Washington', 'Alabama', 'Montana', 'Indiana', 'New York']

Labelled numeric series:

The index is used to line up arithmetic operations.

Aggregation by index labels is easy (and optimized)

Create a series indexed by month

Closer look at the index values

Resample by week

#DataFrames
Data frames extend the concept of Series to table-like data.

From a dictionary of series

From a numpy array

Dataframes can be indexed (selected) by label, numeric index (avoid if possible), and boolean.

In [None]:
# Each column is a series


In [None]:
# So are the rows.


In [None]:
#The columns all have the same index:


In [None]:
#What's the index for the rows?


#DataFrame basics

In [None]:
df

In [None]:
#New column


In [None]:
#Delete a column


#Applying functions

In [None]:
# Mean of each column


In [None]:
# Mean of each row


#Load some data from disk

In [None]:
df = pd.read_csv('data/playgolf.csv', sep='|')
df

In [None]:
# Describe


In [None]:
# Let's use date as the index


In [None]:
# Look at some subsets


In [None]:
# What are the averages of the numeric variables?


In [None]:
# Apply an arbitrary function to each column


In [None]:
# Or each row


#split-apply-combine

In [None]:
# Get averages for each outlook


In [None]:
# Or using the index


In [None]:
# Initialize a groupby object---and iterate through the groupings


In [None]:
# One row for each group.


In [None]:
# Same shape as the original


In [None]:
# Different index than I started with.


#ReadyChef
[Data](https://www.dropbox.com/sh/5sm9nvnh6b4m8d0/AABQyediVavAdsjnoEUBEyYCa?dl=0)

Download, unzip and place the readychef directory in pandas-tutorial/data

In [None]:
meals = pd.read_csv('data/readychef/meals.csv')
events = pd.read_csv('data/readychef/events.csv')
referrals = pd.read_csv('data/readychef/referrals.csv')
users = pd.read_csv('data/readychef/users.csv')
visits = pd.read_csv('data/readychef/visits.csv')

Select statements
===================

1. To get an understanding of the data, run a [SELECT](http://www.postgresqltutorial.com/postgresql-select/) statement on each table. Keep all the columns and limit the number of rows to 10.

2. Write a `SELECT` statement that would get just the userids.

3. Maybe you're just interested in what the campaign ids are. Use 'SELECT DISTINCT' to figure out all the possible values of that column.

    *Note:*  Pinterest=PI, Facebook=FB, Twitter=TW, and Reddit=RE

In [None]:
#3


Where Clauses / Filtering
========================================

Now that we have the lay of the land, we're interested in the subset of users that came from Facebook (FB). If you're unfamiliar with SQL syntax, the [WHERE](http://www.postgresqltutorial.com/postgresql-where/) clause can be used to add a conditional to `SELECT` statements. This has the effect of only returning rows where the conditional evaluates to `TRUE`. 

*Note: Make sure you put string literals in single quotes, like `campaign_id='TW'`.*

1. Using the `WHERE` clause, write a new `SELECT` statement that returns all rows where `Campaign_ID` is equal to `FB`.

2. We don't need the campaign id in the result since they are all the same, so only include the other two columns.

    Your output should be something like this:

    ```
     userid |     dt
    --------+------------
          3 | 2013-01-01
          4 | 2013-01-01
          5 | 2013-01-01
          6 | 2013-01-01
          8 | 2013-01-01
    ...
    ```


Aggregation Functions
=======================



6. Now get the average price, the min price and the max price for each meal type. Don't forget the group by statement!

    Your output should look like this:

    ```
        type    |         avg         | min | max
    ------------+---------------------+-----+-----
     mexican    |  9.6975945017182131 |   6 |  13
     french     | 11.5420000000000000 |   7 |  16
     japanese   |  9.3804878048780488 |   6 |  13
     italian    | 11.2926136363636364 |   7 |  16
     chinese    |  9.5187165775401070 |   6 |  13
     vietnamese |  9.2830188679245283 |   6 |  13
    (6 rows)
    ```



Joins
=========================

Now we are ready to do operations on multiple tables. A [JOIN](http://www.tutorialspoint.com/postgresql/postgresql_using_joins.htm) allows us to combine multiple tables.

1. Write a query to get one table that joins the `events` table with the `users` table (on `userid`) to create the following table.

    ```
     userid | campaign_id | meal_id | event
    --------+-------------+---------+--------
          3 | FB          |      18 | bought
          7 | PI          |       1 | like
         10 | TW          |      29 | bought
         11 | RE          |      19 | share
         15 | RE          |      33 | like
    ...
    ```



Extra Credit
========================
1. Answer the question, _"What user from each campaign bought the most items?"_

    It will be helpful to create a temporary table that contains the counts of the number of items each user bought. You can create a table like this: `CREATE TABLE mytable AS SELECT...`

#Exploratory Data Analysis with Pandas

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv('data/playgolf.csv', delimiter='|' )
print df.head()

#Describe the continuous variables
##This treats the Boolean Windy variable as a series of 0's and 1's

In [None]:
df.describe()

Can see the general pattern of Temperature and Humidity and mean of a Boolean represents the percentage

##We can make use of df.plot() to produce simple graphs that calls on the more adjustable Matplotlib library 

In [None]:
# Side-by-side histograms


In [None]:
# Box plot


###Scatterplots for examining bivariate relationships (kind=scatter)

###If we want to color the scatterplots according to a category, it requires a bit of matplotlib...ugh!

#What about the categorical variables? Frequency tables and relative frequency tables

###Simply df.value_counts() gets you the frequencies

###Using apply will get you the value counts for multiple columns at once

###Contingency Tables for looking at bivariate relationships between two categorical variables (pd.crosstab)

###Often we want the row percentages

###Or the column percentages

#Enough...lets get to the pair sprint

https://www.youtube.com/watch?v=yGf6LNWY9AI