# Pandas 

(If you're using the code files, please open pandas_lessons.py)

## The importance of data preprocessing

Data preprocessing (also called data wrangling, cleaning, scrubbing, etc) is the most important thing you will do with your data because it sets the stage for the analysis part of your data analysis workflow. The preprocessing you do largely depends on what kind of data you have, what sort of analysis you'll be doing with your data, and what you intend to do with the results.

Preprocessing is also a process for getting to know your data, and can answer questions such as these (and more): 

- What kind of data are you working with? 
- Is it categorical, continuous, or a mix of both? 
- What's the distribution of features in your dataset? 
- What sort of wrangling do you have to do?
- Do you have any missing data? 
- Do you need to remove missing data?
- Do you need only a subset of your data?
- Do you need more data?
- Or less?

The questions you'll have to answer are, again, dependent upon the data that you're working with, and preprocessing can be a way to figure that out.

## What is Pandas?

Pandas is by far my favorite preprocessing tool. It's a data wrangling/modeling/analysis tool that is similar to R and Excel; in fact, the DataFrame data structure in Pandas was named after the DataFrame in R. Pandas comes with several easy-to-use data structures, two of which (the `Series` and the `DataFrame`) I'll be covering here.

I'll also be covering a bunch of different wrangling tools, as well as a couple of analysis tools.

## Why Pandas?

So, why would you want to use Python, as opposed to tools like R and Excel? I like to use it because I like to keep everything in Python, from start to finish. It just makes it easier if I don't have to switch back and forth between other tools. Also, if I have to build in preprocessing as part of a production system, which I've had to do at my job, it makes sense to just do it in Python from the beginning. 

Pandas is great for preprocessing, as we'll see, and it can be easily combined with other modules from the scientific Python stack.

## Pandas data structures

Pandas has several different data structures, but we're going to talk about the `Series` and the `DataFrame`.

### The Series

The `Series` is a one-dimensional array that can hold a variety of data types, including a mix of those types. The row labels in a `Series` are collectively called the index. You can create a `Series` in a few different ways. Here's how you'd create a `Series` from a list.

In [2]:
import pandas as pd

some_numbers = [2, 5, 7, 3, 8]

series_1 = pd.Series(some_numbers)
series_1

0    2
1    5
2    7
3    3
4    8
dtype: int64

To specify an index, you can also pass in a list.

In [3]:
ind = ['a', 'b', 'c', 'd', 'e']

series_2 = pd.Series(some_numbers, index=ind)
series_2

a    2
b    5
c    7
d    3
e    8
dtype: int64

We can pull that index back out again, too, with the `.index` attribute.

In [4]:
series_2.values

array([2, 5, 7, 3, 8])

You can also create a `Series` with a dictionary. The keys of the dictionary will be used as the index, and the values will be used as the `Series` array.

In [5]:
more_numbers = {'a': 9, 'b': 'eight', 'c': 7.5, 'd': 6}

series_3 = pd.Series(more_numbers)
series_3

a        9
b    eight
c      7.5
d        6
dtype: object

Notice how, in that previous example, I created a `Series` with integers, a float, and a string.

### The DataFrame

The `DataFrame` is Pandas' most used data structure. It's a two and greater dimensional structure that can also hold a variety of mixed data types. It's similar to a spreadsheet in Excel or a SQL table. You can create a `DataFrame` with a few different methods. First, let's look at how to create a `DataFrame` from multiple `Series` objects.

In [6]:
combine_series = pd.DataFrame([series_2, series_3])
combine_series

Unnamed: 0,a,b,c,d,e
0,2,5,7.0,3,8.0
1,9,eight,7.5,6,


Notice how in column `b`, we have two kinds of data. If a column in a `DataFrame` contains multiple types of data, the data type (or `dtype`) of the column will be chosen to accomodate all of the data. We can look at the data types of different columns with the `.dtypes` attribute. `object` is the most general, which is what has been chosen for column `b`.

In [7]:
combine_series.dtypes

a      int64
b     object
c    float64
d      int64
e    float64
dtype: object

Another way to create a `DataFrame` is with a dictionary of lists. This is pretty straightforward:

In [8]:
data = {'col1': ['i', 'love', 'pandas', 'so', 'much'],
        'col2': ['so', 'will', 'you', 'i', 'promise']}

df = pd.DataFrame(data)
df

Unnamed: 0,col1,col2
0,i,so
1,love,will
2,pandas,you
3,so,i
4,much,promise


## File I/O

It's really easy to read data into Pandas from a file. Pandas will read your file directly into a `DataFrame`. There are multiple ways to read in files, but they all work in the same way. Here's how you read in a CSV file:

In [2]:
import pandas as pd
week4 = pd.read_csv('../data/fanduel_week4.csv')circuitamericas 
week4.head()

Unnamed: 0,Id,Position,First Name,Last Name,FPPG,Played,Salary,Game,Team,Opponent,Injury Indicator,Injury Details,Unnamed: 12,Unnamed: 13
0,14190,WR,Julio,Jones,22.7,4,9200,WAS@ATL,ATL,WAS,,,,
1,6894,QB,Aaron,Rodgers,23.6,4,9200,STL@GB,GB,STL,,,,
2,6728,RB,Jamaal,Charles,20.4,4,9100,CHI@KC,KC,CHI,,,,
3,28181,RB,Le'Veon,Bell,23.6,2,9000,PIT@SD,PIT,SD,,,,
4,31360,WR,Odell,Beckham Jr.,13.7,4,9000,SF@NYG,NYG,SF,,,,


Reading in a text file is just as easy. Make sure to pass in `'\t'` to the delimiter parameter.

In [4]:
week17 = pd.read_csv('../data/fanduel_data/fanduel2014week17.csv', delimiter=';')
week17.tail()

Unnamed: 0,Week,Year,GID,Name,Pos,Team,h/a,Oppt,FD points,FD salary
450,17,2014,7002,Atlanta Defense,Def,atl,h,car,2,4700
451,17,2014,7001,Arizona Defense,Def,ari,a,sfo,2,5200
452,17,2014,7030,Tennessee Defense,Def,ten,h,ind,1,4600
453,17,2014,7016,Miami Defense,Def,mia,h,nyj,-1,4900
454,17,2014,7031,Washington Defense,Def,was,h,dal,-1,4600


## Exploring the data

Here are some different ways to explore the data we have. Let's first take a look at some of the basic characteristics of the auto_mpg dataset. You can easily find the number of rows and the number of columns a dataframe has using the `.shape` attribute.

In [5]:
week17.shape

(455, 10)

You've already seen the `head()` function, which returns the first five lines in the dataset. To grab the last 5 lines, you can use the `tail()` function:

In [10]:
week4.tail()

Unnamed: 0,Id,Position,First Name,Last Name,FPPG,Played,Salary,Game,Team,Opponent,Injury Indicator,Injury Details,Unnamed: 12,Unnamed: 13
510,12532,D,Detroit,Lions,10.8,4,4200,ARI@DET,DET,ARI,,,,
511,12530,D,Dallas,Cowboys,5.3,4,4200,NE@DAL,DAL,NE,,,,
512,12549,D,San Francisco,49ers,4.0,4,4100,SF@NYG,SF,NYG,,,,
513,12534,D,Tennessee,Titans,7.7,3,4100,BUF@TEN,TEN,BUF,,,,
514,12551,D,Tampa Bay,Buccaneers,5.0,4,4000,JAC@TB,TB,JAC,,,,


Getting column names from a `DataFrame` is also easy and can be done using the `.columns` attribute.

In [6]:
week17.columns

Index(['Week', 'Year', 'GID', 'Name', 'Pos', 'Team', 'h/a', 'Oppt', 'FD points', 'FD salary'], dtype='object')

Another useful thing you can do is generate some summary statistics using the `describe()` function. The `describe()` function calculates descriptive statistics like the mean, standard deviation, and quartile values for continuous and integer data that exist in your dataset. Don't worry, Pandas won't try to calculate the standard deviation of your categorical values!

In [12]:
week4.describe()

Unnamed: 0,Id,FPPG,Played,Salary,Unnamed: 12,Unnamed: 13
count,515.0,515.0,515.0,515.0,0.0,0.0
mean,22941.300971,5.408738,3.545631,5229.320388,,
std,14430.606108,5.704482,2.692285,1051.40934,,
min,6387.0,-1.0,0.0,4000.0,,
25%,11508.5,0.2,2.0,4500.0,,
50%,22123.0,3.7,4.0,4800.0,,
75%,30331.0,9.1,4.0,5550.0,,
max,65962.0,26.2,16.0,9200.0,,


Another useful thing you can do to explore your data is to sort it. Let's say we wanted to sort our `auto_mpg DataFrame` by mpg. This is very easy as well:

In [11]:
week17.sort(columns='FD points').tail()

Unnamed: 0,Week,Year,GID,Name,Pos,Team,h/a,Oppt,FD points,FD salary
42,17,2014,2924,"Anderson, C.J.",RB,den,h,oak,29.7,8300
165,17,2014,5323,"Beckham Jr., Odell",WR,nyg,h,phi,30.5,9200
423,17,2014,7005,Carolina Defense,Def,car,a,atl,31.0,4900
164,17,2014,5148,"Floyd, Michael",WR,ari,a,sfo,31.3,6100
163,17,2014,3902,"Decker, Eric",WR,nyj,a,mia,33.1,5900


## Lesson: let's see what's going on in our data!

This dataset is data on credit approvals. The column names and data were changed to protect the confidentiality of the data.

In [20]:
f = '../data/credit_approval.csv'
credit = pd.read_csv(f)
credit.head()
# How do you read in that file?

# Can you grab just the column names?
credit.columns

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P'], dtype='object')

In [21]:
# How many rows and columns does the dataframe have?
credit.shape

(690, 16)

In [22]:
# Now, look at the first 5 lines
credit.head()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [24]:
# Now, look at the last 5 lines
credit.tail()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260,0,-
686,a,22.67,0.75,u,g,c,v,2.0,f,t,2,t,g,200,394,-
687,a,25.25,13.5,y,p,ff,ff,2.0,f,t,1,t,g,200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280,750,-
689,b,35.0,3.375,u,g,c,h,8.29,f,f,0,t,g,0,0,-


In [25]:
# Can you describe() the data? (Notice how Pandas only "describes" the numerical data!)
credit.describe()

Unnamed: 0,C,H,K,O
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [30]:
# Let's sort on column H
credit.sort("H")

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
351,b,22.17,0.585,y,p,ff,ff,0.000,f,f,0,f,g,100,0,-
633,b,32.42,2.165,y,p,k,ff,0.000,f,f,0,f,g,120,0,-
456,b,34.58,0.000,?,?,?,?,0.000,f,f,0,f,p,?,0,-
585,b,73.42,17.750,u,g,ff,ff,0.000,t,f,0,t,g,0,0,+
584,a,28.08,15.000,y,p,e,z,0.000,t,f,0,f,g,0,13212,+
345,b,62.75,7.000,u,g,e,z,0.000,f,f,0,f,g,0,12,-
53,b,34.92,2.500,u,g,w,v,0.000,t,f,0,t,g,239,200,+
261,a,52.17,0.000,y,p,ff,ff,0.000,f,f,0,f,g,0,0,-
350,a,26.17,2.000,u,g,j,j,0.000,f,f,0,t,g,276,1,-
448,b,31.25,1.125,u,g,ff,ff,0.000,f,t,1,f,g,96,19,-


## Working with dataframes

Pandas has a ton of functionality for manipulating and wrangling the data. Let's look at a bunch of different ways to select and subset our data.

### Selecting columns and rows

There are multiple ways to select by both rows and columns. From index to slicing to label to position, there are a variety of methods to suit your data wrangling needs.

Let's select just the mpg column from the `auto_mpg DataFrame`. This works similar to how you would access values from a dictionary:

In [31]:
auto_mpg['mpg']

0      18
1      15
2      18
3      16
4      17
5      15
6      14
7      14
8      14
9      15
10     15
11     14
12     15
13     14
14     24
15     22
16     18
17     21
18     27
19     26
20     25
21     24
22     25
23     26
24     21
25     10
26     10
27     11
28      9
29     27
       ..
368    27
369    34
370    31
371    29
372    27
373    24
374    23
375    36
376    37
377    31
378    38
379    36
380    36
381    36
382    34
383    38
384    32
385    38
386    25
387    38
388    26
389    22
390    32
391    36
392    27
393    27
394    44
395    32
396    28
397    31
Name: mpg, dtype: float64

You can do exactly the same thing by using mpg as an attribute:

In [32]:
auto_mpg.mpg

0      18
1      15
2      18
3      16
4      17
5      15
6      14
7      14
8      14
9      15
10     15
11     14
12     15
13     14
14     24
15     22
16     18
17     21
18     27
19     26
20     25
21     24
22     25
23     26
24     21
25     10
26     10
27     11
28      9
29     27
       ..
368    27
369    34
370    31
371    29
372    27
373    24
374    23
375    36
376    37
377    31
378    38
379    36
380    36
381    36
382    34
383    38
384    32
385    38
386    25
387    38
388    26
389    22
390    32
391    36
392    27
393    27
394    44
395    32
396    28
397    31
Name: mpg, dtype: float64

To extract rows from a `DataFrame`, you can use the slice method, similar to how you would slice a list. Here's how we would grab rows 7-13 from the wine `DataFrame`:

In [34]:
week4[7:14]

Unnamed: 0,Id,Position,First Name,Last Name,FPPG,Played,Salary,Game,Team,Opponent,Injury Indicator,Injury Details,Unnamed: 12,Unnamed: 13
7,9371,RB,Marshawn,Lynch,8.5,3,8600,SEA@CIN,SEA,CIN,Q,Hamstring,,
8,11612,WR,Antonio,Brown,19.7,4,8600,PIT@SD,PIT,SD,,,,
9,11460,TE,Rob,Gronkowski,20.9,3,8400,NE@DAL,NE,DAL,,,,
10,6899,RB,Matt,Forte,15.1,4,8400,CHI@KC,CHI,KC,,,,
11,6616,QB,Matt,Ryan,18.1,4,8300,WAS@ATL,ATL,WAS,,,,
12,22015,QB,Russell,Wilson,17.7,4,8200,SEA@CIN,SEA,CIN,,,,
13,14187,WR,A.J.,Green,18.1,4,8200,SEA@CIN,CIN,SEA,,,,


Pandas also has tools for purely label-based selection of rows and columns using the `.loc` indexer. The `.loc` indexer takes input as `[row, column]`. 

For example, let's say we wanted to select the abv value in the 8th instance in our wine `DataFrame`:

In [36]:
week4.loc[8,'FPPG']

19.699999999999999

We can also use `.loc` to grab slices. It's important to note that `.loc` interprets the index as a *label*. This means that, if we select a range, it will grab the last item in the range, unlike slicing in a list. The index is the label for the rows. Let's grab the abv for rows 8 to 11 from the wine `DataFrame`.

In [37]:
week4.loc[8:11, 'FPPG']

8     19.7
9     20.9
10    15.1
11    18.1
Name: FPPG, dtype: float64

And, as you might expect, we can select multiple columns by passing in a list of column names. Let's also grab ash and color for rows 8 to 11.

In [38]:
week4.loc[8:11, ['FPPG', 'Last Name', 'Position']]

Unnamed: 0,FPPG,Last Name,Position
8,19.7,Brown,WR
9,20.9,Gronkowski,TE
10,15.1,Forte,RB
11,18.1,Ryan,QB


Finally, let's just grab all columns for rows 8 to 11.

In [41]:
week4.loc[:, "First Name":]

Unnamed: 0,First Name,Last Name,FPPG,Played,Salary,Game,Team,Opponent,Injury Indicator,Injury Details,Unnamed: 12,Unnamed: 13
0,Julio,Jones,22.7,4,9200,WAS@ATL,ATL,WAS,,,,
1,Aaron,Rodgers,23.6,4,9200,STL@GB,GB,STL,,,,
2,Jamaal,Charles,20.4,4,9100,CHI@KC,KC,CHI,,,,
3,Le'Veon,Bell,23.6,2,9000,PIT@SD,PIT,SD,,,,
4,Odell,Beckham Jr.,13.7,4,9000,SF@NYG,NYG,SF,,,,
5,Tom,Brady,26.2,3,8800,NE@DAL,NE,DAL,,,,
6,Demaryius,Thomas,14.2,4,8600,DEN@OAK,DEN,OAK,,,,
7,Marshawn,Lynch,8.5,3,8600,SEA@CIN,SEA,CIN,Q,Hamstring,,
8,Antonio,Brown,19.7,4,8600,PIT@SD,PIT,SD,,,,
9,Rob,Gronkowski,20.9,3,8400,NE@DAL,NE,DAL,,,,


So, `.loc` provides functionality for a very specific and precise selection method.

Pandas has tools for purely position-based selection of rows and columns using the `.iloc` indexer, which works exactly how slicing a list works. The `.iloc` indexer also takes input as `[row, column]`, but takes only integer input. If we wanted to access the 60th row and the model value from `auto_mpg`, it would look like this (remember that integer indexing is 0-based):

In [42]:
auto_mpg.iloc[60, 6]

72

To grab rows 60-63 and the last three columns from the `auto_mpg DataFrame`, we would need to do the following:

In [43]:
auto_mpg.iloc[60:64, 6:9]

Unnamed: 0,model,origin,car_name
60,72,1,chevrolet vega
61,72,1,ford pinto runabout
62,72,1,chevrolet impala
63,72,1,pontiac catalina


`.iloc` again works like slicing a list, based on position, so it does not grab the last item, like `.loc` does.

To grab all values and those last three columns from the `auto_mpg DataFrame`:

In [None]:
auto_mpg.iloc[:, 6:9]

One of my favorite methods for selecting data is through boolean indexing. Boolean indexing is similar to the WHERE clause in SQL in that it allows you to filter out data based on certain criteria. Let's see how this works.

Let's select from the wine `DataFrame` where `wine_type` is type 1.

In [45]:
week4[week4['FPPG'] == 21.1]

Unnamed: 0,Id,Position,First Name,Last Name,FPPG,Played,Salary,Game,Team,Opponent,Injury Indicator,Injury Details,Unnamed: 12,Unnamed: 13
18,6748,QB,Carson,Palmer,21.1,4,8100,ARI@DET,ARI,DET,,,,


This works with any comparison operators, like >, < >=, !=, and so on. For example, we can select everything from the wine `DataFrame` where the value in the magnesium column is less than 100.

In [46]:
week4[~(week4['FPPG'] < 20)]

Unnamed: 0,Id,Position,First Name,Last Name,FPPG,Played,Salary,Game,Team,Opponent,Injury Indicator,Injury Details,Unnamed: 12,Unnamed: 13
0,14190,WR,Julio,Jones,22.7,4,9200,WAS@ATL,ATL,WAS,,,,
1,6894,QB,Aaron,Rodgers,23.6,4,9200,STL@GB,GB,STL,,,,
2,6728,RB,Jamaal,Charles,20.4,4,9100,CHI@KC,KC,CHI,,,,
3,28181,RB,Le'Veon,Bell,23.6,2,9000,PIT@SD,PIT,SD,,,,
5,6498,QB,Tom,Brady,26.2,3,8800,NE@DAL,NE,DAL,,,,
9,11460,TE,Rob,Gronkowski,20.9,3,8400,NE@DAL,NE,DAL,,,,
18,6748,QB,Carson,Palmer,21.1,4,8100,ARI@DET,ARI,DET,,,,
27,24920,RB,Devonta,Freeman,23.8,4,7600,WAS@ATL,ATL,WAS,,,,
31,14377,QB,Tyrod,Taylor,21.2,4,7500,BUF@TEN,BUF,TEN,,,,
33,29358,QB,Marcus,Mariota,20.6,3,7400,BUF@TEN,TEN,BUF,,,,


You can also say 'not' with the tilde: ~

Let's select from the wine `DataFrame` where magnesium is NOT less than 100, which is equivalent to saying greater than or equal to.

In [None]:
wine[~(wine['magnesium'] < 100)]

It's also possible to combine these boolean indexers. Make sure you enclose them in parentheses. This is something I usually forget.

Let's select from wine where magnesium is less than 100 and the type of wine is type 1.

In [47]:
week4[(week4['FPPG'] > 15) & (week4['Salary'] < 7000)]

Unnamed: 0,Id,Position,First Name,Last Name,FPPG,Played,Salary,Game,Team,Opponent,Injury Indicator,Injury Details,Unnamed: 12,Unnamed: 13
47,8005,QB,Alex,Smith,16.5,4,6900,CHI@KC,KC,CHI,,,,
48,34308,QB,Blake,Bortles,17.5,4,6900,JAC@TB,JAC,TB,,,,
50,14339,RB,Dion,Lewis,16.7,3,6900,NE@DAL,NE,DAL,,,,
56,45859,QB,Derek,Carr,16.5,4,6700,DEN@OAK,OAK,DEN,,,,
58,6735,WR,Steve,Smith,16.0,4,6700,CLE@BAL,BAL,CLE,O,Back,,
71,7222,WR,James,Jones,16.1,4,6400,STL@GB,GB,STL,,,,
74,24849,QB,Jameis,Winston,16.2,4,6400,JAC@TB,TB,JAC,,,,
82,22035,WR,Travis,Benjamin,17.2,4,6200,CLE@BAL,CLE,BAL,,,,
89,6687,QB,Mark,Sanchez,16.7,9,6100,NO@PHI,PHI,NO,,,,
185,25410,QB,Mike,Glennon,15.9,6,5100,JAC@TB,TB,JAC,,,,


If you wanted to, you could just keep on chaining the booleans together. Let's add on where the abv is greater than 14.

In [None]:
wine[(wine['magnesium'] < 100) & (wine['wine_type'] == 1) & (wine['abv'] > 14)]

Another method of selecting data is using the `isin()` function. If you pass in a list to `isin()`, it will return a `DataFrame` of booleans. True means that the value at that index is in the list you passed into `isin()`.

Let's take the first five rows of the `auto_mpg DataFrame` and check for certain values existing in the `DataFrame`.

In [None]:
auto_mpg_5 = auto_mpg.head()

vals = [8, 150, 12.0, 'ford torino']
auto_mpg_5.isin(vals)

If it says `True`, it means that one of the values from the `vals` list occurs there.

## Lesson: let's try some of these on some data!

In [None]:
# Extract column C from the credit_approval dataframe we read in above


In [None]:
# Slice rows 5-10 from the credit_approval dataframe


In [None]:
# How would you look up the value for the 13th row in column C by label (loc)?


In [None]:
# How would you look up the same thing by position (iloc)?


In [None]:
# What if I wanted to select all data from credit_approval based on column C being greater than 5?


In [None]:
# What if I wanted to select data based on column C being greater than 5 and column F being equal to 'w'?


In [None]:
# What if I wanted to look at a boolean DataFrame of where values are in ['t', 's', 100, 0] in credit_approval?


## Groupby

`groupby()` is just like SQL's 'group by' clause. What groupby does is a three-step process:

- Split the data
- Apply a function to the split groups
- Recombine the data

In the apply step, you can do things like apply a statistical function, filter out data, or transform the data.

Let's `groupby()` the wine_type in our wine `DataFrame`! Let's start with just `groupby()`, and then build it from there. This will produce a `DataFrame groupby` object.

In [None]:
wine.groupby('wine_type')

Not so interesting yet. This object has some attributes you can access. We can get lists of which rows are in which group by using the `.groups` attribute:

In [None]:
wine.groupby('wine_type').groups

The dataset was in order by `wine_type` to begin with, so that makes sense. To get just the keys, add the `.keys()` function to the end of that line.

In [None]:
wine.groupby('wine_type').groups.keys()

Let's group our `auto_mpg` dataset by cylinders, just for contrast.

In [None]:
auto_mpg.groupby('cylinders').groups

You can see we have four observations with three cylinders, many more with four, and so on.

Going back to the wine example, let's apply an aggregate function. Let's generate the mean of all the other values and group them by `wine_class`.

In [None]:
wine.groupby('wine_type').mean()

So, the mean `abv` for wine with type 1 is 13.74, type 2 is 12.27, type 3 is 13.15. The mean `malic_acid` for wine with type 1 is 2.01, and so on. So, with one line of code, we're able to apply a function to the entire dataset and see what's going on within different groups.

Selecting from a `groupby DataFrame` works the same way as selecting from any other `DataFrame`. Let's select the abv where `wine_type` is 2.

In [None]:
wine_type_mean = wine.groupby('wine_type').mean()

wine_type_mean.loc[2, 'abv']

It's also possible to apply multiple functions to the entire `DataFrame` using the `agg()` function. Let's get not only the mean, but the count and the standard deviation as well for each value in the `DataFrame`, still grouping by `wine_type`.

In [None]:
wine.groupby('wine_type').agg(['mean', 'count', 'std'])

It's also possible to run different functions on different columns. Let's get the mean for abv, the standard deviation for ash, and the sum of the values for hue. To do this, you'll need to create a dictionary with these functions, with the column names as the dictionary keys.

In [None]:
multiple_funcs = {'abv': 'std', 'ash': 'mean', 'hue': sum}

wine.groupby('wine_type').agg(multiple_funcs)

## Lesson: Groupby galore

Let's take this one step at a time.

In [None]:
# Let's group credit_approval by column G.


In [None]:
# Can you generate a list of all of the groups in the groupby object we just made?


In [None]:
# Let's use mean() on credit_approval_group to get the mean of our numeric values.


In [None]:
# Let's see both the standard deviation and the sum of everything in credit_approval_group


In [None]:
# Let's see the count on column H, the sum on column C, and the mean on column O.


## Merge/join; or, how Pandas can be like SQL

In Pandas, it's possible to combine `DataFrames` and `Series` much like you would in SQL. For the examples in this section, we'll work with smaller `DataFrames` rather than our datasets. It's easier to provide proof of concept this way, as well as explain what's going on

Let's start by appending a row to a `DataFrame`. We can do that by passing in a dictionary to the append function, and setting `ignore_index` equal to `True`.

In [None]:
data = pd.DataFrame({'col1': ['i', 'love', 'pandas', 'so', 'much'],
        'col2': ['so', 'will', 'you', 'i', 'promise']})
data.append({'col1': 'dude', 'col2': 'dude'}, ignore_index=True)

Appending a column is also easy. You can do that by setting a new column name equal to a list or a `Series`.

In [None]:
data['col3'] = ['how', 'do', 'you', 'like', 'oscon']
data

However, this will not work if your new column in a different length than the original `DataFrame`.

In [None]:
data['col4'] = ['I', 'am', 'too', 'short']
data

### Merge

You can `merge()` in different ways, just like joining in SQL. Let's look at an imaginary taco dataset:

In [48]:
tacos = pd.read_csv('../data/tacos.csv')
tacos

Unnamed: 0,name,restaurant,number_of_tacos,score
0,Sarah,Taco Party,4,3.6
1,Georgi,Taco Mania,6,2.5
2,Sammy,Paradise Tacos,10,5.0
3,Peter,Taco Party,8,4.3
4,Rob,Taco Mania,8,3.4


Let's also look at an imaginary taco toppings dataset:

In [49]:
taco_toppings = pd.read_csv('../data/taco_toppings.csv')
taco_toppings

Unnamed: 0,name,favorite_topping,least_favorite_topping,corn_or_flour
0,Sammy,bacon,slime,corn
1,Peter,avocado,dirt,flour
2,Georgi,jalapeno,dirt,flour
3,Sarah,cheese,celery,corn
4,Rob,salsa,slime,flour


Notice that we have a unique identifier in each dataset: the name column. We have the same five people. Let's merge these `DataFrames` together. You don't even need to pass the key to merge; `merge()` will automatically infer which key to use based on if it exists in both `DataFrames`. 

In [51]:
tacos.merge(taco_toppings)

Unnamed: 0,name,restaurant,number_of_tacos,score,favorite_topping,least_favorite_topping,corn_or_flour
0,Sarah,Taco Party,4,3.6,cheese,celery,corn
1,Georgi,Taco Mania,6,2.5,jalapeno,dirt,flour
2,Sammy,Paradise Tacos,10,5.0,bacon,slime,corn
3,Peter,Taco Party,8,4.3,avocado,dirt,flour
4,Rob,Taco Mania,8,3.4,salsa,slime,flour


By default, `merge()` performs a left outer join, which means it takes the key from the "left" `DataFrame` - the `DataFrame` that is passed in as the first parameter - and matches the right to it.

Generally speaking, full outer joins will join everything as a union, meaning that everything will be joined even if there are missing values; inner joins will join everything as an intersection, meaning that if a value does not appear in a row in a `DataFrame`, that row will be left out.

Let's look at a couple of other ways of merging. First, let's append a row to our tacos `DataFrame`.

In [52]:
tacos = tacos.append({'name': 'Dan', 'restaurant': 'Tres Carnes', 'number_of_tacos': 7, 'score': 3.8}, ignore_index=True)
tacos

Unnamed: 0,name,restaurant,number_of_tacos,score
0,Sarah,Taco Party,4,3.6
1,Georgi,Taco Mania,6,2.5
2,Sammy,Paradise Tacos,10,5.0
3,Peter,Taco Party,8,4.3
4,Rob,Taco Mania,8,3.4
5,Dan,Tres Carnes,7,3.8


Now, let's do a full outer merge.

In [53]:
pd.merge(tacos, taco_toppings, how='outer')

Unnamed: 0,name,restaurant,number_of_tacos,score,favorite_topping,least_favorite_topping,corn_or_flour
0,Sarah,Taco Party,4,3.6,cheese,celery,corn
1,Georgi,Taco Mania,6,2.5,jalapeno,dirt,flour
2,Sammy,Paradise Tacos,10,5.0,bacon,slime,corn
3,Peter,Taco Party,8,4.3,avocado,dirt,flour
4,Rob,Taco Mania,8,3.4,salsa,slime,flour
5,Dan,Tres Carnes,7,3.8,,,


You can see that the entire tacos `DataFrame` has been merged, even though 'Dan' does not exist in the `taco_toppings DataFrame`.

However, if we do the same thing and use a right outer join, we'll only use the keys from the `taco_toppings DataFrame` and Dan will be left out.

In [54]:
pd.merge(tacos, taco_toppings, how='right')

Unnamed: 0,name,restaurant,number_of_tacos,score,favorite_topping,least_favorite_topping,corn_or_flour
0,Sarah,Taco Party,4,3.6,cheese,celery,corn
1,Georgi,Taco Mania,6,2.5,jalapeno,dirt,flour
2,Sammy,Paradise Tacos,10,5.0,bacon,slime,corn
3,Peter,Taco Party,8,4.3,avocado,dirt,flour
4,Rob,Taco Mania,8,3.4,salsa,slime,flour


### Join

The `join()` function gives you a way way to combine `DataFrames` without needing a key. `Taco_extra`, which contains data about chips and spiciness level, has no name column.

In [55]:
taco_extra = pd.read_csv('../data/taco_extra.csv')
taco_extra

Unnamed: 0,chips,spiciness
0,yes,hot
1,no,mild
2,no,medium
3,yes,hot
4,yes,hot


It's easy to join this to our taco `DataFrame`.

In [56]:
tacos.join(taco_extra)

Unnamed: 0,name,restaurant,number_of_tacos,score,chips,spiciness
0,Sarah,Taco Party,4,3.6,yes,hot
1,Georgi,Taco Mania,6,2.5,no,mild
2,Sammy,Paradise Tacos,10,5.0,no,medium
3,Peter,Taco Party,8,4.3,yes,hot
4,Rob,Taco Mania,8,3.4,yes,hot
5,Dan,Tres Carnes,7,3.8,,


You can also specify how to join. The default is outer, but we can change it to inner and Dan will be left out again.

In [57]:
tacos.join(taco_extra, how='inner')

Unnamed: 0,name,restaurant,number_of_tacos,score,chips,spiciness
0,Sarah,Taco Party,4,3.6,yes,hot
1,Georgi,Taco Mania,6,2.5,no,mild
2,Sammy,Paradise Tacos,10,5.0,no,medium
3,Peter,Taco Party,8,4.3,yes,hot
4,Rob,Taco Mania,8,3.4,yes,hot


It's possible to join more than two `DataFrames` at a time. Let's slice off the name column from taco_toppings.

In [58]:
taco_toppings_noname = taco_toppings.iloc[:, 1:]

taco_toppings_noname

Unnamed: 0,favorite_topping,least_favorite_topping,corn_or_flour
0,bacon,slime,corn
1,avocado,dirt,flour
2,jalapeno,dirt,flour
3,cheese,celery,corn
4,salsa,slime,flour


Joining this frame with tacos and taco_extra is as easy as chaining two joins together. Again, it's all an outer join, so even though there's no toppings or extra data for Dan, he's still included in the `DataFrame`.

In [59]:
tacos.join(taco_toppings_noname).join(taco_extra)

Unnamed: 0,name,restaurant,number_of_tacos,score,favorite_topping,least_favorite_topping,corn_or_flour,chips,spiciness
0,Sarah,Taco Party,4,3.6,bacon,slime,corn,yes,hot
1,Georgi,Taco Mania,6,2.5,avocado,dirt,flour,no,mild
2,Sammy,Paradise Tacos,10,5.0,jalapeno,dirt,flour,no,medium
3,Peter,Taco Party,8,4.3,cheese,celery,corn,yes,hot
4,Rob,Taco Mania,8,3.4,salsa,slime,flour,yes,hot
5,Dan,Tres Carnes,7,3.8,,,,,


## Lesson: Let's merge some dataframes!

In [60]:
# Can you merge following DataFrames together?
pizza = pd.read_csv('../data/pizza.csv')
pizza_toppings = pd.read_csv('../data/pizza_toppings.csv')

# Merge them here


In [None]:
# Let's inner merge those DataFrames



In [None]:
# Let's join pizza to another dataset, pizza_extra
pizza_extra = pd.read_csv('../data/pizza_extra.csv')


In [None]:
# Let's only join them together where all the data is present


In [None]:
# Can you join all three dataframes together, first by merging pizza and pizza_toppings, then joining that to pizza_extra?


## Pivoting

You can pivot in Pandas just like you would in Excel. `pivot_table()` takes in four requires parameters: the `DataFrame`, the column to use for the index, the column to use for the columns, and the column to use for the values. `pivot_table()` also has an `aggfunc` parameter that defaults to the mean of the values, but you can pass in other functions, just as we did in the `agg()` function before.

Let's look at the mean weight per model number and number of cylinders combination.

In [61]:
pd.pivot_table(auto_mpg, values='weight', index='model', columns='cylinders')

cylinders,3,4,5,6,8
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
70,,2292.571429,,2710.5,3940.055556
71,,2056.384615,,3171.875,4537.714286
72,2330.0,2382.642857,,,4228.384615
73,2124.0,2338.090909,,2917.125,4279.05
74,,2151.466667,,3320.0,4438.4
75,,2489.25,,3398.333333,4108.833333
76,,2306.6,,3349.6,4064.666667
77,2720.0,2205.071429,,3383.0,4177.5
78,,2296.764706,2830.0,3314.166667,3563.333333
79,,2357.583333,3530.0,3025.833333,3862.9


If a cell contains NaN, it means that that combination doesn't exist within the `DataFrame`.

We can pass in multiple column names to the rows and cols parameters. This creates a `multiindex`.

If we add the origin column to our pivot table, we can look at the average weight of all of the model/origin combinations against the number of cylinders the cars have.

In [62]:
pd.pivot_table(auto_mpg, values='weight', index=['model', 'origin'], columns='cylinders')

Unnamed: 0_level_0,cylinders,3,4,5,6,8
model,origin,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70,1,,,,2710.5,3940.055556
70,2,,2309.2,,,
70,3,,2251.0,,,
71,1,,2178.6,,3171.875,4537.714286
71,2,,2024.0,,,
71,3,,1936.0,,,
72,1,,2263.8,,,4228.384615
72,2,,2573.2,,,
72,3,2330.0,2293.0,,,
73,1,,2355.5,,2932.857143,4279.05


You can apply different aggregate functions to a pivot table. Let's look at the total weight per model/cylinder combination.

In [63]:
pd.pivot_table(auto_mpg, values='weight', index='model', columns='cylinders', aggfunc='sum')

cylinders,3,4,5,6,8
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
70,,16048,,10842.0,70921.0
71,,26733,,25375.0,31764.0
72,2330.0,33357,,,54969.0
73,2124.0,25719,,23337.0,85581.0
74,,32272,,23240.0,22192.0
75,,29871,,40780.0,24653.0
76,,34599,,33496.0,36582.0
77,2720.0,30871,,16915.0,33420.0
78,,39045,2830.0,39770.0,21380.0
79,,28291,3530.0,18155.0,38629.0


## Lesson: let's pivot!

In [None]:
# Create a pivot_table for credit_approval with column A as the index, column J as the columns, and column H as the values.


In [None]:
# Now, change the aggfunc to the standard deviation.


In [None]:
# Finally, can you come up with your own pivot_table?


# For those using IPython Notebook/Wakari/NBViewer: Go to the [data_analysis](data_analysis.ipynb) notebook!

# For those using code files, go to data_analysis.py!