# Lesson 4 Class Exercises: Pandas Part 2

With these class exercises we learn a few new things.  When new knowledge is introduced you'll see the icon shown on the right: 
<span style="float:right; margin-left:10px; clear:both;">![Task](../media/new_knowledge.png)</span>
## Get Started
Import the Numpy and Pandas packages

In [1]:
import pandas as pd
import numpy as np

## Exercise 1: Review of Pandas Part 1
### Task 1: Explore the data
Import the data from the [Lectures in Quantiatives Economics](https://github.com/QuantEcon/lecture-source-py) regarding minimum wages in countries round the world in US Dollars.  You can view the data [here](https://github.com/QuantEcon/lecture-source-py/blob/master/source/_static/lecture_specific/pandas_panel/realwage.csv) and you can access the data file here: https://raw.githubusercontent.com/QuantEcon/lecture-source-py/master/source/_static/lecture_specific/pandas_panel/realwage.csv.  Then perform the following

Import the data into a variable named `minwages` and print the first 5 lines of data to explore what is there.

In [2]:
minwages = pd.read_csv('https://raw.githubusercontent.com/QuantEcon/lecture-source-py/master/source/_static/lecture_specific/pandas_panel/realwage.csv')
minwages.head()

Unnamed: 0.1,Unnamed: 0,Time,Country,Series,Pay period,value
0,0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443
1,1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918
2,2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406
3,3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139
4,4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832


Find the shape of the data.

In [3]:
minwages.shape

(1408, 6)

List the column names.

In [4]:
minwages.columns

Index(['Unnamed: 0', 'Time', 'Country', 'Series', 'Pay period', 'value'], dtype='object')

Identify the data types. Do they match what you would expect?

In [5]:
minwages.dtypes

Unnamed: 0      int64
Time           object
Country        object
Series         object
Pay period     object
value         float64
dtype: object

Identify columns with missing values. 

In [6]:
minwages.isna().sum()

Unnamed: 0     0
Time           0
Country        0
Series         0
Pay period     0
value         68
dtype: int64

Identify if there are duplicated entires.

In [7]:
minwages.duplicated().sum()

0

How many unique values per row are there.  Do these look reasonable for the data type and what you know about what is stored in the column?

In [8]:
minwages.nunique()

Unnamed: 0    1408
Time            11
Country         32
Series           2
Pay period       2
value         1289
dtype: int64

### Task 2: Explore More

Retrieve descriptive statistics for the data.

In [9]:
minwages.describe()

Unnamed: 0.1,Unnamed: 0,value
count,1408.0,1340.0
mean,703.5,5697.843084
std,406.598901,7475.920784
min,0.0,0.234
25%,351.75,4.388742
50%,703.5,290.606495
75%,1055.25,10501.7305
max,1407.0,25713.797


Identify all of the countries listed in the data.

In [10]:
minwages['Country'].unique()

array(['Ireland', 'Spain', 'Australia', 'Turkey', 'Luxembourg',
       'New Zealand', 'United Kingdom', 'Mexico', 'Greece',
       'Slovak Republic', 'Portugal', 'France', 'United States', 'Japan',
       'Netherlands', 'Estonia', 'Hungary', 'Poland', 'Czech Republic',
       'Canada', 'Korea', 'Slovenia', 'Chile', 'Israel', 'Belgium',
       'Germany', 'Brazil', 'Russian Federation', 'Lithuania', 'Latvia',
       'Colombia', 'Costa Rica'], dtype=object)

Convert the time column to a datetime object. Use the `pd.to_datetime()` function.

In [11]:
minwages['Time'] = pd.to_datetime(minwages['Time'])

List the time points that were used for data collection. How many years of data collection were there? What time of year were the data collected?

In [12]:
minwages['Time'].unique()

array(['2006-01-01T00:00:00.000000000', '2007-01-01T00:00:00.000000000',
       '2008-01-01T00:00:00.000000000', '2009-01-01T00:00:00.000000000',
       '2010-01-01T00:00:00.000000000', '2011-01-01T00:00:00.000000000',
       '2012-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2014-01-01T00:00:00.000000000', '2015-01-01T00:00:00.000000000',
       '2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

Because we only have one data point collected per year per country, simplify this by adding a new column with just the year.  Print the first 5 rows to confirm the column was added.

In [13]:
minwages['Year'] = minwages['Time'].dt.year
minwages.head()

Unnamed: 0.1,Unnamed: 0,Time,Country,Series,Pay period,value,Year
0,0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443,2006
1,1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918,2007
2,2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406,2008
3,3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139,2009
4,4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832,2010


There are two pay periods.  Retrieve them in a list of just the two strings

In [14]:
minwages['Pay period'].unique()

array(['Annual', 'Hourly'], dtype=object)

### Task 3: Clean the data
We have no duplicates in this data so we do not need to consider removing those, but we do have missing values in the `value` column. Lets remove those. Review the documentation for the [dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) function. Check the dimensions (shape) afterwards to make sure they rows with missing values are gone.

In [21]:
minwages.dropna(inplace=True)
minwages.shape

(1340, 7)

Remove the "Unnamed: 0" column as it's not needed. Review the documentation for the [drop()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function

In [22]:
minwages.drop(['Unnamed: 0'], axis=1, inplace=True)

### Task 4:  Indexing
Use boolean indexing to retrieve the rows of annual salary in United States

In [23]:
minwages[(minwages['Country'] == "United States") & (minwages['Pay period'] == 'Annual')]

Unnamed: 0_level_0,Time,Country,Series,Pay period,value,Year
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
United States,2006-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,12594.397,2006
United States,2007-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,12974.395,2007
United States,2008-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,14097.556,2008
United States,2009-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15756.423,2009
United States,2010-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,16391.313,2010
United States,2011-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15889.705,2011
United States,2012-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15567.554,2012
United States,2013-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15342.814,2013
United States,2014-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15097.89,2014
United States,2015-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15080.0,2015


Do we have enough data to calculate descriptive statistics for annual salary in the United States in 2016? 

In [24]:
minwages[(minwages['Country'] == "United States") & (minwages['Pay period'] == 'Annual') & (minwages['Year'] == 2016)]

Unnamed: 0_level_0,Time,Country,Series,Pay period,value,Year
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
United States,2016-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,14892.122,2016
United States,2016-01-01,United States,In 2015 constant prices at 2015 USD exchange r...,Annual,14892.122,2016


Use `loc` to calculate descriptive statistics for the hourly salary in the United States and then again separately for Ireland. Hint: you will have to set row indexes.  Hint: you should reset the index before using `loc`

In [25]:
minwages.index = minwages['Country']

In [26]:
minwages[minwages['Pay period'] == 'Hourly'].loc['United States'].describe()

Unnamed: 0,value,Year
count,22.0,22.0
mean,7.154966,2011.0
std,0.560776,3.236694
min,6.055,2006.0
25%,6.87587,2008.25
50%,7.2588,2011.0
75%,7.555215,2013.75
max,7.88044,2016.0


In [27]:
minwages[minwages['Pay period'] == 'Hourly'].loc['Ireland'].describe()

Unnamed: 0,value,Year
count,22.0,22.0
mean,9.202719,2011.0
std,0.569146,3.236694
min,8.23675,2006.0
25%,8.65874,2008.25
50%,9.14255,2011.0
75%,9.62225,2013.75
max,10.148,2016.0


Now do the same for Annual salary

In [28]:
minwages[minwages['Pay period'] == 'Annual'].loc['Ireland'].describe()

Unnamed: 0,value,Year
count,22.0,22.0
mean,19141.667591,2011.0
std,1183.797682,3.236694
min,17132.443,2006.0
25%,18010.18675,2008.25
50%,19016.6495,2011.0
75%,20014.7555,2013.75
max,21107.758,2016.0


In [29]:
minwages[minwages['Pay period'] == 'Annual'].loc['United States'].describe()

Unnamed: 0,value,Year
count,22.0,22.0
mean,14880.379,2011.0
std,1167.568383,3.236694
min,12594.397,2006.0
25%,14296.1975,2008.25
50%,15097.89,2011.0
75%,15709.20575,2013.75
max,16391.313,2016.0


## Exercise 2: Occurances
First, reset the indexes back to numeric values. Print the first 10 lines to confirm.

In [30]:
minwages.reset_index(drop=True, inplace=True)
minwages.head()

Unnamed: 0,Time,Country,Series,Pay period,value,Year
0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443,2006
1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918,2007
2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406,2008
3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139,2009
4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832,2010


Get the count of how many rows there are per year?  Review the documentation for the [value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html) function.

In [31]:
minwages['Year'].value_counts()

2015    128
2016    128
2014    124
2006    120
2007    120
2008    120
2009    120
2010    120
2011    120
2012    120
2013    120
Name: Year, dtype: int64

## Exercise 3: Grouping
### Task 1: Aggregation
Calculate the average "annual" salary for each country across all years.

In [32]:
mwannual = minwages[(minwages['Pay period'] == 'Annual')]
mwannual[['Country','value']].groupby('Country').mean().head()

Unnamed: 0_level_0,value
Country,Unnamed: 1_level_1
Australia,22950.927364
Belgium,21146.370318
Brazil,3364.827682
Canada,15875.013182
Chile,4865.907691


Calculate the average salary and hourly wage for each country across all years. Save the resulting dataframe containing the means into a new variable named `mwmean`.

In [33]:
mwgroup = minwages[['Country', 'Pay period', 'value']].groupby(['Country', 'Pay period'])
mwmean = mwgroup.mean()
mwmean.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
Country,Pay period,Unnamed: 2_level_1
Australia,Annual,22950.927364
Australia,Hourly,11.616901
Belgium,Annual,21146.370318
Belgium,Hourly,10.138833
Brazil,Annual,3364.827682


In [34]:
mwgroup

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000281FC040430>

Above we saw how to aggregate using built-in functions of the `DataFrameGroupBy` object. For eaxmple we called the `mean` function directly. These handy functions help with writing succinct code. However, you can also use the `aggregate` function to do more! You can learn more on the [aggregate description page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html)

With `aggregate` we can perform operations across rows and columns, and we can perform more than one operation at a time.  Explore the online documentation for the function and see how you would calculate the mean, min, and max for each country and pay period type, as well as the total number of records per country and pay period:


In [35]:
mwgroup.aggregate(['mean', 'min', 'max', 'count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,value,value,value,value
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max,count
Country,Pay period,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Australia,Annual,22950.927364,20410.65200,25643.72900,22
Australia,Hourly,11.616901,10.33073,12.98100,22
Belgium,Annual,21146.370318,20228.74200,22140.19100,22
Belgium,Hourly,10.138833,9.69900,10.61538,22
Brazil,Annual,3364.827682,2032.87300,4753.59910,22
...,...,...,...,...,...
Turkey,Hourly,3.432194,2.22200,5.78927,22
United Kingdom,Annual,18808.980409,16510.78300,21352.73000,22
United Kingdom,Hourly,9.043460,7.93763,10.26200,22
United States,Annual,14880.379000,12594.39700,16391.31300,22


Also you can use the aggregate on a single column of the grouped object. For example:

```python
    mwgroup = minwages[['Country', 'Pay period', 'value']].groupby(['Country', 'Pay period'])
    mwgroup['value'].aggregate(['mean'])

```
Redo the aggregate function in the previous cell but this time apply it to a single column.

In [36]:
mwgroup['value'].aggregate(['mean', 'min', 'max', 'count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,min,max,count
Country,Pay period,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Australia,Annual,22950.927364,20410.65200,25643.72900,22
Australia,Hourly,11.616901,10.33073,12.98100,22
Belgium,Annual,21146.370318,20228.74200,22140.19100,22
Belgium,Hourly,10.138833,9.69900,10.61538,22
Brazil,Annual,3364.827682,2032.87300,4753.59910,22
...,...,...,...,...,...
Turkey,Hourly,3.432194,2.22200,5.78927,22
United Kingdom,Annual,18808.980409,16510.78300,21352.73000,22
United Kingdom,Hourly,9.043460,7.93763,10.26200,22
United States,Annual,14880.379000,12594.39700,16391.31300,22


### Task 2: Slicing/Indexing
In the following code the resulting dataframe should contain only one data column: the mean values. It does, however, have two levels of indexes: Country and Pay period.  For example:

```python
mwgroup = minwages[['Country', 'Pay period', 'value']].groupby(['Country', 'Pay period'])
mwmean = mwgroup.mean()
mwmean
```

Try it out:

In [37]:
mwgroup = minwages[['Country', 'Pay period', 'value']].groupby(['Country', 'Pay period'])
mwmean = mwgroup.mean()
mwmean

Unnamed: 0_level_0,Unnamed: 1_level_0,value
Country,Pay period,Unnamed: 2_level_1
Australia,Annual,22950.927364
Australia,Hourly,11.616901
Belgium,Annual,21146.370318
Belgium,Hourly,10.138833
Brazil,Annual,3364.827682
...,...,...
Turkey,Hourly,3.432194
United Kingdom,Annual,18808.980409
United Kingdom,Hourly,9.043460
United States,Annual,14880.379000


Notice in the output above there are two levels of indexes. This is called MultiIndexing.  In reality, there is only one data column and two index levels.  So, you can do this:

```python
mwmean['value']
```

But you can't do this:

```python
mwmean['Pay period']
```

Why not? Try it:


In [38]:
mwmean['value']

Country         Pay period
Australia       Annual        22950.927364
                Hourly           11.616901
Belgium         Annual        21146.370318
                Hourly           10.138833
Brazil          Annual         3364.827682
                                  ...     
Turkey          Hourly            3.432194
United Kingdom  Annual        18808.980409
                Hourly            9.043460
United States   Annual        14880.379000
                Hourly            7.154966
Name: value, Length: 64, dtype: float64

In [39]:
mwmean['Pay period']

KeyError: 'Pay period'

The reason we cannot exeucte `mwmean['Pay period']` is because `Pay period` is not a data column. It's an index.  Let's learn how to use MultiIndexes to retrieve data. You can learn more about it on the [MultiIndex/advanced indexing page](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-indexing-with-hierarchical-index)

First, let's take a look at the indexes using the `index` attribute. 

```python
mwmean.index
```

Try it:

In [40]:
mwmean.index

MultiIndex([(         'Australia', 'Annual'),
            (         'Australia', 'Hourly'),
            (           'Belgium', 'Annual'),
            (           'Belgium', 'Hourly'),
            (            'Brazil', 'Annual'),
            (            'Brazil', 'Hourly'),
            (            'Canada', 'Annual'),
            (            'Canada', 'Hourly'),
            (             'Chile', 'Annual'),
            (             'Chile', 'Hourly'),
            (          'Colombia', 'Annual'),
            (          'Colombia', 'Hourly'),
            (        'Costa Rica', 'Annual'),
            (        'Costa Rica', 'Hourly'),
            (    'Czech Republic', 'Annual'),
            (    'Czech Republic', 'Hourly'),
            (           'Estonia', 'Annual'),
            (           'Estonia', 'Hourly'),
            (            'France', 'Annual'),
            (            'France', 'Hourly'),
            (           'Germany', 'Annual'),
            (           'Germany',

Notice that each index is actually a tuple with two levels. The first is the country names and the second is the pay period. Remember, we can use the `loc` function, to slice a dataframe using indexes.  We can do so with a MultiIndexed dataframe as well. For example, to extract all elements with they index named 'Australia':

```python
mwmean.loc[('Australia')]
```

Try it yourself:

In [41]:
mwmean.loc[('Australia')]

Unnamed: 0_level_0,value
Pay period,Unnamed: 1_level_1
Annual,22950.927364
Hourly,11.616901


You can specify both indexes to pull out a single row. For example, to find the average hourly salary in Australia:

```python
mwmean.loc[('Australia','Hourly')]
```
Try it yourself:

In [42]:
mwmean.loc[('Australia','Hourly')]

value    11.616901
Name: (Australia, Hourly), dtype: float64

Suppose you wanted to retrieve all of the mean "Hourly" wages. For MultiIndexes, there are multiple ways to slice it, some are not entirely intuitive or flexible enough.  Perhaps the easiest is to use the `pd.IndexSlice` object.  It allows you to specify an index format that is intuitive to the way you've already learned to slice.  For example:

```python
idx = pd.IndexSlice
mwmean.loc[idx[:,'Hourly'],:]
```

In the code above the `idx[:, 'Hourly']` portion is used in the "row" indexor position of the `loc` function. It indicates that we want all possible first-level indexes (specified with the `:`) and we want second-level indexes to be restricted to "Hourly".  
Try it out yourself:

In [43]:
idx = pd.IndexSlice
mwmean.loc[idx[:,'Hourly'],:].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
Country,Pay period,Unnamed: 2_level_1
Australia,Hourly,11.616901
Belgium,Hourly,10.138833
Brazil,Hourly,1.438636
Canada,Hourly,7.632323
Chile,Hourly,2.073636


Using what you've learned above about slicing the MultiIndexed dataframe, find out which country has had the highest average annual salary.

In [44]:
meanmwa = mwmean.loc[idx[:, 'Annual'], 'value']
meanmwa[meanmwa == meanmwa.max()]

Country     Pay period
Luxembourg  Annual        23680.787909
Name: value, dtype: float64

You can move the indexes into the dataframe and reset the index to a traditional single-level numeric index by reseting the indexes:    
```python
mwmean.reset_index()
```

Try it yourself:

### Task 3: Filtering the original data.
Another way we might want to filter is to find records in the dataset that, after grouping meets some criteria. For example, what if we wanted to find the records for all countries with the average annual salary was greater than $35K?

To do this, we can use the `filter` function of the `DataFrameGroupBy` object. The filter function must take a function as an argument (this is new and may seem weird).  

```python
annualwages = minwages[minwages['Pay period'] == 'Annual']
annualwages.groupby(['Country']).filter(
    lambda x : x['value'].mean() > 22000
)
```
Try it:

In [45]:
annualwages = minwages[minwages['Pay period'] == 'Annual']
annualwages.groupby(['Country']).filter(
    lambda x : x['value'].mean() > 22000
)


Unnamed: 0,Time,Country,Series,Pay period,value,Year
88,2006-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,20410.652,2006
89,2007-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,21087.568,2007
90,2008-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,20718.238,2008
91,2009-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,20984.768,2009
92,2010-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,20879.332,2010
93,2011-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,21037.328,2011
94,2012-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,21323.83,2012
95,2013-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,21387.027,2013
96,2014-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,21453.828,2014
97,2015-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,21715.529,2015


### Task 4: Reset the index
If you do not want to use MultiIndexes and you prefer to return any Multiindex dataset back to a traditional 1-level index dataframe you can use the`reset_index` function. 

Try it out on the `mwmean` dataframe:

In [None]:
mwmean.reset_index().head()

## Exercise 4:  
Load the iris dataset. 

In the Iris dataset:
+ Create a new column with the label "region" in the iris data frame. This column will indicates geographic regions of the US where measurements were taken. Values should include:  'Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'. Use these randomly.
+ Use `groupby` to get a new data frame of means for each species in each region.
+ Add a `dev_stage` column by randomly selecting from the values "early" and "late".
+ Use `groupby` to get a new data frame of means for each species, in each region and each development stage.
+ Use the `count` function (just like you used the `mean` function) to identify how many rows in the table belong to each combination of species + region + developmental stage.

In [46]:
iris = pd.read_csv('../data/iris.csv')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [47]:
iris['region'] = np.random.choice(['Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'], 150)
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,region
0,5.1,3.5,1.4,0.2,setosa,Northwest
1,4.9,3.0,1.4,0.2,setosa,Northeast
2,4.7,3.2,1.3,0.2,setosa,Northwest
3,4.6,3.1,1.5,0.2,setosa,Southwest
4,5.0,3.6,1.4,0.2,setosa,Midwest


In [48]:
iris.groupby(['species', 'region']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal_length,sepal_width,petal_length,petal_width
species,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
setosa,Midwest,4.922222,3.377778,1.5,0.288889
setosa,Northeast,4.975,3.375,1.483333,0.258333
setosa,Northwest,5.021429,3.435714,1.471429,0.221429
setosa,Southeast,5.1875,3.55,1.4125,0.2125
setosa,Southwest,4.928571,3.357143,1.428571,0.242857
versicolor,Midwest,6.064286,2.714286,4.264286,1.292857
versicolor,Northeast,5.9,2.94,4.3,1.34
versicolor,Northwest,6.063636,2.8,4.381818,1.381818
versicolor,Southeast,5.727273,2.818182,4.036364,1.309091
versicolor,Southwest,5.855556,2.666667,4.355556,1.322222


In [49]:
iris['dev_stage'] = np.random.choice(['early', 'late'], 150)
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,region,dev_stage
0,5.1,3.5,1.4,0.2,setosa,Northwest,early
1,4.9,3.0,1.4,0.2,setosa,Northeast,late
2,4.7,3.2,1.3,0.2,setosa,Northwest,late
3,4.6,3.1,1.5,0.2,setosa,Southwest,early
4,5.0,3.6,1.4,0.2,setosa,Midwest,early


In [50]:
iris.groupby(['species', 'region', 'dev_stage']).count().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sepal_length,sepal_width,petal_length,petal_width
species,region,dev_stage,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
setosa,Midwest,early,5,5,5,5
setosa,Midwest,late,4,4,4,4
setosa,Northeast,early,4,4,4,4
setosa,Northeast,late,8,8,8,8
setosa,Northwest,early,6,6,6,6


## Exercise 5: Kaggle Titanic Dataset
A dataset of Titanic passengers and their fates is provided by the online machine learning competition server [Kaggle](https://www.kaggle.com/). See the [Titanic project](https://www.kaggle.com/c/titanic) page for more details. 

Let's practice all we have learned thus far to explore and perhaps clean this dataset.  You have been provided with the dataset named `Titanic_train.csv`.  

### Task 1: Explore the data
First import the data and print the first 10 lines.

In [52]:
titanic = pd.read_csv('../data/Titanic_train.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Find the shape of the data.

In [53]:
titanic.shape

(891, 12)

List the column names.

In [54]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Identify the data types. Do they match what you would expect?

In [55]:
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Identify columns with missing values. 

In [56]:
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Identify if there are duplicated entires.

In [57]:
titanic.duplicated().sum()

0

How many unique values per row are there.  Do these look reasonable for the data type and what you know about what is stored in the column?

In [58]:
titanic.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

### Task 2: Clean the data
Do missing values need to be removed? If so, remove them.

Do duplicates need to be removed?  If so remove them.

### Task 3: Find Interesting Facts
Count the number of passengers that survied and died in each passenger class

In [None]:
scounts = titanic.groupby(['Pclass','Survived']).count()
scounts['PassengerId']

Were men or women more likely to survive?

In [None]:
scounts = titanic.groupby(['Survived']).count()
scounts['Sex']

What was the average, min and max ticket prices per passenger class?
Hint:  look at the help page for the [agg](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) function to help simplify this.

In [None]:
scounts = titanic.dropna().groupby(['Pclass']).agg(['mean', 'min', 'max'])
scounts['Fare']

Give descriptive statistics about the survival age.

In [None]:
titanic[(titanic['Survived'] == 1)]['Age'].dropna().describe()