# Lesson 4 Class Exercises: Pandas Part 2

With these class exercises we learn a few new things.  When new knowledge is introduced you'll see the icon shown on the right: 
<span style="float:right; margin-left:10px; clear:both;">![Task](../media/new_knowledge.png)</span>
## Get Started
Import the Numpy and Pandas packages

In [1]:
import numpy as np
import pandas as pd

## Exercise 1: Review of Pandas Part 1
### Task 1: Explore the data
Import the data from the [Lectures in Quantiatives Economics](https://github.com/QuantEcon/lecture-source-py) regarding minimum wages in countries round the world in US Dollars.  You can view the data [here](https://github.com/QuantEcon/lecture-source-py/blob/master/source/_static/lecture_specific/pandas_panel/realwage.csv) and you can access the data file here: https://raw.githubusercontent.com/QuantEcon/lecture-source-py/master/source/_static/lecture_specific/pandas_panel/realwage.csv.  Then perform the following

Import the data into a variable named `minwages` and print the first 5 lines of data to explore what is there.

In [11]:
minwages = pd.read_csv("https://raw.githubusercontent.com/QuantEcon/lecture-source-py/master/source/_static/lecture_specific/pandas_panel/realwage.csv")
minwages.head()

Unnamed: 0.1,Unnamed: 0,Time,Country,Series,Pay period,value
0,0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443
1,1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918
2,2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406
3,3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139
4,4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832


Find the shape of the data.

In [12]:
minwages.shape

(1408, 6)

List the column names.

In [10]:
minwages.columns

Unnamed: 0      int64
Time           object
Country        object
Series         object
Pay period     object
value         float64
dtype: object

Identify the data types. Do they match what you would expect?

In [None]:
minwages.dtypes

Identify columns with missing values. 

In [14]:
minwages.isna().sum()

Unnamed: 0     0
Time           0
Country        0
Series         0
Pay period     0
value         68
dtype: int64

Identify if there are duplicated entires.

In [16]:
minwages.duplicated().sum()

0

How many unique values per row are there.  Do these look reasonable for the data type and what you know about what is stored in the column?

In [17]:
minwages.nunique()

Unnamed: 0    1408
Time            11
Country         32
Series           2
Pay period       2
value         1289
dtype: int64

In [20]:
minwages['Pay period'].unique() # unique is actual values nunique is number of unique values

array(['Annual', 'Hourly'], dtype=object)

In [21]:
minwages['Series'].unique()

array(['In 2015 constant prices at 2015 USD PPPs',
       'In 2015 constant prices at 2015 USD exchange rates'], dtype=object)

### Task 2: Explore More

Retrieve descriptive statistics for the data.

In [22]:
minwages.describe()

Unnamed: 0.1,Unnamed: 0,value
count,1408.0,1340.0
mean,703.5,5697.843084
std,406.598901,7475.920784
min,0.0,0.234
25%,351.75,4.388742
50%,703.5,290.606495
75%,1055.25,10501.7305
max,1407.0,25713.797


Identify all of the countries listed in the data.

In [25]:
minwages['Country'].unique()

array(['Ireland', 'Spain', 'Australia', 'Turkey', 'Luxembourg',
       'New Zealand', 'United Kingdom', 'Mexico', 'Greece',
       'Slovak Republic', 'Portugal', 'France', 'United States', 'Japan',
       'Netherlands', 'Estonia', 'Hungary', 'Poland', 'Czech Republic',
       'Canada', 'Korea', 'Slovenia', 'Chile', 'Israel', 'Belgium',
       'Germany', 'Brazil', 'Russian Federation', 'Lithuania', 'Latvia',
       'Colombia', 'Costa Rica'], dtype=object)

Convert the time column to a datetime object. Use the `pd.to_datetime()` function.

In [28]:
minwages['Date'] = pd.to_datetime((minwages['Time']))
minwages.dtypes

Unnamed: 0             int64
Time                  object
Country               object
Series                object
Pay period            object
value                float64
Date          datetime64[ns]
dtype: object

List the time points that were used for data collection. How many years of data collection were there? What time of year were the data collected?

In [31]:
minwages['Time'].sort_values().unique()

array(['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01',
       '2010-01-01', '2011-01-01', '2012-01-01', '2013-01-01',
       '2014-01-01', '2015-01-01', '2016-01-01'], dtype=object)

Because we only have one data point collected per year per country, simplify this by adding a new column with just the year.  Print the first 5 rows to confirm the column was added.

In [40]:
minwages["Years"] = minwages["Time"].str.split("-").str[0].astype(int) # only works in the string
minwages

Unnamed: 0.1,Unnamed: 0,Time,Country,Series,Pay period,value,Date,Years
0,0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443,2006-01-01,2006
1,1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918,2007-01-01,2007
2,2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406,2008-01-01,2008
3,3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139,2009-01-01,2009
4,4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832,2010-01-01,2010
...,...,...,...,...,...,...,...,...
1403,1403,2012-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,,2012-01-01,2012
1404,1404,2013-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,,2013-01-01,2013
1405,1405,2014-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,2.410,2014-01-01,2014
1406,1406,2015-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,2.560,2015-01-01,2015


There are two pay periods.  Retrieve them in a list of just the two strings

### Task 3: Clean the data
We have no duplicates in this data so we do not need to consider removing those, but we do have missing values in the `value` column. Lets remove those. Review the documentation for the [dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) function. Check the dimensions (shape) afterwards to make sure they rows with missing values are gone.

In [53]:
minwages = minwages.dropna(axis = 0) # to drop rows


Remove the "Unnamed: 0" column as it's not needed. Review the documentation for the [drop()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function

In [54]:
minwages = minwages.drop(['Unnamed: 0'], axis = 1)


Unnamed: 0,Time,Country,Series,Pay period,value,Date,Years
0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.4430,2006-01-01,2006
1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.9180,2007-01-01,2007
2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.4060,2008-01-01,2008
3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.1390,2009-01-01,2009
4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.8320,2010-01-01,2010
...,...,...,...,...,...,...,...
1395,2015-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Annual,7467.4902,2015-01-01,2015
1396,2016-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Annual,7678.3379,2016-01-01,2016
1405,2014-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,2.4100,2014-01-01,2014
1406,2015-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,2.5600,2015-01-01,2015


### Task 4:  Indexing
Use boolean indexing to retrieve the rows of annual salary in United States

In [58]:
minwages[(minwages['Country'] == "United States") & 
         (minwages['Pay period'] == 'Annual')]

Unnamed: 0,Time,Country,Series,Pay period,value,Date,Years
528,2006-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,12594.397,2006-01-01,2006
529,2007-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,12974.395,2007-01-01,2007
530,2008-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,14097.556,2008-01-01,2008
531,2009-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15756.423,2009-01-01,2009
532,2010-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,16391.313,2010-01-01,2010
533,2011-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15889.705,2011-01-01,2011
534,2012-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15567.554,2012-01-01,2012
535,2013-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15342.814,2013-01-01,2013
536,2014-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15097.89,2014-01-01,2014
537,2015-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,15080.0,2015-01-01,2015


Do we have enough data to calculate descriptive statistics for annual salary in the United States in 2016?

In [61]:
minwages[(minwages['Country'] == "United States") & 
         (minwages['Pay period'] == 'Annual') &
         (minwages['Years'] == 2016)] # no we only have 3 values and they are from differnet series

Unnamed: 0,Time,Country,Series,Pay period,value,Date,Years
538,2016-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Annual,14892.122,2016-01-01,2016
560,2016-01-01,United States,In 2015 constant prices at 2015 USD exchange r...,Annual,14892.122,2016-01-01,2016


Use `loc` to calculate descriptive statistics for the hourly salary in the United States and then again separately for Ireland. Hint: you will have to set row indexes.  Hint: you should reset the index before using `loc`

In [62]:
minwages.index = minwages['Country']


In [71]:
minwages[minwages['Pay period'] == 'Hourly'].loc[['United States','Ireland']]

Unnamed: 0_level_0,Time,Country,Series,Pay period,value,Date,Years
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
United States,2006-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,6.055,2006-01-01,2006
United States,2007-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,6.24143,2007-01-01,2007
United States,2008-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,6.78127,2008-01-01,2008
United States,2009-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,7.57882,2009-01-01,2009
United States,2010-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,7.88044,2010-01-01,2010
United States,2011-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,7.63928,2011-01-01,2011
United States,2012-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,7.4844,2012-01-01,2012
United States,2013-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,7.37635,2013-01-01,2013
United States,2014-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,7.2586,2014-01-01,2014
United States,2015-01-01,United States,In 2015 constant prices at 2015 USD PPPs,Hourly,7.25,2015-01-01,2015


Now do the same for Annual salary

## Exercise 2: Occurances
First, reset the indexes back to numeric values. Print the first 10 lines to confirm.

Get the count of how many rows there are per year?  Review the documentation for the [value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html) function.

## Exercise 3: Grouping
### Task 1: Aggregation
Calculate the average "annual" salary for each country across all years.

In [79]:
minwages.reset_index(drop = True, inplace = True)

Calculate the average salary and hourly wage for each country across all years. Save the resulting dataframe containing the means into a new variable named `mwmean`.

In [82]:
minwages.groupby(['Country', 'Pay period']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,value,Years
Country,Pay period,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia,Annual,22950.927364,2011.0
Australia,Hourly,11.616901,2011.0
Belgium,Annual,21146.370318,2011.0
Belgium,Hourly,10.138833,2011.0
Brazil,Annual,3364.827682,2011.0
...,...,...,...
Turkey,Hourly,3.432194,2011.0
United Kingdom,Annual,18808.980409,2011.0
United Kingdom,Hourly,9.043460,2011.0
United States,Annual,14880.379000,2011.0


Above we saw how to aggregate using built-in functions of the `DataFrameGroupBy` object. For eaxmple we called the `mean` function directly. These handy functions help with writing succinct code. However, you can also use the `aggregate` function to do more! You can learn more on the [aggregate description page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html)

With `aggregate` we can perform operations across rows and columns, and we can perform more than one operation at a time.  Explore the online documentation for the function and see how you would calculate the mean, min, and max for each country and pay period type, as well as the total number of records per country and pay period:

Also you can use the aggregate on a single column of the grouped object. For example:

```python
    mwgroup = minwages[['Country', 'Pay period', 'value']].groupby(['Country', 'Pay period'])
    mwgroup['value'].aggregate(['mean'])

```
Redo the aggregate function in the previous cell but this time apply it to a single column.

### Task 2: Slicing/Indexing
In the following code the resulting dataframe should contain only one data column: the mean values. It does, however, have two levels of indexes: Country and Pay period.  For example:

```python
mwgroup = minwages[['Country', 'Pay period', 'value']].groupby(['Country', 'Pay period'])
mwmean = mwgroup.mean()
mwmean
```

Try it out:

Notice in the output above there are two levels of indexes. This is called MultiIndexing.  In reality, there is only one data column and two index levels.  So, you can do this:

```python
mwmean['value']
```

But you can't do this:

```python
mwmean['Pay period']
```

Why not? Try it:


The reason we cannot exeucte `mwmean['Pay period']` is because `Pay period` is not a data column. It's an index.  Let's learn how to use MultiIndexes to retrieve data. You can learn more about it on the [MultiIndex/advanced indexing page](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-indexing-with-hierarchical-index)

First, let's take a look at the indexes using the `index` attribute. 

```python
mwmean.index
```

Try it:

Notice that each index is actually a tuple with two levels. The first is the country names and the second is the pay period. Remember, we can use the `loc` function, to slice a dataframe using indexes.  We can do so with a MultiIndexed dataframe as well. For example, to extract all elements with they index named 'Australia':

```python
mwmean.loc[('Australia')]
```

Try it yourself:

You can specify both indexes to pull out a single row. For example, to find the average hourly salary in Australia:

```python
mwmean.loc[('Australia','Hourly')]
```
Try it yourself:

Suppose you wanted to retrieve all of the mean "Hourly" wages. For MultiIndexes, there are multiple ways to slice it, some are not entirely intuitive or flexible enough.  Perhaps the easiest is to use the `pd.IndexSlice` object.  It allows you to specify an index format that is intuitive to the way you've already learned to slice.  For example:

```python
idx = pd.IndexSlice
mwmean.loc[idx[:,'Hourly'],:]
```

In the code above the `idx[:, 'Hourly']` portion is used in the "row" indexor position of the `loc` function. It indicates that we want all possible first-level indexes (specified with the `:`) and we want second-level indexes to be restricted to "Hourly".  
Try it out yourself:

Using what you've learned above about slicing the MultiIndexed dataframe, find out which country has had the highest average annual salary.

You can move the indexes into the dataframe and reset the index to a traditional single-level numeric index by reseting the indexes:    
```python
mwmean.reset_index()
```

Try it yourself:

### Task 3: Filtering the original data.
Another way we might want to filter is to find records in the dataset that, after grouping meets some criteria. For example, what if we wanted to find the records for all countries with the average annual salary was greater than $35K?

To do this, we can use the `filter` function of the `DataFrameGroupBy` object. The filter function must take a function as an argument (this is new and may seem weird).  

```python
annualwages = minwages[minwages['Pay period'] == 'Annual']
annualwages.groupby(['Country']).filter(
    lambda x : x['value'].mean() > 22000
)
```
Try it:

### Task 4: Reset the index
If you do not want to use MultiIndexes and you prefer to return any Multiindex dataset back to a traditional 1-level index dataframe you can use the`reset_index` function. 

Try it out on the `mwmean` dataframe:

## Exercise 4:  
Load the iris dataset. 

In the Iris dataset:
+ Create a new column with the label "region" in the iris data frame. This column will indicates geographic regions of the US where measurments were taken. Values should include:  'Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'. Use these randomly.
+ Use `groupby` to get a new data frame of means for each species in each region.
+ Add a `dev_stage` column by randomly selecting from the values "early" and "late".
+ Use `groupby` to get a new data frame of means for each species, in each region and each development stage.
+ Use the `count` function (just like you used the `mean` function) to identify how many rows in the table belong to each combination of species + region + developmental stage.

## Exercise 5: Kaggle Titanic Dataset
A dataset of Titanic passengers and their fates is provided by the online machine learning competition server [Kaggle](https://www.kaggle.com/). See the [Titanic project](https://www.kaggle.com/c/titanic) page for more details. 

Let's practice all we have learned thus far to explore and perhaps clean this dataset.  You have been provided with the dataset named `Titanic_train.csv`.  

### Task 1: Explore the data
First import the data and print the first 10 lines.

Find the shape of the data.

List the column names.

Identify the data types. Do they match what you would expect?

Identify columns with missing values. 

Identify if there are duplicated entires.

How many unique values per row are there.  Do these look reasonable for the data type and what you know about what is stored in the column?

### Task 2: Clean the data
Do missing values need to be removed? If so, remove them.

Do duplicates need to be removed?  If so remove them.

### Task 3: Find Interesting Facts
Count the number of passengers that survied and died in each passenger class

Were men or women more likely to survive?

What was the average, min and max ticket prices per passenger class?
Hint:  look at the help page for the [agg](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) function to help simplify this.

Give descriptive statistics about the survival age.