# Lesson 4 Class Exercises: Pandas Part 2

With these class exercises we learn a few new things.  When new knowledge is introduced you'll see the icon shown on the right: 
<span style="float:right; margin-left:10px; clear:both;">![Task](../media/new_knowledge.png)</span>
## Get Started
Import the Numpy and Pandas packages

## Exercise 1: Review of Pandas Part 1
### Task 1: Explore the data
Import the data from the [Lectures in Quantiatives Economics](https://github.com/QuantEcon/lecture-source-py) regarding minimum wages in countries round the world in US Dollars.  You can view the data [here](https://github.com/QuantEcon/lecture-source-py/blob/master/source/_static/lecture_specific/pandas_panel/realwage.csv) and you can access the data file here: https://raw.githubusercontent.com/QuantEcon/lecture-source-py/master/source/_static/lecture_specific/pandas_panel/realwage.csv.  Then perform the following

Import the data into a variable named `minwages` and print the first 5 lines of data to explore what is there.

Find the shape of the data.

List the column names.

Identify the data types. Do they match what you would expect?

Identify columns with missing values. 

Identify if there are duplicated entires.

How many unique values per row are there.  Do these look reasonable for the data type and what you know about what is stored in the column?

### Task 2: Explore More

Retrieve descriptive statistics for the data.

Identify all of the countries listed in the data.

Convert the time column to a datetime object.

List the time points that were used for data collection. How many years of data collection were there? What time of year were the data collected?

Because we only have one data point collected per year per country, simplify this by adding a new column with just the year.  Print the first 5 rows to confirm the column was added.

There are two pay periods.  Retrieve them in a list of just the two strings

### Task 3: Clean the data
We have no duplicates in this data so we do not need to consider removing those, but we do have missing values in the `value` column. Lets remove those.  Check the dimensions afterwards to make sure they rows with missing values are gone.

Remove the "Unnamed: 0" column as it's not needed.

### Task 4:  Indexing
Use boolean indexing to retrieve the rows of annual salary in United States

Do we have enough data to calculate descriptive statistics for annual salary in the United States in 2016?

Use `loc` to calculate descriptive statistics for the hourly salary in the United States and then again separately for Ireland. Hint: you will have to set row indexes.  Hint: you should reset the index before using `loc`

Now do the same for Annual salary

## Exercise 2: Occurances
First, reset the indexes back to numeric values. Print the first 10 lines to confirm.

Get the count of how many rows there are per year?

## Exercise 3: Grouping
### Task 1: Aggregation
Calculate the average salary for each country across all years.

Calculate the average salary and hourly wage for each country across all years. Save the resulting dataframe containing the means into a new variable named `mwmean`.

<span style="float:right; margin-left:10px; clear:both;">![Task](../media/new_knowledge.png)</span>

Above we saw how to aggregate using built-in functions of the `DataFrameGroupBy` object. For eaxmple we called the `mean` function directly. These handly functions help with writing succint code. However, you can also use the `aggregate` function to do more! You can learn more on the [aggregate description page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html)

With `aggregate` we can perform operations across rows and columns, and we can perform more than one operation at a time.  Explore the online documentation for the function and see how you would calculate the mean, min, and max for each country and pay period type, as well as the total number of records per country and pay period:


Also you can use the aggregate on a single column of the grouped object. For example:

```python
    mwgroup = minwages[['Country', 'Pay period', 'value']].groupby(['Country', 'Pay period'])
    mwgroup['value'].aggregate(['mean'])

```
Redo the aggregate function in the previous cell but this time apply it to a single column.

### Task 2: Slicing/Indexing
<span style="float:right; margin-left:10px; clear:both;">![Task](../media/new_knowledge.png)</span>

In the following code the resulting dataframe should contain only one data column: the mean values. It does, however, have two levels of indexes: Country and Pay period.  For example:

```python
mwgroup = minwages[['Country', 'Pay period', 'value']].groupby(['Country', 'Pay period'])
mwmean = mwgroup.mean()
mwmean
```

Try it out:

Notice in the output above there are two levels of indexes. This is called MultiIndexing.  In reality, there is only one data column and two index levels.  So, you can do this:

```python
mwmean['value']
```

But you can't do this:

```python
mwmean['Pay period']
```

Why not? Try it:


The reason we cannot exeucte `mwmean['Pay period']` is because `Pay period` is not a data column. It's an index.  Let's learn how to use MultiIndexes to retrieve data. You can learn more about it on the [MultiIndex/advanced indexing page](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-indexing-with-hierarchical-index)

First, let's take a look at the indexes using the `index` attribute. 

```python
mwmean.index
```

Try it:

Notice that each index is actually a tuple with two levels. The first is the country names and the second is the pay period. Remember, we can use the `loc` function, to slice a dataframe using indexes.  We can do so with a MultiIndexed dataframe as well. For example, to extract all elements with they index named 'Australia':

```python
mwmean.loc[('Australia')]
```

Try it yourself:

You can specify both indexes to pull out a single row. For example, to find the average hourly salary in Australia:

```python
mwmean.loc[('Australia','Hourly')]
```
Try it yourself:

Suppose you wanted to retrieve all of the mean "Hourly" wages. For MultiIndexes, there are multiple ways to slice it, some are not entirely intuitive or flexible enough.  Perhaps the easiest is to use the `pd.IndexSlice` object.  It allows you to specify an index format that is intuitive to the way you've already learned to slice.  For example:

```python
idx = pd.IndexSlice
mwmean.loc[idx[:,'Hourly'],:]
```

In the code above the `idx[:, 'Hourly']` portion is used in the "row" indexor position of the `loc` function. It indicates that we want all possible first-level indexes (specified with the `:`) and we want second-level indexes to be restricted to "Hourly".  
Try it out yourself:

Using what you've learned above about slicing the MultiIndexed dataframe, find out which country has had the highest average annual salary.

You can move the indexes into the dataframe and reset the index to a traditional single-level numeric index by reseting the indexes:    
```python
mwmean.reset_index()
```

Try it yourself:

### Task 3: Filtering the original data.
<span style="float:right; margin-left:10px; clear:both;">![Task](../media/new_knowledge.png)</span>

Another way we might want to filter is to find records in the dataset that, after grouping meets some criteria. For example, what if we wanted to find the records for all countries with the average annual salary was greater than $35K?

To do this, we can use the `filter` function of the `DataFrameGroupBy` object. The filter function must take a function as an argument (this is new and may seem weird).  

```python
annualwages = minwages[minwages['Pay period'] == 'Annual']
annualwages.groupby(['Country']).filter(
    lambda x : x['value'].mean() > 22000
)
```
Try it:

### Task 4: Reset the index
If you do not want to use MultiIndexes and you prefer to return any Multiindex dataset back to a traditional 1-level index dataframe you can use the`reset_index` function. 

Try it out on the `mwmean` dataframe:

## Exercise 4:  Task 6d from the practice notebook
Load the iris dataset. 

In the Iris dataset:
+ Create a new column with the label "region" in the iris data frame. This column will indicates geographic regions of the US where measurments were taken. Values should include:  'Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'. Use these randomly.
+ Use `groupby` to get a new data frame of means for each species in each region.
+ Add a `dev_stage` column by randomly selecting from the values "early" and "late".
+ Use `groupby` to get a new data frame of means for each species, in each region and each development stage.
+ Use the `count` function (just like you used the `mean` function) to identify how many rows in the table belong to each combination of species + region + developmental stage.

## Exercise 5: Kaggle Titanic Dataset
A dataset of Titanic passengers and their fates is provided by the online machine learning competition server [Kaggle](https://www.kaggle.com/). See the [Titanic project](https://www.kaggle.com/c/titanic) page for more details. 

Let's practice all we have learned thus far to explore and perhaps clean this dataset.  You have been provided with the dataset named `Titanic_train.csv`.  

### Task 1: Explore the data
First import the data and print the first 10 lines.

Find the shape of the data.

List the column names.

Identify the data types. Do they match what you would expect?

Identify columns with missing values. 

Identify if there are duplicated entires.

How many unique values per row are there.  Do these look reasonable for the data type and what you know about what is stored in the column?

### Task 2: Clean the data
Do missing values need to be removed? If so, remove them.

Do duplicates need to be removed?  If so remove them.

### Task 3: Find Interesting Facts
Count the number of passengers that survied and died in each passenger class

Were men or women more likely to survive?

What was the average, min and max ticket prices per passenger class?
Hint:  look at the help page for the [agg](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) function to help simplify this.

Give descriptive statistics about the survival age.