# Pandas P1 Continued

![more_pandas](https://media.giphy.com/media/KyBX9ektgXWve/giphy.gif)

In [1]:
# You will get very used to these imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

from src.student_list import student_first_names
from src.student_caller import three_random_students, one_random_student

%load_ext autoreload
%autoreload 2

Learning Goals:

1. Learn to interact with and manipulate dataframe columns
2. Learn to interact with and manipulate dataframe row indices
3. Identify and deal with N/A values
4. Visualize data using built in dataframe methods and MPL

There are several well-worn datasets you will come to know: the iris dataset, the boston housing dataset, the heart dataset.  In this notebook, we will look at the Titanic dataset.  As a tool, it is a bit macabre - predicting survival on the ill-fated ship - but it is still very useful.

![leo_titanic](https://media.giphy.com/media/XOY5y7YXjTD7q/giphy.gif)

In [None]:
# The data is in the csv file called titanic.csv
# create a dataframe object using it, and look at the head to start getting familiar with its structure

df = pd.read_csv('titanic.csv', index_col='PassengerId')

# 1. Learn to interact and manipulate dataframe columns

Let's take a look at the head of the data frame and the shape, just to get a quick overview.

In [None]:
df.head()

In [None]:
df.shape

### Quick knowledge check
We always want to be aware of what a row represents. 

What does each row in the dataframe represent? 

In [None]:
# Type answer here

Like most things code, there are several ways to view columns.

The first way is to look at the columns attribute of the dataframe.

In [None]:
# We are getting familiar with dataframe attributes: .shape and now .columns
df.columns

In [None]:
# We can confirm that the number of columns matches the second index of the shape attribute

len(df.columns) == df.shape[1]

A second way to see the columns is using the built in list() method:

In [None]:
list(df)

Consider the situation where you want to rename a column in the dataframe. Let's say you are getting tired of remembering that SibSp refers to siblings and spouses. We can rename it like so:

In [None]:
df.rename({'SibSp':'siblings_and_spouses'}, axis=1) # Axis tells the rename method to look for SibSp along the columns axis

Great. Now print out the head of the df

In [None]:
df.head()

Looks like something did not register.  The column name is back to SibSp. 
A finicky thing about Pandas is the use of inplace.  
In order for the object to be transformed in memory, we need to assign the inplace paramater the value of True

In [None]:
df.rename({'SibSp':'siblings_and_spouses'}, axis=1, inplace=True)

In [None]:
df.head()

We can also change multiple columns at once with a dictionary:

In [None]:
df.rename(columns = {'Parch': 'parent_child_ratio', 'Pclass': 'ticket_class'}, inplace=True)

In [None]:
df.head()

We can also interact directly with the .columns attribute


In [None]:
df_columns = df.columns # saved for pairprogramming

df.columns = list('ABCDEFGHIJK')
# What will the columns of our dataframe look like now?

If we find a column is not useful, we can drop columns with the drop method.



In [None]:
df.drop('A', axis=1)

# Pair Program 1:

Take 5 minutes with a partner to perform this activity.

We just renamed our columns to a useless series of letters. Luckily we saved our column names in the variable df_columns. Let's rename our columns using columns attribute.  To make things neater, we want the column names to all be lowercase.   You can perform this in any way you prefer, but a list comprehension can do it in one line.

Remember, list comprehensions look like this:
> [function(variable) for variable in iterable]

In [None]:
# your answer here

## 2. Learn to interact and manipulate dataframe row indices


Row indices are an attribute of a dataframe just as columns are.

In [None]:
# This is a RangeIndex object, which can be iterated over
df.index

The index can be set in the same way as columns:

In [None]:
# Note they are the same length
df.index = range(1000, 1891)
df.index

We can also reset the index:

In [None]:
df.reset_index(inplace=True, drop=True)

In [None]:
df.head()

### Round Robin
Isolate the indices of those passengers who survived.
Store each in the variables below.

Then, use numpy.random.choice on those indices to create a dataframe of a random subset of 30 surviving passengers

In [None]:
three_random_students(student_first_names)

In [None]:
# isolate Passengers who survived in a variabled named survived

In [None]:
three_random_students(student_first_names)

In [None]:
# Using the index of the above variable, us np.random.choice to choose 30 random indices

In [None]:
# Your code here

In [None]:
three_random_students(student_first_names)

In [None]:
# Use those indices to subset the dataframe

In [None]:
# Your answer here

## 3. Identify and deal with N/A values

NA (not available) values, are a constant annoyance.  They can mess up our code and our analysis.  One of the first steps of EDA you will perform is looking at whether your data has NA's.  

Apropo to the event it describes, the titanic dataset has many NA values. 

We can see that in a few ways, first using describe.

In [None]:
df.info()

## Knowledge Check: From the above info() output, which columns have na's? How can you tell?


Your answer here  


Another way to see na's is with the **isna()** method

In [None]:
df.isna()

More usefully, we can sum the values which are na:

In [None]:
df.isna().sum()

## Dealing with na's


One way to deal with na's is by dropping rows that have them:


In [None]:
df.dropna()

Let's explore what happened there. Since we didn't include inplace=True, we can run the same code with some additions to see the difference:

In [None]:
df.dropna().info()

# Knowledge check
How did drop na affect the dataframe?  Why did it remove so many rows?

In [None]:
one_random_student(student_first_names)

In [None]:
# your answer here

Dropna without params reduced our data significantly, which is a very bad thing. Our model performance, when we get to modeling, will heavily rely on having enough data.

Let's add a parameter to dropna:

In [None]:
list(df)

In [None]:
df.dropna(subset=['embarked'], inplace=True)

In [None]:
# Now there are only two columns with na values
df.info()

You will find that data preprocessing presents you with many paths to follow.  You have many choices you can make as to how to preprocess. 

For now let's make the choice to drop cabin, since it has so many nulls:

In [None]:
df.drop('cabin', axis=1, inplace=True)

With age, let's be a bit more creative, and impute the mean. This is a common method.

##  Short Exercise: Turn of your camera and take 3 minutes:

Using the fillna() method, write code below to fill the na's in age with the mean of age.

In [None]:
# Your code here

In [None]:
# Run df.info() to check that you have no more na's.
df.info()

# 4. Exercises: visualize data using built in dataframe methods, plt, and sns

Dataframes have some built in methods for visualization, which you can call directly from the dataframe.    

Note: By call, we mean call a function.  For example, on a string, "TEXT", I call the .lower() function on it like so:

```python
"TEXT".lower()
```


## Hist

For example, a very useful one is hist(), 
which will display histograms of each numeric field.  

#### Exercise 1
Call hist() on the dataframe object to plot a grid of histograms of all numeric fields in the Titanic dataset.
Add figsize=() to make the figure bigger


In [None]:
# your code here

#### Exericse 2 
Using seaborn, plot two histograms, one on top of the other, of the fares of people who perished and people who survived.

In [None]:
# Your code here

In [None]:
# Analyse the plot above

#### Exercise 3 
Use matplotlib to plot a bar chart of the mean age of the following four categories: survived/male, survived/female, perished/male, perished/female

In [None]:
# Your code here

# Boxplot

Another very useful method is boxplot.  One use of boxplot is to quickly see whether there are outliers.

In [None]:
df.boxplot()
plt.xticks(rotation=45)

#### Exercise 4 
That is a bit small. We also may want to plot histograms alongside the boxplots.  
Let's use seaborn to plot individual boxplots of age and fare above  histograms of age and fare in a 2 x 2 grid using fig, ax notation.

In [None]:
# Your code here

#### Exerise 5

Seaborn's pairplot function is very useful for initial eda.

Run a pairplot on the entire titanic dataframe.


In [None]:
# Your code here

Do you see any correlations between variables in the pairplot?

#### Exercise 6
Sometimes it's easier to see correlations numerically. 
We can use df.corr() to display those numbers
Correlation near 1 means strong positive correlation; correlation near -1 means strong negative

In [None]:
df.corr()

Now, insert df.corr() as a parameter in seaborn's heatmap plot to produce another way of visualizing correlation




In [None]:
# your answer here