# Exercise 1: Exploring the Titanic Dataset 

We'll practice our data exploration and visualization skills using the [Titanic dataset](https://www.kaggle.com/c/titanic/data). 

### Importing Dependencies 

To explore our data, we'll need `pandas`, `seaborn`, and `matplotlib`.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Loading the Data

We'll load in our Titanic dataset which is stored on the cloud.

In [None]:
data = pd.read_csv("https://s3.us-east-2.amazonaws.com/explore.datasets/rbi/titanic_train.csv")

Let's start by looking at the size of our dataset. How many columns and rows are we dealing with?

In [None]:
data.___

Now, let's sample 4 rows from our dataset.

In [None]:
data.___(n=__)

What's the average age of passengers? What about youngest? Oldest?

In [None]:
mean_age = data['Age'].___
min_age = data['Age'].____
max_age = data['Age'].____

What is the passenger class distribution in our dataset? Class is represented by `Pclass`. A value of 1 represents first class, 2 is second class, and 3 is third class. 

Let's look at ticket fares now. We can plot the distribution of fares using Seaborn's [distplot](). 

What's the mean ticket fare? 

In [None]:
mean_fare = data['Fare'].____

print(f"The mean fare is {mean_fare} for passengers on the Titanic")

Are tickets more expensive for passengers staying in first class (`Pclass`==1) as compared to those stayed in third class (`Pclass`==3)? We can do this by grouping the data by `Pclass` (using Pandas [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function). 

In [None]:
data.groupby('___')['Fare'].mean()

We can also plot this using Seaborn's `barplot`.

In [None]:
sns.barplot(x='Pclass', y='____', data=data)

Now, let's take a look at gender. What's the ratio of males to females in our dataset?

In [None]:
data['Sex'].___

Are there more females in a particular passenger class? We can groupby two things: 1) `Pclass` and 2) `Sex`, and count the number of `PassengerId`'s that fall under each category.

In [None]:
data.groupby(['____', '____'])['PassengerId'].count()

Are ticket prices higher for males or females? Let's look at the mean ticket price for each gender.

In [None]:
data.groupby('___')['____'].mean()

Were tickets more expensive for females than males in each passenger class? We can plot this using Seaborn's barplot. We can assign `Sex` to the `hue` parameter.

In [None]:
sns.barplot(x='Pclass', y='Fare', hue='____', data=data)

Were females more likely to survive the Titanic than males? We can look at the `Survived` column to figure this out. 

In [None]:
gender_survival = data.groupby('Sex')['_____'].agg(['sum', 'count']).reset_index()
gender_survival

With our `gender_survival` dataframe, we can get the proportion of females and males who survived by diving the `sum` by the `count`. 

In [None]:
gender_survival['proportion'] = gender_survival['___']/gender_survival['___']

We can plot this using Seaborn's barplot. 

In [None]:
sns.barplot(x='_____', y='Survived', data=data)
plt.title("Survival rates of Males vs. Females on the Titanic")

We have two columns that indicate whether the passenger had any family members who were also aboard the Titanic. These columns are:

- `Parch`: number of parents/children aboard the Titanic
- `SibSp`: number of siblings / spouses aboard the Titanic

Let's create a new column called `num_family_members` by getting the sum of these columns.

In [None]:
data['num_family_members'] = 

Let's plot the distribution of family member counts using Seaborn's [countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html).

In [None]:
sns._______(data['num_family_members'])

It looks like there's a large proportion of passengers who came aboard without any family members. Let's create a new boolean column that indicates whether a passenger came alone or had at least one other family member aboard. 

In [None]:
data['alone'] = (data['_____'] == 0)

Is there any trend in passengers coming alone and staying in a certain passenger class (i.e., first vs. third class)?

In [None]:
data.groupby(['_____'])['alone'].count()

Are passengers who came alone more likely to survive the Titanic?

In [None]:
solo_passenger_survival = data.groupby('___')['Survived'].agg(['sum', 'count']).reset_index()

With our `solo_passenger_survival` dataframe, we can get the proportion of solo vs. non-solo passengers who survived by diving the `sum` by the `count`. Note: We did this earlier with `gender_survival`. 

In [None]:
solo_passenger_survival['proportion'] = 