# Case Study - Titanic


## Table of Contents


[**Step 1: Business Understanding**](#Step-1:-Business-Understanding)

[**Step 2: Data Understanding**](#Step-2:-Data-Understanding)

- [**Load Data**](#Load-Data)
- [**Check Data Quality**](#Check-Data-Quality)
- [**Exploratory Data Analysis-EDA**](#Exploratory-Data-Analysis---EDA)
 
[**Step 3: Data Preparation**](#Step-3:-Data-Preparation)
- [**Deal with Missing Data**](#Deal-with-Missing-Data)
- [**Feature Engineering**](#Feature-Engineering)

[**Step 4: Modeling**](#Step-4:-Modeling)

[**Step 5: Submit**](#Step-5:-Submit)



[Back to Top](#Table-of-Contents)

## Step 1: Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.
#### Titanic Story
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class passengers.

#### Objective
In this challenge, we will complete the analysis of what sorts of people were likely to survive. 

In addition, we will build a regression model to predict ticket price (Fare).



[Back to Top](#Table-of-Contents)

## Step 2: Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. This step is often mixed with the next step, Data Preparation.

### Data Dictionary
The data is in a csv file titanic.csv. 

| Variable | Definition | Key |
| --- | --- | --- |
| Survived | Survival | 0 = No, 1 = Yes |
| Pclass | Ticket class	| 1 = 1st, 2 = 2nd, 3 = 3rd |
| Sex | Sex | male/femail |	
| Age | Age | in years |
| SibSp | # of siblings / spouses aboard the Titanic | |
| Parch | # of parents / children aboard the Titanic | |
| Ticket | Ticket number | |
| Fare | Passenger fare | |
| Cabin | Cabin number | |
| Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

**Variable Notes**
- Pclass: A proxy for socio-economic status (SES)
 - 1st = Upper
 - 2nd = Middle
 - 3rd = Lower

- Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

- SibSp: The dataset defines family relations in this way...
    - Sibling = brother, sister, stepbrother, stepsister
    - Spouse = husband, wife (mistresses and fiancés were ignored)

- Parch: The dataset defines family relations in this way...
 - Parent = mother, father
 - Child = daughter, son, stepdaughter, stepson
 - Some children travelled only with a nanny, therefore Parch=0 for them.



### Load Data

This dataset is in titanic.csv. Make sure the file is in the current folder.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
df_titanic = pd.read_csv('titanic.csv')
df_titanic.head()

### Check Data Quality
Check data quality. The most common check is to check missing values. We can do some basic data cleaning like cleaning up currency fields.
- Check null values
- The currency field needs to be converted to a float type, and should be purged of non-numeric characters like currency symbols ('$'), commas (',') and parentheses ('()') since parentheses are sometimes used to indicate negative values.


##### Task1: Check out Basic Dataframe Info

Hint: info() function.

Discuss missing values in the dataframe.

##### Task2: Clean up Fare, Convert to Float
Strip "$" from Fare, convert datatype to float.

##### Task3: Check out the Statistics for the Numeric Columns

Hint:describe() function.

Discuss:
- Age, SibSp, Parch, Fare statistics
- What does the mean value for Survived mean?

### Exploratory Data Analysis - EDA
EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

#### Types Of Features
##### Categorical Features:
A categorical variable is one that has two or more categories and each value in that feature can be categorised by them. For example, gender is a categorical variable having two categories (male and female). If we cannot sort or give any ordering to such variables then they are also known as Nominal Variables.

Categorical Features in the dataset: Sex, Embarked.

##### Continous Features:
A feature is said to be continous if it can take values between any two points or between the minimum or maximum values in the feature's column.

Continous Features in the dataset: Fare

### Categorical Features
We will perform a univariate analysis of Survived. Then we will analyze the relationship between Sex and Survived, and Embarked and Survived.

#### How Many Survived?
We can answer this question by creating a bar chart of the Survived column. There are multiple ways to create the bar chart. We will demonstrate two ways here: seaborn countplot, and pandas series bar.

##### Task4: Plot a Bar Chart for Perished vs. Survived
Plot a bar chart for the Survived column. Survived=0 means the passenger perished, Survived=1 means the passenger survived.

#### Relationship Between Sex and Survived


##### Task5: Plot a Bar Chart for the Number of Male and Female Passengers
Hint: Use seaborn countplot.

##### Task6: Group by Sex to Find the Survival Rate of Male and Female Passengers
Calculate the average of the Survived column for Male and Female.

##### Task7: Plot Perished vs. Survived Bar for Male and Female
We will use seaborn countplot() again, but set the argument `hue` to 'Survived'.

The number of men on the ship is lot more than the number of women. Still the number of females that survived is almost twice the number of males that survived. Thus, the majority of women survived while the vast majority of men perished.

#### Pclass and Survival
##### Task 8: List the Survival Rate for Each Pclass 
Hint: The verage of Survived is the survival rate.

##### Task 9: Plot Perished vs. Survived for Each Pclass
Hint: Use seaborn countplot, set `hue` to Survived.

### Continuous Features


#### Univariate Distribution Plot
There are multiple ways to create histograms. I will demonstrate 3 ways.
- ax.hist(): can not handle NaN value
- seaborn.distplot(): can not handle NaN. Has KDE(kernel density estimation) by default.
- pd.Series.hist(): simplest and can handle NaN by default

##### Task 10: Plot a Histogram for Age
Use the pandas Series hist() function which handles missing values.

##### Task11: Stack Age Histogram of Survived On Top of Overall Age Histogram
Plot a histogram for Age, then filter out the survived passengers and plot a histogram for Age on the same axis. Set different colors and labels for the two histograms.

Children have a higher survival rate.

[Back to Top](#Table-of-Contents)

## Step 3: Data Preparation
Create new features through feature engineering. Deal with missing values. Clean up data by stripping extra white spaces from string values. We will focus on dealing with missing data in this phase.

In [None]:
#check all missing data
df_titanic.isnull().sum()

### Deal with Missing Data
We will demonstrate filling in missing values with mean/mode, as well as with estimates from other columns.

#### Fill with Mean/Mode
Embarked only has two observations with missing values. Because there is no obvious way to estimate the missing values, we will simply fill them in with the modal value of the column, or 'S'.

##### Task12: Fill in Missing Values of the Embarked Column with the Modal Value.

#### Fill with Estimated Value

A title is a word used in a person's name in certain contexts. It may signify veneration, an official position, or a professional or academic qualification. It's a good indication of age. For example, "Mr" is for adult men while "Master" is for young boys.

If we look at all names of Titanic passengers, we can see that the name is in format Last, Title, First. We can use this information to estimate missing ages.

- First, we will use a regular expression to extract the title from the name.
- Then we will convert the title to upper case.
- Finally, we fill in the missing values in the Age column with the mean age for specific title.

In [1]:
#extract title from name
df_titanic['Title']=df_titanic.Name.str.extract('([A-Za-z]+\.)')
df_titanic.head()

NameError: name 'df_titanic' is not defined

##### Task13: Convert Title to Upper Case.
To ensure we get an accurate mean age for each title, we convert values in Title column to all upper case.

##### Task14: Fill in the Missing Age with the Mean Age of the Title


### Feature Engineering
We'll create a new column, FamilySize. There are two columns related to family size. Parch indicates parent or children number, and SibSp indicates sibling and spouse number.

Take one name, 'Asplund', as an example. We can see that the total family size is 7 (Parch + SibSp + 1), and each family member has the same value for Fare, which means the Fare is for the whole group. So family size will be an important feature to predict Fare. There are only 4 Asplunds out of 7 in the dataset becasue the dataset is only a subset of all passengers.

In [None]:
df_titanic[df_titanic.Name.str.contains('Asplund')]

##### Task15: Create the column 'FamilySize'
FamilySize = Parch + SibSp + 1

[Back to Top](#Table-of-Contents)

## Step 4: Modeling

Now we have a relatively clean dataset. (The Cabin column is an exception because it has many missing values). We can create a classification model for Survived to predict whether a passenger would survive the disaster. We can also use regression to create a model that will predict the ticket price (Fare). This dataset is not a good dataset for regression. But since we don't talk about classification in this course we will perform a linear regression analysis on Fare in this exercise.

##### Task16: Use Regression to Create a Model for Fare
Construct a regression model with statsmodels.

Pick Pclass, Embarked, and FamilySize as independent variables.

Discuss the regression results.

In [None]:
import statsmodels.formula.api as smf

[Back to Top](#Table-of-Contents)

## Step 5: Submit

##### Task17: Submit

Rename the notebook to CaseStudy_Titanic_yournetid. For example, if your netid is zz, rename the notebook to CaseStudy_Titanic_zz.  
Download as HTML.  
Submit the **HTML** File.