# **The Challenge in Python**

# Background

The [RMS Titanic](https://www.britannica.com/topic/Titanic) was a luxury streamship that sank on the 15th of April 1912, off the coast of Newfoundland in the North Atlanic. After a collision with an iceberg while en route to New York City from Southampton, England. There were a recorded 2,240 passengers and crew on board for the voyage and a total of 1504 lost their lives.  
  
This project was initially written as a submission for the "[Titanic: Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic)" Competition. This challenge called for participants to predict whether a passenger on the titanic would survive based on passenger data from the event. The Titanic dataset provided a diverse amount of information about passengers such as socio-economic status, gender, age, survival and more.  

  This project will display a full walkthrough of the procress of creating a machine learning model, data exploration, data cleaning and analysis through various classification methods.
  
  **Classification methods used**
* Logistics Regression
* Random Forest
* Decision Tree
* K Nearest Neighbor

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

#packages for Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

## Importing Data 
  
  The data referenced below and used throughout this project is sources directly from ["*Titanic: Machine Learning from Disaster*"](https://www.kaggle.com/competitions/titanic/data)

In [None]:
train_df = pd.read_csv('../input/titanic/train.csv')
test_df = pd.read_csv('../input/titanic/test.csv')
combine = [train_df, test_df]

# Data Exploration 
  
I started the data exploration process by trying to answer the following questions to become more familiar with the data types and quantity of data.
1. Which features are listed in the dataset?
2. Which features are categorical or numerical?
3. Which features include mixed data types?
4. Which features may contain errors, typos or missing data?
5. Which features could contribute to a high survival rate?

### The Features  
  
  1. **Which features are listed in the data set?**  
    
    PassengerID, Survived, Pclass, Name, Sex, Age, Sibsp, Parch, Ticket, Fare, Cabin and Embarked. This was discovered through the process below.

In [None]:
print(train_df.columns.values)

2. **Which features are categorical or numerical?**  
  
  Previewing the data found a few different data types in each feature. The data types included are defined as follows. [Catergorical data](http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm) is defined as a variable that can take one of a limited, usually fixed, number of possible values. [Ordinal data](https://www.scribbr.com/statistics/ordinal-data/) is a type of categorical data but it refers to a type of data that can be ranked in a natural order . [Continuous data](https://www.isixsigma.com/dictionary/continuous-data/) is a type of numerical data that be measured on an infinate scale. [Discrete data](https://www.thedrum.com/profile/whatagraph/news/discrete-vs-continuous-data-whats-the-difference) is a data type that involves integers and a limited number of values possible.  
    
    * Categorical: Survived, Sex and Embarked
    * Ordinal: Pclass
    *Continous: Age, Fare
    *Discrete: SibSp, Parch

In [None]:
#Previewing the data to determine data types and column headers.
#The head command is used to output the first part of the file.
train_df.head()

3. **Which features include mixed data types?**  
  
  Ticket and Cabin have mixed data types  
    
4. **Which features may contain errors, typos or missing data?**  
  
  This will be determined throughout the data cleaning and exploration process as it is more difficult to determine during prelimiary looks. 

In [None]:
#Previewing data continued
#The tail command will output of last part of the file.
train_df.tail()

In [None]:
#display nulls in data within range
train_df.info()
print('_'*10)
test_df.info()

In [None]:
# determine the number of null or missing values in each column for train set
train_df.isnull().sum()

In [None]:
#determine the number of null or missing value in each column for test set
test_df.isnull().sum()

5. **Which features could contribute to a high survival rate?**  

  Based on observations and assumptions of preliminary data exploration the features I will focus on correlationg are Pclass(socio-econimic status), Sex and Age.  
    Displayed below are results of the Pclass and Sex analysis. The survival rate based on age will be determined after the data cleaning process as there are missing features within the data.
* **Pclass** it was observed that there is a signficant correlation amoung 1st class passengers/pclass 1. This correltation decreases with the Pclass, meaning those on the second and 3rd class decks had a poor survival rate.
* **Sex** it was observed that there is a significant correlation between Sex and Survival as the Sex=Female had about a 74% survival rate. Compared to Sex=Male with a survival rate of about 19%

In [None]:
#survival rate of pclass
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
#visual representation of survival rate per Pclass
sns.barplot(x='Pclass',y='Survived',data=train_df)

In [None]:
#survival rate based on sex 1=Female, 0=Male
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
#Visual representation of survival rate
sns.barplot(x='Sex',y='Survived',data=train_df)

# Cleaning Data
Now that I have an idea of what types of data are in the set and have an idea of what data is missing the process of cleaning or wrangling data begins. The focus of this process is to transform and unify the data for easy access and analysis. I will do this by dropping and adding features, converting data types and completing numerical continuous features. 


## Unifying Categorical Features
In order to unify certain aspects of the data I converted features that contain strings to numerical values. Numerical values are favored by most models and will create consistancy through the features.  
  
  The feature Sex will be converted to a discrete data type where female = 1 and male = 0.

In [None]:
for dataset in combine:
        dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
        
train_df.head()

## Completing Continuous Feature
Through the exploration of the data and its features above it was found that 177 data points for age were missing. In order to correct this use the randomized age data will be generated using the mean age and standard deviation of the data set to account for the missing values.

In [None]:
#compute mean and standard dev of Age
age_mean = train_df['Age'].mean()
age_std = train_df['Age'].std()

#number of NaN values (non number values)
num_na = train_df['Age'].isna().sum()

#generate random ages from mean and standard dev
random_vals = age_mean + age_std * np.random.randn(num_na)

#replace missing values with random_vals
train_df.loc[train_df['Age'].isna(), 'Age'] = random_vals

# convert to whole numbers
train_df['Age'] = train_df['Age'].astype(np.int64)

#view data to check work
train_df.tail()

In [None]:
#Verify that missing values for age have been replaced.
train_df.isnull().sum()

In [None]:
#compute mean and standard dev of Age
age_mean = test_df['Age'].mean()
age_std = test_df['Age'].std()

#number of NaN values (non number values)
num_na = test_df['Age'].isna().sum()

#generate random ages from mean and standard dev
random_vals = age_mean + age_std * np.random.randn(num_na)

#replace missing values with random_vals
test_df.loc[test_df['Age'].isna(), 'Age'] = random_vals

# convert to whole numbers
test_df['Age'] = test_df['Age'].astype(np.int64)

#view data to check work
train_df.tail()

In [None]:
grid = sns.FacetGrid(train_df, row= 'Pclass', col= 'Sex', size = 2.2, aspect = 2.6)
grid.map(plt.hist, 'Age', alpha= .5, bins=10, color= 'orange')
plt.ylim((0,80))
grid.add_legend()

In [None]:
train_df.info()
print('_'*10)

## Completing a Categorical Feature
Throughout the process of data exploration it was found that data points in the feature embarked were missing. In order to correct this I will fill those spaces with the most common occurance before converting the categorical feature to numeric.

In [None]:
#discover the most frequently used port
port = train_df.Embarked.dropna().mode()[0]
port

In [None]:
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(port)
    
train_df[['Embarked', 'Survived']].groupby(['Embarked'], 
                                as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_df.isnull().sum()

In [None]:
#convert categorical embarked feature to numeric
#this creates a unifying data type for analysis

for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

train_df.head()

In [None]:
#convert Fare from float to int64
data = [train_df,test_df]
for dataset in data:
    dataset['Fare'] = dataset['Fare'].fillna(0)
    dataset['Fare'] = dataset['Fare'].astype(int)

## Correction by dropping features
Another tool used in data cleaning is dropping data or features in order to increase overall quality and efficency of the data. In this season of the data cleaning process I dropped the features PassengerID, Ticket, Cabin and Name as I do not intend to use them in this analysis. 

In [None]:
train_df = train_df.drop(['PassengerId'], axis=1)
train_df = train_df.drop(['Ticket', 'Cabin', 'Name',], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin', 'Name',], axis=1)
combine = [train_df, test_df]
train_df.shape,test_df.shape

# Data Visualization
Data visualization is a useful tool for data cleaning as it can assist with detecting outliers, missing values, implicit boundaries and much more. In this case I will be using it to check the effectiveness of my data cleaning methods as well as correlating various related features. 

### Age vs Survival
 
Throughout the cleaning process I noticed a few relationships between age and survival. Most passangers in this data set are in the age range 15-35 years old. The oldest surviving passanger was 80 years old and children under the age of 4 had a high survival rate.

In [None]:
age_hist = sns.FacetGrid(train_df, col= 'Survived')
age_hist.map(plt.hist, 'Age', bins = 20, color = "Orange")

###  Survival Classified by Age and Passenger Class


In [None]:
grid = sns.FacetGrid(train_df, col = 'Survived', row = 'Pclass', size = 2.2, aspect = 1.6)
grid.map(plt.hist, 'Age', alpha = .5, bins = 20, color = "Orange")
grid.add_legend();

### Survival by Sex, Passenger Class and Embarking Port

Next I explored the correlation between sex,passenger class, embarking port and survival.  
  I found that Female passengers had a higher survival rate across all aspects. However, males that embarked at port C were more likely to survive then males that embarked at ports S and Q.
  

In [None]:
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()

# Predicitive Modeling 

[Predicitive modeling](https://www.gartner.com/en/information-technology/glossary/predictive-modeling) is a commonly used statistical techinque to predict future behavior by analyzing historical and current data and generation a model to help predict outcomes.  
  The challenge asks to identify relationships between surviral and other variables so I chose a selection of classication and regression models to best answer this question.
* Logistic Regression
* Random Forest
* k-Nearest Neighbors
* Decision Tree

In [None]:
train_df.info()

In [None]:
test_df.info()

In [None]:
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()


### Logistic Regression
Logistic regression is a statisical model used to handle classification problems. [Logistic regression](https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc) is a process of modeling the probablity of a discrete outcome given an input variable. In other words it measures the realatoinship between the categorical depedent feature and one of more independent features.

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

### Decision Tree
[Decision Trees](https://scikit-learn.org/stable/modules/tree.html) are a non-parametric surpervised learning method used for classifications and regression.The goal is to use a tree like model to evaluate decisions and their possible outcomes including things such as probablity, cost, and other relavent features. Decision tree models 

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

### Random Forest


The [random forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2) analysis is a classification algorithm consisting of many decision trees. However is utilizes a bagging method and randomness features.

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

### K-Nearest Neighbor

The [K-nearest Neighbor algorithm](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761) is a data classification method for estimating the likelihood of the data point will beocome a member of one group or another based on what the group data points nearest to is belong to.

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

# Model Evaluation
The chart below shows the confidence scores from the analysis preformed above. 


In [None]:
models = pd.DataFrame({
   'Model': ['KNN', 'Logistic Regression', 'Random Forest',  'Decision Tree'],
    'Score': [ acc_knn, acc_log, acc_random_forest, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)