## First Steps on the ML Universe - Titanic Disaster!

## First Steos on Kaggle and ML!

During the past few months, I've been learning the basics of Machine Learning. This notebook is an attempt to show others like me, the insights I've came across so far. Hope you find it useful!

### Table of Contents

* [Exploratory Data Analysis](#eda)
    * [Correlations, Distributions](#corr)
* [Pre Processing](#prepro)
    * [Some Feature Engineering](#fea)
    * [Normalization - Scaling](#norm)
    * [Imputation](#imp)
        * [Observations](#obs)
    * [+ Feature Enginering](#fea2)
    * [Outliers](#out)
        * [Observations](#obs2)
* [Modeling](#mod)
* [Final Thoughts](#final)


### Understanding your problem and your data

The first step on your journey is to understand the problem you are trying to solve(**who survived the Titanic disaster?**), and the data you have to do it.

Let's import your datasets and some of the libraries you will be using and see what the dataset looks like:

In [None]:
import numpy as np #Numpy is used for array manipulation and processing
import pandas as pd #Pandas is used for table manipulation and processing
import matplotlib.pyplot as plt
import seaborn as sns # A very useful library other than matplotlib for plotting
from sklearn.impute import KNNImputer 
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier # Simple classification model
from sklearn.model_selection import train_test_split #Splitting intro train and test
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV# Hyperparameter tunning
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, plot_confusion_matrix # Classification metrics

##These lines of code import the datasets that are also available in the Code tab on the Kaggle's competition
# You could also upload them via the Add data button on the right of your kaggle notebook.

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/titanic/train.csv') #Import our data
df.columns #This method returns the names of the columns of the dataset

<a id = "eda"></a>
## Exploratory Data Analysis (EDA)

When you are presented when a new problem, it might contain special vocabulary associated with the topic of the problem itself, so it's really important tu understand what are the variables(columns in this case) you have.

In this case, Kaggle provides information about it in the **Data** tab in the Titanic's competition page.

But it might happen you don't have this information(with most of the real world problems), so you will have to gather it by yourself. Usually, an expert of the topic can be a great place to start. **Research** it's a big part of solving a problem! 

### The columns

Many of them are self-explanatory, but there are some that don't:

* Embarked: Name of the port of embarkation
* Parch: Number of parents/children aboard
* SibSp: Number of siblings/spouses aboard

### How do we continue?

Now that you understand what are the data variables telling, you will try to figure out the relation of each variable with the target prediction(if the person survived or not). We will do some plotting and some further analysis.

Some useful questions to guide youur analysis may be:

* What type of distribution do the variables have?
* Are there any outliers?
* Are there any missing values? 
* What variables do you think will be of interest? Is there any that don't give too much information about the problem?

In [None]:
sns.pairplot(data = df, hue = 'Survived') #Plotting histograms and distributions of each variable.
# hue parameter sets color by another variable, in this case, "Survived"

In [None]:
sns.kdeplot(df.SibSp, hue = df.Survived)

In [None]:
df_test = pd.read_csv('/kaggle/input/titanic/test.csv') #Import our data

sns.pairplot(df_test,)

In [None]:
print(df_test.head(), df_test.shape)

In [None]:
sns.heatmap(df_test.corr(), annot = True)
plt.title('Correlation between Features in Train')

In [None]:
sns.heatmap(df.corr(), annot = True)
plt.title('Correlations between Features in Test')

In [None]:
print('Missing values in Train: ')
df.isna().sum() # We count the number of missing values on each column

In [None]:
print('Missing values in Test: ')
df_test.isna().sum()

In [None]:
print('Shape of Train: ', df.shape)
df.head()

In [None]:
print('Shape of Test: ', df_test.shape)

In [None]:
sns.barplot(x = df.Sex, y = df.Survived) #easy Barplot with seaborn 

<a id = "corr"></a>
### Correlations

Take some time to understand what's happening in the pairplots and heatmaps. There's plenty of information in those few lines. Parch and SibSp are **strongly correlatied**, so it gives you the chance to do some feature engineering. 
In the Train Heatmap(which has Survived as a column) we see that Pclass is **strongly negative correlated**, and that's because first class passengers had a better chance to survive.

### Distributions

In the KDEs you can percieve that depending on the number of SibSp, **the number of survivals is greater than the fatalities**, and, from my point of view, that's a very good reason to think on doing some Feature Enginerring with it. Something similar happens with the **Fare**

There's so many patterns going on that I'm missing, probably. You can dig in all you want and find others. This part of the process if crucial to get a better performance model.

### Feature Importance

There are some columns that don't give you any information, throw them. Keep the ones that matter.




<a id = "prepro"></a>
## Data Preprocessing

During this step we want to leave the data ready for modeling. You will:

* Impute missing values: A model can't have missing values. You have to do something with them. 

* Handle your Outliers: values that are too large or too small compared to the rest.

* Normalize and scale: These steps are not always necessary. It depends on your Machine Learning Model. Some models are based on relations of distances(like K-Nearest Neighbors), so normalizing and scaling is indispensable to maintain an equally weighted importance of your features. Other models like Decision Trees are not designed the same way. Anyway, this is a good practice, so in this notebook you'll leasrn the basics on how to do it.


### Some Feature Engineering

**Label Encoding**: With the Sex column, Expensive Fare, and Small Family(see KDEs), all decisions based on the barplot, KDE and heatmap. 

I'll change Pclass from a descending rank to an ascending one. I don't know if this change something.
 

In [None]:
df.Sex = df.Sex.map(dict(female=1, male=0)) #The map method uses a dictionary to transform values on a Series object.
df_test.Sex = df_test.Sex.map(dict(female=1, male=0))

In [None]:
df['Pclass'] = df['Pclass'].replace([3,1],[1,3]) #Another method to replace values on column
df_test['Pclass'] = df_test['Pclass'].replace([3,1],[1,3])
print(df.head(), df_test.head())

In [None]:
df['Small Family'] = np.where((df['SibSp'] <= 2) & (df['SibSp'] != 0), 1, 0)
df_test['Small Family'] = np.where((df_test['SibSp'] <= 2) & (df_test['SibSp'] != 0), 1, 0)
df_test['Lonely Child'] = np.where(df_test['Parch'] == 1, 1, 0)
df['Lonely Child'] = np.where(df['Parch'] == 1, 1, 0)

df['Family'] = df['SibSp'] + df['Parch']
df_test['Family'] = df_test['SibSp'] + df_test['Parch']
df.head()

In [None]:
X = df[['Age', 'Pclass', 'Fare', 'Family','Lonely Child', 'Sex', 'Small Family','PassengerId']]#Select features
Xt = df_test[['Age', 'Pclass', 'Fare', 'Family','Lonely Child', 'Sex', 'Small Family', 'PassengerId']]
y = df.Survived #Select target

<a id = "norm"></a>
### Normalization - Scaling

Scaling data means(in a non-exhaustive, non-academic way) **bringing all your feature values to a common scale, maintaining the individual relationships between each feature**, this is, brinding each column of your dataframe to, say, values between 0 and 1. This makes sense: if a column has values between 1000 and 100000, and another between 0 and 1, some algorithms may incorrectly decide that the first feature is much more important than the second one.
Normalizing is, well, **changing the shape of your distributions to fit a normal one**, said in very simple terms. The **encoded features** like "Sex" and "Pclass" are not normalized.
 

#### Side Notes

The truth is, you should make some Statistical proof about the relation between the variables for the imputation. We are simplifying things a big lot here. If you want more details, just let me know.

I'm also learning so any feedback on improvements are very, very welcomed :)

In [None]:
X = X.append(Xt)

for i in range(4): # Run through all columns except encoded
    X.iloc[:,i] = X.iloc[:,i] / X.iloc[:,i].max() # Scale each column dividing by maximum
    mean = X.iloc[:,i].mean() # Calculate column mean
    std = X.iloc[:,i].std() # Calculate column STD
    X.iloc[:,i] = (X.iloc[:,i] - mean)/std # Normalie column
print(X.shape, X.head())

<a id = "knn"></a>
### KNN Imputer

This kind of imputation uses KNNeighbors logic: it assings a value to the missing gaps depending on the nearests points values. It has it's own hyperparameters, but in this notebook I won't focus on those aspects. I don't want you to get too bored too quickly. Kaggle and StackOverflow are places to look up for amazing content.

In [None]:
sns.kdeplot(Xt.Age) # plot first kde
plt.title('Age Distribution without imputation')

In [None]:
imputer = KNNImputer() # Create the imputer object
X = pd.DataFrame(imputer.fit_transform(X),columns = X.columns) # Transform the data

In [None]:
sns.kdeplot(Xt.Age) # plot first kde
plt.title('Age Distribution without imputation')


<a id = "obs"></a>
### Observations

You can see that Age distribution looks close enough with and without imputation. **This is crucial**. You want your data to look similar with the imputation as how it looked without it.

<a id = "fea2"></a>
## Feature Engineering(Child and Family Columns)

You can see in the above graphs that we could create a new feature using this "two headed distribution". Let's add a new column named Child with 0 if False and 1 if True

In [None]:
X['Child'] = np.where(X['Age']< -1.5, 1, 0)
X['Expensive'] = np.where(X['Fare'] <= 0.14, 0, 1)

X.head()

<a id = "out"></a>
### Outliers

Without scaling and normalization, you could tend to get too many instances(each **row** of your dataset) when detecting outliers. Now that you have everything set up nicely, let's see how these new features look like, this time with **boxplots** 

In [None]:
sns.boxplot(x = 'variable', y = 'value', data=pd.melt(X[['Age', 'Pclass', 'Fare', 'Family', 'Sex', 'Small Family']]))
plt.title('Distribution of the features')

<a id = "obs2"></a>
### Observations

As you may expect, there are no such thing as an outlier on an encoded column like Sex or Pclass, and in **all** other features we can see a long-tailed distribution. **This must be kept on mind for the outlier detection**.

For an **excellent in-depth article about outliers** visit: 
https://www.kaggle.com/aimack/how-to-handle-outliers/notebook
by Akash Dey. 

The following steps follow his logic. Some of the code waws borrowed as well.
Since there are not too many instances(each individual), less than 1000, **Capping method** will be used.

In [None]:
for i in [0, 2, 3]:
    q1 = X.iloc[:,i].quantile(0.25) # Calculate 1 quartile
    q3 = X.iloc[:,i].quantile(0.75) # Calculate 3 quartile

    IQR = q3 - q1 # Interquartile range

    #defining max and min limits
    max_limit = q3 + (1.5 * IQR)
    min_limit = q1 - (1.5 * IQR) 

    #capping
    X.iloc[:,i] = pd.DataFrame(np.where(X.iloc[:,i] > max_limit, max_limit, 
             (np.where(X.iloc[:,i] < min_limit, min_limit,X.iloc[:,i] ))))

In [None]:
sns.boxplot(x = 'variable', y = 'value', data=pd.melt(X[['Age', 'Pclass', 'Fare', 'Family', 'Sex', 'Small Family']]))
plt.title('Distribution of the features')

<a id = "mod"></a>
### Modeling

Now that you have both train and test sets preprocessed, let's train a Random Forest and a Support Vector Machine. My hypothesis is that SVC might perform better, since it can benefit from the scaling and normalizing.

In [None]:
Xtrain = X.iloc[:891, :]
Xtest = X.iloc[891:, :]

In [None]:
X_train = Xtrain[['Age', 'Pclass', 'Fare', 'Family','Lonely Child', 'Sex', 'Small Family', 'Expensive', 'Child']]
X_test = Xtest[['Age', 'Pclass', 'Fare', 'Family','Lonely Child', 'Sex', 'Small Family', 'Expensive', 'Child']]

In [None]:
Xtrain2, Xtest2, ytrain, ytest = train_test_split(X_train, y, train_size = 0.7, test_size = 0.3)

In [None]:
rf = RandomForestClassifier()
grid_list = {"n_estimators": [100, 200, 300, 500, 1000],
             'max_depth': [100, 30, 150, None],
             'max_features':['auto', 'sqrt', 'log2']}


grid_search = GridSearchCV(rf, param_grid = grid_list, n_jobs = 4, cv = 5, scoring = 'accuracy') 
grid_search.fit(Xtrain2, ytrain) 
print(grid_search.best_params_, grid_search.best_score_)

In [None]:
svc = SVC()
grid_list = {'C': [0.1,1, 10, 100], 
             'gamma': [1,0.1,0.01,0.001],
             'kernel': ['rbf', 'poly', 'sigmoid']}


grid_search = GridSearchCV(svc, param_grid = grid_list, n_jobs = 4, cv = 5, scoring = 'accuracy') 
grid_search.fit(Xtrain2, ytrain) 
print(grid_search.best_params_, grid_search.best_score_)

<a id = "final"></a>
### Final Thoughts

If you have any comment, suggestion, or idea, please let me know. Hope you enjoyed it =).

drK~


In [None]:
svc = SVC(C = 100, gamma = 0.01, kernel = 'rbf')
svc.fit(Xtrain2, ytrain)
predictions = svc.predict(Xtest2)
accuracy_score(ytest, predictions)

In [None]:
rf = RandomForestClassifier(n_estimators = 300)
rf.fit(Xtrain2,ytrain)
predictions = rf.predict(Xtest2)
accuracy_score(ytest,predictions)

In [None]:
svc = SVC(C = 100, gamma = 0.01, kernel = 'rbf')
svc.fit(X_train, y)
predictions = svc.predict(X_test)

In [None]:
Xtest['PassengerId'] = Xtest['PassengerId'].astype(int)

output = pd.DataFrame({'PassengerId': Xtest.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")