## 1. Introduction

> This is my first work of machine learning in kaggle. the notebook is written in python. 
In this kernel I will go through the whole process of creating a machine learning model on the famous Titanic dataset, which is used by many people all over the world.

> feel free to fork this kernel to play around with the code and test it for yourself. If you plan to use any part of this code, please reference this kernel! I will be glad to answer any questions you may have in the comments. Thank You!


In [None]:
#Import Libraries
import numpy as np 
import pandas as pd 

import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix



## Gathering Data

> We downloaded two files (train.csv) & (test.csv) and We have to read them

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
all_data = [train,test]

## 2. Assessment Data

In [None]:
train.head()



We note that we have: 
1. mix of numeric and alphanumeric data types in Ticket column
2. Categorical data in Embarked and Sex columns


In [None]:
train.info()

We have a new probleme here, there is missing value in columns('Age' - 'Cabine' - 'Embarked')

In [None]:
train.describe()


## Exploration



## 1. Sex

In [None]:
print (train[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean())


It's meaning that the females were rescued at the expense of the males


## 2. Pclass

In [None]:
print (train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean())


In [None]:
grid = sns.FacetGrid(train, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();


### Another graph for representation

In [None]:
print(pd.crosstab(train.Survived, train.Pclass))

plt.figure(figsize=(12,5))

sns.countplot(x="Pclass", data=train, hue="Survived",palette="hls")
plt.xlabel('PClass',fontsize=17)
plt.ylabel('Count', fontsize=17)
plt.title('Class Distribuition by Survived or not', fontsize=20)

plt.show()


Looking the graphs, is clear that 3st class and Embarked at Southampton have a high probabilities to not survive



### 3. SibSp and Parch

Here we can create a new feature who contain SibSp(siblings/spouse ) and Parch (children/parents) family_size

In [None]:
for dataset in all_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
print (train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean())

In [None]:
print(pd.crosstab(train.FamilySize, train.Survived))
sns.factorplot(x="FamilySize",y="Survived", data=train, kind="bar",size=6, aspect=1.6)
plt.show()


Most family suvived who are consisting of Four people

Let us drop Parch, SibSpeatures in favor of FamilySize.



In [None]:
train = train.drop(['Parch', 'SibSp'], axis=1)
test = test.drop(['Parch', 'SibSp'], axis=1)
all_data = [train, test]

train.head()


### 4. Age

In [None]:
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Age', bins=25)


### Above we can see that: 
1. Infants (Age <=4) had high survival rate.
2. Oldest passengers (Age = 80) survived.
3. Large number of 15-25 year olds did not survive.
4. Most passengers are in 15-35 age range.

## 5. Sex

In [None]:
print(pd.crosstab(train.Survived,train.Sex))

plt.figure(figsize=(12,5))
sns.countplot(x="Sex", data=train, hue="Survived",palette="hls")
plt.title('Sex Distribuition by survived or not', fontsize=20)
plt.xlabel('Sex Distribuition',fontsize=17)
plt.ylabel('Count', fontsize=17)

plt.show()


It's meaning that dies to mens are much higher than female


## 6. Embarked

In [None]:
print(pd.crosstab(train.Survived, train.Embarked))

plt.figure(figsize=(12,5))

sns.countplot(x="Embarked", data=train, hue="Survived",palette="hls")
plt.title('Class Distribuition by survived or not',fontsize=20)
plt.xlabel('Embarked',fontsize=17)
plt.ylabel('Count', fontsize=17)

plt.show()


## 3. Cleaning

## Problems:
 - Missing Value in (Age, Cabin, Embarked)
 
 - Categorical data in columns (Sex, Embarked)
 
 - mix of numeric and alphanumeric data types in Ticket column
 
 - Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.
 
 

### Prop1: Cabine and Ticket Data[](http://)
* I think that the Cabin and Ticket features not impact in survive so I decided to drob these features

In [None]:
print('Shape Befor drop: ', train.shape)
train = train.drop(['Ticket', 'Cabin'], axis=1)
test = test.drop(['Ticket', 'Cabin'], axis=1)
all_data = [train, test]
print('Shape After drop: ',train.shape)


### Prop2: categorical feature (Sex)

Let us start by converting Sex feature to a new feature called Gender where female=1 and male=0.



In [None]:
for dataset in all_data:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
train.head()

 ### Prop3: Missing value in Age
 
We have plenty of missing values in this feature. 
We can generate random numbers between (mean - std) and (mean + std). then we part age into 5 range.



In [None]:
guess_ages = np.zeros((2,3))
guess_ages


In [None]:
for dataset in all_data:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_data = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()

            # age_mean = guess_df.mean()
            # age_std = guess_df.std()
            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)

            age_guess = guess_data.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

train.head()


In [None]:
train['Age_group'] = pd.cut(train['Age'], 5)
train[['Age_group', 'Survived']].groupby(['Age_group'], as_index=False).mean().sort_values(by='Age_group', ascending=True)


#### Let us replace Age with ordinals based on these groups.



In [None]:
for dataset in all_data:    
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']
train.head()


#### remove the Age_group column.

In [None]:
train = train.drop(['Age_group'], axis=1)
all_data = [train, test]
train.head()


### Prop4: Missing value in Embarked:
Our training dataset has two missing values. We simply fill these with the most common occurance.



In [None]:
for dataset in all_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
print (train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean())



### Prop5: categorical feature (Embarked)

We can now convert the EmbarkedFill

In [None]:
for dataset in all_data:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

train.head()


### Prop6: Fare
I think Fare column it's not important so i decided to drop this

In [None]:
train = train.drop(['Fare'], axis=1)
test = test.drop(['Fare'], axis=1)
all_data = [train,test]
train.head()

### Prop6: Name 
- Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.
- Survival among Title Age bands varies slightly.
- Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).

In [None]:
# retain the new Title feature for model training.
for dataset in all_data:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train['Title'], train['Sex'])


We can replace many titles with a more common name

In [None]:
for dataset in all_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()


### Show Titles names in a graph

In [None]:
plt.figure(figsize=(12,5))

#Plotting the result
sns.countplot(x='Title', data=train, palette="hls")
plt.xlabel("Title", fontsize=16) #seting the xtitle and size
plt.ylabel("Count", fontsize=16) # Seting the ytitle and size
plt.title("Title Name Count", fontsize=20) 
plt.xticks(rotation=45)
plt.show()


We can convert the categorical titles to ordinal.



In [None]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in all_data:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train.head()


- we can drop the Name feature now from our data.
- We also don't need the PassengerId column in the training dataset.

In [None]:
train = train.drop(['Name', 'PassengerId'], axis=1)
test = test.drop(['Name'], axis=1)
all_data = [train, test]


In [None]:
train.head()


In [None]:
test.head()

## 4. Applying ML Models:

In [None]:
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]
X_test  = test.drop("PassengerId", axis=1).copy()


In [None]:
#Apply RandomForestClassifier
random_forest= RandomForestClassifier(n_estimators=100,
                             max_features='auto',
                             criterion='entropy',
                             max_depth=10)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(round(acc_random_forest,2,), "%")




In [None]:
#Apply GradientBoostingClassifier

from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
                                 max_depth=1, random_state=0).fit(X_train, Y_train)
y_prediction= clf.predict(X_test)
clf.score(X_train, Y_train)
acc_clf = round(clf.score(X_train, Y_train) * 100, 2)
print(round(acc_clf,2,), "%")


In [None]:
#Apply LGBMClassifier

from lightgbm import LGBMClassifier
model = LGBMClassifier().fit(X_train, Y_train)
y_predict= model.predict(X_test)
model.score(X_train, Y_train)
acc_model = round(model.score(X_train, Y_train) * 100, 2)
print(round(acc_model,2,), "%")



In [None]:
#Apply Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

Y_pred = logreg.predict(X_test)

acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
print(round(acc_log,2,), "%")


In [None]:
# Apply Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)

Y_pred = decision_tree.predict(X_test)

acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
print(round(acc_decision_tree,2,), "%")


In [None]:
from xgboost import XGBClassifier

params_xgb = {'colsample_bylevel': 0.7, 'learning_rate': 0.03, 'max_depth': 3, 
              'n_estimators': 400, 'reg_lambda': 15, 'subsample': 0.5}
xgb = XGBClassifier(**params_xgb)
y_preds = xgb.fit(X_train, Y_train).predict(X_test)
acc_xgb = round(xgb.score(X_train, Y_train) * 100, 2)
print(round(acc_xgb,2,), "%")



### The Best Model?


In [None]:
results = pd.DataFrame({
    'Model': ['LGBMClassifier', 'Logistic Regression', 
              'Random Forest', 'Boosting', 
              'Decision Tree','xgb'],
    'Score': [ acc_model,acc_log,
              acc_random_forest, acc_clf,
              acc_decision_tree,acc_xgb]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(7)


In [None]:
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X_train, Y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())


## Feature Importance

In [None]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(15)



In [None]:
importances.plot.bar()


The title was the most impact in survive! 

### Test

In [None]:
params_xgb = {'colsample_bylevel': 0.7, 'learning_rate': 0.03, 'max_depth': 3, 
              'n_estimators': 400, 'reg_lambda': 15, 'subsample': 0.5}
xgb = XGBClassifier(**params_xgb)

y_preds = xgb.fit(X_train, Y_train).predict(X_test)
print("Score: ",xgb.score, 4*100, "%")





### Submission


In [None]:
submission = pd.DataFrame({
        "PassengerId": test['PassengerId'],
        "Survived":  y_preds
    })

submission.to_csv('submission.csv', index=False)
