## Introduction:

This is my first post on kaggle.com as part of the "Titanic : Machine Learning from disaster" competition. Titanic dataset allow us to work on the supervised learning, more preciously in classification problem and easily understandable to all which is used by many analyst around the world. I would like to introduce myself in kaggle community as a data analyst through Titanic dataset where I will explore my experience to build a classification model to predict binary variables. In this analysis we will use logistic regression machine learning classification algorithm that is used to predict binary variable that contain data coded as 1 (Survived, Yes, Success, etc.) and 0 (Not Survived, No, Failure, etc.). In other words, logistic regression model predict P(Y=1) as function of X.

## Objectives: 

1. Look at the big picture
2. Get the data (loading)
3. Exploratory data analysis to find hidden pattern inside the data
4. Visualization provide us opportunity to gain insights into the relationship between dependent and independent variables, to spot correlation and dependencies.
5. Build the relationship pattern between dependent variable (Survival) and Independent variables
6. Feature engineering
7. Prepare the machine learning algorithm
8. Select the best model
9. Tune the selected model to increase the accuracy

### Look at the Big Picture

The first question to ask your business is what exactly is the business objective; building a model is probably
not the end goal. How does the company expect to use and benefit from this model? This is important
because it will determine how you frame the problem, what algorithms you will select, what performance
measure you will use to evaluate your model, and how much effort you should spend tweaking it.

### Import the Libraries 

In [1]:
# Linear algebra
import numpy as np 

# Data processing
import pandas as pd 


# Data Visualization
#from PIL import  Image
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns


### Get the dataset (traning)

In [3]:
train_df = pd.read_csv("c:\\Data Science\\Titanic_kaggle\\train.csv")
train_df.head(6)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


### Exploratory Data Analysis

#### Data definition

#### Data info
By info command  we can tell the number of rows and columns, data types of the columns and if null values exist in them.
The training-set has 891 observations and 11 features + the target variable (survived). 2 of the features are floats, 5 are integers and 5 are objects.

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### Data Shape

In [5]:
train_df.shape

(891, 12)

In [6]:
train_df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

The df.unique() command allows us to better understand what does each column mean. Looking at the Survived and Sex columns, they only have 2 unique values. It usually means that they are categorical columns, in this case, it should be True or False for Survived and Male or Female for Sex.

We can also observe other categorical columns like Embarked, Pclass and more. We can’t really tell what does Pclass stand for, let’s explore more.

### Summary Statistics

In [7]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


From summary statistics we can find that survival rate was 38%, so it is not imbalance data. 
Passengers age range (0.4 to 80) and average age was 29. 
We also detect some fearure missing value i.e. age

In [8]:
data = [train_df]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for dataset in data:
    # extract titles
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    # replace titles with a more common title or as Rare
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
                                            'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    # convert titles into numbers
    dataset['Title'] = dataset['Title'].map(titles)
    # filling NaN with 0, to get safe
    dataset['Title'] = dataset['Title'].fillna(0)
###train_df = train_df.drop(['Name'], axis=1)
train_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


### Create Group Bucket for Age attributes

In [None]:
#Tenure to categorical column
def age_group(train_df) :
    
    if train_df["Age"] <= 10 :
        return "1_10"
    elif (train_df["Age"] > 10) & (train_df["Age"] <= 20 ):
        return "11_20"
    elif (train_df["Age"] > 20) & (train_df["Age"] <= 30) :
           return "21_30"
    elif (train_df["Age"] > 30) & (train_df["Age"] <= 40) :
           return "31_40"
    elif (train_df["Age"] > 40) & (train_df["Age"] <= 50) :
           return "41_50"
           
    elif (train_df["Age"] > 50) & (train_df["Age"] <= 60) :
           return "51_60"
    elif train_df["Age"] > 60 :
           return "Over_60"
    
train_df["Age_group"] = train_df.apply(lambda train_df:age_group(train_df), axis = 1)

train_df.head(5)

# Fare Bucket

In [None]:
train_df.describe()

In [None]:
#Tenure to categorical column
def Fare_group(train_df) :
    
    if train_df["Fare"] <= 20 :
        return "3"
    elif (train_df["Fare"] > 20) & (train_df["Fare"] <= 50 ):
        return "2"
    elif (train_df["Fare"] > 50) :
           return "1"
    
    
train_df["Fare_group"] = train_df.apply(lambda train_df:Fare_group(train_df), axis = 1)

train_df.head(5)

In [None]:
train_df.info()

In [None]:
train_df.Fare_group.nunique()

In [None]:

train_df.groupby(['Age_group','Survived']).count()
#pd.crosstab([df.sex,df.survived],df.pclass,margins=True).style.background_gradient(cmap='summer_r')
#pd.crosstab([train_df.Sex,train_df.Survived]).style.background_gradient(cmap='summer_r')

### Outliers Detection

I have did analysis of each individual variable and checking if there is any outlier values present.¶ Outliers can be defined as values out of range [(Q1-1.5IQR) , (Q3+1.5IQR)] but here I choose a range based on Maximum and Minimum value for each variable selected by observing Boxplot of corresponding variable. After identify, Outlier values will be imputed by "mean" of respective variable by implementation of following "impute_outliers" function.

In [None]:
def detect_outliers(df,n,features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []
    
    
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        
    # select observations containing more than 2 outliers
        outlier_indices = Counter(outlier_indices)        
        multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    
    return multiple_outliers   

# detect outliers from Age, SibSp , Parch and Fare
Outliers_to_drop = detect_outliers(train_df,2,["Age","SibSp","Parch","Fare"])
train_df.loc[Outliers_to_drop] # Show the outliers rows
    # Drop outliers
train_df = train_df.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)
train_df.info()
    

In [None]:
Q1 = train_df.quantile(0.25)
Q3 = train_df.quantile(0.75)
IQR = Q3 - Q1
IQR

In [None]:
train_df = train_df[~((train_df < (Q1 - 1.5 * IQR)) |(train_df > (Q3 + 1.5 * IQR))).any(axis=1)]
train_df.info()

In [None]:
train_df_out.info()

In [None]:
plt.figure(figsize=(12,4))
sns.boxplot(x='Pclass',y='Age',hue = 'Survived', data=train_df)

In [None]:
plt.figure(figsize=(12,4))
sns.boxplot(x="Embarked", y="Fare", hue="Survived", data=train_df)

In [None]:
plt.figure(figsize=(12,4))
sns.boxplot('Age', data=train_df)
##sns.boxplot(x='Pclass',y='Age',hue = 'Survived', data=train_df)

### Missing Value

In [None]:
#axis = 0 means vertically
train_df.apply(lambda x: sum(x.isnull()), axis=0) 

In [None]:
# Get more details about missing value
total = train_df.isnull().sum().sort_values(ascending=False)
percent_1 = train_df.isnull().sum()/train_df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(4)

### Visualize Target variable - Survival Distribution

In [None]:
# Visualization: Survival Rate
f,ax=plt.subplots(1,2,figsize=(10,4))
train_df['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot('Survived',data=train_df,ax=ax[1])
ax[1].set_title('Survived')
plt.show()

#### As we know that 38.4% was survival rate, need to go insights to understand the distribution of 38.4% passengers

In [None]:
# Visualization : Survival Vs Sex
f,ax=plt.subplots(1,2,figsize=(10,4))
train_df[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survived vs sex')
sns.countplot('Sex',hue='Survived',data=train_df,ax=ax[1])
ax[1].set_title('Survived vs Dead')
plt.show()

In [None]:
# Visualization : Passenger Class vs Survival
f,ax=plt.subplots(1,2,figsize=(12,4))
sns.countplot('Embarked',hue='Survived',data=train_df,ax=ax[0])
ax[0].set_title('Embarked:Survived')

sns.countplot('Pclass',hue='Survived',data=train_df,ax=ax[1])
ax[1].set_title('Pclass: Survived')
plt.show()

In [None]:
plt.rcParams["figure.figsize"]= (14, 4)
sns.countplot(x="Age_group", hue="Survived", data=train_df, palette="Set2")
plt.show()

In [None]:
genders = {"male": 0, "female": 1}
data = [train_df]

for dataset in data:
    dataset['Sex'] = dataset['Sex'].map(genders)

In [None]:
train_df

In [None]:
train_df.info()

### Data Preprocessing

In [None]:
from sklearn.preprocessing import Imputer 
imputer = Imputer(strategy="median")

tr_df_cat = train_df[[ "Embarked", 'Pclass',"Age_group"]]
tr_df_num = train_df[["Survived","Sex","Title","Age",'Fare','Fare_group',"SibSp","Parch"]]

# for numerical missing value
imputer.fit(tr_df_num)
X = imputer.transform(tr_df_num)
df_tr_num = pd.DataFrame(X, columns=tr_df_num.columns)

# for categorical missing valye
df_tr_cat = tr_df_cat.apply(lambda x: x.fillna(x.value_counts().index[0]))



cat_cols = ["Embarked", 'Pclass',"Age_group"]
cat_processed = pd.get_dummies(tr_df_cat, prefix_sep="_",columns=cat_cols)

tr_df_prepared = pd.concat([df_tr_num, cat_processed], axis = 1)

In [None]:
tr_df_prepared.head(5)

In [None]:
genders = {"male": 0, "female": 1}
data = [train_df]

for dataset in data:
    dataset['Sex'] = dataset['Sex'].map(genders)

### Features Selection

In [None]:
# Correlation Matrics
# feature Selection- It provides score for correlation between the variables
# How the variables are interrelated with each other
# We should take consider these variable for prdiction where correlation matrics results are high

sns.heatmap(tr_df_prepared.corr(),annot=True,cmap='RdYlGn',linewidths=0.4) #data.corr()-->correlation matrix
fig=plt.gcf()
fig.set_size_inches(20,12)
plt.show()

In [None]:
corr_matrix = tr_df_prepared.corr()
corr_matrix["Survived"].sort_values(ascending=False)

### Important features

In this case, all features are not necessary to build the classification model. Only we need to consider best possible independent variables which are are strongly related to churn target variable. We should select the features based on importance of the correlation metrics.

According to correlation matrix we choose seven metrics these are positively related to churn (i.e. Contract_Month_to_Month, 
tenure_group_Tenure_0-12, InternetService_Fibre optic, PaymentMethod_Elctronic check, MonthlyCharges, PaperlessBilling and 
Senior Citizen etc.)


### Spliting dataset into training (80%) and testing (20%

In [None]:
tr_df_prepared.head(3)

In [None]:
# Target 
y = tr_df_prepared['Survived']
# Independent variable
X = tr_df_prepared[['Sex','Title', 'Pclass_1','Embarked_C', 'Age_group_1_10']]

from sklearn.model_selection import train_test_split

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [None]:
print(len(X_train), len(y_train))

In [None]:
print(len(X_test), len(y_test))

### Scaling the Input (X_train)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

### Model - Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).

In [None]:
# train logistic regression model
log_reg = LogisticRegression()
log_reg = log_reg.fit(X_train, y_train)

Train_model_score = round(log_reg.score(X_train,y_train)*100,2)
print("Train Model Score : " ,Train_model_score)

# cross validattion check
validation_score = cross_val_score(log_reg, X_train, y_train, cv=5, scoring ="accuracy")
print("Cross Validation Score: ", validation_score)
print("Mean validation score: ", validation_score.mean())

# Prediction
y_pred = log_reg.predict(X_test)

print("Test Model Score: ", accuracy_score(y_pred,y_test))

In [None]:
Actual_Predicted= pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
Actual_Predicted.head(5)

In [None]:
#axis = 0 means vertically
tr_df_prepared.apply(lambda x: sum(x.isnull()), axis=0) 

In [None]:
tr_df_prepared.corr()

# Random Forest

In [None]:
# Target 
y = tr_df_prepared['Survived']
# Independent variable
X = tr_df_prepared[['Sex','Title', 'Pclass_1','Fare','Embarked_C','Age_group_1_10','Age_group_11_20','Pclass_2','Pclass_3','Embarked_S' ]]

from sklearn.model_selection import train_test_split

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier( n_estimators=300, max_depth=4, bootstrap=True, max_features = 'auto', random_state=123)
random_forest.fit(X_train, y_train)

Y_prediction = random_forest.predict(X_test)

# cross validattion check
validation_score = cross_val_score(random_forest, X_train, y_train, cv=5, scoring ="accuracy")
print("Cross Validation Score: ", validation_score)
print("Mean validation score: ", validation_score.mean())

random_forest.score(X_train, y_train)
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
acc_random_forest
print("Train Model Score: ",acc_random_forest)
print("Test Model Score: ", accuracy_score(Y_prediction,y_test))

In [None]:
from sklearn.model_selection import GridSearchCV
RF = RandomForestClassifier(random_state=123)
param_grid = [{'n_estimators':  [4, 5, 10, 20, 50,200]}]

grid_search_RF = GridSearchCV(RF, param_grid, cv=10,scoring='roc_auc')
grid_search_RF.fit(X_train, y_train)

Y_prediction1 = grid_search_RF.predict(X_test)
print("Test Model Score: ", accuracy_score(Y_prediction1,y_test))

In [None]:
cvres_RF = grid_search_RF.cv_results_

for mean_score, params in zip(cvres_RF["mean_test_score"], cvres_RF["params"]):
    print(mean_score, params)

In [None]:
grid_search_RF.best_params_

# Test Data

In [None]:
test_df = pd.read_csv("c:\Data Science\Kaggle\Titanic\Test.csv")
test_df.head(5)

In [None]:
data1 = [test_df]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for dataset in data1:
    # extract titles
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    # replace titles with a more common title or as Rare
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
                                            'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    # convert titles into numbers
    dataset['Title'] = dataset['Title'].map(titles)
    # filling NaN with 0, to get safe
    dataset['Title'] = dataset['Title'].fillna(0)
###train_df = train_df.drop(['Name'], axis=1)
test_df.head(5)

In [None]:
#Tenure to categorical column
def age_group(test_df) :
    
    if test_df["Age"] <= 10 :
        return "1_10"
    elif (test_df["Age"] > 10) & (test_df["Age"] <= 20 ):
        return "11_20"
    elif (test_df["Age"] > 20) & (test_df["Age"] <= 30) :
           return "21_30"
    elif (test_df["Age"] > 30) & (test_df["Age"] <= 40) :
           return "31_40"
    elif (test_df["Age"] > 40) & (test_df["Age"] <= 50) :
           return "41_50"
           
    elif (test_df["Age"] > 50) & (test_df["Age"] <= 60) :
           return "51_60"
    elif test_df["Age"] > 60 :
           return "Over_60"
    
test_df["Age_group"] = test_df.apply(lambda test_df:age_group(test_df), axis = 1)

genders = {"male": 0, "female": 1}
data = [test_df]

for dataset in data:
    dataset['Sex'] = dataset['Sex'].map(genders)

test_df.head(5)

In [None]:
#Tenure to categorical column
def Fare_group(test_df) :
    
    if test_df["Fare"] <= 20 :
        return "3"
    elif (test_df["Fare"] > 20) & (test_df["Fare"] <= 50 ):
        return "2"
    elif (test_df["Fare"] > 50) :
           return "1"
    
    
test_df["Fare_group"] = test_df.apply(lambda test_df:Fare_group(test_df), axis = 1)

test_df.head(5)

In [None]:
from sklearn.preprocessing import Imputer 
imputer = Imputer(strategy="median")

te_df_cat = test_df[[ "Embarked", 'Pclass',"Age_group"]]
te_df_num = test_df[['Title','Age',"Sex","SibSp","Parch","Fare", 'Fare_group']]

# for numerical missing value
imputer.fit(te_df_num)
X = imputer.transform(te_df_num)
df_te_num = pd.DataFrame(X, columns=te_df_num.columns)

# for categorical missing valye
df_te_cat = te_df_cat.apply(lambda x: x.fillna(x.value_counts().index[0]))



cat_cols = ["Embarked", 'Pclass',"Age_group"]
cat_processed = pd.get_dummies(te_df_cat, prefix_sep="_",columns=cat_cols)

te_df_prepared = pd.concat([df_te_num, cat_processed], axis = 1)

In [None]:
test_df.info()

#### Model Evaluation : Confusion metrics

In [None]:
#axis = 0 means vertically

final_test.apply(lambda x: sum(x.isnull()), axis=0)

In [None]:
final_test = te_df_prepared[['Sex','Title', 'Pclass_1','Fare','Embarked_C','Age_group_1_10','Age_group_11_20' ,'Pclass_2', 'Pclass_3','Embarked_S','Fare_group' ]]
final_test.head(5)

In [None]:
final_prediction = random_forest.predict(final_test)

In [None]:
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": final_prediction
    })
submission.to_csv("c:\Data Science\Kaggle\Titanic\zaa6.csv", index=False)

In [None]:
submission

A much better way to evaluate the performance of a classifier is to look at the confusion matrix. The
general idea is to count the number of times instances of class A are classified as class B. For example, to
5th know the number of times the classifier confused images of 5s with 3s, you would look in the row and
3rd column of the confusion matrix.

To compute the confusion matrix, you first need to have a set of predictions, so they can be compared to
the actual targets. You could make predictions on the test set, but let’s keep it untouched for now
(remember that you want to use the test set only at the very end of your project, once you have a classifier
that you are ready to launch). Instead, you can use the function:cross_val_predict()
    
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

Just like the function, performs K-fold cross-validation, cross_val_score() cross_val_predict()
but instead of returning the evaluation scores, it returns the predictions made on each test fold. This means
that you get a clean prediction for each instance in the training set (“clean” meaning that the prediction is
made by a model that n ever saw the data during training). Now you are ready to get the confusion matrix using the
function. Just pass it the confusion_matrix()target classes (5) and the predicted classes (y_train) predict:

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_train, y_pred)
confusion_matrix

In [None]:
Actual_Predicted= pd.DataFrame({'Actual': y_train, 'Predicted': y_pred})  
Actual_Predicted.head(5)

In [None]:
from sklearn import metrics 
tree_cm = metrics.confusion_matrix( y_train, y_pred, [1,0] )

sns.heatmap(tree_cm, annot=True,  fmt='.2f', xticklabels = ["Positive", "Negative"] , yticklabels = ["Positive", "Negative"] )
plt.ylabel('Actual (True Positive)')
plt.xlabel('Predicted (False Positive)')

#### Precision and Recall

Accuracy = TP (True positive)+ TN (True Negative)/Total (total number of classifier)

Recall(True Positive Rate): When it's actually yes, how often does it predict yes?
Recall = TP/TP+FN 618/(618+840) = 42%

Precision: When it predicts yes, how often is it correct?
Precision : TP/(TP+FP) = 618/(618+359) = 63%

In [None]:
from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)  

In [None]:
recall_score(y_train, y_pred) 

### F1 Score

It is often convenient to combine precision and recall into a single metric called the F1
score, in particular if you need a simple way to compare two classifiers. The
F1score is the harmonic mean of precision and recall. Whereas the regular mean treats all values equally, the harmonic
mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if
both recall and precision are high.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train,y_pred))

### ROC Curve

The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers.
It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC
curve plots the true positive rate (another name for recall) against the false positive rate.

To plot the ROC curve, you first need to compute the TPR and FPR for various threshold values, using the
function:

In [None]:
from sklearn.metrics import roc_curve, auc
#fpr, tpr, thresholds = roc_curve(y_test, log_reg.predict_proba(X_train)[:,1])  
## fpr, tpr, thresholds = roc_curve( X_train, y_score)
fpr, tpr, thresholds = roc_curve(y_test, log_reg.predict_proba(X_test)[:,1])

# Calculate the AUC 
roc_auc = auc(fpr, tpr) 
print('ROC AUC: %0.2f' % roc_auc) 
   
# Plot of a ROC curve for a specific class 
plt.figure() 
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc) 
plt.plot([0, 1], [0, 1], 'k--') 
plt.xlim([0.0, 1.0]) 
plt.ylim([0.0, 1.05]) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate') 
plt.title('ROC Curve') 
plt.legend(loc="lower right") 
plt.show()

Once again there is a tradeoff: the higher the recall (TPR), the more false positives (FPR) the classifier
produces. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays
as far away from that line as possible (toward the top-left corner).
One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will
have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5.

ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.

### Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
#! pip install graphviz
#! pip install pydotplus
from sklearn.tree import export_graphviz
from sklearn import tree
from graphviz import Source
from IPython.display import SVG,display


In [None]:
clf = DecisionTreeClassifier(max_depth = 3,
                                           splitter  = "best",
                                           criterion = "gini")
clf = clf.fit(X_train, y_train)

#Predict the response for test dataset
y_test_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_test_pred))

In [None]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus
features = X.columns
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = features, class_names=['Not Churn','Churn'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

In [None]:
from sklearn import metrics 
dt_cm = metrics.confusion_matrix( y_test, y_test_pred, [1,0] )

sns.heatmap(dt_cm, annot=True,  fmt='.2f', xticklabels = ["Positive", "Negative"] , yticklabels = ["Positive", "Negative"] )
plt.ylabel('Actual (True Positive)')
plt.xlabel('Predicted (False Positive)')

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

### Random Forest Tree

In [None]:
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(max_depth = 3, min_samples_split=2, n_estimators =100, random_state = 1)

RF= forest.fit(X_train,y_train)
RF_Pred = RF.predict(X_test)

Model_forest = round(RF.score(X_train,y_train)*100,2)
print("Training Model Score : " , Model_forest)
Acc_forest = metrics.accuracy_score( y_test, RF_Pred)
print("Acc_Score : ", Acc_forest)

#Confusion Metrix 


cnf_metrix = (metrics.confusion_matrix(y_test,RF_Pred))
cmap = sns.cubehelix_palette(50, hue=0.5, rot=0, light=0.9, dark=0, as_cmap=True)
sns.heatmap(cnf_metrix,cmap = cmap,xticklabels=['0','1'],yticklabels=['0','1'],annot=True, fmt="d",)
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, RF_Pred))

In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score

#gives model report in dataframe
def model_report(model,X_train,X_test,y_train,y_test,name) :
    model.fit(X_train,y_train)
    predictions  = model.predict(X_test)
    accuracy     = accuracy_score(y_test,predictions)
    recallscore  = recall_score(y_test,predictions)
    precision    = precision_score(y_test,predictions)
    roc_auc      = roc_auc_score(y_test,predictions)
    f1score      = f1_score(y_test,predictions) 
    #kappa_metric = cohen_kappa_score(y_test,predictions)
    
    df = pd.DataFrame({"Model"           : [name],
                       "Accuracy_score"  : [accuracy],
                       "Recall_score"    : [recallscore],
                       "Precision"       : [precision],
                       "f1_score"        : [f1score],
                       "Area_under_curve": [roc_auc],
                     #  "Kappa_metric"    : [kappa_metric],
                      })
    return df

#outputs for every model
model1 = model_report(log_reg,X_train,X_test,y_train,y_train,
                      "Logistic Regression(Baseline_model)")

decision_tree = DecisionTreeClassifier(max_depth = 9,
                                       random_state = 123,
                                       splitter  = "best",
                                       criterion = "gini",
                                      )
model4 = model_report(decision_tree,X_train,X_test,y_train,y_train,
                      "Decision Tree")

rfc = RandomForestClassifier(n_estimators = 1000,
                             random_state = 123,
                             max_depth = 9,
                             criterion = "gini")
model6 = model_report(X_train,X_test,y_train,y_train,
                      "Random Forest Classifier")

#concat all models
model_performances = pd.concat([model1,
                                model4,model6,],axis = 0).reset_index()

model_performances = model_performances.drop(columns = "index",axis =1)

table  = ff.create_table(np.round(model_performances,4))

py.iplot(table)


## Fine Tune Random Forest

In [None]:
from sklearn.model_selection import GridSearchCV
RF = RandomForestClassifier(random_state=123)
param_grid = [{'n_estimators':  [4, 5, 10, 20, 50]}]

grid_search_RF = GridSearchCV(RF, param_grid, cv=5 ,scoring='roc_auc')
grid_search_RF.fit(X_train, y_train)

In [None]:
grid_search_RF.best_params_

In [None]:
cvres_RF = grid_search_RF.cv_results_

for mean_score, params in zip(cvres_RF["mean_test_score"], cvres_RF["params"]):
    print(mean_score, params)