<center><h1> New Approach to Titanic <sup><b>with Pycaret</b></sup></h1></center>

<img src="https://www.usnews.com/dims4/USNEWS/4f3cd50/2147483647/thumbnail/970x647/quality/85/?url=http%3A%2F%2Fmedia.beam.usnews.com%2F0e%2Fe187dd2f8f1fe5be9058fa8eef419e%2F7018FE_DA_080929titanic.jpg" width=1200 />

In [None]:
#importing libraries

import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## loading data files

In [None]:
train=pd.read_csv("/kaggle/input/titanic/train.csv")
test=pd.read_csv("/kaggle/input/titanic/test.csv")
result=pd.read_csv("/kaggle/input/titanic/gender_submission.csv")

# Understanding the Data

In [None]:
train.shape #shape of the data

In [None]:
train.info()

In [None]:
train.describe()

## Unique values


In [None]:
for col in train.columns:
    print(col,":",len(train[col].unique()))

In [None]:
!pip install pandas-profiling

## pandas_profiling for simple and fast exploratory data analysis

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

In [None]:
%%capture
import pandas_profiling as pp

# Titanic Data Report

In [None]:
pp.ProfileReport(train)

## Quick Introduction to PyCaret - An open source low-code ML library

<img src="https://pycaret.org/wp-content/uploads/2020/04/thumbnail.png"/>

You can reach pycaret website and documentation from https://pycaret.org

PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.

PyCaret being a low-code library makes you more productive. With less time spent coding, you and your team can now focus on business problems.

PyCaret is simple and easy to use machine learning library that will help you to perform end-to-end ML experiments with less lines of code.

PyCaret is a business ready solution. It allows you to do prototyping quickly and efficiently from your choice of notebook environment.

## let's install pycaret !


In [None]:
!pip install pycaret

## Import Whole Classification

In [None]:
from pycaret.classification import *

# Setting up the Environment

**Setup()**-This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must called before executing any other function in pycaret. It takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column. All other parameters are optional.

In [None]:
clf = setup(data = train, target = 'Survived',train_size=0.7,numeric_imputation='mean',categorical_imputation='mode',feature_selection=True)

**Parameters:**


data:  dataframe
array-like, sparse matrix, shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features.

target: string
Name of the target column to be passed in as a string. The target variable could be binary or multiclass. In case of a multiclass target, all estimators are wrapped
with a OneVsRest classifier.

train_size: float, default = 0.7
Size of the training set. By default, 70% of the data will be used for training and validation. The remaining data will be used for a test / hold-out set.


categorical_features: string, default = None
If the inferred data types are not correct, categorical_features can be used to overwrite the inferred type. If when running setup the type of ‘column1’ is inferred as numeric instead of categorical, then this parameter can be used to overwrite the type by passing categorical_features = [‘column1’].

categorical_imputation: string, default = ‘constant’
If missing values are found in categorical features, they will be imputed with a constant ‘not_available’ value. The other available option is ‘mode’ which imputes the missing value using most frequent value in the training dataset.


numeric_imputation: string, default = ‘mean’
If missing values are found in numeric features, they will be imputed with the mean value of the feature. The other available option is ‘median’ which imputes the value using the median value in the training dataset.


feature_selection: bool, default = False
When set to True, a subset of features are selected using a combination of various permutation importance techniques including Random Forest, Adaboost and Linear correlation with target variable. The size of the subset is dependent on the feature_selection_param. Generally, this is used to constrain the feature space in order to improve efficiency in modeling. When polynomial_features and feature_interaction are used, it is highly recommended to define the feature_selection_threshold param with a lower value.


# Compare the Models

In [None]:
compare_models()

We can clearly see that accuracy for Gradient Boosting classifier is higher than other models

# let's create a Gradient Boost  Model

This function creates a model and scores it using Stratified Cross Validation.The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold (default = 10 Fold). This function returns a trained model object

In [None]:
gradientboost_model=create_model('gbc')

## Let's tune it!

This function tunes the hyperparameters of a model and scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall Precision, F1, Kappa, and MCCby fold (by default = 10 Folds). This function returns a trained model object.

In [None]:
tuned_model=tune_model(gradientboost_model)

# Ensemble Model

This function ensembles the trained base estimator using the method defined in ‘method’ param (default = ‘Bagging’). The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default = 10 Fold)

In [None]:
# ensemble tuned Gradient Boost model 
ensembled_gbm = ensemble_model(tuned_model,method='Boosting')

# Learning Curve

In [None]:
plot_model(ensembled_gbm,plot='learning')

# Confusion Matrix

In [None]:
plot_model(ensembled_gbm,plot='confusion_matrix')

# Classification Report

In [None]:
plot_model(ensembled_gbm,plot='class_report')

# Decision Boundary

In [None]:
plot_model(ensembled_gbm,plot='boundary')

# AUC Curve

In [None]:
plot_model(ensembled_gbm,plot='auc')

# Class Prediction Error

In [None]:
plot_model(ensembled_gbm,plot='error')

# Validation Curve

In [None]:
plot_model(ensembled_gbm,plot='vc')

# Evaluate Model


This function displays a user interface for all of the available plots for a given estimator. It internally uses the plot_model() function.

In [None]:
evaluate_model(ensembled_gbm)

# Let's Predict it

In [None]:
predict_model(ensembled_gbm,data=test)

In [None]:
predictions = predict_model(ensembled_gbm, data=test)
predictions.head()

In [None]:
result['Survived'] = round(predictions['Score']).astype(int)
result.to_csv('Submission.csv',index=False)
result.head()