# 1 - Introducing data science workflows


*** From [https://github.com/ivanovitchm/EEC1509_MachineLearning](https://github.com/ivanovitchm/EEC1509_MachineLearning/tree/master/Lesson%20%2311%20-%20Kaggle%20Fundamentals).***

In this guided project, we're going to put together all that we've learned in this course and create a data science workflow.

By defining a workflow for yourself, you can give yourself a framework with which to make iterating on ideas quicker and easier, allowing yourself to work more efficiently.

In this mission, we're going to explore a workflow to make competing in the Kaggle Titanic competition easier, using a pipeline of functions to reduce the number of dimensions you need to focus on.

To get started, we'll read in the original **train.csv** and **test.csv** files from Kaggle.



In [0]:
#!mkdir data predictions
#!mv holdout_modified.csv train_modified.csv train.csv test.csv data

In [0]:
import pandas as pd

train = pd.read_csv("./data/train.csv")
holdout = pd.read_csv("./data/test.csv")

In [0]:
train.columns

In [0]:
holdout.columns

In [0]:
survived = train["Survived"]
train = train.drop("Survived",axis=1)

In [0]:
holdout.shape

In [0]:
train.shape

In [0]:
## concatenate all data to guarantee that dataset have the same columns
all_data = pd.concat([train,holdout],axis=0)

In [0]:
all_data.shape

#2 - Exploring the Data




In the first three missions of this course, we have done a variety of activities, mostly in isolation: **Exploring the data**, **creating features**, **selecting features**, **selecting and tuning different models**.

The Kaggle workflow we are going to build will combine all of these into a process.

<img width="400" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1swb6PxXUJuDvv83ylqh9eUh992lXTu47">

- **Data exploration**, to find patterns in the data
- **Feature engineering**, to create new features from those patterns or through pure experimentation
- **Feature selection**, to select the best subset of our current set of features
- **Model selection/tuning**, training a number of models with different hyperparameters to find the best performer.

We can continue to repeat this cycle as we work to optimize our predictions. At the end of any cycle we wish, we can also use our model to make predictions on the holdout set and then **Submit to Kaggle** to get a leaderboard score.

While the first two steps of our workflow are relatively freeform, later in this project we'll create some functions that will help automate the complexity of the latter two steps so we can move faster.

For now, let's practice the first stage, exploring the data. We're going to examine the two columns that contain information about the family members each passenger had onboard: **SibSp** and **Parch**.

# 3 - Preprocesing the Data

In [0]:
def process_ticket(df):
    # see https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling
    Ticket = []
    for i in list(df.Ticket):
        if not i.isdigit():
            #Take prefix
            Ticket.append(i.replace(".","").replace("/","").strip().split(' ')[0]) 
        else:
            Ticket.append("X")
    df["Ticket"] = Ticket
    return df

def process_missing(df):
    """Handle various missing values from the data set

    Usage
    ------

    holdout = process_missing(holdout)
    """
    df["Fare"] = df["Fare"].fillna(df["Fare"].mean())
    df["Embarked"] = df["Embarked"].fillna("S")
    return df

def process_age(df):
    """Process the Age column into pre-defined 'bins' 

    Usage
    ------

    train = process_age(train)
    """
    df["Age"] = df["Age"].fillna(-0.5)
    cut_points = [-1,0,5,12,18,35,60,100]
    label_names = ["Missing","Infant","Child","Teenager","Young Adult","Adult","Senior"]
    df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
    
    #df = df.drop("Age",axis=1)
    
    return df

def process_fare(df):
    """Process the Fare column into pre-defined 'bins' 

    Usage
    ------

    train = process_fare(train)
    """
    cut_points = [-1,12,50,100,1000]
    label_names = ["0-12","12-50","50-100","100+"]
    df["Fare_categories"] = pd.cut(df["Fare"],cut_points,labels=label_names)
    
    df = df.drop("Fare",axis=1)
    
    return df

def process_cabin(df):
    """Process the Cabin column into pre-defined 'bins' 

    Usage
    ------

    train process_cabin(train)
    """
    df["Cabin_type"] = df["Cabin"].str[0]
    df["Cabin_type"] = df["Cabin_type"].fillna("Unknown")
    df = df.drop('Cabin',axis=1)
    return df

def process_titles(df):
    """Extract and categorize the title from the name column 

    Usage
    ------

    train = process_titles(train)
    """
    titles = {
        "Mr" :         "Mr",
        "Mme":         "Mrs",
        "Ms":          "Mrs",
        "Mrs" :        "Mrs",
        "Master" :     "Master",
        "Mlle":        "Miss",
        "Miss" :       "Miss",
        "Capt":        "Officer",
        "Col":         "Officer",
        "Major":       "Officer",
        "Dr":          "Officer",
        "Rev":         "Officer",
        "Jonkheer":    "Royalty",
        "Don":         "Royalty",
        "Sir" :        "Royalty",
        "Countess":    "Royalty",
        "Dona":        "Royalty",
        "Lady" :       "Royalty"
    }
    extracted_titles = df["Name"].str.extract(' ([A-Za-z]+)\.',expand=False)
    df["Title"] = extracted_titles.map(titles)
    return df

def create_dummies(df,column_name):
    """Create Dummy Columns (One Hot Encoding) from a single Column

    Usage
    ------

    train = create_dummies(train,"Age")
    """
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df

In [0]:
def pre_process(df):
    df = process_ticket(df)
    df = process_missing(df)
    df = process_age(df)
    df = process_fare(df)
    df = process_titles(df)
    df = process_cabin(df)

    for col in ["Age_categories","Fare_categories",
                "Title","Cabin_type","Sex","Ticket","Pclass"]:
        df = create_dummies(df,col)
    
    #df = df.drop(["Age_categories","Fare_categories",
                #"Title","Cabin_type","Sex","Ticket"],axis=1)
    
    return df

all_data = pre_process(all_data)

train = all_data.iloc[:891]
train = pd.concat([train,survived],axis=1)
holdout = all_data.iloc[891:]


#4 - Exploring Data



In [0]:
explore_cols = ["SibSp","Parch","Survived"]
explore = train[explore_cols].copy()
explore.info()

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

explore.drop("Survived",axis=1).plot.hist(alpha=0.5,bins=8)
plt.show()

In [0]:
explore["familysize"] = explore[["SibSp","Parch"]].sum(axis=1)
explore.drop("Survived",axis=1).plot.hist(alpha=0.5,bins=10)
plt.xticks(range(11))
plt.show()

In [0]:
import numpy as np
plt.clf()
for col in explore.columns.drop("Survived"):
    pivot = explore.pivot_table(index=col,values="Survived")
    pivot.plot.bar(ylim=(0,1),yticks=np.arange(0,1,.1))
    plt.show()

The SibSp column shows the number of siblings and/or spouses each passenger had on board, while the Parch columns shows the number of parents or children each passenger had onboard. Neither column has any missing values.

The distribution of values in both columns is skewed right, with the majority of values being zero.

You can sum these two columns to explore the total number of family members each passenger had onboard. The shape of the distribution of values in this case is similar, however there are less values at zero, and the quantity tapers off less rapidly as the values increase.

Looking at the survival rates of the the combined family members, you can see that few of the over 500 passengers with no family members survived, while greater numbers of passengers with family members survived.

#5 - Engineering New Features


In [0]:
def process_isalone(df):
    df["familysize"] = df[["SibSp","Parch"]].sum(axis=1)
    df["isalone"] = 0
    df.loc[(df["familysize"] == 0),"isalone"] = 1
    #df = df.drop("familysize",axis=1)
    return df

train = process_isalone(train)
holdout = process_isalone(holdout)

#6 - Selecting the Best-Performing Features


In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression

def select_features(df,index):
    
    # index
    # 0 - random forest
    # 1 - logistic regression
    
    # Remove non-numeric columns, columns that have null values
    df = df.select_dtypes([np.number]).dropna(axis=1)
    all_X = df.drop(["Survived","PassengerId"],axis=1)
    all_y = df["Survived"]
    
    clf_rf = RandomForestClassifier(random_state=1, n_estimators=100)
    clf_lr = LogisticRegression()
    clfs = [clf_rf,clf_lr]
    
    selector = RFECV(clfs[index],cv=10,n_jobs=-1)
    selector.fit(all_X,all_y)
    
    best_columns = list(all_X.columns[selector.support_])
    print("Best Columns \n"+"-"*12+"\n{}\n".format(best_columns))
    
    return best_columns

cols_rf = select_features(train,0)
cols_lr = select_features(train,1)

In [0]:
print(len(cols_rf), cols_rf)
print(len(cols_lr), cols_lr)

#7 - Selecting and Tuning Different Algorithms


In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
import numpy as np

def select_model(df,features):
    
    all_X = df[features]
    all_y = df["Survived"]

    # List of dictionaries, each containing a model name,
    # it's estimator and a dict of hyperparameters
    models = [
        {
            "name": "LogisticRegression",
            "estimator": LogisticRegression(),
            "hyperparameters":
                {
                    "solver": ["newton-cg", "lbfgs", "liblinear"]
                }
        },
        {
            "name": "KNeighborsClassifier",
            "estimator": KNeighborsClassifier(),
            "hyperparameters":
                {
                    "n_neighbors": range(1,20,2),
                    "weights": ["distance", "uniform"],
                    "algorithm": ["ball_tree", "kd_tree", "brute"],
                    "p": [1,2]
                }
        },
        {
            "name": "RandomForestClassifier",
            "estimator": RandomForestClassifier(random_state=1),
            "hyperparameters":
                {
                    "n_estimators": [200],
                    "criterion": ["entropy", "gini"],
                    "max_depth": [10,20],
                    "max_features": ["log2", "sqrt"],
                    "min_samples_leaf": [1],
                    "min_samples_split": [2]
                }
        },
        {
            "name":"SVC",
            "estimator":SVC(),
            "hyperparameters":
                {
                  "kernel": ['rbf'],  
                  "C": [0.001, 0.01, 0.1, 1, 10],
                  "gamma": [0.001, 0.01, 0.1, 1]
                }
        },
        {
            # reference
            # https://github.com/UltravioletAnalytics/kaggle-titanic/blob/master/sgdclassifier.py
            "name":"SGDC",
            "estimator": SGDClassifier(),
            "hyperparameters":
            {
                "loss": ["log"],
                "alpha": [0.001],
                "penalty": ["elasticnet"],
                "l1_ratio": [0.8],
                "shuffle": [True],
                "learning_rate": ['optimal'],
                "max_iter":[1000]
            }
        }
    ]

    for model in models:
        print(model['name'])
        print('-'*len(model['name']))

        grid = GridSearchCV(model["estimator"],
                            param_grid=model["hyperparameters"],
                            cv=10)
        grid.fit(all_X,all_y)
        model["best_params"] = grid.best_params_
        model["best_score"] = grid.best_score_
        model["best_model"] = grid.best_estimator_

        print("Best Score: {}".format(model["best_score"]))
        print("Best Parameters: {}\n".format(model["best_params"]))

    return models

In [0]:
result_a = select_model(train,cols_rf)

In [0]:
result_b = select_model(train,cols_lr)

#8 - Making a Submission to Kaggle


In [0]:
def save_submission_file(model,cols,filename):
    holdout_data = holdout[cols]
    predictions = model.predict(holdout_data)
    
    holdout_ids = holdout["PassengerId"]
    submission_df = {"PassengerId": holdout_ids,
                 "Survived": predictions}
    submission = pd.DataFrame(submission_df)

    submission.to_csv(filename,index=False)

In [0]:
best_rf_model = result_b[3]["best_model"]
save_submission_file(best_rf_model,cols_lr,"submission_PreprocessingIvanovitch.csv")

#9 - Next Steps




We encourage you to continue working on this Kaggle competition. Here are some suggestions of next steps:

- Continue to explore the data and create new features, following the workflow and using the functions we created.
- Read more about the titanic and this Kaggle competition to get ideas for new features.
- Use some different algorithms in the select_model() function, like [stochastic gradient descent](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) or [perceptron linear models](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html).
- Experiment with [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) instead of **GridSearchCV** to speed up your **select_features()** function.

Lastly, while the Titanic competition is great for learning about how to approach your first Kaggle competition, we recommend against spending many hours focused on trying to get to the top of the leaderboard. With such a small data set, there is a limit to how good your predictions can be, and your time would be better spent moving onto more complex competitions.

Once you feel like you have a good understanding of the Kaggle workflow, you should look at some other competitions - a great next competition is the [House Prices Competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). A start point you can find [here](https://www.dataquest.io/blog/kaggle-getting-started/).

# 10 - Loading the modified data

In [0]:
import pandas as pd

train_mod = pd.read_csv("./data/train_modified.csv")
holdout_mod = pd.read_csv("./data/holdout_modified.csv")

In [0]:
print(train_mod.info())
train_mod.head(5)

In [0]:
print(holdout_mod.info())
holdout_mod.head(5)

In [0]:
X_mod = train_mod.drop(["Survived","PassengerId"],axis=1)
y_mod = train_mod["Survived"]

# 11 - Using the  AdaBoostClassifier

An AdaBoost [1] classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

This class implements the algorithm known as AdaBoost-SAMME [2].

class **sklearn.ensemble.AdaBoostClassifier**(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)[source]


In [0]:
from sklearn.externals.six.moves import zip

import matplotlib.pyplot as plt

from sklearn.datasets import make_gaussian_quantiles
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_mod, y_mod, test_size=0.3, random_state=42)


bdt_real = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=15),
    n_estimators=300,
    learning_rate=0.1)

bdt_discrete = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=15),
    n_estimators=300,
    learning_rate=0.1,
    algorithm="SAMME")

bdt_real.fit(X_train, y_train)
bdt_discrete.fit(X_train, y_train)

real_test_errors = []
discrete_test_errors = []

for real_test_predict, discrete_train_predict in zip(
        bdt_real.staged_predict(X_test), bdt_discrete.staged_predict(X_test)):
    real_test_errors.append(
        1. - accuracy_score(real_test_predict, y_test))
    discrete_test_errors.append(
        1. - accuracy_score(discrete_train_predict, y_test))

n_trees_discrete = len(bdt_discrete)
n_trees_real = len(bdt_real)

# Boosting might terminate early, but the following arrays are always
# n_estimators long. We crop them to the actual number of trees here:
discrete_estimator_errors = bdt_discrete.estimator_errors_[:n_trees_discrete]
real_estimator_errors = bdt_real.estimator_errors_[:n_trees_real]
discrete_estimator_weights = bdt_discrete.estimator_weights_[:n_trees_discrete]

plt.figure(figsize=(15, 5))

plt.subplot(131)
plt.plot(range(1, n_trees_discrete + 1),
         discrete_test_errors, c='black', label='SAMME')
plt.plot(range(1, n_trees_real + 1),
         real_test_errors, c='black',
         linestyle='dashed', label='SAMME.R')
plt.legend()
plt.ylim(0.18, 0.62)
plt.ylabel('Test Error')
plt.xlabel('Number of Trees')

plt.subplot(132)
plt.plot(range(1, n_trees_discrete + 1), discrete_estimator_errors,
         "b", label='SAMME', alpha=.5)
plt.plot(range(1, n_trees_real + 1), real_estimator_errors,
         "r", label='SAMME.R', alpha=.5)
plt.legend()
plt.ylabel('Error')
plt.xlabel('Number of Trees')
plt.ylim((.2,
         max(real_estimator_errors.max(),
             discrete_estimator_errors.max()) * 1.2))
plt.xlim((-20, len(bdt_discrete) + 20))

plt.subplot(133)
plt.plot(range(1, n_trees_discrete + 1), discrete_estimator_weights,
         "b", label='SAMME')
plt.legend()
plt.ylabel('Weight')
plt.xlabel('Number of Trees')
plt.ylim((0, discrete_estimator_weights.max() * 1.2))
plt.xlim((-20, n_trees_discrete + 20))

# prevent overlapping y-axis labels
plt.subplots_adjust(wspace=0.25)
plt.show()

In [0]:
bdt_real

In [0]:
print(real_test_errors)
print(discrete_test_errors)

In [0]:
print(bdt_real.score(X_test, y_test))
print(bdt_discrete.score(X_test, y_test))

In [0]:
# make the prediction using the resulting model
holdout_pred = holdout_mod.drop(["PassengerId"],axis=1)

predictions = bdt_real.predict(holdout_pred)

print("class = ", predictions)

holdout_ids = holdout_mod["PassengerId"]
submission_df = {"PassengerId": holdout_ids, "Survived": predictions}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_AdaBoostClassifier_real_0.csv",index=False)


In [0]:
# make the prediction using the resulting model
holdout_pred = holdout_mod.drop(["PassengerId"],axis=1)

predictions = bdt_discrete.predict(holdout_pred)

print("class = ", predictions)

holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids, "Survived": predictions}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_AdaBoostClassifier_discrete_0.csv",index=False)

## 11.1 Plot feature importance of AdaBoostClassifier Discrete

In [0]:

import numpy as np
import matplotlib.pyplot as plt


# #############################################################################
# Plot feature importance
feature_importance = bdt_discrete.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
#plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_mod.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

In [0]:
features = pd.DataFrame()
features['feature'] = X_mod.columns
features['importance'] = bdt_discrete.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)

features.plot(kind='barh', figsize=(25, 25))


In [0]:
features = pd.DataFrame()
features['feature'] = X_mod.columns
features['importance'] = bdt_real.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)

features.plot(kind='barh', figsize=(25, 25))


## 11.2 Tunning the AdaBoostClassifier with RandomizedSearchCV 




In [0]:

import numpy as np

from time import time
from scipy.stats import randint as sp_randint
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits

import numpy as np
from catboost import CatBoostClassifier

# specify the training parameters 
#clf = AdaBoostClassifier(learning_rate=0.1, loss_function='Logloss', logging_level='Verbose')
clf = AdaBoostClassifier()

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

#AdaBoostClassifier(algorithm='SAMME.R',
#          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=15,
#            max_features=None, max_leaf_nodes=None,
#            min_impurity_decrease=0.0, min_impurity_split=None,
#            min_samples_leaf=1, min_samples_split=2,
#            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
#            splitter='best'),
#          learning_rate=0.1, n_estimators=300, random_state=None)            

# specify parameters and distributions to sample from
param_dist = {#"iterations": sp_randint(15,50),
              #"max_depth" : sp_randint(5,16),
              "base_estimator" : [RandomForestClassifier(max_depth=10), RandomForestClassifier(max_depth=15) , DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=15, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') ],
             "learning_rate" : uniform(0.1, 1),
              "n_estimators" : sp_randint(10, 100),
               "algorithm" : ['SAMME', 'SAMME.R']}

# run randomized search 

#ValueError: Invalid parameter iterations for estimator AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
#          learning_rate=1.0, n_estimators=50, random_state=None). Check the list of available parameters with `estimator.get_params().keys()`.
  
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)

start = time()
X_train, X_test, y_train, y_test = train_test_split(X_mod, y_mod, test_size=0.1, random_state=42)

random_search.fit(X_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)



In [0]:
random_search.best_params_

In [0]:
random_search

In [0]:
print(random_search.score(X_mod, y_mod))
print(random_search.score(X_train, y_train))
print(random_search.score(X_test, y_test))

In [0]:
holdout_mod = pd.read_csv("./data/holdout_modified.csv")

## 11.2 Tunning the AdaBoostClassifier with RandomizedSearchCV 
predictions = random_search.predict(holdout_mod.drop(["PassengerId"], axis = 1))

holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids, "Survived": predictions}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_AdaBoostClassifierRandomizedSearchCV0.csv",index=False)

**Score of 0.73684**

# 12 - Using the CatBoost: Score of 0.80382


Here we implements an Soluting using the CatBoost algorithm implementation.

CatBoost is a machine learning algorithm that uses gradient boosting on decision trees:

https://tech.yandex.com/catboost/doc/dg/concepts/python-quickstart-docpage/

## 12.1 - Installing the CatBoost

In [0]:
!pip install catboost


## 12.2 - Loading the modified data and executing the CatBoost

In [0]:
import pandas as pd

train_mod = pd.read_csv("./data/train_modified.csv")
holdout_mod = pd.read_csv("./data/holdout_modified.csv")
X_mod = train_mod.drop(["Survived","PassengerId"],axis=1)
y_mod = train_mod["Survived"]

In [0]:
import numpy as np
from catboost import CatBoostClassifier

# specify the training parameters 
model = CatBoostClassifier(iterations=46, depth=10, learning_rate=0.1, loss_function='Logloss', logging_level='Verbose')

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_mod, y_mod, test_size=0.3, random_state=42)

#train the model
#model.fit(X_train, y_train)

model.fit(X_mod, y_mod)

score_train = model.score(X_train, y_train)
print("Accuracy of Train Data", score_train)

score_test = model.score(X_test, y_test)
print("Accuracy of Test Data", score_test)

# make the prediction using the resulting model
predictions = model.predict(holdout_mod).astype(int)

print("class = ", predictions)

#preds_proba = model.predict_proba(test_data)
#print("proba = ", preds_proba)




In [0]:

holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids, "Survived": predictions}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_CatBoostClassifier_1.csv",index=False)


**Score of 0.80382**



![Rank on the Kaggle](https://drive.google.com/uc?export=view&id=1N8GM-_16JgWArc_-v9vc5nmghRKaplLw)




###  12.2.1 - Best result with CatBoostClassifier

`CatBoostClassifier(iterations=20, depth=10, learning_rate=0.1, loss_function='Logloss', logging_level='Verbose')`



## 12.3 - Using a Pool of CatBoost with Cross-validation

In [0]:
from catboost import Pool, cv

pool = Pool(X_mod, y_mod)

params = {'iterations': 100, 
          'depth': 3,        
          'loss_function': 'Logloss', 
          'verbose': False, 
          'roc_file': 'roc-file'}

scores = cv(pool,params)

In [0]:
scores


## 12.4 Tunning the CatBoost with RandomizedSearchCV: Bad Score of 0.74162


In [0]:
#from scipy.stats import randint as sp_randint

np.linspace(0,1.5,16)

In [0]:

import numpy as np

from time import time
from scipy.stats import randint as sp_randint
from scipy.stats import uniform
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
#from sklearn.ensemble import RandomForestClassifier

import numpy as np
from catboost import CatBoostClassifier

# specify the training parameters 
clf = CatBoostClassifier(loss_function='Logloss', logging_level='Verbose')


# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


# specify parameters and distributions to sample from
param_dist = {"iterations": sp_randint(15,50),
              "depth" : sp_randint(5,16),
              "learning_rate" : np.linspace(0,1,16)}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)

start = time()
random_search.fit(X_mod, y_mod)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

# use a full grid over all parameters
param_grid =  {"iterations": sp_randint(15,50),
              "depth" : sp_randint(5,16),
              "learning_rate" : uniform(0,1)}

clf_grid = CatBoostClassifier(learning_rate=0.1, loss_function='Logloss', logging_level='Verbose')

# run grid search
#grid_search = GridSearchCV(clf_grid, param_grid=param_grid, cv=5)
#start = time()
#grid_search.fit(X_mod, y_mod)

#print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
#      % (time() - start, len(grid_search.cv_results_['params'])))
#report(grid_search.cv_results_)


**The learning rate must be less than 1.**


In [0]:
print("Random_search best_score:", random_search.best_score_)
#print("Grid_search best_score:", grid_search.best_score_,"\n")

print("Random_search best_score:", random_search.best_params_)
#print("Grid_search best_score:", grid_search.best_params_,"\n")

In [0]:
predictions = random_search.predict(holdout_mod).astype(int)

holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids, "Survived": predictions}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_CatBoostClassifier_Random_Search.csv",index=False)


![alt text](https://drive.google.com/uc?export=view&id=1n7LSwZz7UIHOiXAxsVfitxCX6VNJSyMS)


**BAD! Overfit with: **

**Random_search best_score: {'depth': 10, 'iterations': 46, 'learning_rate': 0.5} **

# 13 - Appling the Voting strategy for Ensemble

In [0]:
import pandas as pd

train_mod = pd.read_csv("./data/train_modified.csv")
holdout_mod = pd.read_csv("./data/holdout_modified.csv")
X_mod = train_mod.drop(["Survived","PassengerId"],axis=1)
y_mod = train_mod["Survived"]

In [0]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV 
from sklearn.metrics import make_scorer, roc_auc_score
from scipy import stats

auc = make_scorer(roc_auc_score)
rand_list = {"C": stats.uniform(2, 10),
                        "gamma": stats.uniform(0.1, 1)}


# X_train0, X_test0, y_train0, y_test0 = train_test_split(X_mod, y_mod, test_size=0.1, random_state=42)
#X = train[0::, 1::]
#y = train[0::, 0]
X = X_mod
y = y_mod
clf = SVC()


rand_search = RandomizedSearchCV(clf, param_distributions = rand_list, n_iter = 20, n_jobs = 4, cv = 3, random_state = 2017, scoring = auc) 
rand_search.fit(X,y)

In [0]:
rand_search.best_estimator_

In [0]:
  
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

from catboost import CatBoostClassifier


from sklearn.model_selection import train_test_split

X_train0, X_test0, y_train0, y_test0 = train_test_split(X_mod, y_mod, test_size=0.1, random_state=42)

 #{'AdaBoostClassifier': 0.7950617283950618,
 #'CatBoostClassifier': 0.8185185185185185,
 #'DecisionTreeClassifier': 0.774074074074074,
 #'GaussianNB': 0.7358024691358025,
 #'GradientBoostingClassifier': 0.8160493827160493,
 #'KNeighborsClassifier': 0.7777777777777777,
 #'LinearDiscriminantAnalysis': 0.8135802469135802,
 #'LogisticRegression': 0.8209876543209876,
 #'RandomForestClassifier': 0.8246913580246913,
 #'SGDClassifier': 0.8185185185185185,
 #'SVC': 0.8160493827160493}

  
 #{'algorithm': 'SAMME',
 # 'base_estimator': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
 #            max_depth=10, max_features='auto', max_leaf_nodes=None,
  #           min_impurity_decrease=0.0, min_impurity_split=None,
   #          min_samples_leaf=1, min_samples_split=2,
   #          min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
    #         oob_score=False, random_state=None, verbose=0,
     #        warm_start=False),
 #'learning_rate': 0.802656626129159,
 #'n_estimators': 36} 


classifiers = [
    #KNeighborsClassifier(algorithm='brute', n_neighbors= 3, p= 1, weights='uniform'),
    #SVC(probability=True, C= 1, gamma = 0.1, kernel= 'rbf'),
    SVC(C=9.062264858625882, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.5565289675398223, kernel='rbf', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False),
    #DecisionTreeClassifier(),
    RandomForestClassifier(criterion= 'gini', max_depth=10, max_features= 'log2', min_samples_leaf= 1, min_samples_split= 2, n_estimators= 200),
    SGDClassifier(alpha= 0.001, l1_ratio= 0.8, learning_rate='optimal', loss='log', max_iter= 1000, penalty= 'elasticnet', shuffle= True),
 	  AdaBoostClassifier(RandomForestClassifier(max_depth=10,  criterion='gini', min_samples_leaf=1, min_samples_split=2), n_estimators=36, learning_rate=0.802656626129159, algorithm="SAMME"),
    GradientBoostingClassifier(),
    #GaussianNB(),
    LinearDiscriminantAnalysis(),
    #QuadraticDiscriminantAnalysis(),
    LogisticRegression(solver='newton-cg'),
    CatBoostClassifier(iterations=20, depth=10, learning_rate=0.1, loss_function='Logloss', logging_level='Verbose')]

log_cols = ["Classifier", "Accuracy"]
log 	 = pd.DataFrame(columns=log_cols)

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

X = X_mod
y = y_mod

acc_dict = {}

for train_index, test_index in sss.split(X_mod, y_mod):
	X_train, X_test = X.iloc[train_index], X.iloc[test_index]
	y_train, y_test = y.iloc[train_index], y.iloc[test_index]
	
	for clf in classifiers:
		name = clf.__class__.__name__
		clf.fit(X_train, y_train)
		train_predictions = clf.predict(X_test)
		acc = accuracy_score(y_test, train_predictions)
		if name in acc_dict:
			acc_dict[name] += acc
		else:
			acc_dict[name] = acc

for clf in acc_dict:
	acc_dict[clf] = acc_dict[clf] / 10.0
	log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
	log = log.append(log_entry)

plt.xlabel('Accuracy')
plt.title('Classifier Accuracy')

sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

In [0]:
 acc_dict
  

In [0]:
classifiers

In [0]:
predictions = []

for clf in classifiers:
  predictions.append(clf.predict_proba(X_train0)[:,1])
    
combined = np.sum(predictions, axis=0) / len(classifiers)     
rounded = np.round(combined)
 
acc = accuracy_score(y_train0, rounded)
print("Predictions of X_train", acc)

In [0]:
predictions = []

for clf in classifiers:
  predictions.append(clf.predict_proba(X_test0)[:,1])
    
combined = np.sum(predictions, axis=0) / len(classifiers)     
rounded = np.round(combined)
 
acc = accuracy_score(y_test0, rounded)
print("Predictions of X_test", acc)

In [0]:
import pandas as pd

test = pd.read_csv("./data/test.csv")

hold = holdout_mod.drop(["PassengerId"],axis=1)

predictions = []

for clf in classifiers:
  predictions.append(clf.predict_proba(hold)[:,1])
    
combined = np.sum(predictions, axis=0) / len(classifiers)     
rounded = np.round(combined).astype(int)
 
holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids, "Survived": rounded}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_Ensemble_Voting2.csv",index=False)

**Score of 0.78947**

# 14 - VotingClassifier from Sklearn: Bad Score of 0.77990

In [0]:
from sklearn.ensemble import VotingClassifier

classifiers = [
    #KNeighborsClassifier(algorithm='brute', n_neighbors= 3, p= 1, weights='uniform'),
    #SVC(probability=True, C= 1, gamma = 0.1, kernel= 'rbf'),
    ('svm', SVC(C=9.062264858625882, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.5565289675398223, kernel='rbf', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)),
    #(DecisionTreeClassifier(),
    ('rf', RandomForestClassifier(criterion= 'gini', max_depth=10, max_features= 'log2', min_samples_leaf= 1, min_samples_split= 2, n_estimators= 200)),
    ('sg',SGDClassifier(alpha= 0.001, l1_ratio= 0.8, learning_rate='optimal', loss='log', max_iter= 1000, penalty= 'elasticnet', shuffle= True)),
 	  #AdaBoostClassifier(RandomForestClassifier(max_depth=15), n_estimators=200, learning_rate=0.1, algorithm="SAMME"),
    ('gb',GradientBoostingClassifier()),
    #GaussianNB()),
    ('ld', LinearDiscriminantAnalysis()),
    #QuadraticDiscriminantAnalysis()),
    ('logr', LogisticRegression(solver='newton-cg')),
    ('catb', CatBoostClassifier(iterations=20, depth=10, learning_rate=0.1, loss_function='Logloss', logging_level='Verbose'))]


eclf1 = VotingClassifier(classifiers, voting='soft', flatten_transform=True)


eclf1 = eclf1.fit(X_mod, y_mod)

In [0]:
eclf1

In [0]:
print(eclf1.score(X_mod, y_mod))

In [0]:
hold = holdout_mod.drop(["PassengerId"],axis=1)

predictions = eclf1.predict(hold)
 
holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids, "Survived": predictions}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_VotingClassifier0.csv",index=False)

![alt text](https://drive.google.com/uc?export=view&id=1CpmCiebxXoQ-_B1eqNa9-GhYG2hqhUoe)


# 15 - New Pre-processing with VotingClassifier from Sklearn: Bad Score of 0.77990

In [0]:
import numpy as np
import pandas as pd
import re as re

train = pd.read_csv('./data/train.csv', header=0, dtype={'Age': np.float64})
test = pd.read_csv('./data/test.csv', header=0, dtype={'Age': np.float64})
full_data = [train, test]

### PRE-PROCESSING

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)
    dataset['Title'] = dataset['Name'].apply(get_title)
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', \
                                                 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map({'female': 0, 'male': 1}).astype(int)

    # Mapping titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

    # Mapping Embarked
    dataset['Embarked'] = dataset['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

    # Mapping Fare
    dataset.loc[dataset['Fare'] <= 10, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 10) & (dataset['Fare'] <= 20), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 20) & (dataset['Fare'] <= 30), 'Fare'] = 2
    dataset.loc[dataset['Fare'] > 30, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

    # Mapping Age
    dataset.loc[dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[dataset['Age'] > 64, 'Age'] = 4

train['CategoricalFare'] = pd.cut(train['Fare'], 4)
train['CategoricalAge'] = pd.cut(train['Age'], 5)

# Feature Selection
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp',\
                 'Parch', 'FamilySize']
train = train.drop(drop_elements, axis = 1)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)
test = test.drop(drop_elements, axis = 1)
#train = train.values
#test = test.values

In [0]:
from sklearn.ensemble import VotingClassifier

classifiers = [
    #KNeighborsClassifier(algorithm='brute', n_neighbors= 3, p= 1, weights='uniform'),
    #SVC(probability=True, C= 1, gamma = 0.1, kernel= 'rbf'),
    ('svm', SVC(C=9.062264858625882, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.5565289675398223, kernel='rbf', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)),
    #(DecisionTreeClassifier(),
    ('rf', RandomForestClassifier(criterion= 'gini', max_depth=10, max_features= 'log2', min_samples_leaf= 1, min_samples_split= 2, n_estimators= 200)),
    ('sg',SGDClassifier(alpha= 0.001, l1_ratio= 0.8, learning_rate='optimal', loss='log', max_iter= 1000, penalty= 'elasticnet', shuffle= True)),
 	  #AdaBoostClassifier(RandomForestClassifier(max_depth=15), n_estimators=200, learning_rate=0.1, algorithm="SAMME"),
    ('gb',GradientBoostingClassifier()),
    #GaussianNB()),
    ('ld', LinearDiscriminantAnalysis()),
    #QuadraticDiscriminantAnalysis()),
    ('logr', LogisticRegression(solver='newton-cg')),
    ('catb', CatBoostClassifier(iterations=20, depth=10, learning_rate=0.1, loss_function='Logloss', logging_level='Verbose'))]


eclf1 = VotingClassifier(classifiers, voting='soft', flatten_transform=True)


from sklearn.model_selection import train_test_split

#X_train0, X_test0, y_train0, y_test0 = train_test_split(X_mod, y_mod, test_size=0.1, random_state=42)

eclf1 = eclf1.fit(train.drop(["Survived"], axis=1), train["Survived"])

In [0]:
print(eclf1.score(train.drop(["Survived"], axis=1), train["Survived"]))

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

from catboost import CatBoostClassifier


from sklearn.model_selection import train_test_split

#X_train0, X_test0, y_train0, y_test0 = train_test_split(train.drop(["Survived"], axis=1), train["Survived"], test_size=0.1, random_state=42)

#{'CatBoostClassifier': 0.8233333333333335,
# 'GradientBoostingClassifier': 0.8188888888888888,
# 'LinearDiscriminantAnalysis': 0.7933333333333332,
# 'LogisticRegression': 0.7944444444444443,
# 'RandomForestClassifier': 0.8088888888888889,
# 'SGDClassifier': 0.7944444444444444,
# 'SVC': 0.8122222222222222}

classifiers = [
    #KNeighborsClassifier(algorithm='brute', n_neighbors= 3, p= 1, weights='uniform'),
    #SVC(probability=True, C= 1, gamma = 0.1, kernel= 'rbf'),
    SVC(C=9.062264858625882, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.5565289675398223, kernel='rbf', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False),
    #DecisionTreeClassifier(),
    RandomForestClassifier(criterion= 'gini', max_depth=10, max_features= 'log2', min_samples_leaf= 1, min_samples_split= 2, n_estimators= 200),
    #SGDClassifier(alpha= 0.001, l1_ratio= 0.8, learning_rate='optimal', loss='log', max_iter= 1000, penalty= 'elasticnet', shuffle= True),
 	  AdaBoostClassifier(RandomForestClassifier(max_depth=10), n_estimators=48, learning_rate=1.0888, algorithm="SAMME.R"),
    GradientBoostingClassifier(),
    #GaussianNB(),
    #LinearDiscriminantAnalysis(),
    #QuadraticDiscriminantAnalysis(),
    #LogisticRegression(solver='newton-cg'),
    CatBoostClassifier(iterations=20, depth=10, learning_rate=0.1, loss_function='Logloss', logging_level='Verbose')]

log_cols = ["Classifier", "Accuracy"]
log 	 = pd.DataFrame(columns=log_cols)

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

X = train.drop(["Survived"], axis=1)
y = train["Survived"]

acc_dict = {}

for train_index, test_index in sss.split(X, y):
	X_train, X_test = X.iloc[train_index], X.iloc[test_index]
	y_train, y_test = y.iloc[train_index], y.iloc[test_index]
	
	for clf in classifiers:
		name = clf.__class__.__name__
		clf.fit(X_train, y_train)
		train_predictions = clf.predict(X_test)
		acc = accuracy_score(y_test, train_predictions)
		if name in acc_dict:
			acc_dict[name] += acc
		else:
			acc_dict[name] = acc

for clf in acc_dict:
	acc_dict[clf] = acc_dict[clf] / 10.0
	log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
	log = log.append(log_entry)

plt.xlabel('Accuracy')
plt.title('Classifier Accuracy')

sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

In [0]:
predictions = []

for clf in classifiers:
  predictions.append(clf.predict_proba(X)[:,1])
    
combined = np.sum(predictions, axis=0) / len(classifiers)     
rounded = np.round(combined)
 
acc = accuracy_score(y, rounded)
print("Predictions of X_train", acc)

In [0]:
acc_dict

In [0]:
import pandas as pd


hold = test

predictions = []

for clf in classifiers:
  predictions.append(clf.predict_proba(hold)[:,1])
    
combined = np.sum(predictions, axis=0) / len(classifiers)     
rounded = np.round(combined).astype(int)
 
holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids, "Survived": rounded}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_Ensemble_Voting0_section15.csv",index=False)

![alt text](https://drive.google.com/uc?export=view&id=1jUT65euStJbzpF7eKdTHUzRcI_OyphxU)


# 16 -  Another Preprocessing Try-all with KNN: Best Score of 0.83253


Based on  https://www.kaggle.com/konstantinmasich/titanic-0-82-0-83 


In [0]:
# NumPy
import numpy as np

# Dataframe operations
import pandas as pd

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Scalers
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

# Models
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn.linear_model import Perceptron
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier

# Cross-validation
from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
from sklearn.model_selection import cross_validate

# GridSearchCV
from sklearn.model_selection import GridSearchCV

#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix

Loading datasets


In [0]:
train_df = pd.read_csv("./data/train.csv")
test_df = pd.read_csv("./data/test.csv")
data_df = train_df.append(test_df) # The entire data: train + test.

Engineering features


In [0]:
data_df['Title'] = data_df['Name']
# Cleaning name and extracting Title
for name_string in data_df['Name']:
    data_df['Title'] = data_df['Name'].str.extract('([A-Za-z]+)\.', expand=True)

# Replacing rare titles with more common ones
mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr', 'Don': 'Mr', 'Mme': 'Miss',
          'Jonkheer': 'Mr', 'Lady': 'Mrs', 'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}
data_df.replace({'Title': mapping}, inplace=True)
titles = ['Dr', 'Master', 'Miss', 'Mr', 'Mrs', 'Rev']
for title in titles:
    age_to_impute = data_df.groupby('Title')['Age'].median()[titles.index(title)]
    data_df.loc[(data_df['Age'].isnull()) & (data_df['Title'] == title), 'Age'] = age_to_impute
    
# Substituting Age values in TRAIN_DF and TEST_DF:
train_df['Age'] = data_df['Age'][:891]
test_df['Age'] = data_df['Age'][891:]

# Dropping Title feature
data_df.drop('Title', axis = 1, inplace = True)


data_df['Family_Size'] = data_df['Parch'] + data_df['SibSp']

# Substituting Age values in TRAIN_DF and TEST_DF:
train_df['Family_Size'] = data_df['Family_Size'][:891]
test_df['Family_Size'] = data_df['Family_Size'][891:]


data_df['Last_Name'] = data_df['Name'].apply(lambda x: str.split(x, ",")[0])
data_df['Fare'].fillna(data_df['Fare'].mean(), inplace=True)

DEFAULT_SURVIVAL_VALUE = 0.5
data_df['Family_Survival'] = DEFAULT_SURVIVAL_VALUE

for grp, grp_df in data_df[['Survived','Name', 'Last_Name', 'Fare', 'Ticket', 'PassengerId',
                           'SibSp', 'Parch', 'Age', 'Cabin']].groupby(['Last_Name', 'Fare']):
    
    if (len(grp_df) != 1):
        # A Family group is found.
        for ind, row in grp_df.iterrows():
            smax = grp_df.drop(ind)['Survived'].max()
            smin = grp_df.drop(ind)['Survived'].min()
            passID = row['PassengerId']
            if (smax == 1.0):
                data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 1
            elif (smin==0.0):
                data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 0

print("Number of passengers with family survival information:", 
      data_df.loc[data_df['Family_Survival']!=0.5].shape[0])


for _, grp_df in data_df.groupby('Ticket'):
    if (len(grp_df) != 1):
        for ind, row in grp_df.iterrows():
            if (row['Family_Survival'] == 0) | (row['Family_Survival']== 0.5):
                smax = grp_df.drop(ind)['Survived'].max()
                smin = grp_df.drop(ind)['Survived'].min()
                passID = row['PassengerId']
                if (smax == 1.0):
                    data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 1
                elif (smin==0.0):
                    data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 0
                        
print("Number of passenger with family/group survival information: " 
      +str(data_df[data_df['Family_Survival']!=0.5].shape[0]))

# # Family_Survival in TRAIN_DF and TEST_DF:
train_df['Family_Survival'] = data_df['Family_Survival'][:891]
test_df['Family_Survival'] = data_df['Family_Survival'][891:]


data_df['Fare'].fillna(data_df['Fare'].median(), inplace = True)

# Making Bins
data_df['FareBin'] = pd.qcut(data_df['Fare'], 5)

label = LabelEncoder()
data_df['FareBin_Code'] = label.fit_transform(data_df['FareBin'])

train_df['FareBin_Code'] = data_df['FareBin_Code'][:891]
test_df['FareBin_Code'] = data_df['FareBin_Code'][891:]

train_df.drop(['Fare'], 1, inplace=True)
test_df.drop(['Fare'], 1, inplace=True)


data_df['AgeBin'] = pd.qcut(data_df['Age'], 4)

label = LabelEncoder()
data_df['AgeBin_Code'] = label.fit_transform(data_df['AgeBin'])

train_df['AgeBin_Code'] = data_df['AgeBin_Code'][:891]
test_df['AgeBin_Code'] = data_df['AgeBin_Code'][891:]

train_df.drop(['Age'], 1, inplace=True)
test_df.drop(['Age'], 1, inplace=True)


train_df['Sex'].replace(['male','female'],[0,1],inplace=True)
test_df['Sex'].replace(['male','female'],[0,1],inplace=True)

train_df.drop(['Name', 'PassengerId', 'SibSp', 'Parch', 'Ticket', 'Cabin',
               'Embarked'], axis = 1, inplace = True)
test_df.drop(['Name','PassengerId', 'SibSp', 'Parch', 'Ticket', 'Cabin',
              'Embarked'], axis = 1, inplace = True)


X = train_df.drop('Survived', 1)
y = train_df['Survived']
holdout = test_df.copy()



std_scaler = StandardScaler()
X = std_scaler.fit_transform(X)
holdout = std_scaler.transform(holdout)



## 16.1 - Best Result of KNN: Score of 0.83253

In [0]:
n_neighbors = [6,7,8,9,10,11,12,14,16,18,20,22]
algorithm = ['auto']
weights = ['uniform', 'distance']
leaf_size = list(range(1,50,5))
hyperparams = {'algorithm': algorithm, 'weights': weights, 'leaf_size': leaf_size, 
               'n_neighbors': n_neighbors}
gd=GridSearchCV(estimator = KNeighborsClassifier(), param_grid = hyperparams, verbose=True, 
                cv=10, scoring = "roc_auc")
gd.fit(X, y)
print(gd.best_score_)
print(gd.best_estimator_)

In [0]:
gd.best_estimator_.fit(X, y)
y_pred = gd.best_estimator_.predict(holdout)

In [0]:
knn = KNeighborsClassifier(algorithm='auto', leaf_size=26, metric='minkowski', 
                           metric_params=None, n_jobs=1, n_neighbors=6, p=2, 
                           weights='uniform')
knn.fit(X, y)


In [0]:
y_pred = knn.predict(holdout)

temp = pd.DataFrame(pd.read_csv("./data/test.csv")['PassengerId'])
temp['Survived'] = y_pred
temp.to_csv("./predictions/submission.csv", index = False)

### 16.1.1  - Submition on Kaggle with Best Result of KNN: Score of 0.83253

![alt text](https://drive.google.com/uc?export=view&id=1RZhyZPZAXsXh7Wx0sX-MTfmtIFr2KFv6)

## 16.2 - Another Preprocessing with VotingEnsemble

In [0]:
!pip install catboost


In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

from catboost import CatBoostClassifier


from sklearn.model_selection import train_test_split

#X_train0, X_test0, y_train0, y_test0 = train_test_split(train.drop(["Survived"], axis=1), train["Survived"], test_size=0.1, random_state=42)

#{'CatBoostClassifier': 0.8233333333333335,
# 'GradientBoostingClassifier': 0.8188888888888888,
# 'LinearDiscriminantAnalysis': 0.7933333333333332,
#'AdaBoostClassifier': 0.8377777777777776,
# 'CatBoostClassifier': 0.8422222222222222,
# 'GaussianNB': 0.7777777777777777,
# 'KNeighborsClassifier': 0.8422222222222221,
# 'SVC': 0.8133333333333332}
# 'LogisticRegression': 0.7944444444444443,
# 'RandomForestClassifier': 0.8088888888888889,
# 'SGDClassifier': 0.7944444444444444,
# 'SVC': 0.8122222222222222}

classifiers = [
    #KNeighborsClassifier(algorithm='brute', n_neighbors= 3, p= 1, weights='uniform'),
    #SVC(probability=True, C= 1, gamma = 0.1, kernel= 'rbf'),
    SVC(C=9.062264858625882, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.5565289675398223, kernel='rbf', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False),
    #DecisionTreeClassifier(),
    #RandomForestClassifier(criterion= 'gini', max_depth=10, max_features= 'log2', min_samples_leaf= 1, min_samples_split= 2, n_estimators= 200),
    #SGDClassifier(alpha= 0.001, l1_ratio= 0.8, learning_rate='optimal', loss='log', max_iter= 1000, penalty= 'elasticnet', shuffle= True),
 	  AdaBoostClassifier(RandomForestClassifier(max_depth=6), n_estimators=48, learning_rate=1.0888, algorithm="SAMME.R"),
    #GradientBoostingClassifier(),
    #GaussianNB(),
    #LinearDiscriminantAnalysis(),
    #QuadraticDiscriminantAnalysis(),
    LogisticRegression(solver='newton-cg'),
    CatBoostClassifier(iterations=46, depth=6, learning_rate=0.1, loss_function='Logloss', logging_level='Verbose'),
    KNeighborsClassifier(algorithm='auto', leaf_size=26, metric='minkowski', 
                           metric_params=None, n_jobs=1, n_neighbors=6, p=2, 
                           weights='uniform')]

log_cols = ["Classifier", "Accuracy"]
log 	 = pd.DataFrame(columns=log_cols)

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

#X = train.drop(["Survived"], axis=1)
#y = train["Survived"]

acc_dict = {}

for train_index, test_index in sss.split(X, y):
	X_train, X_test = X[train_index], X[test_index]
	y_train, y_test = y[train_index], y[test_index]
	
	for clf in classifiers:
		name = clf.__class__.__name__
		clf.fit(X_train, y_train)
		train_predictions = clf.predict(X_test)
		acc = accuracy_score(y_test, train_predictions)
		if name in acc_dict:
			acc_dict[name] += acc
		else:
			acc_dict[name] = acc

for clf in acc_dict:
	acc_dict[clf] = acc_dict[clf] / 10.0
	log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
	log = log.append(log_entry)

plt.xlabel('Accuracy')
plt.title('Classifier Accuracy')

sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

In [0]:
acc_dict

In [0]:
predictions = []

for clf in classifiers:
  predictions.append(clf.predict_proba(X)[:,1])
    
combined = np.sum(predictions, axis=0) / len(classifiers)     
rounded = np.round(combined)
 
acc = accuracy_score(y, rounded)
print("Predictions of X_train", acc)

In [0]:
import pandas as pd

predictions = []

for clf in classifiers:
  predictions.append(clf.predict_proba(test_df)[:,1])
    
combined = np.sum(predictions, axis=0) / len(classifiers)     
rounded = np.round(combined).astype(int)
 
temp = pd.read_csv("./data/test.csv")

submission_df = {"PassengerId": temp['PassengerId'], "Survived": rounded}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_Ensemble_Voting1_section16.csv",index=False)

**Score of 0.76076**

## 16.2 - With XGBoost: Score of 0.80861


Based on https://www.kaggle.com/stuarthallows/using-xgboost-with-scikit-learn

In [0]:
!pip install xgboost

In [0]:
import numpy as np

from scipy.stats import uniform, randint


from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split

import xgboost as xgb

In [0]:
def display_scores(scores):
    print("Scores: {0}\nMean: {1:.3f}\nStd: {2:.3f}".format(scores, np.mean(scores), np.std(scores)))

In [0]:
def report_best_scores(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [0]:
xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
xgb_model.fit(X, y)

y_pred = xgb_model.predict(X)

print(confusion_matrix(y, y_pred))

In [0]:
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = []

for train_index, test_index in kfold.split(X):   
    X_train0, X_test0= X[train_index], X[test_index]
    y_train0, y_test0 = y[train_index], y[test_index]

    xgb_model = xgb.XGBClassifier(objective="binary:logistic")
    xgb_model.fit(X_train0, y_train0)
    
    y_pred = xgb_model.predict(X_test0)
    
    scores.append(accuracy_score(y_test0, y_pred0))
    
display_scores(np.sqrt(scores))


**Early stopping**

In [0]:

# if more than one evaluation metric are given the last one is used for early stopping
xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42, eval_metric="auc")

X_train0, X_test0, y_train0, y_test0 = train_test_split(X, y, random_state=42)

xgb_model.fit(X_train0, y_train0, early_stopping_rounds=5, eval_set=[(X_test0, y_test0)])

y_pred0 = xgb_model.predict(X_test0)

accuracy_score(y_test0, y_pred0)

**Hyperparameter Searching**

In [0]:
xgb_model = xgb.XGBClassifier()

params = {
    "colsample_bytree": uniform(0.7, 0.3),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.03, 0.3), # default 0.1 
    "max_depth": randint(2, 6), # default 3
    "n_estimators": randint(100, 150), # default 100
    "subsample": uniform(0.6, 0.4)
}

search = RandomizedSearchCV(xgb_model, param_distributions=params, random_state=42, n_iter=200, cv=3, verbose=1, n_jobs=1, return_train_score=True)

search.fit(X, y)

report_best_scores(search.cv_results_, 1)

In [0]:
print("best score: {0}, best iteration: {1}, best ntree limit {2}".format(xgb_model.best_score, xgb_model.best_iteration, xgb_model.best_ntree_limit))

In [0]:
search

In [0]:
search.score(X_test0, y_test0)

In [0]:
y_pred = search.predict(holdout)

temp = pd.DataFrame(pd.read_csv("./data/test.csv")['PassengerId'])
temp['Survived'] = y_pred
temp.to_csv("./predictions/submission_XGBoost_RandomSearch0.csv", index = False)

**Score of 0.80861**

### 16.2.1  - Submition on Kaggle with Best Result of XGBoost: Score of 0.80861



![alt text](https://drive.google.com/uc?export=view&id=1KHk2xDOt8BvN8EHXu1nu6-sWop1ZvqSs)



## 16.3  - Best Results: KNN, CATBoost and XGBoost

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from catboost import CatBoostClassifier
import xgboost as xgb

from sklearn.model_selection import train_test_split

#X_train0, X_test0, y_train0, y_test0 = train_test_split(train.drop(["Survived"], axis=1), train["Survived"], test_size=0.1, random_state=42)


classifiers = [
    xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
    CatBoostClassifier(iterations=46, depth=6, learning_rate=0.1, loss_function='Logloss', logging_level='Verbose'),
    KNeighborsClassifier(algorithm='auto', leaf_size=26, metric='minkowski', 
                           metric_params=None, n_jobs=1, n_neighbors=6, p=2, 
                           weights='uniform')]

log_cols = ["Classifier", "Accuracy"]
log 	 = pd.DataFrame(columns=log_cols)

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

#X = train.drop(["Survived"], axis=1)
#y = train["Survived"]

acc_dict = {}

for train_index, test_index in sss.split(X, y):
	X_train, X_test = X[train_index], X[test_index]
	y_train, y_test = y[train_index], y[test_index]
	
	for clf in classifiers:
		name = clf.__class__.__name__
		clf.fit(X_train, y_train)
		train_predictions = clf.predict(X_test)
		acc = accuracy_score(y_test, train_predictions)
		if name in acc_dict:
			acc_dict[name] += acc
		else:
			acc_dict[name] = acc

for clf in acc_dict:
	acc_dict[clf] = acc_dict[clf] / 10.0
	log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
	log = log.append(log_entry)

plt.xlabel('Accuracy')
plt.title('Classifier Accuracy')

sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

In [0]:
acc_dict

In [0]:
predictions = []

for clf in classifiers:
  predictions.append(clf.predict_proba(X)[:,1])
    
combined = np.sum(predictions, axis=0) / len(classifiers)     
rounded = np.round(combined)
 
acc = accuracy_score(y, rounded)
print("Predictions of X_train", acc)

In [0]:
import pandas as pd

predictions = []

for clf in classifiers:
  predictions.append(clf.predict_proba(holdout)[:,1])
    
combined = np.sum(predictions, axis=0) / len(classifiers)     
rounded = np.round(combined).astype(int)
 
temp = pd.read_csv("./data/test.csv")

submission_df = {"PassengerId": temp['PassengerId'], "Survived": rounded}
submission = pd.DataFrame(submission_df)
submission.to_csv("./predictions/submission_Voting_CatBoost_KNN_XGB.csv",index=False)

### 16.3.1 Submition on Kaggle with Voting: score of 0.79425


![alt text](https://drive.google.com/uc?export=view&id=1FD9RVt-RM-lfAR0ynVfOmcMO9mCzX4rk)

