# Hello wanderer !

Welcome to the new version of my notebook, previously known as "Reach > 0.80 as a noob", and now named "Reach > 0.79 as a beginner". You may have noticed a drastic reduce in the score displayed in the title, I have to admit I'm a bit sad by this, 0.79+ is not as near cool as 0.80+ is, especially since I ascended from complete noob to humble beginner in the meantime, you would thought I learnt how to do better in score. Well, no and yes actually. Let's jump in !

Early note to the potential readers, this notebook is loooooooooong, 'cause I talk a looooooooot, but if you're a real beginner like me, I hope I'll make your reading worth (at least for the code part, you can skip the markdown stuff). Also, the code you gonna see is not super clean, not because I don't care, but because I have to be better at it, sorry in advance if some of this is hard to watch (at least it works). Last thing,  I don't use EDA nor ML advanced stuff in this notebook, you can find plenty of that on other notebooks, I'd rather want to focus on simple things sometimes overlooked :
- Transformers and Pipelines in pre-processing
- Ensembling
- Common mistakes of beginners, learnt the hard way

*Once again, no seaborn nor matplotlib here. Don't get me wrong, those are essentials, super useful even, and I would definitely use them on another notebook to increase my score, but not on this one. This one goes straight to the point.*

# **Why the downgrade in score ??**

This new notebook, although it achieves not so good of a score compare to my previous one (which was 0.815 !) is actually future proof ! To understand what does this mean, Kaggle has a neat system where the test set of the Titanic competition is often changed, it's really good against cheaters on top of the leaderboard (the shameless people with more than 0.95 in score), but it cas also drastically reduce the score of other contestants who submit a model that generalize very poorly ..... yeah you get it now, my precious 0.815 score went down to 0.77 overnight. ML adepts will say what kind of disease my model carried, and you migth have heard it too, **OVERFITTING**. 

If you red anything in Machine Learning, you red about overfitting at least once, generally in training phase. The thing I learnt with this competition though, it's that you can overfit in "submission phase", meaning overfitting the test data ! In my case, this overfitting happens because I actually submitted too much, tweaking my model bit by bit in order to increase the final score. I didn't realize back then, but that's overfitting, and in real world situation it is dangerous. My new objective with this notebook, reach a decent score in one submission !! Let's do this !


In [None]:
import numpy as np 
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = 40

In [None]:
# Let's load the data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

In [None]:
# We concat the two dataset in order to work on the missing values
full_dataset = pd.concat([train_data, test_data])
# Let's then have a look at the features
full_dataset.head()

# Pre-processing - Part One

In order to practice, I wanted to try the custom transformers. It looked kinda neat to be able to create my own custom pipelines with custom transformers, I thought it would take me a long time to learn it and surprise, it's actually rather easy to have simple transformers ! I definitely invite you to try.

Since it was interesting to make those, I decided to create a lot of transformers. The idea was simple, at the end of it, I wanted a super-mega-giga pipeline that apply a bunch of changes to the dataset. To my susprise (another one!), not only does this work really well, it also works extremely fast, changes are applied in few seconds it's awesome.

I'll explain every transformers with in-code commentaries, but since there is a lot of them, I chose to only display 2 of them. Click the code button if you want to see more of them. The idea behind this Part One is to handle missing data and create meaningful features for every initial features but Age (see Part Two). Part Three will focus on one last feature that I couldn't put into a transformer 'cause I'm still a beginner


In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class KnownCabinTransformer(BaseEstimator, TransformerMixin):
    """
    This first transformer is rather easy, goal is to create
    a new column, 'KnownCabin' where value is 0 when we don't
    know the cabin location, and 1 when we do.
    For this notebook, this is the only thing I do with cabin,
    but I'm sure there is plenty more to do with this information.
    """
    def __init__(self):
        print("- KnownCabin transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X.Cabin.fillna(0, inplace=True)
        X["KnownCabin"] = 0
        X.loc[X.Cabin != 0, 'KnownCabin'] = 1
        
        print("- KnownCabin transformer applied -")
             
        return X
            

In [None]:
class TitleTransformer(BaseEstimator, TransformerMixin):
    """ 
    A way more interesting Transformer, when I extract the Title 
    of every passenger thanks to a regular expression, and create
    dummies out of them. Those Titles also gonna be useful for 
    other features later on.
    """
    def __init__(self):
        print("- Title transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X['Title'] = X.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
        
        #Different variations of nobility/professions, all called 'Rare'
        X['Title'] = X['Title'].replace(['Lady', 'Countess','Capt', 'Col',
                                         'Don', 'Dr', 'Major', 'Rev', 
                                         'Sir', 'Jonkheer', 'Dona'], 
                                         'Rare')
        
        #Different (French) variations of Miss and Mrs
        X['Title'] = X['Title'].replace(['Mlle', 'Ms'], 'Miss')
        X['Title'] = X['Title'].replace('Mme', 'Mrs')
        
        #If a title is not treated, I count it as rare
        X['Title'] = X['Title'].fillna('Rare')
        
        title_dummies = pd.get_dummies(X['Title'], prefix="Title")
        
        X = pd.concat([X, title_dummies], axis=1)
        
        # Drop of Title column for redundancy
        X = X.drop('Title', axis=1)
        
        
        print("- Title transformer applied -")
             
        return X

From now on, the rest of my transformers are in one big chunk of code just below. Feel free to look at them too by clicking on the "code" button or directly go to Part Two.

In [None]:
class MissingFareTransformer(BaseEstimator, TransformerMixin):
    """
    Simple transformer to handle missing values in Fare. It looks
    at the class of the passenger before giving the median fare 
    associated with this class.
    It's not optimal as you would also like to look at the number 
    of people under the same ticket, but it's a beginning.    
    """
    def __init__(self):
        print("- MissingFare transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X.loc[(X.Pclass == 1) & (X.Fare.isnull()), 'Fare'] = X.loc[X.Pclass == 1]["Fare"].median()
        X.loc[(X.Pclass == 2) & (X.Fare.isnull()), 'Fare'] = X.loc[X.Pclass == 2]["Fare"].median()
        X.loc[(X.Pclass == 3) & (X.Fare.isnull()), 'Fare'] = X.loc[X.Pclass == 3]["Fare"].median()
        
        print("- MissingFare transformer applied -")
        
        return X
    

class EmbarkedTransformer(BaseEstimator, TransformerMixin):
    """
    An other simple transformer, giving to the missing values in
    Embarked the most common value, S in that case. We also use
    this transformer to create dummies of Embarked.
    """
    def __init__(self):
        print("- MissingEmbarked transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        self.most_frequent = X["Embarked"].value_counts().index[0]
        X["Embarked"].fillna(self.most_frequent, inplace=True)
        
        embarked_dummies = pd.get_dummies(X['Embarked'], prefix="Embarked")
        
        X = pd.concat([X, embarked_dummies], axis=1)
        
        # Drop of Embarked column for redundancy
        X = X.drop('Embarked', axis=1)
        
        print("- MissingEmbarked transformer applied -")
        
        return X
    

class HypotheticalMissingsTransformer(BaseEstimator, TransformerMixin):
    """
    This one is actually useless, but future-proof !! If the test set
    changes again, this transformer will deal with new missing values 
    in features without missing values at the time of this notebook.
    The only column I don't deal with is the Ticket, being a tad to
    difficult to generate random Ticket or finding families that would
    share the same number.
    """
    def __init__(self):
        print("- HypotheticalMissings transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        #Pclass is handled by frequency
        self.most_frequent_class = X["Pclass"].value_counts().index[0]
        
        self.unknown_name = "unknown"
        
        # Assuming people with NaN in SibSp and Parch got no family with them
        self.horizontal_family = 0
        self.vertical_family = 0     
        
        X["Pclass"].fillna(self.most_frequent_class, inplace=True)
        X["Name"].fillna(self.unknown_name, inplace=True)
        
        X["SibSp"].fillna(self.horizontal_family, inplace=True)
        X["Parch"].fillna(self.vertical_family, inplace=True)
        
        # We go a little further with sex, kinda important in our models
        # Mr., Master. and Rare are men, others are female
        X.loc[(X.Sex.isnull()) & (X.Title_Master == 1), "Sex"] == "male"
        X.loc[(X.Sex.isnull()) & (X.Title_Mr == 1), "Sex"] == "male"
        # We assume 'rare' title holders are more likely to be male than female
        X.loc[(X.Sex.isnull()) & (X.Title_Rare == 1), "Sex"] == "male"
        
        X.loc[(X.Sex.isnull()) & (X.Title_Miss == 1), "Sex"] == "female"
        X.loc[(X.Sex.isnull()) & (X.Title_Mrs == 1), "Sex"] == "female"
        
        print("- HypotheticalMissings transformer applied -")
        
        return X

    
class GenderTransformer(BaseEstimator, TransformerMixin):
    """
    This transformer could have been a simple use of dummies,
    but at this point I was full-on in the use of transformers,
    transformers are cool, Michael Bay approves !
    """
    def __init__(self):
        print("- Gender transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})
        
        print("- Gender transformer applied -")
        
        return X


class PclassTransformer(BaseEstimator, TransformerMixin):
    """
    Cf GenderTransformer commentaries, transformers rule!
    """
    def __init__(self):
        print("- Pclass transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        pclass_dummies = pd.get_dummies(X['Pclass'], prefix='Pclass')
        
        X = pd.concat([X, pclass_dummies], axis=1)
        
        X = X.drop('Pclass', axis=1)
        
        print("- Pclass transformer applied -")
        
        return X
    

class FamilySizeTransformer(BaseEstimator, TransformerMixin):
    """
    A simple feature created thanks to other existing features.
    FamilySize is used by many other notebooks and yield some
    decent results.
    """
    def __init__(self):
        print("- FamilySize transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X["FamilySize"] = X["SibSp"] + X["Parch"] + 1
        
        print("- FamilySize transformer applied -")
        
        return X

# **Pre-processing - Part Two**

I decided to split the handling of the "Age" missing values in a second part. I saw some notebook using a mean() or median() method, but I wanted to go a little further and use some basic deduction. The idea was essentialy to recognize children within the passengers with missing data in Age, children being more likely to survive ("Children and women first" is a popular memory of tragedies such as the Titanic). A simple example then, every passenger with the Title Master (or Title_Master == 1 in my case) are children. Same logic, a lot of Miss 'might' be children too, when Mr. and Mrs are definitely not. The following transformers will then focus directly or indirectly on the age of the passengers.

In [None]:
class AgeStatusTransformer(BaseEstimator, TransformerMixin):
    """
    I wanted to distinguish people from whom we know for sure their age,
    those from whom the age is estimated (floating value over 1 yo), and
    those from whom we guessed the age (missing values at the start of the
    exercice)
    """
    def __init__(self):
        print("- AgeStatus transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X["AgeStatus"] = "known"
        #Toddlers under 1 have their age displayed in float, it's not a estimation
        X.loc[(X["Age"] > 1) & ((X["Age"]*2)%2 != 0), "AgeStatus"] = "estimated"
        X.loc[X["Age"].isnull(), "AgeStatus"] = "guessed"
        
        age_status_dummies = pd.get_dummies(X['AgeStatus'], prefix="AgeStatus")
        
        X = pd.concat([X, age_status_dummies], axis=1)
        
        # Drop of AgeStatus column for redundancy
        X = X.drop('AgeStatus', axis=1)
        
        print("- AgeStatus transformer applied -")
             
        return X
            

**Ticket_grouping**

This transformer achieves a little bit the same work than the FamilySize one, but I found it more accurate to work with for some other transformers later on (Age and Tragedy). The idea is that there is a decent number of people travelling under the same ticket but are not family members, this column can show it.

It is far from ideal, and someone who wants to have better results might use a more precised transformer looking at the name and ticket for example, to really show the different groups travelling together. This one I used is just the easy way.

In [None]:
import category_encoders as ce

class TicketGroupingTransformer(BaseEstimator, TransformerMixin):
    """
    I apply a category encoder on the Ticket, if 6 people share the
    same ticket, the value of the new column will be 6.
    """
    def __init__(self):
        print("- TicketGrouping transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        cat_feature = ['Ticket']
        count_enc = ce.CountEncoder(cols=cat_feature)
        count_enc.fit(X[cat_feature])
        grouping_col = count_enc.transform(X[cat_feature]).add_suffix("_grouping")
        X = pd.concat([X, grouping_col], axis=1)
        
        print("- TicketGrouping transformer applied -")
        
        return X

In [None]:
class AgeGuessingTransformer(BaseEstimator, TransformerMixin):
    """
    The transformer that guess the ages way more precisely than mean or
    median, and I made the previous change in part two. Every change is
    explained but it is rather intuitive.
    
    I never apply hard-coded value of course, I just use the median of
    a more precise subgroup.
    """
    def __init__(self):
        print("- AgeGuessing transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Every 'Master' is a child, we apply the median of all Master (4yo)
        X.loc[(X.Age.isnull() == True) & (X.Title_Master == 1), "Age"] = X.loc[X.Title_Master == 1]["Age"].median()
        
        # Every Miss with at least a sibling (no spouse obviously), is more 
        # likely to be a child too (it's not sure, just more likely) (13yo)
        X.loc[(X.Age.isnull() == True) & (X.SibSp > 0) & (X.Title_Miss == 1), "Age"] = X.loc[(X.Title_Miss == 1) & (X.SibSp > 0)]["Age"].median()
        
        # People alone are way more likely to be adult (28yo)
        X.loc[(X.Age.isnull() == True) & (X.Ticket_grouping == 1), "Age"] = X.loc[(X.Ticket_grouping == 1)]["Age"].median()
        
        # People with more than 1 sibling/spouse are more likely to be children (13yo)
        X.loc[(X.Age.isnull() == True) & (X.SibSp > 1), "Age"] = X.loc[(X.Age.isnull() == False) & (X.SibSp > 1)]["Age"].median()
        
        # The rest is more likely to be adults (27yo)
        X.loc[(X.Age.isnull()), "Age"] = X.loc[(X.Age.isnull() == False) & (X.Ticket_grouping > 1)]["Age"].median()
        
        
        print("- AgeGuessing transformer applied -")
        
        return X

In [None]:
class AgeBinningTransformer(BaseEstimator, TransformerMixin):
    """
    I use a binning technique to create 6 categories of age,
    a better solution would be to look at the distribution of
    age before applying the bins, I was just lazy.
    """
    def __init__(self):
        print("- AgeBinning transformer initiated -")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        self.bins = [0, 2, 16, 25, 40, 50, np.inf]
        self.age_cat = ["0_2", "2_16", "16_25", "25_40", "30_50", "50+"]
        
        X["AgeBins"] = pd.cut(X["Age"], bins=self.bins, labels=self.age_cat)
        
        age_bins_dummies = pd.get_dummies(X['AgeBins'], prefix="AgeBins")
        
        X = pd.concat([X, age_bins_dummies], axis=1)  
        
        print("- AgeBinning transformer applied -")
        
        return X

# Pre-processing - Pipeline joy

This is my biggest joy of this notebook, and it was a great occasion for me to practice transformers. It all comes to this, this badass pipeline that transform the initial dataset into a way more meaningful in just a few seconds. Almost got me to cry a little, of joy.

In [None]:
from sklearn.pipeline import Pipeline

custom_pipeline = Pipeline([
        ("cabin_trans", KnownCabinTransformer()),
        ("title_trans", TitleTransformer()),
        ("hypo_missing_trans", HypotheticalMissingsTransformer()),
        ("missing_fare_trans", MissingFareTransformer()),
        ("embarked_trans", EmbarkedTransformer()),
        ("gender_trans", GenderTransformer()),
        ("class_trans", PclassTransformer()),
        ("family_size_trans", FamilySizeTransformer()),
        ("ticket_grp_trans", TicketGroupingTransformer()),
        ("age_status_trans", AgeStatusTransformer()),
        ("age_guessing_trans", AgeGuessingTransformer()),
        ("age_binning_trans", AgeBinningTransformer()),
    ])

full_dataset = custom_pipeline.fit_transform(full_dataset)

In [None]:
# LOOK AT THIS, SUCH A BEAUTY !!!
full_dataset.head(3)

# Pre-processing - Part Three

It needed to be a trilogy. Why ? Because I still have some troubles with transformers, that's why. The following pre-modeling change is huge to me, but I don't know why putting it in a transformers caused me many bugs, so here it is, a stupid ass loop, sooooo sad.

The idea behind this last transformation is to create a column called "TS_Tragedy" for Ticket Sharing Tragedy. More specifically, I want to know if at least 2/3 of the passengers of a given group (sharing the same ticket) died. It took many a long time of reflexion to come up with this, at first I just wanted to create a feature that looked at "vertical survivability", meaning does both the parents of a child died (resulting in the child being more likely to not survive either) ? And does every child of a parent died (resulting in the parent being more likely to not have survived). 

I settled for this "TS_Tragedy" although more general, because it gave me good enough results for children (traditional models of this competition tend to over estimate the survivability of children). Around 300 passengers are concerned, and a hundred got a "yes" as value, meaning 2/3 of their group are reported dead.

In [None]:
# I set up the column as null, staying this way for people in group of 2 or less
full_dataset["TS_Tragedy"] = "null"
i = 0
unique_tickets_subset = full_dataset.loc[full_dataset["Ticket_grouping"] > 2]["Ticket"].unique()
while i < len(unique_tickets_subset):
    subset = full_dataset.loc[full_dataset.Ticket == unique_tickets_subset[i]]
    try:
        # I chose to have the number of people in training set as denominator rather than total on ticket
        ratio = len(subset.loc[subset["Survived"] == 0]) / len(subset.loc[subset.Survived.isnull() == False])
        if ratio > 0.65 :
            full_dataset.loc[full_dataset.Ticket == unique_tickets_subset[i], "TS_Tragedy"] = "yes"
        else:
            full_dataset.loc[full_dataset.Ticket == unique_tickets_subset[i], "TS_Tragedy"] = "no"
        i += 1
    
    #ZeroDivisionError caused by a full group in test set, I pass on it
    except ZeroDivisionError:
        i += 1
        
TST_dummies = pd.get_dummies(full_dataset['TS_Tragedy'], prefix="TST")
        
full_dataset = pd.concat([full_dataset, TST_dummies], axis=1)
        
full_dataset = full_dataset.drop('TS_Tragedy', axis=1)

full_dataset.head(3)


In [None]:
# We reindex the full dataset, before dropping the PassengerId col later on
full_dataset.set_index(full_dataset.PassengerId, verify_integrity = True, inplace=True)
full_dataset.tail()

# MODELLING - INTRO

Let's go on the fun part, creating the model. Many other changes could have been done in pre-processing, but I have enough for this notebook. A good practicionner would also make some neat EDA at this point, good looking graph and stuff, but I'm not a good practicionner (kidding aside, plenty other notebooks provide amazing graphs and figures, give it a look !)

SPOILER ALERT : I use an ensemble model that actually performs not that great (hence the under 0.8 score). I just wanted to practice on ensembling and it was amazingly fun. I strongly invite you to give it a try if you're a beginner too.[](http://)



First thing first, we split back the data into training and testing set, and we drop some non-used feature.

In [None]:
# split back into train and test

X_train = full_dataset.loc[full_dataset.Survived.isnull() == False]
y_train = X_train["Survived"].astype(int)
X_test = full_dataset.loc[full_dataset.Survived.isnull()]

### Dropping non used features (and target Survived)
useless_features = ["PassengerId", "Survived", "Name", "Age", "Ticket", "Fare", "Cabin", "AgeBins"]

X_train.drop(useless_features, axis=1, inplace=True)
X_test.drop(useless_features, axis=1, inplace=True)

Because I'm using a SVM model in my ensembling, I need to apply some standard scaling to my data, the following code use a scikit transformer design for this effect and apply it on the data.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train[["SibSp","Parch","FamilySize","Ticket_grouping"]]), columns=["sc_SibSp","sc_Parch","sc_FamilySize","sc_Ticket_grouping"], index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test[["SibSp","Parch","FamilySize","Ticket_grouping"]]), columns=["sc_SibSp","sc_Parch","sc_FamilySize","sc_Ticket_grouping"], index=X_test.index)

X_train = pd.concat([X_train, X_train_scaled], axis=1)
X_train.drop(["SibSp","Parch","FamilySize","Ticket_grouping"], axis=1, inplace=True)
X_test = pd.concat([X_test, X_test_scaled], axis=1)
X_test.drop(["SibSp","Parch","FamilySize","Ticket_grouping"], axis=1, inplace=True)

X_train.head(3)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)

# MODELLING - MODELS PARAMETERS

This following part is the heart of the ensembling model, I fit every model after using a GridSearchSV to find some good parameters. There is a lot so I chose to display only the KNeighbors example, but as for the transformers, you can expand the code by clicking on the code "button".

In [None]:
# Various imports
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as xgb
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

In [None]:
# KNeighbors example

param_grid = [
    {'n_neighbors':[2,3,4,5,6,7,8], 'weights':['uniform', 'distance'], 'n_jobs':[-1] }
]

neigh = KNeighborsClassifier()

grid_search = GridSearchCV(neigh, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

grid_search.best_params_

In [None]:
neigh = KNeighborsClassifier(n_neighbors = 5, weights = 'uniform')
neigh.fit(X_train, y_train)
scores = cross_val_score(neigh, X_train, y_train, scoring="accuracy", cv=10)
print(scores.mean())

0.82, not bad at all for a KNeighbors model on such complicated dataset. 

Just below I apply the same technique for the rest of the models, nothing much to say. Funny enough to me, it's the SVC that yields the best results (even better than the ensembling in fact, I should improve on this but it's fine enough for me for a start, I'm happy I practiced the ensembling).

I put the GridSearchCV parameters under commentaries, you can delete those or applied them, but some take an awful amount of time, be warned.

In [None]:
"""
# Logistic Regression

param_grid = [
    {'penalty':['l1','l2','elacticnet','none'], 'C':[0.001, 0.01, 0.1,1, 10, 100, 1000], 'n_jobs':[-1] }
]

log_reg = LogisticRegression()

grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

grid_search.best_params_
"""
log_reg = LogisticRegression(C = 0.001, penalty='none')
log_reg.fit(X_train, y_train)
scores = cross_val_score(log_reg, X_train, y_train, scoring="accuracy", cv=10)
print("Logistic Regression mean score : ", scores.mean())

"""
# Random Forest

param_grid = [
    {'n_estimators':[100,200,300,500,700,1000], 'max_depth':[2,3,4,5,6,7,8],
     'min_samples_split':[1,2,3,4,5], 'min_samples_leaf':[1,2,3,4] ,'n_jobs':[-1] }
]

forest_clf = RandomForestClassifier()

grid_search = GridSearchCV(forest_clf, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

grid_search.best_params_
"""

forest_clf = RandomForestClassifier(max_depth = 6, min_samples_leaf=1, min_samples_split=2, n_estimators=100)
forest_clf.fit(X_train, y_train)
scores = cross_val_score(forest_clf, X_train, y_train, scoring="accuracy", cv=10)
print("Random Forest mean score : ", scores.mean())

"""
# SVC

param_grid = [
    {'kernel':['linear','poly','rbf','sigmoid'], 'C':[0.001, 0.01, 0.1,1, 10, 100, 1000], 'degree':[2,3,4], 'gamma':['auto']}
]

svc = SVC()

grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

grid_search.best_params_
"""

# Despite this 0.847 score, real submission is 0.794, not good enough
svc = SVC(C = 10, kernel='rbf', gamma='auto', probability=True)
svc.fit(X_train, y_train)
scores = cross_val_score(svc, X_train, y_train, scoring="accuracy", cv=10)
print("SVC mean score: ", scores.mean())

# GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=500, max_depth=3, learning_rate=0.01)
gbc.fit(X_train, y_train)
scores = cross_val_score(gbc, X_train, y_train, scoring="accuracy", cv=10)
print("Gradient Boosting mean score : ", scores.mean())

# XGB
xgb_clf = xgb.XGBClassifier(n_estimators=500, max_depth=3, learning_rate=0.01)
xgb_clf.fit(X_train, y_train)
scores = cross_val_score(xgb_clf, X_train, y_train, scoring="accuracy", cv=10)
print("XGB mean score : ", scores.mean())


**Voting ensembling**

I discovered this method really recently and I found it to be really cool, so I gave it a go. I chose a 'hard' voting system purely arbitrarely, you can have almost the same results with a 'soft'

In [None]:
# Voting ensembling

voting_clf = VotingClassifier(estimators=[('gbc', gbc), ('xgb', xgb_clf), ('forest', forest_clf), ('svc', svc), ('lrc', log_reg), ('knc', neigh)],
                             voting='hard')

voting_clf.fit(X_train, y_train)

scores = cross_val_score(voting_clf, X_train, y_train, scoring="accuracy", cv=10)

print(scores.mean())

0.838, good enough for me, I'll wrap this up and make the predictions out of it !

In [None]:
preds = voting_clf.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': preds})
output.to_csv("submission.csv", index=False)
output.head()

# Tadaaaa!

Unfortunately the 0.80 will not be broken with this notebook (you should have something like 0.795). 

But here's a lesson of my journey : I got over 0.79 in a single try ! It took me something like 50 submissions to get a 0.80 few month ago, only to painfully realize what Overfitting really meant. Confidently, I can say this notebook will give me around 0.79/0.80 for EVERY changes in the test set, and for me that's awesome. I started ML few months ago and those simple exercices (and especially coming back at them with more practice) can really show you how much progress we made.

There is plenty other things to improve on this notebook, and plenty other things to try in order to reach a great score. But, for me, that will be enough for now. I sincerely hope you may have learn something from my mistakes. don't hesitate to leave a comment if you need to know anything about how I chose to code my notebook, always happy to talk with the Kaggle community.

Good day to all of you !