# KAGGLE-LIKE CHALLENGE
On vous propose ici de tester tout ce que vous avez appris sur le machine learning supervisé, dans le but de faire un modèle de prédiction sur des données fournies, à la manière des compétitions Kaggle.

**Déroulement d'un challenge Kaggle**
- Kaggle vous envoie toujours deux datasets :
  - un fichier data_train.csv qui contient des données correspondant aux variables X, et au label Y à prédire. Utilisez ce fichier pour entraîner vos modèles comme d'habitude.
  - un fichier data_test.csv, qui contient les données X au même format que dans data_train.csv, mais cette fois les labels sont cachés. Votre but est de faire des prédictions sur ces données et de renvoyer ces prédictions à Kaggle, pour qu'ils évaluent votre modèle de manière indépendante
- Kaggle compare vos prédictions aux vrais labels et propose un leaderboard (équipes classées en fonction de leur score)
- Kaggle vous annonce à l'avance quelle métrique va être utilisée pour évaluer les modèles : veillez à utiliser la même métrique pour évaluer les performances de vos modèles

**Prédiction de conversion**

Ici, on vous propose d'essayer de créer le meilleur modèle pour prédire des conversions en fonction de différentes variables explicatives. Vos modèles seront évalués à l'aide du f1-score.

*Inspirez-vous du template ci-dessous pour la lecture des fichiers, la structure à suivre, et l'écriture des prédictions finales.*

In [45]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import f1_score, confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Read file with labels

In [100]:
data = pd.read_csv('conversion_data_train.csv')
print('Set with labels (our train+test) :', data.shape)

Set with labels (our train+test) : (284580, 6)


In [101]:
data.head()

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,China,22,1,Direct,2,0
1,UK,21,1,Ads,3,0
2,Germany,20,0,Seo,14,1
3,US,23,1,Seo,3,0
4,US,28,1,Direct,3,0


# Explore dataset

In [102]:
# Don't forget to compute statistics and visualize your data

In [103]:
data.describe(include = 'all')

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
count,284580,284580.0,284580.0,284580,284580.0,284580.0
unique,4,,,3,,
top,US,,,Seo,,
freq,160124,,,139477,,
mean,,30.564203,0.685452,,4.873252,0.032258
std,,8.266789,0.464336,,3.341995,0.176685
min,,17.0,0.0,,1.0,0.0
25%,,24.0,0.0,,2.0,0.0
50%,,30.0,1.0,,4.0,0.0
75%,,36.0,1.0,,7.0,0.0


In [104]:
print('Dropping outliers in age...')
to_keep = data['age'] < data['age'].mean() + 2*data['age'].std()
data = data.loc[to_keep,:]
print('Done. Number of lines remaining : ', data.shape[0])
print()


Dropping outliers in age...
Done. Number of lines remaining :  275778



In [105]:
data['converted'].value_counts()/len(data)

0    0.96693
1    0.03307
Name: converted, dtype: float64

# Make your model (as always)

## Choose variables to use in the model, and create train and test sets

In [68]:
features_list = ['total_pages_visited','age','new_user','source','country']
numeric_indices = [0,1]
categorical_features = [2,3, 4]
target_variable = 'converted'

In [69]:
X = data.loc[:, features_list]

In [70]:
Y = data.loc[:, target_variable]

print('Variables explicatives : ', X.columns)
print()

Variables explicatives :  Index(['total_pages_visited', 'age', 'new_user', 'source', 'country'], dtype='object')



In [71]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [72]:
# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_train = X_train.values
X_test = X_test.values
Y_train = Y_train.values
Y_test = Y_test.values
print("...Done")

print(X_train[0:5,:])
print(X_test[0:2,:])
print()
print(Y_train[0:5])
print(Y_test[0:2])

Convert pandas DataFrames to numpy arrays...
...Done
[[4 21 1 'Ads' 'China']
 [3 21 1 'Seo' 'China']
 [6 40 0 'Ads' 'China']
 [2 31 1 'Ads' 'China']
 [4 22 1 'Seo' 'US']]
[[3 38 1 'Ads' 'US']
 [6 30 1 'Direct' 'US']]

[0 0 0 0 0]
[0 0]


## Training pipeline

In [73]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer

In [74]:
# Create pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # missing values will be replaced by columns' median
    ('scaler', StandardScaler())
])

In [75]:
# Create pipeline for categoric features
categorical_transformer = Pipeline(
    steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # missing values will be replaced by most frequent value
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

In [76]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_indices),
        ('cat', categorical_transformer, categorical_features)
    ])

In [77]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5,:])
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
X_test = preprocessor.transform(X_test)
print('...Done.')
print(X_test[0:5,:])
print()

Performing preprocessings on train set...
...Done.
[[-0.26262182 -1.19098172  1.          0.          0.          0.
   0.          0.        ]
 [-0.56088573 -1.19098172  1.          0.          1.          0.
   0.          0.        ]
 [ 0.333906    1.35037383  0.          0.          0.          0.
   0.          0.        ]
 [-0.85914964  0.14657383  1.          0.          0.          0.
   0.          0.        ]
 [-0.26262182 -1.05722616  1.          0.          1.          0.
   0.          1.        ]]

Performing preprocessings on test set...
...Done.
[[-0.56088573  1.08286272  1.          0.          0.          0.
   0.          1.        ]
 [ 0.333906    0.01281828  1.          1.          0.          0.
   0.          1.        ]
 [-0.56088573  1.08286272  1.          0.          1.          0.
   1.          0.        ]
 [ 0.63216992 -1.19098172  1.          0.          1.          0.
   0.          1.        ]
 [-1.15741356  2.28666271  0.          1.          0.       

In [78]:
# Train model
from sklearn.ensemble import RandomForestClassifier
print("Train model...")
classifier = RandomForestClassifier(n_estimators=50, max_depth=10, min_samples_leaf=5, random_state=0) # regularized logit with regularization strength chosen by cross-val
classifier.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


In [79]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = classifier.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]



In [96]:
classifier.feature_importances_

array([0.88944667, 0.03829291, 0.03940282, 0.00151512, 0.0015793 ,
       0.00814682, 0.00995983, 0.01165652])

## Test pipeline

In [80]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = classifier.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on test set...
...Done.
[0 0 0 ... 0 0 0]



## Performance assessment

In [81]:
# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(Y_train, Y_train_pred))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred))

f1-score on train set :  0.7746031746031744
f1-score on test set :  0.7626076260762608


In [94]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
scores = cross_val_score(classifier, X_train, Y_train, cv=5, scoring = 'f1')

In [95]:
print(scores.mean())
print(scores.std())

0.7571714768810769
0.007786868096699849


In [85]:
# You can also check more performance metrics to better understand what your model is doing
print("Confusion matrix on train set : ")
print(confusion_matrix(Y_train, Y_train_pred))
print()
print("Confusion matrix on test set : ")
print(confusion_matrix(Y_test, Y_test_pred))
print()

Confusion matrix on train set : 
[[212516    790]
 [  2192   5124]]

Confusion matrix on test set : 
[[53144   208]
 [  564  1240]]



In [92]:
from sklearn.metrics import classification_report
print(classification_report(Y_test, classifier.predict(X_test)))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99     53352
           1       0.88      0.70      0.78      1804

    accuracy                           0.99     55156
   macro avg       0.93      0.85      0.88     55156
weighted avg       0.99      0.99      0.99     55156



In [93]:
from sklearn.metrics import classification_report
print(classification_report(Y_train, classifier.predict(X_train)))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99    213306
           1       0.87      0.70      0.77      7316

    accuracy                           0.99    220622
   macro avg       0.93      0.85      0.88    220622
weighted avg       0.99      0.99      0.99    220622



# Train best classifier on all data and use it to make predictions on X_without_labels

In [86]:
# Concatenate our train and test set to train your best classifier on all data with labels
X = np.append(X_train,X_test,axis=0)
Y = np.append(Y_train,Y_test)

classifier.fit(X,Y)

RandomForestClassifier(max_depth=10, min_samples_leaf=5, n_estimators=50,
                       random_state=0)

In [87]:
# Read data without labels
data_without_labels = pd.read_csv('conversion_data_test.csv')
print('Prediction set (without labels) :', data_without_labels.shape)

# Warning : check consistency of features_list (must be the same than the features 
# used by your best classifier)
features_list = ['total_pages_visited','age','new_user','source','country']
X_without_labels = data_without_labels.loc[:, features_list]

# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_without_labels = X_without_labels.values
print("...Done")

print(X_without_labels[0:5,:])

Prediction set (without labels) : (31620, 5)
Convert pandas DataFrames to numpy arrays...
...Done
[[16 28 0 'Seo' 'UK']
 [5 22 1 'Direct' 'UK']
 [1 32 1 'Seo' 'China']
 [6 32 1 'Ads' 'US']
 [3 25 0 'Seo' 'China']]


In [88]:
# WARNING : PUT HERE THE SAME PREPROCESSING AS FOR YOUR TEST SET
# CHECK YOU ARE USING X_without_labels
print("Encoding categorical features and standardizing numerical features...")

X_without_labels = preprocessor.transform(X_without_labels)
print("...Done")
print(X_without_labels[0:5,:])

Encoding categorical features and standardizing numerical features...
...Done
[[ 3.31654513 -0.25469283  0.          0.          1.          0.
   1.          0.        ]
 [ 0.03564209 -1.05722616  1.          1.          0.          0.
   1.          0.        ]
 [-1.15741356  0.28032939  1.          0.          1.          0.
   0.          0.        ]
 [ 0.333906    0.28032939  1.          0.          0.          0.
   0.          1.        ]
 [-0.56088573 -0.6559595   0.          0.          1.          0.
   0.          0.        ]]


In [89]:
# Make predictions and dump to file
# WARNING : MAKE SURE THE FILE IS A CSV WITH ONE COLUMN NAMED 'converted' AND NO INDEX !
# WARNING : FILE NAME MUST HAVE FORMAT 'conversion_data_test_predictions_[name].csv'
# where [name] is the name of your team/model separated by a '-'
# For example : [name] = AURELIE-model1
data = {
    'converted': classifier.predict(X_without_labels)
}

Y_predictions = pd.DataFrame(columns=['converted'],data=data)
Y_predictions.to_csv('conversion_data_test_predictions_LesSportifs_RFOPT.csv', index=False)
