<a href="https://colab.research.google.com/github/rafaborneo/kaggle-competition/blob/main/Rafael_kaggle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Diplodatos Kaggle Competition

We present this peace of code to create the baseline for the competition, and as an example of how to deal with these kind of problems. The main goals are that you:

1. Explore the data and learn from it
1. Try different models and see which one fits the best the given data
1. Get a higher score than the given one in the current baseline example
1. Try to get the highest score in the class :)

In [None]:
# Import the required packages
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read Data

In [None]:
train_df = pd.read_csv("travel_insurance_prediction_train.csv")
test_df = pd.read_csv("travel_insurance_prediction_test.csv")

## Explore the Data

Is your task to explore the data, do analysis over it and get insights, then use those insights to better pick a model.

In [None]:
train_df.head()

Unnamed: 0,Customer,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,1,33,Private Sector/Self Employed,Yes,550000,6,0,No,No,1
1,2,28,Private Sector/Self Employed,Yes,800000,7,0,Yes,No,0
2,3,31,Private Sector/Self Employed,Yes,1250000,4,0,No,No,0
3,4,31,Government Sector,No,300000,7,0,No,No,0
4,5,28,Private Sector/Self Employed,Yes,1250000,3,0,No,No,0


In [None]:
test_df.head()

Unnamed: 0,Customer,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad
0,1491,29,Private Sector/Self Employed,Yes,1100000,4,0,No,No
1,1492,28,Private Sector/Self Employed,Yes,750000,5,1,Yes,No
2,1493,31,Government Sector,Yes,1500000,4,0,Yes,Yes
3,1494,28,Private Sector/Self Employed,Yes,1400000,3,0,No,Yes
4,1495,33,Private Sector/Self Employed,Yes,1500000,4,0,Yes,Yes


**TravelInsurance** is the column that we should predict. That column is not present in the test set.

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Customer             1490 non-null   int64 
 1   Age                  1490 non-null   int64 
 2   Employment Type      1490 non-null   object
 3   GraduateOrNot        1490 non-null   object
 4   AnnualIncome         1490 non-null   int64 
 5   FamilyMembers        1490 non-null   int64 
 6   ChronicDiseases      1490 non-null   int64 
 7   FrequentFlyer        1490 non-null   object
 8   EverTravelledAbroad  1490 non-null   object
 9   TravelInsurance      1490 non-null   int64 
dtypes: int64(6), object(4)
memory usage: 116.5+ KB


In [None]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Customer             497 non-null    int64 
 1   Age                  497 non-null    int64 
 2   Employment Type      497 non-null    object
 3   GraduateOrNot        497 non-null    object
 4   AnnualIncome         497 non-null    int64 
 5   FamilyMembers        497 non-null    int64 
 6   ChronicDiseases      497 non-null    int64 
 7   FrequentFlyer        497 non-null    object
 8   EverTravelledAbroad  497 non-null    object
dtypes: int64(5), object(4)
memory usage: 35.1+ KB


In [None]:
train_df.describe()

Unnamed: 0,Customer,Age,AnnualIncome,FamilyMembers,ChronicDiseases,TravelInsurance
count,1490.0,1490.0,1490.0,1490.0,1490.0,1490.0
mean,745.5,29.667114,927818.8,4.777181,0.275839,0.357047
std,430.270264,2.880994,381171.5,1.640248,0.447086,0.47929
min,1.0,25.0,300000.0,2.0,0.0,0.0
25%,373.25,28.0,600000.0,4.0,0.0,0.0
50%,745.5,29.0,900000.0,5.0,0.0,0.0
75%,1117.75,32.0,1250000.0,6.0,1.0,1.0
max,1490.0,35.0,1800000.0,9.0,1.0,1.0


In [None]:
test_df.describe()

Unnamed: 0,Customer,Age,AnnualIncome,FamilyMembers,ChronicDiseases
count,497.0,497.0,497.0,497.0,497.0
mean,1739.0,29.599598,947585.5,4.68008,0.283702
std,143.615807,3.010506,363581.8,1.51347,0.451248
min,1491.0,25.0,300000.0,2.0,0.0
25%,1615.0,28.0,650000.0,4.0,0.0
50%,1739.0,29.0,950000.0,4.0,0.0
75%,1863.0,32.0,1250000.0,6.0,1.0
max,1987.0,35.0,1750000.0,9.0,1.0


## Baseline

In this section we present a baseline based on a decision tree classifier.

Many of the attributes are binary, there are a couple of numeric attributes, we might be able to one-hot (e.g. family members), or event discretize (age and anual income), this will come more clearly after the EDA.

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

### Transform the columns into features

First we need to transform the columns into features. The type of features we use will have a direct impact on the final result. In this example we decided to discretize some numeric features and make a one hot encoding of others. The number of bins, what we use as a one hot encoding, etc, is all up to you to try it out.

In [None]:
transformer = make_column_transformer(
    (KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile"), ["Age", "AnnualIncome"]),
    (OneHotEncoder(categories="auto", dtype="int", handle_unknown="ignore"),
     ["Employment Type", "GraduateOrNot", "FamilyMembers", "FrequentFlyer", "EverTravelledAbroad"]),
    remainder="passthrough")

We transform the train and test data. In order to avoid overfitting is better to remove the `Customer` column and we don't want the `TravelInsurance` column as part of the attributes either.

In [None]:
# The data for training the model
X_train = transformer.fit_transform(train_df.drop(columns=["Customer", "TravelInsurance"]))
y_train = train_df["TravelInsurance"].values

# The test data is only for generating the submission
X_test = transformer.transform(test_df.drop(columns=["Customer"]))
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)

(1490, 19)
(1490,)
(497, 19)


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_train,y_train,random_state=0,train_size=0.8)

print(X_train.shape)
print(y_train.shape)

print(X_valid.shape)
print(y_valid.shape)

(1192, 19)
(1192,)
(298, 19)
(298,)


In [None]:
from sklearn.tree import DecisionTreeClassifier
search_params_decis_tree = {
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [1, 2, 5],
    'max_depth': [3, 6, 10, 15, 20]
}
model_decis_tree = DecisionTreeClassifier(random_state=42)



In [None]:
grid_decis_tree = GridSearchCV(model_decis_tree, search_params_decis_tree, cv=5, scoring='f1', n_jobs=-1)
grid_decis_tree.fit(X_train, y_train)

best_decis_tree = grid_decis_tree.best_estimator_
best_decis_tree

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=15, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

In [None]:
print('DecisionTreeClassifier\n')
print(classification_report(y_train, best_decis_tree.predict(X_train)))

DecisionTreeClassifier

              precision    recall  f1-score   support

           0       0.85      0.93      0.89       766
           1       0.85      0.69      0.76       426

    accuracy                           0.85      1192
   macro avg       0.85      0.81      0.83      1192
weighted avg       0.85      0.85      0.84      1192



In [None]:
print('DecisionTreeClassifier: validation\n')
print(classification_report(y_valid, best_decis_tree.predict(X_valid)))

DecisionTreeClassifier: validation

              precision    recall  f1-score   support

           0       0.82      0.90      0.86       192
           1       0.78      0.65      0.71       106

    accuracy                           0.81       298
   macro avg       0.80      0.77      0.78       298
weighted avg       0.81      0.81      0.80       298



# **NUEVO MODELO**

```
ADABOOST
```



In [None]:
from sklearn.ensemble import AdaBoostClassifier

search_params_AdaBoost = {
    'n_estimators':[5,10,20,30],
    'learning_rate':[1.0,5.0,10.0,20.0]
}

model_AdaBoost = AdaBoostClassifier(random_state=42)
grid_AdaBoost = GridSearchCV(model_AdaBoost, search_params_AdaBoost, cv=5, scoring='f1', n_jobs=-1)

grid_AdaBoost.fit(X_train, y_train)
best_AdaBoost = grid_AdaBoost.best_estimator_
best_AdaBoost


AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=20, random_state=42)

In [None]:
print('AdaBoostClassifier\n')
print(classification_report(y_train, grid_AdaBoost.predict(X_train)))

AdaBoostClassifier

              precision    recall  f1-score   support

           0       0.77      0.93      0.84       766
           1       0.79      0.51      0.62       426

    accuracy                           0.78      1192
   macro avg       0.78      0.72      0.73      1192
weighted avg       0.78      0.78      0.76      1192



# LOGISTIC **REGRESSION**

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
cLoR = LogisticRegression(solver = 'liblinear', random_state = 0).fit(X_train, y_train)

In [None]:
print('Logistic regression\n')
print(classification_report(y_train, cLoR.predict(X_train)))

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
grid={"C":[0.001, 0.01, 0.1, 1, 10, 100, 1000], "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(X_train,y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



tuned hpyerparameters :(best parameters)  {'C': 0.01, 'penalty': 'l2'}
accuracy : 0.7802030812324929


In [None]:
logreg2=LogisticRegression(C=0.01,penalty="l2")
logreg2.fit(X_train,y_train)
print("score",logreg2.score(X_train,y_train))

score 0.7785234899328859


In [None]:
print(classification_report(y_train, logreg.predict(X_train)))

              precision    recall  f1-score   support

           0       0.77      0.93      0.84       766
           1       0.81      0.50      0.62       426

    accuracy                           0.78      1192
   macro avg       0.79      0.72      0.73      1192
weighted avg       0.78      0.78      0.76      1192



In [None]:
logreg3=LogisticRegression(C=0.01,penalty="l2")
logreg3.fit(X_valid,y_valid)
print("score",logreg3.score(X_valid,y_valid))

score 0.7281879194630873


In [None]:
print(classification_report(y_valid, logreg3.predict(X_valid)))

              precision    recall  f1-score   support

           0       0.71      0.98      0.82       192
           1       0.90      0.26      0.41       106

    accuracy                           0.73       298
   macro avg       0.81      0.62      0.62       298
weighted avg       0.78      0.73      0.68       298



## Generate the output

The last thing we do is generating a file that should be *submitted* on kaggle

In [None]:
test_id = test_df["Customer"]
test_pred = best_AdaBoost.predict(X_test)



In [None]:
submission = pd.DataFrame(list(zip(test_id, test_pred)), columns=["Customer", "TravelInsurance"])
submission.to_csv("travel_insurance_submission.csv", header=True, index=False)