## **Objective :** predict whether a customer made a claim upon an insurance policy.


<img src="https://www.claimsmadesimple.org/wp-content/uploads/2020/01/claims.jpg" width="500"></img>

## **Content :**

- Import Libraries
- Load Data
- EDA :
     - Dealing With Missing Values
     - Outliers
- Pre-Processing
     - Feature Scaling
     - PCA
- Modeling
- Submission

<p style="background-color:pink; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 10px 100px;"><b>Importing Libraries & Data</b></p> 

### Import Libraries :

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV
import xgboost

pd.set_option('display.max_column', 120)
pd.set_option('display.max_row', 30)

### Read The Data :

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-sep-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-sep-2021/test.csv')
sub = pd.read_csv('/kaggle/input/tabular-playground-series-sep-2021/sample_solution.csv')

In [None]:
train.head()

<p style="background-color:pink; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 10px 100px;"><b>EDA</b></p> 

In [None]:
train.shape

In [None]:
train.info()

In [None]:
train.dtypes.value_counts()

In [None]:
train.describe()

Here, we notice that features have different ranges, so later we should do some feature scaling wich refers to the methods used to normalize the range of values of independent variables.

In [None]:
train.isna().sum() / train.shape[0] *100

In [None]:
test.isna().sum() / test.shape[0] *100

Almost features have 1% to 2% Nan values

In [None]:
train.nunique()[train.nunique()<1000] 

We can thought that f97 is a categorical column, but let's check first ;)

In [None]:
train.f97.value_counts() 

In [None]:
print('minimal value in f97 column is : ', train.f97.min())
print('maximal value in f97 column is : ', train.f97.max())

It appears that it is not a categorical columnl; it's just that all values varies in a small range.

So yeah, it is not a categorical column

In [None]:
dup_rows = train[train.duplicated()]
print('Number of duplicated rows is : ', dup_rows.shape)

In [None]:
train.claim.value_counts() / train.shape[0] *100 

The two values in target variable are balanced 

### Variables Correlations :

In [None]:
corr = train.corr()
highest_corr = corr.index[abs(corr["claim"])>0.01]

In [None]:
highest_corr = train[highest_corr].corr()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(14 , 14))

#mask = np.zeros_like(highest_corr, dtype=np.bool)
#mask[np.triu_indices_from(mask)] = True

sns.heatmap(highest_corr, ax=ax,
            
        square=True, center=0, linewidth=1,
        cmap=sns.diverging_palette(240, 10, as_cmap=True),
        #cmap = 'Greens',
        annot = True,
        fmt = '.3f',
        cbar_kws={"shrink": .6},    
        mask=np.triu(highest_corr)
       ) 

ax.set_title(f'Correlation', loc='left', fontweight='bold')     

plt.show()

In [None]:
highest_corr.claim.to_frame().T.sort_values(by = 'claim', ascending = True)

There are week correlations between independent and dependent variable.

### Outliers :

Let's check for outliers :
let 's train model without changing outliers then with changing it

In [None]:
def check_outliers(col) :
    outliers = []
    Q1 = col.quantile(.25)
    Q3 = col.quantile(.75)
    IQR = Q3 - Q1
    lowerLimit = Q1 - 1.5*IQR
    higherLimit = Q3 + 1.5*IQR
    
    for elt in col :
        if elt < lowerLimit or elt > higherLimit :
            outliers.append(elt)
            
    return np.array(outliers), lowerLimit, higherLimit

    

    Those values may be the result of some human errors or system failures. So we cannot simply accept them and we cannot drop them as well since then we will miss other features data. So we can use IQR.
     IQR or interquartile range is a measurement of variability based on dividing the dataset into different quantiles.

    We can calculate the lower limit and upper limit using quantiles. Then we replace the values that are less than the lower limit with the lower limit and the values that are greater than the upper limit with the upper limit. This will work with left-skewed or right-skewed data as well.

In [None]:
def change_outliers(data) :
    for col in data.columns :
        arr,lowerLimit,higherLimit = check_outliers(data[col])
        #print(col, len(arr))

        data[col] = np.where(data[col]>higherLimit,higherLimit,data[col])
        data[col] = np.where(data[col] <lowerLimit,lowerLimit,data[col])
        
change_outliers(train)
change_outliers(test)

In [None]:
train.describe().T[['min', 'max']].sort_values(by='max')

Even with changing outliers, we always have different ranges of values, so we always need to do some feature scaling.

**EDA Conclusions :**
- Number of rows and columns : (957919, 120)
- Types of Features : all features are numerical
- No categorical variables
- Almost features have 1% to 2% Nan values in both train and test set
- Features are in different ranges --> Need some Feature Scaling
- No duplicated rows
- Target variable is balanced
- Week correlations between features
- There are many outliers in our data, so we change their values using the IQR

<p style="background-color:pink; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 10px 100px;"><b>Pre-Processing</b></p> 

In [None]:
# let's first set the target 
y  = train.claim

### Data Imputation :

In [None]:
for col in train.columns :
    train[col] = train[col].fillna(train[col].median())
    if col != 'claim' :
        test[col] = test[col].fillna(train[col].median())

### Feature Scaling :

In [None]:
scaler = StandardScaler()
train = scaler.fit_transform(train.drop(['id', 'claim'], axis = 1))

test = scaler.transform(test.drop('id', axis = 1))

### PCA :

    The importance of dimensionality reduction is that we can compress the dataset by removing redundancy and retaining only useful information. Too many input variables can lead to the curse of dimensionality and then the model will not be able to perform well because the model will also learn from noise in the training dataset and be overfitted.

    Principal Component Analysis is a powerful technique used for dimensionality reduction, increasing interpretability but at the same time minimizing information loss.

In [None]:
# let's try first to determine the appropriate number of components
plt.figure(figsize=(14,7))
pca = PCA().fit(train)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

  The plot explained variance ratio and choose a number of components that "capture" at least 95% of the variance, thus 110 is a good choice in our case.

In [None]:
pca = PCA(n_components=110)
pca.fit(train)
X_train_pca = pca.transform(train)
X_test_pca = pca.transform(test)
principalDf = pd.DataFrame(data = X_train_pca)
principalDf.head(10)

<p style="background-color:pink; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 10px 100px;"><b>Modeling and Submission</b></p> 

### Modeling :

In [None]:
model=xgboost.XGBClassifier( tree_method="gpu_hist",
        gpu_id=1,
        predictor="gpu_predictor")

In [None]:
## Hyper Parameter Optimization

params={
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
    
}

In [None]:
random_search=RandomizedSearchCV(model,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)

In [None]:
random_search.fit(X_train_pca,y)

In [None]:
random_search.best_params_

In [None]:
random_search.best_estimator_

In [None]:
model=xgboost.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, gamma=0.1, gpu_id=1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.25, max_delta_step=0, max_depth=4,
              min_child_weight=5, missing=np.nan, monotone_constraints='()',
              n_estimators=100, n_jobs=2, num_parallel_tree=1,
              predictor='gpu_predictor', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='gpu_hist', validate_parameters=1, verbosity=None)

### Submission :

In [None]:
model.fit(X_train_pca,y)
test_preds = model.predict(X_test_pca)
sub.claim = test_preds
sub

In [None]:
sub.to_csv('submission.csv', index=False)

## Please If you find this notebook usefull don't forget to Upvote it!