<img src = "https://storage.googleapis.com/kaggle-forum-message-attachments/543450/13399/Untitled.jpg" width = "400"></img>

# Introduction

# <a id='0'>Content</a>

- <a href='#1'>1. Read the data</a>
- <a href='#2'>2. Data Understanding</a>
- <a href='#3'>3. Data Exploration</a>
 - <a href='#7'>3.1 Distribution of Y variable</a>
 - <a href='#8'>3.2 Distribution of X variables</a>
 - <a href='#9'>3.3 Correlation</a>
- <a href='#4'>4. Magic Feature</a>
- <a href='#5'>5. Model (LR)</a>
 - <a href='#10'>5.1 Model w/o Magic feature</a>
 - <a href='#11'>5.2 Model with Magic feature</a>
- <a href='#6'>6. Model (QDA)</a>

## <a id='1'>1. Read the data</a>

In [None]:
# Import necessary libraries

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import roc_auc_score

In [None]:
# Input path

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv('/kaggle/input/instant-gratification/train.csv')
test  = pd.read_csv('/kaggle/input/instant-gratification/test.csv')

## <a id='2'>2. Data Understanding</a>

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.describe()

## <a id='3'>3. Data Exploration</a>

### <a id='7'>3.1 Distribution of Y variable</a>

In [None]:
sns.countplot(train.target)

### Conclusion: 
Target variable seems to be equally distributed

### <a id='8'>3.2 Distribution of X variables</a>

In [None]:
def plot_feature_distplot(df, features):
    i = 0
    plt.figure()
    fig, ax = plt.subplots(4,4,figsize=(14,14))

    for feature in features:
        i += 1
        plt.subplot(4,4,i)
        sns.distplot(df[feature])
        plt.xlabel(feature, fontsize=9)
    plt.show();

In [None]:
cols = [col for col in train.columns if col not in ["id", "target"]]

# distribution plot for first 16 variables
plot_feature_distplot(train, cols[0:16])

### Conclusion: 
All 'X' variables seems to be normally distributed.

### <a id='9'>3.3 Correlation</a>

In [None]:
plt.figure(figsize=[16,9])
sns.heatmap(train[cols].corr())

### Conclusion: 
From heatmap, it seems to be there is no relation between 'X' variables.

## <a id='4'>4. Magic Feature</a>

In [None]:
train.info()

In [None]:
train.dtypes[train.dtypes == np.int64]

### There are two 'int' type columns in train dataset
1. wheezy-copper-turtle-magic
2. target

In [None]:
train['wheezy-copper-turtle-magic'].value_counts()

#### Column 'wheezy-copper-turtle-magic' can be treated as numeric or category.

### Here, we are gong to build the model by considering both options
1. Treat the column 'wheezy-copper-turtle-magic' as numeric
2. Treat the column 'wheezy-copper-turtle-magic' as category

In [None]:
train['wheezy-copper-turtle-magic'].nunique()

In [None]:
print('Train set')
print('Minimum value of wheezy-copper-turtle-magic:',train['wheezy-copper-turtle-magic'].min())
print('Maximum value of wheezy-copper-turtle-magic:',train['wheezy-copper-turtle-magic'].max())

## <a id='5'>5. Model (LR)</a>

### <a id='10'>5.1 Model w/o Magic Feature</a>

### Here, we consider the column 'wheezy-copper-turtle-magic' as numeric and build the model

In [None]:
cols = [c for c in train.columns if c not in ['id', 'target']]
oof = np.zeros(len(train))

# Stratified K-fold
skf = StratifiedKFold(n_splits=5)
 
for train_idx, val_idx in skf.split(train[cols], train['target']):
    
    # LR model
    clf = LogisticRegression()
    clf.fit(train.loc[train_idx][cols], train.loc[train_idx]['target'])
    oof[val_idx] = clf.predict_proba(train.loc[val_idx][cols])[:,1]

auc = roc_auc_score(train['target'],oof)
print('LR CV score w/o Magic feature =',round(auc,4))

### <a id='11'>5.2 Model with Magic Feature</a>

### Here, we consider the column 'wheezy-copper-turtle-magic' as category and build 512 models for each value

In [None]:
cols = [c for c in train.columns if c not in ['id', 'target']]

cols.remove('wheezy-copper-turtle-magic')
oof = np.zeros(len(train))

# Build 512 models
for i in range(512):
    
    # train the data for each value of 'wheezy-copper-turtle-magic'
    train1 = train[train['wheezy-copper-turtle-magic']==i]     
    
    idx1 = train1.index
    train1.reset_index(drop = True, inplace = True)
    
    # Stratified K-fold
    skf = StratifiedKFold(n_splits = 5)     
    
    for train_idx, val_idx in skf.split(train1[cols], train1['target']):
        
        # LR model 
        clf = LogisticRegression(solver = 'liblinear', penalty = 'l1', C = 0.05)
        clf.fit(train1.loc[train_idx][cols], train1.loc[train_idx]['target'])
        oof[idx1[val_idx]] = clf.predict_proba(train1.loc[val_idx][cols])[:,1]
    
auc = roc_auc_score(train['target'],oof)       

In [None]:
print('LR CV score with Magic feature =',round(auc,4)) 

### Conclusion: 
We can see huge difference in the CV score with Magic feature
1. LR, CV score without Magic feature: 0.53
2. LR, CV score with Magic feature:    0.79

## <a id='6'>6. Model (QDA)</a>

In [None]:
cols = [c for c in train.columns if c not in ['id', 'target']]

cols.remove('wheezy-copper-turtle-magic')
oof = np.zeros(len(train))

# Build 512 models
for i in range(512):
    
    # train the data for each value of 'wheezy-copper-turtle-magic'
    train1 = train[train['wheezy-copper-turtle-magic']==i]     
    
    idx1 = train1.index
    train1.reset_index(drop = True, inplace = True)
    
    # Dropping low-variance features (fit and transform)
    sel = VarianceThreshold(threshold = 1.5).fit(train1[cols])
    train2 = sel.transform(train1[cols])
    
    # Stratified K-fold
    skf = StratifiedKFold(n_splits = 5)     
    
    for train_idx, val_idx in skf.split(train2, train1['target']):
        
        # QDA model 
        clf = QuadraticDiscriminantAnalysis(reg_param=0.5)
        clf.fit(train2[train_idx,:], train1.loc[train_idx]['target'])
        oof[idx1[val_idx]] = clf.predict_proba(train2[val_idx,:])[:,1]
    
auc = roc_auc_score(train['target'],oof)  

In [None]:
print('QDA, CV score =',round(auc,4))

### Summary

QDA outperforms LR and other models as well.

The dataset most likely was produced by sklearn.datasets make_classification. This method generates clusters of gaussians with non-diagonal covariance matrix and assigns them classes. QDA works exactly with this structure of data, it learns normal distributions with n-dimentional covariance matrix.

QDA works by finding the multivariate Gaussian distribution of target=1 and finding the multivariate Gaussian distribution of target=0. A multivariate Gaussian distribution is an hyper-ellipsoid in p dimensional space where p is the number of variables.

For more information, please refer:
https://www.kaggle.com/c/instant-gratification/discussion/93843