# **TPS OCT'21**

---

- First we will simply implement XGBoost with any hyperparameter tuning and check the results. 
- Then we will select important feature and try to optimize our prediction.

## **Basic XGBoost Prediction -**

#### **1. Import Libraries and Dataset -**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
train_df = pd.read_csv("../input/tabular-playground-series-oct-2021/train.csv")
test_df = pd.read_csv("../input/tabular-playground-series-oct-2021/test.csv")

#### **2. Reduce Memory Usage -**
* For every column we will reduce the datatype size if all datapoints in the column lie in the range of smaller sized datatype.

* For eg- The target variable(which is int64) has only 2 values(0/1) and can easily fit in int8(range -127 to 127) and help us save a lot of memory. 

In [None]:
train_df.info()
test_df.info()

In [None]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

In [None]:
train_df = reduce_memory_usage(train_df)
test_df = reduce_memory_usage(test_df)

#### **3.Understanding Data -**

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
print('Missing values in train dataset =',train_df.isna().sum().sum())
print('Missing values in test dataset =',test_df.isna().sum().sum())

In [None]:
train_df['target'].value_counts()

**Data Summary -**
* Total 1000000 datapoints and 287 features
* 240 features are of float type
* 45 features(other than id and target) are of int type 
* No missing datapoints in either of the datasets
* The value counts of the target variable are almost equal

#### **4. XGBoost Modelling and Prediction -**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [None]:
df = train_df.sample(n=10000)
X = df.drop(['target','id'],axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
XGB = GradientBoostingClassifier()

In [None]:
XGB.fit(X_train,y_train)
XGB_predict = XGB.predict(X_test)

In [None]:
roc_auc_score(y_test, XGB_predict)

* XGBoost gives a good roc_auc score above 0.75 on the train dataset.

In [None]:
del df, X_train, X_test, y_train, y_test, XGB_predict

## **Feature Selection -**

#### **1. Feature Importance -**

* Categorize features into discrete(categorical) and continous types.

In [None]:
cat_features=[feature for i,feature in enumerate(train_df.columns) if train_df[feature].dtype=='int8' and feature!='target']
cont_features=[feature for i,feature in enumerate(train_df.columns) if train_df[feature].dtype=='float16']

print('Number of Categorical features excluding id and target: ',len(cat_features),
      '\nNumber of Continuous features: ',len(cont_features))

* Let us check the correlation between all features and the target variable.

In [None]:
corr=pd.DataFrame()
corr['target'] = train_df[cat_features].corrwith(train_df['target'])
plt.subplots(figsize=(3,15))
df=corr.sort_values(by='target', ascending=False)
heatmap = sns.heatmap(df,annot=True,cmap='mako',linewidth=0.5,xticklabels=df.columns,yticklabels=df.index)

heatmap.set_title('Correlation - Categorical Features with target', fontdict={'fontsize':16}, pad=16)
plt.show()

In [None]:
corr=pd.DataFrame()
corr['target'] = train_df[cont_features].corrwith(train_df['target'])
plt.subplots(figsize=(3,50))
df=corr.sort_values(by='target', ascending=False)
heatmap = sns.heatmap(df,annot=True,cmap='mako',linewidth=0.5,xticklabels=df.columns,yticklabels=df.index)

heatmap.set_title('Correlation - Continous Features with target', fontdict={'fontsize':16}, pad=16)
plt.show()

#### **2. Manual Feature Selction -**
* Manually select only the features that have high correlation than a **threshold (say >|0.025|)** to the target variable.  

In [None]:
imp_cat_features = {'f247', 'f43','f22'}
imp_cont_features = {'f179', 'f69','f58','f14','f78','f8','f200','f134','f56','f192','f112','f72','f1','f201',
                     'f150','f92','f95','f3','f77','f136','f156'}

In [None]:
imp_features = imp_cont_features.union(imp_cat_features)
X = train_df[imp_features]
y = train_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
XGB = GradientBoostingClassifier()
XGB.fit(X_train,y_train)
XGB_predict = XGB.predict(X_test)
roc_auc_score(y_test, XGB_predict)

* There is a slight improvement in the roc_auc score after feature selection.

In [None]:
del df, X_train, X_test, y_train, y_test, XGB_predict

* **Final Prediction -**

In [None]:
df1 = train_df
X = df1.drop(['target','id'],axis=1)
y = df1['target']

XGB.fit(X,y)
df2 = test_df.drop('id',axis=1).copy()
XGB_target = XGB.predict(df2)

In [None]:
df2 = test_df.drop('id',axis=1).copy()
XGB_target = XGB.predict(df2)
XGB_target

In [None]:
sub = test_df['id'].copy()
sub['target'] = XGB_target
sub.to_csv("XGBsubmission.csv", index=False)