# <U> <B> A Introductory Notebook for XGBClassfier </B> </U>

### XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. The XGBoost model for classification is known as XGBClassifier.

### <U> <B> Topics Covered: </B></U>

### 1. Importing Libraries
### 2. Loading Data
### 3. Viewing Data
### 4. Analysis
### 5. Correlation Analysis
### 6. Bivariate Analysis
### 7. Using XGBClassifier
### 8. Predicting
### 9. Making Submission File

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing Libraries

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Loading Data

In [None]:
train = pd.read_csv("/kaggle/input/tabular-playground-series-may-2021/train.csv")
test = pd.read_csv("/kaggle/input/tabular-playground-series-may-2021/test.csv")
submission = pd.read_csv("/kaggle/input/tabular-playground-series-may-2021/sample_submission.csv")

# Viewing Data

In [None]:
train.head()

In [None]:
test.head()

# Analysis

In [None]:
train.columns

### ID column shouldn't have any impact on Target, so we delete it from both test and train

In [None]:
train.drop(columns=['id'])
test.drop(columns=['id'])

In [None]:
train['target'].value_counts(normalize=True).plot(kind='bar')

### Mostly data (60%) is for class 2. 

In [None]:
train['target'].value_counts()

# Correlation Analysis

In [None]:
sns.heatmap(train.corr(method='pearson'),cmap="YlGnBu",vmin=-1)

In [None]:
#If you want to dig deep and see
feature_list = ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4','feature_5']
fig_dims = (15,15)
fig, ax = plt.subplots(figsize=fig_dims)
sns.heatmap(train[feature_list].corr(method='pearson'),cmap="YlGnBu",vmin=-1,annot=True,fmt='.1f',ax=ax)

### There is almost no correlation between any of the features.

# Bivariate Analysis

In [None]:
SS = pd.crosstab(train['target'],train['feature_0'])
SS.div(SS.sum(1),axis=0).plot(kind='bar',stacked=True)

### Such kind of graph would be seen for almost all variables as '0' is used maximum times. But the distribution is same across all 4 classes.

# Data Preparation

In [None]:
X = train.iloc[:,:-1]
# Dependant variable
y = train['target']

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

# Using XGBClassifier

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier(random_state=1, use_label_encoder=True, n_estimators=128)
model.fit(X_train, y_train)

### model.fit is used to fit the XGBClassifier to out training Dataset

In [None]:
#importance = model.feature_importances_
# summarize feature importance
#for i,v in enumerate(importance):
    #print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance

In [None]:
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.show()

### All variables seem to be equally important except two which are considerably more important.

In [None]:
pred = model.predict(X_valid)

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pred,y_valid)
print(accuracy)

# Predicting

In [None]:
y_pred = model.predict_proba(test)
print(y_pred)

# Making submission file

In [None]:
submission_xgb = pd.DataFrame(y_pred,columns=['Class_1','Class_2','Class_3','Class_4'])
submission_xgb['id'] = submission['id']

In [None]:
submission_xgb = submission_xgb[['id','Class_1','Class_2','Class_3','Class_4']]

In [None]:
submission_xgb

In [None]:
submission_xgb.to_csv('submission_xgb',index=False)

## Scope of Improvement
<B>1. The data is imbalanced. You can use smote, undersampling and oversampling to correct it </B>

<B>2. You might want to use some data dimensionality reduction techniques like PCA </B>


## If you have any questions, post them below. Would love to hear your suggestions and ideas for improvement of the notebook.