**Importing Relevant Libraries for Exploration, Plotting and use of Models through our analysis!**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_columns = None

**Importing the dataset**

In [None]:
data = pd.read_csv("../input/tabular-playground-series-may-2021/train.csv")

**Running some initial analysis!
Seeing given data has 51 columns-**

* 1 column forID
* 49 columns for features
* 1 column Classification into 4 classes

**Can conclude that this is a Supervised learning, multi class classification problem**

In [None]:
data.head()

**Splitting the dataset into independent and dependent variables**

In [None]:
x = data.iloc[:,1:51]
y = data.iloc[:,-1]

x_test_data = pd.read_csv("../input/tabular-playground-series-may-2021/test.csv")
x_test_new = x_test_data.iloc[:,1:51]


**Doing some preliminary analysis to see the following observations-**

* We have 49 feature columns
* No missing data present in dataset (No NA, nulls)
* Max values are positive though dataset does contain negative values also
* Data type for all columns is homogeneous - Integer
* Large number of zeros present for all the classes

In [None]:
x.head()

In [None]:
x.describe()

In [None]:
x.isna().sum()

In [None]:
pd.crosstab(x.iloc[:,0],y)

**Using Standard Scalar to Scale the dataset on a similar range**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)
x_test_new = sc.fit_transform(x_test_new)

**Plotting a graph of Cumulative variance vs No. of components**

This is being done to use only the number of relevant component features which affect the outcome.
Threshold being selected as 95%

In [None]:
from sklearn.decomposition import PCA
pca = PCA().fit(x)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(1, 51, step=1)
y1 = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(xi, y1, marker='o', linestyle='--', color='b')

plt.xlabel('Number of Components')
plt.xticks(np.arange(0, 60, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()

**Applying Principal Component Analysis(PCA)**


This is being done to reduce the dimensions to required dimensions, number decided above.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 47)
x = pca.fit_transform(x)
x_test_new = pca.transform(x_test_new)

**Applying XGBoost Classifier to predict the output probabilities**

In [None]:
from xgboost import XGBClassifier
classifier = XGBClassifier(n_estimators = 10000,predictor = 'gpu_predictor',tree_method = 'gpu_hist',learning_rate = 0.01,max_depth=29,max_leaves = 31,eval_metric = 'mlogloss',verbosity = 3)

In [None]:
classifier.fit(x,y)

**Predicting Probabilities of output for the test data**

In [None]:
y_pred_proba = classifier.predict_proba(x_test_new)

In [None]:
output = pd.read_csv("../input/tabular-playground-series-may-2021/sample_submission.csv")

In [None]:
output_df = pd.DataFrame(y_pred_proba, columns=['class_1', 'class_2', 'class_3', 'class_4'])

In [None]:
output_df = pd.concat([x_test_data.id,output_df],axis = 1)

In [None]:
output_df.head()

In [None]:
output_df.to_csv('OutputXGB.csv', index=False)

**Open for advice and suggestions on how to improve my score and to learn new techniques.**