#### If you like my work, please, leave an upvote: it will be really appreciated and it will motivate me in offering more content to the Kaggle community ! :)

#### We will go through all the processes:
* Data Loading
* Data Analysis
* Data Preprocessing
* Data Visualization
* Data Modelling
* Prediction

### Some nomenclatures used in the notebook:

##### train - train data from Kaggle
##### test - test data from Kaggle
##### cols - column names of train data
##### train_norm - normalized train data
##### test_norm - normalized test data
##### x - independent variables (columns) from train data (without 'id')
##### y - dependent variable (column) from train data
##### x_norm - normalized independent variables (columns) from train data (without 'id')
##### y_norm - normalized dependent variable (column) from train data
##### test_id - holds 'id' of test data
##### model_xgbc - XGBOOST model object
##### model_xgbc_norm - XGBOOST model with normalized data
##### y_predict_xgbc - holds predictions for test data
##### y_norm_predict_xgbc - holds predictions for scaled model test data
##### result - submission data (holds 'id' and predictions)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

> ## Loading the data

In [None]:
train = pd.read_csv('../input/tabular-playground-series-may-2022/train.csv')
test = pd.read_csv('../input/tabular-playground-series-may-2022/test.csv')

> ## Data Analysis

In [None]:
train.head()

In [None]:
train.shape, test.shape

In [None]:
train.isnull().sum().sum(), test.isnull().sum().sum()

#### There are no NULL values. That's a relief.

In [None]:
train.dtypes, test.dtypes

#### Ok so all are numerical except `f_27`. That's a relief again.

##### Let's drop them.

In [None]:
train = train.drop(['f_27'], axis=1)
test = test.drop(['f_27'], axis=1)

> ## Data Visualization

In [None]:
sns.lineplot(x='f_00', y='f_01', data=train, color='red')[:10]

In [None]:
sns.distplot(train['f_02'], color='pink')

In [None]:
plt.figure(figsize=(10, 6))
plt.title('Target distribution')
ax = sns.countplot(x=train['target'], data=train)

> ## Train-test-split

In [None]:
x = train.drop(['id', 'target'], axis=1)
y = train['target']

test_id = test['id']
test = test.drop(['id'], axis=1)


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

> ## Data Modelling

#### Let's prepare a table for all models and their scores

In [None]:
model_name = []
score = []

### XGBOOST

In [None]:
model_xgbc = XGBClassifier()

model_xgbc.fit(x_train, y_train, verbose=1)

In [None]:
y_predict_xgbc = model_xgbc.predict(test)

In [None]:
model_name.append("model_xgbc")
score.append(model_xgbc.score(x_test, y_test))

### CATBOOST

In [None]:
catboost_model = CatBoostClassifier(iterations=1, learning_rate=0.1)

catboost_model.fit(x_train, y_train, verbose=False)

In [None]:
y_predict_catb = catboost_model.predict(test)

In [None]:
model_name.append("model_catboost")
score.append(catboost_model.score(x_test, y_test))

> ## Best model

In [None]:
#best_score = max(score)
best_score_index = score.index(max(score))
best_model = model_name[best_score_index]
best_model

In [None]:
models_score = pd.DataFrame()
models_score['model_name'] = model_name
models_score['score'] = score
models_score

In [None]:
result = pd.DataFrame()
result['id'] = test_id
result['target'] = y_predict_xgbc

In [None]:
result.head()

In [None]:
result.to_csv('submission.csv', index=False)

### Work in progress!!