**Basic Starting Procedures**

1. Import all the basic model libraries

2. Transform the input data into the more flexible Pandas Dataframe format

In [None]:
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_roc_curve

In [None]:
train = pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv')

**Basic Data Analysis**

First, we take a look at how the training data looks like.

In [None]:
train.info()
train.head()

Next, since there are only two types of data (continuous and binary), we can easily create a list of the columns containing the binary features.

Since there are only 2 dtypes present, we can conviniently assume the columns with float64 dtypes are the continuous features, while the columns with float64 dtypes are binary the features.

Of course, we have to remember to drop the id and target columns before creating our list.

In [None]:
train_new = train.drop('id', axis = 1)
train_new = train_new.drop('target', axis = 1)

dtypes=train_new.dtypes
dlist = dtypes.tolist()

binary_columns=[]
for i, j in enumerate(dlist):
    if j == "int64":
        binary_columns.append(i)
        
print(binary_columns)
len(binary_columns)

Now, we check if there are missing values that we have to fix.

In [None]:
train.isna().any(axis = 1).sum()

Luckily, there were none, so we will move on to slice out the target column as well as clear up some memory spacce as the training data is quite large.

In [None]:
target = train.iloc[:,286]

#improve memory usage
import gc
del train
gc.collect()

We then take a look at a summary of the continuous and binary features to spot any interesting patterns.

In [None]:
train_binary = train_new.iloc[:,binary_columns]
train_continuous = train_new.drop(train_binary, axis = 1)
print('Binary features summary statistics')
train_binary.describe().T.style.bar(subset = ['mean'], color = 'grey').bar(subset = ['std'],color = 'grey').background_gradient(cmap = 'GnBu')

In [None]:
print('Continuous features summary statistics')
train_continuous.describe(include = 'all').T.style.bar(subset = ['mean'], color = 'grey').bar(subset = ['std'],color = 'grey').background_gradient(cmap = 'GnBu')

Nothing too special, so we move on.

The next step is for reducing meomory usage as Kaggle has a limit.

In [None]:
#improve memory usage
import gc
del train_binary, train_continuous
gc.collect()

After having a clearer picture of the different features, we can start preparing the data to be fitted to the model.

**Train-Test Split**

An essential step to ensure that models can be properly evaluated.

In [None]:
x = train_new
y = target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=1)

In [None]:
#Memory garbage collection again
del x, y
gc.collect()

**Trying Various Modelling Techniques**

Just default models used here, no hyperparameter tuning for now. The metric used to compare the different model's classification accuracy will be ROC.
1. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
log_reg=LogisticRegression(solver='liblinear').fit(x_train,y_train)

In [None]:
plot_roc_curve(log_reg, x_test, y_test, name = 'Logistic Regression')

2. CatBoost

In [None]:
from catboost import CatBoostClassifier
catboost = CatBoostClassifier()
catboost.fit(x_train, y_train)

In [None]:
plot_roc_curve(catboost, x_test, y_test, name = 'CatBoost')

3. RandomForest Classifier
*This one takes a while to train so I cheated a little by adding a cap to the number of iterations to get the notebook to run a little faster(the model is not the most accurate with more iterations anyway)*

In [None]:
from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier(n_jobs = -1, random_state = 1, n_estimators = 10)
randomforest.fit(x_train, y_train)

In [None]:
plot_roc_curve(randomforest, x_test, y_test, name = 'Decision Tree Classifier')

As we can see (with a little help), the default CatBoost model seems to be the most accurate model so far. Therefore, we will use that model's prediction as our submission.

In [None]:
#Let's ensure that the test data is in the format we want
test = pd.read_csv('../input/tabular-playground-series-oct-2021/test.csv')
test.info()
test.head()

In [None]:
prediction=catboost.predict_proba(test.iloc[:,1:])
pred_col=np.c_[test.iloc[:,0],prediction[:,1]]
submission = pd.DataFrame(pred_col, columns = ['id','target'])

submission = submission.convert_dtypes('convert_integer')
submission.to_csv('submission.csv', index=False)

submission

**Work in progress**

Correlation graphs between features and target