# Tabular Playground Series - Feb 2022

This dataset is aimed to practive ML skills by using the genomic analysis technique data with some data compression and data loss.

ATATGGCCTT will be represented as A2T4G2C2 instead of the sequence.

Let's start with checking the data set and let's find out some direction for analysis.

PS. This is my first notebook for a competition and I find it very fun. I hope you guys will enjoy watch it and please feel free to leave a comment for further discussion. What I have done was learnt by other experts on Kaggle or video. I think it will be great to have a practical notebook and try to ask for suggestion from others instead of keep watching the video. Thanks for watching & have fun!

## Next Step & Discussion:

UMAP may not be a good technique for this data set since there is no pattern & cluster can be found. It is great to preview the scatter plot first since it gave a lot of information that the UMAP is helpful or not.
GridSearchCV is great as well since it can help to tune the hyper-parameter. GridSearchCV can be used again for further tuning but it may need to cost a lot of time and computational power. Next, the other model for classification can be used since the accuracy for XGBoost is hard to improve or it is hard to adjust the perfect one. Let's try again in the future.

#### Result of the submission

17/02/2022 First trial (XGBoost): Score = 0.90366, Rank = 701

18/02/2022 Second trial (XGBoost) : Score = 0.91611, Rank = 720 (Accuracy was increased)

19/02/2022 Third trial (XGBoost with higher lambda & alpha): Score = 0.90005, Rank = NA

19/02/2022 Fourth trial (XGBoost with 100 lambda & alpha): Score = 0.82967, Rank = NA

19/02/2022 Fifth trial (XGBoost with no gamma + Second trial regularization): Score = 0.91781, Rank = 724

21/02/2022 Sixth trial (XGBoost with 10-fold & depth adjusted): Score = 0.94448, Rank = 679

22/02/2022 Seventh trial (XGBoost with 5-fold & PCA): Score = 0.86250, Rank = NA

22/02/2022 Eighth trial (XGBoost with 5-fold & features amount PCA): Score = 0.87796, Rank = NA 

26/02/2022 Ninth trial (XGBoost with GridSearchCV): Score = 0.93042, Rank = NA

26/02/2022 Tenth trial (XGBoost with UMAP): Score = 0.78098, Rank = NA

In [None]:
# libraries set up
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# read the data
train_path = '../input/tabular-playground-series-feb-2022/train.csv'
test_path = '../input/tabular-playground-series-feb-2022/test.csv'

train_set = pd.read_csv(train_path, index_col = 0)
test_set = pd.read_csv(test_path, index_col = 0)

In [None]:
# preview for the train data
df_train = train_set.copy()
df_train.head()

In [None]:
#preview for the test data
df_test = test_set.copy()
df_test.head()

In [None]:
# check the shape for train set
df_train.shape

In [None]:
# check the shape for test set
df_test.shape

There are 200000 set data for model training. 

100000 testing sample was used for prediction and check with the performance by upload the result to Kaggle.

Now, let's check the whole data set and see is there any duplicate/non-numeric data/ missing value and some column needed to be worked on for feature engineering.

#### Missing value

Let's check is there any missing value for each sample in both training and testing set

In [None]:
# count the missing value in both data set
train_missing_count = 0
for i in df_train.columns:
    train_missing_count += df_train[i].isna().sum()

test_missing_count = 0
for i in df_test.columns:
    test_missing_count += df_test[i].isna().sum()
    
print('Missing value for train set: {0}'.format(train_missing_count))
print('Missing value for test set: {0}'.format(test_missing_count))

There is no missing value for the data set and we don't have to fill with other value. All the information was remained & obtained.

Next, we have to work on the duplicate.

#### Duplicate

Let's check the sample number for duplicate sample

In [None]:
# print the sample size for duplicate in both train and test set
print(f'Duplicate sample in train set: {df_train.duplicated().sum()}')
print(f'Duplicate sample in test set: {df_test.duplicated().sum()}')

Let's remove the duplicates for training set since it may cause overfitting issues.

In [None]:
# remove the duplicate and replace training set
df_train.drop_duplicates(subset = None, keep = 'first', inplace = True)

#### Object type encoding/ mapping

All the features should be numeric data except the target column but we can still check about the whole data set.

In [None]:
# check for the object column in training set
object_list = []
for i in df_train.columns:
    if df_train[i].dtypes == 'object':
        object_list.append(i)
print(object_list)

We can see that only the target columns was object type and it is the only column that we needed to do some feature engineering. We can check for the test set as well.

In [None]:
# check for the object column in testing set
object_list = []
for i in df_test.columns:
    if df_test[i].dtypes == 'object':
        object_list.append(i)
print(object_list)

Now, all the features parameter was numeric and it can be applied easily for our model without any transformation at this moment.

Let's work on the target column in training set and change it to numeric value that's the model is readable.

There is two ways for handling this column: (1) Ordinal Encoding (2) One-Hot Encoding. From the description of the data set, there is ten type of bacteria as our target and Orginal Encoding will be a better way to deal with it since it is our target column and we dont want much column for each type of bacteria.

In [None]:
# let's check about what type of bacteria
list(df_train['target'].unique())

# let's assign a number for each type
ordinal_target = dict(enumerate(df_train['target'].unique()))
ordinal_target = {y:x for x,y in ordinal_target.items()}
ordinal_target

In [None]:
reverse_target = {y:x for x,y in ordinal_target.items()}
reverse_target

In [None]:
# make another copy to prevent data change
label_df_train = df_train.copy()

# let's map with the dict that we create in the data set
label_df_train['target'] = label_df_train['target'].map(ordinal_target)

# let's check the data set after mapping/ ordinal encoding
label_df_train.head()

After the mapping, the data was ready for training the model and we can try it in a different.

Let's work a simple XGBoost model and check with the result first.

### XGBoost

In the XGBoost library, it got XGBoost Regressor & XGBoost Classifier. In the prediction, classification was needed and XGB Classifier will be used as the model.

In [None]:
# set up for machine learning libraries
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Spliting of the data in data set by slicing

In [None]:
# slice the dataset for features and target
X = label_df_train.iloc[:,:-1].values
y = label_df_train.iloc[:,-1].values.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)

##### Ninth trial with GrisSearchCV

By using the PCA, the performance for the model is not that well for last trial. GridSearchCV is a great technique to help searching the best hyperparameter instead of try it one by one. Let's try with few combination and search for the best accuracy. It will repeat 540 fits to ensure that best hyperparameter was found. The parameters will be focused on n_estimators, max_depth & learning_rate.

Due to the cost of time, K-fold will not be used since it may cost more time to find the hyperparameter.
After the bset parameter was found, the model will be trained again and predict the final test set for accuracy checking.

PS. the best model may not be used since the random_state was forgot to reproduce the result and the accuracy should be closed to the others

In [None]:
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

# make a dictionary of hyperparameter to search
search_space = {
    "n_estimators" : [500, 1000],
    "max_depth" : [8 , 10],
    "learning_rate" : [0.01, 0.1, 1]
}
# build a xgb classifier model without the hyperparameter that we want to check
model = XGBClassifier(objective = 'multi:softmax', booster = 'gbtree',
                        eval_metric = 'auc', tree_method = 'hist', use_label_encoder = False)
# build a GridSearchCV model to start for the searching of hyperparameter
GS = GridSearchCV(estimator = model, # target model
                  param_grid = search_space, # dictionary of the parameters
                  cv = 2, # 2 cv to minimize the time for training
                  verbose = 3 # message for the searching progress
)
GS.fit(X_train, y_train)

In [None]:
print(GS.best_estimator_) # best model

In [None]:
print(GS.best_params_) # best hyperparameter

In [None]:
print(GS.best_score_) # best score for the model

In [None]:
# slice the dataset for features and target
X = label_df_train.iloc[:,:-1]
y = label_df_train.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)

best_model = XGBClassifier(objective = 'multi:softmax', booster = 'gbtree',
                        eval_metric = 'auc', tree_method = 'hist', use_label_encoder = False,
                          learning_rate = 0.1, max_depth = 8, n_estimators = 1000)
best_model.fit(X_train, y_train)

y_pred = best_model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc}")

In [None]:
predictions = best_model.predict(df_test)

In [None]:
# take the result appeared most in 5-fold result and generate the submission to test the accuracy
from scipy.stats import mode
xgb_result = pd.DataFrame()
xgb_result['row_id'] = df_test.index
xgb_result['target'] = predictions
xgb_result['target'] = xgb_result['target'].map(reverse_target)
xgb_result.to_csv("xgb_GSCV_submission.csv", index = False)

After usign the best parameter from GridSearchCV, the accuracy of the submission was determined as 0.93042. It is not great than the sixth submission but we can enhance it by using the K-fold as well.

##### 10th submission: UMAP + previous model

Previously, PCA was used for dimension but it is not that great since the cluster was very close. Sample will have a chance to be predicted as a wrong target.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_scaled = ss.fit_transform(X)
pca = PCA(n_components = None)
L = pca.fit_transform(X_scaled)

Let's recap for the scatter plotted by using the first two components

In [None]:
def pca_scatter(pca, standardised_values, classifs):
    foo = pca.transform(standardised_values)
    bar = pd.DataFrame(zip(foo[:,0], foo[:,1], classifs), columns = ["PC1","PC2","Class"])
    sns.lmplot(data = bar, x = "PC1", y = "PC2", hue = "Class", fit_reg = False)
    
pca_scatter(pca, X_scaled, y)

Instead of PCA or t-SNE, UMAP was another great technique to reduce the dimension and form cluster. With this model, it may be helpful to classify our sample.

In [None]:
from umap import UMAP
import warnings
warnings.filterwarnings("ignore")

# create the UMAP model
reducer = UMAP(n_components = 2, n_neighbors = 10)

# fit and transform all the data with training data set
embedding = reducer.fit_transform(X_train)

Let's plot and check with the cluster according to the target class

In [None]:
# plot a scatter plot and check with the class
plt.scatter(embedding[:,0], embedding[:,1], s = 5, c = y_train, cmap = 'Spectral')
# set the axis and limit equal
plt.gca().set_aspect('equal', 'datalim')
# set the title
plt.title('Visualizing data with UMAP', fontsize= 24)

It seems that there is no pattern with UMAP. It may not be a good way to get a high accuracy. Let's try and check with the accuracy with XGBoost.

In [None]:
from sklearn.pipeline import Pipeline

# create the model for dimension reduction & xgboost
_umap = UMAP(n_components = 2, n_neighbors = 10)
_xgboost = XGBClassifier(objective = 'multi:softmax', booster = 'gbtree',
                        eval_metric = 'auc', tree_method = 'hist', use_label_encoder = False,
                          learning_rate = 0.1, max_depth = 8, n_estimators = 1000)
# create the pipeline for the progress
XGB_UMAP_model = Pipeline([
    ('umap', _umap),
    ('xgb', _xgboost)
])
# fit the training set into the pipeline for training
XGB_UMAP_model.fit(X_train, y_train)
# predict the validation set
y_pred = XGB_UMAP_model.predict(X_test)
# calculate the accuracy for the model
acc = accuracy_score(y_test, y_pred)
print(f'Accuracy: {acc}')

According to the validation accuracy, it seems UMAP may not be helpful for dimension reduction. May be dimension reduction is not a good technique for this data set. Let's try to predict the final test set and check with the accuracy.

In [None]:
predictions = XGB_UMAP_model.predict(df_test)

In [None]:
# take the result appeared most in 5-fold result and generate the submission to test the accuracy
from scipy.stats import mode
xgb_result = pd.DataFrame()
xgb_result['row_id'] = df_test.index
xgb_result['target'] = predictions
xgb_result['target'] = xgb_result['target'].map(reverse_target)
xgb_result.to_csv("xgb_UMAP_submission.csv", index = False)

The accuracy (0.78098)for using UMAP was terrible and it is great to know about the preview of UMAP scatter plot do really helpful for the accuracy.