# The task

The task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment ATATGGCCTT becomes A2T4G2C2. In the dataset we have 285 pieces of 10mner snippets and their amount for each bacteria sample. The aim is to use this lossy information to accurately predict bacteria species. 

The bacteria species which need to be identified: 
* Streptococcus_pyogenes, 
* Salmonella_enterica, 
* Enterococcus_hirae, 
* Escherichia_coli, 
* Campylobacter_jejuni, Streptococcus_pneumoniae, 
* Staphylococcus_aureus, 
* Escherichia_fergusonii, 
* Bacteroides_fragilis, 
* Klebsiella_pneumoniae

The idea for this competition came from the following paper:

@ARTICLE{10.3389/fmicb.2020.00257,
AUTHOR={Wood, Ryan L. and Jensen, Tanner and Wadsworth, Cindi and Clement, Mark and Nagpal, Prashant and Pitt, William G.},   
TITLE={Analysis of Identification Method for Bacterial Species and Antibiotic Resistance Genes Using Optical Data From DNA Oligomers},      
JOURNAL={Frontiers in Microbiology},      
VOLUME={11},      
YEAR={2020},      
URL={https://www.frontiersin.org/article/10.3389/fmicb.2020.00257},       
DOI={10.3389/fmicb.2020.00257},      
ISSN={1664-302X}}


## 1. Loading and inspecting the data

In [None]:
%%time
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report,accuracy_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


In [None]:
%%time
# Reading input data
train = pd.read_csv('../input/tabular-playground-series-feb-2022/train.csv', dtype={'target':'category'})
train.drop(columns=['row_id'], inplace= True)

# Checking basic infos of the dataset
display(train.head())
display(train.info())
display(f'Duplicated rows: {train.duplicated().sum()}')

display(f'Missing data: {train.isna().sum().sum()}')

The train dataset includes 76k duplicated rows. These could either be dropped to reduce memory usage of the dataset, or could be used to support that if a row appear more than one time, then most likely it is a correct (not lossy) sample of genomic sequence and this information can be used to give more confidence to this data. In this notebook i leave the duplicated rows in to see whether they improve the prediction. 

In [None]:
%%time
# Reducing the dataset's memory usage by optimizing datatypes in the dataset

display(f'Initial memory usage: {train.memory_usage().sum()/1024**2:.2f}')

# The below memory reducing function is based on the function created by Daniil Kaprov in this notebook: (https://www.kaggle.com/vanguarde/tps-feb22-deep-eda-catboost-submission)

def reduce_memory_usage(df):
    start_mem = df.memory_usage().sum()/1024**2
    datatypes = ['float16', 'float32', 'float64']

    for col in df.columns[:-1]:     
        for dtp in datatypes:
            if abs(df[col]).max() <= np.finfo(dtp).max:
                df[col] = df[col].astype(dtp)
                break

    end_mem = df.memory_usage().sum()/1024**2
    reduction = (start_mem - end_mem)*100/start_mem
    print(f'Mem. usage decreased by {reduction:.2f}% to {end_mem:.2f}')
    return df


train = reduce_memory_usage(train)

### Visual EDA

In [None]:
# Checking target bacteria species
display(train.target.unique())

In [None]:
# Checking whether the dataset is balanced
ax = sns.countplot(data=train, y='target')
ax.set_title('Distribution of the target bacteria species')
plt.xticks(rotation=90)
sns.despine()
plt.xlabel('Nr of observations')
plt.ylabel('')
plt.show()

Luckily, the dataset is balanced, there are approximately equal amount of observations for each of the 10 bacteria species. 

In [None]:
# Checking cardinality: how distinct the values of the features are
feature_distinct_values = train.iloc[:,:-1].nunique(axis=0, dropna=True).sort_values()
feature_distinct_values = pd.DataFrame(feature_distinct_values, columns=['distincts'])
feature_distinct_values = feature_distinct_values.reset_index().rename(columns={'index':'sequence'})

# Plotting cardinality of the features
chart = sns.barplot(data=feature_distinct_values, y='sequence', x='distincts')
plt.yticks([]) # hiding ticks
plt.title('Cardinality: distinct values of features')
plt.xlabel('No. of distinct values')
plt.ylabel('Features')
sns.despine()


It seems that some sequences have a very low cardinality (below 50 distinct values), while there are sequences which where there are more than 6000 different values for a sequence.
Let's check out the low cardinality sequences: 

In [None]:
print(f'Low cardinality features:')
display(feature_distinct_values[feature_distinct_values.distincts <50])


It seems that the above segments have a very low cardinality, so there are less than 30 distinct values among the 200k observations. 

Idea: it might be worth trying to prepare categorical data out of these at a later stage. 

### Checking correlation of features

If features are highly correlated, there is a chance that the problem of multicollinearity can impact the performance of a model leading to misleading results. Also, there are a lot of features (285 segments) and it would be useful for the model performance if we could easily eliminate features which do not add further information to the model anyway. 


In [None]:
%%time
# Preparing correlation matrix
features = train.iloc[:, :-1]
features_corr = features.corr()

# Preparing heatmap 
 # creating mask to hide the upper triangle (incl. diagonal) from the heatmap to prevent duplication
mask = np.triu(np.ones_like(features_corr, dtype=bool), k=0) 
mask


In [None]:
%%time 
# Plotting correlation matrix on a heatmap

plt.figure(figsize =(12,8))
ax = sns.heatmap(
    features_corr, 
    mask=mask,
    cmap='coolwarm')

ax.set_title('Correlation of features')
plt.xlabel('DNA segments')
plt.ylabel('DNA segments')
plt.xticks([]) # removing the ticks and labels as there are too many features to appear f
plt.yticks([])

The heatmap suggests that there is no close to 100% correlation between the features, but there may some features which are highly correlated. Let's search for the highest correlated features. 

In [None]:
# Sorting the correlation values from negative linear to positive linear (-1 to +1)
 
sorted_correlation = features_corr.mask(mask).stack().sort_values() #  we selects the bottom triangle of the dataframe (wo diagonal) with the same mask as above to avoid duplication
sorted_correlation

It seems that correlation ranges from -0.7 to +0.84, meaning that there is no perfectly linear (positive or negative) correlation between features, however, there are 4 pairs, where the correlation is strong, between 0.8 and 0.9. 

Such pairs are 
* A3T0G5C2 - A3T0G3C4 (0.808503)
* A1T2G7C0 - A1T2G6C1 (0.826917)
* A4T1G4C1 - A4T0G1C5 (0.836726)
* A3T1G3C3 - A3T0G3C4 (0.841112)

It seems we cannot significantly reduce dimensionality by eliminating highly correlated features.


## 2. Building model

### Separating train, target and validation data

In [None]:
%%time
# Defining X and y for model
X = train.drop(columns=['target'])
y = train.target

X = np.array(X)
y = np.array(y)

# Splitting observations to train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25)

In [None]:
%%time

# Standardizing train data and validation data. I scale only after splitting to train and validation set to avoid data leakage
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_valid_scaled = scaler.transform(X_valid)

In [None]:
%%time
le = LabelEncoder() # I make the label encoding only after splitting the dataset for training and validation
le.fit(y_train)
y_train = le.transform(y_train)
y_valid = le.transform(y_valid)
display(y_train)
display(y_valid)

### Selecting and optimizing model

In earlier notebooks, I experimented with RandomForestClassifier, SGDClassifier, KNeighborsClassifier, but reached good results with KNeighborsClassifier only (around 90% with hyperparameter tuning), the other models performed around 69-77% even after optimizing. However, while the scores with KNeighborsClassifier were promising, the model took ages to run. Interestingly, even if RandomForest did not perform well, ExtraTrees (short for Extremely Randomized Trees) came up with much better results.

In [None]:
%%time
model = ExtraTreesClassifier(n_estimators = 1400, random_state=0)
display(model)

In [None]:
%%time 

# This part below was used for grid search 

# hyperparameterspace
param = {   
        'n_estimators': [900, 1000, 1200, 1400]
}

search = GridSearchCV(model, param, cv=5, scoring='accuracy', verbose=3, refit=True)
display(search)

#search.fit(X_train_scaled, y_train)
#display(search.best_estimator_)
#display(search.best_params_) 
#display(search.best_score_)

# using the best estimator found by grid search as our model

#model = search.best_estimator_ 
#display(model)



The result of earlier grid searches: 


ExtraTreesClassifier: param_grid={'n_estimators': [900, 1000, 1200, 1400]}
(n_estimators=1400, random_state=0) score: 0.973

ExtraTreesClassifier: param_grid={'n_estimators': [200, 400, 600, 800, 1000]}
(n_estimators=1000, random_state=0)


KNeighborsClassifier: param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6]} : (n_neighbors=1)
{'n_neighbors': 1}

for SGD Classifier: 
SGDClassifier(alpha=0.001, max_iter=10000)
{'alpha': 0.001}

In [None]:
%%time

model.fit(X_train_scaled, y_train)

In [None]:
%%time
y_pred = model.predict(X_valid_scaled)
display(accuracy_score(y_valid, y_pred))


In [None]:
%%time
#display(accuracy_score(y_train, model.predict(X_train_scaled))) #this was used to check overfitting

### Fitting the model for all the observations

In [None]:
%%time

# Standardizing the values to all train data
scaler.fit(X)
X_scaled = scaler.transform(X)

y = le.transform(y)


In [None]:
%%time
#Fitting the model for all the observations to so that the model has the most observation to work with
model.fit(X_scaled, y) 


## 3. Making predictions and submission csv

In [None]:
%%time
# Reading test data
test = pd.read_csv('../input/tabular-playground-series-feb-2022/test.csv')
X_test = test.drop(columns=['row_id'])
X_test = reduce_memory_usage(X_test)



X_test_scaled = scaler.transform(X_test)
display(X_test_scaled)

In [None]:
%%time
# Predicting bacteria species
prediction_encoded = model.predict(X_test_scaled)
prediction = le.inverse_transform(prediction_encoded)
display(prediction)

In [None]:
%%time
# Putting predicted results in a dataframe
submission = pd.DataFrame({
  'row_id':test.row_id,
  'target': prediction
})

submission.head()

# Generating submission file
submission.to_csv('submission.csv', index=False)