Problem statement

For the February 2022 Tabular Playground Series competition, your task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment  becomes . Can you use this lossy information to accurately predict bacteria species?

Import

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

Load

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Read

In [None]:
train = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2022/train.csv")
test = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2022/test.csv")
submission = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2022/sample_submission.csv")

In [None]:
train

In [None]:
test

In [None]:
submission

Analyse target

In [None]:
sns.displot(train['target'])

Define target

In [None]:
target = train['target']
target

Combine train and test

In [None]:
combi = train.drop(['target'], axis=1).append(test)
combi

drop row_id

In [None]:
combi = combi.drop(['row_id'], axis=1)
combi

Analyse combi

In [None]:
combi.info()

In [None]:
combi.describe()

In [None]:
combi.isnull().sum().sum()

Normalise combi

In [None]:
combi = (combi - combi.min()) / (combi.max() - combi.min())
combi

Encode label

In [None]:
from sklearn import preprocessing
    
le = preprocessing.LabelEncoder()
target = le.fit_transform(target)
target

Define X and y

In [None]:
y = target
X = combi[: len(train)]
X_test = combi[len(train) :]

SelectKBest

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

skb = SelectKBest(chi2, k=90)

X = skb.fit_transform(X, y)
X_test = skb.transform(X_test)

X.shape, y.shape, X_test.shape



Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=1)
X_train.shape, X_val.shape, y_train.shape,y_val.shape, X_test.shape

Select model

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=100, solver='newton-cg', penalty='l2', multi_class='multinomial', random_state=42).fit(X_train, y_train)
print(model.score(X_train, y_train))


Predict on validation set

In [None]:
y_pred = model.predict(X_val)
print(model.score(X_val, y_val))

Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_val, y_pred))

Predict on test set

In [None]:
pred = model.predict(X_test)
pred

In [None]:
pred = le.inverse_transform(pred)
pred

Prepare for submission

In [None]:
submission['target'] = pred
submission.to_csv("submission.csv", index=False)
submission = pd.read_csv("submission.csv")
submission