## Introduction

# Bacteria and Viruses

[<img style="float:right; width:150px;height:300px;" src="https://media4.giphy.com/media/l0O9xXtMtjz050twY/giphy.gif?cid=ecf05e47a7muzlffbso29382q1q4qocad2cvet8lb7urll6m&rid=giphy.gif&ct=g">](http:google.com.au/)
Bacteria are single-celled, microscopic organisms. Most have a cell membrane and all lack membrane-bound organelles, including a nucleus. The bacterial genetic material is a single, circular molecule of DNA not arranged into a chromosome. Bacteria can have several shapes (e.g., rod shaped; filamentous; spiral shaped). Many bacteria cause disease by producing toxins. Bacterial infections that cause human illness can be prevented by vaccines or can be cured by antibiotics.

# Challenge Description

For this challenge, you will be predicting bacteria species based on repeated lossy measurements of DNA snippets. Snippets of length 10 are analyzed using Raman spectroscopy that calculates the histogram of bases in the snippet. In other words, the DNA segment $ATATGGCCTT$ becomes translated to $A_2T_4G_2C_2$.

Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities (e.g. $A_0T_0G_0C_{10}$ to $A_{10}T_0G_0C_0$), which then has a bias spectrum (of totally random $ATGC$) subtracted from the results.

# Preprocessing

In [None]:
# !pip -q install pandas --upgrade



import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from statistics import mean
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
import matplotlib.gridspec as gridspec
from sklearn.svm import LinearSVC
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

import torch

# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')

In [None]:
train_data = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2022/train.csv")
test_data = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2022/test.csv")

row_id = test_data["row_id"]

del test_data["row_id"]
del train_data["row_id"]

In [None]:
print(f"Nb samples in train: {train_data.shape[0]}\nNb columns in train: {train_data.shape[1]}\nNb samples in test: {test_data.shape[0]}\nNb columns in test: {test_data.shape[1]}\n")

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2

    for col in df.columns:
        col_type = df[col].dtypes

        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()

            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2

    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
 
    return df

print("Train data")
train_data = reduce_mem_usage(train_data)
print("\nTest data")
test_data = reduce_mem_usage(test_data)

Remove Duplicated Rows since they can introduce bias, a.k.a. overfitting in our model.

In [None]:
print(f"Total number of duplicated rows: {train_data.duplicated().sum()} out of {train_data.shape[0]} ({train_data.duplicated().sum()/train_data.shape[0]*100:.2f}%)")
train_data = train_data.drop_duplicates()
print(f"Total number of rows after removal: {train_data.shape[0]}")

# Data Exploration

In [None]:
train_data.groupby('target').describe()

In [None]:
bacteria = train_data.target.value_counts(normalize=True).reset_index().rename(columns={'index': 'Name'})

gs = gridspec.GridSpec(1, 2,width_ratios=[2.5, 2.5]) 
fig = plt.figure(figsize=(25,300))

# Set the coordinates limits
upperLimit = 100
lowerLimit = 30

# Compute max and min in the dataset
max = bacteria['target'].max()

# Let's compute heights: they are a conversion of each item value in those new coordinates
# In our example, 0 in the dataset will be converted to the lowerLimit (10)
# The maximum will be converted to the upperLimit (100)
slope = (max - lowerLimit) / max
heights = slope * bacteria.target + lowerLimit +1000

# Compute the width of each bar. In total we have 2*Pi = 360°
width = 2*np.pi / len(bacteria.index)

# Compute the angle each bar is centered on:
indexes = list(range(1, len(bacteria.index)+1))
angles = [element * width for element in indexes]

COLORMAP = ["#5F4690", "#1D6996", "#38A6A5", "#0F8554", "#73AF48", "#EDAD08", "#E17C05", "#CC503E", "#94346E", "#666666"]

# initialize the figure
plt.figure(figsize=(30,7))
ax = plt.subplot(gs[0,0], polar=True)
plt.axis('off')


# Draw bars
bars = ax.bar(
    x=angles, 
    height=heights, 
    width=width, 
    bottom=lowerLimit,
    linewidth=2, 
    edgecolor="white",
    color=COLORMAP,
)

# little space between the bar and the label
labelPadding = 4



# Add labels
for bar, angle, height, label,value in zip(bars,angles, heights, bacteria["Name"], bacteria["target"]):

    # Labels are rotated. Rotation must be specified in degrees :(
    rotation = np.rad2deg(angle)

    # Flip some labels upside down
    alignment = ""
    if angle >= np.pi/2 and angle < 3*np.pi/2:
        alignment = "right"
        rotation = rotation + 180
    else: 
        alignment = "left"

    # Finally add the labels
    ax.text(
        x=angle, 
        y=lowerLimit + bar.get_height() + labelPadding, 
        s=label + "  " +str(round(value*100,1)) + "%", 
        ha=alignment, 
        va='center', 
        rotation=rotation, 
        rotation_mode="anchor") 
    

ax1 = plt.subplot(gs[0,1])


ax1.axis('off')
ax1.axis('tight')


ax1.table(cellText=bacteria.values, colLabels=bacteria.columns, loc='center')

fig.tight_layout()

plt.show()

## Check Distribution of variables

In [None]:
from numpy.random import seed
from numpy.random import randn
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = np.array(train_data.iloc[:, 20])
# q-q plot
qqplot(data, line='s')
pyplot.show()

# Everything indicates that we should go for non parametric models

## PCA 

The PCA algorithm is going to standardize the input data frame, calculate the covariance matrix of the features.

Don’t choose the number of components manually. Instead of that, use the option that allows you to set the variance of the input that is supposed to be explained by the generated components.

In [None]:
X = train_data.iloc[: , :-1].values
y = train_data.iloc[: ,-1: ] # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

model = make_pipeline(StandardScaler(),PCA(n_components=0.99), ExtraTreesClassifier(class_weight='balanced', n_estimators=1111, random_state=21)).fit(X_train,  y_train.values.ravel())

In [None]:
y_pred = model.predict(X_test)



print("Accuracy: ", accuracy_score(y_test, y_pred))

In [None]:
preds = model.predict(test_data)



submission = pd.DataFrame({"row_id": row_id, 'target':preds}, index=test_data.index).to_csv("submission.csv", index = False)