# Simple Viualization and SVC

**Please UPVOTE if you find this useful and I appreciate your comments!**

I thought I would try 2 things:

1. Plot the 10-mer histogram distributions for each bacteria class 
2. Try Support Vector Machines to disinguish between the different classes - whereby each of the 10-mer histograms will become a dimension in the vector space

Learnings:
1. There seems to be visually  disinguishable differences in distributions except for a couple of stroptococus strains
2. At first the sklearn SVM implementation seemed to be having a challenge with the number of dimensions so I tried to zero out small "vector" components - however this seems to actually deteriorate classification performace and training time and the small valued 10-mers seem to have good predictive power so I gave up on that idea
3. I first used a LinearSVC but later sworched to SV  with and radial basis function kernel due to better results and convergence time

*Note this is work in progress!*

# Imports

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn import svm

# Parameters

In [None]:
MAX_ITER = 1000

# Load Data

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv')

In [None]:
train.set_index('row_id', inplace=True)
test.set_index('row_id', inplace=True)

# Show histogram distributions

In [None]:
grps = train.groupby('target')

In [None]:
# for name, grp in grps:
#   stacked = grp.melt(id_vars=['target'])
#   plt.figure(figsize=(12,6))
#   sns.histplot(stacked,
#                x='variable',
#                y='value')
#   plt.xticks(rotation=90)
#   plt.title(name)
#   plt.show()

# Support Vector Machine Model

## Train / Test Split

In [None]:
X = train.drop(['target'],
               axis = 1)
y = train['target']

In [None]:
X_train, X_test , y_train , y_test = train_test_split(X,y,test_size = 0.2,random_state = 42)

print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

## Standardization

In [None]:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Encode Labels

In [None]:
enc = LabelEncoder()
y_train = enc.fit_transform(y_train)
y_test = enc.fit_transform(y_test)

## Train

In [None]:
tuned_parameters= [
  {'C': [0.1, 10], 
   #'gamma': [0.1, 1, 10], 
   'kernel': ['rbf']}
 ]

In [None]:
scores = ["precision", "recall"]

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(svm.SVC(), 
                       tuned_parameters, 
                       scoring="%s_macro" % score, 
                       verbose=True,
                       n_jobs=-1)
    clf.fit(X_train, y_train)
    print("Classification report for: ", score)
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_["mean_test_score"]
    stds = clf.cv_results_["std_test_score"]
    for mean, std, params in zip(means, stds, clf.cv_results_["params"]):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

# Submit Predictions

In [None]:
X_test = scaler.transform(test)

In [None]:
y_predict = clf.predict(X_test)

In [None]:
submission = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/sample_submission.csv')
submission['row_id'] = test.index
submission['target'] = enc.inverse_transform(y_predict)
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)
print("Submission saved.")

# UNDER CONSTRUCTION...