<a href="https://colab.research.google.com/github/w4bo/teaching-handsondatapipelines/blob/main/materials/17-Ecoli.solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The `Ecoli` challenge

### Goal

It is your job to predict the `SITE` label.

### TODO

You are allowed to use `numpy`, `pandas`, `matplotlib`, `sns`, and `sk-learn` Python libraries. You can import any model from `sk-learn`.

You are asked to fulfill the following steps; remember to write your insights on the dataset in the card below.

1. Feature pre-processing (e.g., remove useless features, impute missing values, encode some features)
2. Verify the distribution of "Outcome" values
3. Check pairwise correlations among features
4. Split training and test data. When splitting train and test datasets, the test dataset should contain 30% of the data.
5. Plot the training dataset in 2D, are the outcomes separated?
6. Train at least two ML classification models; submissions are evaluated using the accuracy score.
7. Perform hyperparameter optimization for at least one model

In [None]:
# Briefly explain *HERE* the overall steps of your solution (e.g., what did you do and why).
# Briefly write the extracted outcome/insights of each of the previous points here.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random
from scipy.stats import randint
import seaborn as sns
from sklearn import metrics

# SEED all random generators
seed = 42
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)

df = pd.read_csv("https://raw.githubusercontent.com/w4bo/handsOnDataPipelines/main/materials/datasets/ecoli.csv")

In [None]:
df

In [None]:
del df['SEQUENCE_NAME']
df.head()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
sns.countplot(data=df, x='SITE', hue='SITE')

In [None]:
g = sns.pairplot(df, hue='SITE')
plt.show()

In [None]:
scale_mapper = { x: idx for idx, x in enumerate(df["SITE"].unique()) }
df["SITE"] = df["SITE"].replace(scale_mapper)
df["SITE"]

Check pairwise correlations among variables

In [None]:
from sklearn.model_selection import train_test_split # to split the data into two parts
X = df.drop(columns=["SITE"])
y = df["SITE"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Plot training data into 2D

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
result = pca.fit_transform(X_train)

plt.scatter(
    x=result[:,0],
    y=result[:,1],
    c=y_train,
    cmap='viridis'
)

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
result = tsne.fit_transform(X_train.head(2000))

plt.scatter(
    x=result[:,0],
    y=result[:,1],
    c=y_train,
    cmap='viridis'
)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
def fit_knn(X_train, y_train, X_test, y_test):
    k_range = list(range(1, 30))
    scores = []
    for k in k_range:
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        y_pred = knn.predict(X_test)
        scores.append(metrics.accuracy_score(y_test, y_pred))
    print(max(scores))
    plt.plot(k_range, scores)
    plt.xticks(k_range)
    plt.xlabel('Value of k for KNN')
    plt.ylabel('Accuracy Score')
    plt.title('k-Nearest-Neighbors')
    plt.show()
    return y_pred

fit_knn(X_train, y_train, X_test, y_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier # import the model

def run_forest(n_estimators, max_features):
    # initialize the model (i.e., the estimator)
    forest = RandomForestClassifier(n_estimators=n_estimators, max_features=max_features, random_state=42)
    forest.fit(X_train, y_train) # train it
    y_pred = forest.predict(X_test) # predict the cost of houses in the test set
    print(metrics.accuracy_score(y_test, y_pred))
    return y_pred
y_pred = run_forest(100, "auto")

In [None]:
from sklearn.metrics import accuracy_score, classification_report
print("Accuracy : {}%".format(accuracy_score(y_test, y_pred) * 100))
print("Classification Report: \n",classification_report(y_test, y_pred))

In [None]:
from sklearn.model_selection import RandomizedSearchCV # for tuning parameter
model = RandomForestClassifier()

param_grid = {
    'n_estimators': [10, 50, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 6, 9, None],
    'min_samples_split': range(2, 11),
    'bootstrap': [True, False],
}

# lets Make a function for Grid Search CV
def gridsearch_cv(model,param_grid, X_train, y_train):
    clf = RandomizedSearchCV(model, param_grid, cv=5, scoring="accuracy", n_jobs=2)
    clf.fit(X_train, y_train)
    print("The best estimator is " + str(clf.best_estimator_))
    print("The best score is " + str(clf.best_score_))

gridsearch_cv(model, param_grid, X, y)