### Baseline Iris dataset random classifier & Logistic classifier

The purpose of this section is to get a baseline for how performance looks on the classification task with the unmodified Iris dataset. We will be randomly losing data in this dataset at varying percentages and using various imputation strategies to compare and see what the performance implications are.

In [1]:
from sklearn.experimental import enable_iterative_imputer

ModuleNotFoundError: No module named 'sklearn.experimental'

In [None]:
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd

In [None]:
# import the iris dataset
iris = datasets.load_iris()
iris

### Loading that raw data into a `raw_df`

In [None]:
raw_df = pd.DataFrame.from_dict({**{feature:list(map(lambda row: row[idx], iris["data"])) 
                              for idx, feature in enumerate(iris["feature_names"])},
                               **{"class":iris["target"]}})
raw_df

### Baseline feature scaling

In [None]:
import random
X_raw, y_raw = raw_df.iloc[:,0:-1], raw_df.iloc[:,-1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_raw, y_raw, test_size=0.2, random_state=random.randint(1,101)
)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # Scaler was fit on the training data only
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Dummy & Logistic Regression $F_1$ scores with `100.0%` of data

In [None]:
from sklearn.dummy import DummyClassifier
random_baseline = DummyClassifier()
random_baseline.fit(X_train, y_train)
y_test_hat = random_baseline.predict(X_test)
from sklearn.metrics import f1_score
scores = pd.DataFrame(data=[f1_score(y_test, y_test_hat, average='weighted')],
                      index=[100.0],
                      columns=["dummy"])
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver="lbfgs", multi_class="auto")
clf.fit(X_train, y_train)
y_test_hat = clf.predict(X_test)
scores['logisticreg'] = pd.Series([f1_score(y_test, y_test_hat, average='weighted')],
                                  index=[100.0])
scores

As we continue with the notebook, we will fill up this dataframe with the result of varying levels of random data-loss (`80.0%` indicates that we draw from a uniform distribution of all elements in `X` and randomly keep `80%` of values, discarding `20%`.

### Dummy & Logistic Regression $F_1$ scores with `99.0%` of data

In [None]:
import itertools
import numpy as np
def drop_percentage(df, p):
    new_df = df.copy()
    for i, j in itertools.product(*map(lambda dx: range(dx), df.shape)):
        if np.random.rand(1)[0] < p: new_df[i,j] = np.nan
    return new_df
def percentages(): return np.arange(0.75,1,0.01)
X_train_percentage = {
    percentage:{"dirty":drop_percentage(X_train, percentage)} for percentage in percentages()
}
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
for percentage in percentages():
    def get_dirty_data(): return X_train_percentage[percentage]["dirty"]
    imp_mean = SimpleImputer().fit(get_dirty_data())
    imp_iter = IterativeImputer().fit(get_dirty_data())
    X_train_percentage[percentage]["clean"] = {
        "mean":imp_mean.transform(get_dirty_data()),
        "iter":imp_iter.transform(get_dirty_data())
    }