# Tabular Playground Series February 2022

Credit: The EDA takes inspirtation from [AmbrosM](https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense) his notebooks are amazing!

I am still learning! If there's any mistakes, or you have any tips for me please let me know!

## Project Task

**Task:**

For the February 2022 Tabular Playground Series competition, your task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count (Snippets of length 10 are analyzed using Raman spectroscopy that calculates the histogram of bases in the snippet). In other words, the DNA segment $ATATGGCCTT$ becomes $A_{2}T_{4}G_{2}C_{2}$ Can you use this lossy information to accurately predict bacteria species?

**Evaluation Metric:**

Accuracy = $\frac{Correct predictions}{Total predictions}$

**Info**

Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities (e.g. $A_{0}T_{0}G_{0}C_{10}$ to $A_{10}T_{0}G_{0}C{0}$), which then has a bias spectrum (of totally random ATGC) subtracted from the results.

## Preliminaries

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

sns.set_style('darkgrid')

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv')

In [None]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

## Data Description

In [None]:
print("Train Dimensions:", train_df.shape)
print("Test Dimensions:", test_df.shape)

In [None]:
train_df.info()

In [None]:
train_df.head()

In [None]:
train_df.describe()

### Missing values

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    There are no missing values in the dataset

    
</div>

In [None]:
print("Number of missing values in train set: ", train_df.isna().sum().sum())
print("Number of missing values in test set: ", test_df.isna().sum().sum())

### Classes

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    The classes are balanced

    
</div>

In [None]:
targets = train_df['target']
bacteria_counts = targets.value_counts()
bacteria_counts = bacteria_counts.reset_index().rename(columns={"index":"BacteriaSpecies", "target":"Count"})
bacteria_counts.set_index("BacteriaSpecies")

In [None]:
f, ax = plt.subplots(figsize=(15, 6))
sns.barplot(x = "BacteriaSpecies", y="Count", data = bacteria_counts)
plt.xticks(rotation=30);

## EDA

In the original paper they perform three important steps to the data we should consider:

1. They introduce random errors to simulate gene mutations and experimental errors. For some rows they introduce more errors than others - where the fraction of the number of errors introduced is described by m.
2. They add bias (the spectrum from a random sequence (bias) is subtracted from the simulated emperimental spectrum)
3. When taking their measurements they use different numbers of reads (can think of as the number of measurements/scans made). The more reads made the more precise the resulting spectrum (row) and the better it matches the bacterium.

Quotes from the paper:

> To test the robustness of identification of species or genes by deviation spectra in the presence of real experimental noise and random mutations in the DNA sequence, random errors are introduced into the FBC spectra (see section “Simulating Gene Mutations and Experimental Errors” for details) and then the resulting noisy data are divided by r and the bias spectrum is subtracted to obtain a simulated FBC deviation spectrum (with noise) for the given plasmid or genome. 

> For both the gDNA and pDNA, the FBC deviation spectra were created with the following parameters: k = 10; r = [10^2; 10^3; 10^4; 10^5; 10^6]; m = [0, 0.01, 0.05, 0.1, 0.25, 0.33, 0.5, 0.75, 0.9, 1]; and s = 1000; where k is the size of the k-mer, r is the number of pyramid tips for generating the sample FBC spectra, m is the fractional error rate, and s is the number of FBC deviation spectra created per DNA sequence (genome or plasmid).

> We define an error rate m (where 0 ≤ m ≤ 1) to be the fraction of bases in the reference sequence that are expected to contain an error. 

> The robustness of each classification model was studied by measuring the predictive accuracy as a function of the parameters of the simulated BOC data. For each optical sequencing read number (r) and each error rate (m)


In the paper they perform:

$Data = \frac{originalData}{r} - bias$

where Data is the data that we currently have. To see the orignal data 

$originalData = (Data + bias) * r $

where r could be 10,100,1000,10000,100000

Obtaining the original data:

In [None]:
train_cols = list(train_df.columns.drop(['row_id','target']))

In [None]:
def bias(w,x,y,z):
    b = 1/4**10 * (np.math.factorial(10)/(np.math.factorial(w)*np.math.factorial(x)*np.math.factorial(y)*np.math.factorial(z)))
    return b

In [None]:
def calc_bias(s):
    w = int(s[1:s.index('T')])
    x = int(s[s.index('T')+1:s.index('G')])
    y = int(s[s.index('G')+1:s.index('C')])
    z = int(s[s.index('C')+1:])
    b = bias(w,x,y,z)
    return b

We can tell which r value was used for each row by considering the greatest common divisor once we multiply the value by the largest possible r (10^6), where a GCD of 1 would mean r = 10^6, GCD of 10 = 10^5 etc.

In [None]:
train_df_bias = pd.DataFrame({col: (((train_df[col] + calc_bias(col))*10**6).round().astype(int)) for col in train_cols})
test_df_bias = pd.DataFrame({col: (((test_df[col] + calc_bias(col))*10**6).round().astype(int)) for col in train_cols})
train_df_bias.head()

### Calculating the r value used

In [None]:
def gcd_of_all(df_i):
    gcd = df_i[train_cols[0]]
    for col in train_cols[1:]:
        gcd = np.gcd(gcd, df_i[col])
    return gcd

train_df_bias['gcd'] = gcd_of_all(train_df_bias)
train_df['gcd'] = gcd_of_all(train_df_bias)
test_df['gcd'] = gcd_of_all(test_df_bias)

In [None]:
gcd_counts = train_df["gcd"].value_counts().reset_index().rename(columns={"gcd":"train_count", "index":"gcd"}).sort_values(by="gcd").set_index("gcd")
gcd_counts["test_count"] = test_df["gcd"].value_counts()
gcd_counts["train_perc"] = 100 * gcd_counts["train_count"]/gcd_counts["train_count"].sum()
gcd_counts["test_perc"] = 100 * gcd_counts["test_count"]/gcd_counts["test_count"].sum()
gcd_counts

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
The GCD counts are relatively balanced - perhaps there is a slightly higher percentage of gcd 1 in the test set.


Greatest common divisors are of 1, 10, 1000 and 10000, corresponding to r (read) values of 1,000,000, 100,000, 1,000 and 100 respectively, with r values of 10,000 apparently absent from this dataset.

In [None]:
#Copying the targets over
train_df_bias["target"] = train_df['target']

Lets visualise the data for different GCD values.

Train data:

In [None]:
plt.figure(figsize=(20, 20))

for i,gcd in enumerate([1,10,1000,10000]):
    pca = PCA(n_components = 2, random_state = 10, whiten = True)
    pca.fit(train_df_bias[train_df_bias["gcd"] == gcd][train_cols])

    X_PCA = pca.transform(train_df_bias[train_df_bias["gcd"] == gcd][train_cols])

    # Percentage of variance explained for each components
    print( "GCD: ", gcd, " explained variance ratio (first two components): ", pca.explained_variance_ratio_)

    PCA_df = pd.DataFrame({"PCA_1" : X_PCA[:,0], "PCA_2" : X_PCA[:,1], "LABEL":train_df_bias[train_df_bias["gcd"] == gcd]["target"]})
    
    ax = plt.subplot(2, 2, i + 1)
    plt.title("GCD: " + str(gcd))
    sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = train_df_bias["target"].unique())

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    <ol>
        <li>Classification will be easier for lower GCD values.<\li>
         <li>We can see that there are 8 clusters for each GCD value (only noticeable for GCD 1 and 10). these correspond to the different simulated error (m) <\li>
    <\ol>
    <\div>

Test data:

In [None]:
plt.figure(figsize=(20, 20))

for i,gcd in enumerate([1,10,1000,10000]):
    pca = PCA(n_components = 2, random_state = 10, whiten = True)
    pca.fit(test_df[test_df["gcd"] == gcd][train_cols])

    X_PCA = pca.transform(test_df[test_df["gcd"] == gcd][train_cols])

    # Percentage of variance explained for each components
    print( "GCD: ", gcd, " explained variance ratio (first two components): ", pca.explained_variance_ratio_)

    PCA_df = pd.DataFrame({"PCA_1" : X_PCA[:,0], "PCA_2" : X_PCA[:,1]})
    
    ax = plt.subplot(2, 2, i + 1)
    plt.title("GCD: " + str(gcd))
    sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2")

### Simulated Error Classess

Lets get a better visual of the different simular error (m) values and try and classify these clusters

In [None]:
plt.figure(figsize=(20, 30))

for i,bacteria in enumerate(train_df_bias["target"].unique()):
    cluster_input_data = train_df_bias[(train_df_bias["gcd"] == 1) & (train_df_bias["target"] == bacteria)][train_cols]
    cluster_input_data = StandardScaler().fit_transform(cluster_input_data)

    clustering_gmm_m = GaussianMixture(n_components=8, covariance_type = 'diag', n_init=5, random_state=1).fit(cluster_input_data)
    clustering_gmm = clustering_gmm_m.predict(cluster_input_data)

    pca.fit(cluster_input_data)

    X_PCA = pca.transform(cluster_input_data)

    PCA_df = pd.DataFrame({"PCA_1" : X_PCA[:,0], "PCA_2" : X_PCA[:,1],"Cluster":clustering_gmm})    
    
    mvals = ["m00", "m05", "m10", "m25", "m33", "m50", "m75", "m90"]
    PCA_1 = PCA_df.groupby(by=["Cluster"])["PCA_1"].mean().reset_index().sort_values("PCA_1").reset_index(drop=True)
    PCA_df = PCA_df.replace({'Cluster': {PCA_1.iloc[n]["Cluster"]: mvals[n] for n in range(0,8)}})
    
    if i == 0:
        cluster_counts = PCA_df['Cluster'].value_counts().reset_index().rename(columns={"index":"mval", "Cluster":bacteria + " count"}).set_index("mval").sort_index()
    else:
        cluster_counts[bacteria + " count"] = PCA_df['Cluster'].value_counts()
    
    ax = plt.subplot(5, 2, i + 1)
    plt.title("GMM - Bacteria: " + str(bacteria))
    sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue="Cluster", hue_order=mvals, palette = sns.color_palette("husl", 8))
cluster_counts

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;"><ol>
    <li>These different groups correspond to the simulated error added. Out of the possible m  = [0, 0.01, 0.05, 0.1, 0.25, 0.33, 0.5, 0.75, 0.9, 1], 8 ( have been selected for the version of the data we are using. These are likely [m = 0 OR 0.01, 0.05, 0.1, 0.25,0.33, 0.5, 0.75 and 0.9] based on the distances between clusters. 
    <li> We notice that the m=0 cluster has twice the number of points as we expect, this is likely because there is actually 9 clusters with m=0.01 and m=0 both currently being represented by the m0 cluster. The difference between them is just too small to pick up. <\li>
   <li> Some of the bacterium seems to have more distinct clusters than others.<\li>
    <li>   We can accuractely use clustering to predict which m value was used to create the data (for gcd = 1 and if the bacterium is known). <\li>
        <\ol>
<\div>

### Unique Values

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    <ol>
    <li>  Although the values for the DNA segment columns are floating points and there are 20,000 rows in the train set the number of possible values are not unique</li>
    <li> Each DNA segmentcolumn varies significantly in the number of unique values. This makes sense as there are more possible permutations e.g. of $A_{3}T_{3}G_{3}C_{4}$ then $A_{10}T_{0}G_{0}C_{0}$ </li>
    </ol>
</div>

Although the values for the DNA segment columns are floating points and there are 20,000 rows in the train set the number of possible values are not unique:

In [None]:
train_cols = list(train_df.columns.drop(['row_id','target']))

In [None]:
def calculate_nuniques(df, cols):
    nuniques = []
    for i in cols:
        nuniques.append(df[i].nunique())
    return nuniques

train_uniques = calculate_nuniques(train_df, train_cols)
nunique_df = pd.DataFrame({'Cols' : train_cols, 'nunique': train_uniques})
nunique_df.T

In [None]:
test_uniques = calculate_nuniques(test_df, train_cols)
nunique_test_df = pd.DataFrame({'Cols' : train_cols, 'nunique': test_uniques})
nunique_test_df.T

In [None]:
f, ax = plt.subplots(figsize=(20, 7))
ax = sns.histplot(data = nunique_df, x = "nunique", bins=50)
ax.set_xlim(0);
ax.set_xlabel("Number of unique values")
ax.set_title("Visualising the number of unique values for different DNA segments (features)");

## Feature Creation and experimentation

The ultimate goal would be to do determine the value of m just from patterns in the training columns, without also knowing the target bacterium. This is probably not possible. However useful features can also be created.

If we could determine the value of m, without knowledge of the bacteria then we would avoid making errors between ones bacterias m value and a different m value from a different bacteria, which otherwise would have similar values.

Number of non-zero rows:

In [None]:
non_zero = train_df_bias.drop(columns=["gcd","target"]).astype(bool).sum(axis=1).values
non_zero_test = test_df_bias.astype(bool).sum(axis=1).values
train_df["non_zero_row"] = non_zero
test_df["non_zero_row"] = non_zero_test

f, ax = plt.subplots(figsize=(20, 7))
ax = sns.histplot(data = train_df, x = "non_zero_row", bins=200, hue = "gcd",  palette = sns.color_palette("husl", 4))
ax.set_xlabel("Number of non-zero values")
ax.set_title("Number of non-zero entries in each row of the training data");

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    <ol>
    <li>  As we would expect the lower the GCD (the more reads) the more possible DNA observations are obtained, and the more non-zero columns</li>
    <li> They are probably normally distributed as we would expect, but they do seem to be a little off. Perhaps this could be a result of the different bacterium, but it could also be a result of the different m values - Taking a look</li>
    </ol>
</div>

In [None]:
f, ax = plt.subplots(figsize=(20, 30))
for i, gcd in enumerate([1,10,1000,10000]):
    plt.subplot(4, 1, i+1)
    plt.title("GCD " + str(gcd))
    ax = sns.histplot(data = train_df[train_df["gcd"] == gcd], x = "non_zero_row", bins=30, hue = "target", hue_order = train_df["target"].unique(), palette = sns.color_palette("husl", 10))
    ax.set_xlabel("Number of non-zero values")

In [None]:
std_dev = train_df_bias.drop(columns=["gcd","target"]).std(axis=1)
std_dev_test = test_df_bias.std(axis=1)
train_df["std_row"] = std_dev.values
test_df["std_row"] = std_dev_test.values

f, ax = plt.subplots(figsize=(20, 7))
ax = sns.histplot(data = train_df, x = "std_row", bins=200, hue = "gcd",  palette = sns.color_palette("husl", 4))
ax.set_xlabel("Standard Deviation");

In [None]:
f, ax = plt.subplots(figsize=(20, 30))
for i, gcd in enumerate([1,10,1000,10000]):
    plt.subplot(4, 1, i+1)
    plt.title("GCD " + str(gcd))
    ax = sns.histplot(data = train_df[train_df["gcd"] == gcd], x = "std_row", bins=200, hue = "target", hue_order = train_df["target"].unique(), palette = sns.color_palette("husl", 10))
    ax.set_xlabel("Standard Deviation");

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    <ol>
    <li>  standard deviation in rows seems like a very good predictor of bacteria</li>
    <li> We can see the different peaks for the same bacterium (for low gcd/high reads). This is the effect of the different m values (see example below)</li>
    </ol>
</div>

In [None]:
f, ax = plt.subplots(figsize=(20, 7))
ax = sns.histplot(data = train_df[(train_df["gcd"] == 1) & (train_df["target"] == 'Campylobacter_jejuni')], x = "std_row", bins=200)

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution

In [None]:
kurtosis = train_df_bias.drop(columns=["gcd","target"]).kurtosis(axis=1)
kurtosis_test = test_df_bias.kurtosis(axis=1)

train_df["kurtosis_row"] = kurtosis.values
test_df["kurtosis_row"] = kurtosis_test.values

f, ax = plt.subplots(figsize=(20, 7))
ax = sns.histplot(data = train_df, x = "kurtosis_row", bins=200, hue = "gcd",  palette = sns.color_palette("husl", 4))
ax.set_xlabel("Kurtosis");

In [None]:
f, ax = plt.subplots(figsize=(20, 30))
for i, gcd in enumerate([1,10,1000,10000]):
    plt.subplot(4, 1, i+1)
    plt.title("GCD " + str(gcd))
    ax = sns.histplot(data = train_df[train_df["gcd"] == gcd], x = "kurtosis_row", bins=200, hue = "target", hue_order = train_df["target"].unique(), palette = sns.color_palette("husl", 10))
    ax.set_xlabel("Kurtosis");

 skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

In [None]:
skew =  train_df_bias.drop(columns=["gcd","target"]).skew(axis=1)
skew_test = test_df_bias.skew(axis=1)

train_df["skew_row"] = skew.values
test_df["skew_row"] = skew_test.values

f, ax = plt.subplots(figsize=(20, 7))
ax = sns.histplot(data = train_df, x = "skew_row", bins=200, hue = "gcd",  palette = sns.color_palette("husl", 4))
ax.set_xlabel("Skew");

In [None]:
f, ax = plt.subplots(figsize=(20, 30))
for i, gcd in enumerate([1,10,1000,10000]):
    plt.subplot(4, 1, i+1)
    plt.title("GCD " + str(gcd))
    ax = sns.histplot(data = train_df[train_df["gcd"] == gcd], x = "skew_row", bins=200, hue = "target", hue_order = train_df["target"].unique(), palette = sns.color_palette("husl", 10))
    ax.set_xlabel("Skew")

In [None]:
f, ax = plt.subplots(figsize=(20, 7))
ax = sns.histplot(data = train_df[(train_df["gcd"] == 1) & (train_df["target"] == 'Campylobacter_jejuni')], x = "skew_row", bins=200)
ax.set_xlabel("Skew");

In [None]:
nunique_x = train_df_bias.drop(columns=["gcd","target"]).nunique(axis=1)
nunique_x_test = test_df_bias.nunique(axis=1)

train_df["nunique"] = nunique_x.values
test_df["nunique"] = nunique_x_test.values

In [None]:
def set_quantiles(qlist):
    for q in qlist:
        quantile_x = train_df_bias.drop(columns=["gcd","target"]).quantile(q=q/100, axis=1)
        quantile_x_test = test_df_bias.quantile(q=q/100, axis=1)

        train_df["quantile_"+str(q)] = quantile_x.values
        test_df["quantile_"+str(q)] = quantile_x_test.values
set_quantiles([5,10,20,30,40,50,60,70,80,90, 92, 93, 94, 95])

In [None]:
f, ax = plt.subplots(figsize=(20, 30))
for i, gcd in enumerate([1,10,1000,10000]):
    plt.subplot(4, 1, i+1)
    plt.title("GCD " + str(gcd))
    ax = sns.histplot(data = train_df[train_df["gcd"] == gcd], x = "quantile_92", bins=200, hue = "target", hue_order = train_df["target"].unique(), palette = sns.color_palette("husl", 10))
    ax.set_xlabel("quantile_92")

## Comparing Train and Test

In [None]:
plt.figure(figsize=(20, 20))

for i,gcd in enumerate([1,10,1000,10000]):
    pca = PCA(n_components = 2, random_state = 10, whiten = True)
    pca.fit(train_df[train_df["gcd"] == gcd][train_cols])
    Xtr_PCA = pca.transform(train_df[train_df["gcd"] == gcd][train_cols])
    
    #pca.fit(test_df[test_df["gcd"] == gcd][train_cols])
    Xte_PCA = pca.transform(test_df[test_df["gcd"] == gcd][train_cols])

    PCA_train_df = pd.DataFrame({"PCA_1" : Xtr_PCA[:,0], "PCA_2" : Xtr_PCA[:,1], "Label":["Train"]*len(Xtr_PCA[:,0])})
    PCA_test_df = pd.DataFrame({"PCA_1" : Xte_PCA[:,0], "PCA_2" : Xte_PCA[:,1], "Label":["Test"]*len(Xte_PCA[:,0])})
    PCA_df = pd.concat([PCA_train_df, PCA_test_df])
    
    ax = plt.subplot(2, 2, i + 1)
    plt.title("GCD: " + str(gcd))
    sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "Label")

<div class="alert alert-block alert-warning"  style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    The train data and the test data do not seem to match perfectly - this is most obvious for GCD 1 but it can be seen from GCD 10 as well. Could the train and test data be from different distributions?  However this difference is only noticed for some of the bacteria - others seem to match up relatively well.

## Duplicates

In [None]:
#FUNCTION adapted from https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense
def plot_duplicates_per_gcd(df, title):
    plt.figure(figsize=(16, 3))
    plt.tight_layout()
    for i, gcd in enumerate(np.unique(df.gcd)):
        plt.subplot(1, 5, i+1)
        duplicates = df[df.gcd == gcd][train_cols].duplicated().sum()
        non_duplicates = len(df[df.gcd == gcd]) - duplicates
        plt.pie([non_duplicates, duplicates],
                labels=['not duplicate', 'duplicate'],
                colors=['gray', 'r'],
                startangle=90)
        plt.title("GCD = " + str(gcd))
    
    plt.subplot(1, 5, 5)
    duplicates = df[train_cols].duplicated().sum()
    print("In total there are", duplicates, title, "\n")
    non_duplicates = len(df) - duplicates
    plt.pie([non_duplicates, duplicates],
                labels=['not duplicate', 'duplicate'],
                colors=['gray', 'r'],
                startangle=90)
    plt.title("All Data")
    
    plt.subplots_adjust(wspace=0.8)
    plt.suptitle(title)
    plt.show()
        
plot_duplicates_per_gcd(train_df, title="Duplicates in Training")
plot_duplicates_per_gcd(test_df, title="Duplicates in Test")

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;"> 
  <ol>
   <li> GCD Values of 1000 and 10000 have significantly more duplicates. <\li>
    <li>The test data has less duplicates<\li>
    
</div>

<div class="alert alert-block alert-warning"  style="font-size:14px; font-family:verdana; line-height: 1.7em;">This is odd:

1. We would expect more duplicates for lower reads (higher GCD) this makes sense as a more detailed read should give more possible outcomes. However why does GCD 1 and GCD 10 have the same number of duplicates; and why does GCD 1000 and 10000 have the same number of duplicates? This seems a little odd.
2. Why would the test data have a lower percentage of duplicates than the training data? - Actually this could make sense as the training data has 200,000 rows and the test data only 100,000. Perhaps after 100,000 samples each new additional sample is more likely to have a duplicated entry.
   </div>

As seen there are a lot of duplicate rows in the training data. Lets remove them, but save how many duplicates there are of each row. We can then use this list as sample weights when training the model, so that highly duplicated rows have a higher importance in training.

Why is this important?
- Speed up training times
- We don't want duplicates in the validation set when doing cross-fold validation as this will invalidate the CV results. 

In [None]:
train_df.drop(columns=["row_id"], inplace=True) 
a = train_df.groupby(by=train_df.columns.to_list()).size().values # Use groupby to identify duplicated rows
train_df = train_df.drop_duplicates() 
train_df["duplicates"] = a
print("Dropped:", 200000 - len(train_df), "rows")

In [None]:
s1 = pd.merge(train_df, test_df, how='inner', on=train_cols)
print("There are", len(s1), "rows in the test data that are also in the training data")
print("Of these", len(s1), "test data rows,", s1["row_id"].nunique(), "are unqiue in the test data")
print("Of these", len(s1), "test data rows,", len(s1[s1["gcd"]==10000]), "have a gcd of 10000")
plt.pie([len(test_df)-len(s1), len(s1)],
                labels=['not duplicate', 'duplicate'],
                colors=['gray', 'r'],
                startangle=90)
plt.title("Test data duplicated in the train data")
plt.show()

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;"> <ol>
   <li>We can ensure that we classify all 486 test rows that are also in the training data correctly.<\li>
     <li>All of these duplicated rows have a GCD of 10,000 which are the hardest for the model to predict.<\li>
    <\ol>
</div>

<div class="alert alert-block alert-warning"  style="font-size:14px; font-family:verdana; line-height: 1.7em;">However this is a little strange - we should expect WAY more duplicates, based on the number of duplicates in the train and test data. This could indicate that the test data comes from a different distribution OR most of the duplicates have been manually removed.
    <\div>

## Modelling

In [None]:
from lightgbm import LGBMClassifier
from lightgbm import plot_importance
from lightgbm import early_stopping
from lightgbm import log_evaluation

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
train_df = reduce_memory_usage(train_df)
test_df = reduce_memory_usage(test_df)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
sample_weights = train_df["duplicates"]
X_train = train_df.drop(columns=[ 'target', 'duplicates']) # row_id should already be dropped - could drop GCD too
X_test = test_df.drop(columns=['row_id'])

le = LabelEncoder() 
le.fit(train_df['target'])
y_train = le.transform(train_df['target'])

In [None]:
N_SPLITS = 10

kf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

### **ExtraTreesClassifier**

In [None]:
# Currently chosen randomly

et_params = {
    'n_estimators': 1000,
    'n_jobs': -1,
    'random_state': 1
}


In [None]:
%%time
pred_validation_all_et = []
validation_all = []
val_ids_all = []

y_pred_test_et = []
y_pred_test_prob_et = []

importances_et = []
accs_et = []

for fold, (trn_idx, val_idx) in enumerate(kf.split(X=X_train, y = y_train)):
    val_ids_all.append(val_idx)
    print("===== Fold", fold," =====")
    X_tr = X_train.iloc[trn_idx]
    y_tr = y_train[trn_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train[val_idx]
    sample_weight_tr = sample_weights.iloc[trn_idx].values
    sample_weight_val = sample_weights.iloc[val_idx].values
    
    model_et = ExtraTreesClassifier(**et_params)
    
    model_et.fit(
        X_tr,
        y_tr,
        sample_weight_tr)
        
    importances_et.append(model_et.feature_importances_)
    
    pred_val_et = model_et.predict(X_val)
    pred_validation_all_et.append(pred_val_et)
    validation_all.append(y_val)
    
    acc_et = accuracy_score(y_true = y_val, y_pred = pred_val_et, sample_weight=sample_weight_val)
    accs_et.append(acc_et)
    
    print("FOLD", fold, "ETC Accuracy:", acc_et)
    
    # Test data predictions
    y_pred_test_et.append(model_et.predict(X_test))
    y_pred_test_prob_et.append(model_et.predict_proba(X_test))
    
print("======================================")
print("Mean Accuracy (all folds) - ETC:", np.mean(accs_et))

In [None]:
def confusion_matrix_mod(pred_val_all_mod, modelName):
    preds_val_full = [item for sublist in pred_val_all_mod for item in sublist]
    true_val_full = [item for sublist in validation_all for item in sublist]
    ids_full = [item for sublist in val_ids_all for item in sublist]
    gcd_full = train_df.iloc[ids_full]["gcd"].values
    sample_weight_val = train_df["duplicates"].iloc[ids_full].values
    
    cm_df = pd.DataFrame({"Preds":preds_val_full, "True":true_val_full, "GCD":gcd_full, "Sample_weights":sample_weight_val})
    plt.figure(figsize=(17, 17))
    
    for i,gcd in enumerate([1,10,1000,10000]):
        plot_data = cm_df[cm_df["GCD"] == gcd]
        cm = confusion_matrix(plot_data["True"], plot_data["Preds"], sample_weight = plot_data["Sample_weights"])
        disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=le.classes_)
        ax = plt.subplot(2, 2, i + 1)
        plt.title(modelName + "- GCD: " + str(gcd))
        disp.plot(ax=ax, xticks_rotation = 30);
        plt.grid(False)

In [None]:
def fold_feature_importances(model_importances, model_name):
    importances_et_df = pd.DataFrame({"feature_cols": X_train.columns, "importances_0": model_importances[0]})
    for i in range(1,N_SPLITS):
        importances_et_df["importances_"+str(i)] = importances_et[i]
    importances_et_df["importances_median"] = importances_et_df.drop(columns=["feature_cols"]).median(axis=1)
    importances_et_df = importances_et_df.sort_values(by="importances_median", ascending=False)
    f, ax = plt.subplots(figsize=(10, 15))
    ax = sns.barplot(data = importances_et_df.iloc[0:80], x = "importances_median", y="feature_cols")
    plt.title(model_name)
    ax.set_xlabel("Feature importance median across all folds");

We plot the confusion matrix for each GCD value using the predictions from all of folds validation set combined.

In [None]:
confusion_matrix_mod(pred_validation_all_et, "ETC") 
# Something may be off with the sample weights not showing up as intended for GCD 1000,10000

We plot the median feature importances from the model across all folds.

In [None]:
fold_feature_importances(importances_et, "ETC")

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">Looks like our created features do well!

### **LGBM Classifier**

In [None]:
#Currently chosen randomly

lgbm_params = {
    'objective' : 'multiclass',
    'n_estimators': 300,
    'random_state': 43,
    'learning_rate': 0.1,
    'n_jobs' : -1
}

In [None]:
%%time
pred_validation_all_lgbm = []
validation_all = []
val_ids_all = []

y_pred_test_lgbm = []
y_pred_test_prob_lgbm = []

importances_lgbm = []
accs_lgbm = []

for fold, (trn_idx, val_idx) in enumerate(kf.split(X=X_train, y = y_train)):
    val_ids_all.append(val_idx)
    print("===== Fold", fold," =====")
    X_tr = X_train.iloc[trn_idx]
    y_tr = y_train[trn_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train[val_idx]
    sample_weight_tr = sample_weights.iloc[trn_idx].values
    sample_weight_val = sample_weights.iloc[val_idx].values
    
    model_lgbm = LGBMClassifier(**lgbm_params)
    
    model_lgbm.fit(
        X_tr, 
        y_tr,
        sample_weight = sample_weight_tr,
        eval_sample_weight = [sample_weight_val],
        eval_set=[(X_val, y_val)],
        eval_metric = ['multi_logloss', 'multi_error'],
        callbacks = [early_stopping(30), log_evaluation(period=50)]
    )
    
    importances_lgbm.append(model_lgbm.feature_importances_)
    
    pred_val_lgbm = model_lgbm.predict(X_val)
    
    pred_validation_all_lgbm.append(pred_val_lgbm)
    validation_all.append(y_val)
    
    acc_lgbm = accuracy_score(y_true = y_val, y_pred = pred_val_lgbm, sample_weight=sample_weight_val)
    accs_lgbm.append(acc_lgbm)
    
    print("FOLD", fold, "LGBM Accuracy:", acc_lgbm)

    # Test data predictions
    y_pred_test_lgbm.append(model_lgbm.predict(X_test))
    y_pred_test_prob_lgbm.append(model_lgbm.predict_proba(X_test))
    
print("======================================")
print("Mean Accuracy - LGBM:", np.mean(accs_lgbm))

In [None]:
confusion_matrix_mod(pred_validation_all_lgbm, "LGBM")

In [None]:
fold_feature_importances(importances_lgbm, "LGBM")

## Ensemble and Post-processing

In [None]:
print("Each model trained on: ", 100*(1-(1/N_SPLITS)), "% of the training data.")

### ExtraTreesClassifier

We sum the preidction probabilities of each class from each of the models in the ensemble. We then select the class for each row in the test data based on the highest probability

In [None]:
def gen_preds_from_ensemble_proba(pred_probs, class_weights):
    """We sum the preidction probabilities of each class from each of the models in the ensemble.
    We then select the class for each row in the test data based on the highest probability"""
    y_pred_probs_added = []
    for row in range(len(X_test)):
    
        summed_prob = class_weights
        for i in range(N_SPLITS):
            summed_prob = np.add(summed_prob,pred_probs[i][row])
    
        y_pred_probs_added.append(summed_prob)
    y_test_preds = [np.argmax(array) for array in y_pred_probs_added]
    return y_test_preds

In [None]:
class_weight = np.zeros(10)
y_test_pred_et = gen_preds_from_ensemble_proba(pred_probs = y_pred_test_prob_et, class_weights=class_weight)
y_test_pred_et = le.inverse_transform(y_test_pred_et)

In [None]:
def bacteriaCount(model_predictions):
    bacteria_counts = train_df['target'].value_counts()
    bacteria_counts = bacteria_counts.reset_index().rename(columns={"index":"BacteriaSpecies", "target":"train_count"})
    bacteria_counts = bacteria_counts.set_index("BacteriaSpecies")
    bacteria_counts["test_count"] = pd.Series(model_predictions).value_counts()
    bacteria_counts["train_perc"] = 100 * bacteria_counts["train_count"]/bacteria_counts["train_count"].sum()
    bacteria_counts["test_perc"] = 100 * bacteria_counts["test_count"]/bacteria_counts["test_count"].sum()
    bacteria_counts = bacteria_counts.sort_index()
    return bacteria_counts

In [None]:
def bacteriaPlotFull(model_predictions):
    test_df["preds"] = model_predictions
    bacteria_counts_full = test_df.groupby(["gcd","preds"])["row_id"].count().reset_index().rename(columns={"row_id":"test_count"})
    bacteria_counts_full["train_count"] = train_df.groupby(["gcd","target"])["gcd"].count().values
    plt.figure(figsize=(15, 15))
    for i,gcd in enumerate([1,10,1000,10000]):
        
        plot_b_count = bacteria_counts_full[bacteria_counts_full["gcd"] == gcd]
        ax = plt.subplot(2, 2, i + 1)
        plt.title("GCD: " + str(gcd))
        ax = sns.barplot(data = plot_b_count, x = "preds", y = "test_count", hue_order = test_df["preds"].unique())
        plt.xticks(rotation=30);
        ax.set_ylim(1500,3250)
    return 

In [None]:
bacteria_count_et = bacteriaCount(y_test_pred_et)
bacteria_count_et

In [None]:
bacteriaPlotFull(y_test_pred_et)

There is likely some issue in our test predictions for GCD 1000 and GCD 10000 as a result of the test data distribution being different to the train data distribution.

In particular: 
- *E. coli* is often underpredicted in favour of *E. fergusonii*.
- *E. hirae* is often underpredicted.

Why these mistakes:
- *E. fergusonii* and *E. coli* are from the same genus and so only a small deviation from the train and test distributions will cause them to be misclassified.
- *E. hirae* has one of the largest train/test distribution changes (as can be seen by the train/test PCA).

We also know that 486 test data rows for GCD 10,000 (r = 100) are also in the training data, lets have a look at how well our model predicts these:

In [None]:
test_df["preds"] = y_test_pred_et
s1 = pd.merge(train_df, test_df, how='inner', on=train_cols)
print("There are", len(s1[s1["target"] != s1["preds"]]), "probable mistakes that we can easily fix")

We can try and fix this by matching the percentage of data in each class of the test data to the percentage of data in each class of the train data (each class should be about the same size). To do this we introduce a bias to the predicted probabilities for the class with fewer predictions than expected. This addedd bias makes it more likely that that class will be predicted. 

Credit to [AmbrosM](https://www.kaggle.com/ambrosm/tpsfeb22-02-postprocessing-against-the-mutants) for this idea.

In [None]:
class_weight_et = np.array([0, 0, 0.025, 0.035, 0, 0, 0, 0, 0, 0])*N_SPLITS

y_test_pred_et = gen_preds_from_ensemble_proba(pred_probs = y_pred_test_prob_et, class_weights=class_weight_et)
y_test_pred_et = le.inverse_transform(y_test_pred_et)

bacteria_count_et["test_count"] = pd.Series(y_test_pred_et).value_counts()
bacteria_count_et["test_perc"] = 100 * bacteria_count_et["test_count"]/bacteria_count_et["test_count"].sum()

Lets see how that improved predictions:

In [None]:
bacteria_count_et.sort_index()

In [None]:
bacteriaPlotFull(y_test_pred_et)

Lets have a look at how good are predictions are:

In [None]:
plt.figure(figsize=(20, 20))
test_df["preds"] = y_test_pred_et

for i,gcd in enumerate([1,10,1000,10000]):
    pca = PCA(n_components = 2, random_state = 10, whiten = True)
    pca.fit(test_df[test_df["gcd"] == gcd][train_cols])

    X_PCA = pca.transform(test_df[test_df["gcd"] == gcd][train_cols])

    PCA_df = pd.DataFrame({"PCA_1" : X_PCA[:,0], "PCA_2" : X_PCA[:,1],  "LABEL":test_df[test_df["gcd"] == gcd]["preds"]})
    
    ax = plt.subplot(2, 2, i + 1)
    plt.title("GCD: " + str(gcd))
    sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique())

Lets have another look at our 486 test data rows:

In [None]:
test_df["preds"] = y_test_pred_et
s1 = pd.merge(train_df, test_df, how='inner', on=train_cols)
print("There are", len(s1[s1["target"] != s1["preds"]]), "probable mistakes that we can easily fix")

Looks like our model gets them all right anyway, no need to fix

### LGBM

In [None]:
class_weight = np.zeros(10)
y_test_pred_lgbm = gen_preds_from_ensemble_proba(pred_probs = y_pred_test_prob_lgbm, class_weights=class_weight)
y_test_pred_lgbm = le.inverse_transform(y_test_pred_lgbm)

In [None]:
bacteria_count_lgbm = bacteriaCount(y_test_pred_lgbm)
bacteria_count_lgbm

In [None]:
bacteriaPlotFull(y_test_pred_lgbm)

In [None]:
#We also know that 486 test data rows for GCD 10,000 (r = 100) are also in the training data, lets have a look at how well our model predicts these:
test_df["preds"] = y_test_pred_lgbm
s1 = pd.merge(train_df, test_df, how='inner', on=train_cols)
print("There are", len(s1[s1["target"] != s1["preds"]]), " definite misclassifications in the test data for GCD 10,000")

In [None]:
class_weight_lgbm = np.array([0, 0, 0.1, 0.2, 0, 0, 0, 0, 0, 0])*N_SPLITS

y_test_pred_lgbm = gen_preds_from_ensemble_proba(pred_probs = y_pred_test_prob_lgbm, class_weights=class_weight_lgbm)
y_test_pred_lgbm = le.inverse_transform(y_test_pred_lgbm)


bacteria_count_lgbm["test_count"] = pd.Series(y_test_pred_lgbm).value_counts()
bacteria_count_lgbm["test_perc"] = 100 * bacteria_count_lgbm["test_count"]/bacteria_count_lgbm["test_count"].sum()

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
LGBM seems very confident in its predictions and doesn't want to change without adding a massive bias. Making postpreccessing on LGBM very hard and just makes the performance worse. This is interesting and supports the hypothesis that the test data and train data have different distributions.

I'm not sure why ETC doesn't also have this problem. Could LGBM be fitting more to the training/validation data, making its predictions almost too good?

<div class="alert alert-block alert-warning"  style="font-size:14px; font-family:verdana; line-height: 1.7em;">
What does this mean in terms of CV scores? If the train and test distributions are so different then how are we going to choose the optimal hyperparameters for the models? This is a rare case where we SHOULD actually trust the public leaderboard over the CV scores.

In [None]:
bacteria_count_lgbm.sort_index()

In [None]:
bacteriaPlotFull(y_test_pred_lgbm)

In [None]:
test_df["preds"] = y_test_pred_lgbm
s1 = pd.merge(train_df, test_df, how='inner', on=train_cols)
print("There are", len(s1[s1["target"] != s1["preds"]]), "definite misclassifications in the test data for GCD 10,000 that we can easily fix")

We have actually just made more errors in our attempt to reduce them in post-processing with LGBM.

Visualising the predictions:

In [None]:
plt.figure(figsize=(20, 20))
test_df["preds"] = y_test_pred_lgbm

for i,gcd in enumerate([1,10,1000,10000]):
    pca = PCA(n_components = 2, random_state = 10, whiten = True)
    pca.fit(test_df[test_df["gcd"] == gcd][train_cols])

    X_PCA = pca.transform(test_df[test_df["gcd"] == gcd][train_cols])

    PCA_df = pd.DataFrame({"PCA_1" : X_PCA[:,0], "PCA_2" : X_PCA[:,1],  "LABEL":test_df[test_df["gcd"] == gcd]["preds"]})
    
    ax = plt.subplot(2, 2, i + 1)
    plt.title("GCD: " + str(gcd))
    sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique())

### How many points are in each m cluster for the test data?

Assumming that we have classified all entries with GCD = 1 correctly and all m clusters follow the same pattern as in GCD = 1 

In [None]:
plt.figure(figsize=(20, 30))

for i,bacteria in enumerate(test_df["preds"].unique()):
    cluster_input_data = test_df[(test_df["gcd"] == 1) & (test_df["preds"] == bacteria)][train_cols]
    cluster_input_data = StandardScaler().fit_transform(cluster_input_data)

    clustering_gmm_m = GaussianMixture(n_components=8, covariance_type = 'diag', n_init=10, random_state=0).fit(cluster_input_data)
    clustering_gmm = clustering_gmm_m.predict(cluster_input_data)

    pca.fit(cluster_input_data)

    X_PCA = pca.transform(cluster_input_data)

    PCA_df = pd.DataFrame({"PCA_1" : X_PCA[:,0], "PCA_2" : X_PCA[:,1],"Cluster":clustering_gmm})    
    
    mvals = ["m00", "m05", "m10", "m25", "m33", "m50", "m75", "m90"]
    PCA_1 = PCA_df.groupby(by=["Cluster"])["PCA_1"].mean().reset_index().sort_values("PCA_1").reset_index(drop=True)
    PCA_df = PCA_df.replace({'Cluster': {PCA_1.iloc[n]["Cluster"]: mvals[n] for n in range(0,8)}})
    
    if i == 0:
        cluster_counts = PCA_df['Cluster'].value_counts().reset_index().rename(columns={"index":"mval", "Cluster":bacteria + " count"}).set_index("mval").sort_index()
    else:
        cluster_counts[bacteria + " count"] = PCA_df['Cluster'].value_counts()
    
    ax = plt.subplot(5, 2, i + 1)
    plt.title("GMM - Bacteria: " + str(bacteria))
    sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue="Cluster", hue_order=mvals, palette = sns.color_palette("husl", 8))
cluster_counts

No suprises here. It follows the same pattern as in the train data.

## Improvements with clustering

 Credit: This was inspired by [AsmosM](https://www.kaggle.com/ambrosm/tpsfeb22-03-clustering-improves-the-predictions)

Perhaps clustering can be used to improve a few of the predictions when GCD=10. 

* There are not any obvious mistakes for GCD = 1.
* There are a few obvious mistakes for GCD = 10 which are easily fixable.
* Clustering will be difficult for GCD = 1000 and GCD = 10000 but it may be possible.

### GCD 10

First lets get a better view of the models mistakes:

In [None]:
test_df["preds"] = y_test_pred_et # Set to y_test_pred_lgbm to view lgbm predictions

In [None]:
pca = PCA(n_components = 2, random_state = 10, whiten = True)
pca.fit(test_df[test_df["gcd"] == 10][train_cols])

X_PCA = pca.transform(test_df[test_df["gcd"] == 10][train_cols])

PCA_df = pd.DataFrame({"PCA_1" : X_PCA[:,0], "PCA_2" : X_PCA[:,1],  "LABEL":test_df[test_df["gcd"] == 10]["preds"]})

In [None]:
f, ax = plt.subplots(figsize=(20,40))

ax = plt.subplot(5, 2, 1)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());

ax = plt.subplot(5, 2, 2)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-1,0.5)
plt.xlim(-1, 0.4)

ax = plt.subplot(5, 2, 3)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-1,0.2)
plt.xlim(-1, -0.45)

ax = plt.subplot(5, 2, 4)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-1,0.2)
plt.xlim(-0.8, -0.7)

ax = plt.subplot(5, 2, 5)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-0.17,0.1)
plt.xlim(-0.82, -0.72)

ax = plt.subplot(5, 2, 6)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-0.4,-0.19)
plt.xlim(-0.785, -0.72)

ax = plt.subplot(5, 2, 7)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-0.85,-0.55)
plt.xlim(-0.77, -0.72);

ax = plt.subplot(5, 2, 8)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-0.85,-0.75)
plt.xlim(-0.752, -0.725);

ax = plt.subplot(5, 2, 9)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-1,-0.8)
plt.xlim(-0.6, -0.5);

ax = plt.subplot(5, 2, 10)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-1,-0.5)
plt.xlim(-1, -0.5);

The main issue for GCD 10 is classifying between Eschericha_coli and Eschericha_fergusonii (blue and light blue). Lets grab the points in some of the images and recluster them.

In [None]:
def recluster(xlim_low, xlim_high, ylim_low, ylim_high, ncomponents):
    cluster_data = PCA_df[(PCA_df["PCA_1"]>= xlim_low) & (PCA_df["PCA_1"]<= xlim_high) & (PCA_df["PCA_2"] >= ylim_low) & (PCA_df["PCA_2"] <= ylim_high)]
    cluster_data_X = X_test.iloc[cluster_data.index][train_cols].drop(columns=["gcd"])
    
    #clustering_gmm_m = GaussianMixture(n_components=ncomponents, covariance_type = 'diag', n_init=20, random_state=2).fit(cluster_data_X)
    #clustering_gmm = clustering_gmm_m.predict(cluster_data_X)
    
    kmeans = KMeans(n_clusters=ncomponents, n_init=40, random_state=0).fit(cluster_data_X)
    
    #cluster_data["LABEL_CLUSTER"] = clustering_gmm
    cluster_data["LABEL_CLUSTER"] = kmeans.labels_
    
    # Labels from clustering are either 0 or 1 at random - changing 0 or 1's to match the correct labels (done by matching most points)
    for i in range(ncomponents):
        most = cluster_data[cluster_data["LABEL_CLUSTER"]==i]["LABEL"].value_counts().index[0]
        cluster_data = cluster_data.replace(i,most)
    
    f, ax = plt.subplots(figsize=(20,10))
    ax = plt.subplot(1, 2, 1)
    sns.scatterplot(data = cluster_data, x = "PCA_1", y = "PCA_2", hue = "LABEL");
    plt.title("Original Predictions")
    plt.ylim(ylim_low,ylim_high)
    plt.xlim(xlim_low,xlim_high)
                    
    ax = plt.subplot(1, 2, 2)
    sns.scatterplot(data = cluster_data, x = "PCA_1", y = "PCA_2", hue = "LABEL_CLUSTER");
    plt.title("Cluster Predictions")
    plt.ylim(ylim_low, ylim_high)
    plt.xlim(xlim_low,xlim_high)
    
    return cluster_data

In [None]:
def update_predictions(cluster_data, y_test_pred_model):
    change_preds = cluster_data[cluster_data["LABEL"] != cluster_data["LABEL_CLUSTER"]]
    print(len(change_preds), "predictions attempting to be changed")
    index_preds = change_preds.index
    #Test if we are definitely changing the right predictions
    if np.array_equal(y_test_pred_model[index_preds], change_preds["LABEL"].values):
        print("Check Passed: Changing correct entries")
        y_test_pred_model[index_preds] = change_preds["LABEL_CLUSTER"]
    return

In [None]:
recluster(xlim_low = -0.82, xlim_high = -0.72, ylim_low = -0.17, ylim_high = 0.1, ncomponents=2)
#Decided not to update these predictions

In [None]:
recluster(xlim_low = -0.785, xlim_high = -0.72, ylim_low = -0.375, ylim_high = -0.19, ncomponents=2)
#Decided not to update these predictions

In [None]:
clust_data = recluster(xlim_low = -0.77, xlim_high = -0.72, ylim_low = -0.75, ylim_high = -0.55, ncomponents=3)

In [None]:
update_predictions(cluster_data = clust_data, y_test_pred_model = y_test_pred_et)

In [None]:
clust_data = recluster(xlim_low = -0.752, xlim_high = -0.725, ylim_low = -0.85, ylim_high = -0.75, ncomponents=2)

In [None]:
update_predictions(cluster_data = clust_data, y_test_pred_model = y_test_pred_et)

### GCD 1000

In [None]:
pca = PCA(n_components = 2, random_state = 10, whiten = True)
pca.fit(test_df[test_df["gcd"] == 1000][train_cols])

X_PCA = pca.transform(test_df[test_df["gcd"] == 1000][train_cols])

PCA_df = pd.DataFrame({"PCA_1" : X_PCA[:,0], "PCA_2" : X_PCA[:,1],  "LABEL":test_df[test_df["gcd"] == 1000]["preds"]})

In [None]:
f, ax = plt.subplots(figsize=(20,20))

ax = plt.subplot(3, 2, 1)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());

ax = plt.subplot(3, 2, 2)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-0.5,2)
plt.xlim(1.6, 2.6)

ax = plt.subplot(3, 2, 3)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(0,2)
plt.xlim(-1.5, -0.3)

ax = plt.subplot(3, 2, 4)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-1.8,-1)
plt.xlim(-1, 0)

ax = plt.subplot(3, 2, 5)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-0.5,1.2)
plt.xlim(-0.5, 1.6)

ax = plt.subplot(3, 2, 6)
sns.scatterplot(data = PCA_df, x = "PCA_1", y = "PCA_2", hue = "LABEL", hue_order = test_df["preds"].unique());
plt.ylim(-0.4,-0.19)
plt.xlim(-0.785, -0.72);


In [None]:
recluster(xlim_low = 1.6, xlim_high = 2.6, ylim_low = -0.5, ylim_high = 2, ncomponents=2)

Nothing to change here.

### Saving our predictions

<div class="alert alert-block alert-warning"  style="font-size:14px; font-family:verdana; line-height: 1.7em;">
What does this mean in terms of CV scores? If the train and test distributions are so different then how are we going to choose the optimal hyperparameters for the models? This is a rare case where we SHOULD actually trust the public leaderboard over the CV scores.

For example I have noticed that LGBM gets much worse predictions on the test set ~0.95 compared to ~0.99 for ETC despite having very similar CV scores.

In [None]:
submission = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/sample_submission.csv')
submission['target'] = y_test_pred_et

In [None]:
submission.head(10).T

In [None]:
submission.to_csv('submission.csv', index=False)