# Summary

By analyzing results of the logistic regression, we find it most likely that 25% of the train and test events were mislabeled. This would set an upper bound of AUC = 0.75 or this competition, which is consistent with the current public leaderboard.

To improve training, we suggest dropping mislabeled events from the training or flipping their target values.

Inspired by the ROC curve from [notebook](https://www.kaggle.com/hamzaghanmi/make-it-simple)

# Import libraries and data

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve,roc_auc_score,accuracy_score

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-nov-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-nov-2021/test.csv')
sub = pd.read_csv('/kaggle/input/tabular-playground-series-nov-2021/sample_submission.csv')
y = train['target']

cols = ['f'+str(i) for i in range(100)] #feature columns

# Fit logistic regression and check its quality

Apply standard scaler to the data for faster training

In [None]:
scaler = StandardScaler()
train[cols] = scaler.fit_transform(train[cols])
test[cols] = scaler.transform(test[cols])

Fit logistic regression to the data - use all the data as we are not interested in CV here

In [None]:
model = LogisticRegression(solver='liblinear')
model.fit(train[cols],y);

Quality of the model

In [None]:
y_pred_proba = model.predict_proba(train[cols])[:, 1]
auc = roc_auc_score(y, y_pred_proba)

acc = accuracy_score(y, model.predict(train[cols]))

print(f"accuracy: {round(acc*100,3)} , auc: {round(auc*100,3)}")

Plot of roc curve

In [None]:
fpr, tpr, _ = roc_curve(y,  y_pred_proba)
plt.plot(fpr,tpr,label="data, auc = "+str(round(auc*100,2)))
plt.legend(loc=4)
plt.show()

# Digging deeper into what is going on

In the ROC curve plot above we can notice a strange linear behavior on both ends... Let's dig a bit more into what is going on! We start by calculating the linear combination logistic regression found (let's label it (LRLC):

In [None]:
c0 = model.intercept_[0]
ci = model.coef_[0]

And calculate it for each of our training examples

In [None]:
train['LRLC'] = c0
for i in range(100):
    train['LRLC'] += ci[i] * train['f' + str(i)]

Let's define samples of emails marked as spam / ham for future convenience

In [None]:
y0 = train[train['target'] == 0]
y1 = train[train['target'] == 1]

And plot the distributions of LRLC for both

In [None]:
plt.hist(y0['LRLC'], bins = np.arange(-4, 4, 0.1), alpha = 0.3);
plt.title('target = 0');
plt.xlabel('LRLC');
plt.axvline(0, color = 'k');

plt.show()

plt.hist(y1['LRLC'], bins = np.arange(-4, 4, 0.1), alpha = 0.3);
plt.title('target = 1');
plt.xlabel('LRLC');
plt.axvline(0, color = 'k');

Ok, this definitely looks super suspicious! Clearly there are two populations:

1) one where the logistic regression variable LRLC does a pretty good job separating events with target = 0 and target = 1

2) one where the value of LRLC does not seem to matter at all

We believe this is because after the data set was created, **the target value for part of the emails was intentionally mislabeled**. This scenario is not that far fetched, given an almost-perfect model was submitted within minutes of the competition start and organizers had to act quickly.

Let's look at the percentage of mislabeled emails by comparing the corresponding histograms:

In [None]:
h0 = np.histogram(y0['LRLC'], bins = np.arange(-2, 2, 0.1))
h1 = np.histogram(y1['LRLC'], bins = np.arange(-2, 2, 0.1))

lrlc = (h0[1][1:] + h0[1][:-1])/2. #average in the bin
fraction_of_0 = h0[0]/(h0[0]+h1[0])

plt.plot(lrlc, fraction_of_0, marker = 'o')
plt.axhline(0.75, c = 'gray', ls = '--')
plt.axhline(0.25, c = 'gray', ls = '--');
plt.xlabel('LRLC')
plt.ylabel('Fraction with target = 0');

This seems to confirm our expectations - further away from zero LRLC (where the logistic regression does not do a 100% perfect job separating the two categories), we find that the fraction of the samples with target = 0 plateaus to about 75% / **25%**. This looks like a nice, round number, which again agrees with our assertion that there is an intentional mislabelling going on.

Zoom in on the edges

In [None]:
plt.plot(lrlc, fraction_of_0, marker = 'o')
plt.axhline(0.75, c = 'gray', ls = '--')
plt.axhline(0.25, c = 'gray', ls = '--');
plt.xlabel('LRLC')
plt.ylabel('Fraction with target = 0');
plt.ylim([0.73,0.77]);

plt.show()

plt.plot(lrlc, fraction_of_0, marker = 'o')
plt.axhline(0.75, c = 'gray', ls = '--')
plt.axhline(0.25, c = 'gray', ls = '--');
plt.xlabel('LRLC')
plt.ylabel('Fraction with target = 0');
plt.ylim([0.23,0.27]);

# What does it mean for maximal AUC

Let's assume we have a perfect classifier of the events before the mislabelling. Here we estimate what maximal AUC such a classifier would achieve. Some [background reading about AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) for those who need a refresher.

Before the mislabelling, our classifier would correctly order the samples - 50% of the samples that are red (target = 0) to the left side, 50% of the samples that are green (target = 1) to the right. There is no mixing and we have AUC = 1.

After the mislabelling, the left side is 75% red and 25% green, while the right side is 75% green and 25% red. AUC is then probability of drawing green "marble" right of a red "marble" when drawing one green and one red marble at random.

We have the following options:

1) green marble from the right, red marble from the left (75% * 75% of the time)

2) green marble from the left, red marble from the right (25% * 25% of the time)

3) green marble from the right, red marble from the right (75% * 25% of the time)

4) green marble from the left, red marble from the left (25% * 75% of the time)

We draw green "marble" to the right of a red "marble" for all 1), half of 3) and half of 4).

This gives together

$$
AUC_\mathrm{max} = 
\frac{3}{4}\times\frac{3}{4} 
+ \frac{1}{2}\times\frac{3}{4}\times\frac{1}{4} 
+ \frac{1}{2}\times\frac{1}{4}\times\frac{3}{4} 
= 
\frac{3}{4}
$$

This number seems to be in agreement with the public leaderboard, where noone has been able to push through this barrier. Some people might get CV over 0.75, but if our hypothesis is correct, this is just overfitting.