In [None]:
# https://stackoverflow.com/questions/56283294/importerror-cannot-import-name-factorial
!pip install statsmodels==0.10.0rc2

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

I'm gonna need to familiarize myself with basic snippets like stuff for loading data; these I'll just be copy-pasting from other kernels.

In [None]:
train_data = pd.read_csv('../input/train_V2.csv')
test_data = pd.read_csv('../input/test_V2.csv')

In [None]:
train_data.info()
train_data.head()

Okay, let's tackle the problem. Here's a couple starting points:

- I took notes at PyCon US at a talk
  called _"When not to use neural networks and what to do instead"_:
  - I'll be checking that here:
    https://github.com/underyx/conference-notes/blob/master/pycon-us-2019/when-not-to-use-deep-learning.md
  - I don't quite grasp the underlying concepts
    but at least know what terms to Google
- The first time machine learning ever really clicked for me
  was at Google I/O, during an intro talk
  - The talk is available online here: https://www.youtube.com/watch?v=_RPHiqF2bSs
    (I didn't take notes there)
  - I'll probably try to adapt the code from the talk to this problem
  - But I also suspect it would an unreasonably long time to train a machine learning model
    so I'll try this approach after the more basic ones from the above PyCon talk
- One thing I'm guessing now is that more `damageDealt` means a better finishing position,
  so to keeps things simple, I'll start out by trying to make this work as the sole predicting feature
  - The very simplest thing I'll try right now is **fitting a linear regression model**.
    - For help with data transformation I found https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf
    - For help with linear regression I'll follow https://devarea.com/linear-regression-with-numpy/

In [None]:
train_data.head(10)[["damageDealt", "winPlacePerc"]]

So let's try to visualize this,
to see if a linear relationship even seems to be visible.
The article I linked above does this with `matplotlib`
but I remember seeing somewhere how much nicer `seaborn` looked
so I'll try to use that first; it can't be that hard.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

data_to_plot = (
    train_data.sample(n=200)[["damageDealt", "winPlacePerc"]]
)

sns.relplot(x="damageDealt", y="winPlacePerc", data=data_to_plot)

That's surpisingly all over the place.
I'm guessing lots of these might be
due to less active players piggybacking on their teammates' high `damageDealt`.
For the sake of doing running any kind of prediction ASAP,
let me filter to solo players for now.

In [None]:
data_to_plot = (
    train_data[train_data.matchType.isin({"solo", "solo-fpp"})].sample(n=200)[["damageDealt", "winPlacePerc"]]
)

sns.relplot(x="damageDealt", y="winPlacePerc", data=data_to_plot)

Alright, that's something.
Let me see what I can do if I try to predict with linear regression on this.
Hopefully I'll get really low accuracy so I'll get to see lots of improvement once moving on to proper models.

In [None]:
from scipy import stats
print("pure random\n", stats.linregress(np.random.random(200), np.random.random(200)).rvalue ** 2)

solo_train_data = train_data[train_data.matchType.isin({"solo", "solo-fpp"})]
print("20 samples\n", stats.linregress(solo_train_data.sample(n=20)[["damageDealt", "winPlacePerc"]]).rvalue ** 2)
print("200 samples\n", stats.linregress(solo_train_data.sample(n=200)[["damageDealt", "winPlacePerc"]]).rvalue ** 2)
print("2000 samples\n", stats.linregress(solo_train_data.sample(n=2000)[["damageDealt", "winPlacePerc"]]).rvalue ** 2)
print("20000 samples\n", stats.linregress(solo_train_data.sample(n=20000)[["damageDealt", "winPlacePerc"]]).rvalue ** 2)

I was checking R-squared values here
cause people were writing how that's a useful metric for determining level of correlation.
And since that number's higher than when trying to regress on random number series,
I guess we did something, yay!

When trying to look into how to visualize the results,
I found out that seaborn has some cool tooling for calculating regression models on its own.

This tutorial is super useful for exploring my options:
https://seaborn.pydata.org/tutorial/regression.html#regression-tutorial

In [None]:
sns.regplot(x="damageDealt", y="winPlacePerc", x_jitter=5, data=solo_train_data.sample(n=500)[["damageDealt", "winPlacePerc"]])

One thing that bothers me is that this model doesn't seem to understand
that we're bound to results between 0.0 and 1.0.

I have no idea what LOWESS/"locally weighted scatterplot smoothing" is,
but based on the above tutorial,
it seems to return much more sensible results for datasets such as ours.

Let me try it first, and if it truly looks good on our dataset,
I'll read up on it.



In [None]:
sns.regplot(x="damageDealt", y="winPlacePerc", x_jitter=5, lowess=True, data=solo_train_data.sample(n=200)[["damageDealt", "winPlacePerc"]])

Damn, that looks much better.
Upon some reading, what I seem to have done with enabling `lowess`
is that I stopped telling the computer that I expect a linear relationship here.
That's actually good: I have no reason to assume linear,
or pretty much any kind of relationship.

Normally I'd be worried about overfitting
if I let the computer decide what kind of relationship to look for.
But looking at this regression line,
it seems like what we have is still fairly smooth and inaccurate
(there are no random spikes near outlier datapoints)
so I'll go with it for now.

In [None]:
import statsmodels.api as sm
sample = solo_train_data.sample(n=10)
print(sample[["winPlacePerc", "damageDealt"]])
print(sm.nonparametric.lowess(sample["winPlacePerc"], sample["damageDealt"]))

Okay, this function seems to be smoothing existing data.
That's not what we want.
There's probably a way to use it to improve our models butâ€¦
I don't want to go down too deep of a rabbit hole here.

Let's just abandon this for now,
cause I noticed that the LOWESS function fitted something
that looks like graphing `y = sqrt(x)` back in high school.

So I looked for some overview of regression functions
and seems like polynomial is what I need here:
https://towardsdatascience.com/machine-learning-polynomial-regression-with-python-5328e4e8a386

I checked Seaborn's docs and the function I have above actually has an `order` parameter.
Let's try it!

In [None]:
sns.regplot(x="damageDealt", y="winPlacePerc", x_jitter=5, order=2, data=solo_train_data.sample(n=200)[["damageDealt", "winPlacePerc"]])

Well, that looks nice.
But I'm worried about the downward curve at the end.
That makes no logical sense.
How do I get rid of it?

Reading the other parameters of `seaborn.regplot`
there was one called `log`
and the Wikipedia article on this looks very promising.

Finally I found something that doesn't just looks right,
but makes logical sense, too.

In [None]:
sns.regplot(x="damageDealt", y="winPlacePerc", x_jitter=5, logistic=True, data=solo_train_data.sample(n=2000)[["damageDealt", "winPlacePerc"]])

The only thing that doesn't make perfect logical sense is that
apparently the inputs on `y` should be just 0s and 1s.

I can sort of justify passing in `winPlacePerc`
as it correlates with win probability from 0.0 to 1.0.

It's really time to run some prediction now
so I won't try finding even cooler models for now.
Let's use https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
from sklearn.model_selection import train_test_split

clean_train_data = train_data[["damageDealt", "winPlacePerc"]].dropna()

X_train, X_test, y_train, y_test = train_test_split(clean_train_data, clean_train_data.winPlacePerc, test_size=0.2)

model = sm.Logit(y_train, X_train.damageDealt)
result = model.fit()

In [None]:
from sklearn.metrics import mean_absolute_error

predictions = result.predict(X_test.damageDealt)
sns.scatterplot(y_test[:200], predictions[:200])

print(mean_absolute_error(y_test, np.random.random(len(y_test))))
print(mean_absolute_error(y_test, predictions))

Alright, we've got something that's better than random.
Ain't gonna win any awards, but I'm glad it seemed to work out.

It's a bit suspicious that 0.5 seems to be a hard lower bound for our predictions.
Just for fun, let me check what happens
if we expand the range of predictions by brute force.

In [None]:
garbled_predictions = (predictions - 0.5) * 2.0

print(mean_absolute_error(y_test, garbled_predictions))
sns.scatterplot(y_test[:200], garbled_predictions[:200])

Worse than random.
Unsurprising, but still, lesson learned.
Don't mess with the models.

In [None]:
submission_predictions = result.predict(test_data.damageDealt)
submission = pd.DataFrame({"Id": test_data.Id, "winPlacePerc": submission_predictions})
submission.to_csv("submission.csv", index=False)

In [None]:
submission.head()