The notebook https://www.kaggle.com/alijs1/target-variable-some-interesting-insights and discussion https://www.kaggle.com/c/avito-demand-prediction/discussion/58948 inspired me to look more into the target variable. Below are my findings.

In [1]:
import numpy  as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = (20, 7)

## Some data preparation

In [2]:
df = pd.read_csv("../input/train.csv")

Let's set `na` params to a blank string value. This way they'll be included in `groupby`.

In [3]:
params = ["param_1", "param_2", "param_3"]

df[params] = df[params].fillna("")
df[params].isnull().any()

Next, let's do the group by counts. Note the `item_types` variable here as it will be something we'll use later.

In [4]:
item_types = ["parent_category_name", "category_name"] + params

dfg = df.groupby(item_types + ["deal_probability"]) \
        .size() \
        .reset_index(name="count")
dfg.head()

## Verifying services `param_2` binning

The notebook mentioned above gave us a quick look to see the pattern. In this section, let's take a closer look. 

In [5]:
services = dfg[dfg["parent_category_name"] == "Услуги"].copy()

First let's calculate the diffs in deal probabilities by item types. This way we can see the pattern without doing mental subtractions.

In [6]:
services["deal_probability_diff"] = services.groupby(item_types)["deal_probability"].diff()
services[services["param_2"] != ""].head(n=15)

For each item type, let's take the:
* `dp_diff` - mean of `deal_probability_diff`. this should just be equal to all the `deal_probability_diff` in that group, ignoring `np.nan` and rounding artifacts
* `N` - number of unique deal probabilities. ie, this is the number of bins alijs was talking about

In [7]:
dpdiff_stats = services.groupby(item_types)["deal_probability_diff"].agg(["mean", "size"]).reset_index()
dpdiff_stats.rename(columns={"size": "N", "mean": "dp_diff"}, inplace=True)
dpdiff_stats[dpdiff_stats["param_2"] != ""]

We see that for the most part `dp_diff = 1 / (N - 1)` which is what we expect.

There are cases where N is lower that expected. Below is an example when the true N is 8 (`(1 / 0.14286) + 1`).

In [8]:
dpdiff_stats[dpdiff_stats["param_2"] == "Ремонт часов"]

I wonder why `N=8` was chosen when not all bins are used. Maybe it makes sense when also counting the items in `test.csv` (or even the active files)?

In [9]:
dfg[dfg["param_2"] == "Ремонт часов"]

## What about the service items with missing param_2?

So we saw that `deal_probability` in services were binned by `param_2`. However, there are service items with blank `param_2`.

In this section, we'll see how those are binned.

In [10]:
dpdiff_stats[dpdiff_stats["param_2"] == ""]

The answer is they are binned by `param_1`. And if that is absent, then they are binned by `category_name` instead. This is consistent with binning with `param_2` because `param_3` is absent.

The next question is: "Does this hold for other parent categories?"

## Looking at the whole training data

So far we've only been looking at the services parent category. Time to check the rest.


### deal_probability_diff

In [11]:
dfg["deal_probability_diff"] = dfg.groupby(item_types)["deal_probability"].diff()
dfg.head()

That doesn't look promising.

### dp_diff_stats

In [12]:
dp_diff_stats = dfg.groupby(item_types)["deal_probability_diff"].agg(["mean", "size"]).reset_index()
dp_diff_stats.rename(columns={"size": "N", "mean": "dp_diff"}, inplace=True)
dp_diff_stats.head(n=15)

Two observations:
* There are much more bins here.
* `dp_diff` and `N` don't seem to match

It could be that what we observed for services don't apply to other parent_categories. Or, as we've seen before, we're not seeing a lot of the bins and the N we counted is less than the true N. Let's proceed as if the pattern in services do apply to other parent_categories.

### expected_N

Let's calculate:
* `expected_df_diff` - the `dp_diff` that we expect given the `N` that we see
* `expected_N` - the `N` that we expect given the `dp_diff` that we see

In [20]:
dp_diff_stats.dropna(inplace=True)
dp_diff_stats["expected_df_diff"] = 1 / (dp_diff_stats["N"]  - 1)
dp_diff_stats["expected_N"] = (1 / dp_diff_stats["dp_diff"]) + 1
dp_diff_stats.head(n=15)

Our `expected_N` is always greater than or equal to `N`.

In [21]:
(dp_diff_stats["expected_N"] >= dp_diff_stats["N"]).all()

This means that a lot of times not all bins are used and `N` should be larger that we counted.

## Calculating true N for an example

To explain further what I was trying to say in the previous section, let's look at an example.

In [22]:
example = dfg[(dfg["category_name"] == "Игры, приставки и программы") & (dfg["param_1"] == "")]
example

We see that the diffs are inconsistent and we don't even have a `deal_probability` equal to 1!

We only see 5 bins here. Clearly, that's not the true N. Let's try to work out the true N. We can do this by trying out different N values and calculating their bins until we find something that matches the bins we already see.

### Calculating bins from N

So, first let's have a way to calculate bins given N:

In [23]:
def calculate_bins(N):
    return [
        round(n / float(N - 1), 5)
        for n in range(N)
    ]

def print_bins(N):
    print(f"For N={N}: {calculate_bins(N)}")

print_bins(5)

### Try N=7 and N=8

Okay so what N values do we try? Let's start with the `expected_N` in `dp_diff_stats`

In [24]:
dp_diff_stats[
    (dp_diff_stats["category_name"] == "Игры, приставки и программы") &
    (dp_diff_stats["param_1"]       == "")
]

In [25]:
print_bins(7)
print_bins(8)

Not very helpful. Those aren't enough bins. We need to be smarter than this.

### GCD / LCM

What we're trying to achieve feels like a GCD / LCM problem. Let's try to frame the problem this way.

Say we're getting multiples of 25 from 0 to 100. That would be [0, 25, 50, 75, 100]. Those multiples are like bins for N = 5 and diff = 25.

What if we're given a series of multiples but with missing values [12, 20, 48, 84, 96, 100]. We don't even know how many missing values there are and where they're situated in the series. How do we go about finding them?

To start with, we know the series always starts with 0 and end at 100 so now we have [0, 12, 20, 48, 84, 96, 100].

Ah! We just have to find the gcd of the whole list!

In [213]:
import math

def gcd_arr(arr):
    diff = arr[0]
    for num in arr[1:]:
        diff = math.gcd(diff, num)
    return diff

diff = gcd_arr([0, 12, 20, 48, 84, 96, 100])
diff

From diff, we can now calculate N and generate the whole series.

In [222]:
%pprint
print(f"N = {int(100 / diff)}")
print(f"Series: {[ num for num in range(0, 101, diff)]}")

Okay so now how do we apply this to our original puzzle.

### Going back

This will be hard as `deal_probability` is a float and most probably have been rounded. We can't easily find the GCD for them.

Actually, scratch that. I hope you enjoyed the last subsection's math lesson because I won't be using it and instead I'll just brute force for N:

In [46]:
from tqdm import tqdm

def brute_force_N(probs):
    for N in tqdm(range(2, 100000)):
        bins = calculate_bins(N)
        
        # this `in` comparison could fail because of rounding issues
        # use math.isclose instead?
        if all([ prob in bins for prob in probs]):
            return N

brute_force_N([0, 0.06322, 0.18389, 0.34615, 0.55800, 0.76786])

That's a large N.

So there are 2 possibilities we must consider:
1. The binning pattern we saw in services does not apply to other parent categories
2. The binning pattern we saw in services does apply but other parent categories just have large N and we are not seeing most of the bins

I'm not sure which one is correct.

## Other observations

For extra points

In [40]:
aggs = []
for col in [
    "region",
    "city",
    "parent_category_name",
    "category_name",
    "param_1"
]:
    agg = df.groupby(col)["deal_probability"].agg(["nunique", "count"]).reset_index(drop=True)
    agg["column"] = col
    aggs.append(agg)

aggs = pd.concat(aggs)
aggs.head()

In [41]:
sns.lmplot(
    data=aggs,
    x="count", y="nunique", col="column",
    fit_reg=False, sharex=False, sharey=False
);

For location-based columns, there is a visible relationship between the number of items and the number of unique `deal_probability` values. The more items, the more likely a city/region is going to have item types with more bins I guess.

No relation is seen for columns regarding the item's type. From what we know, number of unique `deal_probability` values is related to the number of subcategories it has.



Next we look at region's item count vs unique `deal_probability` count scatterplot again but this time we separate for each `parent_category_name`.

In [57]:
agg = df.groupby(["region", "parent_category_name"])["deal_probability"].agg(["nunique", "count"]).reset_index()
agg.head()

In [61]:
sns.lmplot(
    data=agg,
    x="count", y="nunique", col="parent_category_name",
    fit_reg=False, sharex=False, sharey=False
);

Trending up is expected but what's with that 2nd to the last graph?