# Wilson Confidence Intervals with Amazon Toys

Note: this notebook was inspired by the very influential blog post [**"How Not To Sort By Average Rating"**](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html).

In a previous notebook I described the [Wald confidence interval](https://www.kaggle.com/residentmario/wald-confidence-intervals-with-iowa-liquor-sales/), the simplest and most commonly used estimator for the confidence interval associated with boolean data. In this notebook I will cover an implementation of a much more complicated estimator, the Wilson CI.

The Wilson CI is obnoxiously complicated:

![](https://www.evanmiller.org/images/average-rating/equation.png)

Mathematically, all of this jazz is basically just a correction for certain incorrect assumptions injected by the simple Wald test. These changes paticularly make a difference when the number of samples are small, resulting an overall more accurate result. However, the derivation is complex.

A good reference on the difference between the Wald and Wilson CIs is [this paper](http://www.ucl.ac.uk/english-usage/staff/sean/resources/binomialpoisson.pdf). In practical data science, you will almost always be fine with just using the Wald test.

However, the Wilson confidence interval has an interesting application, one most popularly described [here](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html). It turns out that the Wilson CI (in particular, its lower bound) is the ideal formulation of the confidence interval to use for ranking items based on their review scores. It compensates for the "small number of elements" problem: e.g. the fact it's quite likely for a bad product with few reviews to have only good reviews simply by statistical chance.

For this reason the "percentage good reviews" isn't statistically meaningful. In the past, I've gotten around this problem by mandating that the items being reviewed have at least X reviews, where X is an abritrary threshold on how much statistical noise I accept having, versus how many records I get to use for the algorithm. This is the approach I used in e.g. [Recommending Chess Openings](https://www.kaggle.com/residentmario/recommending-chess-openings/notebook).

This is a problem for a lot of websites as well, as it is essentially the problem to the question "how do you go about ranking items in a search result?". The Wilson Confidence Interval is a better way of approaching this problem than simpler but more noisy alternative approaches, like taking `Upvotes - Downvotes` or `Upvotes / All Votes`.

In this notebook I apply the Wilson CI to reviews of Amazon toy products. I am *not* doing the Wilson CI derivation, that's worth a whole other notebook...

## Data munging

In [None]:
import pandas as pd
toys = pd.read_csv("../input/amazon_co-ecommerce_sample.csv")
toys.head()

In [None]:
import numpy as np

def get_scores(l):
    try:
        if pd.isnull(l):
            return np.array([])
    except:
        ret = np.array(l)[1::4]
        ret = [str(s).strip() for s in ret]
        try:
            ret = [float(s) for s in ret]
        except ValueError:
            ret = []
            
        return ret
    
review_scores = toys.customer_reviews.str.split("//").map(get_scores).values

In [None]:
review_scores[:5].tolist()

Notice that to apply the Wilson confidence interval we need to simplify this five-star rating system down to a "positive reviews" versus "negative reviews" one. In general, doing this sort of thing will cause you to lose information; however in an online ratings environment this is insubstantial. We can set 3.0 as the cutoff between good and bad reviews, based on the advice from [this relevant XKCD comic](https://xkcd.com/1098/).

In [None]:
pos_neg_review_scores = list(
    map(
        lambda sc: [s >= 4 for s in sc], review_scores.tolist()
    )
)

## Implementation

Here is the implementation of the Wilson CI, using the huge formula we stated above:

In [None]:
import scipy.stats as st
import numpy as np

def wilson_confidence_interval(X, c):
    n = len(X)
    
    z_score = st.norm.ppf(1 - ((1 - c) / 2))
    p_hat = np.array(X).astype(int).sum() / n
    
    correction_1 = z_score * z_score / (2*n)
    correction_2 = z_score * np.sqrt((p_hat*(1-p_hat) + z_score * z_score/(4 * n))/n) / (1 + z_score * z_score /n)
    additive_part = correction_1 + correction_2

    return (p_hat - additive_part, p_hat + additive_part)

## Analysis

In [None]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

When applied to a product with an underlying "recommendation ratio" of 0.5, the Wilson Confidence Interval converges to 0.5. This isn't so surprising, as it would be a lousy estimator otherwise.

In [None]:
cis = [wilson_confidence_interval([True]*n + [False]*n, 0.95) for n in range(1, 1000)]

In [None]:
plt.plot(range(1, 1000), np.array(cis)[:, 0])
plt.plot(range(1, 1000), np.array(cis)[:, 1])

The useful part of the Wilson confidence interval is the lower bound, which serves as a good *pessimistic* approximator of how actually recommendable a product is. Per our problem statement, it gets used in particular to account for extremely small-sample products:

In [None]:
plt.plot(range(1, 51), np.array(cis)[:50, 0])

In [None]:
toy_recommendability_cis = np.array(
    [wilson_confidence_interval(pos_neg_review_scores[n], 0.95) for n in range(len(pos_neg_review_scores))]
)

Because of how many products have very few reviews, the differences between the upper and lower bound is generally huge.

In [None]:
pd.Series(toy_recommendability_cis[:, 1] - toy_recommendability_cis[:, 0]).plot.hist(bins=20)

Actually it looks like this isn't such a good test set after all, as I had neglected to notice that the reviews are only a sample of all of the ones the product recieved. For example, the following product has 36 reviews listed, but only 8 appear in the reviews included in the dataset. Whoops!

In [None]:
toys.iloc[np.nanargmax(toy_recommendability_cis)].number_of_reviews

In [None]:
(len(toys.iloc[np.nanargmax(toy_recommendability_cis)].customer_reviews.split("//")) - 1) / 4

Nevertheless the point still stands. For a less rushed introduction, read "[How Not to Sort by Average Rating](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html)"!