# Recomendation:  Item-based Collaborative Filtering and Wilson score

 This algorithm takes historical view, cart and purchase data as input and output the “related” products for each item with corresponding score.

## I. Wilson score
First let take a look what is Wilson score and why we using Wilson score.

#### PROBLEM: 
You have an wedding invitation, and you decide to go to zalora website to buy a new dress. Unfortunately, you computer cannot display image of dresses. You have to choose dress base on others the rating to decide decide which dress you should buy.

**SOLUTION #1:** Score = (Positive ratings) - (Negative ratings)

Dress A (new arrival): 10 positive votes, 0 negative votes. => score = 10. <br />
Dress B: 60 positive votes, 49 negative votes => score = 11. <br />
Should you choose dress B instead of dress A base on B'score 11 > A'score 10? How about 49 negavite votes on Dress B. <br />
*Problem with this score is that it depends to much on number of votes.* <br />

You feel some thing wrong with score = (Positive ratings) - (Negative ratings). So you chagne the way to calculate score.

**SOLUTION #2:** Score = Average rating = (Positive ratings) / (Total ratings)

Dress C (new arrival): 1 positive votes, 0 negative votes. => score = 100%. <br />
Dress D: 9 positive votes, 1 negative votes => score = 90%. <br />
Should you choose dress C instead of dress D base on C'score 100% > D'score 90%? <br />
*Problem with this score is that it easy to jump significant of item only had a few votes* <br />

After try two score you realize you want a reliable score that is try worthy and over the weaknesses of the two previous scores. <br />
And you come up with the question:<br /> 
**"Given the ratings I have, there is a realy high chance (~95% - 100%) that the "real" fraction of positive ratings is what?"**
And you find out Wilson score can give you exactly what you want ;)

In [26]:
# I want confidence of fraction of positive rating is 95%.
import math
def wilson95(p, total):
    if (p <= 0 or total < 0):
        return 0
    p = float(p)
    total = float(total)
    return (p + 1.9208 - 1.96 * math.sqrt(p - p * p / total + 0.9604)) /  (3.8416 + total)

Let we calculate scores of Dreass A, B, C, D as mention upper. <br />
Dress A: positive = 10, negative = 0 <br />
Dress B: positive = 60, negative = 49 <br />
Dress C: positive = 1, negative = 0 <br />
Dress D: positive = 9, negative = 1 <br />

In [27]:
print("Dress score of dress A is ", wilson95(10, 10 + 0))
print("Dress score of dress B is ", wilson95(60, 60 + 49))
print("Dress score of dress C is ", wilson95(1, 1 + 0))
print("Dress score of dress D is ", wilson95(9, 9 + 1))

Dress score of dress A is  0.7224598312333834
Dress score of dress B is  0.45694046930826504
Dress score of dress C is  0.20654329147389294
Dress score of dress D is  0.595843614502428


So the final result base on Wilson score, I recommend you choose Dress A.

## II. Item-based Collaborative Filtering
Let pur_sku1 be the set of users who bought product sku1 and pur_sku2 be the set of users who bought product sku2, the score for this strategy is:

<img src="img/rec_Jaccard.png">

with W is Wilson score.

This formula which focuses on historical purchasing data provides us kind of product association recommendation (i.e. "customers who bought this item also bought these"). (We will do the same for co-view and co-add-to-cart, then combine 3 scores base on the weights on each scores). For this example I will focus on co-buy score.

In [28]:
pur_sku1 = set(['user1', 'user2', 'user3', 'user4', 'user5'])
pur_sku2 = set(['user1', 'user2', 'user6', 'user7', 'user8', 'user9'])
pur_sku3 = set(['user9', 'user10', 'user11', 'user12', 'user13'])
pur_sku4 = set(['user11', 'user12', 'user13', 'user4', 'user5'])

As the data show that: 
- sku1 and sku2 have 2 co-buyers is user1 and user2, and there are 9 buyers, score will be wilson(2,9)
- sku1 and sku4 have 0 co-buyers and there are total 10 buyers, score will be wilson(0,10)
- ...

In [39]:
def score(A, B):
    return wilson95(len(A & B), len(A | B))
print('score sku1-sku2 is : ', score(pur_sku1, pur_sku2)) 
print('score sku1-sku3 is : ', score(pur_sku1, pur_sku3)) 
print('score sku1-sku4 is : ', score(pur_sku1, pur_sku4)) 
print('score sku2-sku3 is : ', score(pur_sku2, pur_sku3))
print('score sku2-sku4 is : ', score(pur_sku2, pur_sku4))
print('score sku3-sku4 is : ', score(pur_sku3, pur_sku4)) 

score sku1-sku2 is :  0.06322376231222718
score sku1-sku3 is :  0
score sku1-sku4 is :  0.07147768885802765
score sku2-sku3 is :  0.017875749515721136
score sku2-sku4 is :  0
score sku3-sku4 is :  0.15821692226262676
