# Building the Recommender System (Jaccard Similarity)

### Two Classical Recommendation Methods

- **Similar people** (**Collaborative Filtering**)
    - If you like the same 5 movies as someone else, you'll likely enjoy other movies they like.
    - There are two main types: (a) Find users who are similar and recommend what they like (**user-based**), or (b) recommend items that are similar to already-liked items (**item-based**).
   

- **Similar items** (**Content-based Filtering**)
    - If you enjoy certain characteristics of movies (e.g. certain actors, genre, etc.), you'll enjoy other movies with those characteristics.
    - Note this can easily be done using machine learning methods! Each movie can be decomposed into features. Then, for each user we compute a model -- the target can be a binary classifier (e.g. "LIKE"/"DISLIKE") or regression (e.g. star rating).
    
### Types of Similarity
1. **PEARSON CORRELATION:** This method is most commonly used method. This method is used to find linear correlation between two vectors. PCC results a value between -1 and +1. -1 represents a negative corelation while +1 represents high positive correlation. 0 value shows no relation sometimes called zero order correlation. For the user-based algorithm, Pearson correlation is given in Table 1. 


2. **CONSTRAINED PEARSON CORRELATION:** Constrained Pearson correlation uses median value instead of average of rating co-rated by both users. Median value of scale is 3. For the user-based algorithm, Constrained Pearson correlation is given in Table 1.


3. **COSINE SIMILARITY:** This method is also most commonly used method in collaborative filtering in recommender systems. Cosine similarity finds how two vectors are related to each other using measuring cosine angle between these vectors. For the user-based algorithm, Cosine similarity is given in Table 1. The major drawback with cosine similarity is that it considers null preferences as negative preference.


4. **ADJUSTED COSINE SIMILARITY:** Cosine similarity measure does not consider the scenario in which different users use different rating scale. Adjusted cosine similarity solves it by subtracting the average rating provided by the user. Adjusted cosine similarity considers the difference in rating scale used by each user. Adjusted cosine similarity is slightly different from Pearson Correlation; Pearson Correlation considers the average rating of user for co-rated the items. Adjusted cosine similarity subtracts the average rating of user for all the items rated by user. For the user-based algorithm, Adjusted cosine similarity is given in Table 1.


5. **JACCARD SIMILARITY:** Jaccard similarity takes number of preferences common between two users into account. This does not consider the absolute ratings rather it considers number of items rated. Two users will be more similar, when two users have more common rated items. For the user-based algorithm, Jaccard similarity is given in Table 1. Jaccard produces limited number of values which makes the task of user distinction difficult.


6. **MEAN SQUARED DIFFERENCES:** For the user-based algorithm, MSD similarity is given in Table 1. MSD does not consider number of common rating rather it considers absolute ratings. Various similarity measures have been proposed in combination of jaccard similarity as JMSD, JPSS. Jaccard and MSD similarity can be combined to form another similarity measure method JMSD. For the user-based algorithm, JMSD similarity is given in Table 1.


7. **PIP SIMILARITY:** PIP stands for Proximity, Impact, Popularity. Proximity factor calculates the arithmetic difference between two ratings, along with consideration of agreement or disagreement of ratings, giving penalty to ratings in disagreement. Two ratings occurs in agreement if they lie on one side of the median of the rating scale. The Impact factor represents the extent to which an item is preferred or disliked by users. If two users have rated 1 on an item, it will show more strong dislike than they rate 3. Popularity is calculated around average rating of item provided by all users. Popularity gives high similarity when average of two rating far from average ratings of the item. If both users average rating has a large difference with the average of total users’ ratings, the two ratings can provide more information about the similarity of the two users. For the user-based algorithm, PIP similarity is given in Table 1. The Development of PIP was based on utilizing domain specific interpretation of user ratings on products It was developed to overcome the weakness of traditional similarity and distance measures in new user cold-start conditions. PIP performed well for new user cold-start conditions. PIP penalizes on proximity as well as Impact when there is disagreement in ratings. Sometimes it misleads about the similarity between similar users and similarity between dissimilar users. To overcome the problem faced by PIP, a new similarity measure was developed based on sigmoid function, called PSS an improved PIP measure. PSS stands for Proximity, significance, similarity. For the user-basedalgorithm, PSS similarity is given in Table 1. Proximity of two ratings is computed as in PIP. The second factor was taken as significance. Significance of two ratings was based on the median value of rating scale. It is based on the concept that the ratings, more distant from the median value, would be more significant. Singularity defines how two rating different from other ratings. PSS can be combined with Jaccard similarity measure. For the user-based algorithm, JPSS is given in Table 1.Where JPSS uses an improved version of Jaccard similarity measure.


8. **NEW HEURISTIC SIMILARITY MODEL (NHSM):** For this method, PIP has been taken as initial heuristic method. NHSM similarity measure is combination of JPSS and User Rating Preference similarity measures. User Rating Preference similarity measure is based on mean and variance of the ratings of user. For the user-based algorithm, NHSM similarity measure is given in Table 1. The value produced by NHSM ranges from 0 to 1. This similarity measure considers the fact that different users have different preferences scale and models user preference based on mean and standard variance of user ratings.


9. **SPEARMAN RANK CORRELATION:** Spearman Rank Correlation uses ranks instead of ratings for calculating similarity. For the user-based algorithm, Spearman Correlation is given in Table 1. Spearman Rank Correlation does not work well for partial orderings. Weak orderings occur whenever there are at least two items in the ranking such that neither item is preferred over the other. If there is difference between user ranking ordering and system ordering. When the system ranks same rated items at different levels, then the Spearman correlation will be penalized for every pair of items rated same by the user. Since the user shouldn’t care the system orders items that the user has rated at the same level. Kendall’s Tau metric also suffers from the same. 


10. **KENDALL’S TAU CORRELATION:** This is also rank based method to compute correlation. Kendall’s correlation considers relative ranks instead of ratings for calculating similarity. It computes the correlation of the rankings that is independent of the variable values. Kendall’s correlation suffers from the same problem faced by Spearman correlation.

Source: https://pdfs.semanticscholar.org/943a/e455fafc3d36ae4ce68f1a60ae4f85623e2a.pdf

In [2]:
# Import standard libraries
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import sqlite3
from collections import Counter
from time import sleep

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## 1. User Based Collaborative Filtering (Jaccard Similarity) THIS IS BAD

This recommender will use the products a customer has reviewed for user based collaborative filtering using jaccard similarity. The workflow will be as follows:

1. Take a sample from the dataset and build a small recommender system.
2. Scale that recommender system to the whole dataset.

This would fall under **other users also reviewed**


## 1.1. Sample Recommender

In [3]:
# Read in Data
sqlite_db = 'amzn_vg_clean.db'
conn = sqlite3.connect(sqlite_db) 

query = '''
SELECT "customer_id", "product_title" 
FROM video_games
LIMIT 2000
'''

amzn_rec_1 = pd.read_sql(query, con=conn)

In [4]:
# Put each product for a customer into a set
amzn_sample = amzn_rec_1.groupby('customer_id')['product_title'].apply(lambda x: set(x))
amzn_sample.head()

customer_id
165189     {Sly Cooper And The Thievius Raccoonus - PlayS...
758930     {Karaoke Revolution Glee: Volume 3, The Witche...
918349     {PlayStation 3 Dualshock 3 Wireless Controller...
1037887    {Nintendo Super Smash Bros. White Classic Game...
1085641    {The Witcher 3: Wild Hunt, Fantasy Zone, J Sta...
Name: product_title, dtype: object

In [5]:
# Count how many times each product appears in the entire dataset
prod_freq_sample = Counter([prod for prod in amzn_rec_1['product_title'].values])
sorted(prod_freq_sample.items(), key=lambda x: x[1], reverse=True)[:10]

[('DISNEY INFINITY Figure', 27),
 ('Disney INFINITY Disney Infinity: Marvel Super Heroes (2.0 Edition) Characters',
  19),
 ('The Last of Us', 8),
 ('Super Smash Brothers', 7),
 ('Grand Theft Auto IV', 7),
 ('Until Dawn  - PlayStation 4 [Digital Code]', 7),
 ('Batman Arkham Origins', 7),
 ('Bloodborne', 6),
 ('Battlefield 4', 6),
 ('Call of Duty: Black Ops II', 6)]

**Recommender Functions**

In [6]:
def jaccard(set1, set2):
    '''
    Jaccard similarity measures how closely related 2 sets are.
    Function takes in 2 sets and returns the jaccard similarity.
    Jaccard similarity  = (intersection set1 & set2) / (union set1 & set2)
    '''
    if (len(set1) == 0) and (len(set2) == 0):
        return Inf
    
    return (float(len(set1 & set2)) / len(set1 | set2))

In [7]:
def user_cf_jacc_recs(product_set, df, prod_freq):
    '''
    Function takes in set of products and returns top recommendations.
    '''
    # Sorted list of pairwise product sets from other customers based on jaccard similarity 
    similarity_list = sorted(df.items(), 
                             key=lambda x: jaccard(product_set, x[1]))
    
    # Dictionary to house recommendations and their weighted scores
    recommended_products = {}
    
    while len(recommended_products) < 5:
        
        # For each item in similarity list, define the set as the set with highest jaccard similarity
        # .pop() will remove the last item in the list so that we will not repeat in the next iteration
        highest_jacc = similarity_list.pop()[1]
        
        for product in (highest_jacc - product_set):
            
            # If key hasn't been added yet, set to 0
            recommended_products.setdefault(product, 0)
            
            # Increment by the inverse brand frequency
            # We want the products that are not so pupular that everyone has reviewed them
            # The higher the inverse product frequency, the more unique the product
            recommended_products[product] += 1 / prod_freq[product] 
    
#     # Print to see the whole list of recommendations
#     print()
#     for item in sorted(recommended_products.items(), key=lambda kv: kv[1]):
#         print(item)
#     print()
    
    return sorted(recommended_products.items(), key=lambda x: x[1])
    

In [8]:
def user_cf_jacc_prompt(df, prod_freq):
    '''
    Function that prompts the user for products to select.
    Products will be put into a set which will be used to get recommendations based on jaccard similarity of users.
    '''
    
    print('Please choose one or more of the following:\n')
    sleep(1)
    
    # Avaliable games for user
    top_suggestions = [i[0] for i in sorted(prod_freq.items(), key=lambda x: x[1], reverse=True)][:20]
    
    # Print available games for user
    for ind, prod in enumerate(top_suggestions):
        print(str(ind+1)+'. ', prod)
    
    # Product set
    product_set = set()
    
    # Stop prompt when user inputs 'Done'
    print('\nIf you are done, type "Done".')
    not_done = True
    while not_done == True:
        prompt = input('Type in your selection here: ')
        if prompt != 'Done':
            product_set.add(prompt)
        else:
            not_done = False
    
    # Get recommendations
    recs = user_cf_jacc_recs(product_set, df, prod_freq)
    
    # Print user's selections
    print('\nBased on your selections:')
    for ind, prod in enumerate(product_set):
        print(str(ind+1)+'. ', prod)
    
    # Print recommendations
    print('\nWe suggest:')
    top5 = [rec[0] for rec in recs[:5]]
    for ind, rec in enumerate(top5):
        print(str(ind+1)+'. ', rec)
    
    return recs

In [9]:
user_cf_sample = user_cf_jacc_prompt(amzn_sample, prod_freq_sample)

Please choose one or more of the following:

1.  DISNEY INFINITY Figure
2.  Disney INFINITY Disney Infinity: Marvel Super Heroes (2.0 Edition) Characters
3.  The Last of Us
4.  Super Smash Brothers
5.  Grand Theft Auto IV
6.  Until Dawn  - PlayStation 4 [Digital Code]
7.  Batman Arkham Origins
8.  Bloodborne
9.  Battlefield 4
10.  Call of Duty: Black Ops II
11.  Dying Light
12.  Call of Duty: Ghosts
13.  Assassin's Creed 4
14.  Tomb Raider
15.  The Elder Scrolls Online
16.  The Witcher 3: Wild Hunt
17.  Batman: Arkham Knight
18.  Disney INFINITY 2.0 Edition
19.  Grand Theft Auto V
20.  Infinity 3.0 Starter Pack

If you are done, type "Done".
Type in your selection here: The Elder Scrolls Online
Type in your selection here: Done

Based on your selections:
1.  The Elder Scrolls Online

We suggest:
1.  The Last of Us
2.  Until Dawn  - PlayStation 4 [Digital Code]
3.  South Park:  The Stick of Truth
4.  XCOM: Enemy Within
5.  Sleeping Dogs: Definitive Edition


In [10]:
user_cf_sample

[('The Last of Us', 0.125),
 ('Until Dawn  - PlayStation 4 [Digital Code]', 0.14285714285714285),
 ('South Park:  The Stick of Truth', 0.25),
 ('XCOM: Enemy Within', 0.5),
 ('Sleeping Dogs: Definitive Edition', 0.5),
 ('Sony PlayStation 4 PS4 Dualshock 4 Wireless Control', 1.0),
 ('SmaAcc Cooling Fan with Dual Charging Station for PS4 Controllers/DualShock 4 Controllers',
  1.0),
 ('Turtle Beach - Ear Force P11 Amplified Stereo Gaming Headset', 1.0),
 ('Hitman: Absolution', 1.0)]

## 1.2. Scale to Whole Dataset

In [46]:
# Read in Data
sqlite_db = 'amzn_vg_clean.db'
conn = sqlite3.connect(sqlite_db) 

query = '''
SELECT "customer_id", "product_title" 
FROM video_games
'''

amzn_rec_2 = pd.read_sql(query, con=conn)

In [47]:
# Change sample of data to whole dataset
whole = amzn_rec_2.groupby('customer_id')['product_title'].apply(lambda x: set(x))
whole.head()


customer_id
10206    {Sonic Generations, Cabela's Big Game Hunter H...
11049    {Assassin's Creed III: Liberation, Killzone: S...
12605    {Skylanders Trap Team: Traps, Super Mario Gala...
13087    {Gears of War: Ultimate Edition - Xbox One, Ba...
14476    {Nintendo NES RF Adapter, Punch Out - Nintendo...
Name: product_title, dtype: object

In [49]:
# Count how many times each product appears in the entire dataset
prod_freq_whole = Counter([prod for prod in amzn_rec_2['product_title'].values])
sorted(prod_freq_whole.items(), key=lambda x: x[1], reverse=True)[:10]

[('Call of Duty: Ghosts', 1763),
 ('Grand Theft Auto V', 1681),
 ('PlayStation 4 500GB Console [Old Model]', 1569),
 ("Assassin's Creed 4", 1354),
 ('The Last of Us', 1325),
 ('Elder Scrolls V: Skyrim', 1122),
 ('Battlefield 3', 1079),
 ('Destiny', 1051),
 ("Assassin's Creed III", 979),
 ('Battlefield 4', 972)]

In [53]:
user_cf_whole = user_cf_jacc_prompt(whole, prod_freq_whole)


Please choose one or more of the following:

1.  Call of Duty: Ghosts
2.  Grand Theft Auto V
3.  PlayStation 4 500GB Console [Old Model]
4.  Assassin's Creed 4
5.  The Last of Us
6.  Elder Scrolls V: Skyrim
7.  Battlefield 3
8.  Destiny
9.  Assassin's Creed III
10.  Battlefield 4
11.  Call of Duty: Black Ops II
12.  Grand Theft Auto IV
13.  Watch Dogs
14.  DISNEY INFINITY Figure
15.  Mass Effect 3
16.  Nintendo Amiibo
17.  Call of Duty: Modern Warfare 2
18.  Call of Duty 4: Modern Warfare
19.  PlayStation 3 Dualshock 3 Wireless Controller (Black)
20.  Call of Duty: Advanced Warfare

If you are done, type "Done".
Type in your selection here: Destiny
Type in your selection here: Call of Duty: Black Ops II
Type in your selection here: Call of Duty: Modern Warfare 2
Type in your selection here: Done

Based on your selections:
1.  Call of Duty: Modern Warfare 2
2.  Destiny
3.  Call of Duty: Black Ops II

We suggest:
1.  Watch Dogs
2.  Red Dead Redemption
3.  Red Dead Redemption Game of the 

## 2. Item-Item Collaborative Filtering (Jaccard Similarity)

This recommender will use the customers a product has been reviewed by for item based collaborative filtering using jaccard similarity. The workflow will be as follows:

1. Take a sample from the dataset.
2. Build co-occurence matrix.
3. Build a small recommender system.
4. Scale that recommender system to the whole dataset.

This would fall under **similar items**


## 2.1. Sample Recommender

In [11]:
# Utility Matrix
utilmat = amzn_rec_1.groupby(['customer_id','product_title']).size().unstack().fillna(0)
utilmat.head()

product_title,.hack// fragment [Japan Import],007 Legends,007 Racing,007 The World is Not Enough,10 Pak Clear Cartridge Cases For DS Games,100 All-Time Favorites - Nintendo DS,1001 Touch Games - Nintendo DS,16GB PlayStation Vita Memory Card,2013 R4 SDHC 3DS DSi Micro SD Memory Storage Adapter Card for All 3DS/DSi/XL/NDS/DS SDHC,24 the Game - PlayStation 2,...,Zuma - PC,Zumba Fitness,Zumba Fitness Rush - Xbox 360,[Upgraded Version]Ultra-comfort White & Red Wireless Bluetooth Ps3 Controller for the Playstation 3 Console By Avalid,dreamGEAR Rainbow Stylus Pack 5 Precision Styluses Perfect for New 3DS XL - Nintendo 3DS,dreamGEAR TriMount,eForCity 6-Pack Anti Scratch Reusable Screen Protector Compatible With Nintendo Wii U Gamepad Remote Controller,eXuby® Killer Red Direct Bluetooth Wireless Controller Compatible with Sony PS3 and Playstation 3 (6-Axis and Twin-Shock),inFAMOUS Second Son - PlayStation 4,niceEshop For Ps2 to Ps3/pc USB Game Controller Converter Adapter
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
165189,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
758930,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
918349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1037887,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1085641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# Users per product title in a set
prod_customers = amzn_rec_1.groupby('product_title')['customer_id'].apply(lambda x: set(x))
prod_customers.head()

product_title
.hack// fragment [Japan Import]                        {12918195}
007 Legends                                            {11300578}
007 Racing                                             {39984534}
007 The World is Not Enough                  {10589108, 39984534}
10 Pak Clear Cartridge Cases For DS Games              {10434531}
Name: customer_id, dtype: object

In [None]:
def jaccard(set1, set2):
    '''
    Jaccard similarity measures how closely related 2 sets are.
    Function takes in 2 sets and returns the jaccard similarity.
    Jaccard similarity  = (intersection set1 & set2) / (union set1 & set2)
    '''
    if (len(set1) == 0) and (len(set2) == 0):
        return Inf
    
    return (float(len(set1 & set2)) / len(set1 | set2))

In [13]:
# Co-occurence matrix
coocc_sample = utilmat.T.dot(utilmat)
np.fill_diagonal(coocc_sample.values, 0)

# Convert values in co-occurrence matrix to jaccard similarities between products
for col in coocc_sample.columns:
    new_col = []
    for ind in coocc_sample.index:
        set1 = prod_customers[col]
        set2 = prod_customers[ind]
        jacc_sim = jaccard(set1, set2)
        new_col.append(jacc_sim)
    coocc_sample[col] = new_col

coocc_sample.head()

product_title,.hack// fragment [Japan Import],007 Legends,007 Racing,007 The World is Not Enough,10 Pak Clear Cartridge Cases For DS Games,100 All-Time Favorites - Nintendo DS,1001 Touch Games - Nintendo DS,16GB PlayStation Vita Memory Card,2013 R4 SDHC 3DS DSi Micro SD Memory Storage Adapter Card for All 3DS/DSi/XL/NDS/DS SDHC,24 the Game - PlayStation 2,...,Zuma - PC,Zumba Fitness,Zumba Fitness Rush - Xbox 360,[Upgraded Version]Ultra-comfort White & Red Wireless Bluetooth Ps3 Controller for the Playstation 3 Console By Avalid,dreamGEAR Rainbow Stylus Pack 5 Precision Styluses Perfect for New 3DS XL - Nintendo 3DS,dreamGEAR TriMount,eForCity 6-Pack Anti Scratch Reusable Screen Protector Compatible With Nintendo Wii U Gamepad Remote Controller,eXuby® Killer Red Direct Bluetooth Wireless Controller Compatible with Sony PS3 and Playstation 3 (6-Axis and Twin-Shock),inFAMOUS Second Son - PlayStation 4,niceEshop For Ps2 to Ps3/pc USB Game Controller Converter Adapter
product_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
.hack// fragment [Japan Import],1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
007 Legends,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
007 Racing,0.0,0.0,1.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
007 The World is Not Enough,0.0,0.0,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10 Pak Clear Cartridge Cases For DS Games,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Recommender Functions**

In [66]:
def item_cf_jacc_recs(product_set, coocc):
    '''
    Function that takes in product set and co-occurence matrix.
    Returns weighted sum of jaccard similarities and sorts in descending order
    '''
    recs = coocc[product_set].apply(lambda x: sum(x)/len(product_set), axis=1).sort_values(ascending=False)
        
    return recs

In [67]:
def item_cf_jacc_prompt(coocc, prod_freq):
    '''
    Function that prompts the user for products to select.
    Products will be put into a set which will be used to get recommendations based on jaccard similarity of products.
    '''
    print('Please choose one or more of the following:\n')
    sleep(1)
    
    # Avaliable games for user
    top_suggestions = [i[0] for i in sorted(prod_freq.items(), key=lambda x: x[1], reverse=True)][:20]    
    for ind, prod in enumerate(top_suggestions):
        print(str(ind+1)+'. ', prod)
    
    # Product set - Stop prompt when user inputs 'Done'
    product_set = []
    print('\nIf you are done, type "Done".')
    not_done = True
    while not_done == True:
        prompt = input('Type in your selection here: ')
        if prompt != 'Done':
            product_set.append(prompt)
        else:
            not_done = False
    
    # Get recommendations
    recs = item_cf_jacc_recs(product_set, coocc).drop(index=product_set)
    
    # Print user's selections
    print('\nBased on your selections:')
    for ind, prod in enumerate(product_set):
        print(str(ind+1)+'. ', prod)
    
    # Print top recommendations
    print('\nWe suggest:')
    top5 = [rec for rec in recs[0:5].index]
    for ind, rec in enumerate(top5):
        print(str(ind+1)+'. ', rec)
    
    return recs

In [76]:
# One game
one_prod = item_cf_jacc_prompt(coocc_sample, prod_freq_sample)

Please choose one or more of the following:

1.  DISNEY INFINITY Figure
2.  Disney INFINITY Disney Infinity: Marvel Super Heroes (2.0 Edition) Characters
3.  The Last of Us
4.  Super Smash Brothers
5.  Grand Theft Auto IV
6.  Until Dawn  - PlayStation 4 [Digital Code]
7.  Batman Arkham Origins
8.  Bloodborne
9.  Battlefield 4
10.  Call of Duty: Black Ops II
11.  Dying Light
12.  Call of Duty: Ghosts
13.  Assassin's Creed 4
14.  Tomb Raider
15.  The Elder Scrolls Online
16.  The Witcher 3: Wild Hunt
17.  Batman: Arkham Knight
18.  Disney INFINITY 2.0 Edition
19.  Grand Theft Auto V
20.  Infinity 3.0 Starter Pack

If you are done, type "Done".
Type in your selection here: Call of Duty: Ghosts
Type in your selection here: Done

Based on your selections:
1.  Call of Duty: Ghosts

We suggest:
1.  Call of Duty: Black Ops II
2.  Xbox LIVE 12 Month Gold Membership Card
3.  Battlefield 3
4.  Crysis 3
5.  Halo 3


In [77]:
one_prod.head(15)

product_title
Call of Duty: Black Ops II                                                       0.333333
Xbox LIVE 12 Month Gold Membership Card                                          0.333333
Battlefield 3                                                                    0.333333
Crysis 3                                                                         0.285714
Halo 3                                                                           0.285714
Evolve                                                                           0.285714
Wolfenstein: The New Order                                                       0.250000
Gears of War 3                                                                   0.250000
Minecraft - Xbox 360                                                             0.222222
Dying Light                                                                      0.222222
Battlefield Hardline                                                             0.222

In [74]:
# Five games
five_games = item_cf_jacc_prompt(coocc_sample, prod_freq_sample)

Please choose one or more of the following:

1.  DISNEY INFINITY Figure
2.  Disney INFINITY Disney Infinity: Marvel Super Heroes (2.0 Edition) Characters
3.  The Last of Us
4.  Super Smash Brothers
5.  Grand Theft Auto IV
6.  Until Dawn  - PlayStation 4 [Digital Code]
7.  Batman Arkham Origins
8.  Bloodborne
9.  Battlefield 4
10.  Call of Duty: Black Ops II
11.  Dying Light
12.  Call of Duty: Ghosts
13.  Assassin's Creed 4
14.  Tomb Raider
15.  The Elder Scrolls Online
16.  The Witcher 3: Wild Hunt
17.  Batman: Arkham Knight
18.  Disney INFINITY 2.0 Edition
19.  Grand Theft Auto V
20.  Infinity 3.0 Starter Pack

If you are done, type "Done".
Type in your selection here: The Last of Us
Type in your selection here: Grand Theft Auto IV
Type in your selection here: Batman Arkham Origins
Type in your selection here: Assassin's Creed 4
Type in your selection here: The Witcher 3: Wild Hunt
Type in your selection here: Done

Based on your selections:
1.  The Last of Us
2.  Grand Theft Auto IV


In [75]:
five_games.head(15)

product_title
Until Dawn  - PlayStation 4 [Digital Code]              0.198718
New Super Mario Bros. U                                 0.169444
Watch Dogs                                              0.125253
Dying Light                                             0.122626
Ratchet and Clank: Into the Nexus - PS3                 0.120238
Halo 2                                                  0.120238
Battlefield Hardline                                    0.118182
Donkey Kong Country Tropical Freeze - Nintendo Wii U    0.116667
BioShock Infinite                                       0.110000
Saints Row the Third                                    0.107143
Assassin's Creed Unity                                  0.104365
Kingdom Hearts HD 1.5 Remix                             0.104365
PlayStation 4 500GB Console [Old Model]                 0.100000
Halo 3                                                  0.097222
Crysis 3                                                0.097222
dtype: floa

## 2.2. Scale to Whole Dataset

In [75]:
# Utility Matrix
utilmat = amzn_rec_2.groupby(['customer_id','product_title']).size().unstack().fillna(0)
utilmat.head()

product_title,"! Aikatsu Cinderella lessons (inclusion benefits: First shipped with original card ""Carddass and phase data!"") [Japan Import]","$100 Money PS3 Controller Shell Full Assembly Housing Hydrodipped (start, select, d-pad, square, triangle, circle, x, thumbsticks, r1/r2/l1/l2 buttons mod kit)","$100,000 Pyramid - PC",(25) Empty Standard XBOX 360 Translucent Green Replacement Games Boxes / Cases - VGBR14XBOX,(25) Standard Black Nintendo DS Empty Replacement Game Cases Boxes VGBR14DSBK,(4) Wii Fit Balance Board Replacement Foot Leg Extensions,(5 PACK) ULTIMATE Xbox 360 Rapid Fire Mod Kit (4-Mode) COD MW2- All games,(5) Empty Standard BLUE 10MM Replacement Boxes / Cases with out logo for Playstation Vita Games - VGBR10VIBL,(5) Empty Standard XBOX 360 Translucent Green Replacement Games Boxes / Cases #DVBR14XBOX,(5) White Nintendo 3DS Standard Replacement Game Boxes,...,はじめの一歩 THE FIGHTING!,ゴッドイーター2 レイジバースト アクセサリーセット for PlayStation (R) Vita,ボクと魔王,ローゼンメイデン ヴェヘゼルン ジー ヴェルト アップ (限定版) (シリアルナンバー入りオリジナル懐中時計 同梱),得点王 NCD 【NEOGEO】,痛快GANGAN行進曲 NCD 【NEOGEO】,蝶の毒 華の鎖~大正艶恋異聞~,電脳大戦 ~DroneZ~,餓狼伝説2 NCD 【NEOGEO】,麻雀狂列伝 NCD 【NEOGEO】
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11049,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12605,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14476,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Users per product title in a set
prod_customers = amzn_rec_2.groupby('product_title')['customer_id'].apply(lambda x: set(x))
prod_customers.head()

In [None]:
# Co-occurence matrix
coocc_whole = utilmat.T.dot(utilmat)
np.fill_diagonal(coocc_whole.values, 0)

# Convert values in co-occurrence matrix to jaccard similarities between products
for col in coocc_whole.columns:
    new_col = []
    for ind in coocc_whole.index:
        set1 = prod_customers[col]
        set2 = prod_customers[ind]
        jacc_sim = jaccard(set1, set2)
        new_col.append(jacc_sim)
    coocc_whole[col] = new_col

coocc_whole.head()

In [None]:
coocc_whole.shape

In [None]:
item_cf_whole = item_collab_jacc_prompt(coocc_whole)

In [None]:
item_cf_whole

## Other Ideas


## Content Based Filtering 

This second recommender will compare the products to each other and determine a cosine similarity to them. Customers would like certain aspects of a game or gaming console, such as the graphics, gameplay, replayability, storytelling, etc. Through natural language processing, I will attempt to exctract these features from the reviews of the data.

In [None]:
interrupt

In [None]:
# Sample of the data
sample = amzn_rec_1[:1000]
sample = pd.get_dummies(sample)

# Replace 'product_title' in column names
sample.columns = [i.replace('product_title_','') for i in sample.columns]

# Groupby customer id to get the products reviewed per customer
sample_clean = sample.groupby('customer_id').sum()

# Make all values NaN first, so that we can mean center the data
for col in sample_clean.columns:
    sample_clean[col] = sample_clean[col].apply(lambda x: None if x==0 else x)

# Mean center the data then fill with 0s
sample_clean = sample_clean.apply(lambda x: x - np.mean(x), axis=1).fillna(0)
print(sample_clean.shape)
sample_clean.head()

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix
cosims = cosine_similarity(sample_clean)
cosims.shape

In [None]:
def get_recommendations():
    df = sample_clean.reset_index()
    df = df.append({'customer_id':1, '007 Racing':2, '1001 Touch Games - Nintendo DS':1}, ignore_index=True)
    df.set_index('customer_id', inplace=True)
    df = df.fillna(0)
    
    return df

get_recommendations()
