# Popularity model

This model does not recommend to a user similar cars which that user interacted with - it simply recommends the most popular cars tthat the user has not previously interacted with. As the popularity accounts for the "wisdom of the crowds", it usually provides good recommendations, generally interesting for most people.

Reference article: https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101

In [214]:
import numpy as np
import pandas as pd
import math
import random
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

In [215]:
cars = pd.read_csv('CarIds.csv')
users = pd.read_csv('users.csv')
cars.drop('Unnamed: 0', axis=1, inplace=True)
users['car_id'] = users['carId']
users.drop('carId',axis=1, inplace=True)

In [216]:
cars.drop_duplicates(subset = ['car_id'], inplace=True)

In [217]:
cars

Unnamed: 0,Brend,Cena,Godiste,Gorivo,Karoserija,Kilometraza,Kubikaza,Model,Snaga,car_id
0,ALFA ROMEO,2150,2007,Dizel,Hecbek,215000,1910,147,120,147_2007
1,ALFA ROMEO,2850,2006,Dizel,Hecbek,222000,1910,147,150,147_2006
2,ALFA ROMEO,1850,2004,Dizel,Limuzina,178000,1910,147,116,147_2004
4,ALFA ROMEO,1700,2002,Dizel,Hecbek,272000,1900,147,116,147_2002
5,ALFA ROMEO,2000,2005,Dizel,Kupe,189500,1910,147,150,147_2005
...,...,...,...,...,...,...,...,...,...,...
34819,VOLKSWAGEN,3100,2003,Dizel,Monovolumen (MiniVan),225000,1896,TOURAN,101,TOURAN_2003
34833,VOLKSWAGEN,5490,2010,Benzin,Monovolumen (MiniVan),184921,1398,TOURAN,140,TOURAN_2010
34846,VOLKSWAGEN,9700,2015,Metan CNG,Monovolumen (MiniVan),237000,1400,TOURAN,150,TOURAN_2015
34862,VOLKSWAGEN,10000,2014,Dizel,Monovolumen (MiniVan),225000,1596,TOURAN,105,TOURAN_2014


In [218]:
cars['car_id'].value_counts()

PRIUS +_2014    1
407_2006        1
120_2010        1
A3_2003         1
530_2005        1
               ..
QASHQAI_2013    1
TWINGO_2011     1
PRIUS +_2016    1
LAGUNA_2001     1
CIVIC_2010      1
Name: car_id, Length: 1266, dtype: int64

In [219]:
interactions = pd.merge(cars, users[['car_id','user_id']], how = 'inner', on = 'car_id')

In [220]:
interactions['eventStrength'] = 1

In [221]:
interactions

Unnamed: 0,Brend,Cena,Godiste,Gorivo,Karoserija,Kilometraza,Kubikaza,Model,Snaga,car_id,user_id,eventStrength
0,ALFA ROMEO,2150,2007,Dizel,Hecbek,215000,1910,147,120,147_2007,283,1
1,ALFA ROMEO,2150,2007,Dizel,Hecbek,215000,1910,147,120,147_2007,325,1
2,ALFA ROMEO,2150,2007,Dizel,Hecbek,215000,1910,147,120,147_2007,326,1
3,ALFA ROMEO,2150,2007,Dizel,Hecbek,215000,1910,147,120,147_2007,411,1
4,ALFA ROMEO,2850,2006,Dizel,Hecbek,222000,1910,147,150,147_2006,281,1
...,...,...,...,...,...,...,...,...,...,...,...,...
18970,VOLKSWAGEN,13200,2017,Dizel,Monovolumen (MiniVan),149341,1598,TOURAN,116,TOURAN_2017,2513,1
18971,VOLKSWAGEN,13200,2017,Dizel,Monovolumen (MiniVan),149341,1598,TOURAN,116,TOURAN_2017,2514,1
18972,VOLKSWAGEN,13200,2017,Dizel,Monovolumen (MiniVan),149341,1598,TOURAN,116,TOURAN_2017,2577,1
18973,VOLKSWAGEN,13200,2017,Dizel,Monovolumen (MiniVan),149341,1598,TOURAN,116,TOURAN_2017,2579,1


Grpupby users and cars by their Ids

In [222]:
users_interactions_count_df = interactions.groupby(['user_id', 'car_id','Model','Godiste','Cena']).size().groupby('user_id').size()
print('# users: %d' % len(users_interactions_count_df))
users_with_enough_interactions_df = users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['user_id']]
print('# users with at least 5 interactions: %d' % len(users_with_enough_interactions_df))

# users: 4278
# users with at least 5 interactions: 2007


Dealing with cold-start problem in recommender systems, in which is hard to provide personalized recommendations for user with none or a very few number of interactions. 

In [223]:
print('# of interactions: %d' % len(interactions))
interactions_from_selected_users_df = interactions.merge(users_with_enough_interactions_df, 
               how = 'right',
               left_on = 'user_id',
               right_on = 'user_id')
print('# of interactions from users with at least 5 interactions: %d' % len(interactions_from_selected_users_df))

# of interactions: 18975
# of interactions from users with at least 5 interactions: 10976


If we had different types of events this function provides a weighted sum of interaction for event strength and apply a log transformation to smooth the distribution. Here it doesn't do anything significant due to having only one type of interaction.

In [224]:
def smooth_user_preference(x):
    return math.log(1+x, 2)
    
interactions_full_df = interactions_from_selected_users_df \
                    .groupby(['user_id', 'car_id','Model','Godiste'])['eventStrength'].sum() \
                    .apply(smooth_user_preference).reset_index()
print('# of unique user/item interactions: %d' % len(interactions_full_df))
interactions_full_df['car_id'] = interactions_full_df['car_id'].apply(lambda x: int(str(hash(x))[0:5]))
interactions_full_df.head(10)

# of unique user/item interactions: 10976


Unnamed: 0,user_id,car_id,Model,Godiste,eventStrength
0,2,-3508,156,2001,1.0
1,2,90646,GOLF 4,2002,1.0
2,2,48140,PUNTO,2002,1.0
3,2,-7113,XSARA,2001,1.0
4,2,-7969,YPSILON,2002,1.0
5,6,20909,206,2000,1.0
6,6,10629,206,2001,1.0
7,6,94354,FOCUS,2002,1.0
8,6,48140,PUNTO,2002,1.0
9,6,-8915,TWINGO,2002,1.0


## Evaluation

We are using here a simple cross-validation approach named holdout, in which a random data sample (30% in this case) are kept aside in the training process, and exclusively used for evaluation. All evaluation metrics reported here are computed using the test set.

In [225]:
interactions_train_df, interactions_test_df = train_test_split(interactions_full_df,
                                   stratify=interactions_full_df['user_id'], 
                                   test_size=0.3,
                                   random_state=42)

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

# interactions on Train set: 7683
# interactions on Test set: 3293


We are using here **Top-N accuracy metrics**, which evaluates the accuracy of the top recommendations provided to a user, comparing to the items the user has actually interacted in test set. 

In [226]:
#Indexing by personId to speed up the searches during evaluation
interactions_full_indexed_df = interactions_full_df.set_index('user_id')
interactions_train_indexed_df = interactions_train_df.set_index('user_id')
interactions_test_indexed_df = interactions_test_df.set_index('user_id')

In [227]:
def get_items_interacted(person_id, interactions_df):
    # Get the user's data and merge in the movie information.
    interacted_items = interactions_df.loc[person_id]['car_id']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

In [228]:
cars['car_id'] = cars['car_id'].apply(lambda x: int(str(hash(x))[0:5]))

This evaluation method works as follows:

- For each user
    - For each item the user has interacted in test set
        - Sample 100 other items the user has never interacted.   
            Ps. Here we naively assume those non interacted items are not relevant to the user, which might not be true, as the user may simply not be aware of those not interacted items. But let's keep this assumption.
        - Ask the recommender model to produce a ranked list of recommended items, from a set composed one interacted item and the 100 non-interacted ("non-relevant!) items
        - Compute the Top-N accuracy metrics for this user and interacted item from the recommendations ranked list
    - Aggregate the global Top-N accuracy metrics

In [229]:
#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:


    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        interacted_items = get_items_interacted(person_id, interactions_full_indexed_df)
        all_items = set(cars['car_id'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)

    def _verify_hit_top_n(self, item_id, recommended_items, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_items) if c == item_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def evaluate_model_for_user(self, model, person_id):
        #Getting the items in test set
        interacted_values_testset = interactions_test_indexed_df.loc[person_id]
        if type(interacted_values_testset['car_id']) == pd.Series:
            person_interacted_items_testset = set(interacted_values_testset['car_id'])
        else:
            person_interacted_items_testset = set([int(interacted_values_testset['car_id'])])  
        interacted_items_count_testset = len(person_interacted_items_testset) 

        #Getting a ranked recommendation list from a model for a given user
        person_recs_df = model.recommend_items(person_id, 
                                               items_to_ignore=get_items_interacted(person_id, 
                                                                                    interactions_train_indexed_df), 
                                               topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0
        #For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            #Getting a random sample (100) items the user has not interacted 
            #(to represent items that are assumed to be no relevant to the user)
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                                          seed=item_id%(2**32))

            #Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            #Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['car_id'].isin(items_to_filter_recs)]                    
            valid_recs = valid_recs_df['car_id'].values
            #Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        #when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, person_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):
            #if idx % 100 == 0 and idx > 0:
            #    print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(model, person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()   

In [230]:
interactions_full_df

Unnamed: 0,user_id,car_id,Model,Godiste,eventStrength
0,2,-3508,156,2001,1.0
1,2,90646,GOLF 4,2002,1.0
2,2,48140,PUNTO,2002,1.0
3,2,-7113,XSARA,2001,1.0
4,2,-7969,YPSILON,2002,1.0
...,...,...,...,...,...
10971,4264,70704,AVENSIS,2017,1.0
10972,4264,69772,GOLF 7,2017,1.0
10973,4264,-5519,INSIGNIA,2017,1.0
10974,4264,51006,PRIUS +,2017,1.0


In [231]:
#Computes the most popular items
item_popularity_df = interactions_full_df.groupby(['car_id','Model','Godiste'])['eventStrength'].sum().sort_values(ascending=False).reset_index()
item_popularity_df.head(10)

Unnamed: 0,car_id,Model,Godiste,eventStrength
0,-5519,INSIGNIA,2017,81.0
1,-8006,A6,2013,80.0
2,-7053,PASSAT B8,2015,72.0
3,-7805,520,2012,72.0
4,80434,A4,2008,72.0
5,90802,A6,2012,70.0
6,14822,PASSAT B8,2017,67.0
7,69772,GOLF 7,2017,66.0
8,-8240,PASSAT B8,2016,62.0
9,67727,GOLF 7,2014,60.0


### Popularity model class

In [232]:
class PopularityRecommender:
    
    MODEL_NAME = 'Popularity'
    
    def __init__(self, popularity_df, items_df=None):
        self.popularity_df = popularity_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Recommend the more popular items that the user hasn't seen yet.
        recommendations_df = self.popularity_df[~self.popularity_df['car_id'].isin(items_to_ignore)] \
                               .sort_values('eventStrength', ascending = False) \
                               .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'car_id', 
                                                          right_on = 'car_id')[['eventStrength', 'car_id']]


        return recommendations_df
    
popularity_model = PopularityRecommender(item_popularity_df, interactions_full_df)

In [233]:
print('Evaluating Popularity recommendation model...')
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popularity_model)
print('\nGlobal metrics:\n%s' % pop_global_metrics)
pop_detailed_results_df.head(10)

Evaluating Popularity recommendation model...
2006 users processed

Global metrics:
{'modelName': 'Popularity', 'recall@5': 0.24263589432128757, 'recall@10': 0.415122988156696}


Unnamed: 0,hits@5_count,hits@10_count,interacted_count,recall@5,recall@10,_person_id
1003,0,0,2,0.0,0.0,2565
1731,1,2,2,0.5,1.0,2566
1001,1,2,2,0.5,1.0,3861
998,1,1,2,0.5,0.5,2413
997,0,1,2,0.0,0.5,3977
996,1,2,2,0.5,1.0,1655
995,0,0,2,0.0,0.0,2044
994,1,1,2,0.5,0.5,1064
993,0,0,2,0.0,0.0,2629
992,0,1,2,0.0,0.5,4251


Here we have evaluation metrics for Popularity model. It achived the **Recall@5** of 0.2426, which means that about **24%** of interacted items in test set were ranked by Popularity model among the top-5 items (for lists with 100 random cars). And **Recall@10** was higher (**41%**), as expected.

In [241]:
popularity_model.recommend_items(69,topn=10)

Unnamed: 0,car_id,Model,Godiste,eventStrength
0,-5519,INSIGNIA,2017,81.0
1,-8006,A6,2013,80.0
2,-7053,PASSAT B8,2015,72.0
3,-7805,520,2012,72.0
4,80434,A4,2008,72.0
5,90802,A6,2012,70.0
6,14822,PASSAT B8,2017,67.0
7,69772,GOLF 7,2017,66.0
8,-8240,PASSAT B8,2016,62.0
9,67727,GOLF 7,2014,60.0
