# Item-Item Similarity Recommender Using Pandas

### Recommender System
A recommendation engine or  recommender system is a tool that makes prediction on what a user may or may not like, among a list of given items. Ratings for items across multiple users are algorithmically analysed are then used to recommend other items to the user that the user has not seen. 

---

### Pandas
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. 

pandas is well suited for many different kinds of data, including 
* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets.


### MovieLens Dataset
MovieLens Data has been compiled by the GroupLens Research group at University of Minnesota.This MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. These data were created by 671 users between January 09, 1995 and October 16, 2016. This dataset was generated on October 17, 2016. 

This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/.


## Let’s Get Started

This is an implementation of a recommender system based on the popular item-item collaborative filtering. It uses the centered cosine similarity metric (achieved by imputing all unrated items for each user with the mean rating of all items that the user has rated, and then taking a simple pearson correlation)

The input to train the model is a dataframe with columns that represent *UserID, ItemID, and Rating* (need to indicate which is which during training)

In [1]:
import numpy as np
import pandas as pd
import json

class recommender():
    def __init__(self):
        self.item_sim=None
        self.popular_items=None
        self.ratings_df=None
    
    def fit(self, ratings_df, user_id, item_id, ratings):
        assert type(ratings_df) == pd.core.frame.DataFrame
        #assert set([user_id, item_id, rating]) < set(ratings_df.columns)
        self.ratings_df=ratings_df.copy()
        self.ratings_df=self.ratings_df.rename(columns={user_id: 'user_id', item_id: 'item_id', ratings: 'ratings'})
        ratings_pivot=self.ratings_df.pivot(index='user_id', columns='item_id', values='ratings').transpose()
        
        for i in ratings_pivot.index:
            ratings_pivot.loc[i,:].fillna((ratings_pivot.loc[i,:].mean()), inplace=True)
        
        self.item_frequency=self.ratings_df.dropna()['item_id'].value_counts()
        self.ratings_pivot=ratings_pivot.copy() # remove later...not used
        self.item_sim=ratings_pivot.transpose().corr().copy()
        
        self.min_rating=self.ratings_df['ratings'].min()
        self.max_rating=self.ratings_df['ratings'].max()
        self.find_popular_items()
        
    def find_popular_items(self):
        self.popular_items=self.ratings_df.groupby(['item_id'])['ratings'].mean().sort_values(ascending=False)
    
    def score(self,user_id, item_id, Nmax=20):
        assert Nmax > 1
        
        items_rated_by_user=self.ratings_df[self.ratings_df['user_id']==user_id].dropna()

        if items_rated_by_user.empty:
            popular = self.popular_items.index[0] 
            return popular
        
        
        item_sim_ratings=pd.DataFrame(self.item_sim.loc[item_id]).reset_index()
        item_sim_ratings.columns=['item_id', 'sim']
        
        df_temp=items_rated_by_user.merge(item_sim_ratings).sort_values('sim', ascending=False).iloc[0:Nmax]
        #retval= np.average(df_temp['ratings'], weights=df_temp['sim'])
        
        #this compensates for pathelogical cases where negative correltions dominate
        ret_num = (df_temp['ratings'] * df_temp['sim']).sum()
        ret_den = df_temp['sim'].abs().sum()
        retval= ret_num/(1.0*ret_den)
        
        return np.clip(retval, self.min_rating, self.max_rating)
    
    def items_to_search(self, user_id, k=50):
        items_rated_by_user=self.ratings_df[self.ratings_df['user_id']==user_id].dropna()['item_id']
        items_not_rated_by_user=set(self.ratings_df['item_id'])-set(items_rated_by_user)
        data=[self.item_frequency[i] for i in items_not_rated_by_user]
        topk=pd.Series(data=data, index=items_not_rated_by_user).nlargest(k).index
        
        #return list(items_not_rated_by_user)
        return list(topk)
        
    
    def calculate_all_item_suggestions(self, user_id, max_suggestions=30):
        item_search_list=self.items_to_search(user_id, k=max_suggestions)
        scores={}
        for item_id in item_search_list:
            s= self.score(user_id,item_id, 30) #Nmax=30
            scores[item_id]=s
        return pd.Series(scores)
    
    def reco_topk_items_for_user(self, user_id, k=10, ret_json=False):
        """
        inputs:
            user_id - id of user for which recommendations are being requested
            k - number of suggestions to return
        outputs
            item_id, predicted rating  - for top k recommended items
        """
        try:
            retval=self.calculate_all_item_suggestions(user_id).nlargest(k)
            if ret_json:
                return retval.to_json()
            else:
                return retval
        except:
            print('error has occured')
            return -1

### Step1: 
Let’s read in the inputs for training the recommender system, i.e. the movielens files. What are really need are the ratings that each user has given, to each item (movie in this case). This can simply be expressed as a 3-column table – userid, itemid, rating.  

In the MovieLens 100K dataset, we are given the three files, u.users (for userids), u.items (for movie ids) and u.data(for ratings). 

We can use Pandas to read (using pandas.read_csv ) and then join these tables to get what we need very easily 

In [2]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('/data/u.user', sep='|', names=u_cols,  encoding='latin-1')

#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('/data/u.data', sep='\t', names=r_cols,  encoding='latin-1')

#Reading items file:
i_cols = ['movie_id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('/data/u.item', sep='|', names=i_cols,  encoding='latin-1')

movies100k = pd.merge(pd.merge(ratings, users), items)[['user_id', 'movie_id', 'rating']]

In [5]:
movies100k.shape #there are 100K movie ratings!

(100000, 3)

### Step 2
The recommender object uses the fit method to fit a recommendation model. It has a very simple interface - it takes a dataframe (movie100k in this case) as input, along with the column names of the dataframe that stand for userid, itemid, and rating

In [6]:
reco100k=recommender()
reco100k.fit(movies100k, user_id='user_id', item_id='movie_id', ratings='rating')

### Step 3
Now that we have a model, we can use the model to predict i.e. to make recommendations for any userid. Note that we are working with “existing users”, i.e. users who have some rating history in the “system”. However, it can be extended in various ways to serve new users i.e. with no rating history, users with sparse history, etc., but we will not deal with that in this tutorial

The below statement makes topk (in this case, best 10) recommendations for userid 25. It gives the 10 best items as well as their predicted ratings for user id 25, in form of a dictionary

In [9]:
temp=reco100k.reco_topk_items_for_user(user_id=25)
recos=items[items['movie_id'].isin(temp.index)]
recos[['movie title']]

Unnamed: 0,movie title
14,Mr. Holland's Opus (1995)
21,Braveheart (1995)
116,"Rock, The (1996)"
171,"Empire Strikes Back, The (1980)"
236,Jerry Maguire (1996)
287,Scream (1996)
293,Liar Liar (1997)
312,Titanic (1997)
327,Conspiracy Theory (1997)
747,"Saint, The (1997)"
