<h1>User-Based Collaborative Filtering</h1>
![Recommendation Systems Approaches](https://raw.githubusercontent.com/ziababar/recommender/master/images/Recommendation%20Systems.jpg)
A recommender that is based on the user-based collaborative filtering approach.

The recommender system will perform the following steps:
1. Retrieve item and activity data
2. Build the recommendation engine
3. Generate recommendations
![Collaborative Filtering](https://raw.githubusercontent.com/ziababar/recommender/master/images/collaborative-filtering.png)

<h2>Step 1 - Retrieve Data</h2>
The first step would always be to gather the data and pull it into the programming environment.
For our use case, we download the MovieLens dataset containing three sets of data,

- Movie data containing a certain movie's information, such as movieID, release date, URL, genre details, and so on
- User data containing the user information, such as userID, age, gender, occupation, ZIP code, and so on
- Ratings data containing userID, itemID, rating, timestamp

In [None]:
# Import the libraries that are going to be used here
import pandas as pd
import numpy as np
import scipy
import sklearn

In [None]:
# Column headers for the dataset
data_cols = ['user id','movie id','rating','timestamp']
item_cols = ['movie id','movie title','release date', 'video release date','IMDb URL','unknown','Action', 'Adventure','Animation','Childrens','Comedy','Crime', 'Documentary','Drama','Fantasy','Film-Noir','Horror', 'Musical','Mystery','Romance ','Sci-Fi','Thriller', 'War' ,'Western']
user_cols = ['user id','age','gender','occupation', 'zip code']

In [None]:
# User activity data
df_u_data = pd.read_csv('/home/nbuser/library/dataset/u.data', header=None, sep='\t', names=data_cols, encoding='latin-1')
df_u_data = df_u_data.sort_values('user id', ascending=1)
df_u_data.columns
df_u_data.head(5)

In [None]:
# List of movie items
df_u_item = pd.read_csv('/home/nbuser/library/dataset/u.item', header=None, sep='|', names=item_cols, encoding='latin-1')
df_u_item = df_u_item.sort_values('movie id', ascending=1)
df_u_item.columns
df_u_item.head(5)

<h2>Step 2 - Determine Similar Users</h2>

![User-Based Collaborative Filtering](https://raw.githubusercontent.com/ziababar/recommender/master/images/user-based.png)

There are multiple ways to determine similarity between users or items. Some common approaches used in recommendation systems include,
 - Neighbourhood-based techniques
   - Euclidean distance
   - Cosine similarity
   - Jaccard similarity
   - Pearson correlation coefficient
 - Clustering techniques
   - K-means clustering
   
In this example, we'll be using Pearson correlation coefficient to determine similar items.

Merge the two dataframes into one single dataframe. This allows the depiction of all the transactional activity in one single dataframe, leading to better and faster analysis.

In [None]:
ratings = pd.merge(df_u_item, df_u_data)
ratings.head(5)

Now we'll pivot this table to construct a nice matrix of users and the movies they rated. NaN indicates missing data, or movies that a given user did not watch:

In [None]:
userRatings = ratings.pivot_table(index=['user id'], columns=['movie title'], values='rating')
userRatings.head(5)

From the user rating dataframe above, create a correlation matrix. This gives us a correlation score between every pair of movies (where at least one user rated both movies - otherwise NaN's will show up.)

In [None]:
# Use pandas built-in corr() method that will compute a correlation score for every column pair in the matrix
corrMatrix = userRatings.corr()
corrMatrix.head()

Eliminate spurious results that happened from just a handful of users that happened to rate the same pair of movies. This restricts our results to movies that lots of people rated together - and also give us more popular results that are more easily recongnizable.

In [None]:
# Use the min_periods argument to throw out results where fewer than 100 users rated a given movie pair
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
corrMatrix.head()

<h2>Step 3 - Generate Recommendations</h2>

Produce movie recommendations for a test user ID. Generally the input data would be divided into training and testing segments, with the recommendations being produced for users that were not part of the training data. This would permit the evaluation of the recommendation model. However, here we keep things simple and just reuse one of the actual users from the training data to see how the recommendation mode works.

In [None]:
# Retrieve ratings made by user 1
myRatings = userRatings.loc[1].dropna()
myRatings

Let's go through each movie rated one at a time, and build up a list of possible recommendations based on the movies similar to the ones that the user rated. This is done as follows,
 - For each movie rated, retrieve the list of similar movies from the correlation matrix
 - Scale those correlation scores by how well the movie (originally rated by the user) to the similar movies
 - Thus movies similar to ones the user likes count more than movies similar to ones the user hates

In [None]:
simCandidates = pd.Series()
for i in range(0, len(myRatings.index)):
    print ("Adding sims for " + myRatings.index[i] + "...")
    # Retrieve similar movies to this one that I rated
    sims = corrMatrix[myRatings.index[i]].dropna()
    # Now scale its similarity by how well I rated this movie
    sims = sims.map(lambda x: x * myRatings[i])
    # Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)

In [None]:
#Glance at the similar movie candidates results so far:
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(5)

Some of the same movies came up more than once, because they were similar to more than one movie originally rated by the user. Use groupby() to add together the scores from movies that show up more than once, so they'll count more:

In [None]:
simCandidates = simCandidates.groupby(simCandidates.index).sum()
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)