## Introduction to Data Science

### Data Science Tasks: Recommender Systems

Based on [this](https://www.datacamp.com/community/tutorials/recommender-systems-python), [this](https://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/) and [this](http://www.data-mania.com/blog/recommendation-system-python/) blog posts.  
Full version of data can be found [here](https://grouplens.org/datasets/movielens/)

In [21]:
import os
import sys
import re
import math
import time
import string
import datetime
from zipfile import ZipFile
from io import StringIO

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import NearestNeighbors

Specifying the path to the files:

In [2]:
outputs = "../outputs/"

#### Recommendation engines  

Recommendation engines are nothing but an automated form of a “shop counter guy”. You ask him for a product. Not only he shows that product, but also the related ones which you could buy. They are well trained in cross selling and up selling. So, does our recommendation engines.

The ability of these engines to recommend personalized content, based on past behavior is incredible. It brings customer delight and gives them a reason to keep returning to the website.

#### Types of Recommendation Engines


a) Recommend the most popular items

A simple approach could be to recommend the items which are liked by most number of users. This is a blazing fast and dirty approach and thus has a major drawback. The things is, there is no personalization involved with this approach.

Basically the most popular items would be same for each user since popularity is defined on the entire user pool. So everybody will see the same results. It sounds like, ‘a website recommends you to buy microwave just because it’s been liked by other users and doesn’t care if you are even interested in buying or not’.

Surprisingly, such approach still works in places like news portals. Whenever you login to say bbcnews, you’ll see a column of “Popular News” which is subdivided into sections and the most read articles of each sections are displayed. This approach can work in this case because:

    There is division by section so user can look at the section of his interest.
    At a time there are only a few hot topics and there is a high chance that a user wants to read the news which is being read by most others

 
b) Using a classifier to make recommendation

We already know lots of classification algorithms. Let’s see how we can use the same technique to make recommendations. Classifiers are parametric solutions so we just need to define some parameters (features) of the user and the item. The outcome can be 1 if the user likes it or 0 otherwise. This might work out in some cases because of following advantages:

    Incorporates personalization
    It can work even if the user’s past history is short or not available

But has some major drawbacks as well because of which it is not used much in practice:

    The features might actually not be available or even if they are, they may not be sufficient to make a good classifier
    As the number of users and items grow, making a good classifier will become exponentially difficult

 
c) Recommendation Algorithms

Now lets come to the special class of algorithms which are tailor-made for solving the recommendation problem. There are typically two types of algorithms – Content Based and Collaborative Filtering. You should refer to our previous article to get a complete sense of how they work. I’ll give a short recap here.

+ Content based algorithms: If you like an item then you will also like a “similar” item based on similarity of the items being recommended. It generally works well when its easy to determine the context/properties of each item. For instance when we are recommending the same kind of item like a movie recommendation or song recommendation.  
+ Collaborative filtering algorithms: If a person A likes item 1, 2, 3 and B like 2,3,4 then they have similar interests and A should like item 4 and B should like item 1. This algorithm is entirely based on the past behavior and not on the context. This makes it one of the most commonly used algorithm as it is not dependent on any additional information. For instance: product recommendations by e-commerce player like Amazon and merchant recommendations by banks like American Express.
+ Hybrid recommendation systems – Hybrid recommendation systems combine both collaborative and content-based approaches. They help improve recommendations that are derived from sparse datasets. (Netflix is a prime example of a hybrid recommender)

Further, there are several types of collaborative filtering algorithms:
+ User-User Collaborative filtering: Here we find look alike customers (based on similarity) and offer products which first customer’s look alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every customer pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.
+ Item-Item Collaborative filtering: It is quite similar to previous algorithm, but instead of finding customer look alike, we try finding item look alike. Once we have item look alike matrix, we can easily recommend alike items to customer who have purchased any item from the store. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new customer the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between customers. And with fixed number of products, product-product look alike matrix is fixed over time.
+ Other simpler algorithms: There are other approaches like market basket analysis, which generally do not have high predictive power than the algorithms described above.

#### First Example: quick and dirty similarity system:

Collaborative systems often deploy a nearest neighbor method or a item-based collaborative filtering system – a simple system that makes recommendations based on simple regression or a weighted-sum approach. The end goal of collaborative systems is to make recommendations based on customers’ behavior, purchasing patterns, and preferences, as well as product attributes, price ranges, and product categories. Content-based systems can deploy methods as simple as averaging, or they can deploy advanced machine learning approaches in the form of Naive Bayes classifiers,  clustering algorithms or artificial neural nets.

First let's create a dataset called X, with 6 records and 2 features each.

In [3]:
X = np.array([[-1, 2], [4, -4], [-2, 1], [-1, 3], [-3, 2], [-1, 4]])
print(X)

[[-1  2]
 [ 4 -4]
 [-2  1]
 [-1  3]
 [-3  2]
 [-1  4]]


Next we will instantiate a nearest neighbor object, and call it nbrs. Then we will fit it to dataset X.

In [4]:
nbrs = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(X)

Let's find the k-neighbors of each point in object X. To do that we call the kneighbors() function on object X.

In [5]:
distances, indices = nbrs.kneighbors(X)

In [6]:
print(indices)

[[0 3 2]
 [1 2 0]
 [2 4 0]
 [3 5 0]
 [4 2 0]
 [5 3 0]]


In [7]:
print(distances)

[[0.         1.         1.41421356]
 [0.         7.81024968 7.81024968]
 [0.         1.41421356 1.41421356]
 [0.         1.         1.        ]
 [0.         1.41421356 2.        ]
 [0.         1.         2.        ]]


Imagine you have a new incoming data point. It contains the values -2 and 4. To search object X and identify the most similar record, all you need to do is call the kneighbors() function on the new incoming data p

In [8]:
dist, idx = nbrs.kneighbors([[-2, 4]])
print('The closest are {}'.format(idx))
print('The distances are {}'.format(dist))

The closest are [[5 3 0]]
The distances are [[1.         1.41421356 2.23606798]]


The results indicate that the record that has neighbors with the indices [5, 3, 0] is the most similar to the new incoming data point. If you look back at the records in X, that is the last record: [-1, 4]. Just based on a quick glance you can see that, indeed, the last record in object X is the one that is most similar to this new incoming data point [-2, 4].  
In this way, you can use kNN to quickly classify new incoming data points and then make recommendations, all based 
on similarity.

#### Second Example: Movie Lens Database

The MovieLens dataset has been collected by the GroupLens Research Project at the University of Minnesota. MovieLens 100K dataset consists of:
+ 100,000 ratings (1-5) from 943 users on 1682 movies.
+ Each user has rated at least 20 movies.
+ Simple demographic info for the users (age, gender, occupation, zip)
+ Genre information of movies

In [10]:
with ZipFile('../datasets/CSVs/ml-100k.zip') as z:
    for filename in z.namelist():
        if not os.path.isdir(filename):
            print(filename)

ml-100k/
ml-100k/allbut.pl
ml-100k/mku.sh
ml-100k/README
ml-100k/u.data
ml-100k/u.genre
ml-100k/u.info
ml-100k/u.item
ml-100k/u.occupation
ml-100k/u.user
ml-100k/u1.base
ml-100k/u1.test
ml-100k/u2.base
ml-100k/u2.test
ml-100k/u3.base
ml-100k/u3.test
ml-100k/u4.base
ml-100k/u4.test
ml-100k/u5.base
ml-100k/u5.test
ml-100k/ua.base
ml-100k/ua.test
ml-100k/ub.base
ml-100k/ub.test


In [28]:
#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

with ZipFile('../datasets/CSVs/ml-100k.zip') as z:
    myfile = StringIO(z.read('ml-100k/u.user').decode('latin-1'))
    users = pd.read_csv(myfile, sep='|', names=u_cols, encoding='latin-1')
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [27]:
#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

with ZipFile('../datasets/CSVs/ml-100k.zip') as z:
    myfile = StringIO(z.read('ml-100k/u.data').decode('latin-1'))
    ratings = pd.read_csv(myfile, sep='\t', names=r_cols, encoding='latin-1')
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [26]:
#Reading items file:
i_cols = ['movie id', 
          'movie title' ,
          'release date',
          'video release date', 
          'IMDb URL', 
          'unknown', 
          'Action', 
          'Adventure',
          'Animation', 
          'Children\'s', 
          'Comedy', 
          'Crime', 
          'Documentary', 
          'Drama', 
          'Fantasy',
          'Film-Noir', 
          'Horror', 
          'Musical', 
          'Mystery', 
          'Romance', 
          'Sci-Fi', 
          'Thriller', 
          'War', 
          'Western']

with ZipFile('../datasets/CSVs/ml-100k.zip') as z:
    myfile = StringIO(z.read('ml-100k/u.item').decode('latin-1'))
    items = pd.read_csv(myfile, sep='|', names=i_cols, encoding='latin-1')
items.head()

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
