# Amazon Recommender System

## Objective:

- Using the supplied dataset build a a simple recommendation engine that given a user ID ( userId ) returns a collection (list) of recommended movies. 
- The recommendation engine can be written in any language (ruby, python, php, javascript, bash, etc) and built using any technology/framework but you are required to state why you chosen that language and the technology stack.
- Bonus points who builds the solution integrated with a web interface. Please comment your code.

## Work Plan:

1. Data Collection and Preparation:
2. Model Selection and Training
    - Decide which algorithm to use
    - Train algorithm and evaluate performance
3. Make predictions
4. Query Engine

## Notebook Usage

The notebook contain all the code generated to produce a final dictionary with the recommended movies for each user. This dictionary is saved in a pickle file.

To simply get the movies' list for a specified user, one can simply use the query function on section 4).

## Input Parameters:

In [42]:
TOP_N_MOVIES = 10; # number of top movies to retrieve
FILE_PATH = 'movies.txt' # path to the dataset

## 1) Data Collection and Preparation:

It's very important to filter the data that is indeed useful for our purpose. Our data per product contains the following information: 
1. productId
2. userId
3. profileName
4. helpfulness
5. score
6. time
7. summary
8. review

For our purpose, we can simply filter the productID, userID and score. This information will be enough to recommender a movie based on the tecnhiques we plan to use later on. Let's proceed to read the `.txt` file.

In [53]:
import pandas as pd
import pickle

In [46]:
def read_dataset(file_name, df, n_products):
    counter = 0
    with open(file_name, errors = 'ignore') as openfileobject:
        for line in openfileobject:
            counter = counter + 1
            if counter == 500000:
                break
            if 'product/productId:' in line:
                new_product = {'productId': line.split(':')[1].strip()}
            elif 'review/userId:' in line:
                new_product['userId'] = line.split(':')[1].strip()
            elif 'review/score:' in line:
                new_product['score'] = line.split(':')[1].strip()
                df = df.append(new_product, ignore_index=True)
    return df

In [47]:
# create pandas dataframe to store reviews
reviews = pd.DataFrame(columns=['productId', 'userId', 'score'])

# read data file and append info to our dataframe
reviews = read_dataset(FILE_PATH, reviews, 500000)

## 2) Model Selection and Training:

In this part, we will select some models from the library [Surprise](http://surpriselib.com/), a full-fledged python library, specialized for recommender systems. This library will save much of the work that would be required if we were to implement the algorthms from scratch.

After the famous [Netflix Prize Competition](https://en.wikipedia.org/wiki/Netflix_Prize) the Matrix Factorization proved to be one of the most efficient algorithms for recommender systems, performing better than the so far very commonly used collaborative filtering algorithms (user and item based). For this reason, we will use this implementation for our own recommender system.  

In [48]:
from collections import defaultdict
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import evaluate, print_perf

In [62]:
def transformData(df):
    '''
    Function to read data from dataframe
    into Surprise data format
    '''
    df = df[['userId', 'productId', 'score']] # order imposed by surprise

    reader = Reader(rating_scale=(1, 5))
    data = Dataset.load_from_df(df[['userId', 'productId', 'score']], reader)
    
    return data

def trainAlgo(data, algorithm):
    '''
    Function to train requested algorithm
    on full dataset
    '''
    data.split(n_folds=3)
    algo = algorithm

    # Evaluate performances of our algorithm on the dataset.
    perf = evaluate(algo, data, measures=['MAE'], verbose = 0)

    print_perf(perf)

    return algo, data.build_full_trainset()


def get_top_n(n, trainset, algo):
    '''
    Function to return a dictionary with top elements
    per user
    '''
    testset = trainset.build_anti_testset()
    predictions = algo.test(testset)
    
    # Build dictionary per user.
    # Each user contains a list of tuples (itemId, score)
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))
        
    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [63]:
# 1. get data into Surprise format
data = transformData(reviews)

# 2. traing algorithm, defined as the second parameter
algo, trainset = trainAlgo(data, SVD())

        Fold 1  Fold 2  Fold 3  Mean    
MAE     0.8689  0.8557  0.8655  0.8634  


## 3) Make Predictions

We proceed to make predictions with the algorithm trained above:

In [51]:
# 1. get a dictionary with all predictions
top_n = get_top_n(TOP_N_MOVIES, trainset, algo)

In [54]:
pickle.dump( top_n, open( "recommendations.p", "wb" ) )

## 4) Query engine

In [3]:
import pickle

def searchMovies(user, recommendations):
    print('User {} would be delighted to watch the following movies:\n'.format(user))
    return recommendations[user]

In [4]:
# TYPE USER
USER = 'A141HP4LYPWMSR' # user to look for

# load dictionary from pickle file
recommendations = pickle.load( open( "recommendations.p", "rb" ) )

searchMovies(USER, recommendations)

User A141HP4LYPWMSR would be delighted to watch the following movies:



[('B005LAJ22Q', 5),
 ('B00008WJDK', 4.835127692328798),
 ('B00004WCM9', 4.8065915057153745),
 ('B002UIGMYS', 4.787991941890827),
 ('B000063W82', 4.785120125760392),
 ('B0087ZG7RK', 4.776743153138881),
 ('B00004RFIE', 4.742585255506958),
 ('6300147967', 4.737080058199667),
 ('B000O599VC', 4.732156899585864),
 ('B000T4SWXO', 4.729831824023032)]

## Conclusion

**Some interestings points worth mentioning:**

- For the purpose of this exercise, there was not much worry about evaluating the performance of different algorithms, nor tuning parameters. This could be important if we want to achieve a better performance. 
- Furthermore, the more data we consider the better our model will perform.