# Amazon Recommender System

## Objective:

- Using the supplied dataset build a a simple recommendation engine that given a user ID ( userId ) returns a collection (list) of recommended movies. 
- The recommendation engine can be written in any language (ruby, python, php, javascript, bash, etc) and built using any technology/framework but you are required to state why you chosen that language and the technology stack.
- Bonus points who builds the solution integrated with a web interface. Please comment your code.

## Work Plan:

1. Data Collection and Preparation:
    - Analyse Data Format
    - Descriptive Data Analysis
    - Clean data if needed
2. Model Selection and Training
    - Decide which algorithm to use
    - Train algorithm
3. Make predictions

## Input Parameters:

In [1]:
TOP_N_MOVIES = 10; # number of top movies to retrieve
USER = '' # user to look for
FILE_PATH = 'movies.txt' # path to the dataset

## 1) Data Collection and Preparation:

It's very important to filter the data that is indeed useful for our purpose. Our data per product contains the following information: 
1. productId
2. userId
3. profileName
4. helpfulness
5. score
6. time
7. summary
8. review

For our purpose, we can simply filter the productID, userID and score. This information will be enough to recommender a movie based on the tecnhiques we plan to use later on. Let's proceed to read the `.txt` file.

In [2]:
import pandas as pd

In [3]:
def read_dataset(file_name, df):
    counter = 0
    with open(file_name, errors = 'ignore') as openfileobject:
        for line in openfileobject:
            counter = counter + 1
            if counter == 100000:
                break
            if '/productId' in line:
                new_product = {'productId': line.split(':')[1].strip()}
            elif '/userId' in line:
                new_product['userId'] = line.split(':')[1].strip()
            elif '/score' in line:
                new_product['score'] = line.split(':')[1].strip()
                df = df.append(new_product, ignore_index=True)
    return df

In [4]:
# create pandas dataframe to store reviews
reviews = pd.DataFrame(columns=['productId', 'userId', 'score'])

# read data file and append info to our dataframe
reviews = read_dataset(FILE_PATH, reviews)

## 2) Model Selection and Training:

In this part, we will select some models from the library [Surprise](http://surpriselib.com/), a full-fledged python library, specialized for recommender systems. This library will save much of the work that would be required if we were to implement the algorthms from scratch.

After the famous [Netflix Prize Competition](https://en.wikipedia.org/wiki/Netflix_Prize) the Matrix Factorization proved to be one of the most efficient algorithms for recommender systems, performing better than the so far very commonly used collaborative filtering algorithms (user and item based). For this reason, we will use this implementation for our own recommender system.  

In [26]:
from collections import defaultdict
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import evaluate, print_perf

In [27]:
def transformData(df):
    '''
    Function to read data from dataframe
    into Surprise data format
    '''
    df = df[['userId', 'productId', 'score']] # order imposed by surprise

    reader = Reader(rating_scale=(1, 5))
    data = Dataset.load_from_df(df[['userId', 'productId', 'score']], reader)
    
    return data

def trainAlgo(data, algorithm):
    '''
    Function to train requested algorithm
    on full dataset
    '''
    data.split(n_folds=3)
    algo = algorithm

    # Evaluate performances of our algorithm on the dataset.
    perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

    print_perf(perf)

    return algo, data.build_full_trainset()


def get_top_n(n, trainset, algo):
    '''
    Function to return a dictionary with top elements
    per user
    '''
    testset = trainset.build_anti_testset()
    predictions = algo.test(testset)
    
    # Build dictionary per user.
    # Each user contains a list of tuples (itemId, score)
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))
        
    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [28]:
# 1. get data into Surprise format
data = transformData(reviews)

# 2. traing algorithm, defined as the second parameter
algo, trainset = trainAlgo(data, SVD())

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 1.1434
MAE:  0.8790
------------
Fold 2
RMSE: 1.1477
MAE:  0.8828
------------
Fold 3
RMSE: 1.1606
MAE:  0.8903
------------
------------
Mean RMSE: 1.1506
Mean MAE : 0.8840
------------
------------
        Fold 1  Fold 2  Fold 3  Mean    
RMSE    1.1434  1.1477  1.1606  1.1506  
MAE     0.8790  0.8828  0.8903  0.8840  


## 3) Make Predictions

We proceed to make predictions with the algorithm trained above:

In [29]:
# 1. get a dictionary with all predictions
top_n = get_top_n(TOP_N_MOVIES, trainset, algo)

In [32]:
# 2. filter predictions per requested user
result = top_n[USER]

In [33]:
top_n

defaultdict(list,
            {'A34PAZQ73SL163': [('B000063W82', 5),
              ('6300147967', 4.874190090421687),
              ('B001AQT0VI', 4.870688381520292),
              ('B001DBQAZY', 4.835122986946605),
              ('B001LMU1A0', 4.833953073106017),
              ('0790747324', 4.762216380871755),
              ('B00004RQB1', 4.756883634628638),
              ('B000NVL49W', 4.74470495269077),
              ('B000T4SWXO', 4.742732515070597),
              ('5556167281', 4.713599046041906)],
             'AZFCS75RSV25W': [('B00004RQB1', 4.935784124262281),
              ('B000063W82', 4.900930733307887),
              ('B000NVL49W', 4.8978640552732395),
              ('B001AQT0VI', 4.890142827665087),
              ('6300147967', 4.871870990934239),
              ('B001LMU1A0', 4.839889572241409),
              ('5556167281', 4.819035644708247),
              ('B004EPYZQM', 4.8074201376214045),
              ('0790747324', 4.789716684311298),
              ('6304179499', 4

## Conclusion

**Some interestings points worth mentioning:**