***Authors: Hamza Masood, Jarod Carroll, Mihir Bhagat, Sam Thurman***

# **MovieLens Recommendation**

**Data Source: https://grouplens.org/datasets/movielens/latest/**

## Methodology
1. Data Acquisition
    - Download Small dataset
2. Data Preparation
    - ...
3. Simple Model
    - ...
4. ALS Model
    - ...
5. Model Evaluation
    - ...

#### Import Necessary Packages

In [1]:
import pandas as pd
import os, sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.rank_metrics import *
from src.helpers import *
from src.table_encoder import *
from src.metrics import *
from src.cosine_helpers import *

from pyspark.sql import SparkSession
from pyspark import SparkContext
import pyspark.sql

## Data Acquisition

The small data set was downloaded from the grouplens website. It was then unzipped and the cvs files were placed in the data folder of the repo.

## Data Preparation

The data was loaded into data frames. The movie data's genre column was vectorized for later metrics. The ratings data and movie data was merged and each user's movie ratings were turned into vectors.

In [2]:
ratings_df, movies_df, encoded_movies_df, tags_df, enoded_tags_df = load_format_data('../data/csv/')

In [3]:
user_frame = rating_vectorizer(ratings_df)

25 % of the way done
50 % of the way done
75 % of the way done
100 % of the way done


## Simple Model

For a simple model, users were compared to each other by using cosine similarity. The new movies that this closest user liked was used to recomend new movies.

In [4]:
get_top_five(1, user_frame, movies_df)

['Superstar (1999)',
 'Terminator 2: Judgment Day (1991)',
 'Shakespeare in Love (1998)',
 'Fly, The (1986)',
 'Snatch (2000)']

## ALS Model

In order to make a better model an ALS model was made using spark. First the data needed to be put into a spark dataframe. Then the data was split into training and testing data.

In [5]:
ratings_df = ratings_df.drop('timestamp', axis=1)
ratings = spark.createDataFrame(ratings_df)
(training, test) = ratings.randomSplit([0.8, 0.2])

An ALS model was then made using the training data.

In [8]:
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(training)

## Evaluation

Here we load our saved recommender model, and attempt to recommend for users.  We chose RMSE as a metric to evaluate our model's performance, as it is the standard deviation of our residuals and this gave us some measure of how our model was performing overall.

In [12]:
model = load_model('als.model')
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("RMSE = " + str(rmse))

RMSE = 0.6553466869306045


This score shows that on average our residuals are off by a standard deviation of .655