Nice read on "How to set up Anaconda and Jupyter Notebook the right way": https://towardsdatascience.com/how-to-set-up-anaconda-and-jupyter-notebook-the-right-way-de3b7623ea4a

In [None]:
import json
import pandas as pd
import gzip

# Familiarize Yourself with the Dataset
In the lab sessions, we will work with the "All Beauty" category of the Amazon Review Data, and we will use the 5-core subset. You can download the dataset and find information about it here: https://nijianmo.github.io/amazon/index.html

## Exercise 1
Download and import the 5-core dataset.

In [None]:
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

df = getDF('All_Beauty_5.json.gz')

# If you unzip the folder manually, you may simply run the following command:
# df = pd.read_json("All_Beauty_5.json", lines=True)

## Exercise 2

### 2.1 
Sort the dataset entries by the user id (`reviewerID`), product id (`asin`) and rating timestamp (`unixReviewTime`). Then, clean the dataset from missing ratings and duplicates (cases where the same user has rated the same item multiple times) if any. Keep the last entry only. How many observations does the cleaned dataset have?

In [None]:
def data_cleaning(df):
    
    # Write your code here
    
    return df_cleaned

df = data_cleaning(df)

### 2.2
Create a test set by extracting the latest positively rated item (rating $\geq 4$) by each user. Remove users that do not appear in the training set. How many observations does the training and test set have?

In [None]:
def data_split(df):
    df_positive = df[df['overall']>=4]
    
    # Write your code here
    
    return training, test

In [None]:
df_train, df_test = data_split(df)

print("Observations in training set:", len(df_train))
print("Observations in test set:", len(df_test))

## Exercise 3

Compute the number of ratings per item in the training set. How does a barplot of the number of ratings ordered by decreasing frequency look like?

Reflect on how it will affect the prediction process of a recommender system if only a small fraction of the items are rated frequently.

In [None]:
import matplotlib.pyplot as plt 

# Write your code here

plt.show()

# Collaborative Filtering Recommender System

## Exercise 1
In this exercise, we are going to predict the rating of a single user-item pair using a neighborhood-based method.

### 1.1
- Represent the ratings from the training set in a user-item matrix where the rows represent users and the columns represent items.
- Fill unobserved ratings with $0$.
- Compute the cosine similarities between the user with `reviewerID`='A25C2M3QF9G7OQ' and all users that have rated the item with `asin`='B00EYZY6LQ'.
- What are the similarities and what are the ratings given by these users on item 'B00EYZY6LQ'?

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
# Loading and preparing data
user_item_matrix = df_train.pivot_table(index='reviewerID',columns='asin',values='overall')
user_item_matrix = user_item_matrix.fillna(0)
target_user = user_item_matrix.loc[['A25C2M3QF9G7OQ']]
input_users = user_item_matrix[user_item_matrix['B00EYZY6LQ']>0]
users = pd.concat([target_user, input_users])

# Write your code here to compute cosine similarity and report results



### 1.2
Predict the rating for user 'A25C2M3QF9G7OQ' on item 'B00EYZY6LQ' based on the ratings from the $3$ most similar users, using a weighted (by similarity) average. What is the prediction?

In [None]:
k = 3
target_item_k = target_item.sort_values(by=#Complete, 
                                        ascending=False)[:k]

prediction_KNN = # Write your code

print('Predicted rating:', prediction_KNN)

In [None]:
### Recommended: Save the dataframe to load them in the next Session

df_train.to_pickle("train_dataframe.pkl")
df_test.to_pickle("test_dataframe.pkl")