In [None]:
version = "v2.2.033020"

# Assignment 3: Mining Vectors and Matrices (Part I)

Welcome to Assignment 3! In this assigment, you will be playing with vectors and matrices. Similar to itemsets, or other representations of data, we care about *similarity* and *patterns*, the two basic outputs of data mining and building blocks of advanced functionalities. Through this assignment, you will get hands-on experience to derive both similarity measurements and patterns with various vector and matrix operations.

We have curated another real-world dataset for you -- restaurant ratings on Yelp. The original data set can be found [here](https://www.yelp.com/dataset). To make this assignment more trackable, we have preprocessed the data and only kept the restaurants in Montréal, Canada. As with all real-world problems, be sure to sanity-check the quality of the data and be ready to handle "unexpected" scenarios of the "Wild Thing."

In this assignment, you will: 
* Represent the dataset as a matrix and check the row and column vectors.
* Implement various vector similarity/distance measures. 
* Find the top similar vectors to a given vector.
* Compute SVD transformation of the Matrix/Vectors and analyze the vectors with reduced dimensionality. 

In Part I of the assignment, we will load the dataset, transform it into vector/matrix format, and perform necessary sanity checks. You only need to code very little in this part. Please run all the code blocks in order and read the descriptions carefully. 

First, let's import the packages and dependencies that will be used later.

In [None]:
import pandas as pd
import numpy as np

## Data Preprocessing
Let's start by loading the data and preview the first few lines. We will use two data files. The montreal_business.csv file stores the attributes of the restaurants, and the montreal_rating.csv file stores the user ratings of the restaurants. Both businesses and users are assigned randomized string identifiers (business_id and user_id), as a common way to preserve the privacy of customers.

In [None]:
business_df = pd.read_csv('assets/montreal_business.csv')
business_df.head()

In [None]:
business_df.set_index('business_id', inplace=True)
business_df.head()

Every row is a restaurant, and the "stars" field stores the aggregated rating of all its customers.  Other fields should be self-explanatory. 

In [None]:
review_df = pd.read_csv('assets/montreal_user.csv')
review_df.head()

Every row in this user-rating data file corresponds to a restaurant-user pair, and the "stars" column stores the individual (unaggregated) score that users gave to that restaurant. 

### Constructing restaurant-user matrix:

The first thing is to transform the user-restaurant ratings into a matrix. Each row vector represent the ratings of one restaurant and each dimension of the row vector (aka. a column) represents a user. Therefore, the $j$-th dimension of the $i$-th row vector represents the rating of user $j$ on the restaurant $i$. We can perform pivoting on the `review_df` dataframe to generate such a matrix, and then view the first few row vectors.

In [None]:
rating_df = review_df.pivot_table(index=['business_id'], columns=['user_id'], values='stars')
rating_df.head(5)

However, directly using this matrix in production might not be a good practice. The matrix can be huge in size and yet very very sparse, as not every user has rated every restaurant. In other words, most of the cells of the matrix have missing values (NaN).

Let's briefly examine how sparse the matrix is.

In [None]:
n_entry = review_df.shape[0]
n_business = review_df.business_id.unique().size
n_user = review_df.user_id.unique().size
print("total entry:", n_entry)
print("# business:", n_business)
print("# user:", n_user)
print(f"density:{n_entry / (n_business * n_user):.4f}")

Before any analysis, we also need to perform sanity checks. For matrix data, the first thing we always do is the check the dimensionality. 

### Exercise 1. Check the dimensionality (5 pts)
Complete the `row_col_count` function below to return the numbers of rows and columns of the rating matrix.

In [None]:
def row_col_count(rating_df):
    # YOUR CODE HERE
    raise NotImplementedError()
    return n_row, n_col

In [None]:
n_row, n_col = row_col_count(rating_df)
assert n_row == 2770
assert n_col == 11937

In [None]:
rating_df.shape

Comparing the numbers with the earlier cells, we have 2779 businesses in review_df, but we only have 2770 rows in rating_df. What happened?  Let's further examine the difference.

In [None]:
set(business_df.index) - set(rating_df.index)

We see that difference of 9, (2779 - 2770), comes solely from 9 businesses missing in the `rating_df`. Do they exist in `review_df`?

In [None]:
missing_business_id = set(business_df.index) - set(rating_df.index)
review_df[review_df.business_id.isin(missing_business_id)]

It is now clear that the 9 missing businesses do not have any valid entries in the `review_df`, so they do not have corresponding rows in `rating_df` after pivoting. To be consistent, we could create 9 additional rows in `rating_df` with all NaN values. Or, we can simply remove the businesses from `business_df` to make the two dataframes coherent.

In [None]:
business_df.drop(missing_business_id, inplace=True)

There are several ways of dealing with missing values in a matrix (the NaN cells in the pivoted dataframe). For now, let's simply fill the NaN cells with 0. This may not be a good practice in reality. Not only will we be dealing with a much larger (denser) dataset, which requires a lot more computing resources, but more importantly, we are implicitly making an assumption that the users are rating those restaurants with 0 star. Do they really dislike those restaurant or have they not been to them?  Can you think of a more reasonable method to fill in NAs?

In [None]:
rating_df.fillna(0, inplace=True)
rating_df.head(5)

Now we should obtain a restaurant x user matrix fully filled with ratings. This concludes the first part of the assignment.