In [None]:
import csv
import numpy as np
import pandas as pd
import pickle

# Reading and Merging Data

Our data is split between multiple files, which we will peek into using this script. Based on the information in those files, I'd like to populate the following dictionaries. Each dictionary name is formatted as `key_value` to help us remember which values are being mapped to one another. For example, `movieId_movieName` is a dictionary whose keys store movie ID numbers, and values store movie titles. Each dictionary is commented with its contents below.

In [None]:
movieId_movieName = {} # dict[movieId] = 'movie title'
userId_rating     = {} # dict[userId] = [(movieId, rating)]
movieId_isRated   = {} # stores boolean T/F if a movie has been rated by 1+ users

movieId_movieCol  = {} 
userId_userRow    = {} 

We'll start with the movies.csv file. The cells below read the csv into a dataframe, print the first few rows, and the shape. We see that it stores.

In [None]:
filename = 'movies.csv'
df = pd.read_csv('data/{}'.format(filename))

In [None]:
df.head()

In [None]:
df.shape

Now we can start populating the relevant dictionaries we declared at the start of the script. We'll iterate through the dataframe, taking the movieId as the unique index. From this, we can populate the `movieId_movieName` dictionary. Since this is a comprehensive list of movies in our dataset, we'll give everything a default `isRated` value of 0--meaning that this movie is not yet rated by any users.

In [None]:
for index, row in df.iterrows():
    movieId = row['movieId']
    # movieID is the unique index
    movieId_movieName[movieId] = row['title']

    # Give default rating of 0
    movieId_isRated[movieId] = 0 

Next we can open the ratings csv file in the same manner. This file contains user ratings for movies based on movieId. We will use this information to fill in the `userId_rating` dictionary. As we work through user ratings, we'll update a movie's `isRated` status if to 1 if it receives a rating. After we're done iterating through the dataset, we'll remove any movies from `movieId_movieName` if it has not received any ratings.

In [None]:
filename = 'ratings.csv'
df = pd.read_csv('data/{}'.format(filename))

In [None]:
df.head()

In [None]:
df.shape

In [None]:
for index, row in df.iterrows():
    userId = row['userId']
    movieId = row['movieId']
    rating = row['rating']

    if userId not in userId_rating.keys():
        userId_rating[userId] = [(movieId, rating)]
    else:
        userId_rating[userId].append((movieId, rating))

    movieId_isRated[movieId] = 1

for movieId, isRated in movieId_isRated.items():
    if isRated == 0:
        del movieId_movieName[movieId]

Now we will start creating a matrix of ratings where row i represents user i and column j represents movie j. So cell (i,j) is the rating from userId i for movieId j. To start creatig this matrix, we will first fill in the dictionary `movieId_movieCol` and `userId_userRow`.

In [None]:
i = 0
for movieId in sorted(movieId_movieName):
    movieId_movieCol[movieId] = i
    i+=1

i=0
for userId in sorted(userId_rating):
    userId_userRow[userId] = i
    i+=1

Now we'll create the matrix, starting with an initiation of an empty matrix with the correct number of rows (total users) and colums (total movies). Next, we'll update cells to store user ratings for those combinations of (i,j) that hold non-zero ratings. 

In [None]:
m = len(userId_userRow.keys())   # number of userId
n = len(movieId_movieCol.keys()) # number of movieId

# Create an empty matrix
A = np.zeros((m,n))              

print(A.shape)

# Iterate through each user's ratings
for userId, ratings in userId_rating.items():
    for rating in ratings:
        movieId   = rating[0]
        score     = rating[1]
        

        if (userId in userId_userRow and movieId in movieId_movieCol):
            i = userId_userRow[userId]
            j = movieId_movieCol[movieId]
            
            A[i,j] = score

ratingCount = 0
for i in range(m):
    for j in range(n):
        if (A[i][j] != 0):
            ratingCount += 1

print('Number of ratings = {}'.format(ratingCount))
print('Total entries = {}'.format(m*n))
print('Sparsity = {}%'.format(ratingCount*100/(m*n)))

Now we'll store our files into pickle format, which are binary files that hold Python objects and can easily be read in to recover our dictionaries by other Python scripts.

In [None]:
d = {'movieId_movieName': movieId_movieName,
     'movieId_movieCol' : movieId_movieCol,
     'userId_userRow'   : userId_userRow,
     'userId_rating'    : userId_rating }
pickle.dump(A, open('data/data_matrix.p', 'wb'))
pickle.dump(d, open('data/data_dicts.p', 'wb'))
print (A.shape)