# Preprocess the MovieLens Data

1. [EDA](#eda)
2. [Preprocess](#preprocess)
3. [Shink the dataframe](#shrink)
4. [To dictionary](#dict)


## <a id='eda'>1. Exploratory Data Analysis </a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
path = "/Users/rubyjiang/Desktop/machine_learning_examples/large_files/movielens-20m-dataset/rating.csv"
df = pd.read_csv(path)

In [3]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [4]:
len(df.userId.unique())

138493

In [5]:
len(df.movieId.unique())

26744

In [6]:
df.isnull().any()

userId       False
movieId      False
rating       False
timestamp    False
dtype: bool

In [7]:
def get_missing_ids(column: str) -> int:
    ids = df[column]
    reference_ids = np.arange(1, max(ids)+1)
    missing_ids = set(reference_ids) - set(ids)
    print(column + ": ", len(missing_ids))

In [8]:
get_missing_ids('userId')
get_missing_ids('movieId')

userId:  0
movieId:  104518


**user ids are ordered sequentially from 1..138493 <br>
with no missing numbers<br>
movie ids are integers from 1..131262<br>
NOT all movie ids appear<br>
there are only 26744 movie ids<br>**


## <a id='preprocrss'>2. Preprocess </a>
Check out the **preprocrss.py** for a cleaner version.

In [9]:
# make the user id go from 0 to N-1
df['userId'] = df['userId'] - 1

In [10]:
# create a mapping for movie ids, that map from old id to new id
unique_movie_ids = set(df['movieId'].values)
movie2indx = dict()
index = 0
for movie_id in unique_movie_ids:
    movie2indx[movie_id] = index
    index += 1
# Add to df
df['movie_idx'] = df.apply(lambda row: movie2indx[row.movieId], axis=1)

In [11]:
df = df.drop(columns=['timestamp'])
save_path = "/Users/rubyjiang/Desktop/machine_learning_examples/large_files/movielens-20m-dataset/edited_rating.csv"
df.to_csv(save_path, index = False)

In [12]:
df.head()

Unnamed: 0,userId,movieId,rating,movie_idx
0,0,2,3.5,2
1,0,29,3.5,29
2,0,32,3.5,32
3,0,47,3.5,47
4,0,50,3.5,50


## <a id='shrink'>3. Shrink the data </a>
- Full dataset is too large to perform an O(N^2) algorithm (User-User CF)
- If you are an expert at big data (e.g., Spark), you can write a distributed job
- Check out the **preprocess_shrink.py** file for a cleaner version

Idea:
- Select subset of users and movies
- Users whho rated the most movies
- Movies who've been rated by the most users

In [13]:
import pickle
import numpy as np
import pandas as pd
from collections import Counter

In [14]:
print("original dataframe size: ", len(df))

original dataframe size:  20000263


In [19]:
# number of users
N = len(set(df['userId']))
# number of movies
M = df['movie_idx'].max() + 1

# count the occurance of each user id and movie id
user_ids_count = Counter(df['userId'])
movie_ids_count = Counter(df['movie_idx'])

In [23]:
# number of users and movies we would like to keep
n = int(1e4)
m = int(2e3)

user_ids = [u for u, c in user_ids_count.most_common(n)]
movie_ids = [m for m, c in movie_ids_count.most_common(m)]

In [25]:
# make a copy, o.w. ids wont be overwritten
df_small = df[df.userId.isin(user_ids) & df.movie_idx.isin(movie_ids)].copy()

In [26]:
len(df_small)

5392025

Need to remake user ids and movie ids since they are no longer sequential

In [33]:
len(movie_ids)

2000

In [None]:
new_user_id_map = dict()
new_movie_id_map = dict()

index = 0
for old_id in user_ids:
    new_user_id_map[old_id] = index
    index+=1
print('user map done')
    
index = 0
for old_id in movie_ids:
    new_movie_id_map[old_id] = index
    index += 1
print('movie map donw')

print('Setting new id')
df_small.loc[:, 'userId'] = df_small.apply(lambda row: new_user_id_map[row.userId], axis = 1)
df_small.loc[:, 'movie_idx'] = df_small.apply(lambda row: new_movie_id_map[row.movie_idx], axis=1)


In [37]:
len(df_small)

5392025

In [36]:
print("max user id: ", df_small['userId'].max())
print("max movie id: ", df_small['movie_idx'].max())

max user id:  9999
max movie id:  1999


In [39]:
save_path = "/Users/rubyjiang/Desktop/machine_learning_examples/large_files/movielens-20m-dataset/very_small_rating.csv"
df_small.to_csv(save_path, index=False)

## <a id='dict'>4. Preprocrss to Dictionary </a>

#### Table to Dictionary
In code, we want to ask questions like:
    - Given user i, which movies j did this user rate?
    - Given movie j, which users i have rated it?
    - Given user i and movie j, what is the rating?
Therefore, we have
    - user2movie: user ID -> movie ID
    - movie2user: movie ID -> user ID
    - usermovie2rating: (user ID, movie ID) -> rating
    
Why dict? efficient for look up O(phi)

In [41]:
import pickle
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

In [42]:
path = save_path = "/Users/rubyjiang/Desktop/machine_learning_examples/large_files/movielens-20m-dataset/very_small_rating.csv"
df = pd.read_csv(path)

In [43]:
df.head()

Unnamed: 0,userId,movieId,rating,movie_idx
0,7307,1,4.5,10
1,7307,10,2.5,68
2,7307,19,3.5,143
3,7307,32,5.0,19
4,7307,39,4.5,85


In [44]:
N = df['userId'].max() + 1
M = df['movie_idx'].max() + 1

In [45]:
# split into train and test
df = shuffle(df)
cutoff = int(0.8*len(df))
df_train = df.iloc[:cutoff]
df_test = df.iloc[cutoff:]

In [47]:
user2movie = dict()
movie2user = dict()
usermovie2rating = dict()

count = 0
def update_dictionaries(row):
    global count
    count += 1
    if count % 100000 == 0:
        print("processed: %.3f" % (float(count)/cutoff))
    
    i = int(row['userId'])
    j = int(row['movie_idx'])
    
    if i not in user2movie:
        user2movie[i] = [j]
    else:
        user2movie[i].append(j)
        
    if j not in movie2user:
        movie2user[j] = [i]
    else:
        movie2user[j].append(i)
    
    usermovie2rating[(i,j)] = row['rating']
    
df_train.apply(update_dictionaries, axis=1)

processed: 0.023
processed: 0.046
processed: 0.070
processed: 0.093
processed: 0.116
processed: 0.139
processed: 0.162
processed: 0.185
processed: 0.209
processed: 0.232
processed: 0.255
processed: 0.278
processed: 0.301
processed: 0.325
processed: 0.348
processed: 0.371
processed: 0.394
processed: 0.417
processed: 0.440
processed: 0.464
processed: 0.487
processed: 0.510
processed: 0.533
processed: 0.556
processed: 0.580
processed: 0.603
processed: 0.626
processed: 0.649
processed: 0.672
processed: 0.695
processed: 0.719
processed: 0.742
processed: 0.765
processed: 0.788
processed: 0.811
processed: 0.835
processed: 0.858
processed: 0.881
processed: 0.904
processed: 0.927
processed: 0.950
processed: 0.974
processed: 0.997


1525932    None
2953601    None
345992     None
2636347    None
2267471    None
           ... 
3666631    None
1127385    None
3836101    None
1123273    None
4837118    None
Length: 4313620, dtype: object

In [53]:
# test ratings dictionary
usermovie2rating_test = {}
print("Calling: update_usermovie2rating_test")
count = 0
def update_usermovie2rating_test(row):
    global count
    count += 1
    if count % 100000 == 0:
        print("processed: %.3f" % (float(count)/len(df_test)))
    
    i = int(row.userId)
    j = int(row.movie_idx)
    usermovie2rating_test[(i,j)] = row.rating
df_test.apply(update_usermovie2rating_test, axis=1)

Calling: update_usermovie2rating_test
processed: 0.093
processed: 0.185
processed: 0.278
processed: 0.371
processed: 0.464
processed: 0.556
processed: 0.649
processed: 0.742
processed: 0.835
processed: 0.927


3784363    None
716855     None
2136765    None
4596869    None
3316734    None
           ... 
3844701    None
1382044    None
1672868    None
2571096    None
4472094    None
Length: 1078405, dtype: object

In [54]:
# note: these are not really JSONs, but binary files
# becasue JSON keys have to be strings, but ours are int
with open('user2movie.json', 'wb') as f:
  pickle.dump(user2movie, f)

with open('movie2user.json', 'wb') as f:
  pickle.dump(movie2user, f)

with open('usermovie2rating.json', 'wb') as f:
  pickle.dump(usermovie2rating, f)

with open('usermovie2rating_test.json', 'wb') as f:
  pickle.dump(usermovie2rating_test, f)
