# Preprocessing MovieLens data

[MovieLens 1M dataset](http://grouplens.org/datasets/movielens/) released on 2/2003. It has 1 million ratings from 6000 users on 4000 movies.

I have downloaded the data in the `data/raw/ml-1m` directory. ( `scripts/fetch.py` )

In [1]:
import os

print os.listdir("../data/raw/ml-1m")

['README', 'ratings.dat', 'movies.dat', 'users.dat']


`users.dat` contains personal information about the user 
- Gender
- Age
- Occupation (occupation code from a list of 20 occupations; see README)
- Zip-code

In [2]:
import pandas as pd

column_names = ["UserID", "Gender", "Age", "Occupation", "ZipCode"]
users = pd.read_csv("../data/raw/ml-1m/users.dat", sep="::", header=None, names=column_names, engine="python")

users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,ZipCode
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


`movies.dat` contains basic information about movies.
- Title
- Genres

In [3]:
import pandas as pd

column_names = ["MovieID", "Title", "Genres"]
movies = pd.read_csv("../data/raw/ml-1m/movies.dat", sep="::", header=None, names=column_names, engine="python")

movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


Finally, `ratings.dat` contains each movie rating.
- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

In [4]:
import pandas as pd

column_names = ["UserID", "MovieID", "Rating", "Timestamp"]
ratings = pd.read_csv("../data/raw/ml-1m/ratings.dat", sep="::", header=None, names=column_names, engine="python")

ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


## Standardizing the IDs
Neither the `MovieID` nor the `UserID` in the data is not continuous. So, I'll have to add continuous indexes to the movie and user tables and copy them to relevant positions in the ratings table.

In [5]:
movies["ContinuousMovieID"] = movies.index
merged = movies.merge(ratings, on="MovieID").drop(["Timestamp", "Title", "Genres"], axis=1)


users["ContinuousUserID"] = users.index
merged = users.merge(merged, on="UserID").drop(["Gender", "Age", "Occupation", "ZipCode"], axis=1)

merged.head()

Unnamed: 0,UserID,ContinuousUserID,MovieID,ContinuousMovieID,Rating
0,1,0,1,0,5
1,1,0,48,47,5
2,1,0,150,148,5
3,1,0,260,257,4
4,1,0,527,523,5


## Converting the table to a matrix

Now to convert the `merged` table to a matrix of |Users| x |Movies| dimensions

In [6]:
import scipy.sparse as sparse

data = merged.Rating
col = merged.ContinuousMovieID
row = merged.ContinuousUserID

R = sparse.coo_matrix((data, (row, col))).tocsr()
print ('{0}x{1} user by movie matrix'.format(*R.shape))

6040x3883 user by movie matrix


## Saving to disk

In [7]:
import scipy.io

scipy.io.mmwrite("../data/intermediate/user_movie_ratings.mtx", R)
movies.to_csv("../data/intermediate/movies.csv")
users.to_csv("../data/intermediate/users.csv")
merged.to_csv("../data/intermediate/ratings.csv")