Let's assume we have collected some interesting datasets and we want to make predictions and recomendations based on them. The datasets are simply CSV files.

There are multiple ways of working with CSV files in Python. We could just `open(file)` and parse its content manually

In [3]:
open('dataset/ratings.csv')

<_io.TextIOWrapper name='dataset/ratings.csv' mode='r' encoding='UTF-8'>

but that would be re-inventing the wheel, as there is already built-in `csv` module for that

In [4]:
import csv
reader = csv.reader(open('dataset/ratings.csv'))
next(reader)

['userId', 'movieId', 'rating', 'timestamp']

In [5]:
next(reader)

['1', '31', '2.5', '1260759144']

Much better, now you can just iterate through the dataset and do your analysis, although you still have to write that code yourself. We can do better with `pandas`, which provides comfortable data analysis API

In [6]:
import pandas

Let's start from importing our dataset

In [7]:
df = pandas.read_csv('dataset/ratings.csv')

`read_csv` function will load the csv from provided path and transform it into `DataFrame` object, which is the primary data structure used when working with Pandas

In [8]:
type(df)

pandas.core.frame.DataFrame

Whenever you want to quickly preview the result of your operations, simply print your `DataFrame`, it has a `__str__` method implemented which renders nicely formatted output. You can easily see all the columns and how many rows are returned.

In [9]:
df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


Doing simple statistics is trivial

In [10]:
df.rating.mean()

3.5436082556697732

In [11]:
df.rating.max()

5.0

In [12]:
df.rating.min()

0.5

In [13]:
df.rating.std()

1.0580641091070389

The real power of Pandas comes with queries and aggregations

In [14]:
df.query('movieId==1')

Unnamed: 0,userId,movieId,rating,timestamp
495,7,1,3.0,851866703
699,9,1,4.0,938629179
889,13,1,5.0,1331380058
962,15,1,2.0,997938310
3105,19,1,3.0,855190091
3528,20,1,3.5,1238729767
4008,23,1,3.0,1148729853
4781,26,1,5.0,1360087980
5048,30,1,4.0,944943070
6625,37,1,4.0,981308121


In [15]:
pandas.read_csv('dataset/movies.csv').query('movieId==1')

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


The main goal of a recommender system is to show you other content you might also want to see. For example, when you are watching YouTube video, you are also seeing other videos. One of the approaches to implement such a functionality is to find similar movies based on user ratings.

User rates a movie in 1-5 scale, then we are trying to find other users who rated that movie similarly and what other movies they liked. Those movies might also be interesting to our user, if only he has not seen them before.

In [16]:
pivot_table = pandas.pivot_table(df, index='movieId', columns=['userId'], values='rating')
pivot_table

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,3.0,,4.0,,...,,4.0,3.5,,,,,,4.0,5.0
2,,,,,,,,,,,...,5.0,,,3.0,,,,,,
3,,,,,4.0,,,,,,...,,,,3.0,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,3.0,,,,,,
6,,,,,,,,,,,...,,,4.0,,5.0,4.0,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,4.0,,4.0,,,3.0,,,,...,3.0,,,,3.0,,,,,


As you can see, our dataset is really sparse - meaning we have very little ratings filled in. Actually we can check the ratio of filled ratings to number of all user-movie pairs

In [17]:
df.size

400016

In [18]:
pivot_table.size

6083286

In [19]:
1 - (df.size / pivot_table.size)

0.93424343356534612

93% of user-movie pairs are not filled in, so it would be a huge waste of space to keep entire matrix memory.

We want to efficiently compute [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between movies, so for a given movie we can find similar ones. We can use `sklearn` for that and its pairwise [cosine similarity](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html), but first we will optimize our data structures a little bit using Sparse Column Matrix, as we have limited set of ratings for movies.

In [40]:
from scipy.sparse import csr_matrix

In order to use Sparse Column Matrix, we need to prepare our data in such a way, that it can passed in the constructor. Its arguments are `csc_matrix(data, (rows, columns))`. 

Each of the arguments is 1-dimensional array, so to initialize 1x1 matrix, we pass `csc_matrix([value], ([0], [0]))`.

Thanks to that we don't have to initialize entire huge matrix in memory first, we can use only the data we really need.

In [84]:
unique_movie_ids = df.movieId.unique()
unique_user_ids = df.userId.unique()

movieIdToIndexMap = dict(map(lambda iterable: [int(iterable[1]), iterable[0]], enumerate(unique_movie_ids)))
userIdToIndexMap = dict(map(lambda iterable: [int(iterable[1]), iterable[0]], enumerate(unique_user_ids)))

In [85]:
ratings = []
movieIdIndices = []
userIdIndices = []

for index, row in df.iterrows():
    ratings.append(row.rating)
    movieIdIndices.append(movieIdToIndexMap[int(row.movieId)])
    userIdIndices.append(userIdToIndexMap[int(row.userId)])


In [41]:
rows_num = len(df.movieId.unique())
cols_num = len(df.userId.unique())

In [48]:
matrix = csr_matrix((ratings, (movieIdIndices, userIdIndices)), shape=(rows_num, cols_num))

In [49]:
matrix

<9066x671 sparse matrix of type '<class 'numpy.float64'>'
	with 100004 stored elements in Compressed Sparse Row format>

In [50]:
matrix.toarray()

array([[ 2.5,  0. ,  0. , ...,  0. ,  0. ,  0. ],
       [ 3. ,  0. ,  0. , ...,  0. ,  0. ,  0. ],
       [ 3. ,  0. ,  0. , ...,  0. ,  0. ,  0. ],
       ..., 
       [ 0. ,  0. ,  0. , ...,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. , ...,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. , ...,  0. ,  0. ,  0. ]])

In [51]:
matrix[1, :]

<1x671 sparse matrix of type '<class 'numpy.float64'>'
	with 42 stored elements in Compressed Sparse Row format>

In [36]:
from sklearn.metrics.pairwise import cosine_similarity

Let's try to compare the movie with id=1 to itself, we would expect that similarity will be 100%

In [54]:
cosine_similarity(matrix[1, :], matrix[1, :])

array([[ 1.]])

And the result is as expected, now we can try more real example and compare with another movie, choosing totally different genres (horror vs children movie) should give us low similarity

In [56]:
pandas.read_csv('dataset/movies.csv').query('movieId==159858')

Unnamed: 0,movieId,title,genres
9097,159858,The Conjuring 2 (2016),Horror


In [57]:
pandas.read_csv('dataset/movies.csv').query('movieId==1')

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [58]:
cosine_similarity(
    matrix[movieIdToIndexMap[159858], :],
    matrix[movieIdToIndexMap[1], :]
)

array([[ 0.07652809]])

Great! That's clearly low similary. Let's see if it gives expected results when movies are actually similar to each other, like 2 editions Jurassic Park

In [79]:
pandas.read_csv('dataset/movies.csv').query('movieId==480 | movieId==4638')

Unnamed: 0,movieId,title,genres
427,480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
3641,4638,Jurassic Park III (2001),Action|Adventure|Sci-Fi|Thriller


In [87]:
cosine_similarity(
    matrix[movieIdToIndexMap[480], :],
    matrix[movieIdToIndexMap[4638], :]
)

array([[ 0.24505641]])

As you can see similary is higher, if we compared all movies to each other, we could create a similarity matrix, which would enable us to quickly search similar movies

In [117]:
selected_movie = matrix[movieIdToIndexMap[1], :]
similar_movies = [] 
for index, movie_id in enumerate(unique_movie_ids):
    similarity = cosine_similarity(
        selected_movie,
        matrix[index, :]
    )[0][0]
    similar_movies.append((movie_id, similarity))

In [118]:
similar_movies.sort(key=lambda x: x[1], reverse=True)

This is an intuitive way of computing similarities, but quite slow, it's better to just pass entire matrix to `cosine_similarity` matrix function and we will get the result much faster, moreover it will be computer for all movies at once.

In [122]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(matrix)

array([[ 1.        ,  0.10058858,  0.19164838, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.10058858,  1.        ,  0.17538512, ...,  0.16238844,
         0.16238844,  0.        ],
       [ 0.19164838,  0.17538512,  1.        , ...,  0.14400461,
         0.14400461,  0.        ],
       ..., 
       [ 0.        ,  0.16238844,  0.14400461, ...,  1.        ,
         1.        ,  0.        ],
       [ 0.        ,  0.16238844,  0.14400461, ...,  1.        ,
         1.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  1.        ]])

In [119]:
movies_db = pandas.read_csv('dataset/movies.csv', index_col='movieId')
for movie_id, similarity in similar_movies[:10]:
    print(movies_db.loc[movie_id].title, movies_db.loc[movie_id].genres, similarity)

Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1.0
Toy Story 2 (1999) Adventure|Animation|Children|Comedy|Fantasy 0.594709812032
Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Sci-Fi 0.576187845977
Forrest Gump (1994) Comedy|Drama|Romance|War 0.564533861453
Independence Day (a.k.a. ID4) (1996) Action|Adventure|Sci-Fi|Thriller 0.56294560026
Groundhog Day (1993) Comedy|Fantasy|Romance 0.548023021399
Back to the Future (1985) Adventure|Comedy|Sci-Fi 0.536700289574
Jurassic Park (1993) Action|Adventure|Sci-Fi|Thriller 0.535197077837
Shrek (2001) Adventure|Animation|Children|Comedy|Fantasy|Romance 0.532685098741
Star Wars: Episode VI - Return of the Jedi (1983) Action|Adventure|Sci-Fi 0.529334044309
