# CAB420 Assignment 1C Question 1: Utils Demo
Simon Denman (s.denman@qut.edu.au)

## Overview

This notebook provides a quick demo and overview of the provided utility functions to help with Assignment 1C, Question 1.

In [2]:
from cab420_a1c_q1_utils import *

### Data Loading

We'll start by simply loading the two main tables: movies and ratings.

Note the content of each and their size:
* We have 9,742 movies, for each of these we have an ID, a title, and a set of genres
* We have 100,836 ratings, each of which has a userId, a movieId and a timestamp. Essentially, this is a list of movies people watched, what they thought of them, and when they watched them.

In [3]:
movies, ratings = load_data('.')
print(movies.shape)
print(ratings.shape)
print(movies.head())
print(ratings.head())

(9742, 3)
(100836, 4)
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


### Average Rating Per Film

There's lots of ways to manipulate this data. One simple thing is to get an average rating per film. Here, we can group the ratings table over the movieId column. Note that the output is now 9,724 rows, which suggests that 18 movies in the database have not been watched by anyone.

In [4]:
average_rating_per_film = get_average_rating_per_film(ratings)
print(average_rating_per_film.shape)
average_rating_per_film.head()

(9724, 1)


Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.92093
2,3.431818
3,3.259615
4,2.357143
5,3.071429


### Pulling our Genres

A useful way to characterise a movie is by the genres. The default genre representation is a bit limited (see the output of the movie table above). Instead, we can split this out so that each genre has it's own column, and we then have a flag in that to indicate if the movie belongs to that genre or not.

Here, we're just manipulating the movies table, so we end up with the same number of rows, but many more columns.

In [5]:
movies_with_genres, genres = expand_genres(movies)
print(movies_with_genres.shape)
movies_with_genres.head()

(9742, 22)


Unnamed: 0,movieId,title,Romance,Comedy,Crime,Thriller,War,Mystery,Children,Horror,...,Drama,(no genres listed),Animation,Western,IMAX,Action,Musical,Sci-Fi,Film-Noir,Documentary
0,1,Toy Story (1995),,1.0,,,,,1.0,,...,,,1.0,,,,,,,
1,2,Jumanji (1995),,,,,,,1.0,,...,,,,,,,,,,
2,3,Grumpier Old Men (1995),1.0,1.0,,,,,,,...,,,,,,,,,,
3,4,Waiting to Exhale (1995),1.0,1.0,,,,,,,...,1.0,,,,,,,,,
4,5,Father of the Bride Part II (1995),,1.0,,,,,,,...,,,,,,,,,,


### Detailed Data for Users

At this point now, we can start to combine some data across our two tables to work out what users think of particular genres. We're going to do this in two stages - the output here is not that useful on it's own (but you may wish to modify it to do something else with it).

What we get is now a table with the userIds, with the genre list for all movies that they've seen and their rating.

This is now the same size as the original 'ratings' table.

In [6]:
user_movies = movies_per_user(ratings, movies_with_genres, genres)
print(user_movies.shape)
user_movies.head()

(100836, 21)


Unnamed: 0,userId,Romance,Comedy,Crime,Thriller,War,Mystery,Children,Horror,Fantasy,...,Drama,(no genres listed),Animation,Western,IMAX,Action,Musical,Sci-Fi,Film-Noir,Documentary
0,1,,4.0,,,,,4.0,,4.0,...,,,4.0,,,,,,,
1,1,4.0,4.0,,,,,,,,...,,,,,,,,,,
2,1,,,4.0,4.0,,,,,,...,,,,,,4.0,,,,
3,1,,,,5.0,,5.0,,,,...,,,,,,,,,,
4,1,,,5.0,5.0,,5.0,,,,...,,,,,,,,,,


In [7]:
user_movies.columns

Index(['userId', 'Romance', 'Comedy', 'Crime', 'Thriller', 'War', 'Mystery',
       'Children', 'Horror', 'Fantasy', 'Adventure', 'Drama',
       '(no genres listed)', 'Animation', 'Western', 'IMAX', 'Action',
       'Musical', 'Sci-Fi', 'Film-Noir', 'Documentary'],
      dtype='object')

Finally, we can group what we got above by userId, which means we now get each users average rating for all genres. This can be seen as a way to capture a users preferences, do they like Childrens Film-Noir movies? Or perhaps Action-Romance-Musicals are more their thing (are there any movies that actually can be said to be either of those things?).

Looking at this size of this, we now have one row per user. 

In [8]:
user_genre_ratings = average_per_user(user_movies)
print(user_genre_ratings.shape)
user_genre_ratings.head()

(610, 20)


Unnamed: 0_level_0,Romance,Comedy,Crime,Thriller,War,Mystery,Children,Horror,Fantasy,Adventure,Drama,(no genres listed),Animation,Western,IMAX,Action,Musical,Sci-Fi,Film-Noir,Documentary
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,4.307692,4.277108,4.355556,4.145455,4.5,4.166667,4.547619,3.470588,4.297872,4.388235,4.529412,,4.689655,4.285714,,4.322222,4.681818,4.225,5.0,
2,4.5,4.0,3.8,3.7,4.5,4.0,,3.0,,4.166667,3.882353,,,3.5,3.75,3.954545,,3.875,,4.333333
3,0.5,1.0,0.5,4.142857,0.5,5.0,0.5,4.6875,3.375,2.727273,0.75,,0.5,,,3.571429,0.5,4.2,,
4,3.37931,3.509615,3.814815,3.552632,3.571429,3.478261,3.8,4.25,3.684211,3.655172,3.483333,,4.0,3.8,3.0,3.32,4.0,2.833333,4.0,4.0
5,3.090909,3.466667,3.833333,3.555556,3.333333,4.0,4.111111,3.0,4.142857,3.25,3.8,,4.333333,3.0,3.666667,3.111111,4.4,2.5,,


This is one suggested dataset to cluster, which would allow you to characterise a users taste in films, much like the practical question looked at the trip advisor data.

### What if I want to do something else?

That's fine. The above functions hopefully give you an idea of how to manipulate this data. This is not the only way to approach this question, so if you wish to explore please do - just be sure to explain (and justify) what you're doing. Even if you use the approach here, you should provide some justification for why - and *it was in the sample code* is not an acceptable justificaton.