In [18]:
from fastai.collab import *
from fastai.tabular import *

## Collaborative filtering example

In [19]:
user,item,title = 'userId','movieId','title'

Untar the data at url `URLs.ML_SAMPLE`.

In [20]:
path = untar_data(URLs.ML_SAMPLE)

Read the csv at `path/'ratings.csv'` to variable `ratings`, and show the first few rows.

In [21]:
ratings = pd.read_csv(path/'ratings.csv')

Create a `CollabDataBunch` from the df, using the ratings dataframe and a random seed of 42.

In [22]:
data = CollabDataBunch.from_df(ratings, seed=42)

Set a variable `y_range` as a 2-element array of [0, 5.5].

In [23]:
y_range = [0, 5.5]

Create a `collab_learner` from the data bunch, with 50 factors and the y_range created above.

In [24]:
learn = collab_learner(data, n_factors=50, y_range=y_range)

Fit this learner for a cycle (lr = 5e-3).

In [25]:
learn.fit_one_cycle(1, 5e-3)

epoch,train_loss,valid_loss,time
0,1.522016,1.356994,00:01


## Movielens 100k

Create a new variable `path` set to the result of `Config.data_path()/'ml-100k'`.

In [35]:
# ! wget http://files.grouplens.org/datasets/movielens/ml-100k.zip -P {Config.data_path()} && unzip {Config.data_path()}/'ml-100k.zip' -d {Config.data_path()}

--2019-05-07 00:21:00--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.235
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘/home/paperspace/.fastai/data/ml-100k.zip’


2019-05-07 00:21:00 (20.4 MB/s) - ‘/home/paperspace/.fastai/data/ml-100k.zip’ saved [4924029/4924029]

Archive:  /home/paperspace/.fastai/data/ml-100k.zip
   creating: /home/paperspace/.fastai/data/ml-100k/
  inflating: /home/paperspace/.fastai/data/ml-100k/allbut.pl  
  inflating: /home/paperspace/.fastai/data/ml-100k/mku.sh  
  inflating: /home/paperspace/.fastai/data/ml-100k/README  
  inflating: /home/paperspace/.fastai/data/ml-100k/u.data  
  inflating: /home/paperspace/.fastai/data/ml-100k/u.genre  
  inflating: /home/paperspace/.fastai/data/ml-100k/u.info  
  inflating: /home/paperspace/.fastai/data

In [36]:
(Config.data_path()/'ml-100k').exists()

True

Create another ratings dataframe, but this time:
- the path should be `path/'u.data'`
- the delimiter should be `\t`
- there should be no header
- the column names should be `userId`, `movieId`, `rating`, and `timestamp`
    
Why do you need these extra specifications, as compared to the original dataframe?

We need the extra options to get the dataframe into the format we want.

In [38]:
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['userId', 'movieId', 'rating', 'timestamp'])
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Create a movies CSV, with:
- path = `path/'u.item'`
- delimiter='|'
- encoding='latin-1'
- no header
- names = ['movieId', 'title', 'date', 'N', 'url', 'g0'-'g18']

Show the first few rows of this dataframe.

In [40]:
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', header=None, names=['movieId', 'title', 'date', 'N', 'url', *['g%d' % i for i in range(19)]])

In [41]:
movies.head()

Unnamed: 0,movieId,title,date,N,url,g0,g1,g2,g3,g4,...,g9,g10,g11,g12,g13,g14,g15,g16,g17,g18
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


Take a look at the length of the ratings dataframe.

In [42]:
ratings.shape[0]

100000

Merge the ratings and the movie dataframes into one big one that has columns:
- userId
- movieId
- rating
- timestamp
- title

Print out the head of this dataframe.

Create a `CollabDataBunch` from this dataframe, with random seed 42, valid_pct 0.1, item_name 'movieId'.

Show a batch from this dataframe.

Set `y_range` to [0, 5.5] once again.

Create a collab_learner from the data bunch above, with 40 factors, using the y_range from above, with weight decay 1e-1.

Find an appropriate learning rate.

If everything is correct so far, your learning rate should be somewhere around 5e-3.

Fit a 5-epoch cycle with your learning rate.

The best losses should be in the low 0.8's. 

Save the model under the name `dotprod`.

## Interpretation

### Setup

Load the `dotprod` model from above.

Group the `rating_movie` dataset by title, and get a count of the `rating` field. Sort the values descending, and return the titles of the top 1000 rated movies.

### Movie bias

Get the biases for the top movies using `learn.bias`. Hint: what should `is_item` be if we want movie biases?  What does the `item` variable tell us from the beginning of this notebook?

Get the average movie rating by title. Then put together a list of 3-tuples, each with the bias of the movie, the title of the movie, and the mean rating of the movie, for each of the top movies. Call this `movie_ratings`.

Create a function `item0` that grabs o[0] for some item o.

Sort the movie ratings by this function and print out the top 15 to get the worst movies.

Sort the movie ratings by the reverse of this function to get the best movies.

### Movie weights

Get the movie weights using `learn.weight`. Call this `movie_w`. Print out the shape.

Get the first three principal components of `movie_w`. Call this `movie_pca`. Print out the shape.

Set the first three factors of `movie_pca` to variables `fac0`, `fac1`, `fac2`. Create a list of 2-tuples of the factor value and the title. Call this `movie_comp`.

Sort the factor/title tuple by the factor value descending and print out the top 10.

Do the same, but not descending, and print out the first 10.

Do the same process both descending and ascending for factor 1.

Get 50 random movies. Set X and Y to `fac0` and `fac2`. Create a pyplot figure with a scatterplot for these. Label each with the title and make each a different color. 