In [1]:
import sys
import os
import logging
import numpy as np
import pandas as pd
import pickle
import string
import re
from time import time

logging.basicConfig(level=logging.WARN, format='%(message)s')

# Recommendation System on Resource Poor Machines

### Have your ever crashed your sever trying to generate features for your model? 

### Too much data to fit in memory?

### Let's solve this here!

If you have ever use Netflix, YouTube, Hulu or any othe streaming service, after you finish watching a video you had most likely seen a set of recommendations, showing you other similar videos. This is done by their recommendation system and is a core part of their platform because it allow their users to continue enjoying the service, one video after the next, without having to previously decide exactly what they want to watch. In short Recommendation Systems are important.

In this tutorial I'll show you how to create a "content-based" movie recommendation system, even if you only have a single machine with limited memory RAM available.

The focus of this tutorial would be on how to batch and serialize information to allow the model to be trained, rather than creating a highly accurate recommendation system. But you can use these same concepts in your more complex systems.

##### Why do I want to learn this?

1. Recommendation Systems are a really popular part of todays applications.
2. You will be able to work with huge datasets just using your own laptop.

##### What is the problem?
When working with a content-based recommendation system the are two main memory intensive operations.
    1. Calculating a huge number of comparisons
        Why? 
        - movielens20m (popular movie recommendations dataset) contains 27,000 unique movies
        - You compare each movie against each other: over 700 MM comparisons
        - A comparison involves tens of operations depending how many features you are using to describe each movie.
    2. Doing big merges in memory. In general highly accurate content-based recommendation systems you want to use several fields in order to improve your knowledge about the movie. For instance you want to include:
        - Title
        - Author
        - Cast
        - Plot
        - Etc.

##### What are we suggesting?
You can "divide and conquer" in order to solve this problem.

Let's get started.

Our theory is that we are using too much memory at once. Thus we will create a function to easily let us know how much memory we are consuming at any given time.

In [2]:
import psutil
def get_memory_usage(pid = None, log=False):
    ''' Logs how many gigabytes of RAM this process is using'''
    if pid is None:
        pid = os.getpid()
    process = psutil.Process(pid)
    memory = (process.memory_info().rss / (1024 ** 3))
    if log:
        logging.warning("GBs Used: %f " % (memory))
        
    return memory

memory = get_memory_usage(log=True)

GBs Used: 0.088036 


### Data
We will be using the openly available and widely used movielens 20m dataset. So let's make a function that will read that dataset and give us one of its fields.

In [3]:
def get_data(field, filepath='movies.csv', id_field='movieId'):
    # Read data from CSV but only keep the field requested 
    cursor = pd.read_csv(filepath)[[id_field, field]].fillna('')
    
    # Ignore all the punctuation characters
    regex = '[%s]' % re.escape(string.punctuation)
    cursor[field] = cursor[field].apply(lambda x: re.sub(regex,' ', x))
    
    # Remove duplicated spaces
    regex_spaces = '\s+' 
    cursor[field] = cursor[field].apply(lambda x: re.sub(regex_spaces,' ', x).strip())
    
    # Return data as numpy array
    return cursor.as_matrix()

Our data looks like this 

In [4]:
get_data('genres')

array([[1, 'Adventure Animation Children Comedy Fantasy'],
       [2, 'Adventure Children Fantasy'],
       [3, 'Comedy Romance'],
       ..., 
       [131258, 'Adventure'],
       [131260, 'no genres listed'],
       [131262, 'Adventure Fantasy Horror']], dtype=object)

### Features 
Now that we can get data. We will use it to generate features. 

For that we need 2 things: Indexing and Scoring.


1 - INDEXING: which is where we calculate the importance of words in each movie.

    Ex. 
        How relevant is the term 'Magic' in 'Harry Potter'

 In this particular problem we will use TFIDF to calculate the importance of each word. But you can use any vectorizer you prefer.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
def index(matrix):
    # Save the IDs for later
    ids = matrix[:, 0].astype(int)
    
    # Extract the values as a list
    data = matrix[:, 1]
    
    # TODO: Explain parameters values / magic numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
    
    # Train the model
    t0 = time()
    tfidf_matrix = vectorizer.fit_transform(data)
    logging.debug("Indexing Duration: %f" % (time() - t0))
    logging.debug("n_samples: %d, n_features: %d" % tfidf_matrix.shape)
    
    return tfidf_matrix, ids #, vectorizer

 2 - SCORING, which is where we need to calculate how similar two movies are based on the previously generated index. For this system we will use cosine similarity. But you can feel free to use other distance metrics.

    Ex.
        Is 'Magic' extremely important in both 'Harry Potter' and 'Twilight'?
  

In [6]:
from scipy.sparse import csr_matrix
def cosine_similarity(matrix):
    '''
     Given a Matrix NxM, where N is the number of items and M is the number of features, 
     return the cosine similarity between each pair of items.
     
    :param matrix: Scipy Sparse matrix
    :return: Scipy Sparse matrix NxN where every cell is the similarity of its indexes
    '''

    t0 = time()
    
    product = matrix @ matrix.T
    
    matrix_square = matrix.copy()
    matrix_square.data **= 2
    
    square_sum = matrix_square.sum(axis=1)
    square_sum = np.sqrt(square_sum)
    
    norm = square_sum @ square_sum.T
    score = product / norm
    
    duration = time() - t0
    
    logging.debug("Scoring Duration: %f" % duration)
    logging.debug("n_samples: %d, n_related_samples: %d" % score.shape)

    return csr_matrix(score)


The previous function will use the cosine similarity score to rank for each movie, the other 27,000 movies. For instance, it will tell us how similar is each of the 27,000 movie in the dataset to "Harry Potter".

But do we really need to save the 27,000 recommendations of a movies if at the end of the day, we will most likely only recommend 5 of them. Probably not. So let's create a function to truncate this data.

Let's just keep the top K elements that are most similar for each of the movies. And at the same time we will convert the data from a matrix to a DataFrame where each row is: id1, id2, score.

In [7]:
def get_top_k(matrix, movie_ids, k, movie_ids_row=None):
    
    # Extract the data from the sparse matrix
    rows, cols = matrix.nonzero()
    scores = matrix.data

    # Parse from matrix index to movielend_id
    if movie_ids_row is None:
        print('warning')
        movie_ids_row = movie_ids
    
    rows_movielens = movie_ids_row[rows]
    cols_movielens = movie_ids[cols]

    # Filter Out movie recommending themselves
    # Ex. Harry Potter most similar movie will always be Harry Potter
    mask = rows_movielens != cols_movielens
    rows_movielens = rows_movielens[mask]
    cols_movielens = cols_movielens[mask]
    scores = scores[mask]

    # Concat columns as a single panda dataframe
    pre_frame = np.rec.fromarrays((rows_movielens, cols_movielens, scores), names=('id1','id2','score'))
    result = pd.DataFrame(pre_frame)
    
    # Get only the top K elements for each movieid
    # We are assuming that a really small similarity in a field won't 
    # cause a significant difference in the last result.
    result = result.set_index('id2').groupby('id1')['score'].nlargest(k)
    result = result.reset_index()
    result = result.set_index(['id1', 'id2'])
    logging.info('  %s\t %d/%d saved/found', 'TFIDF', result.shape[0], len(scores))
        
    return result.reset_index()


Now if we want to calculate the similarity of any two movies for a particular field, such as 'genres'. We just need to:
1. Get the data
```
data = get_data('genres')
```
2. Index that data
```
tfidf_matrix, ids = index(data)
```
3. Calculate the similarity
```
similarity_scores = similarity(tfidf_matrix)
```
4. Get the best K recommendations
```
recommendations = get_top_k(similarity_scores, movie_ids, 100)
```

Let's try this.

In [8]:
field = 'genres'

# 1. Get the data
data = get_data(field)

# 2. Index that data
tfidf_matrix, movie_ids = index(data)

# 3. Calculate the similarity
similarity_scores = cosine_similarity(tfidf_matrix)

# 4. Get the best K recommendations
recommendations = get_top_k(similarity_scores, movie_ids, 100)

MemoryError: 

Yes! We got a Memory Error. 

Right now we are trying to do all 700 MM comparisons at the same time and our RAM it's like NO!!!!!! I know RAM is too much for you. We need to chill. We are doing too much computation at the same time and also most of those values are useless. Let's be honest we don't need to know that 'Harry Potter' and 'Toy Story' have a 0.00001 cosine similarity in order to obtain the 'Harry Potter' top 5 similar movies.

What we do?

1. TRIMMING: Remove really small values

2. BATCHING: Only ompare a subset of movies at the same time. 

    - PRO: A smaller subset will allow us to run our program in machines with smaller RAMs 
    - CON: Using smaller subsets will make our program slower. 
    Thus we want to tune this parameter to use as much memory as you have available but not go over board.

You can consider both of this number hyperparameter, and tune it based on your particular scenario. 
For the subset size, you would want to start with a really big number and progressively decresing.

#### Trimming

Make a function that calcualte the cosine similarity between two matrixes, but removes all the values which are less than a specified threshold.

In [9]:
from scipy.spatial.distance import cdist, squareform
from scipy.sparse import csr_matrix

def distance_batch(m1,
                   m2,
                   cap):
    
    # Calculate Distance
    t0 = time()
    
    product = m1 @ m2.T
    
    # First Matrix
    matrix_square = m1.copy()
    matrix_square.data **= 2
    
    square_sum = matrix_square.sum(axis=1)
    square_sum1 = np.sqrt(square_sum)
    
    # Second Matrix
    matrix_square = m2.copy()
    matrix_square.data **= 2
    
    square_sum = matrix_square.sum(axis=1)
    square_sum2 = np.sqrt(square_sum)
    
    # Get Denominator
    norm = square_sum1 @ square_sum2.T
    
    score = product / norm
    
    
    duration = time() - t0
    
    result = np.array(score)
    
    # Remove super small values
    result[result < cap] = 0

    # Make the matrix sparse - Uses less RAM
    return csr_matrix(result)
    

#### Batching

This will be our magical function, it will take the big tfidf matrix, it will work with only batches of "batch_size" at the time. 

It will also call our trimming function, so that we remove all the values which are smalle than "cap".

Finally it will only return our top K elements.

In [10]:
def batch_cosine_similarity(tfidf_matrix, movie_ids, batch_size=1000, k=100, cap=0.5):
    '''
     Given a Matrix NxM, where N is the number of items and M is the number of features, 
     return the cosine similarity between each pair of items.
     
    :param index: Numpy matrix
    :return: Sparse matrix NxN where every cell is the similarity of its indexes
    '''
    
    t0 = time()
    matrix = tfidf_matrix
    
    # Calculate Similarity
    for i in range(0, matrix.shape[0], batch_size):
        limit = min(i+batch_size, matrix.shape[0])
        m1 = matrix[i:limit,:]
        
        # Calculate Distance
        dist_matrix = distance_batch(m1, matrix, cap)
        
        # Extract TOP K result
        p = get_top_k(dist_matrix, movie_ids, k, movie_ids_row=movie_ids[i:limit])
        
        # Temporarily save to a local file
        p.to_pickle('distance_%i.tmp' % (i))
    
    # Append All Similarities
    frames = []
    for i in range(0, matrix.shape[0], batch_size):
        frames.append(pd.read_pickle('distance_%i.tmp' % (i)))
    result = pd.concat(frames, axis=0)
    
    logging.debug('Duration: %f', time()-t0)
    return result

Here we go again. But this time we will use our `batch_cosine_similarity` instead of our preview naive `cosine_similarity` function.

In the result the `score` will tell us how simlar it's. For the 'genres' case: a 0 value means they dont share any genre if they have a value of 1 it means all their genres are exactly the same.

In [11]:
def run(q, field='genres', batch_size=500, k=100, cap=0.1):

    # 1. Get the data
    data = get_data(field)

    # 2. Index that data
    tfidf_matrix, movie_ids = index(data)

    # 3. Calculate the similarity
    recommendations = batch_cosine_similarity(tfidf_matrix, movie_ids, batch_size, k, cap)

    return recommendations

genres_recommendation = run('genres')

print('\nSample of the Data\n')
print(genres_recommendation.head())
del genres_recommendation


Sample of the Data

   id1   id2  score
0    1  2294    1.0
1    1  3114    1.0
2    1  3754    1.0
3    1  4016    1.0
4    1  4886    1.0


In [12]:
# Clear Memory
%reset -f array
m = get_memory_usage(log=True)

GBs Used: 3.610821 


### Application Use

A single field is two simple. What if we combine both the similarity of the title and the genre?

True our memory won't be happy about keeping all that in memory. What we will do here is temporarily save the results to the disk.For this we will use pickle because it allows to save things to disk in a easy, fast and lean way (2/3 smaller than csv).

Once we calculate the recommendations for each field, we read the recommendations from disk, join them together by giving an weight to each field and return our model.

How to assign weights vary, some systems use an heuristic approach using trial and error and manually analyzing the results obtained with each set of weights. Other systems use a machine learning approach, trying to learn this weight based on explicit user feedback. For the sake of simplicity we will use the same weight for each of our features.

In [13]:
def join_recommendations(fields, weights, k = 5):
    
    for field in fields: 
        logging.debug('Writting FIELD: %s', field)
        get_memory_usage(log=True)
        recommendation = run(field, batch_size = 1000)
        if recommendation is not None:
            recommendation.to_pickle('%s_rec.tmp' % field)
            del recommendation

    recs = pd.DataFrame(columns=['id1', 'id2'])
    for field in fields: 
        logging.debug('Reading FIELD: %s', field)
        get_memory_usage(log=True)
        rec = pd.read_pickle('%s_rec.tmp' % field)
        rec.columns = ['id1', 'id2', field]
        recs = recs.merge(rec, how='outer', on=['id1', 'id2'])

    recs = recs.fillna(0)    
        
    recs['score'] = 0
    
    for field, weight in zip(fields, weights): 
        recs['score'] += weight * recs[field]
        
    ml = recs.set_index('id2').groupby('id1')['genres'].nlargest(k)
    ml = ml.reset_index()
    return ml
    
rec = join_recommendations(['title', 'genres'], [0.5,0.5])

GBs Used: 3.610821 
GBs Used: 4.344379 
GBs Used: 4.262871 
GBs Used: 4.345444 


Let's translate those numbers into words. And find recommendations for a movie.

In [14]:
def get_movie_id(title, movie_titles):
    return movie_titles[movie_titles[:,1] == title][0][0]

def get_movie_titles(ids, movie_titles):
    return movie_titles[np.in1d(movie_titles[:,0],ids)]

def get_recommendations(title, recs, movie_titles):
    m_id = get_movie_id(title, movie_titles)
    rec_ids = recs[recs.id1 == m_id].id2.values
    
    return get_movie_titles(rec_ids, movie_titles)

movie_titles = get_data('title')    

What are the recommendations for `Toy Story`?

In [15]:
get_recommendations('Toy Story 1995', rec, movie_titles)

array([[2294, 'Antz 1998'],
       [3114, 'Toy Story 2 1999'],
       [3754, 'Adventures of Rocky and Bullwinkle The 2000'],
       [4016, 'Emperor s New Groove The 2000'],
       [4886, 'Monsters Inc 2001']], dtype=object)

What about for the famous `Journey ot the Center of the Earth`?

In [16]:
get_recommendations('Journey to the Center of the Earth 1959', rec, movie_titles)   

array([[2046, 'Flight of the Navigator 1986'],
       [7743, 'Explorers 1985'],
       [9004, 'D A R Y L 1985'],
       [31389, 'Dr Who and the Daleks 1965'],
       [51698, 'Last Mimzy The 2007']], dtype=object)

Now that you dont have to worry about crashing your machine with memory limitation. You can generate so many more features.

You can generate features for:

    Different fields:
        - Title
        - Genres
        - Cast
        - Etc.
    Using different algorithms:
        - Simple Term Frecuency
        - TFIDF
        - BM25F
        - Etc.
    Using different distances:
        - Jaccard
        - Cosine
        - Etc.

Merge all this features together and then play around to see which combination works best for your scenario.

You can even implement a supervised learning model, interviewing users and asking them to explicitly tell you if two movies are similar.

We wanted to keep this tutorial KISS (Keep It Short and Simple) so we will end here. But now that you have the tools, let your imagination run wild, try lots of combinations and have fun regardless of what your memory RAM wants to say. ;)

#DFTBA