# Recommendation engine
Let's be real, recommendations engines are a bit complicated.    
You have content-base, colaborative filtering (personalized) , deep learning, etc,.    

I'm not going to show and explain how all of it works for all use-cases out there, but instead the fastest, "simplest", personalized good quality, production-ready solution.   
Doing adjustments to your use case should be relatively "stright forward" as much as those things normally are.    
Check out [implicit](https://github.com/benfred/implicit), it's great.

We will only use the explicit user, item, ranking.

In [5]:
import vaex

df = vaex.open('data/imdb.parquet')
df.head(2)

#,userId,movieId,rating,timestamp,name,title,genres,year,url
0,1,2,3.5,20050402T235347,Fausto Orms,Jumanji,"[""Adventure"",""Children"",""Fantasy""]",1995,'http://image.tmdb.org/t/p/w500/vzmL6fP7aPKNKPRT...
1,5,2,3.0,19961225T152609,Antony Maguire,Jumanji,"[""Adventure"",""Children"",""Fantasy""]",1995,'http://image.tmdb.org/t/p/w500/vzmL6fP7aPKNKPRT...


We clean movies no one has watched.    
Get a map from item to title for communication purposes. 

In [6]:
userid = 'userId'
itemid = 'movieId'
title = 'title'

counts = df[itemid].value_counts()
counts = counts[counts > 100] # Remove rare movies
df = df[df[itemid].isin(counts.index)] 
unique_movies = df.groupby([itemid, title]).agg({'count': 'count'})
titles = {movie: name for movie, name in 
          zip(unique_movies[itemid].tolist(), unique_movies[title].tolist())}

min_rating = 4.4
df = df[min_rating < df['rating']]  # We want to learn and recommend only moveis people liked

Let's build a matrix-factorization model.
* [reference](https://www.benfrederickson.com/matrix-factorization/)

In [7]:
import os
import numpy as np
from scipy.sparse import csr_matrix
from implicit.als import AlternatingLeastSquares
from implicit.nearest_neighbours import bm25_weight

os.environ['OPENBLAS_NUM_THREADS'] = '1' # for implicit

ratings = csr_matrix((np.ones(len(df)), (df[itemid].values, df[userid].values)))

als = AlternatingLeastSquares(factors=32)
als.fit(ratings)

100%|██████████| 15/15 [01:54<00:00,  7.62s/it]


To do the recommendations, we "extend" vaex with the recommendations function.    
* This is a lazy call for the model - great for testing
* this is easy to extend and add logic

If you need a super fast response time, you should consider persist recommendations per user to any key-value database instead.   
When the model get's more complicated, including item and user attriburtes and context, then probably should use a nice cloud instance for it.

In [8]:
import pyarrow as pa
user_items = ratings.T.tocsr()

@vaex.register_function()
def recommend_als(ar, topk=5, filter_already_liked_items=True):
    ret = []
    for user in ar.tolist():
        recommendations = als.recommend(user, user_items, N=topk,
                                        filter_already_liked_items=filter_already_liked_items)        
        ret.append([recommendation[0] for recommendation in recommendations ])
    return pa.array(ret)
df.add_function('recommend_als', recommend_als)
df['recommendations_ids'] = df.userId.recommend_als() 
df['recommendations'] = df['recommendations_ids'].apply(lambda recommendations: [titles.get(item) for item in recommendations])
df.head(2)

#,userId,movieId,rating,timestamp,name,title,genres,year,url,recommendations_ids,recommendations
0,156,2,5,20021226T212049,Mike Gallup,Jumanji,"[""Adventure"",""Children"",""Fantasy""]",1995,'http://image.tmdb.org/t/p/w500/vzmL6fP7aPKNKPRT...,"[1291, 1250, 1527, 924, 1262]","""array(['Indiana Jones and the Last Crusade',\n ..."
1,249,2,5,19960706T080206,Imogene Hallett,Jumanji,"[""Adventure"",""Children"",""Fantasy""]",1995,'http://image.tmdb.org/t/p/w500/vzmL6fP7aPKNKPRT...,"[480, 150, 36, 1617, 377]","""array(['Jurassic Park', 'Apollo 13', 'Dead Man ..."


Let's add an explnation for the recommedantions - so the user know we don't read their private facebook messages (and that is how we know what they like)

In [10]:
@vaex.register_function(on_expression=False)
def explain(users, recs, k=3):    
    ret = []
    for user, user_recs in zip(users.tolist(), recs.tolist()):
        user_explnations = {}
        for itemid in user_recs:                
            rec_title = titles.get(itemid)
            score_explained, contributions, W = als.explain(user, user_items, itemid=itemid)            
            user_explnations[rec_title]= [titles.get(i) for i, _ in contributions[:k]]
        ret.append(user_explnations)
    return pa.array(ret)
df.add_function('explain', explain)
df['response'] = df.func.explain(df[userid], df['recommendations_ids'])
df.to_records(1)

{'userId': 249,
 'movieId': 2,
 'rating': 5.0,
 'timestamp': '19960706T080206',
 'name': 'Imogene Hallett',
 'title': 'Jumanji',
 'genres': '["Adventure","Children","Fantasy"]',
 'year': 1995,
 'url': 'http://image.tmdb.org/t/p/w500/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg',
 'recommendations_ids': array([ 480,  150,   36, 1617,  377]),
 'recommendations': ['Jurassic Park',
  'Apollo 13',
  'Dead Man Walking',
  'L.A. Confidential',
  'Speed'],
 'response': {'Apollo 13': array(['Fugitive, The', 'Babe', 'Pulp Fiction'], dtype=object),
  'Dead Man Walking': array(['Fargo', "Schindler's List", 'Babe'], dtype=object),
  'Jurassic Park': array(['Fugitive, The', 'Independence Day (a.k.a. ID4)', 'Fargo'],
        dtype=object),
  'L.A. Confidential': array(['Usual Suspects, The', 'Fargo', 'Silence of the Lambs, The'],
        dtype=object),
  'Speed': array(['Fugitive, The', 'Independence Day (a.k.a. ID4)', 'Jumanji'],
        dtype=object)}}

Let's move this thing into production.

In [11]:
from goldilox import Pipeline
pipeline = Pipeline.from_vaex(df)
assert pipeline.validate()

In [None]:
pipeline.save('pipeline.pkl')
print('Go to http://127.0.0.1:5000/docs, and test the inference with only the "userId" key, and "response" in the columns params')
!gl serve pipeline.pkl

In [17]:
# results
[
  {
    "response": {
      "2001: A Space Odyssey": [
        "Blade Runner",
        "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb",
        "Clockwork Orange, A"
      ],
      "Bridge on the River Kwai, The": [
        "Saving Private Ryan",
        "Casablanca",
        "Butch Cassidy and the Sundance Kid"
      ],
      "Fifth Element, The": [
        "Twelve Monkeys (a.k.a. 12 Monkeys)",
        "Blade Runner",
        "Matrix, The"
      ],
      "Great Escape, The": [
        "Saving Private Ryan",
        "Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark)",
        "Sting, The"
      ],
      "Indiana Jones and the Last Crusade": [
        "Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark)",
        "Die Hard",
        "Sixth Sense, The"
      ]
    }
  }
]


''