# Semantic model-based recommender system

### Context

This project was realized in the context of a CentraleSupelec course about Semantic Web.
The idea was to build a model-based recommender system for movies, extracting the features of the movies from a RDF graph.

### Principle

Movies and other resources such as directors, actors or genres are described by a feature vector of size l+m+n+p where l is the number of movies, m the number of actors, n the number of directors, and p the number of genres. The feature vectors are initially set to a one-hot vector activated on the lign corresponding to the resource itself. Then, we look at the distance between our resource and the other resources in the RDF graph. If our resource r1 has a distance d in the graph with the resource r2, then the vector describing r1 will have a 1/d in the lign corresponding to the resource r2. Based on the idea that at a certain point, a long distance between two resources is irrelevant, we can set a limit of distance exploration in the graph.

User vectors are represented with 1 on resources liked by the user, -1 on resources disliked by the user, and 0 on the other resources. To recommend a resource, we use cosine similarity to measure the distance between the user and the resource vectors. We recommend the resource with the best similarity.

### Data

To test our recommender system, we used ratings from movielens, mapped to dbpedia uris.
Unfortunately, in the dbpedia RDF base, we don't have informations like movie genres. Consequently, we took our data from the linkedmdb base. You can look at the content of this base from the sparql endpoint : http://www.linkedmdb.org/sparql.
To keep the movielens - dbpedia - linkedmdb mapping, we had to restrain our recommender system to only 252 movies for which we had movielens ratings, dbpedia uris as well as linkedmdb uris.

### Test

We use gzip and pickle to load all the available movies.

In [43]:
import gzip
import pickle
import urllib.parse
import pandas as pd

In [38]:
f = gzip.open("./data/mapping.gz")
uris = pickle.load(f)

We use pandas to represent the movies in a nice table view.

In [46]:
a = pd.DataFrame.from_dict(uris)
a.columns=['uri']
pd.set_option('display.max_rows', 252)
def extract_title(row):
    row['title'] = urllib.parse.unquote(row['uri'])[28:]
    return row

a.apply(extract_title, axis=1)

Unnamed: 0,uri,title
0,http://dbpedia.org/resource/Stargate_%28film%29,Stargate_(film)
1,http://dbpedia.org/resource/The_Rock_%28film%29,The_Rock_(film)
2,http://dbpedia.org/resource/JFK_%28film%29,JFK_(film)
3,http://dbpedia.org/resource/Lawrence_of_Arabia...,Lawrence_of_Arabia_(film)
4,http://dbpedia.org/resource/Grand_Hotel_%28fil...,Grand_Hotel_(film)
5,http://dbpedia.org/resource/Airplane%21,Airplane!
6,http://dbpedia.org/resource/X-Men_%28film%29,X-Men_(film)
7,http://dbpedia.org/resource/Sleepy_Hollow_%28f...,Sleepy_Hollow_(film)
8,http://dbpedia.org/resource/Casablanca_%28film%29,Casablanca_(film)
9,http://dbpedia.org/resource/The_Good_Earth_%28...,The_Good_Earth_(film)


From this point, if you want to test it for yourself, select the ids of movies you liked or disliked to build your user vector.
In this example, I build a vector for a user who liked the movie X-Men :

In [40]:
user_vector = np.zeros((252,1))
user_vector[6] = 1

Use the recommend best_movie_function, and find the recommended movie title from the returned id thanks to the table.

The first parameter of this function is here to define the limit of distance exploration in the RDF graph. If it's set to 1, then the recommender system will only think direct connections between resources as relevant. For 2 it will be connection of length <= 2, for 3 connections of length <= 4, for 4 connections <= 8 etc.

In [36]:
from ratings import recommend_best_movie
recommend_best_movie(3, user_vector)

18

In our example, the recommender system suggested the movie Dune, which is of the same genres as the movie X-Men.