# A VISUAL APPROACH TO CONTENT-BASED RECOMMENDER SYSTEMS USING DOC2VEC

In other project we saw how to use scraping techniques for extracting movies summaries and create a dataset. Now is time to use all this data for introducing the world of recommender systems. 

In this exercise we will introduce a deep learning content-based recommender system for getting the most similar movies depending on their summaries. For this purpose we are using a Doc2Vec, which is a model for transforming text to a numeric embedding undestandable by a computer. 

First thing is importing all libraries we will use:

In [1]:
import plotly
import plotly.graph_objs as go
import pandas as pd 
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.manifold import TSNE
from tqdm import tqdm
import spacy
import json
import re

Second step is importing data and creating a dataset:  

In [2]:
data = []
with open("./cbrecommenderFiles/movies_house_memories.txt", 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

movies_dataset = pd.DataFrame.from_records(data)   

lista_resumen = []
for i in range(0,len(movies_dataset)):
    detecta_additional = movies_dataset["summary"][i].find("Additional Film")
    if detecta_additional != -1:
        summary = movies_dataset["summary"][i][:detecta_additional]
        lista_resumen.append(summary.replace("\n",""))
        
    else:
        lista_resumen.append(movies_dataset["summary"][i])

movies_dataset = movies_dataset.drop(columns="summary", axis = 1)
movies_dataset.insert(len(movies_dataset.columns),"summary",lista_resumen)

Now, let's import the core nlp model. You can find bigger corpus, but in this case we will use the smarter one: 

In [3]:
nlp=spacy.load('en_core_web_sm')

Next step is transforming the text data by removing stop words (words that don't give us any context: preps, determinants...) and also punctuation marks. Once they are all removed, is time to split data in words (tokenization), in order to create a vocabulary corpus for training our model later.

In [4]:
def normalize_document(doc):
    # pasamos a minúsculas y quitamos puntuación/espacios
    doc = re.sub(r'[^\w\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # separamos en tokens
    tokens = nlp(doc)
    # filtramos stopwords
    filtered_tokens = [t.text for t in tokens if not t.is_stop and t.text not in ["1","2","3","4","5","6","7","8","9","9"]] # juntamos de nuevo en una cadena
    doc = ' '.join(filtered_tokens)
    return doc

print("Summaries cleaning/transformation:")
lista_token = []
for i in tqdm(range(0,len(movies_dataset))):
    #try:
        lista_token.append(normalize_document(movies_dataset["summary"][i]))
    #except:
    #    lista_token.append(None)

#movies_dataset = movies_dataset.drop(columns = ["SynopsisFormated"])
movies_dataset.insert(len(movies_dataset.columns),"summaryFormated",lista_token)
movies_dataset = movies_dataset.dropna(axis = 0, subset = ["summaryFormated"])
movies_dataset = movies_dataset.drop_duplicates()

movies_dataset = movies_dataset.sample(frac=1).reset_index(drop=True)  

print("Split cleaning/transformation:")
listado_split = []
for doc in tqdm(range(0,len(movies_dataset["summaryFormated"]))):
    try:
        text = movies_dataset["summaryFormated"][doc]
        listado_split.append(text)

    except:
        print(doc)


  0%|          | 0/876 [00:00<?, ?it/s]

Summaries cleaning/transformation:


100%|██████████| 876/876 [02:31<00:00,  4.05it/s]
100%|██████████| 870/870 [00:00<00:00, 66788.28it/s]

Split cleaning/transformation:





Then, it's time to train the Doc2Vec model. You can find this model in gensim library, which is a very interesting library for NLP. As you know, Doc2Vec is a paragraph model. That is, let us to convert texts to numeric vectors. In this case we will create a 100 components embedding using for the training a 40 words windows. Also, we will train during 1000 epochs:

In [5]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(listado_split)]
model = Doc2Vec(vector_size=100,alpha=0.00025,min_count=1, window = 40)
model.build_vocab(documents)
model.train(documents,total_examples=model.corpus_count,epochs=1000)

Once the model has been trained, we will extract all vector (one for each summary)

In [6]:
lista_vectores = []
for vector in range(0,len(model.docvecs)):
    lista_vectores.append(model.docvecs[vector])

We can add this new vectors to a dataset and save it as a csv files

In [7]:
dataset_vectors = pd.DataFrame(np.vstack(lista_vectores))

movies_dataset_joined = movies_dataset.join(dataset_vectors)
movies_dataset_joined.to_csv("./cbrecommenderFiles/dataset_house_movies_joined_title.csv",index=None)

We can also see which are the most similar movies given a summary (in this case, a Star Wars movie). Let's visualize it but first, we have to find a way to reduce from 100 dimensions characteristic vectors to, at most, 3 dimensions (otherwise will be impossible to represent them graphically).

For this purpose, we are going to use a TSNE manifold:

In [8]:
tsne = TSNE(n_components=3) 
vectors_tsne = tsne.fit_transform(movies_dataset_joined.iloc[:,4:])

dataset_vectors = pd.DataFrame(np.vstack(vectors_tsne))
movies_dataset_joined_3d = movies_dataset.join(dataset_vectors)
movies_dataset_joined_3d.to_csv("./cbrecommenderFiles/dataset_house_movies_joined_title_3d.csv",index=None)

### RECOMMENDER SYSTEM FEATURES SPACE VISUALIZATION

In [12]:
movies_dataset_joined = pd.read_csv('./cbrecommenderFiles/dataset_house_movies_joined_title.csv')

In [13]:
movies_dataset_joined = movies_dataset_joined.dropna(subset=["summaryFormated"])
pelicula = "Star Wars: Episode IX"

not_selected= movies_dataset_joined[movies_dataset_joined['Title'].str.contains(pelicula)]
closest_movies = closest_movies = movies_dataset_joined[movies_dataset_joined['Title'].isin(["Ad Astra (2019)","Saturn 3 (1980)","Invasion of Astro-Monster (1965)", "Speed Racer (2008)", "Akira (1988)","Ready or Not (2019) Summary"])]
selected= movies_dataset_joined[~movies_dataset_joined['Title'].str.contains(pelicula)]
selected= selected[~selected['Title'].isin(["Ad Astra (2019)","Saturn 3 (1980)","Invasion of Astro-Monster (1965)", "Speed Racer (2008)", "Akira (1988)","Ready or Not (2019) Summary"])]

In [14]:
scatter1 = dict(
    mode = "markers",
    name = "Other movies",
    type = "scatter3d",    
    x = selected.as_matrix()[:,10], y = selected.as_matrix()[:,11], z = selected.as_matrix()[:,12],
    marker = dict( size=2, color='green'),
    text= selected["Title"]

)


scatter2 = dict(
    mode = "markers",
    name = "Selected movie",
    type = "scatter3d",    
    x = not_selected.as_matrix()[:,10], y = not_selected.as_matrix()[:,11], z = not_selected.as_matrix()[:,12],
    marker = dict( size=5, color='red'),
    text= not_selected["Title"]
)

scatter3 = dict(
    mode = "markers",
    name = "Similar movies",
    type = "scatter3d",    
    x = closest_movies.as_matrix()[:,10], y = closest_movies.as_matrix()[:,11], z = closest_movies.as_matrix()[:,12],
    marker = dict( size=5, color='blue'),
    text= closest_movies["Title"]
)



layout = dict(
    title = 'MOVIES DISTRIBUTION',
    scene = dict(
        xaxis = dict( zeroline=True ),
        yaxis = dict( zeroline=True ),
        zaxis = dict( zeroline=True ),
    )
)
    
fig = dict( data=[scatter1, scatter2,scatter3], layout=layout)

plotly.offline.iplot(fig, filename='mesh3d_sample')


Method .as_matrix will be removed in a future version. Use .values instead.


Method .as_matrix will be removed in a future version. Use .values instead.


Method .as_matrix will be removed in a future version. Use .values instead.

