# SISTEMAS DE RECOMENDACIÓN

Los sistemas de recomendación son modelos cuyo objetivo es proporcionar, como su nombre indica, una recomendación de un producto basada en la experiencia el consumidor o considerando en conjunto la experiencia del consumidor junto con otros.

Tradicionalmente, los sistemas de recomendación se han basado en la creación de rankings, de modo que se recomienda aquello que más ha votado una comunidad de usuarios.

Gracias a la evolución de los algoritmos de inteligencia artificial, se plantea la posiblidad de aplicar métodos más complejos para la obtención de recomendaciones más precisas y personalizadas.

Así surgen, entre otros, los llamados sistemas colaborativos, que intentan obtener un perfilado o tratar de asignar recomendaciones basadas en perfiles similares.

Entre estos últimos podemos destacar los algoritmos basados en distancias y las descomposiciones matriciales. 

En el ejemplo que se va a desarrollar se va a mostrar un ejemplo tanto de de sistema basado en un ranking genérico y un ejemplo más detallado con un sistema de distancias.

# ANÁLISIS DEL DATASET

Vamos a utilizar un dataset de la plataforma Kaggle que consiste en un conjunto de películas y usuarios los cuales han puntuado entre 0 y 5 las películas que han visto.

Los datos se componen de dos archivos:
- ratings_small.csv con la relación usuario - identificador de película - puntuación
- movies_medatada.csv donde hay información asociada de la película como su nombre y género.

In [34]:
### Carga de librerías genéricas

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

plt.style.use("ggplot")
warnings.filterwarnings("ignore")

In [35]:
##Direcciones de los archivos

camino="D:\\J\\Big data\\DATAHACK\\PYTHON\\006-_Caso_segmentacion\\Recomendador peliculas\\"
archivo="ratings_small.csv"
archivo_pelis="movies_metadata.csv"

In [36]:
#Lectura del fichero con los ratings
df=pd.read_csv(camino+archivo)

In [37]:
# Información asociada a las películas
df_info_pelis=pd.read_csv(camino+archivo_pelis)

## ANÁLISIS DE LOS DATAFRAMES

Procedemos a analizar los dos daframes y su estructura, comenzando por el de los rankings

In [38]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


El usuaro se identifica con un id, así como la película- El rating es un número, veremos entre qué se mueve- El timestamp no lo vamos a utilizar, con lo que podemos eliminarlo en primer lugar

In [39]:
df.drop(columns=["timestamp"],inplace=True)

In [41]:
df["rating"].value_counts().sort_index(ascending=False)

5.0    15095
4.5     7723
4.0    28750
3.5    10538
3.0    20064
2.5     4449
2.0     7271
1.5     1687
1.0     3326
0.5     1101
Name: rating, dtype: int64

En efecto, se ve que los ratings van desde 0.5 hasta 5. Veamos cuántos usuarios distintos hay, así como películas:

In [43]:
df["userId"].unique().shape

(671,)

Hay 671 usuarios diferentes

In [45]:
### ESCRIBE TU CÓDIGO AQUÍ PARA OBTENER EL NÚMERO DE PELÍCULAS DISTINTAS

df["movieId"].unique().shape

(9066,)

Hay 9066 películas diferentes. Veamos ahora el archivo de las películas:

In [46]:
df_info_pelis.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [47]:
df_info_pelis.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

De todas las columnas nos interesan la que tiene el género de la película, el id para poder cruzar, el título y la media de votos. Filtraomos el df por estos campos.

In [48]:
###TU CÓDIGO PARA QUEDARNOS SOLO CON LAS COLUMNAS:["genres","id","title","vote_average"]
### MODIFICA EL DF df_info_pelis 

df_info_pelis=df_info_pelis[["genres","id","title","vote_average"]]
df_info_pelis.head()

Unnamed: 0,genres,id,title,vote_average
0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,Toy Story,7.7
1,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,Jumanji,6.9
2,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,Grumpier Old Men,6.5
3,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,Waiting to Exhale,6.1
4,"[{'id': 35, 'name': 'Comedy'}]",11862,Father of the Bride Part II,5.7


La columna "genres" tiene una estructura un tanto complicada que habría que limpiar

In [52]:
df_info_pelis["genres"][0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [49]:
df["movieId"].value_counts()

356       341
296       324
318       311
593       304
260       291
         ... 
98604       1
103659      1
104419      1
115927      1
6425        1
Name: movieId, Length: 9066, dtype: int64

In [50]:
df["userId"].value_counts()

547    2391
564    1868
624    1735
15     1700
73     1610
       ... 
296      20
289      20
249      20
221      20
1        20
Name: userId, Length: 671, dtype: int64

In [10]:
df_entrenar=df.pivot_table(values="rating",index="userId",columns="movieId")


In [11]:
df_entrenar.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,


In [12]:
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [13]:
df_entrenar2=df_entrenar.fillna(0)

In [14]:
df_entrenar2.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
distancias=cosine_similarity(df_entrenar2)

In [16]:
distancias[0]

array([1.        , 0.        , 0.        , 0.07448245, 0.01681799,
       0.        , 0.08388416, 0.        , 0.01284289, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.05938727,
       0.        , 0.02107287, 0.        , 0.08563118, 0.01656352,
       0.05733056, 0.0436822 , 0.07024269, 0.        , 0.01491541,
       0.00262104, 0.03501505, 0.        , 0.        , 0.06130103,
       0.02379292, 0.03170691, 0.00821267, 0.08459736, 0.13058496,
       0.01652849, 0.        , 0.03282542, 0.0203909 , 0.        ,
       0.06833225, 0.        , 0.04105418, 0.        , 0.        ,
       0.        , 0.        , 0.02081219, 0.07333877, 0.        ,
       0.        , 0.        , 0.        , 0.03782836, 0.        ,
       0.03844173, 0.06430548, 0.        , 0.01381982, 0.        ,
       0.01399113, 0.        , 0.        , 0.        , 0.        ,
       0.04528371, 0.        , 0.00203034, 0.        , 0.03191959,
       0.        , 0.        , 0.08815485, 0.        , 0.04607

In [17]:
distancias=np.where(distancias==1,-1,distancias)

In [18]:
np.argpartition(distancias[0],-4)[-5:]

array([484, 309, 633, 324, 340], dtype=int64)

In [19]:
def devuelve_similares(indice,total_similares=5,distancias=distancias):
    return np.argpartition(distancias[indice],-total_similares)[-total_similares:]

Metamos, por ejemplo, al consumidor número 50:

In [20]:
similares_50=devuelve_similares(indice=50)

In [21]:
similares_50

array([655, 475, 625, 668, 188], dtype=int64)

Veamos ahora las películas que le gustan a nuestro 50:

In [22]:
df.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [23]:
df_puntuaciones_50=df[df["userId"]==50][["movieId","rating"]]
df_puntuaciones_50

Unnamed: 0,movieId,rating
7987,10,4.0
7988,21,3.0
7989,39,2.0
7990,47,3.0
7991,95,3.0
7992,110,4.0
7993,150,3.0
7994,160,3.0
7995,161,4.0
7996,165,4.0


In [24]:
df_info_pelis[df_info_pelis["id"].isin(df[df["userId"]==50]["movieId"].values)]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count


In [25]:
df_info_pelis["id"].dtype

dtype('O')

In [26]:
df["movieId"].dtype

dtype('int64')

In [27]:
df_info_pelis["id"].unique()

array(['862', '8844', '15602', ..., '67758', '227506', '461257'],
      dtype=object)

In [28]:
df["movieId"]=df["movieId"].astype("str")

In [29]:
df_pelis_50=df_info_pelis[df_info_pelis["id"].isin(df[df["userId"]==50]["movieId"].values)]
df_pelis_50

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
286,False,"{'id': 300546, 'name': 'Once were Warriors Col...",0,"[{'id': 18, 'name': 'Drama'}]",,527,tt0110729,en,Once Were Warriors,A drama about a Maori family lving in Auckland...,...,1994-09-02,2201126.0,99.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,"A family in crisis, a life in chaos... Nothing...",Once Were Warriors,False,7.6,106.0
302,False,"{'id': 131, 'name': 'Three Colors Collection',...",0,"[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'na...",,110,tt0111495,fr,Trois couleurs : Rouge,Red This is the third film from the trilogy by...,...,1994-05-27,0.0,99.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Three Colors: Red,False,7.8,246.0
385,False,,45000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,315,tt0059170,en,"Faster, Pussycat! Kill! Kill!",Three strippers seeking thrills encounter a yo...,...,1965-08-06,0.0,83.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Go-Go For a Wild Ride With the ACTION GIRLS!,"Faster, Pussycat! Kill! Kill!",False,6.5,59.0
1029,False,,14500000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.romeoandjuliet.com/,454,tt0117509,en,Romeo + Juliet,In director Baz Luhrmann's contemporary take o...,...,1996-10-31,147298800.0,120.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,My only love sprung from my only hate.,Romeo + Juliet,False,6.7,1406.0
1163,False,,2200000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",,185,tt0066921,en,A Clockwork Orange,Demonic gang-leader Alex goes on the spree of ...,...,1971-12-18,26589000.0,136.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Being the adventures of a young man whose prin...,A Clockwork Orange,False,8.0,3432.0
1176,False,"{'id': 119674, 'name': 'Psycho Collection', 'p...",806948,"[{'id': 18, 'name': 'Drama'}, {'id': 27, 'name...",,539,tt0054215,en,Psycho,When larcenous real estate clerk Marion Crane ...,...,1960-06-16,32000000.0,109.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The master of suspense moves his cameras into ...,Psycho,False,8.3,2405.0
1234,False,,3500000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,339,tt0102536,en,Night on Earth,An anthology of 5 different cab drivers in 5 A...,...,1991-10-03,2015810.0,129.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Five Taxis. Five Cities. One Night.,Night on Earth,False,7.5,165.0
1299,False,"{'id': 8581, 'name': 'A Nightmare on Elm Stree...",1800000,"[{'id': 27, 'name': 'Horror'}]",,377,tt0087800,en,A Nightmare on Elm Street,Teenagers in a small town are dropping like fl...,...,1984-11-14,25504510.0,91.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"If Nancy Doesn't Wake Up Screaming, She Won't ...",A Nightmare on Elm Street,False,7.2,1212.0
1639,False,,200000000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.titanicmovie.com,597,tt0120338,en,Titanic,"84 years later, a 101-year-old woman named Ros...",...,1997-11-18,1845034000.0,194.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Nothing on Earth could come between them.,Titanic,False,7.5,7770.0
1808,False,,140000000,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",,95,tt0120591,en,Armageddon,When an asteroid threatens to collide with Ear...,...,1998-07-01,553799600.0,151.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The Earth's Darkest Day Will Be Man's Finest Hour,Armageddon,False,6.5,2540.0


LLevamos las puntuaciones al df anterior:

In [30]:
df_pelis_50=df_pelis_50.merge(df_puntuaciones_50,left_on="id", right_on="movieId")[["title","id","rating"]]
df_pelis_50

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

Metemos todos los pasos anteriores en funciones:

In [32]:

##Función que obtiene las películas para un índice concreto:
def filtro_por_Id(indice,df_info_pelis=df_info_pelis,df=df):
    return df_info_pelis[df_info_pelis["id"].isin(df[df["userId"]==indice]["movieId"].values)][["title","id"]]

##Función para añadir al filtrado las puntuaciones. El valor de limitando indica
## si hay que limitar puntuaciones. Esta función internamnete llamará a la de arriba:

def anade_puntuaciones(indice,df=df,limitando = 0):
    df_salida=df[df["userId"]==50][["movieId","rating"]]
    df_salida=filtro_por_Id(indice=indice).merge(df_salida,left_on="id", right_on="movieId")[["title","id","rating"]]
    df_salida=df_salida[df_salida["rating"]>=limitando]
    df_salida["indice"]=indice
    return df_salida

##Replicamos aquí la función que obtiene los más cercanos:

def devuelve_similares1(indice,total_similares=5,distancias=distancias):
    return np.argpartition(distancias[indice],-total_similares)[-total_similares:]

###Aquí aplicamos ahora el proceso completo que irá llamando a las funciones de arriba:

def saca_recomendaciones(indice,df=df,total_similares =5,max_rating=4,filtro_solo_rec=False):
    ###Paso 1: Sacamos los argumentos asociados
    indices_aux=devuelve_similares1(indice,total_similares)
    ###Paso 2_ Sacamos las puntuaciens y películas del índicie
    pelis_salida=anade_puntuaciones(indice,limitando=0)
    ###Paso 3_ Añadimos las películas asociadas a los índices:
    lista_dfs=[anade_puntuaciones(x, limitando=max_rating) for x in indices_aux]
    for comentarista in lista_dfs:
        try:
            saca_indice=comentarista["indice"].values[0]
            pelis_salida=pelis_salida.merge(comentarista, on="id",suffixes=(None,saca_indice),how="outer")
        except:
            pass
    ##El siguiente filtro se quedará solo con las recomendaciones:
    if filtro_solo_rec:
        pelis_salida=pelis_salida[pelis_salida["title"].isnull()]
    #pelis_salida=pd.concat([pelis_salida]+lista_dfs,axis=1)
    return pelis_salida


In [33]:
saca_recomendaciones(475,filtro_solo_rec=False)

Unnamed: 0,title,id,rating,indice,title188,rating188,indice188,title189,rating189,indice189,title625,rating625,indice625,title385,rating385,indice385
0,Three Colors: Red,110,4.0,475.0,Three Colors: Red,4.0,188.0,,,,,,,Three Colors: Red,4.0,385.0
1,Monsoon Wedding,480,4.0,475.0,Monsoon Wedding,4.0,188.0,,,,,,,Monsoon Wedding,4.0,385.0
2,Terminator 3: Rise of the Machines,296,4.0,475.0,Terminator 3: Rise of the Machines,4.0,188.0,,,,,,,Terminator 3: Rise of the Machines,4.0,385.0
3,Grill Point,316,3.0,475.0,,,,,,,,,,,,
4,,527,,,Once Were Warriors,4.0,188.0,,,,Once Were Warriors,4.0,625.0,Once Were Warriors,4.0,385.0
5,,339,,,Night on Earth,4.0,188.0,,,,,,,Night on Earth,4.0,385.0
6,,165,,,Back to the Future Part II,4.0,188.0,,,,,,,Back to the Future Part II,4.0,385.0
7,,780,,,The Passion of Joan of Arc,4.0,188.0,The Passion of Joan of Arc,4.0,189.0,,,,,,
8,,292,,,Dave Chappelle's Block Party,4.0,188.0,,,,,,,Dave Chappelle's Block Party,4.0,385.0
9,,161,,,,,,,,,,,,Ocean's Eleven,4.0,385.0


In [123]:
devuelve_similares1(475)

array([188, 189, 625, 385, 540], dtype=int64)

Añadimos las películas de la recomendación:

In [126]:
anade_puntuaciones(188)

Unnamed: 0,title,id,rating,indice
0,Once Were Warriors,527,4.0,188
1,Three Colors: Red,110,4.0,188
2,Romeo + Juliet,454,3.0,188
3,Psycho,539,3.0,188
4,Night on Earth,339,4.0,188
5,A Nightmare on Elm Street,377,3.0,188
6,Titanic,597,3.0,188
7,Rain Man,380,3.0,188
8,Back to the Future Part II,165,4.0,188
9,Notting Hill,509,3.0,188


In [48]:
 modelo = NMF(n_components=2, init='random', random_state=0)

In [12]:
df_entrenar2=np.where(df_entrenar2==-1,1,df_entrenar2)

In [None]:
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [13]:
modelo.fit(df_entrenar2)

NMF(init='random', n_components=2, random_state=0)