# Reglas de asociacion en peliculas

Importamos las librerias que utilizaremos a lo largo del notebook. Tuvimos problemas para implementar el codigo apriori desde la libreria efficent_apriori, por esto utilizamos una alternativa mejor documentada, mlxtend.

In [1]:
import numpy as np
from itertools import combinations, groupby
from collections import Counter
import pandas as pd
import sys
from IPython.display import display

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

### Analisis y curación

Tomamos los datasets de ratings y de peliculas y los unimos. Esto nos permite tener las peliculas por usuario. Descartamos el rating, la fecha y el genero.

In [2]:
ratings = pd.read_csv('./ml-20m/ratings.csv')

print('ratings -- dimensions: {}'.format(ratings.shape))
display(ratings.head())
items_names = pd.read_csv('../ml-20m/movies.csv')

display(items_names.head())

#decodificar el nombre de los productos
pelis_df = pd.merge(ratings[['userId','movieId']], items_names[['movieId','title']] ,on='movieId', how= "inner")

display(pelis_df.head())
pelis_df=pelis_df.sort_values( by='userId', axis=0, ascending=True, inplace=False, 
                              kind='quicksort', na_position='last')
pelis=pelis_df.values[:2000000,[0,2]]
print(pelis)

ratings -- dimensions: (20000263, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,userId,movieId,title
0,1,2,Jumanji (1995)
1,5,2,Jumanji (1995)
2,13,2,Jumanji (1995)
3,29,2,Jumanji (1995)
4,34,2,Jumanji (1995)


[[1 'Jumanji (1995)']
 [1 'Blade Runner (1982)']
 [1 "Monty Python's The Meaning of Life (1983)"]
 ...
 [13567 'National Treasure (2004)']
 [13567 'Prince of Egypt, The (1998)']
 [13567 'Changeling (2008)']]


In [None]:
Generamos 

In [3]:
transactions=[]
for orders_id, order_object in groupby(pelis, lambda x: x[0]):
    transactions.append([item[1] for item in order_object])


In [4]:
len(transactions)

13567

Es necesario formatear los datos de cierta manera, la requerida por la libreria mlxtend. Para esto utilizo la clase TransactionEncoder que provee la libreria mlxtend.
Esta toma las peliculas y las ubica como columnas, mientras que los id de usuario van a ser los indices. Y lo completa ubicando booleans segun corresponda.

In [5]:
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,"""Great Performances"" Cats (1998)",$ (Dollars) (1971),$5 a Day (2008),$9.99 (2008),'71 (2014),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'R Xmas (2001),'Round Midnight (1986),'Salem's Lot (2004),...,[REC]³ 3 Génesis (2012),a/k/a Tommy Chong (2005),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À l'aventure (2008),À nos amours (1983),À nous la liberté (Freedom for Us) (1931)
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13562,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13563,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
13564,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,True,False,False,False
13565,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Apriori

Ejecuto el algoritmo **apiori** de la libreria mlxtend. Se noto una gran diferencia en la performance con respecto a efficent_apriori y mejor manejo de la memoria.

Luego de correrlo almaceno los resultados.

In [6]:
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True, verbose=False, low_memory=True)

frequent_itemsets.to_csv("frequent_itemsets.csv")
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.186113,(2001: A Space Odyssey (1968))
1,0.129505,"(Abyss, The (1989))"
2,0.280386,(Ace Ventura: Pet Detective (1994))
3,0.151470,(Ace Ventura: When Nature Calls (1995))
4,0.132896,(Addams Family Values (1993))
...,...,...
78476,0.100096,"(Star Wars: Episode IV - A New Hope (1977), St..."
78477,0.100243,"(Star Wars: Episode IV - A New Hope (1977), St..."
78478,0.105034,"(Star Wars: Episode IV - A New Hope (1977), St..."
78479,0.101275,"(Star Wars: Episode IV - A New Hope (1977), Me..."


## Conclusiones

Utilizando la funcion association_rules podemos obtener las metricas necesarias para evaluar los resultados del algoritmo. En este caso 

In [7]:
df_association_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3)

In [14]:
df_association_rules.sort_values("confidence", ascending=False).iloc[:100]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1421850,"(Star Wars: Episode IV - A New Hope (1977), Ba...",(Star Wars: Episode V - The Empire Strikes Bac...,0.109015,0.335815,0.108425,0.994591,2.961724,0.071816,122.791231
1501615,"(Star Wars: Episode IV - A New Hope (1977), Ba...",(Star Wars: Episode V - The Empire Strikes Bac...,0.104813,0.335815,0.104223,0.994374,2.961079,0.069026,118.058911
1501364,"(Star Wars: Episode IV - A New Hope (1977), Ba...",(Star Wars: Episode V - The Empire Strikes Bac...,0.101644,0.335815,0.100907,0.992748,2.956237,0.066773,91.591133
1421230,"(Star Wars: Episode IV - A New Hope (1977), Ba...",(Star Wars: Episode V - The Empire Strikes Bac...,0.112553,0.335815,0.111668,0.992141,2.954430,0.073871,84.517561
1408764,"(Star Wars: Episode IV - A New Hope (1977), Ba...",(Star Wars: Episode V - The Empire Strikes Bac...,0.100907,0.335815,0.100096,0.991965,2.953904,0.066210,82.660862
...,...,...,...,...,...,...,...,...,...
1418763,"(Star Wars: Episode IV - A New Hope (1977), Ba...",(Star Wars: Episode V - The Empire Strikes Bac...,0.110268,0.335815,0.108793,0.986631,2.938021,0.071764,49.681050
1340442,"(Star Wars: Episode IV - A New Hope (1977), Te...",(Star Wars: Episode V - The Empire Strikes Bac...,0.104518,0.335815,0.103118,0.986601,2.937931,0.068019,49.569187
1494849,"(Star Wars: Episode IV - A New Hope (1977), St...",(Lord of the Rings: The Fellowship of the Ring...,0.109825,0.274342,0.108351,0.986577,3.596156,0.078221,54.061510
1340560,"(Terminator, The (1984), Star Wars: Episode VI...",(Star Wars: Episode V - The Empire Strikes Bac...,0.103781,0.335815,0.102381,0.986506,2.937648,0.067529,49.219616


In [None]:
df_association_rules