# Reglas de asociación en compras de productos alimenticios

El objetivo del notebook es aplicar el algoritmo Apriori para descubrir reglas de asociación entre productos alimentarios, en una canasta de compra.

![](./pictures/AffinityAnalysis.png)

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import nbformat

## Cargando datos

In [4]:
df = pd.read_csv("./data/Groceries_dataset.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Member_number    38765 non-null  int64 
 1   Date             38765 non-null  object
 2   itemDescription  38765 non-null  object
dtypes: int64(1), object(2)
memory usage: 908.7+ KB


In [5]:
df["itemDescription"].count()

38765

In [6]:
# Conversión del feature Date a datetime
df.Date = pd.to_datetime(df.Date, dayfirst=True)
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Member_number    38765 non-null  int64         
 1   Date             38765 non-null  datetime64[ns]
 2   itemDescription  38765 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 908.7+ KB


In [7]:
nan_values = df.isna().sum()
nan_values

Member_number      0
Date               0
itemDescription    0
dtype: int64

## Visualización de datos

Top 50 de los productos alimentarios más consumidos

In [8]:

frequency_of_items = df.groupby(pd.Grouper(key='itemDescription')).size().reset_index(name='count').sort_values(["count"], ascending=False)
fi_50=frequency_of_items.head(50)
fig = px.treemap(fi_50, path=['itemDescription'], values='count')
fig.update_layout(title_text='Frequency of the Top 50 Items Sold',
                  title_x=0.5, title_font=dict(size=18)
                  )
fig.update_traces(textinfo="label+value")
fig.show()

<module 'nbformat' from '/media/SharedVolume/ciencias_computacion/4_inteligencia_artificial_II/notebooks/MarketBasketAnalysis/.venv/lib/python3.11/site-packages/nbformat/__init__.py'>


## Aplicando Apriori
Aplicaremos la técnica de Reglas de Asociación , mediante el algoritmo APriori. El objetivo es encontrar relaciones entre los productos alimenticios que compra la gente, por cada visita que hacen al mercado.

In [9]:
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

### Preparación de datos

In [10]:
all_transactions = [transaction[1]['itemDescription'].tolist() for transaction in list(df.groupby(['Member_number', 'Date']))]
all_transactions[0:10]

[['whole milk', 'pastry', 'salty snack'],
 ['sausage', 'whole milk', 'semi-finished bread', 'yogurt'],
 ['soda', 'pickled vegetables'],
 ['canned beer', 'misc. beverages'],
 ['sausage', 'hygiene articles'],
 ['sausage', 'whole milk', 'rolls/buns'],
 ['whole milk', 'soda'],
 ['frankfurter', 'soda', 'whipped/sour cream'],
 ['beef', 'white bread'],
 ['frankfurter', 'curd']]

In [11]:
te = TransactionEncoder()
te_ary = te.fit(all_transactions).transform(all_transactions)
trans_encoder_matrix = pd.DataFrame(te_ary, columns=te.columns_)
trans_encoder_matrix.head()



Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Apriori

In [12]:
frequent_itemsets=apriori(trans_encoder_matrix, min_support=0.001,  use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.004010,(Instant food products),1
1,0.021386,(UHT-milk),1
2,0.001470,(abrasive cleaner),1
3,0.001938,(artif. sweetener),1
4,0.008087,(baking powder),1
...,...,...,...
745,0.001136,"(sausage, rolls/buns, whole milk)",3
746,0.001002,"(soda, rolls/buns, whole milk)",3
747,0.001337,"(yogurt, rolls/buns, whole milk)",3
748,0.001069,"(sausage, soda, whole milk)",3


Elementos de una canaste de compra con mayor soporte y que contengan más de 1 producto

In [13]:
frequent_itemsets[ (frequent_itemsets['length'] >= 2) &
                   (frequent_itemsets['support'] >= 0.0065) ]

Unnamed: 0,support,itemsets,length
223,0.007151,"(bottled beer, whole milk)",2
254,0.007151,"(bottled water, whole milk)",2
390,0.007151,"(citrus fruit, whole milk)",2
609,0.010559,"(other vegetables, rolls/buns)",2
616,0.009691,"(soda, other vegetables)",2
625,0.014837,"(other vegetables, whole milk)",2
626,0.008087,"(yogurt, other vegetables)",2
649,0.006616,"(pip fruit, whole milk)",2
669,0.008087,"(soda, rolls/buns)",2
677,0.013968,"(rolls/buns, whole milk)",2


### Reglas de asociación usando confidence

In [14]:
rules=association_rules(frequent_itemsets, metric="lift", min_threshold=1).sort_values('confidence', ascending=False)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
234,"(sausage, yogurt)",(whole milk),0.005748,0.157923,0.001470,0.255814,1.619866,0.000563,1.131541,0.384877
216,"(sausage, rolls/buns)",(whole milk),0.005347,0.157923,0.001136,0.212500,1.345594,0.000292,1.069304,0.258214
228,"(sausage, soda)",(whole milk),0.005948,0.157923,0.001069,0.179775,1.138374,0.000130,1.026642,0.122281
202,(semi-finished bread),(whole milk),0.009490,0.157923,0.001671,0.176056,1.114825,0.000172,1.022008,0.103985
222,"(yogurt, rolls/buns)",(whole milk),0.007819,0.157923,0.001337,0.170940,1.082428,0.000102,1.015701,0.076751
...,...,...,...,...,...,...,...,...,...,...
112,(whole milk),(detergent),0.157923,0.008621,0.001403,0.008887,1.030824,0.000042,1.000268,0.035510
227,(whole milk),"(yogurt, rolls/buns)",0.157923,0.007819,0.001337,0.008464,1.082428,0.000102,1.000650,0.090433
174,(other vegetables),(pot plants),0.122101,0.007819,0.001002,0.008210,1.049991,0.000048,1.000394,0.054233
221,(whole milk),"(sausage, rolls/buns)",0.157923,0.005347,0.001136,0.007194,1.345594,0.000292,1.001861,0.305000


In [15]:
rules["rule"] = rules["antecedents"].apply(lambda x: ', '.join(list(x))).astype("unicode") + " -> " + rules["consequents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
rules_ = rules[["rule", "support", "confidence", "lift"]]



### Top 20 implicancias más fuertes en una canasta de compra (CONFIDENCE)
El "confidence" mide la probabilidad de que la compra de un  conjunto de productos implique la compra de otro conjunto de productos

$$ Confidence(A => B) = \frac{P(A | B )}{P(A)} $$

In [16]:
rules_.sort_values("confidence", ascending=False).head(20)

Unnamed: 0,rule,support,confidence,lift
234,"sausage, yogurt -> whole milk",0.00147,0.255814,1.619866
216,"sausage, rolls/buns -> whole milk",0.001136,0.2125,1.345594
228,"sausage, soda -> whole milk",0.001069,0.179775,1.138374
202,semi-finished bread -> whole milk,0.001671,0.176056,1.114825
222,"yogurt, rolls/buns -> whole milk",0.001337,0.17094,1.082428
236,"sausage, whole milk -> yogurt",0.00147,0.164179,1.91176
113,detergent -> whole milk,0.001403,0.162791,1.030824
146,ham -> whole milk,0.00274,0.160156,1.014142
180,processed cheese -> rolls/buns,0.00147,0.144737,1.315734
177,packaged fruit/vegetables -> rolls/buns,0.001203,0.141732,1.288421


### Top 20 relaciones más fuertes en una canasta de compra (LIFT)
El lift mide la fuerza de la relación entre dos elementos.  

$$ Lift(A => B) = \frac{P(A \cap B )}{P(A) P(B)} $$

In [17]:
# Como el lift no distingue de consecuentes y antecedentes saltamos una fila por medio.
rules_.sort_values("lift", ascending=False).iloc[::2, :].head(20)

Unnamed: 0,rule,support,confidence,lift
238,"sausage -> yogurt, whole milk",0.00147,0.024363,2.182917
236,"sausage, whole milk -> yogurt",0.00147,0.164179,1.91176
87,citrus fruit -> specialty chocolate,0.001403,0.026415,1.653762
234,"sausage, yogurt -> whole milk",0.00147,0.255814,1.619866
122,tropical fruit -> flour,0.001069,0.015779,1.617141
21,beverages -> sausage,0.001537,0.092742,1.536764
231,"sausage -> soda, whole milk",0.001069,0.017719,1.523708
169,pastry -> napkins,0.001738,0.033592,1.518529
182,processed cheese -> root vegetables,0.001069,0.105263,1.513019
148,pip fruit -> hard cheese,0.001069,0.021798,1.482586


La confianza de las reglas de asociación encontradas no fué muy alto para el dataset utilizado. Es decir, no se encontró una fuerte dependencia entre un conjunto de productos consecuentes con algún conjunto de productos antecedentes. La regla de asociación con mayor confidence (con 0.255) es la de "si un cliente compra salchichas, es probable (con un 0.25) que compre yogurt y leche entera".

Con la métrica lift podemos observar que hay cierta asociación entre varios productos dentro de la canasta de compra.