## Übung zu Kapitel 5


Prüfen Sie Ihr Wissen über Assoziationsregeln in einer praktischen Übung. Lernen Sie hierbei Assoziationsregeln aus einem [Datensatz aus einem Supermarkt](https://data-science-crashkurs.de/exercises/data/store_data.csv). Zum Lesen der Daten können Sie folgenden Quelltext nutzen.

In [1]:
with open('data/store_data.csv') as f:
    records = []
    for line in f:
        records.append(line.strip().split(','))

### Finden von Frequent Itemsets

Nutzen Sie den Apriori-Algorithmus, um Frequent Itemsets zu finden. Hierzu müssen Sie auch einen geeigneten Grenzwert für den Support bestimmen. Begründen Sie Ihre Wahl.

In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder

# we first need to create a one-hot encoding of our transactions
te = TransactionEncoder()
te_ary = te.fit_transform(records)
data_df = pd.DataFrame(te_ary, columns=te.columns_)

# use support of 0.005 - low threshold may include to many candidates
# careful selection of rules based on other metrics required
# this means that we use a higher confidence
frequent_itemsets = apriori(pd.DataFrame(
    data_df), min_support=0.005, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.020397,(almonds)
1,0.008932,(antioxydant juice)
2,0.033329,(avocado)
3,0.008666,(bacon)
4,0.010799,(barbecue sauce)
...,...,...
720,0.007466,"(mineral water, spaghetti, soup)"
721,0.009332,"(mineral water, tomatoes, spaghetti)"
722,0.006399,"(mineral water, turkey, spaghetti)"
723,0.006266,"(mineral water, whole wheat rice, spaghetti)"


### Erstellen von Regeln

Erstellen Sie Regeln aus den Frequent Itemsets. Nutzen Sie Lift und Confidence, um zu ermitteln, welche Regeln gut sind.

In [3]:
from mlxtend.frequent_patterns import association_rules

# use a high confidence to counterbalance the low support threshold
# order by lift to have the "best" rules at the top
association_rules(frequent_itemsets, metric="confidence",
                  min_threshold=0.5).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
13,"(shrimp, ground beef)",(spaghetti),0.011465,0.17411,0.005999,0.523256,3.005315,0.004003,1.732354
6,"(ground beef, frozen vegetables)",(spaghetti),0.016931,0.17411,0.008666,0.511811,2.939582,0.005718,1.691742
9,"(olive oil, frozen vegetables)",(spaghetti),0.011332,0.17411,0.005733,0.505882,2.905531,0.00376,1.671444
8,"(frozen vegetables, soup)",(mineral water),0.007999,0.238368,0.005066,0.633333,2.656954,0.003159,2.077178
17,"(olive oil, soup)",(mineral water),0.008932,0.238368,0.005199,0.58209,2.441976,0.00307,1.822476
7,"(olive oil, frozen vegetables)",(mineral water),0.011332,0.238368,0.006532,0.576471,2.418404,0.003831,1.798297
15,"(milk, soup)",(mineral water),0.015198,0.238368,0.008532,0.561404,2.355194,0.004909,1.73652
2,"(chocolate, soup)",(mineral water),0.010132,0.238368,0.005599,0.552632,2.318395,0.003184,1.702471
3,"(cooking oil, eggs)",(mineral water),0.011732,0.238368,0.006399,0.545455,2.288286,0.003603,1.67559
5,"(ground beef, frozen vegetables)",(mineral water),0.016931,0.238368,0.009199,0.543307,2.279277,0.005163,1.667711


Diese Regeln ergeben oberflächlich Sinn, sind aber nicht hilfreich, da wir nur zwei Folgerungen haben: Spaghetti und Mineralwasser. Die ersten drei Regeln deuten auf verschiedene Nudelsaucen hin, deren Zutaten gemeinsam mit den Nudeln selbst gekauft werden. Dies könnte zwar stimmen, aber Nudeln werden oft auch auf Vorrat gekauft. Der Lift von etwa drei bedeutet, dass es dreimal so wahrscheinlich ist, dass wir einen echten Zusammenhang beobachten, wie das diese Daten durch Zufall entstanden sind. 

Die anderen Regeln sind hingegen vermutlich durch den Zufall erklärbar, da das Mineralwasser sehr oft gekauft wird:

In [4]:
# the mean of a one hot encoded column is the percentage that this value occurs
data_df['mineral water'].mean()

0.23836821757099053

### Validieren der Regeln

Teilen Sie die Daten zufällig in zwei Datensätze mit je 50% der Transaktionen auf. Wenden Sie den Apriori-Algorithmus auf beide Datensätze an, um Regeln zu bestimmen. Vergleichen Sie die gefundenen Regeln miteinander sowie mit den Regeln, die Sie auf allen Daten gefunden haben. Welche Unterschiede gibt es? Was bedeuten die Unterschiede?

In [5]:
from sklearn.model_selection import train_test_split

# split the data into two sets with 50% of the data
X_train, X_test = train_test_split(data_df, test_size=0.5, random_state=42)

# create frequent itemsets
frequent_itemsets_train = apriori(pd.DataFrame(
    X_train), min_support=0.005, use_colnames=True)
frequent_itemsets_test = apriori(pd.DataFrame(
    X_test), min_support=0.005, use_colnames=True)

# # rules for first set
association_rules(frequent_itemsets_train, metric="confidence",
                  min_threshold=0.5).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
27,"(whole wheat pasta, spaghetti)",(milk),0.010133,0.135733,0.005067,0.5,3.683694,0.003691,1.728533
5,"(grated cheese, eggs)",(spaghetti),0.0096,0.181333,0.005867,0.611111,3.370098,0.004126,2.105143
15,"(frozen vegetables, soup)",(spaghetti),0.008533,0.181333,0.005067,0.59375,3.274357,0.003519,2.015179
10,"(ground beef, frozen vegetables)",(spaghetti),0.0168,0.181333,0.0096,0.571429,3.151261,0.006554,1.910222
21,"(shrimp, ground beef)",(spaghetti),0.012267,0.181333,0.006933,0.565217,3.117008,0.004709,1.882933
33,"(mineral water, milk, frozen vegetables)",(spaghetti),0.0112,0.181333,0.006133,0.547619,3.019958,0.004102,1.809684
14,"(olive oil, frozen vegetables)",(spaghetti),0.0112,0.181333,0.005867,0.52381,2.888655,0.003836,1.7192
0,"(cake, chocolate)",(spaghetti),0.013333,0.181333,0.006933,0.52,2.867647,0.004516,1.705556
6,"(herb & pepper, eggs)",(spaghetti),0.0128,0.181333,0.0064,0.5,2.757353,0.004079,1.637333
32,"(eggs, milk, spaghetti)",(mineral water),0.009067,0.2496,0.006133,0.676471,2.710219,0.00387,2.319418


In [6]:
# rules for second set
association_rules(frequent_itemsets_test, metric="confidence",
                  min_threshold=0.5).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
20,"(olive oil, pancakes)",(spaghetti),0.01093,0.166889,0.005599,0.512195,3.06908,0.003774,1.707878
16,"(olive oil, tomatoes)",(mineral water),0.007465,0.227139,0.005065,0.678571,2.987466,0.00337,2.404455
4,"(chocolate, soup)",(mineral water),0.008798,0.227139,0.005865,0.666667,2.935055,0.003867,2.318582
15,"(olive oil, soup)",(mineral water),0.008531,0.227139,0.005332,0.625,2.751614,0.003394,2.060962
13,"(turkey, milk)",(mineral water),0.011464,0.227139,0.006931,0.604651,2.662026,0.004328,1.954883
14,"(shrimp, olive oil)",(mineral water),0.009064,0.227139,0.005332,0.588235,2.589754,0.003273,1.876947
12,"(milk, soup)",(mineral water),0.01413,0.227139,0.008264,0.584906,2.575095,0.005055,1.861891
9,"(olive oil, frozen vegetables)",(mineral water),0.011464,0.227139,0.006665,0.581395,2.559641,0.004061,1.846278
1,"(chocolate, chicken)",(mineral water),0.012797,0.227139,0.007198,0.5625,2.476452,0.004291,1.766538
5,"(cooking oil, eggs)",(mineral water),0.01093,0.227139,0.006132,0.560976,2.469741,0.003649,1.760405


Die "Mineralwasserregeln" findet man stabil in beiden Mengen. Die "Spaghettiregeln" tauchen eher zufällig auf. Es gibt zwar derartige Regeln in beiden Teilmengen, sie sind aber nicht identisch. Dies deutet darauf hin, dass es sich auch um ein zufälliges Artefakt handeln könnte, und es keine starke Beziehung zwischen der Bedingung und der Folgerung gibt. 