# Association Rule Mining
## 1. Online Retail Market Basket Analysis

In [6]:
import pandas as pd
pd.set_option("max_colwidth", 150)

# Load online retail dataset
f = "https://github.com/cs6220/cs6220.spring2019/raw/master/data/Online%20Retail.xlsx"
df = pd.read_excel(f)

# Transform transactions into baskets of items
basket = (df[df["Country"] == "United Kingdom"]
          .groupby(["InvoiceNo", "Description"])["Quantity"]
          .sum().unstack().reset_index().fillna(0)
          .set_index("InvoiceNo"))
# Convert counts to booleans
basket_sets = basket.applymap(lambda x: 1 if x >= 1 else 0)

## 1.1 Frequent Itemset Generation
### What are the top 5 1-itemsets with the highest support?

In [39]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(basket_sets, min_support=0.025, use_colnames=True)
frequent_itemsets
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets
frequent_one_itemsets = frequent_itemsets[(frequent_itemsets['length'] == 1)]
frequent_one_itemsets
top_5_one_itemsets = frequent_one_itemsets.sort_values(by='support', ascending=False)
top_5_one_itemsets
top_5_one_itemsets = top_5_one_itemsets.head()
top_5_one_itemsets

Unnamed: 0,support,itemsets,length
123,0.098276,(WHITE HANGING HEART T-LIGHT HOLDER),1
54,0.087931,(JUMBO BAG RED RETROSPOT),1
99,0.076452,(REGENCY CAKESTAND 3 TIER),1
87,0.072323,(PARTY BUNTING),1
72,0.063158,(LUNCH BAG RED RETROSPOT),1


### What are the top 5 2-itemsets with the highest support?

In [40]:
frequent_two_itemsets = frequent_itemsets[(frequent_itemsets['length'] == 2)]
frequent_two_itemsets
top_5_two_itemsets = frequent_two_itemsets.sort_values(by='support', ascending=False).head()
top_5_two_itemsets

Unnamed: 0,support,itemsets,length
132,0.035617,"(JUMBO BAG RED RETROSPOT, JUMBO BAG PINK POLKADOT)",2
130,0.031806,"(ROSES REGENCY TEACUP AND SAUCER , GREEN REGENCY TEACUP AND SAUCER)",2
134,0.03167,"(JUMBO BAG RED RETROSPOT, JUMBO STORAGE BAG SUKI)",2
133,0.029809,"(JUMBO SHOPPER VINTAGE RED PAISLEY, JUMBO BAG RED RETROSPOT)",2
135,0.027541,"(LUNCH BAG BLACK SKULL., LUNCH BAG RED RETROSPOT)",2


### What is the highest support value for the 1-itemsets?

In [41]:
top_5_one_itemsets[:1]

Unnamed: 0,support,itemsets,length
123,0.098276,(WHITE HANGING HEART T-LIGHT HOLDER),1



### What is the highest support value for the 2-itemsets?


In [42]:
top_5_two_itemsets[:1]

Unnamed: 0,support,itemsets,length
132,0.035617,"(JUMBO BAG RED RETROSPOT, JUMBO BAG PINK POLKADOT)",2


## 1.2 Association Rule Generation
### What are the top 5 association rules?

In [50]:
from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules = rules.sort_values(by='confidence', ascending=False)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(PINK REGENCY TEACUP AND SAUCER),(GREEN REGENCY TEACUP AND SAUCER),0.031897,0.042377,0.02618,0.820768,19.368019,0.024828,5.342926
5,(GREEN REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.042377,0.043421,0.031806,0.750535,17.285056,0.029966,3.834527
4,(ROSES REGENCY TEACUP AND SAUCER ),(GREEN REGENCY TEACUP AND SAUCER),0.043421,0.042377,0.031806,0.732497,17.285056,0.029966,3.579862
7,(JUMBO BAG PINK POLKADOT),(JUMBO BAG RED RETROSPOT),0.052586,0.087931,0.035617,0.677308,7.702719,0.030993,2.826438
0,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED ),0.039746,0.042196,0.025544,0.642694,15.231158,0.023867,2.680627


### What items make up one of the top association rules? Search online for the items (or at least items with the same name). Do you think they are likely to be bought together?

- Green regency teacup and saucer -> roses regency teacup and saucer are items that make up one of the top association rules
- I found the items online here: [green](https://www.rexlondon.com/green-regency-teacup-and-saucer) and [roses](https://www.rexlondon.com/roses-regency-teacup-and-saucer)
- I think these items are likely to be purchased together because those who enjoy tea enough to buy teacups and saucers are likely to purchase multiple sets for a collection. In addition, the earthy green set may be thought to compliment the roses set.


## 2. Association Rule Mining - U.S. Census Data

In [54]:
import numpy as np

# Load adult dataset
path = "https://raw.githubusercontent.com/cs6220/cs6220.spring2019/master/data/adult/"

names = pd.read_csv(path + "adult.names", sep="\n", header=None)
parse_cols = lambda x: x.str.split(":", expand=True).iloc[:, 0]
columns = np.roll(parse_cols(names.iloc[92:108, 0]), shift=-1)

df_adult = pd.read_csv(path + "adult.data", sep=",", header=None, index_col=False)
df_adult.columns = columns

## 2.1 Association Rule Mining

### Transform the raw dataset into a format appropriate for association rule mining by dropping all continuous columns and one-hot encoding the remaining columns. The values for each resulting column should be binary, represented by a 1 or 0.

In [117]:
# Drop continuous columns (numbers)
df_discrete = df_adult.drop(df_adult.select_dtypes(include=[np.number]), axis=1)
df_discrete

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,">50K, <=50K."
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...
32556,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,United-States,<=50K
32557,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States,>50K
32558,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,United-States,<=50K
32559,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,United-States,<=50K


In [119]:
#Encode discrete data with pandas
df_encoded = pd.get_dummies(df_discrete)
df_encoded

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,...,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,">50K, <=50K._ <=50K",">50K, <=50K._ >50K"
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,1,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
32557,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
32558,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
32559,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


### Use confidence for the rule interestingness (metric="confidence") and generate rules up to a depth of at least 3 (max len=3) or higher. Generate rules and find at least 5 rules that you find interesting. Comment on your findings and try to reason about these association rules. Decide yourself on the levels of support and confidence used in this analysis.

In [172]:
# Get frequent itemsets (max len = 3, minsup = 25%)
freq_itemsets = apriori(df_encoded, min_support=0.25, use_colnames=True, max_len=3)
freq_itemsets

Unnamed: 0,support,itemsets
0,0.697030,(workclass_ Private)
1,0.322502,(education_ HS-grad)
2,0.459937,(marital-status_ Married-civ-spouse)
3,0.328092,(marital-status_ Never-married)
4,0.405178,(relationship_ Husband)
...,...,...
64,0.542152,"(native-country_ United-States, sex_ Male, race_ White)"
65,0.401861,"(>50K, <=50K._ <=50K, sex_ Male, race_ White)"
66,0.580971,"(>50K, <=50K._ <=50K, native-country_ United-States, race_ White)"
67,0.264427,"(native-country_ United-States, >50K, <=50K._ <=50K, sex_ Female)"


In [173]:
# Generate association rules based on confidence
census_confidence_rules = association_rules(freq_itemsets, metric="confidence", min_threshold=0.8)
census_confidence_rules.sort_values(by='confidence', ascending=False)[:50]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
14,(relationship_ Husband),(sex_ Male),0.405178,0.669205,0.405147,0.999924,1.494196,0.134,4364.171954
43,"(relationship_ Husband, marital-status_ Married-civ-spouse)",(sex_ Male),0.404902,0.669205,0.404871,0.999924,1.494196,0.133909,4361.194804
66,"(relationship_ Husband, race_ White)",(sex_ Male),0.366696,0.669205,0.366666,0.999916,1.494184,0.12127,3949.686435
71,"(relationship_ Husband, native-country_ United-States)",(sex_ Male),0.36427,0.669205,0.364239,0.999916,1.494183,0.120468,3923.553669
30,"(relationship_ Husband, workclass_ Private)",(sex_ Male),0.26326,0.669205,0.263229,0.999883,1.494135,0.087054,2835.570529
5,(relationship_ Husband),(marital-status_ Married-civ-spouse),0.405178,0.459937,0.404902,0.999318,2.172729,0.218545,791.672741
44,"(relationship_ Husband, sex_ Male)",(marital-status_ Married-civ-spouse),0.405147,0.459937,0.404871,0.999318,2.172729,0.218529,791.612734
25,"(relationship_ Husband, workclass_ Private)",(marital-status_ Married-civ-spouse),0.26326,0.459937,0.263075,0.9993,2.17269,0.141993,771.570386
40,"(relationship_ Husband, race_ White)",(marital-status_ Married-civ-spouse),0.366696,0.459937,0.36642,0.999246,2.172573,0.197763,716.483933
46,(relationship_ Husband),"(marital-status_ Married-civ-spouse, sex_ Male)",0.405178,0.409048,0.404871,0.999242,2.44285,0.239134,779.643457


#### 5 interesting rules & comments/reasoning
- (education_HS-grad, race_White) -> (native-country_United-States) (0.26, 0.94)
- (education_HS-grad) -> (native-country_United-States) (0.3, 0.92)

I find the two rules above interesting. The first co-occurence made me wonder if there are more white high school grads in the US compared to other countries with a significant white population. I also wondered whether about the confidence of other other races with the US and other countries as native countries. I notice, however, in the second rule that the consequent (native-country_United-States) support is nearly 90%. While over 40 native countries are represented, an overwhelming majority is the US. Therefore, it's difficult to reason anything significant from a rule that contains the itemset native-country_United-States.

- (relationship_Husband) -> (race_White) (0.37, 0.91)

Similar to the native-country_United-States itemset, I notice that the consequent support here is over 85%. Again, there are 5 possible races, but most of those in the dataset are white. Therefore, while it seems there is a strong co-occurence of husbands being white, it is difficult to say that with confidence when the sample size of other races is much smaller. 

- (sex_Female) -> (<=50K) (0.29, 0.89)
- (marital-status_Never-married) -> (<=50K) (0.31, 0.95)

I found some of the rules with itemsets regarding whether the individual makes more than or less than/equal to 50K interesting. There seems to be a pretty strong co-occurence of females earning <= 50K as well as those who have never been married. It is important to note the consequent support of over 75%, meaning that many features will co-occur with earnings <= 50K. 


### Use lift for the rule interestingness (metric="lift") and generate rules up to a depth of at  least 3 (max len=3) or higher. Generate rules and find at least 5 rules that you find interesting. Comment on your findings and try to reason about these association rules. Decide yourself on the levels of support and confidence used in this analysis.

In [179]:
# Generate association rules based on lift
census_lift_rules = association_rules(freq_itemsets, metric="lift")
#Lowest 30 lift
census_lift_rules.sort_values(by='lift')[:30]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
205,"(>50K, <=50K._ <=50K)","(sex_ Male, race_ White)",0.75919,0.588864,0.401861,0.529328,0.898898,-0.045199,0.873509
204,"(sex_ Male, race_ White)","(>50K, <=50K._ <=50K)",0.588864,0.75919,0.401861,0.682435,0.898898,-0.045199,0.758299
113,"(>50K, <=50K._ <=50K, workclass_ Private)",(sex_ Male),0.544609,0.669205,0.328829,0.60379,0.902248,-0.035626,0.834896
116,(sex_ Male),"(>50K, <=50K._ <=50K, workclass_ Private)",0.669205,0.544609,0.328829,0.491372,0.902248,-0.035626,0.895333
223,"(>50K, <=50K._ <=50K)","(native-country_ United-States, sex_ Male)",0.75919,0.598507,0.411197,0.541626,0.904962,-0.043184,0.875907
222,"(native-country_ United-States, sex_ Male)","(>50K, <=50K._ <=50K)",0.598507,0.75919,0.411197,0.687038,0.904962,-0.043184,0.769453
225,(sex_ Male),"(>50K, <=50K._ <=50K, native-country_ United-States)",0.669205,0.675624,0.411197,0.614456,0.909464,-0.040934,0.841346
220,"(>50K, <=50K._ <=50K, native-country_ United-States)",(sex_ Male),0.675624,0.669205,0.411197,0.608619,0.909464,-0.040934,0.845197
55,(sex_ Male),"(>50K, <=50K._ <=50K)",0.669205,0.75919,0.464605,0.694263,0.914479,-0.04345,0.787637
54,"(>50K, <=50K._ <=50K)",(sex_ Male),0.75919,0.669205,0.464605,0.611974,0.914479,-0.04345,0.852506


In [180]:
# Highest 30 lift
census_lift_rules.sort_values(by='lift', ascending=False)[:30]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
138,"(marital-status_ Married-civ-spouse, sex_ Male)",(relationship_ Husband),0.409048,0.405178,0.404871,0.989789,2.44285,0.239134,58.253195
139,(relationship_ Husband),"(marital-status_ Married-civ-spouse, sex_ Male)",0.405178,0.409048,0.404871,0.999242,2.44285,0.239134,779.643457
132,"(marital-status_ Married-civ-spouse, race_ White)",(relationship_ Husband),0.411842,0.405178,0.36642,0.889709,2.195848,0.19955,5.393214
133,(relationship_ Husband),"(marital-status_ Married-civ-spouse, race_ White)",0.405178,0.411842,0.36642,0.904343,2.195848,0.19955,6.148624
144,"(native-country_ United-States, marital-status_ Married-civ-spouse)",(relationship_ Husband),0.410553,0.405178,0.363994,0.886595,2.188162,0.197647,5.245106
145,(relationship_ Husband),"(native-country_ United-States, marital-status_ Married-civ-spouse)",0.405178,0.410553,0.363994,0.898355,2.188162,0.197647,5.799091
21,(marital-status_ Married-civ-spouse),(relationship_ Husband),0.459937,0.405178,0.404902,0.880342,2.172729,0.218545,4.971013
20,(relationship_ Husband),(marital-status_ Married-civ-spouse),0.405178,0.459937,0.404902,0.999318,2.172729,0.218545,791.672741
137,"(relationship_ Husband, sex_ Male)",(marital-status_ Married-civ-spouse),0.405147,0.459937,0.404871,0.999318,2.172729,0.218529,791.612734
140,(marital-status_ Married-civ-spouse),"(relationship_ Husband, sex_ Male)",0.459937,0.405147,0.404871,0.880275,2.172729,0.218529,4.968497


#### 5 interesting rules & comments/reasoning
- (marital-status_ Married-civ-spouse, sex_ Male) -> (relationship_ Husband): Lift = 2.44	
- (relationship_ Husband) -> (marital-status_ Married-civ-spouse, race_ White) Lift = 2.2

Above are the rules with the highest and fourth-highest lifts. They are two positively-correlated rule cases that stand out to me as something to be mindful of and possibly use to filter for better and more efficient analysis. The first case is the "obvious rule": one does not need data analysis to learn that married males are husbands. This may come down to mindfulness. 
However, I wonder if filtering rules based on situations like the coffee-tea example is possible and acceptable - filtering if the consequent support is greater than the rule support or even if the consequent or antecedent support is greater than some well-reasoned percentage. This case makes me think of the interesting confidence-based rules from the previous prompt. Some consequent supports (such as US and white) were so large that I wondered how valuable the rules were.

- (>50K, <=50K._ <=50K) -> (sex_ Male, race_ White): Lift = 0.898898	
- (>50K, <=50K._ <=50K, workclass_ Private)	-> (sex_ Male): Lift = 0.902248
- (>50K, <=50K._ <=50K)	-> (native-country_ United-States, sex_ Male): Lift = 0.904962


The lowest negatively-correlated rules are all related to money, and I find them interesting. It is interesting that even with 75% support for making <= 50K, the lowest negative correlations are all about it, and they are as follows:
1. Being a male who is white
2. Being a US native who is male
3. If one makes <= 50K in the private sector, the person is less likely to be male

### Compare the top rules using the two interestingness measures for the same levels of support (use at least two different levels of support) and comment on your findings.

In [181]:
# Change minsup to 10%
freq_itemsets_new_support = apriori(df_encoded, min_support=0.1, use_colnames=True, max_len=3)
top_confidence_new_freq_itemsets = association_rules(freq_itemsets_new_support, 
                                                     metric="confidence").sort_values(by='confidence',
                                                                                     ascending=False)[:1]
top_confidence_new_freq_itemsets

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
212,"(relationship_ Husband, >50K, <=50K._ >50K)",(sex_ Male),0.181751,0.669205,0.181751,1.0,1.494309,0.060122,inf


In [184]:
top_lift_new_freq_itemsets = association_rules(freq_itemsets_new_support,
                                              metric="lift").sort_values(by='lift',
                                                                        ascending=False)[:1]
top_lift_new_freq_itemsets

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
670,(relationship_ Own-child),"(>50K, <=50K._ <=50K, marital-status_ Never-married)",0.155646,0.313012,0.136697,0.878256,2.805817,0.087978,5.642873


When comparing the top rules using the confidence and lift interestingness measures with a minimum support of 10%, the top confidence and lift seem like the rule case 1 that I noted above. They're fairly obvious. There's 100% confidence that husbands who make more than 50K are male. And the highest positive correlation is that children (if Own-child feature descripture simply means child) earn <= 50K and have never been married.

In [186]:
# Change minsup to 50%
freq_itemsets_new_support2 = apriori(df_encoded, min_support=0.5, use_colnames=True, max_len=3)
top_confidence_new_freq_itemsets2 = association_rules(freq_itemsets_new_support2, 
                                                     metric="confidence").sort_values(by='confidence',
                                                                                     ascending=False)[:1]
top_confidence_new_freq_itemsets2

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4,(race_ White),(native-country_ United-States),0.854274,0.895857,0.786862,0.921089,1.028165,0.021555,1.319746


In [187]:
top_lift_new_freq_itemsets2 = association_rules(freq_itemsets_new_support2,
                                              metric="lift").sort_values(by='lift',
                                                                        ascending=False)[:1]
top_lift_new_freq_itemsets2

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
22,"(native-country_ United-States, sex_ Male)",(race_ White),0.598507,0.854274,0.542152,0.905839,1.060362,0.030863,1.547639


When comparing the top rules using the confidence and lift interestingness measures using a higher minimum support (50%) than previously, the top rules are more interesting. There's a strong co-occcurence of being white and the US being one's native country (92% confidence). Yet it's still important to note that both the antecedent support and consequent support (85% and nearly 90% respectively) are quite high - nearly as high as the confidence. Similarly, the top rule based on lift reveals a positive correlation between a male whose native country is the US being white. 

I find it interesting that the two rules have similar itemsets even when using different interestingness measures. Even the rules with a minsup of 10% had at least one common item in the rule itemsets (the feature related to earnings).