## Using mlxtend on Association Rule Data Mining 

Sixian Chen

11/20/2020

LinkedIn: http://linkedin.com/in/seashane-sixian-chen

Inspired by http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/.

------
About The Dataset
The census dataset provided in a CSV file consists of the attributes age, sex, education, native-country, race, marital-status, workclass, occupation, hours-per-week, income, capital-gain, and capital-loss. The CSV file census.csv contains exactly 30162 rows and each row contains exactly 12 comma separated values in the form attribute=value.

----


In [1]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
pd.set_option('display.max_colwidth',0)

census_data = pd.read_csv("census.csv", header = None)
num_records = len(census_data)

records = []
for i in range(0, num_records):
    records.append([str(census_data.values[i,j]) for j in range(0, 12)])

In [2]:
print("This dataset has", len(records), "entries. \n") 

This dataset has 30162 entries. 



In [3]:
dfRecords = pd.DataFrame(records)

print("Let's preview the dataset.")
dfRecords.head(5)

Let's preview the dataset.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,age=Middle-aged,sex=Male,education=Bachelors,native-country=United-States,race=White,marital-status=Never-married,workclass=State-gov,occupation=Adm-clerical,hours-per-week=Full-time,income=Small,capital-gain=Low,capital-loss=None
1,age=Senior,sex=Male,education=Bachelors,native-country=United-States,race=White,marital-status=Married-civ-spouse,workclass=Self-emp-not-inc,occupation=Exec-managerial,hours-per-week=Part-time,income=Small,capital-gain=None,capital-loss=None
2,age=Middle-aged,sex=Male,education=HS-grad,native-country=United-States,race=White,marital-status=Divorced,workclass=Private,occupation=Handlers-cleaners,hours-per-week=Full-time,income=Small,capital-gain=None,capital-loss=None
3,age=Senior,sex=Male,education=11th,native-country=United-States,race=Black,marital-status=Married-civ-spouse,workclass=Private,occupation=Handlers-cleaners,hours-per-week=Full-time,income=Small,capital-gain=None,capital-loss=None
4,age=Middle-aged,sex=Female,education=Bachelors,native-country=Cuba,race=Black,marital-status=Married-civ-spouse,workclass=Private,occupation=Prof-specialty,hours-per-week=Full-time,income=Small,capital-gain=None,capital-loss=None


In [4]:
te = TransactionEncoder()
te_ary = te.fit(records).transform(records)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_comb = fpgrowth(df, min_support=0.3, use_colnames=True)

from mlxtend.frequent_patterns import association_rules

aRule= association_rules(frequent_comb, metric="confidence") 
aRule.sort_values('confidence', ascending = False, inplace = True) # Sort the values by confidence
aRule['length'] = aRule['antecedents'].apply(lambda x: len(x))+1


print("The following dataframe shows the statistics of the association rules sorting by confidence.")
aRule.head(7)

The following dataframe shows the statistics of the association rules sorting by confidence.


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
440,"(income=Small, workclass=Private, hours-per-week=Full-time)",(capital-loss=None),0.370101,0.952689,0.360221,0.973305,1.02164,0.00763,1.772264,4
456,"(income=Small, workclass=Private, native-country=United-States, hours-per-week=Full-time)",(capital-loss=None),0.326702,0.952689,0.317651,0.972296,1.02058,0.006406,1.707709,5
451,"(income=Small, workclass=Private, capital-gain=None, hours-per-week=Full-time)",(capital-loss=None),0.356044,0.952689,0.346164,0.972251,1.020533,0.006965,1.704949,5
768,"(income=Small, workclass=Private)",(capital-loss=None),0.577216,0.952689,0.560772,0.971511,1.019757,0.010864,1.660661,3
465,"(native-country=United-States, workclass=Private, hours-per-week=Full-time, income=Small, capital-gain=None)",(capital-loss=None),0.314004,0.952689,0.304953,0.971175,1.019404,0.005805,1.641334,6
582,(marital-status=Never-married),(capital-loss=None),0.322459,0.952689,0.31301,0.970697,1.018903,0.005807,1.614556,2
314,"(income=Small, hours-per-week=Full-time)",(capital-loss=None),0.469001,0.952689,0.455109,0.97038,1.01857,0.008297,1.597289,3


In [5]:
aRule["antecedents"] = aRule["antecedents"].astype(str)
aRule['antecedents'] = aRule['antecedents'].map(lambda x: x.lstrip('frozenset').rstrip('('')'))
aRule['antecedents'] = aRule['antecedents'].map(lambda x: x.lstrip('()').rstrip('frozenset'))

aRule["consequents"] = aRule["consequents"].astype(str)
aRule['consequents'] = aRule['consequents'].map(lambda x: x.lstrip('frozenset').rstrip('('')'))
aRule['consequents'] = aRule['consequents'].map(lambda x: x.lstrip('()').rstrip('frozenset'))

aRule['association'] = aRule['antecedents']+"=>"+aRule['consequents']

aRuleAsso = aRule[['association','confidence','support','lift','length']]

print("The top 10 association rules that are most likely to happen:")
aRuleAsso.head(10)

The top 10 association rules that are most likely to happen:


Unnamed: 0,association,confidence,support,lift,length
440,"{'income=Small', 'workclass=Private', 'hours-per-week=Full-time'}=>{'capital-loss=None'}",0.973305,0.360221,1.02164,4
456,"{'income=Small', 'workclass=Private', 'native-country=United-States', 'hours-per-week=Full-time'}=>{'capital-loss=None'}",0.972296,0.317651,1.02058,5
451,"{'income=Small', 'workclass=Private', 'capital-gain=None', 'hours-per-week=Full-time'}=>{'capital-loss=None'}",0.972251,0.346164,1.020533,5
768,"{'income=Small', 'workclass=Private'}=>{'capital-loss=None'}",0.971511,0.560772,1.019757,3
465,"{'native-country=United-States', 'workclass=Private', 'hours-per-week=Full-time', 'income=Small', 'capital-gain=None'}=>{'capital-loss=None'}",0.971175,0.304953,1.019404,6
582,{'marital-status=Never-married'}=>{'capital-loss=None'},0.970697,0.31301,1.018903,2
314,"{'income=Small', 'hours-per-week=Full-time'}=>{'capital-loss=None'}",0.97038,0.455109,1.01857,3
777,"{'income=Small', 'workclass=Private', 'native-country=United-States'}=>{'capital-loss=None'}",0.970373,0.501691,1.018563,4
773,"{'income=Small', 'workclass=Private', 'capital-gain=None'}=>{'capital-loss=None'}",0.970346,0.538094,1.018534,4
792,"{'income=Small', 'workclass=Private', 'race=White'}=>{'capital-loss=None'}",0.970211,0.471885,1.018393,4


In [6]:
aRuleAsso.head(1)['association'].to_list
print("The association rule",aRuleAsso.head(1)["association"].values, "has the most likelyhood to happen.\n")

The association rule ["{'income=Small', 'workclass=Private', 'hours-per-week=Full-time'}=>{'capital-loss=None'}"] has the most likelyhood to happen.



-----
### What are the associated antecedents would lead to a high capital gain?
Let's focus on the records that only has high capital gain

In [7]:
HighCapGain = dfRecords[dfRecords[10]=='capital-gain=High']
print("We have",len(HighCapGain),"records have high capital gain.\n")

We have 1090 records have high capital gain.



In [8]:
HighCapGain = HighCapGain.values.tolist()
HighCapGain

te = TransactionEncoder()
te_ary = te.fit(HighCapGain).transform(HighCapGain)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_comb = fpgrowth(df, min_support=0.3, use_colnames=True)

from mlxtend.frequent_patterns import association_rules

aRule= association_rules(frequent_comb, metric="confidence") 
aRule.sort_values('confidence', ascending = False, inplace = True) # Sort the values by confidence
aRule['length'] = aRule['antecedents'].apply(lambda x: len(x))+1

print("The following dataframe shows the statistics of the association rules sorting by confidence.")

aRule.head(10)

The following dataframe shows the statistics of the association rules sorting by confidence.


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
0,(capital-gain=High),(capital-loss=None),1.0,1.0,1.0,1.0,1.0,0.0,inf,2
2511,"(income=Large, native-country=United-States, sex=Male, hours-per-week=Over-time)",(capital-loss=None),0.397248,1.0,0.397248,1.0,1.0,0.0,inf,5
2315,"(workclass=Private, native-country=United-States, hours-per-week=Over-time)","(capital-gain=High, capital-loss=None)",0.3,1.0,0.3,1.0,1.0,0.0,inf,4
2313,"(capital-gain=High, workclass=Private, native-country=United-States, hours-per-week=Over-time)",(capital-loss=None),0.3,1.0,0.3,1.0,1.0,0.0,inf,5
2312,"(workclass=Private, native-country=United-States, capital-loss=None, hours-per-week=Over-time)",(capital-gain=High),0.3,1.0,0.3,1.0,1.0,0.0,inf,5
2309,"(workclass=Private, native-country=United-States, hours-per-week=Over-time)",(capital-loss=None),0.3,1.0,0.3,1.0,1.0,0.0,inf,4
2307,"(workclass=Private, native-country=United-States, hours-per-week=Over-time)",(capital-gain=High),0.3,1.0,0.3,1.0,1.0,0.0,inf,4
2304,"(workclass=Private, hours-per-week=Over-time, income=Large)","(capital-gain=High, capital-loss=None)",0.317431,1.0,0.317431,1.0,1.0,0.0,inf,4
2301,"(capital-gain=High, workclass=Private, hours-per-week=Over-time, income=Large)",(capital-loss=None),0.317431,1.0,0.317431,1.0,1.0,0.0,inf,5
2300,"(income=Large, workclass=Private, hours-per-week=Over-time, capital-loss=None)",(capital-gain=High),0.317431,1.0,0.317431,1.0,1.0,0.0,inf,5


In [9]:
aRule = aRule.reset_index()
aRule["antecedents"] = aRule["antecedents"].astype(str)
aRule['antecedents'] = aRule['antecedents'].map(lambda x: x.lstrip('frozenset').rstrip('('')'))
aRule['antecedents'] = aRule['antecedents'].map(lambda x: x.lstrip('()').rstrip('frozenset'))

aRule["consequents"] = aRule["consequents"].astype(str)
aRule['consequents'] = aRule['consequents'].map(lambda x: x.lstrip('frozenset').rstrip('('')'))
aRule['consequents'] = aRule['consequents'].map(lambda x: x.lstrip('()').rstrip('frozenset'))

aRule['association'] = aRule['antecedents']+"=>"+aRule['consequents']

aRuleAsso = aRule[['association','confidence','support','lift','length']]

print("The top 10 association rules that are most likely to happen:")
aRuleAsso.head(10)

The top 10 association rules that are most likely to happen:


Unnamed: 0,association,confidence,support,lift,length
0,{'capital-gain=High'}=>{'capital-loss=None'},1.0,1.0,1.0,2
1,"{'income=Large', 'native-country=United-States', 'sex=Male', 'hours-per-week=Over-time'}=>{'capital-loss=None'}",1.0,0.397248,1.0,5
2,"{'workclass=Private', 'native-country=United-States', 'hours-per-week=Over-time'}=>{'capital-gain=High', 'capital-loss=None'}",1.0,0.3,1.0,4
3,"{'capital-gain=High', 'workclass=Private', 'native-country=United-States', 'hours-per-week=Over-time'}=>{'capital-loss=None'}",1.0,0.3,1.0,5
4,"{'workclass=Private', 'native-country=United-States', 'capital-loss=None', 'hours-per-week=Over-time'}=>{'capital-gain=High'}",1.0,0.3,1.0,5
5,"{'workclass=Private', 'native-country=United-States', 'hours-per-week=Over-time'}=>{'capital-loss=None'}",1.0,0.3,1.0,4
6,"{'workclass=Private', 'native-country=United-States', 'hours-per-week=Over-time'}=>{'capital-gain=High'}",1.0,0.3,1.0,4
7,"{'workclass=Private', 'hours-per-week=Over-time', 'income=Large'}=>{'capital-gain=High', 'capital-loss=None'}",1.0,0.317431,1.0,4
8,"{'capital-gain=High', 'workclass=Private', 'hours-per-week=Over-time', 'income=Large'}=>{'capital-loss=None'}",1.0,0.317431,1.0,5
9,"{'income=Large', 'workclass=Private', 'hours-per-week=Over-time', 'capital-loss=None'}=>{'capital-gain=High'}",1.0,0.317431,1.0,5


In [10]:
value = int(input("Enter the total length of the association rule: "))

Enter the total length of the association rule: 3


In [11]:
aRule = aRule[aRule['length']==value]
aRule = aRule[aRule['consequents']=="{'capital-gain=High'}"]
aRule = aRule.sort_values(by=['support'],ascending=False)
#aRule

In [12]:
aRule['association'] = aRule['antecedents']+"=>"+aRule['consequents']
aRuleAsso = aRule[['association','confidence','support','lift','length']]
print("The top 10 association rules that are most likely to happen:")
aRuleAsso.head(10)

The top 10 association rules that are most likely to happen:


Unnamed: 0,association,confidence,support,lift,length
681,"{'income=Large', 'capital-loss=None'}=>{'capital-gain=High'}",1.0,0.983486,1.0,3
479,"{'native-country=United-States', 'capital-loss=None'}=>{'capital-gain=High'}",1.0,0.937615,1.0,3
511,"{'income=Large', 'native-country=United-States'}=>{'capital-gain=High'}",1.0,0.922018,1.0,3
641,"{'race=White', 'capital-loss=None'}=>{'capital-gain=High'}",1.0,0.9,1.0,3
572,"{'income=Large', 'race=White'}=>{'capital-gain=High'}",1.0,0.884404,1.0,3
591,"{'race=White', 'native-country=United-States'}=>{'capital-gain=High'}",1.0,0.863303,1.0,3
725,"{'capital-loss=None', 'sex=Male'}=>{'capital-gain=High'}",1.0,0.814679,1.0,3
717,"{'income=Large', 'sex=Male'}=>{'capital-gain=High'}",1.0,0.804587,1.0,3
711,"{'native-country=United-States', 'sex=Male'}=>{'capital-gain=High'}",1.0,0.761468,1.0,3
855,"{'race=White', 'sex=Male'}=>{'capital-gain=High'}",1.0,0.741284,1.0,3


In [13]:
print("\nFrom the association rule table above, we can see that, for people who has a high capital gain, he or she usually has the following top 5 (sets of) characteristics sorting by the support of the rules when the length of association is",value,".\n")
print(*aRule['antecedents'][0:5].values,sep='\n')


From the association rule table above, we can see that, for people who has a high capital gain, he or she usually has the following top 5 (sets of) characteristics sorting by the support of the rules when the length of association is 3 .

{'income=Large', 'capital-loss=None'}
{'native-country=United-States', 'capital-loss=None'}
{'income=Large', 'native-country=United-States'}
{'race=White', 'capital-loss=None'}
{'income=Large', 'race=White'}
