## Using mlxtend on Association Rule Data Mining 

Sixian Chen

LinkedIn: http://linkedin.com/in/seashane-sixian-chen

Aspired by http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/.

------
About The Dataset
The census dataset provided in a CSV file consists of the attributes age, sex, education, native-country, race, marital-status, workclass, occupation, hours-per-week, income, capital-gain, and capital-loss. The CSV file census.csv contains exactly 30162 rows and each row contains exactly 12 comma separated values in the form attribute=value.

Task:Rearrange the given set of rules X->Y in descending order of confidence. It is guaranteed that no two rules have the same confidence. Also, the support of the attributes sets X and Y in each of the rules is greater than or equal to 0.3.


In [1913]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
pd.set_option('display.max_colwidth',0)

census_data = pd.read_csv("C:/Users/jason/Desktop/census.csv", header = None)
num_records = len(census_data)

records = []
for i in range(0, num_records):
    records.append([str(census_data.values[i,j]) for j in range(0, 12)])

In [1914]:
print("This dataset has", len(records), "entries. \n") 

This dataset has 30162 entries. 



In [1915]:
dfRecords = pd.DataFrame(records)

print("Let's preview the dataset.")
dfRecords.head(5)

Let's preview the dataset.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,age=Middle-aged,sex=Male,education=Bachelors,native-country=United-States,race=White,marital-status=Never-married,workclass=State-gov,occupation=Adm-clerical,hours-per-week=Full-time,income=Small,capital-gain=Low,capital-loss=None
1,age=Senior,sex=Male,education=Bachelors,native-country=United-States,race=White,marital-status=Married-civ-spouse,workclass=Self-emp-not-inc,occupation=Exec-managerial,hours-per-week=Part-time,income=Small,capital-gain=None,capital-loss=None
2,age=Middle-aged,sex=Male,education=HS-grad,native-country=United-States,race=White,marital-status=Divorced,workclass=Private,occupation=Handlers-cleaners,hours-per-week=Full-time,income=Small,capital-gain=None,capital-loss=None
3,age=Senior,sex=Male,education=11th,native-country=United-States,race=Black,marital-status=Married-civ-spouse,workclass=Private,occupation=Handlers-cleaners,hours-per-week=Full-time,income=Small,capital-gain=None,capital-loss=None
4,age=Middle-aged,sex=Female,education=Bachelors,native-country=Cuba,race=Black,marital-status=Married-civ-spouse,workclass=Private,occupation=Prof-specialty,hours-per-week=Full-time,income=Small,capital-gain=None,capital-loss=None


In [1916]:
te = TransactionEncoder()
te_ary = te.fit(records).transform(records)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_comb = fpgrowth(df, min_support=0.3, use_colnames=True)

from mlxtend.frequent_patterns import association_rules

aRule= association_rules(frequent_comb, metric="confidence") 
aRule.sort_values('confidence', ascending = False, inplace = True) # Sort the values by confidence
aRule['length'] = aRule['antecedents'].apply(lambda x: len(x))+1


print("\nThe following dataframe shows the statistics of the association rules sorting by confidence.")
aRule.head(7)


The following dataframe shows the statistics of the association rules sorting by confidence.


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
441,"(hours-per-week=Full-time, workclass=Private, income=Small)",(capital-loss=None),0.370101,0.952689,0.360221,0.973305,1.02164,0.00763,1.772264,4
455,"(hours-per-week=Full-time, workclass=Private, native-country=United-States, income=Small)",(capital-loss=None),0.326702,0.952689,0.317651,0.972296,1.02058,0.006406,1.707709,5
449,"(hours-per-week=Full-time, capital-gain=None, workclass=Private, income=Small)",(capital-loss=None),0.356044,0.952689,0.346164,0.972251,1.020533,0.006965,1.704949,5
768,"(income=Small, workclass=Private)",(capital-loss=None),0.577216,0.952689,0.560772,0.971511,1.019757,0.010864,1.660661,3
463,"(hours-per-week=Full-time, capital-gain=None, workclass=Private, income=Small, native-country=United-States)",(capital-loss=None),0.314004,0.952689,0.304953,0.971175,1.019404,0.005805,1.641334,6
582,(marital-status=Never-married),(capital-loss=None),0.322459,0.952689,0.31301,0.970697,1.018903,0.005807,1.614556,2
315,"(hours-per-week=Full-time, income=Small)",(capital-loss=None),0.469001,0.952689,0.455109,0.97038,1.01857,0.008297,1.597289,3


In [1917]:
aRule["antecedents"] = aRule["antecedents"].astype(str)
aRule['antecedents'] = aRule['antecedents'].map(lambda x: x.lstrip('frozenset').rstrip('('')'))
aRule['antecedents'] = aRule['antecedents'].map(lambda x: x.lstrip('()').rstrip('frozenset'))

aRule["consequents"] = aRule["consequents"].astype(str)
aRule['consequents'] = aRule['consequents'].map(lambda x: x.lstrip('frozenset').rstrip('('')'))
aRule['consequents'] = aRule['consequents'].map(lambda x: x.lstrip('()').rstrip('frozenset'))

aRule['association'] = aRule['antecedents']+"=>"+aRule['consequents']

aRuleAsso = aRule[['association','confidence','support','lift','length']]

print("The top 10 association rules that are most likely to happen:")
aRuleAsso.head(10)

The top 10 association rules that are most likely to happen:


Unnamed: 0,association,confidence,support,lift,length
441,"{'hours-per-week=Full-time', 'workclass=Private', 'income=Small'}=>{'capital-loss=None'}",0.973305,0.360221,1.02164,4
455,"{'hours-per-week=Full-time', 'workclass=Private', 'native-country=United-States', 'income=Small'}=>{'capital-loss=None'}",0.972296,0.317651,1.02058,5
449,"{'hours-per-week=Full-time', 'capital-gain=None', 'workclass=Private', 'income=Small'}=>{'capital-loss=None'}",0.972251,0.346164,1.020533,5
768,"{'income=Small', 'workclass=Private'}=>{'capital-loss=None'}",0.971511,0.560772,1.019757,3
463,"{'hours-per-week=Full-time', 'capital-gain=None', 'workclass=Private', 'income=Small', 'native-country=United-States'}=>{'capital-loss=None'}",0.971175,0.304953,1.019404,6
582,{'marital-status=Never-married'}=>{'capital-loss=None'},0.970697,0.31301,1.018903,2
315,"{'hours-per-week=Full-time', 'income=Small'}=>{'capital-loss=None'}",0.97038,0.455109,1.01857,3
778,"{'income=Small', 'workclass=Private', 'native-country=United-States'}=>{'capital-loss=None'}",0.970373,0.501691,1.018563,4
773,"{'income=Small', 'capital-gain=None', 'workclass=Private'}=>{'capital-loss=None'}",0.970346,0.538094,1.018534,4
792,"{'race=White', 'workclass=Private', 'income=Small'}=>{'capital-loss=None'}",0.970211,0.471885,1.018393,4


In [1918]:
aRuleAsso.head(1)['association'].to_list
print("The association rule",aRuleAsso.head(1)["association"].values, "has the most likelyhood to happen.\n")

The association rule ["{'hours-per-week=Full-time', 'workclass=Private', 'income=Small'}=>{'capital-loss=None'}"] has the most likelyhood to happen.



### What are the associated antecedents would lead to a high capital gain?
Let's focus on the records that only has high capital gain

In [1919]:
HighCapGain = dfRecords[dfRecords[10]=='capital-gain=High']
print("We have",len(HighCapGain),"records have high capital gain.\n")

We have 1090 records have high capital gain.



In [1920]:
HighCapGain = HighCapGain.values.tolist()
HighCapGain

te = TransactionEncoder()
te_ary = te.fit(HighCapGain).transform(HighCapGain)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_comb = fpgrowth(df, min_support=0.3, use_colnames=True)

from mlxtend.frequent_patterns import association_rules

aRule= association_rules(frequent_comb, metric="confidence") 
aRule.sort_values('confidence', ascending = False, inplace = True) # Sort the values by confidence
aRule['length'] = aRule['antecedents'].apply(lambda x: len(x))+1

print("\nThe following dataframe shows the statistics of the association rules sorting by confidence.")

aRule.head(10)
#aRule.dtypes


The following dataframe shows the statistics of the association rules sorting by confidence.


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
0,(capital-loss=None),(capital-gain=High),1.0,1.0,1.0,1.0,1.0,0.0,inf,2
2509,"(income=Large, hours-per-week=Over-time, native-country=United-States, sex=Male)",(capital-loss=None),0.397248,1.0,0.397248,1.0,1.0,0.0,inf,5
2316,"(hours-per-week=Over-time, native-country=United-States, workclass=Private)","(capital-loss=None, capital-gain=High)",0.3,1.0,0.3,1.0,1.0,0.0,inf,4
2314,"(capital-loss=None, hours-per-week=Over-time, native-country=United-States, workclass=Private)",(capital-gain=High),0.3,1.0,0.3,1.0,1.0,0.0,inf,5
2312,"(hours-per-week=Over-time, native-country=United-States, workclass=Private, capital-gain=High)",(capital-loss=None),0.3,1.0,0.3,1.0,1.0,0.0,inf,5
2310,"(hours-per-week=Over-time, native-country=United-States, workclass=Private)",(capital-loss=None),0.3,1.0,0.3,1.0,1.0,0.0,inf,4
2306,"(hours-per-week=Over-time, native-country=United-States, workclass=Private)",(capital-gain=High),0.3,1.0,0.3,1.0,1.0,0.0,inf,4
2303,"(income=Large, hours-per-week=Over-time, workclass=Private)","(capital-loss=None, capital-gain=High)",0.317431,1.0,0.317431,1.0,1.0,0.0,inf,4
2301,"(capital-loss=None, income=Large, hours-per-week=Over-time, workclass=Private)",(capital-gain=High),0.317431,1.0,0.317431,1.0,1.0,0.0,inf,5
2299,"(income=Large, hours-per-week=Over-time, workclass=Private, capital-gain=High)",(capital-loss=None),0.317431,1.0,0.317431,1.0,1.0,0.0,inf,5


In [1921]:
aRule = aRule.reset_index()
aRule["antecedents"] = aRule["antecedents"].astype(str)
aRule['antecedents'] = aRule['antecedents'].map(lambda x: x.lstrip('frozenset').rstrip('('')'))
aRule['antecedents'] = aRule['antecedents'].map(lambda x: x.lstrip('()').rstrip('frozenset'))

aRule["consequents"] = aRule["consequents"].astype(str)
aRule['consequents'] = aRule['consequents'].map(lambda x: x.lstrip('frozenset').rstrip('('')'))
aRule['consequents'] = aRule['consequents'].map(lambda x: x.lstrip('()').rstrip('frozenset'))

aRule['association'] = aRule['antecedents']+"=>"+aRule['consequents']

aRuleAsso = aRule[['association','confidence','support','lift','length']]

print("\nThe top 10 association rules that are most likely to happen:")
aRuleAsso.head(10)

#aRule.dtypes


The top 10 association rules that are most likely to happen:


Unnamed: 0,association,confidence,support,lift,length
0,{'capital-loss=None'}=>{'capital-gain=High'},1.0,1.0,1.0,2
1,"{'income=Large', 'hours-per-week=Over-time', 'native-country=United-States', 'sex=Male'}=>{'capital-loss=None'}",1.0,0.397248,1.0,5
2,"{'hours-per-week=Over-time', 'native-country=United-States', 'workclass=Private'}=>{'capital-loss=None', 'capital-gain=High'}",1.0,0.3,1.0,4
3,"{'capital-loss=None', 'hours-per-week=Over-time', 'native-country=United-States', 'workclass=Private'}=>{'capital-gain=High'}",1.0,0.3,1.0,5
4,"{'hours-per-week=Over-time', 'native-country=United-States', 'workclass=Private', 'capital-gain=High'}=>{'capital-loss=None'}",1.0,0.3,1.0,5
5,"{'hours-per-week=Over-time', 'native-country=United-States', 'workclass=Private'}=>{'capital-loss=None'}",1.0,0.3,1.0,4
6,"{'hours-per-week=Over-time', 'native-country=United-States', 'workclass=Private'}=>{'capital-gain=High'}",1.0,0.3,1.0,4
7,"{'income=Large', 'hours-per-week=Over-time', 'workclass=Private'}=>{'capital-loss=None', 'capital-gain=High'}",1.0,0.317431,1.0,4
8,"{'capital-loss=None', 'income=Large', 'hours-per-week=Over-time', 'workclass=Private'}=>{'capital-gain=High'}",1.0,0.317431,1.0,5
9,"{'income=Large', 'hours-per-week=Over-time', 'workclass=Private', 'capital-gain=High'}=>{'capital-loss=None'}",1.0,0.317431,1.0,5


In [1922]:
value = int(input("Enter the total length of the association rule: "))

Enter the total length of the association rule: 3


In [1923]:
aRule = aRule[aRule['length']==value]
aRule = aRule[aRule['consequents']=="{'capital-gain=High'}"]
aRule = aRule.sort_values(by=['support'])
#aRule

In [1924]:
aRule['association'] = aRule['antecedents']+"=>"+aRule['consequents']
aRuleAsso = aRule[['association','confidence','support','lift','length']]
print("\nThe top 10 association rules that are most likely to happen:")
aRuleAsso.head(10)


The top 10 association rules that are most likely to happen:


Unnamed: 0,association,confidence,support,lift,length
642,"{'occupation=Exec-managerial', 'income=Large'}=>{'capital-gain=High'}",1.0,0.302752,1.0,3
725,"{'hours-per-week=Full-time', 'sex=Male'}=>{'capital-gain=High'}",1.0,0.304587,1.0,3
561,"{'occupation=Exec-managerial', 'capital-loss=None'}=>{'capital-gain=High'}",1.0,0.304587,1.0,3
502,"{'income=Large', 'education=Bachelors'}=>{'capital-gain=High'}",1.0,0.312844,1.0,3
773,"{'income=Large', 'occupation=Prof-specialty'}=>{'capital-gain=High'}",1.0,0.313761,1.0,3
453,"{'education=Bachelors', 'capital-loss=None'}=>{'capital-gain=High'}",1.0,0.313761,1.0,3
811,"{'capital-loss=None', 'occupation=Prof-specialty'}=>{'capital-gain=High'}",1.0,0.315596,1.0,3
16,"{'workclass=Private', 'hours-per-week=Over-time'}=>{'capital-gain=High'}",1.0,0.318349,1.0,3
435,"{'marital-status=Married-civ-spouse', 'age=Middle-aged'}=>{'capital-gain=High'}",1.0,0.338532,1.0,3
580,"{'marital-status=Married-civ-spouse', 'age=Senior'}=>{'capital-gain=High'}",1.0,0.341284,1.0,3


From the association rule table above, we can see that, for people who has high capital gain, he or she usually has the following top 5 (sets of) characteristics:

In [1925]:
print(*aRule['antecedents'][0:5].values,sep='\n')

{'occupation=Exec-managerial', 'income=Large'}
{'hours-per-week=Full-time', 'sex=Male'}
{'occupation=Exec-managerial', 'capital-loss=None'}
{'income=Large', 'education=Bachelors'}
{'income=Large', 'occupation=Prof-specialty'}
