# Association mining

<h1>Some Common terms</h1>

<p> </p>

<h2>Antecedent and consequent</h2>
<p>
Association rules provide information of this type in the form of if-then statements. These rules are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature.
In addition to the antecedent (if) and the consequent (then), an association rule has two numbers that express the degree of uncertainty about the rule. In association analysis, the antecedent and consequent are sets of items (called itemsets) that are disjoint (do not have any items in common).</p> 

<h2>Support</h2>
<p>The first number is called the support for the rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. The support is sometimes expressed as a percentage of the total number of records in the database.</p>

<h2>Confidence</h2>
<p>The other number is known as the confidence of the rule. Confidence is the ratio of the number of transactions that include all items in the consequent, as well as the antecedent (the support) to the number of transactions that include all items in the antecedent. </p>
<h2>Lift</h2>
<p>Lift is one more parameter of interest in the association analysis. Lift is nothing but the ratio of Confidence to Expected Confidence. Using the above example, expected Confidence in this case means, "confidence, if buying A and B does not enhance the probability of buying C."  It is the number of transactions that include the consequent divided by the total number of transactions.</p>

<h2>Example</h2>
<p>For example, if a supermarket database has 100,000 point-of-sale transactions, out of which 2,000 include both items A and B, and 800 of these include item C, the association rule "If A and B  are purchased, then C is purchased on the same trip," has a support of 800 transactions (alternatively 0.8% = 800/100,000), and a confidence of 40% (=800/2,000). One way to think of support is that it is the probability that a randomly selected transaction from the database will contain all items in the antecedent and the consequent, whereas the confidence is the conditional probability that a randomly selected transaction will include all the items in the consequent, given that the transaction includes all the items in the antecedent.</p>
<p> Using the above example, expected Confidence in this case means, "confidence, if buying A and B does not enhance the probability of buying C."  It is the number of transactions that include the consequent divided by the total number of transactions. Suppose the number of total number of transactions for C are 5,000. Thus Expected Confidence is 5,000/1,00,000=5%. For the supermarket example the Lift = Confidence/Expected Confidence = 40%/5% = 8. Hence, Lift is a value that  gives us information about the increase in probability of the then (consequent)  given the if (antecedent) part.</p>
<p><strong>A lift ratio larger than 1.0 implies that the relationship between the antecedent and the consequent is more significant than would be expected if the two sets were independent. The larger the lift ratio, the more significant the association.</strong></p>

# Randomly trying out algorithms

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [3]:
from apyori import apriori

In [4]:
data = pd.read_csv('zzz_user_tx_summary_product_crossselling1.txt',sep='|',error_bad_lines=False)

b'Skipping line 19440: expected 8 fields, saw 9\nSkipping line 28516: expected 8 fields, saw 9\nSkipping line 39732: expected 8 fields, saw 9\nSkipping line 59354: expected 8 fields, saw 10\nSkipping line 60335: expected 8 fields, saw 9\nSkipping line 61550: expected 8 fields, saw 9\nSkipping line 62286: expected 8 fields, saw 9\n'
b'Skipping line 76107: expected 8 fields, saw 9\nSkipping line 95154: expected 8 fields, saw 9\nSkipping line 99512: expected 8 fields, saw 10\nSkipping line 113687: expected 8 fields, saw 10\nSkipping line 116525: expected 8 fields, saw 10\nSkipping line 119250: expected 8 fields, saw 9\nSkipping line 121225: expected 8 fields, saw 9\nSkipping line 121383: expected 8 fields, saw 9\nSkipping line 121740: expected 8 fields, saw 9\nSkipping line 122870: expected 8 fields, saw 9\nSkipping line 123767: expected 8 fields, saw 9\n'
b'Skipping line 141878: expected 8 fields, saw 10\nSkipping line 148520: expected 8 fields, saw 10\nSkipping line 166197: expected 8 f

b'Skipping line 661720: expected 8 fields, saw 10\nSkipping line 667636: expected 8 fields, saw 9\nSkipping line 669303: expected 8 fields, saw 9\nSkipping line 669619: expected 8 fields, saw 9\nSkipping line 672068: expected 8 fields, saw 9\nSkipping line 701527: expected 8 fields, saw 9\nSkipping line 703948: expected 8 fields, saw 9\nSkipping line 718973: expected 8 fields, saw 9\nSkipping line 720202: expected 8 fields, saw 9\n'
b'Skipping line 726048: expected 8 fields, saw 9\nSkipping line 727982: expected 8 fields, saw 9\nSkipping line 728833: expected 8 fields, saw 9\nSkipping line 729940: expected 8 fields, saw 9\nSkipping line 753062: expected 8 fields, saw 9\nSkipping line 755362: expected 8 fields, saw 9\nSkipping line 761450: expected 8 fields, saw 9\nSkipping line 766410: expected 8 fields, saw 9\nSkipping line 771481: expected 8 fields, saw 9\nSkipping line 771608: expected 8 fields, saw 9\nSkipping line 774916: expected 8 fields, saw 9\nSkipping line 775982: expected 8 

b'Skipping line 1184660: expected 8 fields, saw 10\nSkipping line 1185830: expected 8 fields, saw 9\nSkipping line 1194585: expected 8 fields, saw 10\nSkipping line 1199416: expected 8 fields, saw 9\nSkipping line 1203224: expected 8 fields, saw 10\nSkipping line 1207135: expected 8 fields, saw 9\nSkipping line 1219269: expected 8 fields, saw 9\nSkipping line 1220862: expected 8 fields, saw 10\nSkipping line 1225413: expected 8 fields, saw 10\nSkipping line 1228372: expected 8 fields, saw 10\nSkipping line 1228677: expected 8 fields, saw 10\nSkipping line 1229797: expected 8 fields, saw 10\nSkipping line 1232466: expected 8 fields, saw 10\nSkipping line 1232900: expected 8 fields, saw 10\nSkipping line 1234757: expected 8 fields, saw 10\nSkipping line 1238990: expected 8 fields, saw 10\nSkipping line 1240776: expected 8 fields, saw 9\n'
b'Skipping line 1246769: expected 8 fields, saw 10\nSkipping line 1250611: expected 8 fields, saw 9\nSkipping line 1257580: expected 8 fields, saw 10\n

b'Skipping line 1837045: expected 8 fields, saw 9\nSkipping line 1850272: expected 8 fields, saw 9\nSkipping line 1855144: expected 8 fields, saw 10\nSkipping line 1855533: expected 8 fields, saw 10\nSkipping line 1859713: expected 8 fields, saw 10\nSkipping line 1860044: expected 8 fields, saw 9\nSkipping line 1862413: expected 8 fields, saw 9\nSkipping line 1862680: expected 8 fields, saw 10\nSkipping line 1863588: expected 8 fields, saw 9\nSkipping line 1868300: expected 8 fields, saw 9\nSkipping line 1876585: expected 8 fields, saw 10\nSkipping line 1877091: expected 8 fields, saw 9\nSkipping line 1877963: expected 8 fields, saw 9\nSkipping line 1878162: expected 8 fields, saw 10\nSkipping line 1878650: expected 8 fields, saw 9\nSkipping line 1878942: expected 8 fields, saw 10\nSkipping line 1879310: expected 8 fields, saw 10\nSkipping line 1880302: expected 8 fields, saw 10\nSkipping line 1880985: expected 8 fields, saw 9\nSkipping line 1882364: expected 8 fields, saw 10\nSkipping

b'Skipping line 2558216: expected 8 fields, saw 9\nSkipping line 2576449: expected 8 fields, saw 10\nSkipping line 2580793: expected 8 fields, saw 9\nSkipping line 2584515: expected 8 fields, saw 9\nSkipping line 2589562: expected 8 fields, saw 9\nSkipping line 2589804: expected 8 fields, saw 9\nSkipping line 2593707: expected 8 fields, saw 9\nSkipping line 2606320: expected 8 fields, saw 9\nSkipping line 2618072: expected 8 fields, saw 9\nSkipping line 2619040: expected 8 fields, saw 9\nSkipping line 2621768: expected 8 fields, saw 10\n'
b'Skipping line 2631445: expected 8 fields, saw 9\nSkipping line 2645846: expected 8 fields, saw 10\nSkipping line 2646647: expected 8 fields, saw 10\nSkipping line 2651769: expected 8 fields, saw 9\nSkipping line 2667935: expected 8 fields, saw 10\nSkipping line 2681970: expected 8 fields, saw 9\nSkipping line 2682057: expected 8 fields, saw 10\nSkipping line 2682387: expected 8 fields, saw 9\nSkipping line 2685123: expected 8 fields, saw 9\n'
b'Skip

b'Skipping line 3146941: expected 8 fields, saw 9\nSkipping line 3148733: expected 8 fields, saw 9\nSkipping line 3148778: expected 8 fields, saw 9\nSkipping line 3150096: expected 8 fields, saw 9\nSkipping line 3152853: expected 8 fields, saw 9\nSkipping line 3157889: expected 8 fields, saw 9\nSkipping line 3158285: expected 8 fields, saw 9\nSkipping line 3167611: expected 8 fields, saw 9\nSkipping line 3173129: expected 8 fields, saw 9\nSkipping line 3179022: expected 8 fields, saw 9\nSkipping line 3179970: expected 8 fields, saw 9\nSkipping line 3180866: expected 8 fields, saw 9\nSkipping line 3183905: expected 8 fields, saw 9\nSkipping line 3184543: expected 8 fields, saw 9\nSkipping line 3189641: expected 8 fields, saw 9\nSkipping line 3191276: expected 8 fields, saw 9\nSkipping line 3197072: expected 8 fields, saw 9\nSkipping line 3197251: expected 8 fields, saw 9\nSkipping line 3205107: expected 8 fields, saw 9\nSkipping line 3210211: expected 8 fields, saw 9\n'
b'Skipping line 

In [None]:
data.to_excel('Raw_data.xlsx')

In [5]:
cond = data.loc[(data.product_name!='Ncell Topup') & (data.product_name!='NT Postpaid Topup' )&(data.product_name!='NT Prepaid Topup')]

In [9]:
data.head()

Unnamed: 0,payer_account_id,created_day,product_id,product_name,amount_sum,revenue_sum,fullname_X,user_name_X
0,265501.0,2018-08-06,7,Send Money,1024.24,0.0,255495,255495.0
1,369755.0,2018-08-06,200,NT Prepaid Topup,100.0,0.8,359354,359354.0
2,439680.0,2018-08-06,200,NT Prepaid Topup,10.0,0.08,429302,429302.0
3,412923.0,2018-08-06,15,Ncell Topup,20.0,0.16,402598,402598.0
4,494207.0,2018-08-06,200,NT Prepaid Topup,100.0,0.8,483671,483671.0


In [6]:
inp = cond[['fullname_X','product_name']]

In [7]:
prep = inp.groupby('fullname_X').product_name.apply(list)

In [12]:
prep.head()

fullname_X
13    [Subisu Cablenet , Subisu Cablenet , Siddharth...
14    [Nabil Bank Credit Card, Dish Home Topup, Net ...
15    [NT Landline Payment, Nepal Electricity Author...
22    [Bigmovies, NT Landline Payment, QFX Chhaya Ce...
28    [QFX Chhaya Centre, Sim TV, Sim TV, QFX Cinema...
Name: product_name, dtype: object

In [13]:
prep = inp.groupby('fullname_X').product_name.apply(set)

In [14]:
prep.head()

fullname_X
13    {Send Money, Siddhartha Bank Credit Card, Nepa...
14    {Nabil Bank Credit Card, Net TV, Send Money, D...
15    {Nepal Electricity Authority, Send Money, NT L...
22    {QFX Cinemas, Bigmovies, NT CDMA Prepaid Topup...
28             {QFX Cinemas, Sim TV, QFX Chhaya Centre}
Name: product_name, dtype: object

In [15]:
listed = list(prep)

In [19]:
listed

[{'BUDDHA AIR',
  'Bhatbhateni Online',
  'Cash Out',
  'Epay Donation',
  'Nepal Electricity Authority',
  'Send Money',
  'Siddhartha Bank Credit Card',
  'Subisu Cablenet '},
 {'Dish Home Topup', 'Nabil Bank Credit Card', 'Net TV', 'Send Money'},
 {'NT Landline Payment',
  'Nepal Electricity Authority',
  'Send Money',
  'Sim TV'},
 {'BSR Movies',
  'Bigmovies',
  'NT CDMA Prepaid Topup',
  'NT Landline Payment',
  'Oho Domain',
  'QFX Chhaya Centre',
  'QFX Cinemas'},
 {'QFX Chhaya Centre', 'QFX Cinemas', 'Sim TV'},
 {'BUDDHA AIR',
  'Dhangadi Premier League 2018',
  'Dish Home Topup',
  'NT ADSL Topup - Unlimited',
  'NT Landline Payment'},
 {'Epay Donation',
  'FCube Cinemas',
  'Global IME Visa Credit Cards',
  'Inception Collection Center',
  'NCELL PRO',
  'NCELL TOPUP',
  'NT Landline Payment',
  'Nabil Bank Credit Card',
  'SHREE AIRLINES',
  'Send Money',
  'Sharesansar'},
 {'BUDDHA AIR',
  'CAN',
  'NCELL TOPUP',
  'Nabil Bank Credit Card',
  'QFX Cinemas',
  'Send Money',

rules = apriori(dummy,min_support = .003,min_confidence = .2, min_lift = 3, min_length = 2)

In [20]:
import fim

ImportError: DLL load failed: The specified module could not be found.

In [None]:
pd.DataFrame(prep).to_excel('association_prepared.xlsx')

In [None]:
#result = fim.apriori(listed,supp= 1,target = 'a',report='l',)

In [None]:
#result

In [None]:
result = fim.eclat(listed,supp= 4,target = 'r',report='e',zmin = 2,conf = 50,eval = 'l')

In [None]:
for i in result[3][1]:
    print (i)

In [None]:
result_frame = pd.DataFrame(result)

In [None]:
result_frame

# DRAWBACKS

<ul>
  <li>too large dataset with too many infrequent item set</li>
  <li>Because of this, the support of frequent items becomes very small</li>
  <li>Cannot learn interesting rules</li>
</ul>

In [None]:
inp = cond[['fullname_X','product_name']]
inp.head()

In [None]:
haha = ((inp.product_name.value_counts()>1000)& (inp.product_name.value_counts()<3000))
keep=[]
for i in range(len(haha)):
    if haha[i] == True:
        keep.append(haha.index[i])

In [None]:
#inp.loc[list((inp.product_name.value_counts()<100).index.values)]

In [None]:
#counts = inp.groupby('fullname_X').product_name.apply()

In [None]:
inp = inp.loc[inp.product_name.isin(keep)]

In [None]:
prep = inp.groupby('fullname_X').product_name.apply(set)
listed = list(prep)

In [None]:
result = fim.eclat(listed,supp= 4,target = 'r',report='e',zmin = 2,conf = 1,eval = 'l')

In [None]:
result_frame = pd.DataFrame(result)

In [None]:
result_frame

# Tackling the drawbacks

# Removing top 3 highest purchased products, taking users who have made 50-200 transactions

In [None]:
import seaborn as sb

In [None]:
sb.distplot(data.fullname_X.value_counts())

<h2>Why do this?</h2>
<ul>
    <li>In this model, we have taken the users as the transactions. In actual, each item list is the items bought by the single person in a single purchase.</li>
    <li>So, there exists a person who has bought 500 items as well as people who have bought 1 item only</li>
    <li>In an actual one time transaciton, say in a supermarket, happening of such transaction is very rare</li>
    <li>Since we are trying to emulate each transaction with the users, we have to take a little different approach</li>
    <li>If we take all transaction, then the support for the items that were purchased very few times by very few users would be very less. In fact, the support for the frequent items would also be very less because of large number or transaction that have very few items.</li>
    <li>That is why, we remove the top 3 highest purchased products. If a product appears in almost all of the transactions, we can recommend them to the customers without any problem. However, there is a problem, because this approach removes the product from the consequent, but also removes the product from the antecedent. Thus, removing potentially interesting rules as well</li>
    <li>I took only the transaction(users in this case) that had 50-200 items. The number of items in transaction ranges from 1 to 500. If we believe in the gaussian distribution, we can also believe that the most frequent items occur between these numbers</li>
<ul>

In [None]:
ranged = ((data.fullname_X.value_counts()>50)& (data.fullname_X.value_counts()<200))


In [None]:
ranged_frame=pd.DataFrame(ranged)

In [None]:
select = ranged_frame.loc[ranged_frame.fullname_X==True].index

In [None]:
selected_set = data.loc[data.fullname_X.isin(select)]

In [None]:
selected_set = selected_set.loc[(selected_set.product_name!='Ncell Topup') & (selected_set.product_name!='NT Postpaid Topup' )&(selected_set.product_name!='NT Prepaid Topup')]

In [None]:
prep = selected_set.groupby('fullname_X').product_name.apply(set)
listed = list(prep)

In [None]:
import fim

In [None]:
result = fim.eclat(listed,supp= 2,target = 'r',report='e',zmin = 1,conf = 70,eval = 'l')
result_frame = pd.DataFrame(result)

In [None]:
result_frame.head()

In [None]:
result_frame.to_excel('rulesAssociation.xlsx')