<a href="https://colab.research.google.com/github/srinikha193/Data-Analysis-Course-Work/blob/main/Association_Rule_StudentVersion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What Is Association Rules?
Put simply, association rules, or affinity analysis, constitute a study of “what goes
with what.” Association rules are heavily used in retail
for learning about items that are purchased together, but they are also useful
in other fields. For example, a medical researcher might want to learn what
symptoms appear together. In law, word combinations that appear too often
might indicate plagiarism. <br> <br>
Association rules are commonly used in online recommendation systems
(or recommender systems) as well, where customers examining an item or items for
possible purchase are shown other items that are often purchased in conjunction
with the first item(s). The display from Amazon.com’s online shopping system
illustrates the application of rules like this under “Frequently bought together.”

# The Apriori Algorithm
Several algorithms have been proposed for generating frequent itemsets, but the
classic algorithm is the Apriori algorithm of Agrawal et al. (1993). The key idea
of the algorithm is to begin by generating frequent itemsets with just one item
(one-itemsets) and to recursively generate frequent itemsets with two items, then
with three items, and so on, until we have generated frequent itemsets of all sizes. In order to define the association rules, we need to understand the following terms:


*   Antecedent
*   Consequent
*   Itemset
*   Frequent itemsets
*   Support
*   Confidence
*   Lift ratio





Let's use the following example to understand these terms:

# Color Association Example
A store that sells accessories for cellular phones would like to know what if there is any association between different color s of a certain product. The table below shows 10 transcations of this product:

![color table](https://drive.google.com/uc?export=view&id=1b9kf8qqDOlbFNMw3YNs5k8_REN6uDMw-)

### Antecedent and Consequent
The idea behind association rules is to examine all possible rules between items
in an if–then format, and select only those that are most likely to be indicators. We use the term antecedent to describe the IF part, and
consequent to describe the THEN part. Considering the first transaction, one example of a possible
rule is “if red, then white,” meaning that if a red is purchased, a
white one is, too. Here the antecedent is red and the consequent is white. The
antecedent and consequent each contain a single item in this case. Another possible
rule is “if red and white, then green.” Here the antecedent includes the
itemset {red, white} and the consequent is {green}.

### Itemset
In association analysis, the antecedent
and consequent are sets of items (called itemsets) that are disjoint (do not have
any items in common). Note that itemsets are not records of what people buy;
they are simply possible combinations of items. Looking at the color example, the itemsets include:


*   {red}
*   {White}
*   ...
*   {yellow}
*   {red, white}
*   {red, blue}
*   ...
*   {green, yellow}
*   ...
*   {red, white, blue, orange, green, yellow}



### Frequent Itemsets
The first step in association rules is to generate all the rules that would be
candidates for indicating associations between items. Ideally, we might want
to look at all possible combinations of items in a database with p distinct items
(in the color example, p = 6). This means finding all combinations
of single items, pairs of items, triplets of items, and so on, in the color example. However, generating all these combinations requires a long computation
time that grows exponentially in p. A practical solution is to consider only
combinations that occur with higher frequency in the database. These are called
frequent itemsets.

### Support
Determining what qualifies as a frequent itemset is related to the concept of
support. The support of a rule is simply the number of transactions that include
both the antecedent and consequent itemsets. It is called a support because it
measures the degree to which the data “support” the validity of the rule. The
support is sometimes expressed as a percentage of the total number of records in
the database. For example, the support for the itemset {red,white} in the color example is 4 (4 out of 10) or 40% (100 * 4/10).

### Confidence
In addition to support, which we described
earlier, there is another measure that expresses the degree of uncertainty about
the if–then rule. This is known as the confidence2 of the rule. This measure
compares the co-occurrence of the antecedent and consequent itemsets in the
database to the occurrence of the antecedent itemsets. Confidence is defined as
the ratio of the number of transactions that include all antecedent and consequent
itemsets (namely, the support) to the number of transactions that include all the
antecedent itemsets: <br><br>
confidence = no. records with both antecedent and consequent itemsets/no. records with antecedent itemset
<br><br>
For example, the confidence of {red, white} is 4/6 or 67%. It means if red is purchased, then with confidence 67% white will also be
purchased.

### Lift Ratio
A high value of confidence suggests a strong association rule (in which we
are highly confident). However, this can be deceptive because if the antecedent
and/or the consequent has a high level of support, we can have a high value
for confidence even when the antecedent and consequent are independent! For
example, if nearly all customers buy bananas and nearly all customers buy ice
cream, the confidence level of a rule such as “IF bananas THEN ice-cream”
will be high regardless of whether there is an association between the items. <br><br>
A better way to judge the strength of an association rule is to
compare the confidence of the rule with a benchmark value, where we assume
that the occurrence of the consequent itemset in a record is independent of
the occurrence of the antecedent for each rule. Therefore:

*   lift ratio = confidence/benchmark confidence
*   benchmark confidence = no. records with consequent itemset/no. transactions in database

A lift ratio greater than 1.0 suggests that there is some usefulness to the rule. The larger
the lift ratio, the greater the strength of the association. For example for the itemset {red, white, green} the confidence is 2/4 or 50%, the benchmark confidence is 2/10 or 20%, and the lift ratio is 50%/20% or 2.5.


# Code
We will use the mlxtend library (Machine Learning extensions) to code the association rules.

In [None]:
# import the libraries

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
# Load and preprocess data set
df = pd.read_csv('Faceplate.csv')
df = df.drop(columns='Transaction')
#fp_df.set_index('Transaction', inplace=True)
df

  and should_run_async(code)


Unnamed: 0,Red,White,Blue,Orange,Green,Yellow
0,1,1,0,0,1,0
1,0,1,0,1,0,0
2,0,1,1,0,0,0
3,1,1,0,1,0,0
4,1,0,1,0,0,0
5,0,1,1,0,0,0
6,1,0,1,0,0,0
7,1,1,1,0,1,0
8,1,1,1,0,0,0
9,0,0,0,0,0,1


In [None]:
# create frequent itemsets
itemsets = apriori(df, min_support=0.2, use_colnames=True) # return items and itemsets with atleast 20% support
itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.6,(Red)
1,0.7,(White)
2,0.6,(Blue)
3,0.2,(Orange)
4,0.2,(Green)
5,0.4,"(White, Red)"
6,0.4,"(Blue, Red)"
7,0.2,"(Green, Red)"
8,0.4,"(White, Blue)"
9,0.2,"(Orange, White)"


In [None]:
# and convert into rules
rules = association_rules(itemsets, metric='confidence', min_threshold=0.5)
rules = rules.drop(columns= ['antecedent support',
       'consequent support', 'leverage',
       'conviction', 'zhangs_metric']) # you can drop the unnecessary columns
rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(White),(Red),0.4,0.571429,0.952381
1,(Red),(White),0.4,0.666667,0.952381
2,(Blue),(Red),0.4,0.666667,1.111111
3,(Red),(Blue),0.4,0.666667,1.111111
4,(Green),(Red),0.2,1.0,1.666667
5,(White),(Blue),0.4,0.571429,0.952381
6,(Blue),(White),0.4,0.666667,0.952381
7,(Orange),(White),0.2,1.0,1.428571
8,(Green),(White),0.2,1.0,1.428571
9,"(White, Blue)",(Red),0.2,0.5,0.833333


In [None]:
# sort the rules by lift in descending order
rules.sort_values(by='lift', ascending=False)

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,support,confidence,lift
14,"(White, Red)",(Green),0.2,0.5,2.5
15,(Green),"(White, Red)",0.2,1.0,2.5
4,(Green),(Red),0.2,1.0,1.666667
12,"(Green, White)",(Red),0.2,1.0,1.666667
7,(Orange),(White),0.2,1.0,1.428571
8,(Green),(White),0.2,1.0,1.428571
13,"(Green, Red)",(White),0.2,1.0,1.428571
2,(Blue),(Red),0.4,0.666667,1.111111
3,(Red),(Blue),0.4,0.666667,1.111111
0,(White),(Red),0.4,0.571429,0.952381


# Book Purchases Exercise
Using the CharlesBookClub.csv dataset we want to examines associations among transactions involving various types of books. There are 11 different types of
books in this dataset.

a) Import the dataset.<br>
b) Drop the unnecessary columns and keep the columns displaying the various types of books only such as ChildBks, CookBks, etc.<br>
c) The columns displaying the different types of books also include information on the number of occurrences. Since the frequency of occurrence is not crucial for the association technique we intend to use, rather than considering the actual count of occurrences, we need to standardize the values. To achieve this, any value greater than 0 will be set to 1 before proceeding to the next step. <br>
d) Conduct an association analysis using minimal support of 5% (200 out of 4000 transactions) and minimal confidence of 50%.<br>
e) Sort the values based on lift ratio in descending order. <br>
f) How do you interpret the first rule from the sorted rules.