# **Machine Learning - Unsupervised Learning - 1**



## Installing required packages

You need following packages installed before you can do association rule mining

* mlxtend
* PyARMViz

To install these packages, open Anaconda prompt and run the following commands

* pip install mlxtend


* pip install --index-url https://test.pypi.org/simple/ PyARMViz


<div class="alert alert-block alert-warning">
<h5>Currently only Testing </h5>
<p> </p> 
<p>
The PyARMViz package for visualizing rules has limited capabilities (there are some issues with orginal package, so we will install test package with limited features)
</p>
</div> 

## Session 1: Association Rule Discovery



Let us load all the required packages

In [5]:
!pip install mlextend
!pip install --index-url https://test.pypi.org/simple/ PyARMViz

[31mERROR: Could not find a version that satisfies the requirement mlextend (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for mlextend[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Looking in indexes: https://test.pypi.org/simple/
Collecting PyARMViz
  Obtaining dependency information for PyARMViz from https://test-files.pythonhosted.org/packages/ab/15/879fc7ca0904e5080c9ca7fdc239c5304c1ef03fe5c9809128ec9bf6177d/PyARMViz-0.1.3-py3-none-any.whl.metadata
[0m[31mERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='test-files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/ab/15/879fc7ca0904e5080c9ca7fdc239c5304c1ef03fe5c9809128ec9bf6177d/PyARMViz-0.1.3-py3-none-any.whl.metadata (Cau

In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

Load the data and do some preprocessing

**Note**: Some of the code and data are borrowed from *Data mining for Business Analytics* book

In [None]:
fp_df = pd.read_csv("./data/Faceplate.csv")
fp_df.set_index('Transaction', inplace=True)

In [None]:
fp_df.head()

In [None]:
fp_df.shape

### Frequent items with Apriori

In [None]:
frequent_items = apriori(fp_df, min_support=0.02, use_colnames=True)
frequent_items

#### You can sort the frequent item sets

In [None]:
frequent_items.sort_values(['support'], ascending=False)

In [None]:
frequent_items_indexed = frequent_items.set_index('itemsets')
frequent_items_indexed

#### Plotting frequent items

In [None]:
frequent_items_indexed.plot(kind='bar')

### Association Rules from frequent sets

You can select rules based on minimum **confidence**



In [None]:
rules = association_rules(frequent_items, metric="confidence", min_threshold=0.4, num_itemsets = 5)
rules

### Plotting support vs. confidence

In [None]:
rules.plot.scatter(x='support',y='confidence')

<div class="alert alert-block alert-info">
<h5>Better plots ahead </h5>
<p> </p> 
<p>
In the above plot we are using the native matplotlib to visualize, however, we have better PyARMViz package as we will see below for better visualization. 
</p>
</div> 

You can also select rules based on minimum **lift** 


In [None]:
rules = association_rules(frequent_items, metric="lift", min_threshold=2, num_itemsets = 5)
rules = rules.copy()
rules

In [None]:
rules

<div class="alert alert-block alert-info">
<h5>Manipulations and filtering </h5>
</div> 

**NOTE**: You can do any kind of manipulations and filtering based on 

* Based on lift, confidence, support thresholds
* Antecedents
* Any combination of these factors. 

## Visualization of the rules

<div class="alert alert-block alert-warning">
<h5>Package in testing mode </h5>
</div> 

**NOTE**: As explained above, the PyARMViz package that we have installed and imported is currently only for testing, hence it might not be super polished. However, it helps us visualize the rules. You can install and experiment with the original package of PyARMViz in your projects if you plan to do one based on Association Rules. The package description is available [here](https://pypi.org/project/pyarmviz/)

In [None]:
from PyARMViz import datasets
from PyARMViz import PyARMViz
from PyARMViz import Rule

In [None]:
rules

### Covert rules DataFrame to Rules list for PyARMViz package

In [None]:
def convert_rules_df_viz_rules(rules_df, num_trans):
    """
    Covert from the rules data frame from association rules to 
    rules list that can be visualized with PyARMViz package
    
    Parameters
    ----------
    
    rules_df: DataFrame
        The rules dataframe from association_rules method
    num_trans: int
        Number of transactions used to build association rules
    """
    rules_list = []
    rule_dict = {'lhs':(), 'rhs':(), 'count_full':(), 'count_lhs':(), 'count_rhs':(), 'num_transactions':()}
    for index, row in rules.iterrows():
        antec = tuple(row['antecedents'])
        conseq = list(row['consequents'])
        count_full = row['support']*num_trans
        count_lhs = row['antecedent support']*num_trans
        count_rhs = row['consequent support']*num_trans
        rule_dict['lhs']= antec
        rule_dict['rhs']= conseq
        rule_dict['count_full']= count_full
        rule_dict['count_lhs']= count_lhs
        rule_dict['count_rhs']= count_rhs
        rule_dict['num_transactions'] = num_trans
        rules_list.append(Rule.generate_rule_from_dict(rule_dict))

    return rules_list

In [None]:
# Number of transactions from the original dataset
num_trans = fp_df.shape[0] 
rules_list = convert_rules_df_viz_rules(rules, num_trans)

In [None]:
rules_list

In [None]:
fig = PyARMViz.generate_rule_strength_plot(rules_list, allow_compound_flag=True)

In [None]:
rules

In [None]:
fig = PyARMViz.generate_rule_start_end_plot(rules_list)

<div class="alert alert-block alert-info">
<h5>Changing the strength plot</h5>
</div> 

**NOTE**: 

* In the **strength plot** the size of the circle is based on the confidence. 
* I couldn't find an easy way to change that to support, lift or any easy way. However, I looked at the PyARMViz.py file to see how they generated this plot and change that to lift as did below. 

In [None]:
# Modified based on PyARMViz.py file from the package
from typing import List
import plotly.graph_objects as go
def generate_rule_start_end_plot_lift(rules:List, notebook_flag:bool = False, dot_size:int = 10):
    unique_values = set()
    x_axis = []
    y_axis = []
    strength = []
    for index, rule in enumerate(rules):
        x_axis.append(str(rule.rhs))
        y_axis.append(str(rule.lhs))
        strength.append(dot_size*rule.lift)
        unique_values.add(str(rule.rhs))
        unique_values.add(str(rule.lhs))
        
    #Generate distance matrix view
    fig = go.Figure()

    fig.add_trace(go.Scatter(
        x=x_axis,
        y=y_axis,
        mode="markers",
        marker = {'size':strength},
        name='Association rules',
    ))
    
    fig.show()
    return fig

In [None]:
rules

In [None]:
fig = generate_rule_start_end_plot_lift(rules_list)

### Visualizing more complext rules from PyARMViz package

In [None]:
rules_data = datasets.load_shopping_rules()
rules_data

In [None]:
fig = PyARMViz.generate_rule_strength_plot(rules_data)

In [None]:
fig = PyARMViz.generate_rule_start_end_plot(rules_data)

In [None]:
fig = generate_rule_start_end_plot_lift(rules_data, dot_size = 2)

## Activity #1

In the Faceplate dataset we have used, filter the rules based on the following criteria

1. All rules with minimum confidence of 0.6
2. All rules with minimum confidence of 0.4 and with 'Red' in the antecedent. 
3. All rules with minimum lift of 0.8 and with 'White' in the antecedent and 'Green' in consequent

In [None]:
rules.head()

## Activity #2

Load the CharlesBookClub.csv dataset, experiment on frequent items, association rules and visualize them

In [None]:
book_df = pd.read_csv("./data/CharlesBookClub.csv")
book_df.head()

In [None]:
ignore = ['Seq#', 'ID#', 'Gender', 'M','R','F','FirstPurch','Related Purchase',
         'Mcode','Rcode','Fcode','Yes_Florence','No_Florence']
# Removing columns not used in the association rules
count_books = book_df.drop(columns=ignore)
count_books.head()

#Converting the data to a binary incident matrix
count_books[count_books>0] = 1
count_books.head()
