[Lecture](https://youtu.be/2qOL45sp75c)


[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_08-UnsupervisedLearning/blob/master/W08_CC--DJ--Coding_Challenge_Association_Models.ipynb)

**Coding Challenge**

In this coding challenge, you will apply the concepts of Association Rule Learning

1) Utilize the Apriori Algorithm to uncover frequent itemsets 

2) Discover the strongest association rules that have high lift and high confidence

3) Create a Directed Graph to surface the association rules identified in Step 2 above

**Resources**:

- The Wikipedia articles for [association rule learning](https://en.wikipedia.org/wiki/Association_rule_learning) and the [Apriori algorithm](https://en.wikipedia.org/wiki/Apriori_algorithm) provide a good introduction and overview of the topic.
- The `mlxtend` package provides an [implementation of Apriori](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/) ([source code](https://github.com/rasbt/mlxtend/blob/master/mlxtend/frequent_patterns/apriori.py)) - it is encouraged to read and understand the documentation and code, so you can be a savvy user of it.
- The `networkx` package has a [tutorial for drawing a directed graph](https://networkx.github.io/documentation/networkx-1.10/tutorial/index.html).


**Dataset: ** The data you will utilize is from the **UCI Machine Learning Repository** and represents transactional data from a UK retailer from 2010 through 2011. The data set represents sales to wholesalers so it is slightly different from data representing customer purchase patterns but is still a useful data set for the purposes of this exercise.

The data set can be downloaded from: http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx

**Guidance for coding challenge**:

1) Focus on the data for the following 2 countries: 1) **Germany** and 2) **Belgium**. Go through the 3 steps highlighted above for each of the 2 countries. This analysis will enable you to compare/constrast the sales of frequent itemsets in the "Germany" from that in "Belgium"

2) You will need to prepare the data. For example: some of the invoices contact 'C' in front of the invoice number; these ivoices should be removed from the dataset

3) Product descriptions with a value of 'POSTAGE' in it will need to be treated since it could negatively skew/impact the results

4) You will have to transpose the data set so that you get a proper representation of the underlying data set that can be feed into the Apriori Algorithm

*Hint:* Good Reference while exploring your data set - https://medium.com/@msalmon00/helpful-python-code-snippets-for-data-exploration-in-pandas-b7c5aed5ecb9

**Communicate your findings**:

One of the common tasks for Data Scientists in the real world is to communicate their findings. Summarize your analysis and make recomendations on how you could potentially improve/bolster sales for the Online Retailer

In [None]:
#%%bash
#conda install -c conda-forge mlxtend

In [None]:
#!wget -c http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
#!mv Online\ Retail.xlsx Online_Retail.xlsx
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import OnehotTransactions, TransactionEncoder



In [None]:
%%time
df = pd.read_excel('Online_Retail.xlsx')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# Any nulls
df.isnull().sum()

In [None]:
# drop the nulls
df.dropna(subset=['Description','CustomerID'], how='any', inplace=True)

In [None]:
#df.drop('POSTAGE', inplace=True, axis=1)

In [None]:
# nulls are gone
df.isnull().sum()

In [None]:
# How many contract invoices in the dataframe?
df['InvoiceNo'].str.startswith('c').count()

In [None]:
df.dtypes

In [None]:
# remove entries that start with 'c'
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.startswith('C')]
df['InvoiceNo'] = df['InvoiceNo'].astype('int')
print(df.dtypes)

In [None]:
df.shape

### German Analysis

In [None]:
transactions_per_row = df [df['Country']=='Germany'].pivot_table(index='InvoiceNo',columns='Description', values='Quantity',aggfunc='sum',fill_value=0)

In [None]:
transactions_per_row.head()

In [None]:
# Now Drop 'POSTAGE'
transactions_per_row.drop('POSTAGE', inplace=True, axis=1)

In [None]:
%%time
purchase_sets = transactions_per_row.applymap(lambda quantity: 1 if quantity >= 1 else 0)


In [None]:
purchase_sets.head()

In [None]:
frequent_itemsets = apriori(purchase_sets, min_support=0.05, use_colnames=True)
association_rules(frequent_itemsets, metric='confidence', min_threshold=0.08)

In [None]:
association_rules(frequent_itemsets, metric='lift', min_threshold=2)

In [None]:
def trans_per_row(feature, value_of_feature,min_support,metric,min_threshold):
    transactions_per_row = df [
                                df[ feature ]== value_of_feature
                              ].pivot_table(
                                            index='InvoiceNo',
                                            columns='Description', 
                                            values='Quantity',
                                            aggfunc='sum',
                                            fill_value=0
                                            )
    transactions_per_row.drop('POSTAGE', inplace=True, axis=1)
    purchase_sets = transactions_per_row.applymap(lambda quantity: 1 if quantity >= 1 else 0)
    frequent_itemsets = apriori(purchase_sets, min_support=min_support, use_colnames=True)
    return association_rules(frequent_itemsets, metric=metric, min_threshold=min_threshold)
    

german_tpr = trans_per_row('Country','Germany',0.05,'confidence',0.08)
german_tpr




In [None]:
german_tpr = trans_per_row('Country','Germany',0.05,'confidence',0.08)
german_tpr

In [None]:
german_tpr = trans_per_row('Country','Germany',0.05,'lift',4); german_tpr

In [None]:
belgium_tpr = trans_per_row('Country','Belgium',0.05,'lift',18); belgium_tpr

In [None]:
belgium_tpr = trans_per_row('Country','Belgium',0.05,'confidence',0.08)

In [None]:
belgium_tpr

In [None]:
import networkx as netx
import matplotlib.pyplot as plt

def draw_association_graph(rules, rules_to_show):
  """Function to draw graph visualization association rules."""
  DiGraph = netx.DiGraph()
  
  colors = np.random.rand(rules_to_show)
  rule_nodes = set()
  
  for i in range(rules_to_show):
    rule_node = 'R' + str(i)
    DiGraph.add_nodes_from([rule_node])
    rule_nodes.add(rule_node)
    lift = rules.iloc[i]['lift']
    
    # For antecedant, the arrow originates from antecedant to the Rule node
    for antecedant in rules.iloc[i]['antecedants']:
      DiGraph.add_nodes_from([antecedant])
      DiGraph.add_edge(antecedant, rule_node, color=colors[i], weight=lift)
    
    # For consequent, the arrow originates fromt he Rule node to the consequent
    for consequent in rules.iloc[i]['consequents']:
      DiGraph.add_edge(rule_node, consequent, color=colors[i], weight=lift)
  
  # Set the color of rule nodes to orange, non-rule nodes to green
  node_colors = ['orange' if node in rule_nodes else 'green'
                 for node in DiGraph]
  
  # Print/draw results!
  edges = DiGraph.edges()
  nodes = DiGraph.nodes()
  print('Nodes: ' + str(nodes))
  print('Edges: ' + str(edges))
  
  edge_colors = [DiGraph[src][dest]['color'] for src, dest in edges]
  edge_weights =  [DiGraph[src][dest]['weight'] for src, dest in edges]
  
  positions = netx.spring_layout(DiGraph, k=15, scale=1)
  print("Node positions: " + str(positions))
  netx.draw_networkx(DiGraph, positions, edges=edges, node_color=node_colors,
                    edge_color=edge_colors, width=edge_weights, font_size=16,
                    with_labels=False)
  
  # Add labels under nodes
  for position in positions:
    positions[position][1] -= 0.15
  netx.draw_networkx_labels(DiGraph, positions)
  plt.show()

In [None]:
def analyze_country(country, min_support=0.10, lift_threshold=4):
  """Run associaton rule analysis for a given country."""
  transactions_per_row = df[df['Country'] == country].pivot_table(
      index='InvoiceNo', columns='Description', values='Quantity',
      aggfunc='sum', fill_value=0)
  # Let's get rid of postage, it's not really a product
  transactions_per_row.drop('POSTAGE', inplace=True, axis=1)
  # Last bit of cleaning - we only care if a transaction includes an item or not
  purchase_sets = transactions_per_row.applymap(
      lambda quantity: 1 if quantity >= 1 else 0)
  # Time for Apriori!
  # Calculate support (indication of how frequently itemset appears in data)
  # and confidence (indication of how often a rule is true)
  frequent_itemsets = apriori(purchase_sets, min_support=min_support,
                              use_colnames=True)
  associationrules = association_rules(frequent_itemsets, metric='lift',
                                       min_threshold=lift_threshold)
  draw_association_graph(associationrules, associationrules.shape[0])



In [None]:
analyze_country('Belgium', min_support=0.08)

In [None]:
analyze_country('Germany', min_support=0.05)

In [None]:
!git status && git add . && git status

In [None]:
!git commit -am "integrated the functions from the lecture, so I have the Germany and Belgium graphed, now I need to try to make sense of the relationships and tweak the thresholds as well as the ?attributes? I manipulate"

In [None]:
!git status && git push && git status && date