<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Create-the-graph" data-toc-modified-id="Create-the-graph-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Create the graph</a></span><ul class="toc-item"><li><span><a href="#Create-the-nodes" data-toc-modified-id="Create-the-nodes-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Create the nodes</a></span></li></ul></li></ul></div>

# Network X Tutorial

Link to [original tutorial](https://networkx.github.io/documentation/networkx-1.10/tutorial/tutorial.html)

In [2]:
# Import libraries
import numpy as np
import pandas as pd
from tqdm import tqdm

import networkx as nx

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns 
# sns.set_style('whitegrid')
color = 'rebeccapurple'
%matplotlib inline

# Display settings
from IPython.display import display
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [7]:
# Read data
rules = pd.read_csv('data/interim/rules.csv')

# Transform first two columns to list format
for col in ['antecedents', 'consequents']:
    rules[col] = rules[col]\
            .str.strip('frozenset({})').str.replace("'", "").str.replace(" ", "")
    rules[col] = rules[col].str.split(',')

In [8]:
# Check data
print(rules.info())
display(rules.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 972 entries, 0 to 971
Data columns (total 9 columns):
antecedents           972 non-null object
consequents           972 non-null object
antecedent support    972 non-null float64
consequent support    972 non-null float64
support               972 non-null float64
confidence            972 non-null float64
lift                  972 non-null float64
leverage              972 non-null float64
conviction            972 non-null float64
dtypes: float64(7), object(2)
memory usage: 68.4+ KB
None


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
299,[6215287],[6215187],0.002917,0.002742,0.001216,0.416867,152.05722,0.001208,1.710175
332,[6327932],[6327911],0.002404,0.002271,0.001483,0.616959,271.723671,0.001478,2.604759
547,"[5344305, 5344501]",[5344584],0.001701,0.001961,0.001701,1.0,509.88172,0.001698,inf
856,[6327901],"[6327932, 6327922, 6327911]",0.002812,0.001286,0.001202,0.4275,332.32168,0.001198,1.744478
479,[9149281],[9149290],0.002875,0.003121,0.001188,0.413203,132.389662,0.001179,1.698848


## Create the graph

documentation for [basic graph types](https://networkx.github.io/documentation/networkx-1.10/reference/classes.html)

In [51]:
# Initialize empty directed graph
G = nx.DiGraph()

### Create the nodes

**DECISION**: Every unique occurence of antecedents and / or consequents will be a node. So there will be nodes that represent groups of multiple products. I will try to visualize this with different colors.

In [29]:
# Check for number of different elements in antecedents and consequents
# This is a bit tricky because of the list format of cell values

s1 = set(rules['antecedents'].apply(lambda x: str(x)))
s2 = set(rules['consequents'].apply(lambda x: str(x)))
diff = s1.difference(s2)
print(len(diff))

125


In [59]:
# Make a list of all antecedents and consequents without duplicates
# This is a bit tricky because of the list format of cell values
# Note it may be that there are still some duplicates remaining if some lists elements are not in same order

nodes_df = pd.DataFrame(pd.concat([rules['antecedents'], rules['consequents']], 
                                  ignore_index=True, sort=True))
assert len(nodes_df) == 2*len(rules)
nodes_df['temp'] = nodes_df[0].apply(lambda x: str(x)) # need this for identifying duplicates
nodes_df.drop_duplicates(subset='temp', inplace=True)
print("Unique nodes for our graph:", len(nodes_df))

# Add column with number of items in every cell (will be used as node property)
nodes_df['len'] = nodes_df[0].apply(len)

Unique nodes for our graph: 489


In [60]:
# Check the results
nodes_df.sample(5)

Unnamed: 0,0,temp,len
355,[6600212],['6600212'],1
944,"[8535212, 8535211, 8535215, 8535214]","['8535212', '8535211', '8535215', '8535214']",4
454,[8601101],['8601101'],1
403,[8535215],['8535215'],1
360,[6603161],['6603161'],1


In [100]:
# Create a list of node tuples (with attribute length)
nodes = list(zip(nodes_df['temp'], nodes_df['len']))

G.clear()

# Add notes to graph
for n in nodes:
    G.add_node(n[0], n_items=n[1])

In [105]:
# Check results, look at 5 nodes of the graph
list(G.nodes(data=True))[-5:]

[("['8600103']", {'n_items': 1}),
 ("['8609001']", {'n_items': 1}),
 ("['8633101']", {'n_items': 1}),
 ("['8641501']", {'n_items': 1}),
 ("['6115231', '6115230', '6113401']", {'n_items': 3})]

# Create the edges

In [67]:
nodes_df['temp'].values

array(["['0061855']", "['2119623']", "['2119622']", "['2119624']",
       "['2310724']", "['2310723']", "['2310725']", "['2310726']",
       "['2310727']", "['2310728']", "['2310729']", "['2453802']",
       "['2453804']", "['2456111']", "['2456105']", "['2480107']",
       "['2480196']", "['2806480']", "['2806409']", "['2809116']",
       "['4227900']", "['4227970']", "['4227908']", "['4227909']",
       "['4228008']", "['4227912']", "['4228009']", "['4228109']",
       "['4228070']", "['4227979']", "['4228079']", "['4228179']",
       "['4700102']", "['4752383']", "['4700104']", "['4752280']",
       "['4752281']", "['4752320']", "['4752351']", "['4752311']",
       "['4755111']", "['4752391']", "['4759102']", "['4752393']",
       "['4759211']", "['4807404']", "['4807498']", "['4807499']",
       "['4891621']", "['4891001']", "['4905011']", "['4905001']",
       "['4905099']", "['4905501']", "['4905502']", "['4910111']",
       "['4910112']", "['5045965']", "['5045970']", "['5045975

In [15]:
# Drop all rows with non-pairwise rules

def drop_non_pairs(df, cols):
    for col in cols:
        df = df.loc[df[col].apply(len) == 1]
    
    return df

pairs = drop_non_pairs(rules, ['antecedents', 'consequents'])

In [17]:
pairs['consequents'].apply(len).value_counts()

1    480
Name: consequents, dtype: int64

In [None]:
# Create a graph
G = nx.Graph()