# Association analysis

This notebook focuses on creating and analysing assocation rules found in posts.

We first load all the libraries we may use throughout the project

In [1]:
#Import graphing utilities
%matplotlib inline
import matplotlib.pyplot as plt

# Import useful mathematical libraries
import numpy as np
import pandas as pd

# Import useful Machine learning libraries
import gensim
from sklearn.cluster import KMeans

# Import utility files
from utils import save_object, load_object, make_post_clusters, make_clustering_objects

from orangecontrib.associate.fpgrowth import *

### Set model name

Before begining the rest of this project, we select a name for our model. This name will be used to save and load the files for this model

In [2]:
# Set the model we are going to be analyzing
model_name = "example_model"

### Prepare data

We now load and process the data we will need for the rest of this project

In [3]:
df = load_object('objects/', model_name + '-df')

scores = list(df['score'])
num_comments_list = list(df['num_comments'])

# Load Our Saved matricies
PostsByWords = load_object('matricies/', model_name + "-PostsByWords")
WordsByFeatures = load_object('matricies/', model_name + "-WordsByFeatures")

# Generate the posts by Features matrix through matrix multiplication
PostsByFeatures = PostsByWords.dot(WordsByFeatures)
PostsByFeatures = np.matrix(PostsByFeatures)
len(PostsByFeatures)
model = gensim.models.Word2Vec.load('models/' + model_name + '.model')

vocab_list = sorted(list(model.wv.vocab))

# Initialize a word clustering to use
num_word_clusters = 100
kmeans =  load_object('clusters/', model_name + '-words-cluster_model-' + str(num_word_clusters))

clusters = make_clustering_objects(model, kmeans, vocab_list, WordsByFeatures)

clusterWords = list(map(lambda x: list(map(lambda y: y[0] , x["word_list"])), clusters))

from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer(vocabulary = vocab_list, analyzer = (lambda lst:list(map((lambda s: s), lst))), min_df = 0)

# Make Clusters By Words Matrix
ClustersByWords = countvec.fit_transform(clusterWords)

# Ensure consistency
len(WordsByFeatures) == ClustersByWords.shape[1]

# take the transpose of Clusters
WordsByCluster = ClustersByWords.transpose()

# Multiply Posts by Words by Words By cluster to get Posts By cluster
PostsByClusters = PostsByWords.dot(WordsByCluster)

In [4]:
PostsByClusters

<33069x100 sparse matrix of type '<class 'numpy.int64'>'
	with 860875 stored elements in Compressed Sparse Row format>

### Create Association rules

Now that our data has been prepared, we create our itemsets, and then analyze them by creating association rules.

In [5]:
itemsets = dict(frequent_itemsets(PostsByClusters > 0, .40))

In [6]:
assoc_rules = association_rules(itemsets,0.8)

In [31]:
rules = [(P, Q, supp, conf, conf/(itemsets[P]/PostsByClusters.shape[0]))
         for P, Q, supp, conf in association_rules(itemsets, .95)]

In [32]:
for lhs, rhs, support, confidence,lift in rules:
    print(", ".join([str(i) for i in lhs]), "-->",", ".join([str(i) for i in rhs]), "support: ",
          support, " confidence: ",confidence, "lift: ", lift)

99, 67, 7, 80, 21, 22 --> 98 support:  13591  confidence:  0.984284472769409 lift:  2.357278623262716
80, 99, 21, 22, 7 --> 98, 67 support:  13591  confidence:  0.9789670820427862 lift:  2.3318780116741986
80, 99, 22, 7 --> 98, 67, 21 support:  13591  confidence:  0.9786146313364056 lift:  2.330199254296054
80, 99, 22 --> 98, 67, 21, 7 support:  13591  confidence:  0.9781920253346769 lift:  2.3281871373105245
80, 99, 21, 22 --> 98, 67, 7 support:  13591  confidence:  0.9786146313364056 lift:  2.330199254296054
80, 67, 99, 21, 22 --> 98, 7 support:  13591  confidence:  0.9841419261404779 lift:  2.3565958983011925
80, 67, 99, 22, 7 --> 98, 21 support:  13591  confidence:  0.9842131942935767 lift:  2.35693722370152
80, 99, 67, 22 --> 98, 21, 7 support:  13591  confidence:  0.9839994207935129 lift:  2.35591346989724
98, 99, 7, 80, 21, 22 --> 67 support:  13591  confidence:  0.9961885215861614 lift:  2.414641810476638
80, 98, 99, 21, 22 --> 67, 7 support:  13591  confidence:  0.996115508648

In [33]:
len(rules)

2878

In [50]:
rule_clusters =[]
for i in range(100):
    for lhs, rhs, support, confidence,lift in rules:
        if (i in lhs) or (i in rhs): 
            rule_clusters.append(i)
            break

In [51]:
rule_clusters

[3, 5, 7, 8, 16, 22, 23, 37, 39, 40, 48, 57, 66, 68, 81, 99, 100]

In [52]:
len(rule_clusters)

17