# Association analysis

This notebook focuses on creating and analysing assocation rules found in posts.

We first load all the libraries we may use throughout the project

In [None]:
#Import graphing utilities
%matplotlib inline
import matplotlib.pyplot as plt

# Import useful mathematical libraries
import numpy as np
import pandas as pd

# Import useful Machine learning libraries
import gensim
from sklearn.cluster import KMeans

# Import utility files
from utils import save_object, load_object, make_post_clusters, make_clustering_objects

from orangecontrib.associate.fpgrowth import *

### Set model name

Before begining the rest of this project, we select a name for our model. This name will be used to save and load the files for this model

In [None]:
# Set the model we are going to be analyzing
model_name = "example_model"

### Prepare data

We now load and process the data we will need for the rest of this project

In [None]:
df = load_object('objects/', model_name + '-df')

scores = list(df['score'])
num_comments_list = list(df['num_comments'])

# Load Our Saved matricies
PostsByWords = load_object('matricies/', model_name + "-PostsByWords")
WordsByFeatures = load_object('matricies/', model_name + "-WordsByFeatures")

# Generate the posts by Features matrix through matrix multiplication
PostsByFeatures = PostsByWords.dot(WordsByFeatures)
PostsByFeatures = np.matrix(PostsByFeatures)
len(PostsByFeatures)
model = gensim.models.Word2Vec.load('models/' + model_name + '.model')

vocab_list = sorted(list(model.wv.vocab))

# Initialize a word clustering to use
num_word_clusters = 100
kmeans =  load_object('clusters/', model_name + '-words-cluster_model-' + str(num_word_clusters))

clusters = make_clustering_objects(model, kmeans, vocab_list, WordsByFeatures)

clusterWords = list(map(lambda x: list(map(lambda y: y[0] , x["word_list"])), clusters))

from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer(vocabulary = vocab_list, analyzer = (lambda lst:list(map((lambda s: s), lst))), min_df = 0)

# Make Clusters By Words Matrix
ClustersByWords = countvec.fit_transform(clusterWords)

# Ensure consistency
len(WordsByFeatures) == ClustersByWords.shape[1]

# take the transpose of Clusters
WordsByCluster = ClustersByWords.transpose()

# Multiply Posts by Words by Words By cluster to get Posts By cluster
PostsByClusters = PostsByWords.dot(WordsByCluster)

In [None]:
PostsByClusters

### Create Association rules

Now that our data has been prepared, we create our itemsets, and then analyze them by creating association rules.

In [None]:
itemsets = dict(frequent_itemsets(PostsByClusters > 0, .40))

In [None]:
assoc_rules = association_rules(itemsets,0.8)

In [None]:
rules = [(P, Q, supp, conf) for P, Q, supp, conf in association_rules(itemsets, .8)]

In [None]:
for lhs, rhs, support, confidence in rules:
    print(lhs, "-->",rhs, "support: ",support, " confidence: ",confidence)