[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_08-UnsupervisedLearning/blob/master/F08_SC0--DJ--Association_and_Collaborative_Filtering_Sprint_Challenge.ipynb)

## **Association Sprint Challenge**


The journey into Association, a form of Unsupervised Machine Learning exposed you to the concepts of:

a) Market Basket Analysis 

b) Collaborative Filtering (including Matrix Factorization)

The purpose of this Sprint Challenge is to solidify your understanding of the Association Rule Learning topics covered this week by providing additional practice.

In this Sprint Challenge, we are going to use a few different data sets:

**Store Transactions data set**: https://www.dropbox.com/s/v3wdo3nzl41vxcd/Store_Transactions.csv?raw=1

**Movies data set**: https://www.dropbox.com/s/qo7v9k5rcwt7wgh/movies.csv?raw=1

**User Movie Ratings data set**: https://www.dropbox.com/s/piypmzeucyz160l/ratings_small.csv?raw=1


**Smaller Dataset to Start**: https://www.dropbox.com/s/4ec9l887mth6rep/movie_ratings.csv?raw=1


**Some Tips**:

1) You *may* need to prepare the data.

2) You will have to transpose the data set so that you get a proper representation of the underlying data set that can be feed into the Apriori Algorithm

Create solutions for the following code blocks. This exercise should take ~ 1.5 - 2 hours.

Share with mlsubmissions@lambdaschool.com when finished.


In [2]:
# LAMBDA SCHOOL
#
# MACHINE LEARNING
#
# MIT LICENSE

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series, DataFrame

In [3]:
#Install the MLxtend package

#!pip install MLxtend

### This is what I use for Anaconda
#%%bash
#conda install -c conda-forge mlxtend

In [4]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import OnehotTransactions
from mlxtend.preprocessing import TransactionEncoder

### Identifying Frequent Itemsets and Association Rules

**Dataset:**

Store Transactions: https://www.dropbox.com/s/v3wdo3nzl41vxcd/Store_Transactions.csv?raw=1

In [5]:
#!mkdir sc_data && cd sc_data && wget -c https://www.dropbox.com/s/v3wdo3nzl41vxcd/Store_Transactions.csv?raw=1 && mv Store_Transactions.csv?raw=1 Store_Transactions.csv && ls -lh 
    

--2018-05-25 11:50:09--  https://www.dropbox.com/s/v3wdo3nzl41vxcd/Store_Transactions.csv?raw=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.1
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://dl.dropboxusercontent.com/content_link/jOqcwd1gxRb7JRB9PjpaZ0r4oQ14sP11fPNvs7h9VOVKMmCcWXQPL4bUXY3n7fMC/file [following]
--2018-05-25 11:50:10--  https://dl.dropboxusercontent.com/content_link/jOqcwd1gxRb7JRB9PjpaZ0r4oQ14sP11fPNvs7h9VOVKMmCcWXQPL4bUXY3n7fMC/file
Resolving dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 162.125.5.6
Connecting to dl.dropboxusercontent.com (dl.dropboxusercontent.com)|162.125.5.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34386 (34K) [text/csv]
Saving to: ‘Store_Transactions.csv?raw=1’


2018-05-25 11:50:10 (3.43 MB/s) - ‘Store_Transactions.csv?raw=1’ saved [34386/34386]

total 72
-rw-r--r--@ 1 darwinm  staff    34K May 

**1:** Utilize the Apriori Algorithm to uncover frequent itemsets

In [9]:
df_store = pd.read_csv('sc_data/Store_Transactions.csv')
df_store.shape
df_store.head()

Unnamed: 0,Rowid,Transaction_id,Product_id,Quantity
0,1,370,154,3
1,2,41,40,3
2,3,109,173,3
3,4,556,11,4
4,5,143,72,1


In [23]:
print(df_store.dtypes,'\n')
print(df_store.isnull().sum(),'\n')
#print(df_store.info)

df_store.shape

Rowid             int64
Transaction_id    int64
Product_id        int64
Quantity          int64
dtype: object 

Rowid             0
Transaction_id    0
Product_id        0
Quantity          0
dtype: int64 



(2328, 4)

In [25]:
df_store.Rowid.nunique()

2328

In [66]:
transactions_per_row = df_store.pivot_table(index='Transaction_id', columns='Product_id', values='Quantity', aggfunc='sum', fill_value=0)

In [67]:
transactions_per_row.head()

Product_id,1,2,3,4,5,6,7,8,9,10,...,169,170,171,172,173,174,175,176,177,178
Transaction_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,3,5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [68]:
purchase_sets = transactions_per_row.applymap(lambda quantity: 1 if quantity >= 1 else 0)

In [77]:
#purchase_sets

In [76]:
frequent_itemsets = apriori(purchase_sets, min_support=0.005, use_colnames=True)
association_rules(frequent_itemsets, metric='confidence', min_threshold=0.26)

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(15),(136),0.018739,0.028961,0.005111,0.272727,9.417112,0.004568,1.335179
1,(18),(151),0.025554,0.028961,0.006814,0.266667,9.207843,0.006074,1.324144
2,(35),(176),0.018739,0.027257,0.005111,0.272727,10.005682,0.0046,1.337521
3,(36),(85),0.017036,0.017036,0.005111,0.3,17.61,0.004821,1.404235
4,(85),(36),0.017036,0.017036,0.005111,0.3,17.61,0.004821,1.404235
5,(69),(84),0.018739,0.028961,0.005111,0.272727,9.417112,0.004568,1.335179
6,(168),(146),0.018739,0.022147,0.005111,0.272727,12.314685,0.004696,1.344549


In [87]:
association_rules(frequent_itemsets, metric='lift', min_threshold=10.25)

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(8),(154),0.020443,0.02385,0.005111,0.25,10.482143,0.004623,1.301533
1,(154),(8),0.02385,0.020443,0.005111,0.214286,10.482143,0.004623,1.246709
2,(88),(30),0.022147,0.022147,0.005111,0.230769,10.420118,0.00462,1.27121
3,(30),(88),0.022147,0.022147,0.005111,0.230769,10.420118,0.00462,1.27121
4,(33),(148),0.022147,0.022147,0.005111,0.230769,10.420118,0.00462,1.27121
5,(148),(33),0.022147,0.022147,0.005111,0.230769,10.420118,0.00462,1.27121
6,(36),(85),0.017036,0.017036,0.005111,0.3,17.61,0.004821,1.404235
7,(85),(36),0.017036,0.017036,0.005111,0.3,17.61,0.004821,1.404235
8,(112),(91),0.02385,0.020443,0.005111,0.214286,10.482143,0.004623,1.246709
9,(91),(112),0.020443,0.02385,0.005111,0.25,10.482143,0.004623,1.301533


**2:** Discover the strongest association rules that have high lift and high confidence

**3:**  Create a Summary Table or Directed Graph to surface the association rules identified in **Ask 2** above

### Collaborative Filtering

Dataset:


*Movies*: https://www.dropbox.com/s/qo7v9k5rcwt7wgh/movies.csv?raw=1

*User Movie Ratings*: https://www.dropbox.com/s/piypmzeucyz160l/ratings_small.csv?raw=1

**1:** Utilize Matrix Factorization to arrive at the 2 matrices i.e. a) User Ratings (across attributes) b) Movie Ratings (across attributes). Once you have the 2 matrices, compute the "dot" product of the 2 matrices to come up with an estimate/prediction for the missing user ratings

For this ask, you could leverage the Matrix Factorization step that was discussed in the lecture **OR**

Here is another implmentation for your reference that could be leveraged (may need to be adapted): 

https://lazyprogrammer.me/tutorial-on-collaborative-filtering-and-matrix-factorization-in-python/


**2:** Pick 2 Userid's from the underlying data set and surface recommendations for the Userid's you chose



**3:** Pick 2 movies and find movies that are similar to the movies you have picked

**Strech goal 1**:  Measure Recommendation Accuracy - compute the RMSE RMSE to ascertain the difference between the user's actual movie ratings and the ratings that were predicted for the same movies.

*Hint*: You will need to split the underlying data set. 70% of the 'User Movie Ratings' data set will constitute the *Training* dataset and 30% of the 'User Movie Ratings' dataset will constitute the* Testing *data set 

**Stretch goal 2**: Publish the Collaborative Filtering process that you understook as a blog post. Some suggested topics to consider:


1) Matrix Factorization step - what is the purpose of performing Matrix Factorization?

2) Recommendation Accuracy - summarize your findings? 

3) The actual recommendations that were surfaced

*Include a link to your blog post in your submission*

