In [1]:
version = "v2.2.033020"

# Assignment 2: Mining Itemsets (Part II)

## Finding Frequent Itemsets with Apriori

In Part I of this assignment, the summary statistics gave us a brief view of what the data look like. Now it is time for the real business - let's use the *Apriori* algorithm to find the frequent itemsets. 
First, let's import the packages and dependencies that will be used in this part of the assignment.

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer

from mlxtend.frequent_patterns import apriori

**<span style="color:red">NOTE: These are all the imports we need to make for this assignment. You should not make other imports in your submitted notebook. You will receive 0 points for the exercises if your solution includes additional imports.</span>**

Before you practice Apriori on the tasty emojis, let's use another dataset as an example to get familiar with the algorithm. As you can see, this is a data set of shopping history - every row is a shopping basket and every column is a product item.

In [3]:
market_df = pd.read_csv('assets/shopping_basket.csv')
market_df.head()

Unnamed: 0,asparagus,almonds,antioxydant_juice,avocado,babies_food,bacon,barbecue_sauce,black_tea,blueberries,body_spray,...,turkey,vegetables_mix,water_spray,white_wine,whole_weat_flour,whole_wheat_pasta,whole_wheat_rice,yams,yogurt_cake,zucchini
0,False,True,True,True,False,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


In [4]:
print(market_df.shape)

(7501, 119)


We can call the Apriori API now and specify the minimal support we want. You may learn more about this API from its [documentation](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/).

In [5]:
market_frequent_itemsets = apriori(market_df, min_support=0.005, use_colnames=True)
market_frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.020397,(almonds)
1,0.008932,(antioxydant_juice)
2,0.033329,(avocado)
3,0.008666,(bacon)
4,0.010799,(barbecue_sauce)


With the above command, we find all itemsets with a min support of 0.005 (half percent of the shopping baskets). We can now use the following command to extract the frequent itemsets with length 2 and beyond.

In [6]:
market_frequent_itemsets[market_frequent_itemsets['itemsets'].apply(lambda x: len(x)) > 1].head()

Unnamed: 0,support,itemsets
101,0.005199,"(almonds, burgers)"
102,0.005999,"(chocolate, almonds)"
103,0.006532,"(almonds, eggs)"
104,0.005066,"(almonds, green_tea)"
105,0.005199,"(almonds, milk)"


**Now, it is time to apply the Apriori algorithm to the emoji dataset.**

Since we have already shown you how to transform the Tweets into emoji itemsets, we concatenate the data preprocessing code into one block. Please run the following code block to load and preprocess the data.

In [7]:
tweets_df = pd.read_csv("assets/food_drink_emoji_tweets.txt", sep="\t", header=None)
tweets_df.columns = ['text']

emoji_list = "🍇🍈🍉🍊🍋🍌🍍🥭🍎🍏🍐🍑🍒🍓🥝🍅🥥🥑🍆🥔🥕🌽🌶🥒🥬🥦🍄🥜🌰🍞🥐🥖🥨🥯🥞🧀🍖🍗🥩🥓🍔🍟🍕🌭🥪🌮🌯🥙🥚🍳🥘🍲🥣🥗🍿🧂🥫🍱🍘🍙🍚🍛🍜🍝🍠🍢🍣🍤🍥🥮🍡🥟🥠🥡🦀🦞🦐🦑🍦🍧🍨🍩🍪🎂🍰🧁🥧🍫🍬🍭🍮🍯🍼🥛☕🍵🍶🍾🍷🍸🍹🍺🍻🥂🥃"
emoji_set = set(emoji_list)

tweets_df['emojis'] = tweets_df.text.apply(lambda text:np.unique([chr for chr in text if chr in emoji_set]))

mlb = MultiLabelBinarizer()
emoji_matrix = pd.DataFrame(data=mlb.fit_transform(tweets_df.emojis), index=tweets_df.index, columns=mlb.classes_)

In [8]:
emoji_matrix.head()

Unnamed: 0,☕,🌭,🌮,🌯,🌰,🌶,🌽,🍄,🍅,🍆,...,🥭,🥮,🥯,🦀,🦐,🦑,🦞,🧀,🧁,🧂
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Exercise 2.  (20 pts)
Complete the following `emoji_frequent_itemsets` function to find all the **frequent *k*-itemsets** with a minimal support of **`min_support`** in the emoji dataset. 

Your function should return a Pandas DataFrame object similar to the `market_frequent_itemsets` object above. The DataFrame object should have two columns: 
- The first one is named `support` and stores the support of the frequent itemsets. 
- The second column is named `itemsets` and stores the frequent itemset as a *frozenset* (the default return type of the apriori API).

Make sure that you are only returning the frequent itemsets that have the specified number of emojis.

In [9]:
def emoji_frequent_itemsets(emoji_matrix, min_support=0.005, k=3):
    frequent_itemsets = apriori(emoji_matrix, min_support=min_support, use_colnames=True)
    return frequent_itemsets[frequent_itemsets['itemsets'].apply(lambda x: len(x)) == k]


If you implemented this function correctly, we can obtain all frequent 3-itemsets with a min support of 0.005 by running the following command.

In [10]:
emoji_frequent_3itemsets = emoji_frequent_itemsets(emoji_matrix, min_support=0.005, k=3)
# You can uncomment the following line to view the obtained frequent itemsets.
emoji_frequent_3itemsets

Unnamed: 0,support,itemsets
206,0.007833,"(🍉, 🍇, 🍊)"
207,0.006327,"(🍍, 🍉, 🍇)"
208,0.006025,"(🍉, 🍇, 🥝)"
209,0.005824,"(🍍, 🍇, 🍊)"
210,0.005925,"(🍇, 🍊, 🥝)"
211,0.005322,"(🍍, 🍇, 🥝)"
212,0.006728,"(🍍, 🍉, 🍊)"
213,0.006025,"(🍉, 🍊, 🥝)"
214,0.005624,"(🍍, 🍉, 🥝)"
215,0.005523,"(🍍, 🍊, 🥝)"


In [11]:
# This cell test whether the `emoji_frequent_itemsets` function is implemented correctly.
# We hide some tests, so passing all the displayed assertions does not guarantee full points.

emoji_frequent_3itemsets = emoji_frequent_itemsets(emoji_matrix, min_support=0.005, k=3)
for row in emoji_frequent_3itemsets.itertuples():
    assert row.support >= 0.005, f"[Exercise 2] The support of the itemset {row.itemsets} is below the threshold."
    assert len(row.itemsets) == 3, f"[Exercise 2] The itemset {row.itemsets} is not a 3-itemset."


If you are interested, you may also examine what the frequent 4-itemsets look like. Does the result make sense to you? (This part will not be graded.)

In [12]:
emoji_frequent_4itemsets = emoji_frequent_itemsets(emoji_matrix, min_support=0.005, k=4)
emoji_frequent_4itemsets

Unnamed: 0,support,itemsets
255,0.005523,"(🍍, 🍉, 🍇, 🍊)"
256,0.005423,"(🍉, 🍇, 🍊, 🥝)"
257,0.005021,"(🍍, 🍉, 🍇, 🥝)"
258,0.005021,"(🍍, 🍇, 🍊, 🥝)"
259,0.005222,"(🍍, 🍉, 🍊, 🥝)"
260,0.005724,"(🍺, 🍷, 🍸, 🍹)"
261,0.006126,"(🍷, 🍻, 🍸, 🍹)"
262,0.005624,"(🥂, 🍷, 🍸, 🍹)"
263,0.005021,"(🍺, 🍷, 🍻, 🍸)"
264,0.005122,"(🍺, 🍻, 🍸, 🍹)"
