# Workshop 5: Frequent Itemset Mining and Recommender Systems

## 1. Frequent Itemset Mining

Association mining, also known as association rule mining, uncovers relationships between items in large datasets. It identifies patterns, dependencies, and correlations, commonly used in market basket analysis, web usage mining, and more. Key concepts include itemsets, support, confidence, and association rules. The process involves data collection, preprocessing, generating itemsets, calculating support, rule generation, and rule evaluation. Applications range from optimizing product placement in retail to healthcare analytics and fraud detection.

## Case Study: Retail Market Basket Analysis


### Objective:
A retail store aims to improve its sales strategy by understanding customer purchasing patterns and optimizing product placement. The goal is to identify associations between products and generate actionable insights for cross-selling and promotion strategies.

### Data:
The dataset consists of transaction records, each representing items purchased by a customer during a single visit.

## Steps:

### Data Collection: 
Gather transaction data, recording items purchased by customers.

### Data Preprocessing:
Clean the data, handle missing values, and organize it into transactional format.

### Generate Itemsets:
Identify frequent itemsets – combinations of products frequently bought together.

### Calculate Support:
Measure support for each itemset, indicating how often they occur in transactions.

### Generate Association Rules:
Create rules based on user-defined thresholds for support and confidence.

### Evaluate Rules:
Analyze the generated rules using metrics like confidence and lift.

#### Example Rules:

**Rule 1:** {Bread} ➔ {Butter} (Support: 10%, Confidence: 60%)
If a customer buys bread, there is a 60% chance they will also buy butter.

**Rule 2:** {Milk, Eggs} ➔ {Bread} (Support: 8%, Confidence: 70%)
If a customer buys milk and eggs, there is a 70% chance they will also buy bread.


### Insights:
Customers frequently buy Bread and Butter together, suggesting a bundling opportunity.
Milk and Eggs purchasers are likely to buy Bread, indicating potential cross-selling.
Use these insights to optimize product placement, create targeted promotions, and improve the overall customer experience.


### Benefits:
Increased sales through strategic product bundling.
Improved customer satisfaction by offering relevant product recommendations.
Enhanced inventory management through better understanding of product associations.
This case study demonstrates how association mining can provide actionable insights for retailers to optimize their sales strategy, enhance customer experience, and boost overall profitability.


Review the below code to understand how it works. You wil use this code later in Activity 1.




In [None]:
# install mlxend if needed
# !pip install mlxtend

In [8]:
# Import necessary libraries
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.frequent_patterns import fpmax, fpgrowth
import pandas as pd

# Sample transaction data
transactions = [
    ['Bread', 'Milk', 'Eggs'],
    ['Bread', 'Diapers', 'Beer', 'Eggs'],
    ['Milk', 'Diapers', 'Beer', 'Cola'],
    ['Bread', 'Milk', 'Diapers', 'Beer'],
    ['Bread', 'Milk', 'Cola']
]

# Convert the transaction data to a one-hot encoded format
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Generate frequent itemsets using Apriori algorithm
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

# Display the frequent itemsets and association rules
print("Frequent Itemsets:")
print(frequent_itemsets)

print("\nAssociation Rules:")
print(rules)


Frequent Itemsets:
   support         itemsets
0      0.6           (Beer)
1      0.8          (Bread)
2      0.6        (Diapers)
3      0.8           (Milk)
4      0.6  (Beer, Diapers)
5      0.6    (Milk, Bread)

Association Rules:
  antecedents consequents  antecedent support  consequent support  support  \
0      (Beer)   (Diapers)                 0.6                 0.6      0.6   
1   (Diapers)      (Beer)                 0.6                 0.6      0.6   
2      (Milk)     (Bread)                 0.8                 0.8      0.6   
3     (Bread)      (Milk)                 0.8                 0.8      0.6   

   confidence      lift  leverage  conviction  zhangs_metric  
0        1.00  1.666667      0.24         inf           1.00  
1        1.00  1.666667      0.24         inf           1.00  
2        0.75  0.937500     -0.04         0.8          -0.25  
3        0.75  0.937500     -0.04         0.8          -0.25  


In this example:

**transactions** represent individual shopping baskets.
The data is converted to a **one-hot encoded** format using the TransactionEncoder.
The **Apriori algorithm** is used to find frequent itemsets.

Association rules are generated based on a confidence threshold.



### Activity 1: Apply the code to a new dataset

Please reuse the above code as needed to process the CSV dataset.

In [7]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# 读取数据并指定数据类型为字符串
data = pd.read_csv('Market_Basket_Optimisation.csv', header=None, dtype=str)

# 填充缺失值
data.fillna('0', inplace=True)

# 转换为列表格式
transactions = data.values.tolist()

# 转换为独热编码格式
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# 生成频繁项集
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)

if len(frequent_itemsets) == 0:
    print("frequent_itemsets is empty!")
else:
    # 生成关联规则
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

    # 显示频繁项集和关联规则
    print("Frequent Itemsets:")
    print(frequent_itemsets)

    print("\nAssociation Rules:")
    print(rules)


Frequent Itemsets:
     support                       itemsets
0   0.999867                            (0)
1   0.087188                      (burgers)
2   0.081056                         (cake)
3   0.059992                      (chicken)
4   0.163845                    (chocolate)
5   0.080389                      (cookies)
6   0.051060                  (cooking oil)
7   0.179709                         (eggs)
8   0.079323                     (escalope)
9   0.170911                 (french fries)
10  0.063325              (frozen smoothie)
11  0.095321            (frozen vegetables)
12  0.052393                (grated cheese)
13  0.132116                    (green tea)
14  0.098254                  (ground beef)
15  0.076523               (low fat yogurt)
16  0.129583                         (milk)
17  0.238368                (mineral water)
18  0.065858                    (olive oil)
19  0.095054                     (pancakes)
20  0.071457                       (shrimp)
21  0.050527 

In [13]:
#Download data and convert to list
data = pd.read_csv('Market_Basket_Optimisation.csv', header=None, dtype=str)

#Fill null:
data.fillna('0',inplace=True)

''' 
    Convert data to list of lists format for each transaction, (hint: use a list called transactions):
    each sub-list should contain items from one line of the file, e.g. 
    [[turkey,burgers,mineral,water,eggs,cooking oil],[...],...]
'''

# Store the list of lists in transactions

transactions = data.values.tolist()

In [14]:
# Convert the transaction data to a one-hot encoded format
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Generate frequent itemsets using Apriori algorithm
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)

if len(frequent_itemsets) == 0:
    print("frequent_itemsets is empty!")
else:
    # Generate association rules
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

    # 显示频繁项集和关联规则
    print("Frequent Itemsets:")
    print(frequent_itemsets)

    print("\nAssociation Rules:")
    print(rules)

## Note: for metric definitions reref to: https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/


Frequent Itemsets:
     support                       itemsets
0   0.999867                            (0)
1   0.087188                      (burgers)
2   0.081056                         (cake)
3   0.059992                      (chicken)
4   0.163845                    (chocolate)
5   0.080389                      (cookies)
6   0.051060                  (cooking oil)
7   0.179709                         (eggs)
8   0.079323                     (escalope)
9   0.170911                 (french fries)
10  0.063325              (frozen smoothie)
11  0.095321            (frozen vegetables)
12  0.052393                (grated cheese)
13  0.132116                    (green tea)
14  0.098254                  (ground beef)
15  0.076523               (low fat yogurt)
16  0.129583                         (milk)
17  0.238368                (mineral water)
18  0.065858                    (olive oil)
19  0.095054                     (pancakes)
20  0.071457                       (shrimp)
21  0.050527 

### Activity 2: Using FP-Growth for Association Pattern Mining 

Use mlxtend library to get association rules for the same dataset and parameters as in Activity 1.

Compare results. Then change min support and test Apriori vs FP-Growth. Plot the timing results for both algorithms and discuss differences.

In [15]:
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)
print(frequent_itemsets)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
print(rules)

     support                       itemsets
0   0.999867                            (0)
1   0.087188                      (burgers)
2   0.081056                         (cake)
3   0.059992                      (chicken)
4   0.163845                    (chocolate)
5   0.080389                      (cookies)
6   0.051060                  (cooking oil)
7   0.179709                         (eggs)
8   0.079323                     (escalope)
9   0.170911                 (french fries)
10  0.063325              (frozen smoothie)
11  0.095321            (frozen vegetables)
12  0.052393                (grated cheese)
13  0.132116                    (green tea)
14  0.098254                  (ground beef)
15  0.076523               (low fat yogurt)
16  0.129583                         (milk)
17  0.238368                (mineral water)
18  0.065858                    (olive oil)
19  0.095054                     (pancakes)
20  0.071457                       (shrimp)
21  0.050527                    

## 2. Recommender Systems

Recommendation systems, often referred to as recommender systems, are crucial components in today's digital landscape, powering personalized content delivery in diverse platforms such as e-commerce, streaming services, and social media. The primary goal of recommendation systems is to predict and suggest items or content that users are likely to be interested in based on their historical preferences, behaviors, and interactions. These systems employ various algorithms, including collaborative filtering, content-based filtering, and hybrid approaches, to analyze user data and generate accurate and relevant recommendations. By enhancing user experience through tailored suggestions, recommendation systems contribute significantly to user engagement, customer satisfaction, and business success in the competitive online environment.

## Case Study: Movie Recommendation System

### Objective:
Netflix aims to improve user satisfaction and retention by implementing an advanced movie 
recommendation system. The goal is to provide personalized movie suggestions to users based on their viewing history and preferences.

### Data:
The dataset includes user interactions with the platform, such as watched movies, ratings given, and genres liked.

### Steps:

### Data Collection:
Gather user interaction data, including watched movies, ratings, and genre preferences.

### Data Preprocessing:
Clean the data, handle missing values, and organize it into a format suitable for recommendation algorithms.

### User Profiling:
Create user profiles based on historical data, considering factors like preferred genres, average ratings, and watch history.

### Recommendation Algorithm:
Implement collaborative filtering and content-based filtering algorithms to generate personalized movie recommendations.

### Evaluation:
Assess the system's performance using metrics such as precision, recall, and mean absolute error to ensure accurate and relevant recommendations.

### Example Scenario:
A user who frequently watches science fiction movies and has given high ratings to several sci-fi films might receive recommendations for newly released sci-fi titles.

### Benefits:
Improved user engagement and satisfaction through personalized recommendations.
Increased user retention as users discover content aligned with their preferences.
Enhanced platform competitiveness in the streaming industry.

This case study illustrates how a movie recommendation system, by leveraging user data and advanced algorithms, can significantly enhance the streaming experience, leading to increased user satisfaction and platform success.

# Recommender System implementation

The skeleton code below is created so that you can have a go at writing your own implementation of collaborative filtering.

Collaborative filtering is a popular technique used in recommendation systems to personalise and improve user experiences. The concept behind collaborative filtering is to analyse the behaviour and preferences of a group of users to recommend items or content to another user based on their similarities with the group. This technique can be applied to various types of data, such as movies, music, books, and products. Collaborative filtering works by building a model that identifies patterns and similarities in user behaviour and then uses these patterns to predict what items a user is likely to enjoy. By leveraging the collective intelligence of a group, collaborative filtering algorithms can generate highly accurate recommendations, making it a powerful tool for e-commerce, content-based websites, and other recommendation-based systems. In this way, collaborative filtering enables businesses to offer personalised experiences to their users, which can lead to increased engagement, loyalty, and revenue.

The pseudocode is explained as:

1. Collect data on user preferences for a set of items.
2. Represent the user preferences as a matrix, with each row representing a user and each column representing an item.
3. Compute the similarity between each pair of users using a similarity metric, such as cosine similarity or Pearson correlation.
4. For a target user, identify the top N most similar users based on the similarity metric.
5. For each item the target user has not rated, predict the rating by computing the weighted average of the ratings given by the most similar users, where the weights are the similarities between the users and the target user.
6. Recommend the top N items with the highest predicted ratings.

### Activity 3: Item based Collaborative Filtering Implementation

Complete the code below to implement CF recommender. Debug the code and make it working using given small rating table.

When the code is working properly, use the provided Netflix movie rating files to obtain recommendations for a target user (just a user number). 

"movies.csv" file contains movie titles. Therefore (optionally) you can replace movie ids with titles from that file.

HAVE A GO! 

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


def similarity(user1, user2):
    # Calculate the dot product of the two user vectors
    dot_product = np.dot(user1, user2)
    
    # Calculate the magnitude of the two user vectors
    magnitude_user1 = np.linalg.norm(user1)
    magnitude_user2 = np.linalg.norm(user2)
    
    # Calculate the similarity between the two users
    similarity = dot_product / (magnitude_user1 * magnitude_user2)
    
    return similarity

In [21]:
def predict_rating(user_ratings, movie_ratings):
    # Find the indices of the users who rated the movie
    rated_indices = np.where(movie_ratings != 0)[0]
    
    # Get the ratings of the movie by the rated users
    ratings = movie_ratings[rated_indices]
    
    # Get the user vectors of the rated users
    rated_users = user_ratings[rated_indices]
    
    # Calculate the similarities between the rated users and the target user
    similarities = np.array([similarity(user_ratings, rated_user) for rated_user in rated_users])
    
    # Calculate the weighted sum of the ratings, using the similarities as weights
    weighted_sum = np.sum(ratings * similarities)
    
    return weighted_sum


In [22]:
def recommend_movies(user_ratings, target_user):
    # Get the number of users and movies
    num_users, num_movies = user_ratings.shape
    
    # Find the indices of the unwatched movies by the target user (where the rating == 0)
    unwatched_indices = np.where(user_ratings[target_user] == 0)[0]
    
    # Predict the ratings for the unwatched movies
    predicted_ratings = [predict_rating(user_ratings, user_ratings[:, movie_index]) for movie_index in unwatched_indices]
    
    # Sort the movies by the predicted rating in descending order
    <your code>
    
    # Get the top 3 recommended movies
    <your code>
    
    return recommended_movies

# Create a sample ratings matrix
ratings = np.array([[3, 0, 0, 5], [0, 4, 0, 3], [1, 0, 2, 4], [5, 0, 3, 0], [0, 2, 4, 0]])

# Make movie recommendations for the target user
# Choose a target user to make recommendations for
target_user = 3
recommended_movies = recommend_movies(ratings, target_user)

# Print the recommended movies
print("Recommended movies:", recommended_movies)

IndexError: index 4 is out of bounds for axis 0 with size 4

In [None]:
import pandas as pd
import numpy as np

movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")

# use pandas pivot to convert from
#   userid, movieid, rating
# to rating table with ratings for users in rows, where each movie rating is in column
### the head of pandas frame looks like this:
# <bound method NDFrame.head of userId   1    2    3    4    5    6    7    8    9    10   ...  601  602  603  \
# movieId                                                    ...                  
# 1        4.0  0.0  0.0  0.0  4.0  0.0  4.5  0.0  0.0  0.0  ...  4.0  0.0  4.0   
# 2        0.0  0.0  0.0  0.0  0.0  4.0  0.0  4.0  0.0  0.0  ...  0.0  4.0  0.0   
# 3        4.0  0.0  0.0  0.0  0.0  5.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
# 4        0.0  0.0  0.0  0.0  0.0  3.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
# 5        0.0  0.0  0.0  0.0  0.0  5.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
# ...      ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   

<Your code>

'''
In a real world, ratings are very sparse and data points are mostly collected 
from very popular movies and highly engaged users. 
So we will reduce the noise by adding some filters 
and qualify the movies for the final dataset.

To qualify a movie, minimum 10 users should have voted a movie.
To qualify a user, minimum 50 movies should have voted by the user.

Implement a filter as specified above
'''
<your code>

# finally,convert ratring into numpy to get
#               mvi1, mvi2, ...
#        user1  [4. ,   0. , 0. , ..., 4. , 2.5, 5. ],
#        user2  [0. ,   0. , 4. , ..., 0. , 2. , 0. ],
#        ...,

rating  = <your code>

In [None]:
## You can test it for different target user ids
target_user = 200
recommended_movies = recommend_movies(ratings, target_user)

# Print the recommended movies
print("Recommended movies:", recommended_movies)

### Activity 4: Tiktok Case Study

Read the following article, discussing with your classmates, and respond to the following questions:

Article: https://www.popsci.com/technology/tiktok-algorithm/ Why TikTok's algorithm is so addictive? 

The idea behind these questions is to prompt you to think more holistically about the industry on which you are studying. 

Technical skills can be taught, but it is crucial to consider the impact of the work you are undertaking, including the threats compared to the benefits.

Questions: 

1) How does TikTok's recommendation algorithm leverage user interactions, such as likes, comments, watch time, and shares, to personalize the content feed?

2) Do you think the level of personalization described in the article enhances or limits the user experience on TikTok?

3) How do TikTok's human content moderators complement the work of the algorithm?

4) What challenges and benefits might arise from the collaboration between automated algorithms and human moderation in content platforms?

5) The article suggests that rapid growth can pose challenges for platforms like TikTok. In what ways might fast-paced growth impact a platform's ability to address and mitigate potential harms?

6) The article mentions concerns about TikTok users encountering harmful content as their streams become more niche. What measures could platforms take to address this issue and protect users, especially considering TikTok's younger user base?