# Introduction into Data Science - Assignment Part II

This is the second part of the assignment in IDS 2023/2024.

This part of the assignment consists of five questions — each of these questions is contained in a separate Jupyter notebook:
- [Question 1: Data Preprocessing](Q1_Preprocessing_Visualization.ipynb)
- [Question 2: Association Rules](Q2_Frequent_Itemsets_Association_Rules.ipynb)
- [Question 3: Process Mining](Q3_Process_Mining.ipynb)
- [Question 4: Text Mining](Q4_Text_Mining.ipynb)
- [Question 5: Big Data](Q5_Big_Data.ipynb)

Additional required files are in two folders.
- [datasets](datasets/)
- [scripts](scripts/)

Please use the provided notebook to work on the questions. When you are done, upload your version of each of the notebooks to Moodle. Your submission will, therefore, consist of five jupyter notebook and _no_ additional file. Any additionally provided files will not be considered in grading.
Enter your commented Python code and answers in the corresponding cells. Make sure to answer all questions in a clear and explicit manner and discuss your outputs. _Please do not change the general structure of this notebook_. You can, however, add additional markdown or code cells if necessary. Please **DO NOT CLEAR THE OUTPUT** of the notebook you are submitting! Additionally, please ensure that the code in the notebook runs if placed in the same folder as all of the provided files, delivering the same outputs as the ones you submit in the notebook. This includes being runnable in the bundled conda environment.

*Please make sure to include the names and matriculation numbers of all group members in the provided slots in each of the notebooks.* If a name or a student id is missing, the student will not receive any points.

Hint 1: **Plan your time wisely.** A few parts of this assignment may take some time to run. It might be necessary to consider time management when you plan your group work. Also, do not attempt to upload your assignment at the last minute before the deadline. This often does not work, and you will miss the deadline. Late submissions will not be considered.

Hint 2: RWTHMoodle allows multiple submissions, with every new submission overwriting the previous one. **Partial submissions are possible and encouraged.** This might be helpful in case of technical issues with RWTHMoodle, which may occur close to the deadline.

Hint 3: As a technical note. Some IDEs such as DataSpell may automatically strip jupyter notebook cell metadata. If you are able, please re-add it from the source notebooks before submission. This is necessary for our grading.

Enter your group number and members with matriculation numbers below.

In [None]:
GROUP_NO = 123 # group number
GROUP_MEMBERS = {
    123456: "firstname lastname", # mat. no. : name,
    234567: "firstname lastname",
    345678: "firstname lastname",
}

---

In [2]:
# required imports
# please do not edit!
import csv
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
from mlxtend.frequent_patterns import apriori
import datetime
from mlxtend.frequent_patterns import association_rules as arule
import pandas as pd
from mlxtend.frequent_patterns import fpgrowth

# Question 2: Frequent Item Sets and Association Rules (13 points)

In this question, you work with transaction data of the customer's visits to the store.

### a) 
Load the transactions from the csv-file called **q2_store_transactions.csv** into a variable called `groceries`. The variable should be a list and each row in the csv-file should be represented as a list within this list.

In [3]:
import csv

with open('datasets/q2_store_transactions.csv', 'r') as file:
    groceries = [line.strip().split(',') for line in file]
display(groceries)

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers', 'meatballs', 'eggs'],
 ['chutney'],
 ['turkey', 'avocado'],
 ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea'],
 ['low fat yogurt'],
 ['whole wheat pasta', 'french fries'],
 ['soup', 'light cream', 'shallot'],
 ['frozen vegetables', 'spaghetti', 'green tea'],
 ['french fries'],
 ['eggs', 'pet food'],
 ['cookies'],
 ['turkey', 'burgers', 'mineral water', 'eggs', 'cooking oil'],
 ['spaghetti', 'champagne', 'cookies'],
 ['mineral water', 'salmon'],
 ['mineral water'],
 ['shrimp',
  'chocolate',
  'chicken',
  'honey',
  'oil',
  'cooking oil',
  'low fat yogurt'],
 ['turkey', 'eggs'],
 ['turkey',
  'fresh tuna',
  'tomatoes',
  'spagh

In [None]:
# Please leave this cell empty - used for grading.

### b) 
Transform the entries from the list to a binary matrix using an object of *TransactionEncoder* as introduced in the exercise. Name the resulting dataframe `itemset_matrix` and display the first 20 rows.

In [4]:
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

encoder = TransactionEncoder()
itemset_matrix = encoder.fit(groceries).transform(groceries)
itemset_matrix = pd.DataFrame(itemset_matrix, columns=encoder.columns_)
display(itemset_matrix.head(20))


Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:
# Please leave this cell empty - used for grading.

In [None]:
# Please leave this cell empty - used for grading.

### c) 
Find all frequent itemsets with a **support of at least 0.03** using the Apriori algorithm and save them in a variable called `frequent_itemsets`. Display the resulting itemsets and the processing time (in milliseconds) required to detect them. 

In [5]:
start_time = datetime.datetime.now()
frequent_itemsets = apriori(itemset_matrix, min_support=0.03, use_colnames=True)
end_time = datetime.datetime.now()

processing_time_ms = (end_time - start_time).total_seconds() * 1000
print(f"Processing Time: {processing_time_ms:.2f} milliseconds")
display('Frequent Itemsets:', frequent_itemsets)

Processing Time: 80.03 milliseconds


'Frequent Itemsets:'

Unnamed: 0,support,itemsets
0,0.033329,(avocado)
1,0.033729,(brownies)
2,0.087188,(burgers)
3,0.030129,(butter)
4,0.081056,(cake)
5,0.046794,(champagne)
6,0.059992,(chicken)
7,0.163845,(chocolate)
8,0.080389,(cookies)
9,0.05106,(cooking oil)


In [None]:
# Please leave this cell empty - used for grading.

### d)
Find the most frequent itemsets containing **more than one product** and a **support of more than 0.04** using the Apriori algorithm. Store them in a variable called `frequent_itemsets_filtered` and show the sets in your output.

In [6]:
frequent_itemsets_filtered = apriori(itemset_matrix, min_support=0.04, use_colnames=True)
frequent_itemsets_filtered = frequent_itemsets_filtered[frequent_itemsets_filtered['itemsets'].apply(lambda x: len(x) > 1)]

# Display the resulting frequent itemsets
print("Frequent Itemsets (More than one product and support > 0.04):")
print(frequent_itemsets_filtered)

Frequent Itemsets (More than one product and support > 0.04):
     support                      itemsets
30  0.052660    (mineral water, chocolate)
31  0.050927         (mineral water, eggs)
32  0.040928  (ground beef, mineral water)
33  0.047994         (milk, mineral water)
34  0.059725    (mineral water, spaghetti)


In [None]:
# Please leave this cell empty - used for grading.

In [None]:
# Please leave this cell empty - used for grading.

In [None]:
# Please leave this cell empty - used for grading.

### e)
Find all association rules in the data that have a **confidence of at least 0.3** and a **minimum lift of 1.2**. Create and show a dataframe `association_rules` listing the antecedents, consequents, support, confidence, and lift of each of these discovered rules. How do you interpret the quality of the discovered rules?

In [7]:
from mlxtend.frequent_patterns import association_rules as arule
import pandas as pd

frequent_itemsets = apriori(itemset_matrix, min_support=0.04, use_colnames=True)
rules = arule(frequent_itemsets, metric="confidence", min_threshold=0.3)
filtered_rules = rules[rules['lift'] > 1.2]

print("Association Rules")
display(filtered_rules)

Association Rules


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(chocolate),(mineral water),0.163845,0.238368,0.05266,0.3214,1.348332,0.013604,1.122357,0.308965
1,(ground beef),(mineral water),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,0.474369
2,(milk),(mineral water),0.129583,0.238368,0.047994,0.37037,1.553774,0.017105,1.20965,0.409465
3,(spaghetti),(mineral water),0.17411,0.238368,0.059725,0.343032,1.439085,0.018223,1.159314,0.369437


In [None]:
# Please leave this cell empty - used for grading.

__Student Answer:__ The higher lift (greater than 1) indicates a psitive correlation between the antecedents and consequents. However on average the support and confidence levels were quite low.

### f) 
Find all frequent itemsets with a **support of at least 0.03** using **FP-Growth** and save them in a variable called `fp_frequent_itemsets`. Display the resulting itemsets and the processing time (in milliseconds) required to detect them. 

In [29]:
from mlxtend.frequent_patterns import fpgrowth

end_time = datetime.datetime.now()
fp_frequent_itemsets = fpgrowth(itemset_matrix, min_support=0.03)
end_time = datetime.datetime.now()
fp_frequent_itemsets['itemsets'] = fp_frequent_itemsets['itemsets'].apply(lambda x: [itemset_matrix.columns[int(val)] for val in x])
print("FP-Growth Frequent Itemsets:")
display(fp_frequent_itemsets)

# Calculate and display the processing time in milliseconds
processing_time = (end_time - start_time).total_seconds() * 1000
print(f"Processing Time: {processing_time:.2f} milliseconds")

FP-Growth Frequent Itemsets:


Unnamed: 0,support,itemsets
0,0.238368,[mineral water]
1,0.132116,[green tea]
2,0.076523,[low fat yogurt]
3,0.071457,[shrimp]
4,0.065858,[olive oil]
5,0.063325,[frozen smoothie]
6,0.04746,[honey]
7,0.042528,[salmon]
8,0.033329,[avocado]
9,0.031862,[cottage cheese]


Processing Time: 10043.52 milliseconds


In [None]:
# Please leave this cell empty - used for grading.

### g)
Using the itemsets identified by **FP-Growth**: Find all association rules in the data that have a **confidence of at least 0.3** and a **minimum lift of 1.2**. Create and show a dataframe `fp_association_rules` listing the antecedents, consequents, support, confidence, and lift of each of these discovered rules.

In [32]:
start_time = datetime.datetime.now()

# Find frequent itemsets using FP-Growth algorithm
fp_rules = arule(fp_frequent_itemsets, metric="confidence", min_threshold=0.3)
fp_filtered_rules = fp_rules[fp_rules['lift'] > 1.2]

end_time = datetime.datetime.now()

print("FP-Growth Association Rules:")
display(fp_filtered_rules)

processing_time_ms = (end_time - start_time).total_seconds() * 1000
print(f"Processing Time: {processing_time_ms:.2f} milliseconds")
display(filtered_rules)

FP-Growth Association Rules:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(milk),(mineral water),0.129583,0.238368,0.047994,0.37037,1.553774,0.017105,1.20965,0.409465
1,(spaghetti),(mineral water),0.17411,0.238368,0.059725,0.343032,1.439085,0.018223,1.159314,0.369437
2,(frozen vegetables),(mineral water),0.095321,0.238368,0.035729,0.374825,1.572463,0.013007,1.21827,0.402413
3,(chocolate),(mineral water),0.163845,0.238368,0.05266,0.3214,1.348332,0.013604,1.122357,0.308965
4,(pancakes),(mineral water),0.095054,0.238368,0.033729,0.354839,1.488616,0.011071,1.180529,0.362712
5,(ground beef),(mineral water),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,0.474369
6,(ground beef),(spaghetti),0.098254,0.17411,0.039195,0.398915,2.291162,0.022088,1.373997,0.624943


Processing Time: 11.55 milliseconds


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(chocolate),(mineral water),0.163845,0.238368,0.05266,0.3214,1.348332,0.013604,1.122357,0.308965
1,(ground beef),(mineral water),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,0.474369
2,(milk),(mineral water),0.129583,0.238368,0.047994,0.37037,1.553774,0.017105,1.20965,0.409465
3,(spaghetti),(mineral water),0.17411,0.238368,0.059725,0.343032,1.439085,0.018223,1.159314,0.369437


In [None]:
# Please leave this cell empty - used for grading.

### h) 
You would like to compare the apriori algorithms and FP-Growth.

i) Both algorithms use the same data (transaction data) as an input and provide association rules as an output. How do the algorithms differ in the way they identify association rules?


__Student Answer:__ Apriori relies on a candidate generation-and-test strategy, iteratively generating and testing itemsets of increasing lengths. It scans the dataset multiple times to count the support of candidate itemsets and prune those that do not meet the minimum support threshold. In contrast, FP-Growth constructs an FP-tree data structure, representing the dataset efficiently without explicit candidate generation. It recursively builds the tree and mines frequent itemsets directly, requiring only a single pass over the data. FP-Growth tends to be more memory-efficient and faster than Apriori, especially for large datasets

ii) Consider your results of the previous tasks. Do the two algorithms provide the same association rules? Is this always the case?

__Student Answer:__ With the exception of 3 additional rules provided by FP-Growth, they do provide similar association rules. With small and simple datasets, you may observe more consistency between Apriori and FP-Growth. Lowering the minimum support threshold may result in more frequent itemsets, increasing the likelihood of differences between algorithms.

iii) Compare the processing time for finding the frequent itemsets tasks using the apriori algorithm and FG-Growth. What do you notice? Is this the result you expected? Briefly explain your answers.


__Student Answer:__ With 80 vs 10043.52 milliseconds, the Apriori algorithm is much faster. The result aligns with the general expectation that Apriori is often faster than FP-Growth for small and dense datasets. Apriori's performance is influenced by the number of candidate itemsets generated, and it may perform well when the number of frequent itemsets is relatively low. FP-Growth, while efficient for large and sparse datasets, can incur additional overhead due to the construction of the FP-tree and may require more memory.