### **Data Mining Using Python**

<font color="red">File access required:</font> In Colab this notebook requires first uploading files **Shop.csv** and **Movies.csv** using the *Files* feature in the left toolbar. If running the notebook on a local computer, simply ensure these files are in the same workspace as the notebook.

In [None]:
# Set-up
import csv

**Look at CSV files:** TID,item pairs

In [None]:
# Read shopping dataset from CSV file
# Create dictionary "Sitems" with key = item and value = set of transactions
# Also set variable Snumtrans = number of transactions
Sitems = {}
trans = []  # list of transactions used to set Snumtrans
with open('Shop.csv') as f:
    rows = csv.DictReader(f)
    for r in rows:
        if r['item'] not in Sitems:
            Sitems[r['item']] = {r['TID']}
        else:
            Sitems[r['item']].add(r['TID'])
        if r['TID'] not in trans:
            trans.append(r['TID'])
Snumtrans = len(trans)
print('Number of transactions:', Snumtrans)
print('Number of distinct items:', len(Sitems))
print('Item dictionary:')
Sitems

In [None]:
# Read movies dataset from CSV file
# Create dictionary "Mitems" with key = item and value = set of transactions
# Also set variable Mnumtrans = number of transactions
Mitems = {}
trans = []  # list of transactions used to set Mnumtrans
with open('Movies.csv') as f:
    rows = csv.DictReader(f)
    for r in rows:
        if r['item'] not in Mitems:
            Mitems[r['item']] = {r['TID']}
        else:
            Mitems[r['item']].add(r['TID'])
        if r['TID'] not in trans:
            trans.append(r['TID'])
Mnumtrans = len(trans)
print('Number of transactions (users):', Mnumtrans)
print('Number of distinct items (movies):', len(Mitems))
print('Item dictionary:')
Mitems.items()

### Some new Python features

In [None]:
# Iterating through dictionaries
for i in Sitems:
    print(i)
    print(Sitems[i])

In [None]:
# Intersecting sets
# How many transactions contain both eggs and milk?
set1 = Sitems['eggs']
print('Transactions containing eggs:', set1)
set2 = Sitems['milk']
print('Transactions containing milk:', set2)
set3 = set1 & set2
print('Transactions containing both:', set3)
# print('Number of transactions containing both:', len(set3))

## Shopping dataset - frequent item-sets

### Frequent item-sets of two

#### Print all pairs of items and the number of transactions they occur together in (see what's wrong and fix it)

In [None]:
for i1 in Sitems:
    for i2 in Sitems:
        common = len(Sitems[i1] & Sitems[i2])
        print([i1, i2, common])

#### Now only print pairs that meet support threshold

In [None]:
support = .3
for i1 in Sitems:
    for i2 in Sitems:
      if i1 < i2:
        common = len(Sitems[i1] & Sitems[i2])
        if common/Snumtrans > support:
          print(i1, '|', i2)

### Frequent item-sets of three

In [None]:
support = .1
for i1 in Sitems:
    for i2 in Sitems:
        for i3 in Sitems:
            if i1 < i2 and i2 < i3:
                common = len(Sitems[i1] & Sitems[i2] & Sitems[i3])
                if common/Snumtrans > support:
                    print(i1, '|', i2, '|', i3)

### <font color = 'green'>**Your Turn - Movies dataset frequent item-sets**</font>

In [None]:
print(Mnumtrans, 'transactions (users)')
print(len(Mitems), 'distinct items (movies)')

#### Mine for frequent item-sets of three and four items in the Movies dataset. Find a single support threshold where the number of frequent item-sets of three items is more than 10 but less than 20, and the number of frequent item-sets of four items is more than 0.

In [4]:
import csv

# load the Movies dataset and create Mitems dictionary
# Read movies dataset from CSV file
Mitems = {}
trans = []  # list of transactions used to set Mnumtrans
with open('Movies.csv') as f:
    rows = csv.DictReader(f)
    for r in rows:
        if r['item'] not in Mitems:
            Mitems[r['item']] = {r['TID']}
        else:
            Mitems[r['item']].add(r['TID'])
        if r['TID'] not in trans:
            trans.append(r['TID'])
Mnumtrans = len(trans)
print('Number of transactions (users):', Mnumtrans)
print('Number of distinct items (movies):', len(Mitems))
print('Item dictionary: ')
Mitems.items()

Number of transactions (users): 1382
Number of distinct items (movies): 123
Item dictionary: 


dict_items([('The Fault in Our Stars', {'15590', '244116', '184307', '173452', '200931', '72265', '232320', '174147', '6802', '68551', '145755', '184040', '150676', '191042', '167831', '46770', '158048', '204050', '166205', '171673', '3508', '98509', '168240', '89200', '221712', '4208', '36040', '206146', '128828', '50601', '241263', '54212', '54858', '69391', '152735', '121896', '71268', '134268', '114602', '218669', '127474', '115516', '153533', '101420', '151368', '39827', '105787', '218424', '214778', '14642', '28304', '12924', '34280', '102371', '22793', '171575', '31198', '232001', '128203', '233542', '69759', '87127', '121408', '15530', '29139', '120662', '186450', '43142', '96530', '78715', '210968', '237123', '102901', '47173', '240694', '18805', '243482', '6573', '87907', '239816', '105996', '176880', '4231', '41453', '206129', '165260', '163775', '126898', '124830', '71352', '158208', '241916', '25851', '111288', '215987', '176', '55096', '234322', '174077', '140380', '62076

In [11]:
import csv

# Read movies dataset from CSV file
Mitems = {}
trans = []  # list of transactions used to set Mnumtrans
with open('Movies.csv') as f:
    rows = csv.DictReader(f)
    for r in rows:
        if r['item'] not in Mitems:
            Mitems[r['item']] = {r['TID']}
        else:
            Mitems[r['item']].add(r['TID'])
        if r['TID'] not in trans:
            trans.append(r['TID'])
Mnumtrans = len(trans)
print('Number of transactions (users):', Mnumtrans)
print('Number of distinct items (movies):', len(Mitems))

# Check support of individual items
print("\nTop 10 most popular movies (by support):")
item_supports = []
for item, tids in Mitems.items():
    support = len(tids) / Mnumtrans
    item_supports.append((item, support, len(tids)))

item_supports.sort(key=lambda x: x[1], reverse=True)
for i, (item, support, count) in enumerate(item_supports[:10]):
    print(f"{i+1}. {item}: {count}/{Mnumtrans} = {support:.3f}")

# Check support of some pairs
print("\nChecking support of some pairs:")
items_list = list(Mitems.keys())
top_items = [item for item, _, _ in item_supports[:5]]

for i in range(len(top_items)):
    for j in range(i+1, len(top_items)):
        i1, i2 = top_items[i], top_items[j]
        common = len(Mitems[i1] & Mitems[i2])
        support = common / Mnumtrans
        print(f"{i1} & {i2}: {common}/{Mnumtrans} = {support:.3f}")

Number of transactions (users): 1382
Number of distinct items (movies): 123

Top 10 most popular movies (by support):
1. The Imitation Game: 677/1382 = 0.490
2. Gone Girl: 583/1382 = 0.422
3. Inside Out: 450/1382 = 0.326
4. Big Hero 6: 387/1382 = 0.280
5. Boyhood: 228/1382 = 0.165
6. Fury: 138/1382 = 0.100
7. The Fault in Our Stars: 111/1382 = 0.080
8. Louis C.K.: Live at The Comedy Store: 90/1382 = 0.065
9. Transformers: Age of Extinction: 69/1382 = 0.050
10. Wild Tales: 68/1382 = 0.049

Checking support of some pairs:
The Imitation Game & Gone Girl: 319/1382 = 0.231
The Imitation Game & Inside Out: 196/1382 = 0.142
The Imitation Game & Big Hero 6: 204/1382 = 0.148
The Imitation Game & Boyhood: 126/1382 = 0.091
Gone Girl & Inside Out: 161/1382 = 0.116
Gone Girl & Big Hero 6: 165/1382 = 0.119
Gone Girl & Boyhood: 154/1382 = 0.111
Inside Out & Big Hero 6: 188/1382 = 0.136
Inside Out & Boyhood: 72/1382 = 0.052
Big Hero 6 & Boyhood: 73/1382 = 0.053


In [12]:
# Check support for top 3 items together
top_3 = ["The Imitation Game", "Gone Girl", "Inside Out"]
common = len(Mitems[top_3[0]] & Mitems[top_3[1]] & Mitems[top_3[2]])
support = common / Mnumtrans
print(f"Support for {top_3[0]}, {top_3[1]}, {top_3[2]}: {common}/{Mnumtrans} = {support:.3f}")

# Check support for top 4 items together
top_4 = ["The Imitation Game", "Gone Girl", "Inside Out", "Big Hero 6"]
common = len(Mitems[top_4[0]] & Mitems[top_4[1]] & Mitems[top_4[2]] & Mitems[top_4[3]])
support = common / Mnumtrans
print(f"Support for {top_4[0]}, {top_4[1]}, {top_4[2]}, {top_4[3]}: {common}/{Mnumtrans} = {support:.3f}")

Support for The Imitation Game, Gone Girl, Inside Out: 103/1382 = 0.075
Support for The Imitation Game, Gone Girl, Inside Out, Big Hero 6: 59/1382 = 0.043


In [22]:
# Frequent item-sets of three
support = 0.03
frequent_3 = []
items_list = list(Mitems.keys())

for i in range(len(items_list)):
    for j in range(i+1, len(items_list)):
        for k in range(j+1, len(items_list)):
            i1, i2, i3 = items_list[i], items_list[j], items_list[k]
            common = len(Mitems[i1] & Mitems[i2] & Mitems[i3])
            if common / Mnumtrans > support:
                frequent_3.append((i1, i2, i3, common))

print(f"Support threshold: {support}")
print(f"Number of frequent 3-itemsets: {len(frequent_3)}")
print("Top 10 frequent 3-itemsets:")
frequent_3.sort(key=lambda x: x[3], reverse=True)
for i, itemset in enumerate(frequent_3[:10]):
    print(f"{i+1}. {itemset[0]} | {itemset[1]} | {itemset[2]} (support: {itemset[3]}/{Mnumtrans} = {itemset[3]/Mnumtrans:.3f})")

Support threshold: 0.03
Number of frequent 3-itemsets: 14
Top 10 frequent 3-itemsets:
1. Big Hero 6 | The Imitation Game | Gone Girl (support: 119/1382 = 0.086)
2. The Imitation Game | Inside Out | Gone Girl (support: 103/1382 = 0.075)
3. Big Hero 6 | The Imitation Game | Inside Out (support: 102/1382 = 0.074)
4. Boyhood | The Imitation Game | Gone Girl (support: 95/1382 = 0.069)
5. Big Hero 6 | Inside Out | Gone Girl (support: 85/1382 = 0.062)
6. The Imitation Game | Gone Girl | Fury (support: 70/1382 = 0.051)
7. Boyhood | Big Hero 6 | The Imitation Game (support: 57/1382 = 0.041)
8. Boyhood | Big Hero 6 | Gone Girl (support: 56/1382 = 0.041)
9. Boyhood | Inside Out | Gone Girl (support: 54/1382 = 0.039)
10. Boyhood | The Imitation Game | Inside Out (support: 52/1382 = 0.038)


In [24]:
# Frequent item-sets of four
support = 0.03  # same from above
frequent_4 = []
items_list = list(Mitems.keys())

for i in range(len(items_list)):
    for j in range(i+1, len(items_list)):
        for k in range(j+1, len(items_list)):
            for l in range(k+1, len(items_list)):
                i1, i2, i3, i4 = items_list[i], items_list[j], items_list[k], items_list[l]
                common = len(Mitems[i1] & Mitems[i2] & Mitems[i3] & Mitems[i4])
                if common / Mnumtrans > support:
                    frequent_4.append((i1, i2, i3, i4, common))

print(f"Support threshold: {support}")
print(f"Number of frequent 4-itemsets: {len(frequent_4)}")
if frequent_4:
    print("All frequent 4-itemsets:")
    frequent_4.sort(key=lambda x: x[4], reverse=True)
    for i, itemset in enumerate(frequent_4):
        print(f"{i+1}. {itemset[0]} | {itemset[1]} | {itemset[2]} | {itemset[3]} (support: {itemset[4]}/{Mnumtrans} = {itemset[4]/Mnumtrans:.3f})")
else:
    print("No frequent 4-itemsets found.")

Support threshold: 0.03
Number of frequent 4-itemsets: 3
All frequent 4-itemsets:
1. Big Hero 6 | The Imitation Game | Inside Out | Gone Girl (support: 59/1382 = 0.043)
2. Boyhood | Big Hero 6 | The Imitation Game | Gone Girl (support: 46/1382 = 0.033)
3. Boyhood | The Imitation Game | Inside Out | Gone Girl (support: 42/1382 = 0.030)


## Shopping dataset - association rules

### Association rules with one item on the left-hand side

#### First compute frequent item-sets of one item, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [None]:
support = .5
frequentLHS = []
for i in Sitems:
    if len(Sitems[i])/Snumtrans > support:
        frequentLHS.append([i,len(Sitems[i])])
print(frequentLHS)

#### Now find right-hand side items with sufficient confidence (see what's wrong and fix it)

In [None]:
# S -> i

confidence = .5
for lhs in frequentLHS:
    for i in Sitems:
        common = len(Sitems[lhs[0]] & Sitems[i])
        if common/lhs[1] > confidence:
            print(lhs[0], '->', i)

### Association rules with two items on the left-hand side

#### First compute frequent item-sets of two items, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [None]:
# S = [JUICE, MILK]
support = .5
frequentLHS = []
for i1 in Sitems:
    for i2 in Sitems:
        if i1 < i2:
            common = len(Sitems[i1] & Sitems[i2])
            if common/Snumtrans > support:
                frequentLHS.append([i1,i2,common])
print(frequentLHS)

#### Now find right-hand side items with sufficient confidence

In [None]:
confidence = .5
for lhs in frequentLHS:
    for i in Sitems:
        if i not in lhs:
            common = len(Sitems[lhs[0]] & Sitems[lhs[1]] & Sitems[i])
            if common/lhs[2] > confidence:
                print(lhs[0], '|', lhs[1], '->', i)

## Shopping dataset - association rules with lift instead of confidence

### Association rules with one item on the left-hand side

#### First compute frequent item-sets of one item, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [None]:
support = .5
frequentLHS = []
for i in Sitems:
    if len(Sitems[i])/Snumtrans > support:
        frequentLHS.append([i,len(Sitems[i])])
print(frequentLHS)

#### Now find right-hand side items with sufficient lift

In [None]:
liftthresh = 1
for lhs in frequentLHS:
    for i in Sitems:
        if i not in lhs:
            common = len(Sitems[lhs[0]] & Sitems[i])
            lift = (common/lhs[1]) / (len(Sitems[i])/Snumtrans)
            if lift > liftthresh:
                print(lhs[0], '->', i, ' lift:', lift)

### Association rules with two items on the left-hand side

#### First compute frequent item-sets of two items, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [None]:
support = .5
frequentLHS = []
for i1 in Sitems:
    for i2 in Sitems:
        if i1 < i2:
            common = len(Sitems[i1] & Sitems[i2])
            if common/Snumtrans > support:
                frequentLHS.append([i1,i2,common])
print(frequentLHS)

#### Now find right-hand side items with sufficient lift

In [None]:
liftthresh = 1
for lhs in frequentLHS:
    for i in Sitems:
        if i not in lhs:
            common = len(Sitems[lhs[0]] & Sitems[lhs[1]] & Sitems[i])
            lift = (common/lhs[2]) / (len(Sitems[i])/Snumtrans)
            if lift > liftthresh:
                print(lhs[0], '|', lhs[1], '->', i, ' lift:', lift)

### <font color = 'green'>**Your Turn - Movies dataset association rules**</font>

#### Mine for association rules in the Movies dataset with three items on the left-hand side. Find support and confidence thresholds (need not be the same) so the number of association rules is more than 10 but less than 20.

In [26]:
# Association rules with three items on the left-hand side
# Hint: Make sure to include the code from the seprate cells above that
#   together implement the two steps of association rule mining

# Association rules with three items on the left-hand side
# Step 1: Find frequent 3-itemsets as candidate LHS
support = 0.04  # Lower support for LHS
confidence = 0.4  # Higher confidence for rules

# Find frequent 3-itemsets
frequentLHS = []
items_list = list(Mitems.keys())

for i in range(len(items_list)):
    for j in range(i+1, len(items_list)):
        for k in range(j+1, len(items_list)):
            i1, i2, i3 = items_list[i], items_list[j], items_list[k]
            common = len(Mitems[i1] & Mitems[i2] & Mitems[i3])
            if common / Mnumtrans > support:
                frequentLHS.append([i1, i2, i3, common])

print(f"Number of frequent 3-itemsets (LHS candidates): {len(frequentLHS)}")

# Step 2: Generate association rules with sufficient confidence
association_rules = []
for lhs in frequentLHS:
    i1, i2, i3, lhs_support_count = lhs[0], lhs[1], lhs[2], lhs[3]
    
    # For each possible RHS item not in LHS
    for item in items_list:
        if item not in (i1, i2, i3):
            # Support of LHS ∪ RHS
            union_support = len(Mitems[i1] & Mitems[i2] & Mitems[i3] & Mitems[item])
            
            # Confidence = P(RHS|LHS) = support(LHS ∪ RHS) / support(LHS)
            if lhs_support_count > 0:
                conf = union_support / lhs_support_count
                if conf > confidence:
                    rule_support = union_support / Mnumtrans
                    association_rules.append((i1, i2, i3, item, union_support, conf, rule_support))

print(f"\nSupport threshold: {support}")
print(f"Confidence threshold: {confidence}")
print(f"Number of association rules found: {len(association_rules)}")

if len(association_rules) > 0:
    print("\nAssociation rules (sorted by confidence):")
    association_rules.sort(key=lambda x: x[5], reverse=True)  # Sort by confidence
    
    # Show all rules if ≤ 20, otherwise show top 20
    display_count = min(len(association_rules), 20)
    for i in range(display_count):
        rule = association_rules[i]
        print(f"{i+1}. {rule[0]} | {rule[1]} | {rule[2]} -> {rule[3]}")
        print(f"   Support: {rule[4]}/{Mnumtrans} = {rule[6]:.3f}, Confidence: {rule[5]:.3f}")
else:
    print("No association rules found with current thresholds.")
    
# If we need to adjust thresholds to get between 10-20 rules
if len(association_rules) < 10:
    print("\nToo few rules. Try lowering confidence or support.")
elif len(association_rules) > 20:
    print(f"\nToo many rules ({len(association_rules)}). Try increasing confidence or support.")


Number of frequent 3-itemsets (LHS candidates): 8

Support threshold: 0.04
Confidence threshold: 0.4
Number of association rules found: 17

Association rules (sorted by confidence):
1. Boyhood | Big Hero 6 | Gone Girl -> The Imitation Game
   Support: 46/1382 = 0.033, Confidence: 0.821
2. Boyhood | Big Hero 6 | The Imitation Game -> Gone Girl
   Support: 46/1382 = 0.033, Confidence: 0.807
3. Big Hero 6 | Inside Out | Gone Girl -> The Imitation Game
   Support: 59/1382 = 0.043, Confidence: 0.694
4. Boyhood | Big Hero 6 | The Imitation Game -> Inside Out
   Support: 35/1382 = 0.025, Confidence: 0.614
5. Boyhood | Big Hero 6 | Gone Girl -> Inside Out
   Support: 34/1382 = 0.025, Confidence: 0.607
6. Big Hero 6 | The Imitation Game | Inside Out -> Gone Girl
   Support: 59/1382 = 0.043, Confidence: 0.578
7. The Imitation Game | Inside Out | Gone Girl -> Big Hero 6
   Support: 59/1382 = 0.043, Confidence: 0.573
8. The Imitation Game | Gone Girl | Fury -> Boyhood
   Support: 39/1382 = 0.028, 

#### Mine for association rules in the Movies dataset with three items on the left-hand side. Find support and lift thresholds so the number of association rules is more than 10 but less than 20. Only consider lift thresholds > 1.


In [94]:
# Association rules with three items on the left-hand side using LIFT
# Step 1: Find frequent 3-itemsets as candidate LHS

support = 0.08  # Lower support for LHS
lift_thresh = 5  # Higher confidence for rules

print("=" * 60)
print(f"Using support={support:.3f}, lift={lift_thresh:.2f}")
print("=" * 60)

# Find frequent 3-itemsets
frequentLHS = []
items_list = list(Mitems.keys())

for i in range(len(items_list)):
    for j in range(i+1, len(items_list)):
        for k in range(j+1, len(items_list)):
            i1, i2, i3 = items_list[i], items_list[j], items_list[k]
            common = len(Mitems[i1] & Mitems[i2] & Mitems[i3])
            if common / Mnumtrans > support:
                frequentLHS.append([i1, i2, i3, common])

print(f"Number of frequent 3-itemsets: {len(frequentLHS)}")

# Generate association rules
association_rules = []
for lhs in frequentLHS:
    i1, i2, i3, lhs_support_count = lhs[0], lhs[1], lhs[2], lhs[3]
    
    for item in items_list:
        if item not in (i1, i2, i3):
            union_support = len(Mitems[i1] & Mitems[i2] & Mitems[i3] & Mitems[item])
            
            if union_support > 0:
                conf = union_support / lhs_support_count
                rhs_support = len(Mitems[item])
                lift = conf / (rhs_support / Mnumtrans) if rhs_support > 0 else 0
                
                if lift > lift_thresh:
                    rule_support = union_support / Mnumtrans
                    association_rules.append((i1, i2, i3, item, union_support, conf, lift, rule_support))

print(f"Number of association rules: {len(association_rules)}")

if len(association_rules) > 0:
    print("\nAll association rules (sorted by lift):")
    association_rules.sort(key=lambda x: x[6], reverse=True)
    
    for i, rule in enumerate(association_rules):
        print(f"{i+1}. {rule[0]} | {rule[1]} | {rule[2]} -> {rule[3]}")
        print(f"   Support: {rule[4]}/{Mnumtrans} = {rule[7]:.3f}, Confidence: {rule[5]:.3f}, Lift: {rule[6]:.3f}")

Using support=0.080, lift=5.00
Number of frequent 3-itemsets: 1
Number of association rules: 17

All association rules (sorted by lift):
1. Big Hero 6 | The Imitation Game | Gone Girl -> Flowers in the Attic
   Support: 1/1382 = 0.001, Confidence: 0.008, Lift: 11.613
2. Big Hero 6 | The Imitation Game | Gone Girl -> Whitey: United States of America v. James J. Bulger
   Support: 1/1382 = 0.001, Confidence: 0.008, Lift: 11.613
3. Big Hero 6 | The Imitation Game | Gone Girl -> The Wonders
   Support: 1/1382 = 0.001, Confidence: 0.008, Lift: 11.613
4. Big Hero 6 | The Imitation Game | Gone Girl -> Action Jackson
   Support: 1/1382 = 0.001, Confidence: 0.008, Lift: 11.613
5. Big Hero 6 | The Imitation Game | Gone Girl -> Breathe
   Support: 1/1382 = 0.001, Confidence: 0.008, Lift: 11.613
6. Big Hero 6 | The Imitation Game | Gone Girl -> Court
   Support: 1/1382 = 0.001, Confidence: 0.008, Lift: 11.613
7. Big Hero 6 | The Imitation Game | Gone Girl -> The Humbling
   Support: 2/1382 = 0.001