### **Data Mining Using Python**

<font color="red">File access required:</font> In Colab this notebook requires first uploading files **Shop.csv** and **Movies.csv** using the *Files* feature in the left toolbar. If running the notebook on a local computer, simply ensure these files are in the same workspace as the notebook.

In [1]:
# Set-up
import csv

**Look at CSV files:** TID,item pairs

In [2]:
# Read shopping dataset from CSV file
# Create dictionary "Sitems" with key = item and value = set of transactions
# Also set variable Snumtrans = number of transactions
Sitems = {}
trans = []  # list of transactions used to set Snumtrans
with open('Shop.csv') as f:
    rows = csv.DictReader(f)
    for r in rows:
        if r['item'] not in Sitems:
            Sitems[r['item']] = {r['TID']}
        else:
            Sitems[r['item']].add(r['TID'])
        if r['TID'] not in trans:
            trans.append(r['TID'])
Snumtrans = len(trans)
print('Number of transactions:', Snumtrans)
print('Number of distinct items:', len(Sitems))
print('Item dictionary:')
Sitems

Number of transactions: 5
Number of distinct items: 5
Item dictionary:


{'milk': {'1', '2', '4', '5'},
 'eggs': {'1', '3', '4'},
 'juice': {'1', '2', '5'},
 'cookies': {'2', '5'},
 'chips': {'3', '5'}}

In [3]:
# Read movies dataset from CSV file
# Create dictionary "Mitems" with key = item and value = set of transactions
# Also set variable Mnumtrans = number of transactions
Mitems = {}
trans = []  # list of transactions used to set Mnumtrans
with open('Movies.csv') as f:
    rows = csv.DictReader(f)
    for r in rows:
        if r['item'] not in Mitems:
            Mitems[r['item']] = {r['TID']}
        else:
            Mitems[r['item']].add(r['TID'])
        if r['TID'] not in trans:
            trans.append(r['TID'])
Mnumtrans = len(trans)
print('Number of transactions (users):', Mnumtrans)
print('Number of distinct items (movies):', len(Mitems))
print('Item dictionary:')
Mitems.items()

Number of transactions (users): 1382
Number of distinct items (movies): 123
Item dictionary:


dict_items([('The Fault in Our Stars', {'215987', '101420', '163775', '144712', '115516', '174077', '186450', '244116', '153533', '140380', '124830', '114602', '176', '127474', '121408', '89200', '12924', '165260', '68551', '229131', '31198', '121896', '18805', '232320', '167831', '173452', '171575', '105996', '204050', '15530', '87907', '25851', '96530', '39827', '6802', '128828', '102901', '152646', '158208', '221712', '14642', '102371', '120662', '29139', '240694', '4208', '41453', '71352', '234322', '50601', '150676', '78715', '176880', '241916', '47173', '241263', '174147', '158048', '218669', '206129', '237123', '210968', '214778', '15590', '6573', '171673', '100316', '7353', '3508', '62076', '184307', '105787', '233542', '239816', '71268', '38153', '126898', '166205', '36040', '145755', '200931', '134268', '69391', '87127', '151368', '191042', '69759', '72265', '28304', '55096', '56323', '34280', '46770', '218424', '206146', '152735', '92094', '98509', '43142', '61803', '111288'

### Some new Python features

In [4]:
# Iterating through dictionaries
for i in Sitems:
    print(i)
    print(Sitems[i])

milk
{'4', '1', '2', '5'}
eggs
{'4', '1', '3'}
juice
{'5', '1', '2'}
cookies
{'5', '2'}
chips
{'5', '3'}


In [5]:
# Intersecting sets
# How many transactions contain both eggs and milk?
set1 = Sitems['eggs']
print('Transactions containing eggs:', set1)
set2 = Sitems['milk']
print('Transactions containing milk:', set2)
set3 = set1 & set2
print('Transactions containing both:', set3)
# print('Number of transactions containing both:', len(set3))

Transactions containing eggs: {'4', '1', '3'}
Transactions containing milk: {'4', '1', '2', '5'}
Transactions containing both: {'4', '1'}


## Shopping dataset - frequent item-sets

### Frequent item-sets of two

#### Print all pairs of items and the number of transactions they occur together in (see what's wrong and fix it)

In [6]:
for i1 in Sitems:
    for i2 in Sitems:
        common = len(Sitems[i1] & Sitems[i2])
        print([i1, i2, common])

['milk', 'milk', 4]
['milk', 'eggs', 2]
['milk', 'juice', 3]
['milk', 'cookies', 2]
['milk', 'chips', 1]
['eggs', 'milk', 2]
['eggs', 'eggs', 3]
['eggs', 'juice', 1]
['eggs', 'cookies', 0]
['eggs', 'chips', 1]
['juice', 'milk', 3]
['juice', 'eggs', 1]
['juice', 'juice', 3]
['juice', 'cookies', 2]
['juice', 'chips', 1]
['cookies', 'milk', 2]
['cookies', 'eggs', 0]
['cookies', 'juice', 2]
['cookies', 'cookies', 2]
['cookies', 'chips', 1]
['chips', 'milk', 1]
['chips', 'eggs', 1]
['chips', 'juice', 1]
['chips', 'cookies', 1]
['chips', 'chips', 2]


#### Now only print pairs that meet support threshold

In [7]:
support = .3
for i1 in Sitems:
    for i2 in Sitems:
      if i1 < i2:
        common = len(Sitems[i1] & Sitems[i2])
        if common/Snumtrans > support:
          print(i1, '|', i2)

eggs | milk
juice | milk
cookies | milk
cookies | juice


### Frequent item-sets of three

In [8]:
support = .1
for i1 in Sitems:
    for i2 in Sitems:
        for i3 in Sitems:
            if i1 < i2 and i2 < i3:
                common = len(Sitems[i1] & Sitems[i2] & Sitems[i3])
                if common/Snumtrans > support:
                    print(i1, '|', i2, '|', i3)

eggs | juice | milk
cookies | juice | milk
chips | juice | milk
chips | cookies | milk
chips | cookies | juice


### <font color = 'green'>**Your Turn - Movies dataset frequent item-sets**</font>

In [9]:
print(Mnumtrans, 'transactions (users)')
print(len(Mitems), 'distinct items (movies)')

1382 transactions (users)
123 distinct items (movies)


#### Mine for frequent item-sets of three and four items in the Movies dataset. Find a single support threshold where the number of frequent item-sets of three items is more than 10 but less than 20, and the number of frequent item-sets of four items is more than 0.

**The support threshold is 0.03**

In [39]:
# Frequent item-sets of three
support_values = [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]

for s in support_values:
    count = 0
    sets = []  # store the counts

    for m1 in Mitems:
        for m2 in Mitems:
            for m3 in Mitems:
                if m1 < m2 and m2 < m3:
                    common = len(Mitems[m1] & Mitems[m2] & Mitems[m3])
                    if common/Mnumtrans > s:
                        count += 1
                        sets.append((m1, m2, m3, common))

    print(f"Support {s}: {count} frequent 3-itemsets")

    if 10 < count < 20:
        print(f"\nFound target support: {s}")
        print(f"\nFrequent 3-itemsets:")
        for m1, m2, m3, common in sets:
            print(f"{m1} | {m2} | {m3} ")
        break


Support 0.01: 81 frequent 3-itemsets
Support 0.02: 26 frequent 3-itemsets
Support 0.03: 14 frequent 3-itemsets

Found target support: 0.03

Frequent 3-itemsets:
Boyhood | Inside Out | The Imitation Game 
Boyhood | Gone Girl | The Imitation Game 
Boyhood | Gone Girl | Inside Out 
Boyhood | Fury | The Imitation Game 
Boyhood | Fury | Gone Girl 
Big Hero 6 | Boyhood | The Imitation Game 
Big Hero 6 | Boyhood | Gone Girl 
Big Hero 6 | Inside Out | The Imitation Game 
Big Hero 6 | Gone Girl | The Imitation Game 
Big Hero 6 | Gone Girl | Inside Out 
Big Hero 6 | Fury | The Imitation Game 
Big Hero 6 | Fury | Gone Girl 
Gone Girl | Inside Out | The Imitation Game 
Fury | Gone Girl | The Imitation Game 


**The support thresholds are 0.01, 0.02, 0.03, 0.04**

In [38]:
# Frequent item-sets of four
support_values = [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]

for s in support_values:
  count = 0
  sets = []  # store the counts
  for m1 in Mitems:
    for m2 in Mitems:
        for m3 in Mitems:
            for m4 in Mitems:
                if m1 < m2 and m2 < m3 and m3 < m4:
                    common = len(Mitems[m1] & Mitems[m2] & Mitems[m3] & Mitems[m4])
                    if common/Mnumtrans > s:
                      count+=1
                      sets.append((m1, m2, m3, m4, common))
  print(f"Support {s}: {count} frequent 4-itemsets")

  if count > 0:
    print(f"\nFound target support: {s}")
    print(f"\nFrequent 4-itemsets:")
    for m1, m2, m3, m4, common in sets:
      print(f"{m1} | {m2} | {m3} | {m4} (appears in {common} transactions)")
      break



Support 0.01: 44 frequent 4-itemsets

Found target support: 0.01

Frequent 4-itemsets:
Boyhood | Inside Out | The Fault in Our Stars | The Imitation Game (appears in 19 transactions)
Support 0.02: 11 frequent 4-itemsets

Found target support: 0.02

Frequent 4-itemsets:
Boyhood | Gone Girl | Inside Out | The Imitation Game (appears in 42 transactions)
Support 0.03: 3 frequent 4-itemsets

Found target support: 0.03

Frequent 4-itemsets:
Boyhood | Gone Girl | Inside Out | The Imitation Game (appears in 42 transactions)
Support 0.04: 1 frequent 4-itemsets

Found target support: 0.04

Frequent 4-itemsets:
Big Hero 6 | Gone Girl | Inside Out | The Imitation Game (appears in 59 transactions)
Support 0.05: 0 frequent 4-itemsets
Support 0.06: 0 frequent 4-itemsets
Support 0.07: 0 frequent 4-itemsets
Support 0.08: 0 frequent 4-itemsets
Support 0.09: 0 frequent 4-itemsets
Support 0.1: 0 frequent 4-itemsets


## Shopping dataset - association rules

### Association rules with one item on the left-hand side

#### First compute frequent item-sets of one item, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [40]:
support = .5
frequentLHS = []
for i in Sitems:
    if len(Sitems[i])/Snumtrans > support:
        frequentLHS.append([i,len(Sitems[i])])
print(frequentLHS)

[['milk', 4], ['eggs', 3], ['juice', 3]]


#### Now find right-hand side items with sufficient confidence (see what's wrong and fix it)

In [41]:
# S -> i

confidence = .5
for lhs in frequentLHS:
    for i in Sitems:
        common = len(Sitems[lhs[0]] & Sitems[i])
        if common/lhs[1] > confidence:
            print(lhs[0], '->', i)

milk -> milk
milk -> juice
eggs -> milk
eggs -> eggs
juice -> milk
juice -> juice
juice -> cookies


### Association rules with two items on the left-hand side

#### First compute frequent item-sets of two items, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [42]:
# S = [JUICE, MILK]
support = .5
frequentLHS = []
for i1 in Sitems:
    for i2 in Sitems:
        if i1 < i2:
            common = len(Sitems[i1] & Sitems[i2])
            if common/Snumtrans > support:
                frequentLHS.append([i1,i2,common])
print(frequentLHS)

[['juice', 'milk', 3]]


#### Now find right-hand side items with sufficient confidence

In [43]:
confidence = .5
for lhs in frequentLHS:
    for i in Sitems:
        if i not in lhs:
            common = len(Sitems[lhs[0]] & Sitems[lhs[1]] & Sitems[i])
            if common/lhs[2] > confidence:
                print(lhs[0], '|', lhs[1], '->', i)

juice | milk -> cookies


## Shopping dataset - association rules with lift instead of confidence

### Association rules with one item on the left-hand side

#### First compute frequent item-sets of one item, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [44]:
support = .5
frequentLHS = []
for i in Sitems:
    if len(Sitems[i])/Snumtrans > support:
        frequentLHS.append([i,len(Sitems[i])])
print(frequentLHS)

[['milk', 4], ['eggs', 3], ['juice', 3]]


#### Now find right-hand side items with sufficient lift

In [45]:
liftthresh = 1
for lhs in frequentLHS:
    for i in Sitems:
        if i not in lhs:
            common = len(Sitems[lhs[0]] & Sitems[i])
            lift = (common/lhs[1]) / (len(Sitems[i])/Snumtrans)
            if lift > liftthresh:
                print(lhs[0], '->', i, ' lift:', lift)

milk -> juice  lift: 1.25
milk -> cookies  lift: 1.25
juice -> milk  lift: 1.25
juice -> cookies  lift: 1.6666666666666665


### Association rules with two items on the left-hand side

#### First compute frequent item-sets of two items, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [46]:
support = .5
frequentLHS = []
for i1 in Sitems:
    for i2 in Sitems:
        if i1 < i2:
            common = len(Sitems[i1] & Sitems[i2])
            if common/Snumtrans > support:
                frequentLHS.append([i1,i2,common])
print(frequentLHS)

[['juice', 'milk', 3]]


#### Now find right-hand side items with sufficient lift

In [19]:
liftthresh = 1
for lhs in frequentLHS:
    for i in Sitems:
        if i not in lhs:
            common = len(Sitems[lhs[0]] & Sitems[lhs[1]] & Sitems[i])
            lift = (common/lhs[2]) / (len(Sitems[i])/Snumtrans)
            if lift > liftthresh:
                print(lhs[0], '|', lhs[1], '->', i, ' lift:', lift)

juice | milk -> cookies  lift: 1.6666666666666665


### <font color = 'green'>**Your Turn - Movies dataset association rules**</font>

#### Mine for association rules in the Movies dataset with three items on the left-hand side. Find support and confidence thresholds (need not be the same) so the number of association rules is more than 10 but less than 20.

In [59]:
# Association rules with three items on the left-hand side
# Hint: Make sure to include the code from the seprate cells above that
#   together implement the two steps of association rule mining
support_values = [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]
frequentLHS = []

for s in support_values:
    count = 0
    temp_freq = [] # for storing temporary frequentLHS of the found threshold

    for m1 in Mitems:
        for m2 in Mitems:
            for m3 in Mitems:
                if m1 < m2 and m2 < m3:
                    common = len(Mitems[m1] & Mitems[m2] & Mitems[m3])
                    if common/Mnumtrans > s:
                        count += 1
                        temp_freq.append([m1, m2, m3, common])

    print(f"Support {s}: {count} frequent 3-itemsets")

    if 10 < count < 20:
        print(f"\nFound target support: {s}")
        frequentLHS = temp_freq
        break

print(frequentLHS)

Support 0.01: 81 frequent 3-itemsets
Support 0.02: 26 frequent 3-itemsets
Support 0.03: 14 frequent 3-itemsets

Found target support: 0.03
[['Boyhood', 'Inside Out', 'The Imitation Game', 52], ['Boyhood', 'Gone Girl', 'The Imitation Game', 95], ['Boyhood', 'Gone Girl', 'Inside Out', 54], ['Boyhood', 'Fury', 'The Imitation Game', 43], ['Boyhood', 'Fury', 'Gone Girl', 43], ['Big Hero 6', 'Boyhood', 'The Imitation Game', 57], ['Big Hero 6', 'Boyhood', 'Gone Girl', 56], ['Big Hero 6', 'Inside Out', 'The Imitation Game', 102], ['Big Hero 6', 'Gone Girl', 'The Imitation Game', 119], ['Big Hero 6', 'Gone Girl', 'Inside Out', 85], ['Big Hero 6', 'Fury', 'The Imitation Game', 44], ['Big Hero 6', 'Fury', 'Gone Girl', 43], ['Gone Girl', 'Inside Out', 'The Imitation Game', 103], ['Fury', 'Gone Girl', 'The Imitation Game', 70]]


In [68]:
confidence_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

for conf in confidence_values:
    counts = []

    for lhs in frequentLHS:
        m1, m2, m3, lhscount = lhs
        lhsitems = Mitems[m1] & Mitems[m2] & Mitems[m3]

        for rhs in Mitems:
            if rhs != m1 and rhs != m2 and rhs != m3:
                rhsitems = Mitems[rhs]
                common = len(lhsitems & rhsitems)

                if common/lhscount > conf:
                    counts.append([m1, m2, m3, rhs, common, lhscount])

    print(f"\nConfidence {conf}: {len(counts)} association rules")

    if 10 < len(counts) < 20:
        print(f"\nTarget confidence: {conf}")
        print(f"\nAssociation Rules (Support={0.03}, Confidence={conf}):")
        for count in counts:
            print(f"{count[0]}, {count[1]}, {count[2]} => {count[3]} (support: {count[5]}, confidence: {count[4]/count[5]:.3f})")
        break


Confidence 0.1: 180 association rules

Confidence 0.2: 76 association rules

Confidence 0.3: 51 association rules

Confidence 0.4: 35 association rules

Confidence 0.5: 26 association rules

Confidence 0.6: 19 association rules

Target confidence: 0.6

Association Rules (Support=0.03, Confidence=0.6):
Boyhood, Inside Out, The Imitation Game => Big Hero 6 (support: 52, confidence: 0.673)
Boyhood, Inside Out, The Imitation Game => Gone Girl (support: 52, confidence: 0.808)
Boyhood, Gone Girl, Inside Out => Big Hero 6 (support: 54, confidence: 0.630)
Boyhood, Gone Girl, Inside Out => The Imitation Game (support: 54, confidence: 0.778)
Boyhood, Fury, The Imitation Game => Big Hero 6 (support: 43, confidence: 0.651)
Boyhood, Fury, The Imitation Game => Gone Girl (support: 43, confidence: 0.907)
Boyhood, Fury, Gone Girl => Big Hero 6 (support: 43, confidence: 0.628)
Boyhood, Fury, Gone Girl => The Imitation Game (support: 43, confidence: 0.907)
Big Hero 6, Boyhood, The Imitation Game => Ins

#### Mine for association rules in the Movies dataset with three items on the left-hand side. Find support and lift thresholds so the number of association rules is more than 10 but less than 20. Only consider lift thresholds > 1.


In [98]:
# Association rules with three items on the left-hand side


support_values = [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1]
frequentLHS = []
target_support = 0

for s in support_values:
    count = 0
    temp_freq = []

    for m1 in Mitems:
        for m2 in Mitems:
            for m3 in Mitems:
                if m1 < m2 and m2 < m3:
                    common = len(Mitems[m1] & Mitems[m2] & Mitems[m3])
                    if common/Mnumtrans > s:
                        count += 1
                        temp_freq.append([m1, m2, m3, common])

    print(f"Support {s}: {count} frequent 3-itemsets")

    if 10 < count < 20:
        print(f"\nFound target support: {s}")
        frequentLHS = temp_freq
        target_support = s
        break

print(f"\nFrequent LHS: {frequentLHS}\n")


lift_values = [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.5, 3.0, 3.5, 4.0]

for liftthresh in lift_values:
    count = 0
    rules = []

    for lhs in frequentLHS:
        for i in Mitems:
            if i not in [lhs[0], lhs[1], lhs[2]]:
                common = len(Mitems[lhs[0]] & Mitems[lhs[1]] & Mitems[lhs[2]] & Mitems[i])
                lift = (common/lhs[3]) / (len(Mitems[i])/Mnumtrans)
                if lift > liftthresh:
                    count += 1
                    rules.append((lhs[0], lhs[1], lhs[2], i, lift))

    print(f"Lift > {liftthresh}: {count} rules")

    if 10 < count < 20:
        print(f"\nFound target: Support={target_support}, Lift>{liftthresh}, Rules={count}\n")
        for rule in rules:
            print(f"{rule[0]} | {rule[1]} | {rule[2]} -> {rule[3]}  lift: {rule[4]:.3f}")
        break

Support 0.01: 81 frequent 3-itemsets
Support 0.02: 26 frequent 3-itemsets
Support 0.03: 14 frequent 3-itemsets

Found target support: 0.03

Frequent LHS: [['Boyhood', 'Inside Out', 'The Imitation Game', 52], ['Boyhood', 'Gone Girl', 'The Imitation Game', 95], ['Boyhood', 'Gone Girl', 'Inside Out', 54], ['Boyhood', 'Fury', 'The Imitation Game', 43], ['Boyhood', 'Fury', 'Gone Girl', 43], ['Big Hero 6', 'Boyhood', 'The Imitation Game', 57], ['Big Hero 6', 'Boyhood', 'Gone Girl', 56], ['Big Hero 6', 'Inside Out', 'The Imitation Game', 102], ['Big Hero 6', 'Gone Girl', 'The Imitation Game', 119], ['Big Hero 6', 'Gone Girl', 'Inside Out', 85], ['Big Hero 6', 'Fury', 'The Imitation Game', 44], ['Big Hero 6', 'Fury', 'Gone Girl', 43], ['Gone Girl', 'Inside Out', 'The Imitation Game', 103], ['Fury', 'Gone Girl', 'The Imitation Game', 70]]

Lift > 1.1: 778 rules
Lift > 1.2: 776 rules
Lift > 1.3: 774 rules
Lift > 1.4: 769 rules
Lift > 1.5: 765 rules
Lift > 1.6: 762 rules
Lift > 1.7: 756 rules
Lif