## Frequent Itemsets Generation Using Apriori

Apriori is a classic algorithm for frequent itemset mining and association rule learning. It's often used in market basket analysis, but it can also be applied to text mining—for example, identifying frequently co-occurring words in a bag-of-words dataset.



### About the datset:

The Bag of Words Data Set in the UCI Machine Learning Repository contains five text collections. We will be working with 3 text collections from the dataset, namely:
- **Enron emails**
- **NIPS full papers**
- **KOS blog entries**

For each collection, we have 2 files:
1. **vocab:** Lists all the words occurring in the given collection

2. **docword:** Lists out the number of times each word in vocab occurs in each document. 

The first 3 lines of docword are:
- D: number of documents in the collection
- W: number of words in vocab of the collection (size of vocab file)
- N: total number of words used from each collection in the given number of documents (size of docword file - 3)

This is followed by N lines of the form 'docID wordID count' where *count* is the number of time the word with id *wordID* appears in document with id *docID*.

### Process Overview

1. Loading the Dataset  
   - We read the docword file, which contains document-word mappings.
   - Each document is treated as a "transaction" with words as "items."

2. Building the Transaction Dataset
   - We construct a dictionary where keys are document IDs and values are lists of word IDs appearing in the document.

3. Generating Frequent Wordsets with Apriori 
   - F1 (1-itemsets): We count word occurrences and filter those meeting the minimum support threshold.
   - F2, F3, ... (Higher-order itemsets): We use candidate generation and pruning to form larger word combinations.
   - The process continues up to K-itemsets.

4. Mapping Word IDs to Actual Words 
   - Using the vocab file, we translate word IDs into meaningful words.
   - We display frequent word combinations at different levels.

### Creating dictionary for each collection storing the wordIDs and its frequency in each document

KOS

In [1]:
kos_transactions={}
for i in range (1,3431):
    kos_transactions[i]=list()
print(kos_transactions) # kos_transactions is a dictionary with its values initialised as lists
print(len(kos_transactions)) 

{1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: [], 23: [], 24: [], 25: [], 26: [], 27: [], 28: [], 29: [], 30: [], 31: [], 32: [], 33: [], 34: [], 35: [], 36: [], 37: [], 38: [], 39: [], 40: [], 41: [], 42: [], 43: [], 44: [], 45: [], 46: [], 47: [], 48: [], 49: [], 50: [], 51: [], 52: [], 53: [], 54: [], 55: [], 56: [], 57: [], 58: [], 59: [], 60: [], 61: [], 62: [], 63: [], 64: [], 65: [], 66: [], 67: [], 68: [], 69: [], 70: [], 71: [], 72: [], 73: [], 74: [], 75: [], 76: [], 77: [], 78: [], 79: [], 80: [], 81: [], 82: [], 83: [], 84: [], 85: [], 86: [], 87: [], 88: [], 89: [], 90: [], 91: [], 92: [], 93: [], 94: [], 95: [], 96: [], 97: [], 98: [], 99: [], 100: [], 101: [], 102: [], 103: [], 104: [], 105: [], 106: [], 107: [], 108: [], 109: [], 110: [], 111: [], 112: [], 113: [], 114: [], 115: [], 116: [], 117: [], 118: [], 119: [], 120: [], 121: [], 122: [], 123: [], 

In [2]:
with open("docword.kos.txt","r") as f:    
    for line in f.readlines()[3:]:  # Skip first 3 header lines
        docID, wordID, _ = map(int, line.split())  # Extract document ID and word ID
        kos_transactions[docID].append(wordID)  # Add word to corresponding document
f.close()
print(list(kos_transactions.items())[:5])  # Print first 5 transactions


[(1, [61, 76, 89, 211, 296, 335, 404, 441, 454, 463, 555, 593, 779, 841, 913, 983, 1116, 1140, 1206, 1219, 1263, 1266, 1267, 1297, 1298, 1316, 1323, 1434, 1534, 1535, 1683, 1715, 1837, 1901, 1919, 2033, 2101, 2111, 2275, 2403, 2640, 2701, 2742, 2953, 3005, 3007, 3112, 3117, 3142, 3219, 3238, 3282, 3310, 3350, 3399, 3420, 3452, 3516, 3534, 3581, 3708, 3745, 3806, 3873, 3929, 3973, 4113, 4143, 4196, 4301, 4347, 4489, 4497, 4560, 4565, 4712, 4735, 4861, 5004, 5017, 5114, 5156, 5185, 5189, 5241, 5262, 5287, 5408, 5517, 5561, 5728, 6021, 6429, 6613, 6622, 6659, 6662, 6689, 6724, 6732, 6815, 6867, 6869, 6905]), (2, [232, 463, 689, 707, 714, 841, 873, 879, 1063, 1116, 1187, 1497, 1510, 1535, 1587, 1940, 2111, 2494, 2640, 2758, 3070, 3187, 3350, 3420, 3422, 3874, 3913, 3968, 3991, 4232, 4235, 4281, 4290, 4293, 4579, 4599, 4604, 4639, 5190, 5229, 5524, 5582, 5810, 5843, 5894, 6134, 6299, 6434, 6511, 6606, 6621, 6655, 6689, 6707, 6715, 6867]), (3, [21, 40, 88, 116, 225, 264, 391, 434, 445, 463, 

NIPS

In [3]:
nips_transactions={}
for i in range (1,1501):
    nips_transactions[i]=list()
print(nips_transactions) # nips_transactions is a dictionary with its values initialised as lists
print(len(nips_transactions)) 

{1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: [], 23: [], 24: [], 25: [], 26: [], 27: [], 28: [], 29: [], 30: [], 31: [], 32: [], 33: [], 34: [], 35: [], 36: [], 37: [], 38: [], 39: [], 40: [], 41: [], 42: [], 43: [], 44: [], 45: [], 46: [], 47: [], 48: [], 49: [], 50: [], 51: [], 52: [], 53: [], 54: [], 55: [], 56: [], 57: [], 58: [], 59: [], 60: [], 61: [], 62: [], 63: [], 64: [], 65: [], 66: [], 67: [], 68: [], 69: [], 70: [], 71: [], 72: [], 73: [], 74: [], 75: [], 76: [], 77: [], 78: [], 79: [], 80: [], 81: [], 82: [], 83: [], 84: [], 85: [], 86: [], 87: [], 88: [], 89: [], 90: [], 91: [], 92: [], 93: [], 94: [], 95: [], 96: [], 97: [], 98: [], 99: [], 100: [], 101: [], 102: [], 103: [], 104: [], 105: [], 106: [], 107: [], 108: [], 109: [], 110: [], 111: [], 112: [], 113: [], 114: [], 115: [], 116: [], 117: [], 118: [], 119: [], 120: [], 121: [], 122: [], 123: [], 

In [4]:
with open("docword.nips.txt","r") as f:    
    for line in f.readlines()[3:]:  # Skip first 3 header lines
        docID, wordID, _ = map(int, line.split())  # Extract document ID and word ID
        nips_transactions[docID].append(wordID)  # Add word to corresponding document
f.close()
print(list(nips_transactions.items())[:5])  # Print first 5 transactions


[(1, [2, 39, 42, 77, 95, 96, 105, 108, 133, 137, 140, 149, 155, 158, 169, 172, 316, 365, 389, 426, 428, 433, 437, 478, 518, 523, 532, 533, 540, 542, 550, 552, 574, 579, 639, 653, 654, 673, 675, 676, 695, 697, 698, 786, 822, 904, 937, 941, 954, 986, 987, 990, 1056, 1087, 1103, 1135, 1172, 1188, 1213, 1222, 1270, 1282, 1393, 1395, 1398, 1418, 1426, 1482, 1483, 1493, 1497, 1498, 1499, 1500, 1501, 1595, 1614, 1616, 1681, 1759, 1768, 1769, 1772, 1783, 1814, 1824, 1826, 1872, 1873, 1927, 1940, 1943, 1951, 1976, 1978, 2005, 2015, 2042, 2050, 2052, 2054, 2055, 2056, 2059, 2083, 2084, 2087, 2094, 2095, 2111, 2118, 2144, 2149, 2153, 2162, 2175, 2176, 2177, 2179, 2182, 2183, 2185, 2203, 2204, 2235, 2267, 2341, 2346, 2350, 2404, 2452, 2455, 2510, 2512, 2529, 2561, 2572, 2574, 2675, 2676, 2678, 2680, 2726, 2756, 2759, 2777, 2796, 2797, 2801, 2806, 2811, 2813, 2820, 2821, 2834, 2836, 2837, 2840, 2846, 2847, 2878, 2902, 2905, 2907, 2911, 2920, 2925, 2943, 2944, 2960, 2996, 3009, 3020, 3023, 3030, 305

ENRON

In [5]:
enron_transactions={}
for i in range (1,39862):
    enron_transactions[i]=list()
print(enron_transactions) # enron_transactions is a dictionary with its values initialised as lists
print(len(enron_transactions))

{1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: [], 23: [], 24: [], 25: [], 26: [], 27: [], 28: [], 29: [], 30: [], 31: [], 32: [], 33: [], 34: [], 35: [], 36: [], 37: [], 38: [], 39: [], 40: [], 41: [], 42: [], 43: [], 44: [], 45: [], 46: [], 47: [], 48: [], 49: [], 50: [], 51: [], 52: [], 53: [], 54: [], 55: [], 56: [], 57: [], 58: [], 59: [], 60: [], 61: [], 62: [], 63: [], 64: [], 65: [], 66: [], 67: [], 68: [], 69: [], 70: [], 71: [], 72: [], 73: [], 74: [], 75: [], 76: [], 77: [], 78: [], 79: [], 80: [], 81: [], 82: [], 83: [], 84: [], 85: [], 86: [], 87: [], 88: [], 89: [], 90: [], 91: [], 92: [], 93: [], 94: [], 95: [], 96: [], 97: [], 98: [], 99: [], 100: [], 101: [], 102: [], 103: [], 104: [], 105: [], 106: [], 107: [], 108: [], 109: [], 110: [], 111: [], 112: [], 113: [], 114: [], 115: [], 116: [], 117: [], 118: [], 119: [], 120: [], 121: [], 122: [], 123: [], 

In [6]:
with open("docword.enron.txt","r") as f:    
    for line in f.readlines()[3:]:  # Skip first 3 header lines
        docID, wordID, _ = map(int, line.split())  # Extract document ID and word ID
        enron_transactions[docID].append(wordID)  # Add word to corresponding document
f.close()
print(list(enron_transactions.items())[:5])  # Print first 5 transactions


[(1, [118, 285, 1229, 1688, 2068, 5299, 6941, 7223, 8904, 9358, 9667, 9784, 11099, 11763, 12224, 12669, 13631, 14814, 14816, 17208, 17872, 18139, 19190, 20240, 23028, 23481, 23893, 25611, 27283, 27359]), (2, [46, 325, 536, 821, 1197, 1290, 1669, 1889, 2053, 2068, 2233, 3182, 3183, 4398, 4516, 4777, 4802, 4843, 4860, 4943, 5006, 5066, 5511, 5602, 6993, 7457, 8655, 8930, 9289, 9458, 9707, 9732, 9755, 9832, 10344, 11585, 11896, 11897, 12405, 12818, 13631, 14024, 14075, 14338, 14649, 15187, 15231, 16673, 17423, 17486, 17728, 17955, 18505, 18522, 19168, 19484, 19568, 19589, 19613, 19651, 19675, 19724, 19725, 20290, 20366, 20371, 20374, 20520, 20956, 20958, 21077, 21290, 21913, 22310, 22435, 22934, 23028, 23557, 24174, 24436, 24445, 24521, 25469, 25539, 26076, 26297, 26324]), (3, [232, 432, 534, 536, 734, 1229, 1583, 1686, 1976, 2226, 2284, 2368, 2504, 2557, 2969, 3286, 3328, 3450, 3854, 4255, 4848, 4849, 5247, 5299, 5693, 6696, 6993, 7270, 7392, 7608, 7867, 8159, 8243, 8940, 9707, 10730, 10

### Defining the candidate generation function

Generates candidate itemsets (Ck) from the previous level's frequent itemsets (Fk-1).


In [7]:
from itertools import combinations
from itertools import product


def candidate_gen(Fk_minus_1): #Fk_minus_1 is a list of frequent K-1 item sets that will be used to generate candidate K item sets
    Ck = set()  # Set to store candidate k-itemsets 
    
    for f1,f2 in product(Fk_minus_1,Fk_minus_1):            
        if f1[:-1] == f2[:-1] and f1[-1] < f2[-1]: # Ensure the first k-2 elements match and join only if the last element of f1 is smaller than f2's
            c = tuple(sorted(f1 + [f2[-1]])) # Merge and sort the new itemset
            Ck.add(c)
    
    # Prune invalid candidates by checking if all (k-1)-subsets exist in Fk-1
    valid_Ck = set()
    Fk_minus_1_set = {tuple(f) for f in Fk_minus_1} # Convert Fk-1 to a set for fast lookup
    
    for c in Ck:
        subsets = combinations(c, len(c)-1) # Generate all (k-1)-subsets of the candidate c
        
        for subset in subsets:
            if subset in Fk_minus_1_set: # If all (k-1)-subsets exist in Fk-1, keep the candidate
                valid_Ck.add(c)
    
    
    return valid_Ck


### Defining the Frequency Set Generation Function

Generates the frequent itemsets (Fk) from candidate itemsets (C) by filtering those that meet the minimum support threshold.

In [8]:
def freqSetGen(C,minSupp,transactions): 
    Fk=[] # List to store frequent itemsets
    candidate_counts={} # Dictionary to count occurrences of each candidate itemset

    for transaction,wordSet in transactions.items():
        for tuples in C:
           if set(tuples).issubset(set(wordSet)): # Check if candidate itemset is present in the transaction
                if tuples not in candidate_counts:
                    candidate_counts[tuples]=1
                else:
                    candidate_counts[tuples]=candidate_counts[tuples]+1

    # Filter candidates that meet the minimum support threshold
    for ele in C:
        if(candidate_counts.get(ele,0)>=minSupp):
            Fk.append(list(ele))
    
    return Fk

### Defining the Apriori Function

Implements the Apriori algorithm to generate frequent itemsets from transactions.

In [9]:
def apriori(transactions,minSupp,K):
    
    C1={} # Count occurrences of individual words (1-itemsets) across all transactions
    for transaction in list(transactions.items()):
        for wordID in set(transaction[1]):
            if wordID not in C1:
                C1[wordID]=1
            else:
                C1[wordID]+=1

    F1=[] # Filter 1-itemsets that meet the minimum support threshold
    for wordID,count in C1.items():
        if count>=minSupp:
            F1.append([wordID])

    print("F1",F1)
    F=[F1] # F is a list contaning the freqent itemset lists as and when they are created
    
    k=2
    while k<=K:
        Ck=candidate_gen(F[k-2]) # Generate candidate k-itemsets from (k-1)-itemsets
        Fk=freqSetGen(Ck,minSupp,transactions) # Filter candidates based on support threshold
        F.append(Fk)
        print(f"F{k}",Fk)
        k=k+1
    
    return F


### WordMaps

In [10]:
kosWordMap={}

with open("vocab.kos.txt","r") as file:
    for wordID,word in enumerate(file):
        kosWordMap[wordID+1]=word.strip()
print(kosWordMap)
print(len(kosWordMap))
file.close()
    

6906


In [11]:

nipsWordMap={}

with open("vocab.nips.txt","r") as file:
    for wordID,word in enumerate(file):
        nipsWordMap[wordID+1]=word.strip()
print(nipsWordMap)
print(len(nipsWordMap))
file.close()
    

{1: 'a2i', 2: 'aaa', 3: 'aaai', 4: 'aapo', 5: 'aat', 6: 'aazhang', 7: 'abandonment', 8: 'abbott', 9: 'abbreviated', 10: 'abcde', 11: 'abe', 12: 'abeles', 13: 'abi', 14: 'abilistic', 15: 'abilities', 16: 'ability', 17: 'abl', 18: 'able', 19: 'ables', 20: 'ablex', 21: 'ably', 22: 'abnormal', 23: 'abort', 24: 'abound', 25: 'abramowicz', 26: 'abrash', 27: 'abrupt', 28: 'abruptly', 29: 'abscissa', 30: 'absence', 31: 'absent', 32: 'absolute', 33: 'absolutely', 34: 'absorb', 35: 'absorbed', 36: 'absorbing', 37: 'absorption', 38: 'abstr', 39: 'abstract', 40: 'abstracted', 41: 'abstraction', 42: 'abu', 43: 'abundances', 44: 'aca', 45: 'acad', 46: 'academic', 47: 'academy', 48: 'acc', 49: 'accelerate', 50: 'accelerated', 51: 'accelerating', 52: 'acceleration', 53: 'accelerator', 54: 'accent', 55: 'accept', 56: 'acceptable', 57: 'acceptably', 58: 'acceptance', 59: 'accepted', 60: 'accepting', 61: 'acceptor', 62: 'access', 63: 'accessed', 64: 'accessible', 65: 'accessing', 66: 'accommodate', 67: '

In [12]:

enronWordMap={}

with open("vocab.enron.txt","r") as file:
    for wordID,word in enumerate(file):
        enronWordMap[wordID+1]=word.strip()
print(enronWordMap)
print(len(enronWordMap))
file.close()
    

28102


### Generating Frequent Item Sets for Our Collections

KOS

In [13]:
def printingKosResults(kos_result):   
    print("length of F1 item set", len(kos_result[0]))
    print("length of F2 item set",len(kos_result[1]))
    print("length of F3 item set",len(kos_result[2]))
    if len(kos_result)==4:
        print("length of F4 item set",len(kos_result[3]))
    for i, level in enumerate(kos_result, start=1):
        level_list = [[kosWordMap.get(wordID, f"Unknown({wordID})") for wordID in ele] for ele in level]
        print(f"Frequent {i}-itemsets:", level_list)

In [14]:
kos_result1=apriori(kos_transactions,1500,4)

F1 [[2640], [841], [3420]]
F2 []
F3 []
F4 []


In [15]:
printingKosResults(kos_result1)

length of F1 item set 3
length of F2 item set 0
length of F3 item set 0
length of F4 item set 0
Frequent 1-itemsets: [['general'], ['bush'], ['kerry']]
Frequent 2-itemsets: []
Frequent 3-itemsets: []
Frequent 4-itemsets: []


In [16]:
kos_result2=apriori(kos_transactions,1000,4)

F1 [[6689], [2640], [841], [3420], [3005], [4632], [5186], [1664], [1666], [3858], [2030], [6296]]
F2 [[2640, 3420], [841, 2640], [841, 3420]]
F3 []
F4 []


In [17]:
printingKosResults(kos_result2)

length of F1 item set 12
length of F2 item set 3
length of F3 item set 0
length of F4 item set 0
Frequent 1-itemsets: [['war'], ['general'], ['bush'], ['kerry'], ['house'], ['poll'], ['republicans'], ['democratic'], ['democrats'], ['media'], ['election'], ['time']]
Frequent 2-itemsets: [['general', 'kerry'], ['bush', 'general'], ['bush', 'kerry']]
Frequent 3-itemsets: []
Frequent 4-itemsets: []


In [18]:
kos_result3=apriori(kos_transactions,500,4)

F1 [[6659], [6662], [6689], [4143], [5185], [2640], [89], [4196], [4735], [3282], [3350], [841], [3420], [3005], [3534], [879], [4632], [4635], [5186], [1664], [6785], [1666], [4761], [2719], [3858], [4941], [4942], [5552], [2000], [2030], [6296], [2031], [6880], [238], [847], [893], [4494], [5386], [4205], [894], [6881], [5891], [5896], [4442], [1433], [4247], [2703], [4627], [4093], [1580]]
F2 [[841, 5185], [1664, 6659], [841, 4627], [841, 2030], [2640, 3005], [3858, 4632], [1664, 1666], [841, 5552], [841, 3858], [841, 4761], [841, 4635], [4632, 5552], [3858, 6689], [3350, 3420], [4632, 4761], [4632, 4635], [841, 847], [3005, 5186], [1666, 3005], [841, 6296], [841, 5896], [1666, 6689], [879, 3420], [2640, 5186], [841, 3420], [841, 5891], [5186, 5552], [841, 6659], [3005, 4632], [4635, 4761], [2030, 3420], [3420, 4735], [2640, 4632], [3005, 6689], [2640, 4735], [3420, 6689], [841, 1666], [841, 3005], [2640, 6689], [1664, 3005], [1666, 5186], [1664, 6689], [841, 879], [2030, 3005], [84

In [19]:
printingKosResults(kos_result3)

length of F1 item set 50
length of F2 item set 101
length of F3 item set 47
length of F4 item set 5
Frequent 1-itemsets: [['vote'], ['voters'], ['war'], ['news'], ['republican'], ['general'], ['administration'], ['november'], ['president'], ['iraq'], ['john'], ['bush'], ['kerry'], ['house'], ['lead'], ['campaign'], ['poll'], ['polls'], ['republicans'], ['democratic'], ['win'], ['democrats'], ['primary'], ['governor'], ['media'], ['race'], ['races'], ['senate'], ['economy'], ['election'], ['time'], ['elections'], ['year'], ['american'], ['bushs'], ['candidate'], ['people'], ['running'], ['numbers'], ['candidates'], ['years'], ['state'], ['states'], ['party'], ['country'], ['oct'], ['gop'], ['political'], ['national'], ['dean']]
Frequent 2-itemsets: [['bush', 'republican'], ['democratic', 'vote'], ['bush', 'political'], ['bush', 'election'], ['general', 'house'], ['media', 'poll'], ['democratic', 'democrats'], ['bush', 'senate'], ['bush', 'media'], ['bush', 'primary'], ['bush', 'polls'],

In [20]:
kos_result4=apriori(kos_transactions,300,3)

F1 [[6659], [6662], [6689], [4143], [5185], [2640], [89], [4196], [4735], [3282], [2275], [3310], [3350], [841], [3929], [3420], [2403], [4497], [5017], [3005], [3534], [463], [6299], [2758], [3422], [4579], [879], [3070], [6660], [6661], [6664], [4116], [2582], [4632], [4635], [40], [2093], [4149], [6714], [5186], [5187], [4686], [3665], [3671], [88], [4194], [116], [1655], [1664], [6785], [1666], [6284], [4761], [2719], [2227], [4285], [5840], [735], [225], [737], [738], [4340], [5890], [2318], [3858], [5914], [3889], [1334], [1850], [3903], [5955], [5960], [3916], [4941], [4942], [848], [4945], [866], [4451], [869], [5487], [3458], [391], [6543], [3472], [2965], [5528], [1951], [6562], [4004], [5545], [5552], [6067], [6584], [5049], [1467], [4546], [2000], [6618], [987], [989], [990], [992], [1516], [2030], [2032], [5167], [6296], [4874], [6423], [4926], [4452], [2031], [2082], [6705], [6745], [4736], [4258], [6333], [6880], [238], [3344], [847], [893], [4494], [500], [5244], [5386]

In [21]:
printingKosResults(kos_result4)

length of F1 item set 186
length of F2 item set 3226
length of F3 item set 73625
Frequent 1-itemsets: [['vote'], ['voters'], ['war'], ['news'], ['republican'], ['general'], ['administration'], ['november'], ['president'], ['iraq'], ['fact'], ['ive'], ['john'], ['bush'], ['military'], ['kerry'], ['final'], ['percent'], ['real'], ['house'], ['lead'], ['aug'], ['times'], ['group'], ['kerrys'], ['place'], ['campaign'], ['ill'], ['voted'], ['voter'], ['voting'], ['needed'], ['function'], ['poll'], ['polls'], ['account'], ['endspan'], ['newwindow'], ['watchers'], ['republicans'], ['republicansforkerry'], ['powered'], ['locations'], ['login'], ['admin'], ['nov'], ['advertising'], ['dem'], ['democratic'], ['win'], ['democrats'], ['ticket'], ['primary'], ['governor'], ['experience'], ['openhttpwwwedwardsforprezcomdailykoshtml'], ['split'], ['boxblogroll'], ['altsite'], ['boxfeed_listing'], ['boxrdf_feeds'], ['ourcongressorg'], ['startspan'], ['faq'], ['media'], ['steal'], ['menu'], ['contact'],

NIPS

In [22]:
def printingNipsResults(nips_result):   
    print("length of F1 item set",len(nips_result[0]))
    print("length of F2 item set",len(nips_result[1]))
    print("length of F3 item set",len(nips_result[2]))
    if len(nips_result)==4:
        print("length of F4 item set",len(nips_result[3]))
    for i, level in enumerate(nips_result, start=1):
        level_list = [[nipsWordMap.get(wordID, f"Unknown({wordID})") for wordID in ele] for ele in level]
        print(f"Frequent {i}-itemsets:", level_list)

In [23]:
nips_result1=apriori(nips_transactions,1400,4)

F1 [[39], [9134], [9402]]
F2 [[39, 9134]]
F3 []
F4 []


In [24]:
printingNipsResults(nips_result1)

length of F1 item set 3
length of F2 item set 1
length of F3 item set 0
length of F4 item set 0
Frequent 1-itemsets: [['abstract'], ['references'], ['result']]
Frequent 2-itemsets: [['abstract', 'references']]
Frequent 3-itemsets: []
Frequent 4-itemsets: []


In [25]:
nips_result2=apriori(nips_transactions,1200,4)

F1 [[39], [4270], [8628], [11020], [7011], [9134], [9402], [7358], [7365], [5399], [7579], [5554], [10003], [5344]]
F2 [[39, 10003], [39, 9402], [5554, 7365], [7579, 9134], [7011, 9134], [5554, 9134], [4270, 11020], [7358, 7365], [7358, 9134], [39, 11020], [39, 7365], [39, 7011], [4270, 10003], [4270, 9402], [39, 9134], [39, 7358], [7365, 11020], [39, 4270], [4270, 7365], [7365, 9402], [7365, 10003], [10003, 11020], [4270, 9134], [4270, 7358], [4270, 5554], [4270, 7579], [39, 5554], [7365, 9134], [39, 7579], [9134, 9402], [9402, 10003], [9134, 11020], [7579, 9402], [7011, 9402], [5554, 9402], [9134, 10003], [9402, 11020], [7358, 9402]]
F3 [[39, 7579, 9402], [39, 5554, 9402], [4270, 7365, 9402], [4270, 9134, 11020], [39, 7358, 9402], [7358, 7365, 9402], [4270, 9134, 10003], [39, 9134, 10003], [39, 9134, 9402], [39, 7579, 9134], [39, 9402, 10003], [39, 7011, 9134], [39, 5554, 9134], [39, 7358, 7365], [39, 9134, 11020], [39, 7358, 9134], [7358, 7365, 9134], [39, 9402, 11020], [5554, 9134,

In [26]:
printingNipsResults(nips_result2)

length of F1 item set 14
length of F2 item set 38
length of F3 item set 36
length of F4 item set 7
Frequent 1-itemsets: [['abstract'], ['function'], ['problem'], ['system'], ['model'], ['references'], ['result'], ['network'], ['neural'], ['input'], ['number'], ['introduction'], ['set'], ['information']]
Frequent 2-itemsets: [['abstract', 'set'], ['abstract', 'result'], ['introduction', 'neural'], ['number', 'references'], ['model', 'references'], ['introduction', 'references'], ['function', 'system'], ['network', 'neural'], ['network', 'references'], ['abstract', 'system'], ['abstract', 'neural'], ['abstract', 'model'], ['function', 'set'], ['function', 'result'], ['abstract', 'references'], ['abstract', 'network'], ['neural', 'system'], ['abstract', 'function'], ['function', 'neural'], ['neural', 'result'], ['neural', 'set'], ['set', 'system'], ['function', 'references'], ['function', 'network'], ['function', 'introduction'], ['function', 'number'], ['abstract', 'introduction'], ['neu

In [27]:
nips_result3=apriori(nips_transactions,1000,4)

F1 [[39], [4146], [10305], [4270], [8370], [316], [8628], [2574], [11020], [7011], [11161], [9134], [9402], [7358], [7365], [5399], [7579], [5554], [1482], [11735], [7787], [7874], [7949], [10003], [7969], [6056], [10181], [6120], [6742], [6836], [5344], [11857], [10217], [8640]]
F2 [[6120, 7365], [7969, 9402], [2574, 7365], [5554, 11020], [8640, 9134], [39, 11857], [5399, 11020], [4270, 11020], [5554, 7358], [8628, 11020], [4270, 7969], [9134, 10217], [7358, 7365], [2574, 5554], [39, 10181], [5399, 9402], [4270, 9402], [39, 9134], [4146, 9134], [39, 2574], [316, 9402], [39, 11735], [5344, 11020], [10003, 11020], [5344, 7358], [6120, 7579], [39, 4146], [8370, 9402], [4270, 5344], [5554, 8628], [316, 10003], [8640, 9402], [316, 4270], [7358, 7579], [4270, 8628], [9134, 9402], [9134, 10305], [39, 6742], [2574, 11020], [4270, 7949], [6056, 9134], [39, 6836], [6120, 9402], [5554, 6120], [7365, 7874], [7579, 9402], [2574, 9402], [5344, 8628], [4270, 6120], [9134, 10003], [7358, 9402], [39, 

In [28]:
printingNipsResults(nips_result3)

length of F1 item set 34
length of F2 item set 179
length of F3 item set 354
length of F4 item set 362
Frequent 1-itemsets: [['abstract'], ['form'], ['small'], ['function'], ['point'], ['algorithm'], ['problem'], ['data'], ['system'], ['model'], ['term'], ['references'], ['result'], ['network'], ['neural'], ['input'], ['number'], ['introduction'], ['case'], ['university'], ['order'], ['output'], ['paper'], ['set'], ['parameter'], ['large'], ['simple'], ['learning'], ['mean'], ['method'], ['information'], ['values'], ['single'], ['processing']]
Frequent 2-itemsets: [['learning', 'neural'], ['parameter', 'result'], ['data', 'neural'], ['introduction', 'system'], ['processing', 'references'], ['abstract', 'values'], ['input', 'system'], ['function', 'system'], ['introduction', 'network'], ['problem', 'system'], ['function', 'parameter'], ['references', 'single'], ['network', 'neural'], ['data', 'introduction'], ['abstract', 'simple'], ['input', 'result'], ['function', 'result'], ['abstrac

In [66]:
nips_result4=apriori(nips_transactions,500,3)

F1 [[2052], [2054], [2055], [2056], [2059], [10254], [39], [2087], [2095], [4146], [10305], [4180], [6244], [6245], [105], [2175], [2179], [10368], [2182], [2204], [172], [4270], [10416], [8370], [10454], [4372], [2341], [316], [10593], [8550], [428], [8623], [6579], [8628], [8632], [8637], [10721], [10734], [8706], [2574], [532], [533], [540], [8737], [574], [10861], [2675], [2676], [639], [10904], [697], [4803], [2756], [2759], [4810], [8940], [2820], [11020], [2836], [2837], [2847], [9010], [822], [11078], [7011], [9059], [2943], [2944], [11161], [941], [9134], [11183], [5045], [11203], [3023], [11215], [990], [3057], [3080], [11282], [11297], [5177], [9296], [9303], [9306], [9310], [9336], [3215], [5263], [11417], [11418], [7349], [9402], [7358], [7365], [5352], [3308], [5370], [5399], [7484], [11603], [1395], [7579], [5554], [9649], [7621], [7623], [11719], [1482], [1483], [11735], [3555], [3563], [9718], [3603], [3628], [7766], [7787], [9840], [11901], [3741], [11933], [3750], [3

KeyboardInterrupt: 

In [None]:
printingNipsResults(nips_result4)

ENRON

In [29]:
def printingEnronResults(enron_result):   
    print("length of F1 item set",len(enron_result[0]))
    print("length of F2 item set",len(enron_result[1]))
    print("length of F3 item set",len(enron_result[2]))
    if len(enron_result)==4:
        print("length of F4 item set",len(enron_result[3]))
    for i, level in enumerate(enron_result, start=1):
        level_list = [[enronWordMap.get(wordID, f"Unknown({wordID})") for wordID in ele] for ele in level]
        print(f"Frequent {i}-itemsets:", level_list)

In [30]:
enron_result1=apriori(enron_transactions,8000,4)

F1 []
F2 []
F3 []
F4 []


In [31]:
printingEnronResults(enron_result1)

length of F1 item set 0
length of F2 item set 0
length of F3 item set 0
length of F4 item set 0
Frequent 1-itemsets: []
Frequent 2-itemsets: []
Frequent 3-itemsets: []
Frequent 4-itemsets: []


In [32]:
enron_result2=apriori(enron_transactions,4000,4)

F1 [[5299], [19190], [19484], [17486], [11897], [4802], [15231], [5511], [18978], [1583], [8243], [17481], [11468], [14559], [24860], [3450], [16267], [15617], [19592], [24463], [13091], [21289], [25181], [18848], [10027], [10636], [14735], [8201], [9970], [21011], [4865], [17292], [20481], [22459], [9867], [16213], [6182], [10933], [17761], [6106], [9460], [15738], [22535], [19793], [10543], [20480], [3284], [5921]]
F2 []
F3 []
F4 []


In [33]:
printingEnronResults(enron_result2)

length of F1 item set 48
length of F2 item set 0
length of F3 item set 0
length of F4 item set 0
Frequent 1-itemsets: [['contract'], ['power'], ['price'], ['office'], ['houston'], ['comment'], ['market'], ['cost'], ['point'], ['attached'], ['energy'], ['offer'], ['help'], ['list'], ['team'], ['california'], ['month'], ['meeting'], ['process'], ['sure'], ['issues'], ['review'], ['think'], ['plan'], ['friday'], ['going'], ['look'], ['end'], ['free'], ['report'], ['company'], ['number'], ['received'], ['send'], ['forward'], ['monday'], ['deal'], ['group'], ['order'], ['date'], ['find'], ['message'], ['service'], ['provide'], ['give'], ['receive'], ['business'], ['customer']]
Frequent 2-itemsets: []
Frequent 3-itemsets: []
Frequent 4-itemsets: []


In [34]:
enron_result3=apriori(enron_transactions,3000,4)

F1 [[285], [5299], [17208], [118], [19190], [2053], [536], [19484], [46], [17486], [11897], [4802], [11585], [6993], [24445], [15231], [5511], [5066], [18978], [1583], [8243], [27710], [17481], [19580], [25221], [11468], [14559], [15616], [24860], [13087], [3450], [23419], [16267], [19959], [19497], [9404], [25180], [15617], [19592], [24463], [13091], [21289], [25181], [18848], [1445], [10027], [11823], [19117], [21200], [10636], [14735], [8201], [9970], [21011], [27294], [4865], [25901], [22547], [11885], [20734], [7526], [18830], [13271], [4025], [17292], [20481], [23837], [22459], [10312], [9867], [12179], [16213], [6182], [21043], [12379], [10933], [5897], [18224], [17761], [10190], [9236], [25282], [18677], [6099], [6106], [22541], [9460], [15738], [22535], [12382], [289], [19793], [15656], [10543], [20480], [6650], [5463], [3284], [5921], [2426]]
F2 []
F3 []
F4 []


In [35]:
printingEnronResults(enron_result3)

length of F1 item set 100
length of F2 item set 0
length of F3 item set 0
length of F4 item set 0
Frequent 1-itemsets: [['additional'], ['contract'], ['note'], ['access'], ['power'], ['based'], ['agreement'], ['price'], ['able'], ['office'], ['houston'], ['comment'], ['high'], ['discuss'], ['support'], ['market'], ['cost'], ['conference'], ['point'], ['attached'], ['energy'], ['working'], ['offer'], ['problem'], ['thought'], ['help'], ['list'], ['meet'], ['team'], ['issue'], ['california'], ['soon'], ['month'], ['put'], ['prices'], ['file'], ['thing'], ['meeting'], ['process'], ['sure'], ['issues'], ['review'], ['think'], ['plan'], ['asked'], ['friday'], ['hope'], ['position'], ['result'], ['going'], ['look'], ['end'], ['free'], ['report'], ['wednesday'], ['company'], ['tuesday'], ['set'], ['hour'], ['regarding'], ['due'], ['place'], ['jeff'], ['change'], ['number'], ['received'], ['start'], ['send'], ['gas'], ['forward'], ['ill'], ['monday'], ['deal'], ['request'], ['include'], ['grou

In [36]:
enron_result4=apriori(enron_transactions,2000,3)

F1 [[2068], [285], [5299], [17208], [8904], [23028], [118], [19190], [14338], [9732], [2053], [536], [19484], [46], [9289], [17486], [11897], [2233], [4802], [19675], [11585], [20290], [6993], [24436], [24445], [25469], [15231], [5511], [5006], [25539], [5066], [18978], [15911], [1583], [8243], [5693], [27710], [17481], [19580], [25221], [24727], [11468], [14559], [7392], [232], [15616], [24860], [13087], [26433], [17735], [23380], [13690], [3450], [23419], [16267], [19860], [23956], [22432], [14254], [14777], [21467], [19959], [21151], [19497], [6446], [9404], [25180], [25447], [14325], [15617], [23809], [17667], [19592], [24463], [284], [13091], [21289], [22830], [1215], [18250], [12380], [16221], [25181], [15084], [15344], [18848], [1445], [10027], [2734], [11823], [17710], [19117], [21200], [15088], [15227], [10636], [14735], [23543], [21787], [8201], [9970], [25870], [21011], [25494], [27294], [5680], [11456], [22122], [4865], [25901], [19006], [16385], [22547], [11885], [24719], 

KeyboardInterrupt: 

In [None]:
printingEnronResults(enron_result4)