### 2.1 Finding Co-occurrence Patterns of Points of Interest (POI)

Co-occurence: Occurence of different combinations of POI at the same time 

Idea: 

Given that we have Grid Cells (x,y) which denotes **what** and **how many** POI are in that grid, we can mine co-occurence that means: 

"What is the common occurance of **different** POIs that would likely appear in one grid"


### Illustrating with an example: 

Grid cell (x1, y1) has categories: restaurant, park, gym

Grid cell (x2, y2) has categories: restaurant, park

Grid cell (x3, y3) has categories: restaurant, cafe


Running Apriori with a minimum support threshold of 2 (meaning the combination should appear in at least 2 grid cells), 

Frequent itemset = {restaurant, park} because restaurant and park appear together in two grid cells ((x1, y1) and (x2, y2)).


### Terminologies 

- Transaction / Baskets = An entry in the database

- Itemset I = {Set of items that appear together}

- Support of Item set I = Number of transactions that contains Itemset I 

- Frequent Itemset = | I | >= minsup 

From frequent itemsets {A,B,C,D}, we can generate {association rules}: 

- Confidence of an association rule: P({current items} -> Item j) 


### Determining Frequent Itemsets 
```pseudo
if support(I) >= min_sup: 
    I = frequent_itemset: 
else:
    not frquenet 

```
### Determining Association Rules  

```pseudo
for all elements in frequent itemset {A, B...}:
    generate A->B 

    conf(A->B) = P(AB)/P(A) 

    if conf(A->B) > min_conf: 
        A->B = Association Rule 

    repeat for different implications (A->C, A->D... AB->C...) 



In [7]:
from itertools import chain, combinations
from collections import defaultdict
import pandas as pd
import csv

### Reformating CSV 

- Treat each unique grid as a transaction / basket 

- For POIs in that unique grid, string all of them together => This becomes a unique transaction 

In [8]:
def preprocess_data(input_file, output_file):

    data = pd.read_csv(input_file)
    grouped = data.groupby(['x', 'y'])['category'].apply(list).reset_index()
    max_pois = grouped['category'].apply(len).max()
    poi_columns = grouped['category'].apply(lambda pois: [str(poi) for poi in pois] + [None]*(max_pois - len(pois)))
    poi_df = pd.DataFrame(poi_columns.tolist(), columns=[f'poi{i+1}' for i in range(max_pois)])
    result = pd.concat([grouped[['x', 'y']], poi_df], axis=1)
    result.to_csv(output_file, index=False, na_rep='')

In [None]:
from itertools import chain, combinations
from collections import defaultdict

def generate_subsets(item_list):
    """Returns non-empty subsets of item_list."""
    return chain(*[combinations(item_list, i + 1) for i, item in enumerate(item_list)])

def filter_items_by_support(candidate_itemsets, transactions, min_support, item_support_counts):
    """Calculates the support for items in candidate_itemsets and returns a subset
    that meets the minimum support threshold."""
    qualified_itemsets = set()
    local_item_counts = defaultdict(int)

    for itemset in candidate_itemsets:
        for transaction in transactions:
            if itemset.issubset(transaction):
                item_support_counts[itemset] += 1
                local_item_counts[itemset] += 1

    for itemset, count in local_item_counts.items():
        support = float(count) / len(transactions)
        if support >= min_support:
            qualified_itemsets.add(itemset)

    return qualified_itemsets

def merge_itemsets(itemsets, length):
    """Joins a set with itself and returns the merged n-element itemsets."""
    return set(
        [i.union(j) for i in itemsets for j in itemsets if len(i.union(j)) == length]
    )

def get_initial_itemsets_and_transactions(data_iterator):
    transactions = []
    initial_itemsets = set()
    for record in data_iterator:
        transaction = frozenset(record)
        transactions.append(transaction)
        for item in transaction:
            initial_itemsets.add(frozenset([item]))  # Generate 1-itemsets
    return initial_itemsets, transactions

def apriori_algorithm(data_iter, min_support, min_confidence):
    """
    Executes the Apriori algorithm. data_iter is a record iterator.
    Returns:
     - frequent_itemsets (tuple, support)
     - association_rules ((antecedent, consequent), confidence)
    """
    initial_itemsets, transactions = get_initial_itemsets_and_transactions(data_iter)

    item_support_counts = defaultdict(int)
    frequent_itemsets_by_length = dict()
    association_rules = dict()

    one_itemsets = filter_items_by_support(initial_itemsets, transactions, min_support, item_support_counts)

    current_itemsets = one_itemsets
    k = 2
    while current_itemsets:
        frequent_itemsets_by_length[k - 1] = current_itemsets
        current_itemsets = merge_itemsets(current_itemsets, k)
        candidate_itemsets = filter_items_by_support(
            current_itemsets, transactions, min_support, item_support_counts
        )
        current_itemsets = candidate_itemsets
        k += 1

    def calculate_support(itemset):
        """Local function to return the support of an itemset."""
        return float(item_support_counts[itemset]) / len(transactions)

    frequent_itemsets = []
    for itemset_length, itemsets in frequent_itemsets_by_length.items():
        frequent_itemsets.extend([(tuple(itemset), calculate_support(itemset)) for itemset in itemsets])

    association_rules = []
    for itemset_length, itemsets in list(frequent_itemsets_by_length.items())[1:]:
        for itemset in itemsets:
            itemset_subsets = map(frozenset, [x for x in generate_subsets(itemset)])
            for subset in itemset_subsets:
                remaining_items = itemset.difference(subset)
                if remaining_items:
                    confidence = calculate_support(itemset) / calculate_support(subset)
                    if confidence >= min_confidence:
                        association_rules.append(((tuple(subset), tuple(remaining_items)), confidence))
    return frequent_itemsets, association_rules

def display_results(frequent_itemsets, association_rules):
    """Displays the frequent itemsets sorted by support and the association rules sorted by confidence."""
    for itemset, support in sorted(frequent_itemsets, key=lambda x: x[1]):
        print("Itemset: %s , %.3f" % (str(itemset), support))
    print("\n------------------------ ASSOCIATION RULES:")
    for rule, confidence in sorted(association_rules, key=lambda x: x[1]):
        antecedent, consequent = rule
        print("Rule: %s ==> %s , %.3f" % (str(antecedent), str(consequent), confidence))

def format_results_to_strings(frequent_itemsets, association_rules):
    """Formats the frequent itemsets and association rules into strings sorted by support and confidence."""
    itemset_strings, rule_strings = [], []
    for itemset, support in sorted(frequent_itemsets, key=lambda x: x[1]):
        itemset_str = "Itemset: %s , %.3f" % (str(itemset), support)
        itemset_strings.append(itemset_str)

    for rule, confidence in sorted(association_rules, key=lambda x: x[1]):
        antecedent, consequent = rule
        rule_str = "Rule: %s ==> %s , %.3f" % (str(antecedent), str(consequent), confidence)
        rule_strings.append(rule_str)

    return itemset_strings, rule_strings

### Since the above code requires a csv in the form of 

Transaction 1 items

Transaction 2 items 
... 

Modify the csv to appear as such 

In [10]:
def prepare_data_for_apriori(input_file, output_file):
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        next(infile)
        
        for line in infile:
            parts = line.strip().split(',')
        
            pois = [poi.strip('"') for poi in parts[2:] if poi]  
            pois_line = ','.join(pois)
            outfile.write(f"{pois_line}\n")

In [None]:
def dataFromFile(fname):
    """Function which reads from the file and yields a generator"""
    with open(fname, "r") as file_iter: 
        for line in file_iter:
            line = line.strip().rstrip(",")  
            record = frozenset(line.split(","))  
            yield record

def run_apriori_and_save_results(input_file, min_support, min_confidence, dataset_label):
    inFile = dataFromFile(input_file)
    items, rules = apriori_algorithm(inFile, min_support, min_confidence)
    

    with open(f'output/itemsets_{dataset_label}.csv', 'w', newline='') as items_file:
        writer = csv.writer(items_file)
        writer.writerow(['Itemset', 'Support'])
        for item, support in items:
            writer.writerow([', '.join(item), support])

    with open(f'output/rules_{dataset_label}.csv', 'w', newline='') as rules_file:
        writer = csv.writer(rules_file)
        writer.writerow(['Rule', 'Confidence'])
        for (pre, post), confidence in rules:
            rule = f"{', '.join(pre)} => {', '.join(post)}"
            writer.writerow([rule, confidence])

In [None]:
def main(input_files, min_support, min_confidence):
    for input_file in input_files:
        dataset_label = input_file.split('_')[-1].replace('.csv', '')  # Extract 'A', 'B', 'C', or 'D'
        
        transformed_file = f'data/transformed_{dataset_label}.csv'
        preprocess_data(input_file, transformed_file)
        
        prepared_file = f'data/prepared_data_{dataset_label}.csv'
        prepare_data_for_apriori(transformed_file, prepared_file)
        
        run_apriori_and_save_results(prepared_file, min_support, min_confidence, dataset_label)

input_files = [
    'data/POIdata_cityA.csv',
    'data/POIdata_cityB.csv',
    'data/POIdata_cityC.csv',
    'data/POIdata_cityD.csv'
]
min_support = 0.10
min_confidence = 0.6

main(input_files, min_support, min_confidence)


From the rules mined,

observe that City B, D have about 40 Rules, while A, C have about 20k Rules. 

In order to pick the best assoication rules, select the top 𝐾 association rules based on confidence.

In [None]:
import pandas as pd
import os

input_files = [
    'output/rules_cityA.csv',
    'output/rules_cityB.csv',
    'output/rules_cityC.csv',
    'output/rules_cityD.csv'
]

output_dir = 'top_output'
os.makedirs(output_dir, exist_ok=True) 

for file_path in input_files:
   
    df = pd.read_csv(file_path)
    
    top_k_rules = df.sort_values(by="Confidence", ascending=False).head(20)
    
    city_name = os.path.basename(file_path).replace("rules_", "top_rules")  # Replace 'rules_' with 'top_10_'
    output_file = os.path.join(output_dir, city_name)

    top_k_rules.to_csv(output_file, index=False)
    
    print(f"Saved top 10 rules for {file_path} to {output_file}")

Saved top 10 rules for output/rules_cityA.csv to top_output\top_rulescityA.csv
Saved top 10 rules for output/rules_cityB.csv to top_output\top_rulescityB.csv
Saved top 10 rules for output/rules_cityC.csv to top_output\top_rulescityC.csv
Saved top 10 rules for output/rules_cityD.csv to top_output\top_rulescityD.csv
