# Frequent Pattern Mining

## Introduction
Frequent Pattern Mining (FPM) identifies common co-occurring words or patterns within a dataset. For this project, we applied FPM to analyze a collection of tweets, aiming to uncover recurring themes related to emergencies.

## Preprocessing
Before running the FPM algorithm, we cleaned and prepared the dataset:
1. **Cleaned the Text**: Removed punctuation, URLs, and stopwords to focus on meaningful words.
2. **Handled Missing Data**: Replaced missing values in the `keyword` and `location` columns with empty strings.
3. **Created Transactions**: Each tweet was tokenized into words, and the corresponding `keyword` and `location` were added to form transactions.

In [31]:
! pip install nltk
! pip install mlxtend==0.23.1

  pid, fd = os.forkpty()




  pid, fd = os.forkpty()




In [32]:
import pandas as pd
import re
from nltk.corpus import stopwords
import nltk
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import association_rules

In [33]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [34]:
# Function to clean text
def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Load the dataset
file_path = 'tweetsv2.csv'  # Adjust path to your file
data = pd.read_csv(file_path)

# Clean 'keyword' and 'location' columns
data['keyword'] = data['keyword'].fillna('').astype(str)
data['location'] = data['location'].fillna('').astype(str)

# Apply text cleaning to the 'text' column
data['cleaned_text'] = data['text'].apply(clean_text)

# Tokenize and prepare transactions
data['tokens'] = data['cleaned_text'].apply(lambda x: x.split())
data['transactions'] = data.apply(lambda row: row['tokens'] + [row['keyword'], row['location']], axis=1)
data['transactions'] = data['transactions'].apply(lambda x: [item for item in x if item and item.lower() != 'nan'])

# Remove rows with empty transactions
data = data[data['transactions'].apply(len) > 0]


## Frequent Itemsets
Summary of Frequent Itemset Extraction Process
Using the Apriori algorithm, we identified frequent itemsets from a dataset of transactions (preprocessed tweet data) with a minimum support threshold of .5%. Frequent itemsets are groups of words (or combinations of words) that occur together in at least .5% of all transactions. The goal of this step is to discover significant co-occurrences of words that form the basis for extracting meaningful patterns and relationships in the data.

Why We Used the Apriori Algorithm
The Apriori algorithm is a widely used method for frequent pattern mining because:

It efficiently identifies itemsets (groups of words) that frequently occur together in a dataset.
It systematically prunes infrequent itemsets to reduce computation, ensuring only those meeting the minimum support threshold are retained.
It lays the groundwork for generating association rules, which reveal how words relate to each other.
The Process in Detail
Data Preprocessing:

The dataset was cleaned by:
Removing URLs and special characters.
Converting text to lowercase.
Removing English stopwords to focus on meaningful words.
Each transaction was constructed from tokenized tweet text, combining relevant columns (text, keyword, and location).
One-Hot Encoding:

Using the TransactionEncoder, the transactions were transformed into a binary matrix:
Rows represented transactions (tweets).
Columns represented unique words.
Each cell indicated whether a word appeared (True) or did not appear (False) in the transaction.
Applying the Apriori Algorithm:

A minimum support threshold of .5% was used to identify frequent itemsets.
Support measures how often a word or combination of words appears in the dataset relative to the total number of transactions.
For example, a support value of 0.01 (1%) means the word or word combination appears in at least 1% of all transactions.
The output consisted of itemsets: individual words or combinations of words that co-occur frequently enough to meet the support threshold.
Significance of Frequent Itemsets:

Frequent itemsets help uncover patterns in the data, such as:
Words that often appear together in tweets (e.g., "fire" and "rescue").
Potential associations between keywords and topics.
These patterns form the basis for association rules, which provide deeper insights into relationships between words.

In [35]:
# One-hot encode the transactions
transactions = data['transactions'].tolist()
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Apply Apriori algorithm
min_support = 0.005  # Adjust threshold as needed
frequent_itemsets = apriori(df, min_support=min_support, use_colnames=True)
print("Frequent Itemsets:")
print(frequent_itemsets)

Frequent Itemsets:
      support           itemsets
0    0.007300        (Australia)
1    0.006508            (India)
2    0.006069           (London)
3    0.007124  (London, England)
4    0.006772               (UK)
..        ...                ...
360  0.018821            (years)
361  0.005541              (yes)
362  0.006596              (yet)
363  0.009411            (youre)
364  0.008091    (taal, volcano)

[365 rows x 2 columns]


## Association Rules
Association rules were generated from the frequent itemsets to find relationships between co-occurring words. These rules show patterns where the presence of one word predicts another with high confidence.

In [36]:
# Generate association rules
min_confidence = 0.01  # Adjust confidence threshold as needed
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)

# Display rules
print("\nAssociation Rules:")
print(rules[['antecedents', 'consequents', 'confidence', 'lift']])


Association Rules:
  antecedents consequents  confidence       lift
0      (taal)   (volcano)    0.741935  59.828415
1   (volcano)      (taal)    0.652482  59.828415


## Analysis and Limitations
- Frequent Itemsets: Patterns like ['taal', 'volcano'] indicate common co-occurrences in emergency-related tweets.
- Sparse Data: Tweets are short, so many patterns consist of single words rather than multi-word itemsets.
- Association Rules: While some meaningful rules were generated, many transactions lacked strong co-occurrences.

## Future Work
N-Grams: Use bigrams or trigrams to extract richer context from tweets.
Lower Thresholds: Experiment with even lower support and confidence thresholds to uncover less frequent patterns.
Alternative Methods: Explore clustering or classification to complement frequent pattern analysis.