# Documentation

## Purpose
This script performs the following tasks:
1. Extracts text from PDF files.
2. Normalizes and processes text data.
3. Matches PDF titles with CSV records based on similarity.
4. Concatenates matched text and saves it to a compressed Parquet file.

## Key Components
### Libraries Used
- **os**: To handle file and directory operations.
- **re**: For regular expression operations.
- **pandas**: For data handling.
- **nltk**: For natural language processing tasks.

### Functions
- **get_wordnet_pos()**: Converts NLTK POS tags to WordNet POS tags.
- **normalize_text()**: Cleans and prepares text by removing emails, punctuation, and extra spaces.
- **remove_stop_words()**: Filters out common English stop words.
- **lemmatize_tokens()**: Converts words to their base form using lemmatization.
- **extract_pdf_title()**: Extracts titles from PDF filenames.
- **is_similar()**: Measures token-level similarity between titles.

### Workflow
1. **Load CSV Data:** Reads a CSV file containing titles and descriptions.
2. **Extract PDF Titles and Content:** Loads text files and extracts titles based on filenames.
3. **Match and Concatenate Text:** Matches PDF content to CSV titles using token similarity and concatenates relevant text.
4. **Normalize and Tokenize Text:**
   - Normalizes text (removes emails, special characters, etc.).
   - Tokenizes text into bigrams and removes stop words.
   - Lemmatizes tokens.
5. **Save to Parquet:** Saves processed data into a compressed Parquet file.

## Output
- A compressed Parquet file containing matched and processed text data.



# Code

In [None]:
import os
import re
import pandas as pd
import nltk
from nltk.corpus import wordnet, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def get_wordnet_pos(treebank_tag):
    """Convert Treebank tags to WordNet POS tags."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def normalize_text(text):
    """
    Normalize text by:
      - Removing email addresses (any string containing '@' and ending with .com or .in)
      - Lowercasing the text
      - Removing punctuation
      - Trimming extra whitespace
    """
    text = re.sub(r'\S+@\S+\.(com|in)\b', '', text, flags=re.IGNORECASE)
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def remove_stop_words(tokens):
    """Remove stop words from a list of tokens."""
    return [word for word in tokens if word not in stop_words]

def lemmatize_tokens(tokens):
    """Lemmatize tokens using POS tags."""
    pos_tokens = nltk.pos_tag(tokens)
    return [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tokens]

def extract_pdf_title(filename):
    """
    Extract the publication title from a filename.
    Expected format: "PWC_DATE_Title.txt"
    Splits the filename by '_' with a maximum of two splits and replaces underscores with spaces.
    """
    base = filename[:-4]
    parts = base.split('_', 2)
    if len(parts) >= 3:
        return parts[2].replace('_', ' ').strip()
    else:
        return base.replace('_', ' ')

def is_similar(title1, title2, threshold=0.5):
    """
    Compare two titles based on token-level matching.
    Returns True if similarity (based on common lemmatized tokens) exceeds the threshold.
    """
    tokens1 = set(lemmatize_tokens(word_tokenize(title1)))
    tokens2 = set(lemmatize_tokens(word_tokenize(title2)))
    if not tokens1 or not tokens2:
        return False
    common = tokens1.intersection(tokens2)
    similarity = len(common) / max(len(tokens1), len(tokens2))
    return similarity >= threshold

# --------------------------
# Step 1: Load the CSV file containing insights details
# --------------------------
csv_file = "insights-details-pwc.csv"
df = pd.read_csv(csv_file)

# --------------------------
# Step 2: Build a list of text file info from the "txt" folder
# --------------------------
txt_folder = "txt"
pdf_text_files = []
for filename in os.listdir(txt_folder):
    if filename.lower().endswith('.txt'):
        pdf_title = extract_pdf_title(filename)
        file_path = os.path.join(txt_folder, filename)
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        pdf_text_files.append({
            "filename": filename,
            "pdf_title": pdf_title,
            "pdf_content": content
        })

# --------------------------
# Step 3: Process each CSV row to match PDF text and perform concatenation
# --------------------------
new_rows = []
for idx, row in df.iterrows():
    csv_title = str(row.get("Title", "")).strip()
    csv_description = str(row.get("Description", "")).strip()
    matched_pdf_content = ""
    
    for pdf_file in pdf_text_files:
        if is_similar(csv_title, pdf_file["pdf_title"], threshold=0.5):
            matched_pdf_content = pdf_file["pdf_content"]
            break

    concatenated = csv_title + " " + csv_description + " " + matched_pdf_content

    normalized = normalize_text(concatenated)
    
    words = word_tokenize(normalized)
    two_grams = ['_'.join(gram) for gram in ngrams(words, 2)]
    tokenized_text = ' '.join(two_grams)
    
    word_tokens = word_tokenize(normalized)
    filtered_tokens = remove_stop_words(word_tokens)
    lemmatized_tokens = lemmatize_tokens(filtered_tokens)
    normalized_concatenated_text = ' '.join(lemmatized_tokens)
    
    new_row = {
        "Date": row.get("Date", ""),
        "Title": csv_title,
        "Description": csv_description,
        "Link": row.get("Link", ""),
        "pdf_content": matched_pdf_content,
        "concatenated_text": concatenated,
        "tokenized_text": tokenized_text,
        "normalized_concatenated_text": normalized_concatenated_text
    }
    new_rows.append(new_row)

# --------------------------
# Step 4: Save the processed data to a new CSV file
# --------------------------
final_df = pd.DataFrame(new_rows)
output_file = "pwc_final_concatenated_insights_gzip.parquet"
final_df.to_parquet(output_file, compression='gzip')
print(f"Final Parquet file saved as {output_file}")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Final Parquet file saved as pwc_final_concatenated_insights_gzip.parquet


In [7]:
final_df.head()

Unnamed: 0,Date,Title,Description,Link,pdf_content,concatenated_text,tokenized_text,normalized_concatenated_text
0,07/03/25,Quality measures and standards for transitioni...,A roadmap to facilitate the transition to VBHC.,https://www.pwc.in/ghost-templates/quality-mea...,Quality measures and standards \nMarch 2025\nf...,Quality measures and standards for transitioni...,quality_measures measures_and and_standards st...,quality measure standard transition valuebased...
1,05/03/25,The mutual funds route to Viksit Bharat @2047,A comprehensive roadmap for the evolution of t...,https://www.pwc.in/ghost-templates/the-mutual-...,The mutual funds route \nto Viksit Bharat @204...,The mutual funds route to Viksit Bharat @2047 ...,the_mutual mutual_funds funds_route route_to t...,mutual fund route viksit bharat 2047 comprehen...
2,04/03/25,Financial health: Transcending from access to ...,Explore India’s financial inclusion journey an...,https://www.pwc.in/ghost-templates/financial-h...,TM\nMarch 2025\nFinancial health: \nTranscendi...,Financial health: Transcending from access to ...,financial_health health_transcending transcend...,financial health transcend access impact explo...
3,04/03/25,Towards a climate-resilient future: Strategies...,Explore the comprehensive climate-resilient ac...,https://www.pwc.in/ghost-templates/towards-a-c...,\nTowards a climate-resilient \nfuture: Strat...,Towards a climate-resilient future: Strategies...,towards_a a_climateresilient climateresilient_...,towards climateresilient future strategy andam...
4,27/02/25,The retail reinvention paradigm,How brands could up their game,https://www.pwc.in/ghost-templates/retail-rein...,The retail reinvention \nparadigm\nHow brands ...,The retail reinvention paradigm How brands cou...,the_retail retail_reinvention reinvention_para...,retail reinvention paradigm brand could game r...
