<a href="https://colab.research.google.com/github/sakeththelu/NLP/blob/main/4082_assignment_5_3_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
import re

def clean_text(text):
    if not isinstance(text, str): # Handle potential non-string values
        return text
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text) # Remove special characters, keep only letters and spaces
    text = re.sub(r'\d+', '', text) # Remove numbers
    text = re.sub(r'\s+', ' ', text) # Replace multiple spaces with a single space
    text = text.strip() # Remove leading/trailing spaces
    return text

# Apply the function to the 'summaries' column
df['cleaned_summaries'] = df['titles'].apply(clean_text)

print("Original summaries (first 5 rows):")
print(df['titles'].head())
print("\nCleaned summaries (first 5 rows):")
print(df['cleaned_summaries'].head())

Original summaries (first 5 rows):
0    Survey on Semantic Stereo Matching / Semantic ...
1    FUTURE-AI: Guiding Principles and Consensus Re...
2    Enforcing Mutual Consistency of Hard Regions f...
3    Parameter Decoupling Strategy for Semi-supervi...
4    Background-Foreground Segmentation for Interio...
Name: titles, dtype: object

Cleaned summaries (first 5 rows):
0    survey on semantic stereo matching semantic de...
1    futureai guiding principles and consensus reco...
2    enforcing mutual consistency of hard regions f...
3    parameter decoupling strategy for semisupervis...
4    backgroundforeground segmentation for interior...
Name: cleaned_summaries, dtype: object


## Word Tokenization (NLTK)

### Subtask:
Apply NLTK's `word_tokenize` to the 'cleaned_summaries' column to break the text into individual words, storing the tokenized lists in a new 'tokenized_summaries' column.


In [19]:
import nltk
from nltk.tokenize import word_tokenize

# Download the 'punkt' tokenizer data if not already downloaded
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')

# Apply word tokenization
df['tokenized_summaries'] = df['cleaned_summaries'].apply(word_tokenize)

print("Tokenized summaries (first 5 rows):")
print(df['tokenized_summaries'].head())

Tokenized summaries (first 5 rows):
0    [survey, on, semantic, stereo, matching, seman...
1    [futureai, guiding, principles, and, consensus...
2    [enforcing, mutual, consistency, of, hard, reg...
3    [parameter, decoupling, strategy, for, semisup...
4    [backgroundforeground, segmentation, for, inte...
Name: tokenized_summaries, dtype: object


## Stopword Removal (NLTK)

### Subtask:
Filter out common English stopwords from the 'tokenized_summaries' using NLTK's stopwords list, storing the cleaned lists in a new 'filtered_summaries' column.


In [20]:
import nltk
from nltk.corpus import stopwords

# Download the 'stopwords' data if not already downloaded
try:
    nltk.data.find('corpora/stopwords')
except nltk.downloader.DownloadError:
    nltk.download('stopwords')

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Define a function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

# Apply the function to the 'tokenized_summaries' column
df['filtered_summaries'] = df['tokenized_summaries'].apply(remove_stopwords)

print("Filtered summaries (first 5 rows):")
print(df['filtered_summaries'].head())

Filtered summaries (first 5 rows):
0    [survey, semantic, stereo, matching, semantic,...
1    [futureai, guiding, principles, consensus, rec...
2    [enforcing, mutual, consistency, hard, regions...
3    [parameter, decoupling, strategy, semisupervis...
4    [backgroundforeground, segmentation, interior,...
Name: filtered_summaries, dtype: object


## Lemmatization (NLTK)

### Subtask:
Apply NLTK's `WordNetLemmatizer` to reduce words in the 'filtered_summaries' to their base or dictionary form, creating a new 'lemmatized_summaries' column.


In [25]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet # Not directly used for the lemmatizer, but good practice to import if related to wordnet

# Download the 'wordnet' data if not already downloaded
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to lemmatize tokens
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

# Apply the function to the 'filtered_summaries' column
df['lemmatized_summaries'] = df['filtered_summaries'].apply(lemmatize_tokens)

print("Lemmatized summaries (first 5 rows):")
print(df['lemmatized_summaries'].head())

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatized summaries (first 5 rows):
0    [survey, semantic, stereo, matching, semantic,...
1    [futureai, guiding, principle, consensus, reco...
2    [enforcing, mutual, consistency, hard, region,...
3    [parameter, decoupling, strategy, semisupervis...
4    [backgroundforeground, segmentation, interior,...
Name: lemmatized_summaries, dtype: object


In [26]:
print('Lemmatization subtask completed successfully.')

Lemmatization subtask completed successfully.


## Rejoining Words

### Subtask:
Rejoin the lemmatized words from the 'lemmatized_summaries' column back into a single string for each text entry, storing the result in a new 'rejoined_summaries' column.


In [27]:
def rejoin_words(word_list):
    return ' '.join(word_list)

df['rejoined_summaries'] = df['lemmatized_summaries'].apply(rejoin_words)

print("Rejoined summaries (first 5 rows):")
print(df['rejoined_summaries'].head())

Rejoined summaries (first 5 rows):
0    survey semantic stereo matching semantic depth...
1    futureai guiding principle consensus recommend...
2    enforcing mutual consistency hard region semis...
3    parameter decoupling strategy semisupervised l...
4    backgroundforeground segmentation interior sen...
Name: rejoined_summaries, dtype: object


## Unified NLTK Preprocessing Pipeline Function

### Subtask:
Develop and apply a single comprehensive Python function that encapsulates all NLTK-based preprocessing steps (tokenization, stopword removal, and lemmatization) and apply it to the 'cleaned_summaries' column to create a 'unified_processed_summaries' column.


In [28]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data if not already present
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

# Initialize WordNetLemmatizer and get English stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Define the unified preprocessing pipeline function
def unified_preprocessing_pipeline(text):
    # 1. Tokenization
    tokens = word_tokenize(text)

    # 2. Stopword Removal
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # 3. Lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    return lemmatized_tokens

# Apply the unified function to the 'cleaned_summaries' column
df['unified_processed_summaries'] = df['cleaned_summaries'].apply(unified_preprocessing_pipeline)

# Print the first 5 rows of the new column
print("Unified Processed Summaries (first 5 rows):")
print(df['unified_processed_summaries'].head())

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unified Processed Summaries (first 5 rows):
0    [survey, semantic, stereo, matching, semantic,...
1    [futureai, guiding, principle, consensus, reco...
2    [enforcing, mutual, consistency, hard, region,...
3    [parameter, decoupling, strategy, semisupervis...
4    [backgroundforeground, segmentation, interior,...
Name: unified_processed_summaries, dtype: object


In [29]:
print('Unified NLTK preprocessing pipeline subtask completed successfully.')

Unified NLTK preprocessing pipeline subtask completed successfully.


## Load spaCy Language Model

### Subtask:
Load the appropriate English language model from spaCy (e.g., 'en_core_web_sm') to enable tokenization, stopword detection, and lemmatization.


In [30]:
import spacy
import sys
import subprocess

def load_spacy_model(model_name):
    try:
        nlp = spacy.load(model_name)
        print(f"SpaCy model '{model_name}' loaded successfully.")
        return nlp
    except OSError:
        print(f"SpaCy model '{model_name}' not found. Attempting to download...")
        try:
            # Use sys.executable to ensure pip is run with the same Python interpreter
            subprocess.check_call([sys.executable, "-m", "spacy", "download", model_name])
            nlp = spacy.load(model_name)
            print(f"SpaCy model '{model_name}' downloaded and loaded successfully.")
            return nlp
        except Exception as e:
            print(f"Error downloading or loading spaCy model '{model_name}': {e}")
            sys.exit(1) # Exit if model cannot be loaded

# Load the 'en_core_web_sm' model
nlp = load_spacy_model('en_core_web_sm')

print(f"NLP object type: {type(nlp)}")

SpaCy model 'en_core_web_sm' loaded successfully.
NLP object type: <class 'spacy.lang.en.English'>


## spaCy Tokenization, Stopword Removal, and Lemmatization

### Subtask:
Process the 'cleaned_summaries' column using the loaded spaCy model. From the resulting Doc objects, extract the raw tokens for 'spacy_tokenized_summaries', filter out stopwords and punctuation for 'spacy_filtered_summaries', and obtain lemmas for 'spacy_lemmatized_summaries'.


In [31]:
import spacy

# 1. Apply the loaded nlp object to create spaCy Doc objects
df['spacy_docs'] = df['cleaned_summaries'].apply(nlp)

# 2. Extract raw tokens
df['spacy_tokenized_summaries'] = df['spacy_docs'].apply(lambda doc: [token.text for token in doc])

# 3. Filter out stopwords and punctuation
df['spacy_filtered_summaries'] = df['spacy_docs'].apply(lambda doc: [token.text for token in doc if not token.is_stop and not token.is_punct])

# 4. Extract lemmas, filtering out stopwords and punctuation
df['spacy_lemmatized_summaries'] = df['spacy_docs'].apply(lambda doc: [token.lemma_ for token in doc if not token.is_stop and not token.is_punct])

# 5. Print the head of the DataFrame to display the newly created columns
print("SpaCy Tokenized Summaries (first 5 rows):")
print(df['spacy_tokenized_summaries'].head())
print("\nSpaCy Filtered Summaries (first 5 rows):")
print(df['spacy_filtered_summaries'].head())
print("\nSpaCy Lemmatized Summaries (first 5 rows):")
print(df['spacy_lemmatized_summaries'].head())

SpaCy Tokenized Summaries (first 5 rows):
0    [survey, on, semantic, stereo, matching, seman...
1    [futureai, guiding, principles, and, consensus...
2    [enforcing, mutual, consistency, of, hard, reg...
3    [parameter, decoupling, strategy, for, semisup...
4    [backgroundforeground, segmentation, for, inte...
Name: spacy_tokenized_summaries, dtype: object

SpaCy Filtered Summaries (first 5 rows):
0    [survey, semantic, stereo, matching, semantic,...
1    [futureai, guiding, principles, consensus, rec...
2    [enforcing, mutual, consistency, hard, regions...
3    [parameter, decoupling, strategy, semisupervis...
4    [backgroundforeground, segmentation, interior,...
Name: spacy_filtered_summaries, dtype: object

SpaCy Lemmatized Summaries (first 5 rows):
0    [survey, semantic, stereo, match, semantic, de...
1    [futureai, guide, principle, consensus, recomm...
2    [enforce, mutual, consistency, hard, region, s...
3    [parameter, decouple, strategy, semisupervised...
4    [bac

## Rejoining spaCy Lemmatized Words

### Subtask:
Rejoin the lemmatized words from the 'spacy_lemmatized_summaries' column back into a single string for each text entry, storing the result in a new 'spacy_rejoined_summaries' column.


In [32]:
def rejoin_words(word_list):
    return ' '.join(word_list)

df['spacy_rejoined_summaries'] = df['spacy_lemmatized_summaries'].apply(rejoin_words)

print("SpaCy Rejoined Summaries (first 5 rows):")
print(df['spacy_rejoined_summaries'].head())

SpaCy Rejoined Summaries (first 5 rows):
0    survey semantic stereo match semantic depth es...
1    futureai guide principle consensus recommendat...
2    enforce mutual consistency hard region semisup...
3    parameter decouple strategy semisupervised d l...
4    backgroundforeground segmentation interior sen...
Name: spacy_rejoined_summaries, dtype: object


## Unified spaCy Preprocessing Pipeline Function

### Subtask:
Develop and apply a single comprehensive Python function that encapsulates all spaCy-based preprocessing steps (tokenization, stopword removal, and lemmatization) and apply it to the 'cleaned_summaries' column to create a 'unified_spacy_processed_summaries' column. This function will return a list of lemmatized, non-stopword tokens.


In [33]:
def unified_spacy_pipeline(text):
    doc = nlp(text)
    # Filter out stopwords and punctuation, and lemmatize
    lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return lemmatized_tokens

# Apply the unified function to the 'cleaned_summaries' column
df['unified_spacy_processed_summaries'] = df['cleaned_summaries'].apply(unified_spacy_pipeline)

# Print the first 5 rows of the new column
print("Unified spaCy Processed Summaries (first 5 rows):")
print(df['unified_spacy_processed_summaries'].head())

Unified spaCy Processed Summaries (first 5 rows):
0    [survey, semantic, stereo, match, semantic, de...
1    [futureai, guide, principle, consensus, recomm...
2    [enforce, mutual, consistency, hard, region, s...
3    [parameter, decouple, strategy, semisupervised...
4    [backgroundforeground, segmentation, interior,...
Name: unified_spacy_processed_summaries, dtype: object
