<a href="https://colab.research.google.com/github/vvrgit/NLP-LAB/blob/main/Assignment5_3_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset: https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts?resource=download



# **# Load dataset from arxiv_data.csv**

In [13]:
import pandas as pd
df = pd.read_csv('arxiv_data.csv', engine='python', nrows=1000)
display(df.head())

Unnamed: 0,titles,summaries,terms
0,Survey on Semantic Stereo Matching / Semantic ...,Stereo matching is one of the widely used tech...,"['cs.CV', 'cs.LG']"
1,FUTURE-AI: Guiding Principles and Consensus Re...,The recent advancements in artificial intellig...,"['cs.CV', 'cs.AI', 'cs.LG']"
2,Enforcing Mutual Consistency of Hard Regions f...,"In this paper, we proposed a novel mutual cons...","['cs.CV', 'cs.AI']"
3,Parameter Decoupling Strategy for Semi-supervi...,Consistency training has proven to be an advan...,['cs.CV']
4,Background-Foreground Segmentation for Interio...,"To ensure safety in automated driving, the cor...","['cs.CV', 'cs.LG']"


# **Text Pre-Processing**

In [16]:
import re

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove mentions
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags
    text = re.sub(r'#\w+', '', text)

    text = text.lower()  # Convert to lowercase

    # Remove emojis (a basic approach for common emoji ranges)
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)

    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

In [17]:
df['processed_summaries'] = df['summaries'].apply(preprocess_text)
print(df[['summaries', 'processed_summaries']].head())

                                           summaries  \
0  Stereo matching is one of the widely used tech...   
1  The recent advancements in artificial intellig...   
2  In this paper, we proposed a novel mutual cons...   
3  Consistency training has proven to be an advan...   
4  To ensure safety in automated driving, the cor...   

                                 processed_summaries  
0  stereo matching is one of the widely used tech...  
1  the recent advancements in artificial intellig...  
2  in this paper we proposed a novel mutual consi...  
3  consistency training has proven to be an advan...  
4  to ensure safety in automated driving the corr...  


# **Word Tokenization**

In [23]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab') # Download punkt_tab as suggested by the error

df['tokenized_summaries'] = df['processed_summaries'].apply(word_tokenize)
print(df[['processed_summaries', 'tokenized_summaries']].head())

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


                                 processed_summaries  \
0  stereo matching is one of the widely used tech...   
1  the recent advancements in artificial intellig...   
2  in this paper we proposed a novel mutual consi...   
3  consistency training has proven to be an advan...   
4  to ensure safety in automated driving the corr...   

                                 tokenized_summaries  
0  [stereo, matching, is, one, of, the, widely, u...  
1  [the, recent, advancements, in, artificial, in...  
2  [in, this, paper, we, proposed, a, novel, mutu...  
3  [consistency, training, has, proven, to, be, a...  
4  [to, ensure, safety, in, automated, driving, t...  


# **Stop Word Removal**

In [25]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

df['filtered_summaries'] = df['tokenized_summaries'].apply(remove_stopwords)
print(df[['tokenized_summaries', 'filtered_summaries']].head())

                                 tokenized_summaries  \
0  [stereo, matching, is, one, of, the, widely, u...   
1  [the, recent, advancements, in, artificial, in...   
2  [in, this, paper, we, proposed, a, novel, mutu...   
3  [consistency, training, has, proven, to, be, a...   
4  [to, ensure, safety, in, automated, driving, t...   

                                  filtered_summaries  
0  [stereo, matching, one, widely, used, techniqu...  
1  [recent, advancements, artificial, intelligenc...  
2  [paper, proposed, novel, mutual, consistency, ...  
3  [consistency, training, proven, advanced, semi...  
4  [ensure, safety, automated, driving, correct, ...  


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Lemmatization**

In [27]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

df['lemmatized_summaries'] = df['filtered_summaries'].apply(lemmatize_tokens)
print(df[['filtered_summaries', 'lemmatized_summaries']].head())

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                                  filtered_summaries  \
0  [stereo, matching, one, widely, used, techniqu...   
1  [recent, advancements, artificial, intelligenc...   
2  [paper, proposed, novel, mutual, consistency, ...   
3  [consistency, training, proven, advanced, semi...   
4  [ensure, safety, automated, driving, correct, ...   

                                lemmatized_summaries  
0  [stereo, matching, one, widely, used, techniqu...  
1  [recent, advancement, artificial, intelligence...  
2  [paper, proposed, novel, mutual, consistency, ...  
3  [consistency, training, proven, advanced, semi...  
4  [ensure, safety, automated, driving, correct, ...  


# **Re-Joining**

In [28]:
df['clean_summaries'] = df['lemmatized_summaries'].apply(lambda x: ' '.join(x))
print(df[['summaries', 'clean_summaries']].head())

                                           summaries  \
0  Stereo matching is one of the widely used tech...   
1  The recent advancements in artificial intellig...   
2  In this paper, we proposed a novel mutual cons...   
3  Consistency training has proven to be an advan...   
4  To ensure safety in automated driving, the cor...   

                                     clean_summaries  
0  stereo matching one widely used technique infe...  
1  recent advancement artificial intelligence ai ...  
2  paper proposed novel mutual consistency networ...  
3  consistency training proven advanced semisuper...  
4  ensure safety automated driving correct percep...  


# **NLTK Text Preprocessing Text**

In [32]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

def nltk_preprocessing_pipeline(text):
    # Initialize NLTK tools
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

    # 1. Preprocess text (from previous steps)
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove mentions
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags
    text = re.sub(r'#\w+', '', text)

    text = text.lower()  # Convert to lowercase

    # Remove emojis
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)

    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces

    # 2. Word Tokenization
    tokenized_words = word_tokenize(text)

    # 3. Stopword Removal
    filtered_words = [word for word in tokenized_words if word not in stop_words]

    # 4. Lemmatization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

    # 5. Rejoin words
    clean_summary = ' '.join(lemmatized_words)

    return clean_summary

print("NLTK preprocessing pipeline function created successfully!")

NLTK preprocessing pipeline function created successfully!


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [35]:
df['clean_summaries_pipeline'] = df['summaries'].apply(nltk_preprocessing_pipeline)
print("\nComparison of previous clean_summaries and new clean_summaries_pipeline (first 5 rows):")
print(df[['clean_summaries_pipeline']].head())


Comparison of previous clean_summaries and new clean_summaries_pipeline (first 5 rows):
                            clean_summaries_pipeline
0  stereo matching one widely used technique infe...
1  recent advancement artificial intelligence ai ...
2  paper proposed novel mutual consistency networ...
3  consistency training proven advanced semisuper...
4  ensure safety automated driving correct percep...


Notebook Summary: Text Preprocessing Pipeline

This notebook demonstrates a comprehensive text preprocessing pipeline using the NLTK library and regular expressions on the 'arxiv_data.csv' dataset.

Key Steps Performed:

1.  **Data Loading**: The `arxiv_data.csv` dataset was loaded into a pandas DataFrame. To handle potential parsing issues in large CSVs, we specifically loaded the first 1000 rows using `pd.read_csv('arxiv_data.csv', engine='python', nrows=1000)`.

2.  **Initial Text Preprocessing (Custom Function with `re` module)**:
    A custom function `preprocess_text` was defined and applied to the 'summaries' column. This function performs the following cleaning operations:
    *   Removal of URLs (http, https, www patterns).
    *   Removal of HTML tags (<...*>).
    *   Removal of social media mentions (@username).
    *   Removal of hashtags (#hashtag).
    *   Conversion of all text to lowercase.
    *   Removal of emojis using a comprehensive regex pattern.
    *   Removal of any remaining special characters (keeping only alphanumeric and spaces).
    *   Normalization of whitespace (reducing multiple spaces to single spaces and stripping leading/trailing spaces).
    The output of this step was stored in the `processed_summaries` column.

3.  **Word Tokenization (NLTK)**:
    The `processed_summaries` were then tokenized into individual words using NLTK's `word_tokenize` function. The necessary 'punkt_tab' resource was downloaded if not already present. This resulted in the `tokenized_summaries` column, where each entry is a list of words.

4.  **Stopword Removal (NLTK)**:
    Common English stopwords were removed from the `tokenized_summaries` to filter out words that carry little semantic meaning. NLTK's 'stopwords' corpus was downloaded as needed. The resulting lists of words were stored in the `filtered_summaries` column.

5.  **Lemmatization (NLTK)**:
    Lemmatization was applied to the `filtered_summaries` using NLTK's `WordNetLemmatizer`. This process reduces words to their base or dictionary form (e.g., 'running' to 'run', 'better' to 'good'). The 'wordnet' corpus was downloaded to support this. The lemmatized words were stored in the `lemmatized_summaries` column.

6.  **Rejoining Words**: The lemmatized words were then rejoined into a single string for each summary, creating a clean, preprocessed text. This final output is available in the `clean_summaries` column.

7.  **Unified NLTK Preprocessing Pipeline Function**: All the individual preprocessing steps (from step 2 to 5) were consolidated into a single, comprehensive Python function named `nltk_preprocessing_pipeline`. This function takes raw text as input and performs the entire sequence of cleaning, tokenization, stopword removal, and lemmatization, returning the final clean summary string. It also includes checks to download necessary NLTK data (punkt_tab, stopwords, wordnet, omw-1.4) to ensure self-sufficiency. This function was then applied to the original 'summaries' column to create `clean_summaries_pipeline`, verifying its consistency with the step-by-step approach.

This pipeline provides a robust framework for preparing text data for various natural language processing tasks.