<a href="https://colab.research.google.com/github/vvrgit/NLP-LAB/blob/main/Assignment5_3_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset: https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts?resource=download

# **Load Dataset**

In [4]:
import pandas as pd
df = pd.read_csv('arxiv_data.csv', engine='python', nrows=1000)
display(df.head())

Unnamed: 0,titles,summaries,terms
0,Survey on Semantic Stereo Matching / Semantic ...,Stereo matching is one of the widely used tech...,"['cs.CV', 'cs.LG']"
1,FUTURE-AI: Guiding Principles and Consensus Re...,The recent advancements in artificial intellig...,"['cs.CV', 'cs.AI', 'cs.LG']"
2,Enforcing Mutual Consistency of Hard Regions f...,"In this paper, we proposed a novel mutual cons...","['cs.CV', 'cs.AI']"
3,Parameter Decoupling Strategy for Semi-supervi...,Consistency training has proven to be an advan...,['cs.CV']
4,Background-Foreground Segmentation for Interio...,"To ensure safety in automated driving, the cor...","['cs.CV', 'cs.LG']"


# **Text Pre-Processing Using Regular Expression**

In [5]:
import re

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove mentions
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags
    text = re.sub(r'#\w+', '', text)

    text = text.lower()  # Convert to lowercase

    # Remove emojis (a basic approach for common emoji ranges)
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)

    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

In [6]:
df['processed_summaries'] = df['summaries'].apply(preprocess_text)
print(df[['summaries', 'processed_summaries']].head())

                                           summaries  \
0  Stereo matching is one of the widely used tech...   
1  The recent advancements in artificial intellig...   
2  In this paper, we proposed a novel mutual cons...   
3  Consistency training has proven to be an advan...   
4  To ensure safety in automated driving, the cor...   

                                 processed_summaries  
0  stereo matching is one of the widely used tech...  
1  the recent advancements in artificial intellig...  
2  in this paper we proposed a novel mutual consi...  
3  consistency training has proven to be an advan...  
4  to ensure safety in automated driving the corr...  


# **Word Tokenization  Using Spacy**

In [7]:
import spacy

# Load the English language model. You might need to run `python -m spacy download en_core_web_sm` once if it's not already installed.
nlp = spacy.load('en_core_web_sm')

def tokenize_text_spacy(text):
    doc = nlp(text)
    return [token.text for token in doc]

df['tokenized_summaries'] = df['processed_summaries'].apply(tokenize_text_spacy)

print(df[['processed_summaries', 'tokenized_summaries']].head())

                                 processed_summaries  \
0  stereo matching is one of the widely used tech...   
1  the recent advancements in artificial intellig...   
2  in this paper we proposed a novel mutual consi...   
3  consistency training has proven to be an advan...   
4  to ensure safety in automated driving the corr...   

                                 tokenized_summaries  
0  [stereo, matching, is, one, of, the, widely, u...  
1  [the, recent, advancements, in, artificial, in...  
2  [in, this, paper, we, proposed, a, novel, mutu...  
3  [consistency, training, has, proven, to, be, a...  
4  [to, ensure, safety, in, automated, driving, t...  


# **stopword removel using spacy**

In [8]:
def remove_stopwords_spacy(tokens):
    return [token.text for token in nlp(' '.join(tokens)) if not token.is_stop]

df['summaries_no_stopwords'] = df['tokenized_summaries'].apply(remove_stopwords_spacy)

print(df[['tokenized_summaries', 'summaries_no_stopwords']].head())

                                 tokenized_summaries  \
0  [stereo, matching, is, one, of, the, widely, u...   
1  [the, recent, advancements, in, artificial, in...   
2  [in, this, paper, we, proposed, a, novel, mutu...   
3  [consistency, training, has, proven, to, be, a...   
4  [to, ensure, safety, in, automated, driving, t...   

                              summaries_no_stopwords  
0  [stereo, matching, widely, techniques, inferri...  
1  [recent, advancements, artificial, intelligenc...  
2  [paper, proposed, novel, mutual, consistency, ...  
3  [consistency, training, proven, advanced, semi...  
4  [ensure, safety, automated, driving, correct, ...  


# **Lemmatization**

In [9]:
def lemmatize_text_spacy(tokens):
    doc = nlp(' '.join(tokens))
    return [token.lemma_ for token in doc]

df['lemmatized_summaries'] = df['summaries_no_stopwords'].apply(lemmatize_text_spacy)

print(df[['summaries_no_stopwords', 'lemmatized_summaries']].head())

                              summaries_no_stopwords  \
0  [stereo, matching, widely, techniques, inferri...   
1  [recent, advancements, artificial, intelligenc...   
2  [paper, proposed, novel, mutual, consistency, ...   
3  [consistency, training, proven, advanced, semi...   
4  [ensure, safety, automated, driving, correct, ...   

                                lemmatized_summaries  
0  [stereo, matching, widely, technique, infer, d...  
1  [recent, advancement, artificial, intelligence...  
2  [paper, propose, novel, mutual, consistency, n...  
3  [consistency, training, prove, advanced, semis...  
4  [ensure, safety, automate, drive, correct, per...  


# **Processed Abstracts**

In [10]:
def rejoin_summaries(lemmas):
    return ' '.join(lemmas)

df['final_summaries'] = df['lemmatized_summaries'].apply(rejoin_summaries)

print(df[['lemmatized_summaries', 'final_summaries']].head())

                                lemmatized_summaries  \
0  [stereo, matching, widely, technique, infer, d...   
1  [recent, advancement, artificial, intelligence...   
2  [paper, propose, novel, mutual, consistency, n...   
3  [consistency, training, prove, advanced, semis...   
4  [ensure, safety, automate, drive, correct, per...   

                                     final_summaries  
0  stereo matching widely technique infer depth s...  
1  recent advancement artificial intelligence ai ...  
2  paper propose novel mutual consistency network...  
3  consistency training prove advanced semisuperv...  
4  ensure safety automate drive correct perceptio...  


# Task
Extract and display the most frequent noun phrases from the 'final_summaries' column of the dataframe.

## Identify Noun Phrases

### Subtask:
Extract noun phrases from the 'final_summaries' column using spaCy's noun_chunks.


**Reasoning**:
To extract noun phrases, I will define a function that utilizes the pre-loaded spaCy 'nlp' object to process text and extract lemmatized noun chunks. This function will then be applied to the 'final_summaries' column to create a new 'noun_phrases' column.



In [11]:
def extract_noun_phrases(text):
    doc = nlp(text)
    return [chunk.lemma_ for chunk in doc.noun_chunks]

df['noun_phrases'] = df['final_summaries'].apply(extract_noun_phrases)

print(df[['final_summaries', 'noun_phrases']].head())

                                     final_summaries  \
0  stereo matching widely technique infer depth s...   
1  recent advancement artificial intelligence ai ...   
2  paper propose novel mutual consistency network...   
3  consistency training prove advanced semisuperv...   
4  ensure safety automate drive correct perceptio...   

                                        noun_phrases  
0  [stereo matching, widely technique infer depth...  
1  [recent advancement artificial intelligence, e...  
2  [paper, novel mutual consistency network mcnet...  
3  [consistency training, advanced semisupervise ...  
4  [safety automate drive correct perception situ...  


## Count Noun Phrase Frequencies

### Subtask:
Count the occurrences of each unique noun phrase identified.


**Reasoning**:
To count the occurrences of each unique noun phrase, I need to import the `Counter` class, flatten the list of lists in the 'noun_phrases' column into a single list, and then use `Counter` to get the frequencies.



In [12]:
from collections import Counter

all_noun_phrases = []
for phrases_list in df['noun_phrases']:
    all_noun_phrases.extend(phrases_list)

noun_phrase_counts = Counter(all_noun_phrases)

print("Top 10 most frequent noun phrases:")
for noun_phrase, count in noun_phrase_counts.most_common(10):
    print(f"'{noun_phrase}': {count}")

Top 10 most frequent noun phrases:
'method': 55
'algorithm': 51
'deep learning': 38
'paper': 37
'propose method': 37
'deep neural network': 30
'image segmentation': 30
'image': 27
'approach': 26
'performance': 25


## Display Frequent Noun Phrases

### Subtask:
Present the most frequent noun phrases to the user.


## Summary:

### Data Analysis Key Findings

*   Noun phrases were successfully extracted and lemmatized from the 'final\_summaries' column, stored in a new column named 'noun\_phrases'. For instance, a summary like "stereo matching widely technique infer depth s..." yielded noun phrases such as "stereo matching" and "technique infer depth".
*   The frequency of each extracted noun phrase was calculated, and the top 10 most common phrases were identified.
*   The most frequent noun phrases observed were:
    *   'method': 55 occurrences
    *   'algorithm': 51 occurrences
    *   'deep learning': 38 occurrences
    *   'paper': 37 occurrences
    *   'propose method': 37 occurrences
    *   'deep neural network': 30 occurrences
    *   'image segmentation': 30 occurrences
    *   'image': 27 occurrences
    *   'approach': 26 occurrences
    *   'performance': 25 occurrences

### Insights or Next Steps

*   The high frequency of terms like 'method', 'algorithm', 'deep learning', 'deep neural network', and 'image segmentation' suggests that the 'final\_summaries' predominantly discuss research or technical papers related to artificial intelligence, machine learning, and computer vision.
*   Further analysis could involve exploring the context in which these frequent noun phrases appear or performing topic modeling to group summaries based on shared themes indicated by these phrases.


# Task
Extract named entities from the 'final_summaries' column using spaCy's Named Entity Recognition (NER) capabilities.

## Extract Named Entities

### Subtask:
Extract named entities from the 'final_summaries' column using spaCy's Named Entity Recognition (NER) capabilities.


**Reasoning**:
To extract named entities, I will define a function that utilizes the pre-loaded spaCy 'nlp' object to process text and extract named entities, then apply this function to the 'final_summaries' column to create a new 'named_entities' column, and finally display the relevant columns to verify the extraction.



In [13]:
def extract_named_entities(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents]

df['named_entities'] = df['final_summaries'].apply(extract_named_entities)

print(df[['final_summaries', 'named_entities']].head())

                                     final_summaries          named_entities
0  stereo matching widely technique infer depth s...                    [3d]
1  recent advancement artificial intelligence ai ...              [european]
2  paper propose novel mutual consistency network...  [misclassifie, second]
3  consistency training prove advanced semisuperv...                      []
4  ensure safety automate drive correct perceptio...      [recent year, gmm]


**Reasoning**:
The previous step successfully extracted named entities. Now, I need to count the occurrences of each unique named entity to identify the most frequent ones.



In [14]:
from collections import Counter

all_named_entities = []
for entities_list in df['named_entities']:
    all_named_entities.extend(entities_list)

named_entity_counts = Counter(all_named_entities)

print("Top 10 most frequent named entities:")
for entity, count in named_entity_counts.most_common(10):
    print(f"'{entity}': {count}")

Top 10 most frequent named entities:
'3d': 304
'cnn': 206
'2d': 104
'second': 78
'1': 77
'2': 77
'3': 51
'recent year': 40
'scan': 31
'one': 30


## Display Frequent Named Entities

### Subtask:
Present the most frequent named entities and their counts to the user.


## Summary:

### Data Analysis Key Findings

*   Named entities were successfully extracted from the `final_summaries` column, populating a new column named `named_entities` with lists of identified entities.
*   The most frequent named entity identified was '3d', appearing 304 times.
*   'cnn' was the second most frequent entity with 206 occurrences.
*   '2d' ranked as the third most frequent entity, observed 104 times.

### Insights or Next Steps

*   The prevalence of terms like '3d' and '2d' suggests that the content of the summaries frequently discusses dimensions, graphics, or visual technologies, while 'cnn' indicates a strong presence of news-related or media-specific topics.
*   Future analysis could involve categorizing these frequent entities (e.g., by type like organizations, technologies, locations) to understand the dominant categories within the summaries, or exploring the context in which these entities appear to uncover specific trends or relationships.


# Task
Define spaCy Matcher patterns to identify technical terms in the 'final_summaries' column, apply these patterns to extract the terms, and then analyze and display the most common extracted terms.

## Define spaCy Matcher patterns

### Subtask:
Set up spaCy's Matcher and define linguistic patterns to identify technical terms. This may involve using Part-of-Speech (POS) tags, lemmas, or specific token attributes.


**Reasoning**:
To set up spaCy's Matcher, I will import the `Matcher` class, initialize it with the `nlp.vocab`, define a few example patterns for technical terms, and then add these patterns to the matcher as requested by the instructions.



In [18]:
from spacy.matcher import Matcher

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Define patterns for technical terms
pattern1 = [{'LOWER': 'deep'}, {'LOWER': 'learning'}]
pattern2 = [{'LOWER': 'neural'}, {'LOWER': 'network'}]
pattern3 = [{'LOWER': 'computer'}, {'LOWER': 'vision'}]
pattern4 = [{'POS': 'ADJ', 'OP': '?'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add patterns to the matcher
matcher.add('TECHNICAL_TERM', [pattern1, pattern2, pattern3, pattern4])

print("spaCy Matcher initialized and patterns added.")

spaCy Matcher initialized and patterns added.


## Apply Matcher to text

### Subtask:
Apply the defined patterns using the spaCy Matcher to the 'final_summaries' column to extract technical terms.


**Reasoning**:
To apply the defined spaCy Matcher patterns, I will create a function that processes text using the 'nlp' object, applies the 'matcher' to find all technical term matches, and returns a list of these extracted terms. This function will then be applied to the 'final_summaries' column to populate a new 'technical_terms' column.



In [16]:
def extract_technical_terms(text):
    doc = nlp(text)
    matches = matcher(doc)
    extracted_terms = []
    for match_id, start, end in matches:
        extracted_terms.append(doc[start:end].text)
    return extracted_terms

df['technical_terms'] = df['final_summaries'].apply(extract_technical_terms)

print(df[['final_summaries', 'technical_terms']].head())

                                     final_summaries  \
0  stereo matching widely technique infer depth s...   
1  recent advancement artificial intelligence ai ...   
2  paper propose novel mutual consistency network...   
3  consistency training prove advanced semisuperv...   
4  ensure safety automate drive correct perceptio...   

                                     technical_terms  
0  [matching, technique, technique infer, infer, ...  
1  [recent advancement, recent advancement artifi...  
2  [paper, mutual consistency, mutual consistency...  
3  [consistency, consistency training, training, ...  
4  [safety, automate drive, drive, correct percep...  


**Reasoning**:
Now that the technical terms have been extracted, the next step is to count the occurrences of each unique term. This will involve flattening the list of lists in the 'technical_terms' column and then using `collections.Counter` to get the frequencies. Finally, the top most frequent terms will be displayed.



In [17]:
from collections import Counter

all_technical_terms = []
for terms_list in df['technical_terms']:
    all_technical_terms.extend(terms_list)

technical_term_counts = Counter(all_technical_terms)

print("Top 10 most frequent technical terms:")
for term, count in technical_term_counts.most_common(10):
    print(f"'{term}': {count}")

Top 10 most frequent technical terms:
'segmentation': 3328
'image': 3255
'method': 1383
'network': 1288
'image segmentation': 1209
'model': 1113
'approach': 680
'performance': 658
'task': 639
'result': 620


## Analyze and display matched terms

### Subtask:
Present the most frequent technical terms and their counts to the user, and summarize the findings.


## Summary:

### Data Analysis Key Findings

*   The spaCy `Matcher` was successfully initialized with four patterns: three specific multi-word terms (`deep learning`, `neural network`, `computer vision`) and one general Part-of-Speech-based pattern that identifies noun phrases (allowing for an optional adjective and an optional second noun).
*   A new column named `technical_terms` was added to the DataFrame, containing lists of extracted technical phrases from the `final_summaries` column for each entry.
*   Analysis of the extracted terms revealed that 'segmentation' was the most frequent technical term, appearing 3328 times.
*   Other highly frequent terms included 'image' with 3255 occurrences and 'method' with 1383 occurrences.

### Insights or Next Steps

*   The current patterns successfully extract both specific multi-word terms and general noun phrases, providing a foundational set of technical terms. Further refinement of the general patterns could help focus on more domain-specific technical jargon, potentially by filtering out very common words like 'method' if they are not considered core technical terms for the specific domain.
*   The extracted technical terms can serve as a valuable input for subsequent analyses, such as building a domain-specific lexicon, performing topic modeling, or generating keyword clouds to understand the predominant themes within the summaries.
