<h1 style="text-align: center; font-weight: bold;">TP2: Basic Text Processing</h1>
<h3 style="font-weight: bold;">Exercise</h3>
<p>Using the NLTK (Natural Language Toolkit) package, we want to enrich the text (text.html) with a set of annotations such as the number of occurrences in the text, the lemma, the root, the grammatical category, etc. The result should be stored in a CSV file. Each line of the result file should contain the word and all the labels (see annotated_text.csv).</p>
<p>Tasks to be completed:</p>
<ul>
  <li>Using regular expressions, filter the "text.html" file to remove tags and non-important information. Keep only the text.</li>
  <li>Segment the resulting text into sentences.</li>
  <li>Segment each sentence into words.</li>
  <li>For each word, add the necessary labels according to the result file (see the "annotated_text.csv" file).</li>
  <li>Repeat the same process, this time using the spaCy package.</li>
</ul>

### **Import Libraries**

In [119]:
import re
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
import spacy

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### **Preprocess Text**

In [120]:
def read_file(filename):
    """
    Reads and processes a text file.

    Args:
        filename (str): The name of the file to be read.

    Returns:
        str : File content
    """
    # Read the file and return a list of lines
    with open(filename, 'r') as f:
        return f.read()

In [121]:
text = read_file('./text.html')
text

'<!DOCTYPE html>\n<html>\n\t<head>\n\t\t<meta charset="utf-8"/>\n\t\t<h1>\n\t\t\t<b> The Providence Journal</b>\n\t\t</h1>\n\t</head>\n\t<body>\n\t\t<text>\n\t\t\t<p>East Providence should organize its civil defense setup and begin by appointing a full-time director, Raymond H. Hawksley, the present city CD head, believes. Mr. Hawksley said yesterday he would be willing to go before the city council `` or anyone else locally \'\' to outline his proposal at the earliest possible time. East Providence now has no civil defense program. Mr. Hawksley, the state\'s general treasurer, has been a part-time CD director in the city for the last nine years. He is not interested in being named a full-time director. </p>\n\t\t\t<p>Noting that President Kennedy has handed the Defense Department the major responsibility for the nation\'s civil defense program, Mr. Hawksley said the federal government would pay half the salary of a full-time local director. He expressed the opinion the city could hire

In [122]:
def remove_html_tags(text):
    """Remove HTML tags from a given text.

    Args:
        text (str): The input text containing HTML tags.

    Returns:
        str: The text with HTML tags removed.
    """
    if text is None:
        return ""

    if not isinstance(text, str):
        text = str(text)
        
    # Define a regular expression to match HTML tags
    html_tags_pattern = re.compile(r'<.*?>')
    
    # Use the sub() function to replace HTML tags with an empty string
    text_without_html = re.sub(html_tags_pattern, '', text)
    
    return text_without_html


In [123]:
text_without_tags = remove_html_tags(text=text)
text_without_tags

"\n\n\t\n\t\t\n\t\t\n\t\t\t The Providence Journal\n\t\t\n\t\n\t\n\t\t\n\t\t\tEast Providence should organize its civil defense setup and begin by appointing a full-time director, Raymond H. Hawksley, the present city CD head, believes. Mr. Hawksley said yesterday he would be willing to go before the city council `` or anyone else locally '' to outline his proposal at the earliest possible time. East Providence now has no civil defense program. Mr. Hawksley, the state's general treasurer, has been a part-time CD director in the city for the last nine years. He is not interested in being named a full-time director. \n\t\t\tNoting that President Kennedy has handed the Defense Department the major responsibility for the nation's civil defense program, Mr. Hawksley said the federal government would pay half the salary of a full-time local director. He expressed the opinion the city could hire a CD director for about $3,500 a year and would only have to put up half that amount on a matching

In [124]:
def decompose_text_into_lines(text):
    """
    Decompose a text into lines, removing empty lines and leading/trailing whitespace.

    Parameters:
    text (str): The input text to be decomposed into lines.

    Returns:
    list of str: A list of lines from the input text.
    """
    
    # Split the text into lines using newline as the delimiter
    sentences = text.split('\n')
    
    # Remove leading/trailing whitespace and empty lines
    sentences = [s.replace('\t', '').strip() for s in sentences if s.strip() != '']
    
    return sentences

In [125]:
sentences = decompose_text_into_lines(text_without_tags)
sentences

['The Providence Journal',
 "East Providence should organize its civil defense setup and begin by appointing a full-time director, Raymond H. Hawksley, the present city CD head, believes. Mr. Hawksley said yesterday he would be willing to go before the city council `` or anyone else locally '' to outline his proposal at the earliest possible time. East Providence now has no civil defense program. Mr. Hawksley, the state's general treasurer, has been a part-time CD director in the city for the last nine years. He is not interested in being named a full-time director.",
 "Noting that President Kennedy has handed the Defense Department the major responsibility for the nation's civil defense program, Mr. Hawksley said the federal government would pay half the salary of a full-time local director. He expressed the opinion the city could hire a CD director for about $3,500 a year and would only have to put up half that amount on a matching fund basis to defray the salary costs.",
 'Mr. Hawks

### **NLTK vs SpaCy**

#### **1. NLTK**

In [126]:
def get_wordnet_pos(tag):
    tag = tag[0].upper()
    tag_dict = {"N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV, "J": wordnet.ADJ}
    return tag_dict.get(tag, wordnet.NOUN)

In [127]:
def annotate_text_with_nltk(sentences):
    """
    Annotate text using NLTK for information like word, occurrences, lemma, stem, POS tag, and stopword status.

    Parameters:
    sentences (list): A list of sentences or text to be annotated.

    Returns:
    pandas.DataFrame: A DataFrame containing annotations, and also saves the annotations to an Excel file.
    """

    # Initialize NLTK tools
    lemmatizer = WordNetLemmatizer()
    stemmer = SnowballStemmer('english')
    stop_words = set(stopwords.words('english'))

    # Initialize dictionaries to store annotations
    annotations = {'word': [], 'occurrences': [], 'lemma': [], 'stem': [], 'pos_tag': [], 'is_stopword': []}
    word_occurrence = {}

    for sentence in sentences:
        # Tokenize the sentence
        tokens = word_tokenize(sentence)

        for token in tokens:
            # Preprocess the token
            token = token.lower()
            token = ''.join(char for char in token if char.isalnum())

            # Count word occurrences
            word_occurrence[token] = word_occurrence.get(token, 1) + 1

            if token == '' or token in annotations['word']:
                continue

            annotations['word'].append(token)
            annotations['is_stopword'].append(token in stop_words)
            pos_tag = nltk.pos_tag([token])[0][1]
            annotations['pos_tag'].append(pos_tag)
            annotations['lemma'].append(lemmatizer.lemmatize(token, get_wordnet_pos(pos_tag)))
            annotations['stem'].append(stemmer.stem(token))

    # Calculate word occurrences
    for word in annotations['word']:
        annotations['occurrences'].append(word_occurrence[word])

    # Create a DataFrame from the annotations
    annotations_df = pd.DataFrame(annotations)

    # Save the annotations to an Excel file
    annotations_df.to_excel('text_annotations_using_nltk.xlsx')

    return annotations_df

In [128]:
annotate_text_with_nltk(sentences)

Unnamed: 0,word,occurrences,lemma,stem,pos_tag,is_stopword
0,the,31,the,the,DT,True
1,providence,6,providence,provid,NN,False
2,journal,2,journal,journal,NN,False
3,east,5,east,east,NN,False
4,should,3,should,should,MD,True
...,...,...,...,...,...,...
176,where,2,where,where,WRB,True
177,event,2,event,event,NN,False
178,an,2,an,an,DT,True
179,enemy,2,enemy,enemi,NN,False


#### **2. SpaCy**

In [129]:
def annotate_text_with_spacy(sentences):
    """
    Annotate text using spaCy for information like word, occurrences, lemma, POS tag, and stopword status.

    Parameters:
    sentences (list): A list of sentences or text to be annotated.

    Returns:
    pandas.DataFrame: A DataFrame containing annotations, and also saves the annotations to an Excel file.
    """
    
    # Load the English language model
    nlp = spacy.load("en_core_web_sm")

    # Initialize dictionaries to store annotations
    annotations = {'word': [], 'occurrences': [], 'lemma': [], 'pos_tag': [], 'is_stopword': []}
    word_occurrence = {}

    for sentence in sentences:
        # Process the text using spaCy
        tokens = nlp(sentence)
        for token in tokens:
            word_occurrence[str(token).lower()] = word_occurrence.get(str(token).lower(), 1) + 1

            if token.text == '' or token.text in annotations['word']:
                continue

            annotations['word'].append(str(token).lower())
            annotations['is_stopword'].append(token.is_stop)
            annotations['pos_tag'].append(token.pos_)
            annotations['lemma'].append(token.lemma_)

    # Calculate word occurrences
    for word in annotations['word']:
        annotations['occurrences'].append(word_occurrence[word])

    # Create a DataFrame from the annotations
    annotations_df = pd.DataFrame(annotations)

    # Save the annotations to an Excel file
    annotations_df.to_excel('text_annotations_using_spacy.xlsx')

    return annotations_df

In [130]:
annotate_text_with_spacy(sentences)

Unnamed: 0,word,occurrences,lemma,pos_tag,is_stopword
0,the,31,the,DET,True
1,providence,6,Providence,PROPN,False
2,journal,2,Journal,PROPN,False
3,east,5,East,PROPN,False
4,providence,6,Providence,PROPN,False
...,...,...,...,...,...
215,where,2,where,SCONJ,True
216,event,2,event,NOUN,False
217,an,2,an,DET,True
218,enemy,2,enemy,NOUN,False


#### **3. Difference between NLTK and SpaCy**

<table border="1">
  <tr>
    <th>Feature</th>
    <th>NLTK</th>
    <th>spaCy</th>
  </tr>
  <tr>
    <td>Tokenization</td>
    <td>Word and sentence tokenization functions are available.</td>
    <td>Advanced tokenization with word and sentence boundary detection.</td>
  </tr>
  <tr>
    <td>Lemmatization</td>
    <td>Basic lemmatization is available but may require additional custom rules.</td>
    <td>Built-in lemmatization with WordNet integration and support for various languages.</td>
  </tr>
  <tr>
    <td>Stemming</td>
    <td>Provides stemming algorithms for English and other languages.</td>
    <td>Lemmatization is provided, but stemming is not available in spaCy.</td>
  </tr>
  <tr>
    <td>Part-of-Speech (POS) Tagging</td>
    <td>Provides POS tagging but may require additional data for more languages.</td>
    <td>Built-in POS tagging for multiple languages with pre-trained models.</td>
  </tr>
  <tr>
    <td>Stopword Removal</td>
    <td>Stopword lists are available, and removal can be customized.</td>
    <td>Built-in stopword removal with support for multiple languages.</td>
  </tr>
  <tr>
    <td>Named Entity Recognition (NER)</td>
    <td>NER functionality is available but may require additional training for specific domains.</td>
    <td>Built-in NER with support for various entity types and languages.</td>
  </tr>
  <tr>
    <td>Dependency Parsing</td>
    <td>Basic dependency parsing with customizable grammar rules.</td>
    <td>Advanced dependency parsing with pre-trained models and efficient parsing algorithms.</td>
  </tr>
  <tr>
    <td>Customization</td>
    <td>NLTK allows you to define custom rules for various NLP tasks.</td>
    <td>spaCy provides more advanced customization options, including training new models.</td>
  </tr>
  <tr>
    <td>Community and Documentation</td>
    <td>Has an active community and extensive documentation.</td>
    <td>Has a growing community and comprehensive documentation.</td>
  </tr>
</table>