# Basics of Natural Language Processing (NLP)Take Home Exercise #



Use the following link to find open source data sets to complete take-home exercises.

[Data Sets](https://opendatascience.com/20-open-datasets-for-natural-language-processing/)

Or, you can try out Assignment 1 data set for a head start to the work!

# Run this code in the beginning to limit the output size of the cells

In [8]:
 from IPython.display import display, Javascript

def resize_colab_cell():
  # Change the maxHeight variable to change the max height of the output
   display(Javascript('google.colab.output.setIframeHeight(0, true, {maxHeight: 400})'))
  #Change output size for the entire notebook (set to call function on cell run)
   get_ipython().events.register('pre_run_cell', resize_colab_cell)

### 1. Input Text

Write a function to collect text data for the analysis via user input - E.g. from a text box



In [9]:
import ipywidgets as widgets
from IPython.display import display

def collect_text_data():
    text_box = widgets.Textarea(
        placeholder='Enter your text here',
        description='Text:',
        disabled=False
    )
    display(text_box)

    button = widgets.Button(description="Submit")
    display(button)

    def handle_submit(sender):
        user_input = text_box.value.strip()
        print("Collected Text Data:", user_input)

    button.on_click(handle_submit) # Use button.on_click instead

collect_text_data()

Textarea(value='', description='Text:', placeholder='Enter your text here')

Button(description='Submit', style=ButtonStyle())

### 2. Basic Analysis

Perform basic text analysis on the collected text using Spacy ([spacy.io](http://spacy.io)) library.

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")

def analyze_text(text):
    doc = nlp(text)

    print("Tokens:")
    for token in doc:
        print(token.text, token.pos_, token.dep_)

    print("\nSentences:")
    for sent in doc.sents:
        print(sent.text)

    print("\nNamed Entities:")
    for ent in doc.ents:
        print(ent.text, ent.label_)


user_input = '''
36118 Applied Natural Language Processing

Subject description
This subject introduces you to the complexities of analysing human language data and the use of Natural Language Processing (NLP) and text mining techniques. You'll develop both technical and communicative skills to process and interpret unstructured textual data. Covering core NLP concepts and extraction techniques, the course equips you with … For more content click the Read More button below.

Learning outcomes
Upon completion of this subject, graduates will be able to:
1. Understand core concepts of Natural Language Processing (NLP) and computational linguistics including its limitations (CILO 2.2, 2.3)

2.Evaluate complex challenges for problem solving and build practical NLP applications (CILO 2.3, 4.2)

3. Apply text mining techniques on unstructured data sets using advanced NLP programming packages (CILOs 1.2, 2.2)

4. Interpret, extract value and effectively communicate insights from text analysis and create real-world applications suitable to a range of audiences (CILOs 2.4, 3.2, 4.2)

5. Articulate the strengths, weaknesses and underlying assumptions of NLP and text analysis to apply ethical practices (CILO 5.1, 5.2)

Learning and teaching activities
Blend of online and face to face activities: The subject is offered through a series of teaching sessions which blend online and face-to-face learning. Students learn through interactive lectures and classroom activities making use of the subject materials on canvas. They also engage in individual and collaborative learning activities to … For more content click the Read More button below. Authentic problem based learning: This subject offers a range of authentic data science problems to solve that will help develop students’ text analysis skills. They work on real world data analysis problems for broad areas of interest using unstructured data and contemporary techniques. Collaborative work: Group activities will enable students to leverage peer-learning and demonstrate effective team participation, as well as learning to work in professional teams with an appreciation of diverse perspectives on data science and innovation. Future-oriented strategies: Students will be exposed to contemporary learning models using speculative thinking, ethical and human-centered approaches as well as reflection. Electronic portfolios will be used to curate, consolidate and provide evidence of learning and development of course outcomes, graduate attributes and professional evolution. Formative feedback will be offered with all assessment activities for successful engagement.

Authentic problem based learning: This subject offers a range of authentic data science problems to solve that will help develop students’ text analysis skills. They work on real world data analysis problems for broad areas of interest using unstructured data and contemporary techniques.

Collaborative work: Group activities will enable students to leverage peer-learning and demonstrate effective team participation, as well as learning to work in professional teams with an appreciation of diverse perspectives on data science and innovation.

Future-oriented strategies: Students will be exposed to contemporary learning models using speculative thinking, ethical and human-centered approaches as well as reflection. Electronic portfolios will be used to curate, consolidate and provide evidence of learning and development of course outcomes, graduate attributes and professional evolution. Formative feedback will be offered with all assessment activities for successful engagement.


'''

analyze_text(user_input)


Tokens:

 SPACE dep
36118 NUM nummod
Applied PROPN compound
Natural PROPN compound
Language PROPN compound
Processing PROPN compound


 SPACE dep
Subject PROPN compound
description NOUN nsubj

 SPACE dep
This DET det
subject NOUN nsubj
introduces VERB ROOT
you PRON dobj
to ADP prep
the DET det
complexities NOUN pobj
of ADP prep
analysing VERB pcomp
human ADJ amod
language NOUN compound
data NOUN dobj
and CCONJ cc
the DET det
use NOUN conj
of ADP prep
Natural PROPN compound
Language PROPN nmod
Processing PROPN pobj
( PUNCT punct
NLP PROPN appos
) PUNCT punct
and CCONJ cc
text NOUN conj
mining NOUN compound
techniques NOUN dobj
. PUNCT punct
You PRON nsubj
'll AUX aux
develop VERB ROOT
both CCONJ det
technical ADJ amod
and CCONJ cc
communicative ADJ conj
skills NOUN dobj
to PART aux
process VERB relcl
and CCONJ cc
interpret VERB conj
unstructured ADJ amod
textual ADJ amod
data NOUN dobj
. PUNCT punct
Covering VERB advcl
core NOUN compound
NLP PROPN compound
concepts NOUN dobj
and CCONJ c

### 3. Tokenizer
Create a custom tokenizer in Python that handles:
*   Contractions (e.g., "don't" → "do n't")
*   Keeps punctuation as separate tokens
*   Splits hyphenated words (e.g., "state-of-the-art" → "state of the art")

Compare its results with NLTK's word_tokenize on any sample paragraph and the following examples:
"New York-based company", "It's a beautiful day!", "https://www.example.com"

What differences do you see? What are the advantages, and limitations of each approach?

In [11]:
import re
from nltk.tokenize import word_tokenize
import nltk # Added import statement for nltk
nltk.download('punkt_tab') # Download the necessary data package


def custom_tokenize(text):
    # Handle contractions
    text = re.sub(r"(\w)'(\w)", r"\1 '\2", text)  # "don't" -> "do n't"
    text = re.sub(r"n't", " n't", text)

    # Keep punctuation
    text = re.sub(r"([.,!?;:])", r" \1 ", text)  # Add space around punctuation

    # Split hyphenated words
    text = re.sub(r"([a-zA-Z])-([a-zA-Z])", r"\1 \2", text)

    # Handle URLs
    text = re.sub(r"(https?://\S+)", r" \1 ", text)

    # Handle other special cases
    tokens = text.split()

    return tokens

# Sample paragraph
paragraph = "It's a beautiful day! New York-based company announced state-of-the-art technology. Let's go to https://www.example.com."

# Tokenization using custom tokenizer and NLTK
custom_tokens = custom_tokenize(user_input)
nltk_tokens = word_tokenize(paragraph)

# Example sentences
examples = ["New York-based company", "It's a beautiful day!", "https://www.example.com"]

print("Custom Tokenizer:")
print(custom_tokens)
print("\nNLTK Tokenizer:")
print(nltk_tokens)
print("\n")

for example in examples:
    print("Example:", example)
    print("Custom Tokenizer:", custom_tokenize(example))
    print("NLTK Tokenizer:", word_tokenize(example))
    print("\n")

Custom Tokenizer:
['36118', 'Applied', 'Natural', 'Language', 'Processing', 'Subject', 'description', 'This', 'subject', 'introduces', 'you', 'to', 'the', 'complexities', 'of', 'analysing', 'human', 'language', 'data', 'and', 'the', 'use', 'of', 'Natural', 'Language', 'Processing', '(NLP)', 'and', 'text', 'mining', 'techniques', '.', 'You', "'ll", 'develop', 'both', 'technical', 'and', 'communicative', 'skills', 'to', 'process', 'and', 'interpret', 'unstructured', 'textual', 'data', '.', 'Covering', 'core', 'NLP', 'concepts', 'and', 'extraction', 'techniques', ',', 'the', 'course', 'equips', 'you', 'with', '…', 'For', 'more', 'content', 'click', 'the', 'Read', 'More', 'button', 'below', '.', 'Learning', 'outcomes', 'Upon', 'completion', 'of', 'this', 'subject', ',', 'graduates', 'will', 'be', 'able', 'to', ':', '1', '.', 'Understand', 'core', 'concepts', 'of', 'Natural', 'Language', 'Processing', '(NLP)', 'and', 'computational', 'linguistics', 'including', 'its', 'limitations', '(CILO'

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### 4. Regex

Try writing your own RegEx that can capture citations in text E.g. (Horning, 2022)

In [12]:


import re

def extract_citations(text):
    # Regular expression to match citations in the format (Author, Year)
    citation_pattern = r"\(([A-Z][a-z]+(?:[\s-][A-Z][a-z]+)?),\s(\d{4})\)"
    citations = re.findall(citation_pattern, text)
    return citations

# Example usage (using the user_input from the previous code block)
citations = extract_citations(user_input)
print("Citations found:")
for citation in citations:
    print(citation)


# Example with additional test cases
test_cases = [
    "This is a sentence with a citation (Horning, 2022).",
    "Another citation (Smith-Jones, 2023) is here.",
    "No citations here.",
    "(Doe, 1999) and (Jane Doe, 2000)."
]

for text in test_cases:
    citations = extract_citations(text)
    print(f"\nText: {text}")
    print(f"Citations: {citations}")


Citations found:

Text: This is a sentence with a citation (Horning, 2022).
Citations: [('Horning', '2022')]

Text: Another citation (Smith-Jones, 2023) is here.
Citations: [('Smith-Jones', '2023')]

Text: No citations here.
Citations: []

Text: (Doe, 1999) and (Jane Doe, 2000).
Citations: [('Doe', '1999'), ('Jane Doe', '2000')]


Extract URLS following a certain format (www. or http or https:// ..)

In [13]:
# prompt: Extract URLS following a certain format (www. or http or https:// ..)

import re

def extract_urls(text):
    url_pattern = r"(https?://\S+|www\.\S+)"
    urls = re.findall(url_pattern, text)
    return urls

# Example usage (using the user_input from the previous code block)
urls = extract_urls(user_input)
print("URLs found:")
for url in urls:
  url


URLs found:


### 5. Word Frequency

Find the list of words that occur more than 10 times in a selected corpus.

Try using different forms of setup: no stopwords, custom stopwords, not removing punctuation, etc. and see what difference in results they produce.


In [15]:
import re
from collections import Counter

def word_frequency(text, min_frequency=10, remove_stopwords=True, custom_stopwords=None, remove_punctuation=True):

    # 1. Tokenization (similar to the custom_tokenize function)
    text = re.sub(r"(\w)'(\w)", r"\1 '\2", text)
    text = re.sub(r"n't", " n't", text)
    if remove_punctuation:
        text = re.sub(r"[^\w\s']", "", text) #remove punctuation
    tokens = text.lower().split()

    # 2. Stopword Removal
    if remove_stopwords:
        stopwords = set(['the', 'a', 'an', 'and', 'or', 'in', 'to', 'of', 'for', 'with', 'on', 'at', 'by', 'from', 'up', 'down', 'out', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'should', 'can', 'could', 'may', 'might', 'must', 'as', 'if', 'so', 'that', 'this', 'these', 'those', 'it', 'its', 'they', 'their', 'them', 'he', 'him', 'his', 'she', 'her', 'hers', 'we', 'us', 'our', 'ours', 'you', 'your', 'yours', 'my', 'mine', 'i', 'me'])
        if custom_stopwords:
            stopwords.update(custom_stopwords)  # Add custom stopwords if provided
        tokens = [token for token in tokens if token not in stopwords]

    # 3. Count word frequencies
    word_counts = Counter(tokens)

    # 4. Return words with frequency > min_frequency
    frequent_words = [(word, count) for word, count in word_counts.items() if count > min_frequency]
    return frequent_words

# Example usage with the provided user_input
user_input = '''
In a selected corpus, we can find the list of words that occur more than 10 times.
For example, in this dummy text, we will try to use different forms of setup: no stopwords, custom stopwords, not removing punctuation, etc., and see what difference in results they produce.
 The word ‘corpus’ appears multiple times in this text. The word ‘text’ also appears multiple times. We will also include some common stopwords like ‘the’, ‘and’, ‘in’, ‘of’, etc., to see how they affect the results.
 Additionally, we will add some punctuation marks like commas, periods, and exclamation marks! Let’s see how this dummy text helps in understanding the concept of corpus and word frequency analysis. corpus corpus corpus corpus corpus corpus corpus corpus corpus corpus corpus text text text text text text text text text text text text text text text stopwords stopwords stopwords stopwords stopwords stopwords stopwords stopwords stopwords stopwords stopwords stopwords stopwords stopwords punctuation punctuation punctuation punctuation punctuation punctuation punctuation punctuation punctuation punctuation punctuation punctuation punctuation punctuation punctuation analysis analysis analysis analysis analysis analysis analysis analysis analysis analysis analysis analysis
'''


# Different setups
print("No Stopwords, No Punctuation Removal:")
print(word_frequency(user_input, min_frequency=2, remove_stopwords=False, remove_punctuation=False))


print("\nDefault Settings (remove stopwords, remove punctuation):")
print(word_frequency(user_input, min_frequency=2))


print("\nWith Custom Stopwords and Punctuation:")
custom_stops = {'subject', 'learning'}
print(word_frequency(user_input, min_frequency=2, custom_stopwords=custom_stops, remove_punctuation=True))


No Stopwords, No Punctuation Removal:
[('in', 5), ('we', 4), ('the', 5), ('of', 3), ('this', 3), ('will', 3), ('and', 3), ('see', 3), ('word', 3), ('stopwords', 15), ('punctuation', 16), ('text', 16), ('corpus', 12), ('analysis', 12)]

Default Settings (remove stopwords, remove punctuation):
[('corpus', 14), ('times', 3), ('text', 19), ('stopwords', 17), ('punctuation', 17), ('see', 3), ('word', 3), ('analysis', 13)]

With Custom Stopwords and Punctuation:
[('corpus', 14), ('times', 3), ('text', 19), ('stopwords', 17), ('punctuation', 17), ('see', 3), ('word', 3), ('analysis', 13)]
