<a href="https://colab.research.google.com/github/thendralbala/UL_MSc_AI_and_ML/blob/main/NLP_Etivity1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Task 1.1

**Approach:**
The eircodeValidator function is designed to assess the validity of an Eircode string. It first removes any extraneous leading or trailing whitespace from the input Eircode. Next, it employs a regular expression pattern to match the Eircode against the expected structure, capturing the Routing Key (the first three characters) and the Unique Identifier (the last four characters). Based on the match result, the function determines the Eircode's validity. If valid, it extracts and displays both the Routing Key and the Unique Identifier, proceeding to search for the Routing Key within the KeyRouteDict. If found, it reveals the associated value from the dictionary; otherwise, it indicates that the Routing Key is absent in the KeyRouteDict. Conversely, if the Eircode fails the regex validation, the function declares it invalid and returns None.

**Dictionary vs. Tuple for Search**

The decision to utilize a dictionary (KeyRouteDict) instead of the original tuple-based KeyRouteList for searching the Routing Key stems from a need for optimized efficiency. Dictionaries in Python are implemented using hash tables, granting them the ability to retrieve values using keys with remarkable speed. This means searching for a specific key within a dictionary generally takes a constant amount of time, irrespective of the dictionary's size. In contrast, searching for a particular Routing Key in a list of tuples like KeyRouteList would necessitate iterating through each tuple and comparing its first element (the Routing Key) to the target key. This linear search approach is less efficient, as the search time directly correlates with the number of tuples in the list.

Therefore, by adopting a dictionary for the Routing Key search, the code achieves a substantial performance boost, especially when handling extensive datasets like the Eircode list. This data structure enables swift and effective retrieval of information associated with a given Routing Key

In [1]:
import re   # import re library for regex
import csv  # import csv library to handle the Eircode CSV file


# use the linux command wget to donwload the CSV file
!wget https://gist.githubusercontent.com/ajoorabchi/eac194a79dd26de8864f9206b7842ff1/raw/8ea1d8d5f74b5b2724e378b43d4df6094990c7db/Eircode%2520Routing%2520Key%2520Boundaries.csv
filePath = "/content/Eircode Routing Key Boundaries.csv" # set the path for the donwloaded CSV file


with open(filePath, 'r') as f:
    reader = csv.reader(f)
    KeyRouteList = list(map(tuple, reader)) #the map function iterates through the rows in the CSV file and puts them in tuples. The tuples are then added to the KeyRouteList    [(k0,d0), ...,(kx,dx)]
print(KeyRouteList)




--2025-01-08 12:27:55--  https://gist.githubusercontent.com/ajoorabchi/eac194a79dd26de8864f9206b7842ff1/raw/8ea1d8d5f74b5b2724e378b43d4df6094990c7db/Eircode%2520Routing%2520Key%2520Boundaries.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1934 (1.9K) [text/plain]
Saving to: ‘Eircode Routing Key Boundaries.csv’


2025-01-08 12:27:56 (16.6 MB/s) - ‘Eircode Routing Key Boundaries.csv’ saved [1934/1934]

[('ROUTING KEY', 'DESCRIPTOR'), ('A41', 'BALLYBOUGHAL'), ('A42', 'GARRISTOWN'), ('A45', 'OLDTOWN'), ('A63', 'GREYSTONES'), ('A67', 'WICKLOW'), ('A75', 'CASTLEBLAYNEY'), ('A81', 'CARRICKMACROSS'), ('A82', 'KELLS'), ('A83', 'ENFIELD'), ('A84', 'ASHBOURNE'), ('A85', 'DUNSHAUGHLIN'), ('A86', 'DUNBOYNE'), ('A91', 'DUNDALK'), ('A92', 'DROGHEDA'), ('A94', 'BLA

In [63]:
KeyRouteDict = {key: value for key, value in KeyRouteList}

In [64]:
def eircodeValidator(eircode):
  eircode = eircode.strip()
  eircode_regex = r"([A-Z0-9]{3})[- ]?([A-Z0-9]{4}$)"

  match = re.match(eircode_regex, eircode)
  print(f"------------------------------------------------\nEIR Code: {eircode}")

  if match:
    r_key = match.group(1).strip()
    u_id = match.group(2)
    print(f"Valid Eircode pattern, Routing Key= {r_key}, Unique Identifier = {u_id}")

    if r_key in KeyRouteDict:
      print(f"Destination = {KeyRouteDict[r_key]}")
    else:
      print(f"{r_key} is a valid but unassigned routing key")
      return None

  else:
    print("Invalid Eircode pattern")
    return None






In [65]:
eircodeValidator("111-T9PX")
eircodeValidator("V94-T9PX")
eircodeValidator("V94 T9PX")
eircodeValidator("V94T9PX")
eircodeValidator("   V94-T9PX")
eircodeValidator("V94-T9PX   ")
eircodeValidator("v94 T9PX")
eircodeValidator("V94T9PXV")

------------------------------------------------
EIR Code: 111-T9PX
Valid Eircode pattern, Routing Key= 111, Unique Identifier = T9PX
111 is a valid but unassigned routing key
------------------------------------------------
EIR Code: V94-T9PX
Valid Eircode pattern, Routing Key= V94, Unique Identifier = T9PX
Destination = LIMERICK
------------------------------------------------
EIR Code: V94 T9PX
Valid Eircode pattern, Routing Key= V94, Unique Identifier = T9PX
Destination = LIMERICK
------------------------------------------------
EIR Code: V94T9PX
Valid Eircode pattern, Routing Key= V94, Unique Identifier = T9PX
Destination = LIMERICK
------------------------------------------------
EIR Code: V94-T9PX
Valid Eircode pattern, Routing Key= V94, Unique Identifier = T9PX
Destination = LIMERICK
------------------------------------------------
EIR Code: V94-T9PX
Valid Eircode pattern, Routing Key= V94, Unique Identifier = T9PX
Destination = LIMERICK
----------------------------------------

#Task 1.2

**Approach:**

The contactsExtractor function is designed to efficiently identify and retrieve contact information from a given text source, such as an email or document. It accomplishes this by employing regular expressions, which are patterns used to match various types of contact details like phone numbers, email addresses, and potentially other relevant information. These regex patterns are applied to the input text to locate and extract matching contact data. The function then typically organizes the extracted contacts into a structured format, often a list of dictionaries where each dictionary represents a specific contact with its type (e.g., "phone," "email") and corresponding value. This approach centralizes the contact extraction logic, promoting code organization, reusability, and adaptability to accommodate new contact formats or types. Moreover, it contributes to data consistency and quality by ensuring uniformity in the extraction process.

In [66]:
import re   # import re library for regex
!pip install html2text
import html2text # import html2text library to convert and extract the text content of the HTML file


!wget https://www.ul.ie/contact-information # Use the Linux command wget to download the webpage
filePath = "/content/contact-information" # Set the path for the downloaded HTML file
contact_information_file = open(filePath, "r")
contact_information_html = contact_information_file.read()
contact_information_text = html2text.html2text(contact_information_html)
print(contact_information_text) #uncomment to see the text content of the page





Collecting html2text
  Downloading html2text-2024.2.26.tar.gz (56 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: html2text
  Building wheel for html2text (setup.py) ... [?25l[?25hdone
  Created wheel for html2text: filename=html2text-2024.2.26-py3-none-any.whl size=33111 sha256=5496f3b29ba6341cd0015feb054a7626cb4a6e93a2fa128233af6ac58ce2faf9
  Stored in directory: /root/.cache/pip/wheels/f3/96/6d/a7eba8f80d31cbd188a2787b81514d82fc5ae6943c44777659
Successfully built html2text
Installing collected packages: html2text
Successfully installed html2text-2024.2.26
--2025-01-08 13:24:28--  https://www.ul.ie/contact-information
Resolving www.ul.ie (www.ul.ie)... 151.101.194.216, 151.101.130.216, 151.101.66.2

In [74]:
def contactsExtractor(contact_information_text):

  email_regex = r"[A-Za-z0-9]+@ul.ie"

  phone_regex = r"tel:([0-9]{13})"

  website_regex = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

  emails = re.findall(email_regex, contact_information_text)
  phones = re.findall(phone_regex, contact_information_text)
  websites = re.findall(website_regex, contact_information_text)

  print(f"{len(emails)} Emails = {emails}")
  print(f"{len(phones)} Phone: {phones}")
  print(f"{len(websites)} Websites = {websites}")




In [75]:
contactsExtractor(contact_information_text)

12 Emails = ['reception@ul.ie', 'reception@ul.ie', 'international@ul.ie', 'international@ul.ie', 'DeanFAHSS@ul.ie', 'DeanFAHSS@ul.ie', 'ehs@ul.ie', 'ehs@ul.ie', 'scieng@ul.ie', 'scieng@ul.ie', 'kbsdean@ul.ie', 'kbsdean@ul.ie']
5 Phone: ['0035361202700', '0035361203154', '0035361202286', '0035361202688', '0035361202116']
36 Websites = ['https://www.facebook.com/universityoflimerick)', 'https://twitter.com/ul)', 'https://www.instagram.com/universityoflimerick/)', 'https://www.linkedin.com/company/university-of-limerick)', 'https://www.youtube.com/user/UniversityofLimerick)', 'https://www.tiktok.com/@universityoflimerick)', 'https://www.ul.ie/library)', 'https://ul.workvivo.com/directory/people)', 'https://www.ul.ie/academic-registry/prospective-students/pathways-ul)', 'https://www.youvisit.com/tour/139046/143553?pl=v&from=embed-js-failed)', 'https://www.ul.ie/artsoc/law', 'https://www.ul.ie/artsoc/seic)', 'https://www.ul.ie/soedu)', 'https://pure.ul.ie/', 'https://www.ul.ie/cecd/graduate

#Task 2.1

Use the !wget command to download the Complete Works of William Shakespeare from here; then open the downloaded text file and print out its first 50 lines.

Python String splitlines() Method Split a string into a list where each line is a list item.

In [76]:
!wget http://www.gutenberg.org/files/100/100-0.txt
filePath = "/content/100-0.txt"

ShakespeareFile = open(filePath, "r")
ShakespeareContent = ShakespeareFile.read()
ShakespeareContent = ShakespeareContent.splitlines()
# The splitlines() method splits a string into a list.
# The splitting is done at line breaks.
#print(ShakespeareContent[0:50])

--2025-01-08 13:48:35--  http://www.gutenberg.org/files/100/100-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/files/100/100-0.txt [following]
--2025-01-08 13:48:36--  https://www.gutenberg.org/files/100/100-0.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5618733 (5.4M) [text/plain]
Saving to: ‘100-0.txt’


2025-01-08 13:48:38 (3.40 MB/s) - ‘100-0.txt’ saved [5618733/5618733]



#Task 2.2

This code snippet uses the Keras Tokenizer to analyze the vocabulary of a Shakespearean text corpus (ShakespeareContent). First, it creates a Tokenizer object and then "fits" it to the corpus, essentially building a vocabulary of unique words (types) and their frequencies. It calculates the total number of words (tokens) by summing the frequencies of each word and the total number of unique words (types) by counting the entries in the tokenizer's word index. Finally, it identifies and displays the top 10 most frequent types, along with their ranks and frequencies, providing insights into the most commonly used words in the Shakespearean text. This analysis is a fundamental step in natural language processing tasks, enabling further text processing and modeling based on the identified vocabulary.

In [77]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
#Tokenizing the Shakespeare corpus
tokenizer.fit_on_texts(ShakespeareContent)

#Total number of tokens in the corpus
total_tokens = 0 # Initializing the variable to zero
for (word, count) in tokenizer.word_counts.items():
  total_tokens += count # number of Tokens is the sum of all individual occurences of each word
print(f"Total number of tokens in the corpus: {total_tokens}")

#Total number of types in the corpus
total_types = len(tokenizer.word_index)
print(f"Total number of types in the corpus: {total_types}")

#top 10 most frequent Types in the corpus along with their ranks and frequencies
print("\nTop 10 most frequent Types in the corpus along with their ranks and frequencies: ")
for word, index in tokenizer.word_index.items():
    if index < 11:
        print(f"    \"{word}\" is ranked #{index}, with a frequency of {tokenizer.word_counts[word]}")

Total number of tokens in the corpus: 969435
Total number of types in the corpus: 29851

Top 10 most frequent Types in the corpus along with their ranks and frequencies: 
    "the" is ranked #1, with a frequency of 30309
    "and" is ranked #2, with a frequency of 28430
    "i" is ranked #3, with a frequency of 21694
    "to" is ranked #4, with a frequency of 20944
    "of" is ranked #5, with a frequency of 18742
    "a" is ranked #6, with a frequency of 16371
    "you" is ranked #7, with a frequency of 14332
    "my" is ranked #8, with a frequency of 13162
    "in" is ranked #9, with a frequency of 12412
    "that" is ranked #10, with a frequency of 11779


#Task 2.3

In [78]:
from nltk.stem import PorterStemmer
import nltk
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [79]:
# stem all the Types in the Shakespeare corpus, and print out the total number of Types after stemming.
stemmed_words = []
for word in tokenizer.word_index.keys():
  stemmed_words.append(ps.stem(word))
total_number_of_stemmed_types = len(set(stemmed_words))
print(f"Total number of types after stemming: {total_number_of_stemmed_types}")

Total number of types after stemming: 20744


In [80]:
# Lemmatize all the Types in the Shakespeare corpus, and print out the total number of Types in the corpus after lemmatization
lemmatized_words = []
for word in tokenizer.word_index.keys():
  lemmatized_words.append(lemmatizer.lemmatize(word))
total_number_of_lemmatized_types = len(set(lemmatized_words))
print(f"Total number of types after lemmatization: {total_number_of_lemmatized_types}")

Total number of types after lemmatization: 26575


In [81]:
# Confirm the validity of this arithmetic expression:
# total_number_of_types > total_number_of_lemmatized_types > total_number_of_stemmed_types
print(f"Total number of types: {total_types}")
print(f"Total number of types after stemming: {total_number_of_stemmed_types}")
print(f"Total number of types after lemmatization: {total_number_of_lemmatized_types}")
#Validate if the arithmetic expression
print(f"\nTotal number of types > total_number_of_lemmatized_types > total_number_of_stemmed_types: {total_types > total_number_of_lemmatized_types > total_number_of_stemmed_types}")

Total number of types: 29851
Total number of types after stemming: 20744
Total number of types after lemmatization: 26575

Total number of types > total_number_of_lemmatized_types > total_number_of_stemmed_types: True


#Task 2.4

This code utilizes the spaCy library for Natural Language Processing (NLP) to segment the last 100 lines of the Shakespeare corpus into individual sentences. It first loads the English language model (en_core_web_sm) from spaCy. Then, it joins the last 100 lines of the corpus using newline characters (\n) and processes this combined text with spaCy's NLP pipeline. This processing identifies sentence boundaries within the text. The code extracts each identified sentence and stores them in a list called sentences. It then prints the total number of sentences detected and displays each sentence with its corresponding number, providing a clear view of the segmented text from the Shakespeare corpus.

In [83]:
import spacy
nlp = spacy.load('en_core_web_sm')

# Segmenting the last 100 lines of the Shakespeaare corpus
doc = nlp("\n".join(ShakespeareContent[-100:]))
sentences = [sent.text for sent in doc.sents]

print(f"Total number of sentences: {len(sentences)}")

print("\nSegmented sentences:\n")

i=0
for sentence in sentences:
  i+=1
  print(f"Sentence #{i}. {sentence}")

Total number of sentences: 27

Segmented sentences:

Sentence #1. But by a kiss thought to persuade him there;
  And nuzzling in his flank, the loving swine
  Sheath’d unaware the tusk in his soft groin.      
Sentence #2. 1116

“Had I been tooth’d like him
Sentence #3. , I must confess,
With kissing him
Sentence #4. I should have kill’d him first;

Sentence #5. But he is dead, and never did he bless
My youth with his; the more am I accurst.”          
Sentence #6. 1120
  With this she falleth in the place she stood,
  And stains her face with his congealed blood.


Sentence #7. She looks upon his lips, and they are pale;
She takes him by the hand, and that is cold,        1124
She whispers in his ears a heavy tale,

Sentence #8. As if they heard the woeful words she told;
She lifts the coffer-lids that close his eyes,
Where lo, two lamps burnt out in darkness lies.


Sentence #9. Two glasses where herself herself beheld            1129
A thousand times, and now no more reflect;
Their 