# **LAB 3: Language Representation**

**Language Representation** a.k.a. Text Representation is the process of converting unstructured text data into a structured format (machine-readable form). It involves converting words, phrases, or entire documents into numerical or symbolic representations while preserving meaning and context.

It comprise preprocessing the text data followed by selecting a suitable representation scheme, such as Bag-of-Words, TF-IDF etc. to capture the key features and characteristics of the same, in a numerical form that can be processed by machine learning algorithms.



# **Objectives:**
This lab is designed to introduce students to  fundamental techniques for representing text in a machine-readable format. These techniques form the foundation for various NLP applications, enabling machines to understand and process human language efficiently
By the end of this lab, students will be familiar with several key Language Representation tasks which include:

1. Text Preprocessing
    * Remove Punctuation
    * Remove URLs
    * Lowercasing
    * Tokenization
    * Remove Stop Words
    * Stemming
    * Lemmatization
2. Character Encoding
    * ASCII
    * UTF-8
3. Text Representation
    * Bag-of-Words (BoW)
    * Term Frequency - Inverse Document Frequency (TF-IDF)

# 1.  **Text Preprocessing**
Raw text data is often messy and unstructured, so we need Text Preprocessing, as it cleans and organizes text for better analysis and predictions


* Remove Punctuation
* Remove URLs
* Lowercasing
* Tokenization
* Remove Stop Words
* Stemming
* Lemmatization



### **Remove Punctuation**

In [None]:
import string

text = "Hello, world! How's it going?"
text_no_punct = text.translate(str.maketrans("", "", string.punctuation))
print(text_no_punct)  # Output: Hello world Hows it going

Hello world Hows it going


### **Remove URLs**

In [None]:
import re

text = "Check this out: https://example.com for more details."
text_no_urls = re.sub(r'http\S+|www\S+', '', text)
print(text_no_urls)  # Output: Check this out:  for more details.

Check this out:  for more details.


### **Lowercasing**

In [None]:
text = "HELLO World! This is an Example."
text_lower = text.lower()
print(text_lower)  # Output: hello world! this is an example.

hello world! this is an example.


### **Tokenization**

In [None]:
import nltk
nltk.download("stopwords")
nltk.download("punkt_tab")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:
text = "this is an example sentence. to demonstrate stop word removal."
words = word_tokenize(text)
print(words) # Output: ['this', 'is', 'an', 'example', 'sentence', 'to', 'demonstrate', 'stop', 'word', 'removal', '.']

['this', 'is', 'an', 'example', 'sentence', '.', 'to', 'demonstrate', 'stop', 'word', 'removal', '.']


### **Remove Stop Words**

In [None]:
# Using the `words` from previous block
filtered_text = [word for word in words if word not in stopwords.words("english")]
print(filtered_text)  # Output: ['example', 'sentence', 'demonstrate', 'stop', 'word', 'removal', '.']

['example', 'sentence', '.', 'demonstrate', 'stop', 'word', 'removal', '.']


### **Stemming**

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()
words = ["running", "flies", "better", "happily", "jumping", "countries"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)
# Output: ['run', 'fli', 'better', 'happili', 'jump', 'countri']

['run', 'fli', 'better', 'happili', 'jump', 'countri']


### **Lemmatization**

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "better", "happily", "jumping", "countries"]
lemmatized_words = [lemmatizer.lemmatize(word, pos="n") for word in words]  # 'n' for noun

print(lemmatized_words)  # Output: ['run', 'fly', 'better', 'happily', 'jump']

['running', 'fly', 'better', 'happily', 'jumping', 'country']


NOTE: Valid options for `pos` in `.lemmatize()` are “n” for nouns, “v” for verbs, “a” for adjectives, “r” for adverbs and “s” for satellite adjectives.



### **Task: Perform the Preprocessing Steps sequentially, on the provided example, by following the hints in comments**

In [None]:
# Import Libraries
import re
import string
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK resources
nltk.download("punkt_tab")
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# Importing Libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download necessary resources
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Text to be used for preprocessing
text = "Hello, world! NLP is amazing. Let's learn it at https://example.com."

# 1. Remove Punctuation from Provided Text
text_no_punct = re.sub(r'[^\w\s]', '', text)
print("\nStep 1: Text without Punctuation:")
print(text_no_punct)

# 2. Remove URLs from Output of Step 1
text_no_urls = re.sub(r'https?://\S+', '', text_no_punct)
print("\nStep 2: Text without URLs:")
print(text_no_urls)

# 3. Perform Lowercasing on Output of Step 2
text_lower = text_no_urls.lower()
print("\nStep 3: Lowercased Text:")
print(text_lower)

# 4. Perform Word and Sentence Tokenization, individually on Output of Step 3
words = word_tokenize(text_lower)
sentences = sent_tokenize(text_lower)
print("\nStep 4: Word Tokens:")
print(words)
print("\nStep 4: Sentence Tokens:")
print(sentences)

# 5. Remove Stop Words from Output of Step 4 (word tokenize output)
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
print("\nStep 5: Text without Stop Words:")
print(filtered_words)

# 6. Perform Stemming on Output of Step 5
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print("\nStep 6: Stemmed Words:")
print(stemmed_words)

# 7. Perform Lemmatization on Output of Step 5, making sure POS tag is set for Verb
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in filtered_words]
print("\nStep 7: Lemmatized Words:")
print(lemmatized_words)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...



Step 1: Text without Punctuation:
Hello world NLP is amazing Lets learn it at httpsexamplecom

Step 2: Text without URLs:
Hello world NLP is amazing Lets learn it at httpsexamplecom

Step 3: Lowercased Text:
hello world nlp is amazing lets learn it at httpsexamplecom

Step 4: Word Tokens:
['hello', 'world', 'nlp', 'is', 'amazing', 'lets', 'learn', 'it', 'at', 'httpsexamplecom']

Step 4: Sentence Tokens:
['hello world nlp is amazing lets learn it at httpsexamplecom']

Step 5: Text without Stop Words:
['hello', 'world', 'nlp', 'amazing', 'lets', 'learn', 'httpsexamplecom']

Step 6: Stemmed Words:
['hello', 'world', 'nlp', 'amaz', 'let', 'learn', 'httpsexamplecom']

Step 7: Lemmatized Words:
['hello', 'world', 'nlp', 'amaze', 'let', 'learn', 'httpsexamplecom']


In [None]:
# Importing Libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download necessary resources
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Text to be used for preprocessing
text = "Hello, world! NLP is amazing. Let's learn it at https://example.com."

# 1. Remove Punctuation from Provided Text
text_no_punct = re.sub(r'[^\w\s]', '', text)
print("\nStep 1: Text without Punctuation:")
print(text_no_punct)

# 2. Remove URLs from Output of Step 1
text_no_urls = re.sub(r'https?://\S+', '', text_no_punct)
print("\nStep 2: Text without URLs:")
print(text_no_urls)

# 3. Perform Lowercasing on Output of Step 2
text_lower = text_no_urls.lower()
print("\nStep 3: Lowercased Text:")
print(text_lower)

# 4. Perform Word and Sentence Tokenization, individually on Output of Step 3
words = word_tokenize(text_lower)
sentences = sent_tokenize(text_lower)
print("\nStep 4: Word Tokens:")
print(words)
print("\nStep 4: Sentence Tokens:")
print(sentences)

# 5. Remove Stop Words from Output of Step 4 (word tokenize output)
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
print("\nStep 5: Text without Stop Words:")
print(filtered_words)

# 6. Perform Stemming on Output of Step 5
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print("\nStep 6: Stemmed Words:")
print(stemmed_words)

# 7. Perform Lemmatization on Output of Step 5, making sure POS tag is set for Verb
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in filtered_words]
print("\nStep 7: Lemmatized Words:")
print(lemmatized_words)


Step 1: Text without Punctuation:
Hello world NLP is amazing Lets learn it at httpsexamplecom

Step 2: Text without URLs:
Hello world NLP is amazing Lets learn it at httpsexamplecom

Step 3: Lowercased Text:
hello world nlp is amazing lets learn it at httpsexamplecom

Step 4: Word Tokens:
['hello', 'world', 'nlp', 'is', 'amazing', 'lets', 'learn', 'it', 'at', 'httpsexamplecom']

Step 4: Sentence Tokens:
['hello world nlp is amazing lets learn it at httpsexamplecom']

Step 5: Text without Stop Words:
['hello', 'world', 'nlp', 'amazing', 'lets', 'learn', 'httpsexamplecom']

Step 6: Stemmed Words:
['hello', 'world', 'nlp', 'amaz', 'let', 'learn', 'httpsexamplecom']

Step 7: Lemmatized Words:
['hello', 'world', 'nlp', 'amaze', 'let', 'learn', 'httpsexamplecom']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# 2.  **Character Encoding**
Raw text data is often messy and unstructured, so we need Text Preprocessing, as it cleans and organizes text for better analysis and predictions


* American Standard Code for Information Interchange (ASCII)
* Unicode Transformation Format 8 (UTF-8)

### **American Standard Code for Information Interchange (ASCII)**

ASCII is a character encoding standard that uses binary numbers to represent text, and is used in computers, telecommunications, and other devices.

We can perform ASCII encoding and decoding using the `.encode()` and `.decode()` function, where the encoding type is set as `"ascii"`.

Due to its limited multilingual support, it can not encode Non-ASCII characters, so we need to ignore them!

In [None]:
text = "Hello, دنیا"

# Encode using ASCII (ignoring non-ASCII characters)
encoded_text = text.encode("ascii", errors="ignore")
print("Encoded Text: ", encoded_text)  # Output: b'Caf'

# Encode using ASCII (replacing non-ASCII characters)
decoded_text = encoded_text.decode("ascii")
print("Decoded Text: ", decoded_text)  # Output: Hello

Encoded Text:  b'Hello, '
Decoded Text:  Hello, 


### **Unicode Transformation Format 8 (UTF-8)**

UTF-8 is a character encoding standard which leverages variable-width encoding, meaning that each character is represented by one to four bytes. It is the most common encoding for the World Wide Web.

Similarly to ASCII, we can perform UTF-8 encoding and decoding using the `.encode()` and `.decode()` function, where the encoding type is set as `"UTF-8"`

Due to its vast multilingual support, it can encode Non English characters too!



#### **Task: Perform UTF-8 Encoding and Decoding on the provided example, using the hints in comments**

In [None]:
text = "Hello, دنیا"

# Encode using ASCII (ignoring non-ASCII characters)
encoded_text = text.encode("ascii", errors="ignore")
print("Encoded Text: ", encoded_text)  # Output: b'Caf'

# Encode using ASCII (replacing non-ASCII characters)
decoded_text = encoded_text.decode("ascii")
print("Decoded Text: ", decoded_text)  # Output: Hello

Encoded Text:  b'Hello, '
Decoded Text:  Hello, 


# 3. **Text Representation**

Text representation is the process of converting unstructured text data into a structured format (machine-readable form) that can be used for natural language processing tasks

It involves selecting a suitable representation scheme, such as bag-of-words, TF-IDF, word embeddings, or topic models, to capture the key features and characteristics of the text data in a numerical form that can be processed by machine learning algorithms.


a) **Bag-of-Words (BoW) Representation:**

It represents text as a vector of word frequencies, ignoring grammar and word order, based on a corpus-wide vocabulary.


b) **Term Frequency - Inverse Document Frequency (TF-IDF) Representation:**

It is a statistical measure that evaluates a word's importance in a document relative to a collection of documents by combining its frequency in the document (TF) and its rarity across the corpus (IDF).

Words that appear frequently across many documents (common words) have lower importance.

#### **Example:**

**Input Text 1:** "I love NLP."

**Input Text 2:** "NLP is good."

a) **Bag-of-Words (BoW) Representation:**

Assuming the above 2 sentences where "NLP" is common, while other words are occurring once, the vector assign equal weight to "NLP" as the other words.

b) **Term Frequency - Inverse Document Frequency (TF-IDF) Representation:**

Assuming the above 2 sentences where "NLP" is common, while other words are occurring one, the vector assign lower weight to "NLP", as compared to other words.

Now we will implement BoW and TF-IDF in this lab.

### **BoW**

In [None]:
# Import libraries
from sklearn.feature_extraction.text import CountVectorizer # for BoW

In [None]:
# Input texts
text1 = "I love NLP."
text2 = "NLP is an interesting subject."

# Bag of Words (BoW)

# Initialize the CountVectorizer, which converts text into a matrix of token counts
bow_vectorizer = CountVectorizer()
# Fit and transform the input texts into a BoW matrix
bow_matrix = bow_vectorizer.fit_transform([text1, text2])

# Feature names and BoW representation
print("Bag of Words (BoW):")
print("Feature Names:", bow_vectorizer.get_feature_names_out())
print("BoW Matrix:\n", bow_matrix.toarray())

Bag of Words (BoW):
Feature Names: ['an' 'interesting' 'is' 'love' 'nlp' 'subject']
BoW Matrix:
 [[0 0 0 1 1 0]
 [1 1 1 0 1 1]]


NOTE: `vectorizer.fit_transform()` build a unique vocabulary by
  * Applying Tokenization
  * Removing Duplicates
  * Lowercasing
  * Stop Word Removal

### **TF-IDF**

#### **Task: Perform TF-IDF on the above example (used in BoW), using the hints in comments**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer # for TF-IDF

In [1]:
import datetime
import subprocess
import webbrowser
import asyncio
import os
import requests
from bs4 import BeautifulSoup
from dotenv import dotenv_values
from groq import Groq
from googlesearch import search as gsearch
from json import load, dump

# Load .env variables
env_vars = dotenv_values(".env")
GroqAPIKey = env_vars.get("GroqAPIKey")
Username = env_vars.get("Username", "Traveler")
Assistantname = env_vars.get("Assistantname", "TravelMate")

client = Groq(api_key=GroqAPIKey)

# Chat history
chatlog_path = "Data/ChatLog.json"
os.makedirs("Data", exist_ok=True)
if not os.path.exists(chatlog_path):
    with open(chatlog_path, "w") as f:
        dump([], f)

SystemChatBot = [
    {"role": "system", "content": f"""You are {Assistantname}, a professional travel assistant helping users plan trips, check destinations, weather, flights, and itineraries. Respond clearly, with proper grammar, punctuation, and professionalism."""}
]

def RealTimeInfo():
    now = datetime.datetime.now()
    return f"Current Time: {now.strftime('%A, %B %d, %Y - %H:%M:%S')}"

def ChatBot(query):
    try:
        with open(chatlog_path, "r") as f:
            messages = load(f)

        messages.append({"role": "user", "content": query})

        completion = client.chat.completions.create(
            model="llama3-70b-8192",
            messages=SystemChatBot + [{"role": "system", "content": RealTimeInfo()}] + messages,
            max_tokens=1200,
            temperature=0.7,
            stream=True
        )

        response = ""
        for chunk in completion:
            if chunk.choices[0].delta.content:
                response += chunk.choices[0].delta.content

        messages.append({"role": "assistant", "content": response})
        with open(chatlog_path, "w") as f:
            dump(messages, f, indent=4)

        return response.strip()
    except Exception as e:
        return f"Error occurred: {e}"

def GoogleSearch(query):
    results = list(gsearch(query, advanced=True, num_results=3))
    return "\n\n".join([f"📌 {r.title}\n{r.description}" for r in results])

def GetWeather(city):
    try:
        url = f"https://wttr.in/{city}?format=3"
        res = requests.get(url)
        return res.text if res.status_code == 200 else "Could not fetch weather."
    except:
        return "Weather request failed."

def GenerateItinerary(destination):
    query = f"Create a 3-day travel itinerary for {destination} including places to visit, local food, and tips."
    return ChatBot(query)

def SearchHotels(destination):
    query = f"Hotels in {destination}"
    webbrowser.open(f"https://www.google.com/search?q={query}")
    return f"Searching hotels in {destination}..."

def SearchFlights(destination):
    webbrowser.open(f"https://www.google.com/flights?q=flights+to+{destination}")
    return f"Opening Google Flights for {destination}..."

async def ExecuteCommand(commands):
    funcs = []

    for cmd in commands:
        cmd = cmd.lower()
        if cmd.startswith("weather "):
            city = cmd.removeprefix("weather ")
            funcs.append(asyncio.to_thread(GetWeather, city))
        elif cmd.startswith("itinerary "):
            dest = cmd.removeprefix("itinerary ")
            funcs.append(asyncio.to_thread(GenerateItinerary, dest))
        elif cmd.startswith("hotels "):
            dest = cmd.removeprefix("hotels ")
            funcs.append(asyncio.to_thread(SearchHotels, dest))
        elif cmd.startswith("flights "):
            dest = cmd.removeprefix("flights ")
            funcs.append(asyncio.to_thread(SearchFlights, dest))
        elif cmd.startswith("search "):
            topic = cmd.removeprefix("search ")
            funcs.append(asyncio.to_thread(GoogleSearch, topic))
        else:
            funcs.append(asyncio.to_thread(ChatBot, cmd))

    results = await asyncio.gather(*funcs)
    for result in results:
        if result:
            print(result)

if __name__ == "__main__":
    print(f"{Assistantname} Ready! Type 'exit' to quit.\n")
    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ("exit", "quit"):
            break
        asyncio.run(ExecuteCommand([user_input]))


ModuleNotFoundError: No module named 'dotenv'