# HTML Tag Removal

In this notebook, we will explore different methods for removing HTML tags from text. HTML is often used for web content, and when extracting or processing text, it's common to remove the tags and keep only the plain text. We'll look at two main approaches:

1. **Using Regular Expressions**  
2. **Using the `BeautifulSoup` library**  

After exploring these approaches, you'll find a small exercise to practice what you've learned.

## Why Remove HTML Tags?

- When extracting data from web pages, we often obtain strings with HTML tags (e.g., `<p>`, `<div>`, etc.).
- If we're performing text analytics (such as sentiment analysis, topic modeling, or keyword extraction), those tags become noise.
- Removing HTML tags helps us get cleaner input data for NLP tasks or general text processing.


In [2]:
# Import necessary libraries
import re
from bs4 import BeautifulSoup

# Example text
html_text = """
<html>
  <head><title>Sample Page</title></head>
  <body>
    <p>This is a <strong>sample</strong> paragraph.</p>
    <div class="content">Another <a href='#'>link</a> here.</div>
  </body>
</html>
"""

print("Original HTML Text:")
print(html_text)

Original HTML Text:

<html>
  <head><title>Sample Page</title></head>
  <body>
    <p>This is a <strong>sample</strong> paragraph.</p>
    <div class="content">Another <a href='#'>link</a> here.</div>
  </body>
</html>



## Approach 1: Using Regular Expressions

A simple (though sometimes brittle) approach to removing HTML tags is to use a regular expression that matches anything in angle brackets (`< >`) and replaces it with an empty string. Keep in mind that HTML can get very complex, and regex might not always capture every edge case, but it can be sufficient for simple scenarios.


In [6]:
# Removing tags using regular expressions

def remove_html_tags_regex(text):
    # This regex pattern matches anything that starts with '<' and ends with '>'
    clean_text = re.sub(r'<.*?>', '', text)
    return clean_text

regex_cleaned_text = remove_html_tags_regex(html_text)
print("Cleaned text (Regex):")
print(regex_cleaned_text)

Cleaned text (Regex):


  Sample Page
  
    This is a sample paragraph.
    Another link here.
  




## Approach 2: Using BeautifulSoup

`BeautifulSoup` is a popular Python library for parsing HTML and XML documents. It allows us to easily extract the text content without dealing directly with raw string operations.

In [7]:
# Removing tags using BeautifulSoup

def remove_html_tags_bs(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

bs_cleaned_text = remove_html_tags_bs(html_text)
print("Cleaned text (BeautifulSoup):")
print(bs_cleaned_text)

Cleaned text (BeautifulSoup):


Sample Page

This is a sample paragraph.
Another link here.





## Comparison of Methods

- **Regex-based approach**:
  - Pros: Quick, minimal dependencies.  
  - Cons: Can fail with nested or malformed HTML; not guaranteed to handle all real-world HTML complexities.

- **BeautifulSoup approach**:
  - Pros: Specifically designed for parsing HTML; robust for many HTML structures.  
  - Cons: Requires installing and importing an external library.  

Choose the method that best fits your use case.

##Excercise:

In [8]:
# Exercise Starter Code

exercise_html = """
<h1 style="color:red;">Hello World!</h1>
<p>This is an <em>HTML</em> example with a <a href='http://example.com'>link</a>.</p>
<span>Some malformed <tag> text</span
"""

# TODO 1: Create your own HTML string or use the above.
# TODO 2: Use the remove_html_tags_regex and remove_html_tags_bs functions on your string.
# TODO 3: Print and compare the outputs. Are there any edge cases?

# Example (uncomment below and replace exercise_html with your own):
# print("Regex approach:\n", remove_html_tags_regex(exercise_html))
# print("\nBeautifulSoup approach:\n", remove_html_tags_bs(exercise_html))


# Emoji Removal or Replacement with the `emoji` Library

In this notebook, we'll learn how to remove or replace emojis in text using the `emoji` library. This library provides convenient functions to identify and handle emojis in Unicode strings.

We'll focus on:
1. **Removing emojis** (i.e., deleting them from the text).
2. **Replacing emojis** with a placeholder token (e.g., `<EMOJI>`).

Let's get started!

In [None]:
# If you haven't installed the emoji library, uncomment and run the following line:
!pip install emoji

import emoji

# Sample text containing emojis
text_with_emojis = "Hello world! 😊 I love Python 🐍❤️."

print("Original Text:")
print(text_with_emojis)

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1
Original Text:
Hello world! 😊 I love Python 🐍❤️.


## Removing Emojis

We can use the function `replace_emoji` from the `emoji` library to remove emojis by replacing them with an empty string.


In [None]:
def remove_emojis(text: str) -> str:
    """
    Remove all emojis from the provided text by replacing them with an empty string.
    """
    return emoji.replace_emoji(text, "")

removed_emojis_text = remove_emojis(text_with_emojis)
print("Text After Removing Emojis:")
print(removed_emojis_text)

Text After Removing Emojis:
Hello world!  I love Python .


## Replacing Emojis with a Placeholder

Instead of deleting them, sometimes it's useful to keep track of where emojis appear—especially for analysis or token replacement.

In [None]:
def replace_emoji(text: str):
  return emoji.demojize(text)

replaced_emojis_text = replace_emoji(text_with_emojis)
print("Text After replacing Emojis:")
print(replaced_emojis_text)

Text After replacing Emojis:
Hello world! :smiling_face_with_smiling_eyes: I love Python :snake::red_heart:.


# Basic Text Preprocessing & Language Translation

In this notebook, we will cover:
1. **Stop Word Removal** (using `nltk`).
2. **Stemming** and **Lemmatization** (using `nltk`).
3. **Removing Digits** from text (using regular expressions).
4. **Lowercasing** text.

Let's begin by importing the necessary libraries and loading some sample text.

In [None]:
import nltk
import re

# Download NLTK data (if you haven't before)
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample text
sample_text = """
This is a sample TEXT!
It contains numbers like 1234 and 56.
We'll remove STOPWORDS, digits, and then try some stemming/lemmatization.
Finally, let's translate this text into French!
"""

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Stop Word Removal

Stop words are commonly used words (e.g., "the", "is", "in") that often don't add significant meaning to a text.
We'll use NLTK's built-in list of English stop words and remove them from our text.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


def remove_stopwords(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Get the English stop words
    stop_words = set(stopwords.words('english'))
    # Filter out stop words
    filtered_tokens = [word for word in tokens if word not in stop_words and word.isalpha()]
    # Reconstruct the string
    return " ".join(filtered_tokens)

text_no_stopwords = remove_stopwords(sample_text)
print("Text after Stop Word Removal:\n")
print(text_no_stopwords)



Text after Stop Word Removal:

sample text contains numbers like remove stopwords digits try finally let translate text french


## Stemming / Lemmatization

- **Stemming**: Reduces words to their word stem, which may not be a proper word (e.g., "studies" -> "studi").
- **Lemmatization**: Reduces words to a valid base form (lemma), considering the context (e.g., "studies" -> "study").

We will demonstrate both using NLTK’s PorterStemmer and WordNetLemmatizer.

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def stem_text(text):
    tokens = word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return " ".join(stemmed_tokens)

def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return " ".join(lemmatized_tokens)

stemmed_text = stem_text(text_no_stopwords)
lemmatized_text = lemmatize_text(text_no_stopwords)

print("Sample Text:\n", sample_text)
print("Stemmed Text:\n", stemmed_text)
print("\nLemmatized Text:\n", lemmatized_text)

Sample Text:
 
This is a sample TEXT! 
It contains numbers like 1234 and 56. 
We'll remove STOPWORDS, digits, and then try some stemming/lemmatization. 
Finally, let's translate this text into French!

Stemmed Text:
 sampl text contain number like remov stopword digit tri final let translat text french

Lemmatized Text:
 sample text contains number like remove stopwords digit try finally let translate text french


##Excercise

In [None]:
# Removing Digits and Lowercasing. Try Yourself !!!

# POS Tagging with displaCy Visualization

In this example, we use [spaCy](https://spacy.io/) to perform Part-of-Speech (POS) tagging on a sample sentence. We then render the syntactic dependency parse (which includes POS information) using **displaCy** directly within a Jupyter environment.

**Steps**:
1. **Import spaCy** and load the English model (`en_core_web_sm`).
2. **Create a Doc object** by processing a text string with `nlp(...)`.
3. **Visualize** the parse (dependencies and POS tags) using `displacy.render`.

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
displacy.render(doc, style="dep")

# One-Hot Encoding of Text

In this notebook, we demonstrate:
1. Creating and preprocessing a small corpus.
2. Building a vocabulary of unique words.
3. Generating one-hot vectors for words in any given string.

**Why One-Hot Encoding?**  
One-hot encoding converts each word into a vector of zeros with a single '1' indicating the position of that word in the vocabulary. This is a simple way to represent text numerically.

In [None]:
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [None]:
#Build the vocabulary
vocab = {}
count = 0
for doc in processed_docs:
    for word in doc.split():
        if word not in vocab:
            count = count +1
            vocab[word] = count
print(vocab)

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}


In [None]:
#Get one hot representation for any string based on this vocabulary.
#If the word exists in the vocabulary, its representation is returned.
#If not, a list of zeroes is returned for that word.
def get_onehot_vector(somestring):
    onehot_encoded = []
    for word in somestring.split():
        temp = [0]*len(vocab)
        if word in vocab:
            temp[vocab[word]-1] = 1 # -1 is to take care of the fact indexing in array starts from 0 and not 1
        onehot_encoded.append(temp)
    return onehot_encoded

In [None]:
print(processed_docs[1])
get_onehot_vector(processed_docs[1]) #one hot representation for a text from our corpus.

man bites dog


[[0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]