# **Natural Language Processing - NLP basics**

NLP is a field of Artificial Intelligence which deals with processing Natural Language using different methods so that the computer has the ability of understand written and spoken human language and perform tasks accordingly.

Some important NLP techniques:

## **1. Text Normalization**

Text normalization is a method of coverting raw text into a general or standardized form so that it can be processed using NLP models. It is important to ensure the natural language is standardized to remove any kind of outliers from the text which may harm the performance of different type of text models like classification, summarization, etc.

Some of the text normalization methods involves:


In [None]:
!pip install nltk



In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet as wn

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## **1.1 Tokenization**

Tokenization is the method of breaking down text data into different chunks called tokens. The idea behind text tokenization is to segment the input text into meaningful units so that it can be easily processed by NLP models unit by unit to understand better about the language. It is crucial step in several NLP tasks like sentence classification, segmentation, generation, translation, and a ton of others.

In [None]:
text = "She is running quickly in the Jurassic park. Suddenly she heard a T-rex roar."

# Sentence Tokenization
sentences = sent_tokenize(text)
print(sentences)

['She is running quickly in the Jurassic park.', 'Suddenly she heard a T-rex roar.']


### **1.1.1 Top-down Tokenization**

This tokenization method tokenize the text in a top down hierarchy where we segment each level of natural language recursively. For example if we have a large document containing multiple number of paragraphs,

First, the document is splitted to paragraph segments, then each paragraph is splitted to sentences, then each sentence is splitted to words called tokens, etc.

### **1.1.2 Bottom-up Tokenization**

Botton-up tokenization is the reverse process of top-down where first we converte the whole text corpus into its smaller units possible and gradually working up to larger structures. for example converting corpus of text to characters and then charcters to words or subwords based on the nature of character distributed over the words and sentences

* #### 1.1.2.1 **Subword Tokenization** : Subword tokenization is a tokenization method that break down words into smaller units called subwords. Each subword holds some meaning about the word and context, which typically include prefixes and suffixes. For example: **Consider the words low, new, newer, we don't need to create another token for lower as it can be created from low + er which is present when we segment new + er**


In [None]:
# Word Tokenization Example
words = word_tokenize(text)
print(words)

['She', 'is', 'running', 'quickly', 'in', 'the', 'Jurassic', 'park', '.', 'Suddenly', 'she', 'heard', 'a', 'T-rex', 'roar', '.']


#### **1.1.2.1.1 Byte Pair Encoding (BPE)**

BPE is one of the popular tokenization algorithm which merges most frequent pairs of characters in a corpus iteratively to create a subword vocabulary.
It is introduced as a compression algorithm which can compress large string of text into simple forms reducing the complexity. But later became so much popular in tokenization for NLP models.

In [None]:
!pip install tokenizers



In [None]:
from collections import defaultdict, Counter

def byte_pair_encoding(corpus, num_merges):
    # Tokenize the corpus into words and then into characters (initial tokenization)
    words = [list(word) for word in corpus.split()]

    # Create a dictionary of all possible pairs and their frequencies
    pairs = defaultdict(int)
    for word in words:
        # Create pairs of adjacent subwords (characters initially)
        for i in range(len(word) - 1):
            pair = (word[i], word[i + 1])
            pairs[pair] += 1

    # Perform the specified number of merges
    for _ in range(num_merges):
        if not pairs:
            break

        # Find the most frequent pair
        most_frequent_pair = max(pairs, key=pairs.get)

        # Replace all occurrences of the most frequent pair with a merged subword
        new_words = []
        for word in words:
            new_word = []
            i = 0
            while i < len(word):
                if i < len(word) - 1 and (word[i], word[i + 1]) == most_frequent_pair:
                    # Merge the pair and skip the next character
                    new_word.append(word[i] + word[i + 1])
                    i += 2
                else:
                    # Add the character as is
                    new_word.append(word[i])
                    i += 1
            new_words.append(new_word)

        # Update the words with merged pairs
        words = new_words

        # Update the pairs dictionary
        pairs = defaultdict(int)
        for word in words:
            for i in range(len(word) - 1):
                pair = (word[i], word[i + 1])
                pairs[pair] += 1

    # Create the final subword vocabulary from the words
    vocabulary = set()
    for word in words:
        vocabulary.update(word)

    # Return the final vocabulary and the tokenized words
    tokenized_words = [' '.join(word) for word in words]
    return vocabulary, tokenized_words

# Example usage:

with open('/content/drive/MyDrive/Natural-Language-Processing/internet_archive_scifi_v3.txt', 'r') as f:
    corpus = f.read()[:50000]

num_merges = 100

vocabulary, tokenized_words = byte_pair_encoding(corpus, num_merges)

print("Vocabulary:", vocabulary)
print("Tokenized words:", tokenized_words)

Vocabulary: {'ing', "'t", 'Y', 'irk', 'ha', 'A', 'ir', 'to', 're', 'E', '?"', 'mo', 'O', 'for', 'one', 'se', 'hi', 'ur', 'Cor', '"', 'C', 'ad', 'so', 'z', "'s", 'and', 'ie', 'all', 'T', 'ld', 'x', 'M', 'was', 'B', 'ter', '?', 's', 'Q', 'the', 'id', 'ut', 'fe', 'i', 'ay', 'you', 'q', 'H', 'h', 'm', '--', 'li', 'ver', 'do', 'gh', "'", '!', 'w', 'e', 'on', 'I', 'as', 'be', 'si', '-', 'st', 'c', 'R', 'al', 'ar', 'nd', 'out', 'in', ')', 'J', 'ly', 'ke', 'v', 'it', 'man', 'p', 'a', 'K', '."', 'de', 'D', 'F', 'that', '.', ':', 'G', 'Corde', 'ri', 'N', 'at', 'th', 'g', 'her', 'b', 'P', 'Kirk', ',', 'S', 'y', 'l', 'we', 'lo', 'et', 'go', 'o', 'j', 'k', 'ne', 'll', 'U', 'of', 'he', 'ma', 'ce', '(', 'an', 'ti', 'le', 'd', 'ent', 'no', 'V', 'u', '#', ';', 'ac', 'or', 'ic', 'ab', 'said', 'ou', 'L', 'es', 'me', 'W', 'en', 'ho', 'The', 'is', 'his', 'ant', 'f', 'X', 'r', 'hat', 've', 'Z', 'ould', 'wi', 'Cordell', 'sa', 'un', 'ow', 'ro', 'ed', 'co', 'er', 't', 'ght', 'n'}
Tokenized words: ['M A R C H',

## **1.2 Part of the speech tagging (POS)**

POS is a method in Natural Language Processing which is also called grammer tagging that involves labelling or tagging different words in a sentence into categories like (Noun, Adjective, Adverb, Verb) etc. It is one of the crucial step in NLP which is used for determining the meaning of the sentence and at which context the different words are present.



In [None]:
# POS tagging

pos_tags = pos_tag(words)

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('V'):
        return wn.VERB
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    else:
        return None


## **1.3 Lemmatization**

Lemmatization is another step in NLP where we convert words into its base or root meaning called a **Lemma**. This is often used to simplyfy complex phrases and words which might became hard for NLP models to perform. It is one of the essential step to reduce the complexity of langugae which involves several tenses and meanings.

For lemmatization to work, we need to provide part of the speech tagging for the lemmatizer to understand different tags of words.

In [None]:
lemmatizer = WordNetLemmatizer()
lemmatized_words = []

for word, tag in pos_tags:
    wordnet_pos = get_wordnet_pos(tag)
    if wordnet_pos:
        lemma = lemmatizer.lemmatize(word, wordnet_pos)
    else:
        lemma = lemmatizer.lemmatize(word)
    lemmatized_words.append(lemma)

print(lemmatized_words)

['She', 'be', 'run', 'quickly', 'in', 'the', 'Jurassic', 'park', '.', 'Suddenly', 'she', 'hear', 'a', 'T-rex', 'roar', '.']


## **1.4 Stemming**

Stemming is similar to Lemmatization but it is a bit more simpler, Steming invoves segmenting a word into its stem by truncating or removing the affixes (additional meaning adding words). Stemming is relatively simpler to implement than lemmatization since it requires truncation algorithamically rather than just definining a set of rules to normalize text like lemmatization.

Types of Stemming:

**PorterStemmer**: PorterStemmer stems words more freely and less aggresively in which some affixes are conserved for indicating the meaning of a word.

**LancasterStemmer**: LancasterStemmer is more aggressive in terms of segmenting the word into their stem, it often removes affixes which sometimes lead to removal of important meaning of words, for example, if lancaster stemmer has a rule to remove 'ing' from words, if may remove 'ing' from sing left with s.

In [None]:
from nltk.stem import PorterStemmer, LancasterStemmer

porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()

text = "She is running quickly and she heard a T-rex roar. She got frightned and called her boyfriend"

words = word_tokenize(text)

porter_stems = [porter_stemmer.stem(word) for word in words]
lancaster_stems = [lancaster_stemmer.stem(word) for word in words]

print(porter_stems)
print(lancaster_stems)

['she', 'is', 'run', 'quickli', 'and', 'she', 'heard', 'a', 't-rex', 'roar', '.', 'she', 'got', 'frightn', 'and', 'call', 'her', 'boyfriend']
['she', 'is', 'run', 'quick', 'and', 'she', 'heard', 'a', 't-rex', 'roar', '.', 'she', 'got', 'frightn', 'and', 'cal', 'her', 'boyfriend']


## **2 Minimum Edit Distance (MED)**

Minimum Edit Distance is used for finding how similar two words or two sentence strings are. It finds this by analyzing the two strings and how many edit is needed for one string to become another, if there are less number of edit including insertion, deletion, substitution, etc, then the two words or sentences said to be more similar.

In [None]:
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.25.1-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.25.1 (from python-Levenshtein)
  Downloading Levenshtein-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<4.0.0,>=3.8.0 (from Levenshtein==0.25.1->python-Levenshtein)
  Downloading rapidfuzz-3.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.25.1 python-Levenshtein-0.25.1 rapidfuzz-3.8.1


In [None]:
import Levenshtein

s1 = "Sunday"
s2 = "Saturday"

edit_distance = Levenshtein.distance(s1, s2)
print(edit_distance)

3


### **2.2 Scratch Implementation of MED**

In [None]:
def minimum_edit_distance(str1, str2):
    # Initialize the matrix
    len1 = len(str1)
    len2 = len(str2)
    matrix = [[0] * (len2 + 1) for _ in range(len1 + 1)]

    # Initialize the first row and column
    for i in range(len1 + 1):
        matrix[i][0] = i
    for j in range(len2 + 1):
        matrix[0][j] = j

    # Calculate the minimum edit distance
    for i in range(1, len1 + 1):
        for j in range(1, len2 + 1):
            if str1[i - 1] == str2[j - 1]:
                cost = 0
            else:
                cost = 1

            matrix[i][j] = min(


                matrix[i - 1][j] + 1,  # Deletion
                matrix[i][j - 1] + 1,  # Insertion
                matrix[i - 1][j - 1] + cost  # Substitution
            )


    # Return the minimum edit distance

    return matrix[len1][len2]

# Example usage
str1 = "Benny"
str2 = "Dennis"

# Calculate the minimum edit distance between the two strings
edit_distance = minimum_edit_distance(str1, str2)

# Print the minimum edit distance
print(f"Minimum edit distance between '{str1}' and '{str2}' is: {edit_distance}")


Minimum edit distance between 'Benny' and 'Dennis' is: 3
