# NLP Use Cases: Canonicalisation, Phonetic Hashing, Edit Distance, Spell Corrector, and Pointwise Mutual Information

## 1. Canonicalisation

* 1. Canonicalisation
     
Canonicalisation refers to the process of standardizing text representations by making them consistent and uniform. This process is vital for text pre-processing in NLP pipelines.
### Standardizing text representations

In [1]:
# Example 1: Lowercasing
text = "Natural Language Processing is Amazing!"
lowercased_text = text.lower()
print("Lowercased Text:", lowercased_text)


Lowercased Text: natural language processing is amazing!


In [2]:
import string

# Example 2: Removing Punctuation
text = "Hello, World! NLP is fun."
text_no_punc = text.translate(str.maketrans('', '', string.punctuation))
print("Text without Punctuation:", text_no_punc)


Text without Punctuation: Hello World NLP is fun


In [3]:
contractions = {"can't": "cannot", "won't": "will not", "I'm": "I am"}
text = "I'm learning NLP and I can't stop!"
expanded = ' '.join([contractions.get(word, word) for word in text.split()])
print("Expanded Text:", expanded)


Expanded Text: I am learning NLP and I cannot stop!


In [4]:
from nltk.stem import WordNetLemmatizer, PorterStemmer
import nltk
nltk.download('wordnet')

# Example 4: Lemmatization and Stemming
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
words = ['running', 'jumps', 'easily', 'fairly']

# Lemmatization (Contextual)
lemmatized = [lemmatizer.lemmatize(word) for word in words]

# Stemming (Simple)
stemmed = [stemmer.stem(word) for word in words]

print("Lemmatized Words:", lemmatized)
print("Stemmed Words:", stemmed)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\A1\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatized Words: ['running', 'jump', 'easily', 'fairly']
Stemmed Words: ['run', 'jump', 'easili', 'fairli']


# 2. Phonetic Hashing
Phonetic hashing converts words into standardized phonetic representations, helping with fuzzy matching or misspelling corrections.

In [6]:
!pip install phonetics


Collecting phonetics
  Downloading phonetics-1.0.5.tar.gz (8.8 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: phonetics
  Building wheel for phonetics (setup.py): started
  Building wheel for phonetics (setup.py): finished with status 'done'
  Created wheel for phonetics: filename=phonetics-1.0.5-py2.py3-none-any.whl size=8720 sha256=92df207883d5c21223d4b74521ec52cca1af6feda2ab7edbf02c9f91d0558831
  Stored in directory: c:\users\a1\appdata\local\pip\cache\wheels\4e\d0\ca\0abf0e0c628782f163861d5daf61c192f61b611aa235c40f52
Successfully built phonetics
Installing collected packages: phonetics
Successfully installed phonetics-1.0.5


In [7]:
import phonetics

# Example 1: Soundex
print('Soundex (Smith):', phonetics.soundex('Smith'))
print('Soundex (Smyth):', phonetics.soundex('Smyth'))


Soundex (Smith): S5030
Soundex (Smyth): S5030


###  Metaphone
Metaphone generates phonetic representations for English words. 

In [8]:
print('Metaphone (Knight):', phonetics.metaphone('Knight'))
print('Metaphone (Night):', phonetics.metaphone('Night'))


Metaphone (Knight): NT
Metaphone (Night): NT


# Double Metaphone
More accurate phonetic matching using Double Metaphone.

In [9]:
import jellyfish

# Example 3: Double Metaphone
print('Double Metaphone (Bier):', jellyfish.metaphone('Bier'))
print('Double Metaphone (Beer):', jellyfish.metaphone('Beer'))


Double Metaphone (Bier): BR
Double Metaphone (Beer): BR


In [10]:
print('NYSIIS (Christopher):', jellyfish.nysiis('Christopher'))
print('NYSIIS (Kristopher):', jellyfish.nysiis('Kristopher'))


NYSIIS (Christopher): CRASTAFAR
NYSIIS (Kristopher): CRASTAFAR


# 3. Edit Distance
Measures similarity between two strings based on the minimum operations (insert, delete, replace) required.

In [13]:
! pip install Levenshtein



In [28]:
import Levenshtein as lev

# Example 1: Levenshtein Distance
print('Levenshtein (kitten, sitting):', lev.distance('kitten', 'sitting'))


Levenshtein (kitten, sitting): 3


# Damerau-Levenshtein Distance
Accounts for transpositions.

In [16]:
!pip install jellyfish




In [17]:
import jellyfish

# Damerau-Levenshtein distance examples
print('Damerau-Levenshtein (cat, cast):', jellyfish.damerau_levenshtein_distance('cat', 'cast'))
print('Damerau-Levenshtein (hello, hlelo):', jellyfish.damerau_levenshtein_distance('hello', 'hlelo'))
print('Damerau-Levenshtein (apple, applf):', jellyfish.damerau_levenshtein_distance('apple', 'applf'))
print('Damerau-Levenshtein (data, adta):', jellyfish.damerau_levenshtein_distance('data', 'adta'))


Damerau-Levenshtein (cat, cast): 1
Damerau-Levenshtein (hello, hlelo): 1
Damerau-Levenshtein (apple, applf): 1
Damerau-Levenshtein (data, adta): 1


## Jaro-Winkler Distance
Measures string similarity with more weight on common prefixes.


In [18]:
print('Jaro-Winkler (hello, helo):', lev.jaro_winkler('hello', 'helo'))


Jaro-Winkler (hello, helo): 0.9533333333333333


## Hamming Distance
Measures differences between equal-length strings

In [19]:
def hamming(s1, s2):
    return sum(el1 != el2 for el1, el2 in zip(s1, s2))

print('Hamming (karolin, kathrin):', hamming('karolin', 'kathrin'))


Hamming (karolin, kathrin): 3


Counts the number of mismatched characters.
Used in error detection.

# 4. Spell Corrector
Corrects misspelled words using various techniques.

In [21]:
!pip install pyspellchecker


Collecting pyspellchecker
  Downloading pyspellchecker-0.8.2-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.2-py3-none-any.whl (7.1 MB)
   ---------------------------------------- 0.0/7.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/7.1 MB 640.0 kB/s eta 0:00:12
   - -------------------------------------- 0.3/7.1 MB 4.7 MB/s eta 0:00:02
   --- ------------------------------------ 0.6/7.1 MB 5.5 MB/s eta 0:00:02
   ---- ----------------------------------- 0.8/7.1 MB 4.8 MB/s eta 0:00:02
   ----- ---------------------------------- 0.9/7.1 MB 4.4 MB/s eta 0:00:02
   ----- ---------------------------------- 1.0/7.1 MB 4.1 MB/s eta 0:00:02
   ------ --------------------------------- 1.2/7.1 MB 4.0 MB/s eta 0:00:02
   ------- -------------------------------- 1.3/7.1 MB 3.9 MB/s eta 0:00:02
   -------- ------------------------------- 1.5/7.1 MB 3.8 MB/s eta 0:00:02
   -------- ------------------------------- 1.5/7.1 MB 3.6 MB/s eta 0:00:02
   --------- -

In [33]:
from spellchecker import SpellChecker

# Initialize the spell checker
spell = SpellChecker()

# Example: Correcting a misspelled word
print('Correction for speling:', spell.correction('speling'))



Correction for speling: spelling


# 2: Sentence Correction
Corrects an entire sentence.

In [23]:
sentence = 'I luv NLP and machin lerning'
corrected = ' '.join([spell.correction(word) or word for word in sentence.split()])
print('Corrected Sentence:', corrected)


Corrected Sentence: I lug nap and machine learning


# Custom Dictionary
Adds domain-specific words.

In [40]:
spell.word_frequency.load_words(['NLP', 'machine', 'learning'])
print('Correction for nlp:', spell.correction('nlp'))


Correction for nlp: nlp


# Using TextBlob
Uses TextBlob for correction.

In [39]:
from textblob import TextBlob

text = TextBlob('I hav a speling eror')
print('TextBlob Correction:', text.correct())


TextBlob Correction: I had a spelling error


# 5. Pointwise Mutual Information (PMI)
Measures the association between two words.

In [26]:
from collections import Counter
import math

corpus = ['NLP is amazing', 'I love NLP', 'NLP uses Python']
tokens = [word for sentence in corpus for word in sentence.split()]
freq = Counter(tokens)

pmi = {word: math.log((freq[word] / len(tokens)) / (1 / len(freq))) for word in freq}
print("PMI:", pmi)


PMI: {'NLP': 0.8472978603872037, 'is': -0.25131442828090605, 'amazing': -0.25131442828090605, 'I': -0.25131442828090605, 'love': -0.25131442828090605, 'uses': -0.25131442828090605, 'Python': -0.25131442828090605}


In [None]:
# Canonicalisation Examples
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download('stopwords')
nltk.download('wordnet')

# Example 1: Lowercasing
text = 'Hello World! This is NLP 101.'
print('Lowercased:', text.lower())

# Example 2: Removing Punctuation
text_no_punc = text.translate(str.maketrans('', '', string.punctuation))
print('Without Punctuation:', text_no_punc)

# Example 3: Expanding Contractions
contractions = {"can't": "cannot", "won't": "will not", "I'm": "I am"}
text_with_expansion = ' '.join([contractions.get(word, word) for word in text.split()])
print('Expanded Contractions:', text_with_expansion)

# Example 4: Lemmatization and Stemming
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
words = ['running', 'jumps', 'easily', 'fairly']
print('Lemmatization:', [lemmatizer.lemmatize(word) for word in words])
print('Stemming:', [stemmer.stem(word) for word in words])

## 2. Phonetic Hashing
### Generating phonetic representations of words

In [None]:
# Phonetic Hashing Examples
from phonetics import soundex, metaphone
import jellyfish

# Example 1: Soundex
print('Soundex (Smith):', soundex('Smith'))
print('Soundex (Smyth):', soundex('Smyth'))

# Example 2: Metaphone
print('Metaphone (Smith):', metaphone('Smith'))
print('Metaphone (Smyth):', metaphone('Smyth'))

# Example 3: Double Metaphone
print('Double Metaphone (Smith):', jellyfish.metaphone('Smith'))
print('Double Metaphone (Smyth):', jellyfish.metaphone('Smyth'))

# Example 4: NYSIIS (New York State Identification and Intelligence System)
print('NYSIIS (Smith):', jellyfish.nysiis('Smith'))
print('NYSIIS (Smyth):', jellyfish.nysiis('Smyth'))

## 3. Edit Distance
### Measuring similarity between strings

In [None]:
# Edit Distance Examples
import Levenshtein as lev

# Example 1: Levenshtein Distance
print('Levenshtein (kitten, sitting):', lev.distance('kitten', 'sitting'))

# Example 2: Damerau-Levenshtein Distance
print('Damerau-Levenshtein (cat, cast):', lev.damerau_levenshtein('cat', 'cast'))

# Example 3: Jaro-Winkler Distance
print('Jaro-Winkler (hello, helo):', lev.jaro_winkler('hello', 'helo'))

# Example 4: Hamming Distance
def hamming(s1, s2):
    return sum(el1 != el2 for el1, el2 in zip(s1, s2))
print('Hamming (karolin, kathrin):', hamming('karolin', 'kathrin'))

## 4. Spell Corrector
### Correcting misspelled words

In [None]:
# Spell Corrector Examples
from spellchecker import SpellChecker

spell = SpellChecker()

# Example 1: Single word correction
print('Correction for speling:', spell.correction('speling'))

# Example 2: Sentence correction
sentence = 'I luv NLP and machin lerning'
corrected_sentence = ' '.join([spell.correction(word) or word for word in sentence.split()])
print('Corrected Sentence:', corrected_sentence)

# Example 3: Custom dictionary-based correction
spell.word_frequency.load_words(['nlp', 'machine', 'learning'])
print('Correction for nlp:', spell.correction('nlp'))

# Example 4: Using NLP libraries for spell correction (TextBlob)
from textblob import TextBlob
text = TextBlob('I hav a speling eror')
print('TextBlob Correction:', text.correct())

## 5. Pointwise Mutual Information (PMI)
### Measuring word associations

In [None]:
# PMI Examples
from collections import Counter
import math

# Example 1: PMI for bigrams
corpus = ['I love NLP', 'I love Python', 'Python is great']
tokens = [word for sentence in corpus for word in sentence.split()]
freq = Counter(tokens)
total = sum(freq.values())
pmi = {word: math.log((freq[word] / total) / (total / len(freq))) for word in freq}
print('PMI:', pmi)

# Example 2: PMI for word pairs
word_pairs = [('NLP', 'Python'), ('Python', 'great'), ('I', 'love')]
pmi_pairs = {pair: math.log((freq[pair[0]] * freq[pair[1]]) / (total ** 2)) for pair in word_pairs}
print('PMI Pairs:', pmi_pairs)

In [30]:
! pip install spacy

Collecting spacy
  Downloading spacy-3.8.4-cp312-cp312-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp312-cp312-win_amd64

In [31]:
import spacy 
nlp=spacy.load("en_core_web_sm")
doc=nlp(u"I enjoy learning new things and exploring places. What about you?")
for token in doc:
    print('{} - {}'.format(token,token.pos_))


OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.