# Basic NLP Concepts
***
## Table of Contents
1. Introduction to NLP
    - Types of NLP
    - NLP Tasks
    - Popular NLP Libraries
2. Text Preprocessing
    - Text Cleaning
***

In [1]:
import pandas as pd

## 1. Introduction to NLP

Natural Language Processing (NLP) is a multidisciplinary field that combines computer science, linguistics, and artificial intelligence to enable computers to interpret, process, and generate human language naturally and efficiently. NLP bridges the gap between human communication and computer understanding, allowing machines to analyse, understand, and even respond to text and speech just as humans do.


### Types of NLP
There are several types and approaches within NLP, each with its own focus and methodology:

- **Symbolic NLP**: Relies on hand-crafted rules and linguistic knowledge to process language. This traditional approach uses grammar rules and dictionaries to interpret text.

- **Statistical NLP**: Uses statistical methods and machine learning to analyze large volumes of language data, identifying patterns and making predictions based on probabilities.

- **Neural NLP**: Employs deep learning and neural networks to model and understand language, enabling advanced applications like language generation and large language models.


### NLP Tasks
NLP can be divided into several overlapping subfields and tasks, including:

- **Natural Language Understanding (NLU)**: Focuses on interpreting and extracting meaning from human language (semantics and syntax).
- **Natural Language Generation (NLG)**: Involved in generating human-like text or speech from structured data or input.
- **Speech Recognition**: Converts spoken language into text.
- **Text Classification**: Assigns categories or labels to text data (e.g., spam detection, topic classification).
- **Named Entity Recognition (NER)**: Identifies and classifies entities such as names, locations, and organisations in text.
- **Sentiment Analysis**: Determines the emotional tone behind a body of text.
- **Machine Translation**: Automatically translates text or speech from one language to another.
- **Part-of-Speech Tagging**: Labels words with their grammatical roles (noun, verb, etc.).

### Popular NLP Libraries
The main Python libraries used in NLP are:
- **NLTK (Natural Language Toolkit)**: One of the oldest and most compherensive libraries for NLP tasks such as tokenisation, stemming, tagging, parsing, and semantic reasoning. Widely used for teaching, research, and foundational NLP projects, though it may be slower for large-scale production.
- **spaCy**: Designed for fast, efficient, and production-ready NLP applications. It offered pre-trained models for multiple languages, supports tokenisation, part-of-speech tagging, named entity recognition, dependency parsing, and integrates well with deep learning frameworks.
- **Gensim**: Specialised in topic modelling, document similarity analysis, and word embeddings (e.g. Word2Vec, FastText, LDA). It's optimised for processing large text corpora efficiently and is popular for unsupervised NLP tasks.
- **TextBlob**: Build on top of NLTK and Pattern. TextBlob provides a simple API for common NLP tasks like sentiment analysis, part-of-speech tagging, and noun phrase extraction. It's user-friendly and great for beginners or rapid prototyping.
- **Pattern**: Offers tools for text processing, web mining, machine learning, and network analysis. Known for its easy use and is suitable for tasks like sentiment analysis, part-of-speech tagging, and web scraping.
- **PyNLPl(Pineapple)**: A versatile library for both basic and advanced NLP tasks, including n-gram analysis, frequency lists, and linguistic annotation. It supports various file formats and is useful for more specialised NLP workflows.
- **Stanza** Developed by Stanford, Stanza provides deep learning-based models for tasks such as named entity recognition and part-of-speech tagging, supporting over 70 languages and integrating well with other libraries (e.g., spaCy, Hugging Face Transformers).
- **Polyglot**: Known for its extensive multilingual support. Polyglot offers tokenisation, sentimental analysis, named entity recognition, and word embeddings across 130+ languages.
- **CoreNLP**: A robust Java-based library from Stanford, accessible in Python via wrappers, used for tasks such as named entity recognition and coreference resolution. Often integrated with other Python NLP libraries.
- **Hugging Face Transformers**: While primarily for large language models, this library is widely used in modern NLP for tasks (e.g., text classification, question answering, text generation using transformer-based models)

## 2. Regular Expressions
Regular expressions (also called 'regex or 'regexp') are patterns used to match, search, and manipulate text based on specific sequences of characters. They are extremely useful for extracting information, validating input, finding specific text, and replacing or splitting strings in tasks such as data cleaning, web scraping, and natural language processing.

A regular expression is essentially a sequence of characters that defines a search pattern. This pattern can be made up of literal characters or special symbols (metacharacters) that represent sets, repetitions, or positions in the text.

For example:
- `/cat/` matches the exact sequence 'cat'.
- `/c.t/` matches 'cat', 'cot', 'cut', etc. (the dot `.` matches any single character).
- `/\d+/` matches one or more digits (`\d` means any digit and `+` means 'one or more').


### Common Regex Elements
- **Literal Characters**: Match themselves (e.g., a, 1, @)
- **Metacharacters**:
    - `.` (dot): Any character except newline.
    - `\d`: Any digit(0-9).
    - `\w`: Any word character (letters, digits, underscore).
    - `\s`: Any whitespace character (space, tab, newline).
    - `*`: Zero or more of the preceding elements.
    - `+`: One or more of the preceding elements.
    - `?`: Zero or one of the preceding element.
    - `[]`: A set or range of characters (e.g., [a-z])
    - `^`: Start of a string.
    - `$`: End of a string.
    - `|`: OR operator (e.g., `cat|dog` matches 'cat' or 'dog').
    - `()`: Grouping for subpatterns.

### Example Use Cases
- **Remove punctuation**: `r'[^\w\s]` matches anything that is not a word character or whitespace.
- **Find email addresses**: `r'\b[\w.-]+@[\w.-]+\.\w+\b'`
- **Validate phone numbers**: Patterns like `r'^\d{3}-\d{3}-\d{4}$'`

## 3. Text Preprocessing
Text preprocessing is the foundation of any NLP project. It involves cleaning and transforming raw text into a structured format suitable for analysis.

Typical Text Preprocessing Pipeline is:
1. **Text Cleaning**: Removes unnecessary characters, HTML, emojis, etc.
2. **Text Normalisation**: Lowercases, expands contractions, removes punctuation, standardises spelling.
3. **Tokenisation**: Splits text into words, sentences, or subwords.
4. **Stop Word Removal**: Eliminates meaningless words.
5. **Stemming / Lemmatisation**: Reduces words to their root form or dictionary form.
6. **Part-of-Speech (POS) Tagging**: Assigns grammatical tags (noun, verb, adjective, etc.) to each token.

Dataset retrieved from [Tweets Dataset](https://www.kaggle.com/datasets/mmmarchetti/tweets-dataset?select=tweets.csv)

In [2]:
df = pd.read_csv('_datasets/tweets.csv')
df = df.drop(columns=['author', 'country', 'date_time', 'id', 'language',
             'latitude', 'longitude', 'number_of_likes', 'number_of_shares'])
df.head()

Unnamed: 0,content
0,Is history repeating itself...?#DONTNORMALIZEH...
1,@barackobama Thank you for your incredible gra...
2,Life goals. https://t.co/XIn1qKMKQl
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...


### Text Cleaning
Text cleaning is a crucial first step in preparing raw text data for NLP or machine learning. It removes noise and irrelevant elements, making the data more structured and suitable for analysis.

#### Removing HTML Tags
HTML tags can appear in text scraped from web pages. Use regex to remove anything between `<` and `>`.

In [3]:
df['clean_text'] = df['content'].str.replace(r'<.*?>', '', regex=True)
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,Is history repeating itself...?#DONTNORMALIZEH...
1,@barackobama Thank you for your incredible gra...,@barackobama Thank you for your incredible gra...
2,Life goals. https://t.co/XIn1qKMKQl,Life goals. https://t.co/XIn1qKMKQl
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,Me right now 🙏🏻 https://t.co/gW55C1wrwd
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...


#### Removing URLs
URLs are often irrelevant for text analysis.

In [4]:
df['clean_text'] = df['clean_text'].str.replace(
    r'http\S+|www\.\S+', '', regex=True)
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,Is history repeating itself...?#DONTNORMALIZEH...
1,@barackobama Thank you for your incredible gra...,@barackobama Thank you for your incredible gra...
2,Life goals. https://t.co/XIn1qKMKQl,Life goals.
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,Me right now 🙏🏻
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️


#### Removing Emojis and Non-ASCII Characters
Emojis and non-ASCII symbols can be noise for many NLP tasks.

In [5]:
df['clean_text'] = df['clean_text'].str.replace(
    r'[^\x00-\x7F]+', '', regex=True)
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,Is history repeating itself...?#DONTNORMALIZEH...
1,@barackobama Thank you for your incredible gra...,@barackobama Thank you for your incredible gra...
2,Life goals. https://t.co/XIn1qKMKQl,Life goals.
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,Me right now
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,SISTERS ARE DOIN' IT FOR THEMSELVES!


#### Removing Mentions (For Twitter data)
Removing mentions (usernames starting with `@`) is a common step in cleaning Twitter data, as these are often not useful for NLP tasks.

In [6]:
df['clean_text'] = df['clean_text'].str.replace(r'@\w+', '', regex=True)
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,Is history repeating itself...?#DONTNORMALIZEH...
1,@barackobama Thank you for your incredible gra...,Thank you for your incredible grace in leader...
2,Life goals. https://t.co/XIn1qKMKQl,Life goals.
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,Me right now
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,SISTERS ARE DOIN' IT FOR THEMSELVES!


#### Removing Hashtags (For Twitter data, Optional)
Whether we should remove hashtags (words starting with `#`) from Twitter data depends on our analysis goals:

- If hashtags are not useful for our NLP task (e.g., general sentiment analysis, topic modeling where hashtags add noise or redundancy), we can remove them just like mentions and URLs.

- If hashtags carry important information (e.g., event detection, trend analysis, or if hashtags are meaningful keywords), we might want to keep them, or extract and analyse them separately.

In [7]:
df['clean_text'] = df['clean_text'].str.replace(r'#\w+', '', regex=True)
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,Is history repeating itself...?
1,@barackobama Thank you for your incredible gra...,Thank you for your incredible grace in leader...
2,Life goals. https://t.co/XIn1qKMKQl,Life goals.
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,Me right now
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,SISTERS ARE DOIN' IT FOR THEMSELVES!


#### Removing Extra Whitespace
Normalise whitespace to a single space and trim leading/trailing spaces.

In [8]:
df['clean_text'] = df['clean_text'].str.replace(
    r'\s+', ' ', regex=True).str.strip()
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,Is history repeating itself...?
1,@barackobama Thank you for your incredible gra...,Thank you for your incredible grace in leaders...
2,Life goals. https://t.co/XIn1qKMKQl,Life goals.
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,Me right now
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,SISTERS ARE DOIN' IT FOR THEMSELVES!


#### Removing Numbers (Optional)
If numbers are not relevant to our analysis, we should remove them.

In [9]:
df['clean_text'] = df['clean_text'].str.replace(
    r'\d+', ' ', regex=True).str.strip()
df.iloc[5:11]

Unnamed: 0,content,clean_text
5,happy 96th gma #fourmoreyears! 🎈 @ LACMA Los A...,happy th gma ! @ LACMA Los Angeles County Mus...
6,"Kyoto, Japan \r\n1. 5. 17. https://t.co/o28M0v...","Kyoto, Japan . . ."
7,🇯🇵 @ Sanrio Puroland https://t.co/eXVev5UMBx,@ Sanrio Puroland
8,2017 resolution: to embody authenticity!,resolution: to embody authenticity!
9,sisters. https://t.co/5ZE21x2aNk,sisters.
10,Happy Holidays! Sending love and light to ever...,Happy Holidays! Sending love and light to ever...


#### Removing Line Breaks (Optional)
If the data contains line breaks (`\n`), replace them with spaces.

In [10]:
df['clean_text'] = df['clean_text'].str.replace(r'[\r\n]+', ' ', regex=True)
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,Is history repeating itself...?
1,@barackobama Thank you for your incredible gra...,Thank you for your incredible grace in leaders...
2,Life goals. https://t.co/XIn1qKMKQl,Life goals.
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,Me right now
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,SISTERS ARE DOIN' IT FOR THEMSELVES!


### Text Normalisaton
Text normalisation is the process of converting text data into a consistent, standardised, and canonical form, making it easier for computers to process and analyze.

#### Spelling Standardisation
Misspelled or inconsistently spelled words (e.g., 'colour' vs 'color', 'real-time' vs 'realtime', or 'hellooo' vs 'hello') can increase vocabulary size and reduce model accuracy. Spelling standardisation (also called spelling correction) helps models treat all variants as the same word, improving downstream analysis and learning. For example:

- `pyspellchecker` is a context-free spell checker. It chooses the most common word within a certain edit distance, regardless of sentence meaning.
- `TextBlob` uses a probabilistic model for spelling correction.

- `Transformers` provide more advanced context-aware corrections.

Popular Transformer Spell Checker Models & Toolkits:

| Model/Toolkit                               | Description                                                                                 |
|----------------------------------------------|---------------------------------------------------------------------------------------------|
| oliverguhr/spelling-correction-english-base | Hugging Face model for English spelling and punctuation correction.                         |
| ai-forever/T5-large-spell                   | T5-based model for standard English spelling correction.                                    |
| NeuSpell                                    | Toolkit with several neural spell checkers, including BERT-based models.                   |
| xfspell                                     | Transformer-based English spell checker trained on millions of parallel sentences.          |
| Custom T5/BERT models                       | Many researchers fine-tune T5 or BERT for spelling correction as a text-to-text task.       |

In [11]:
# Using 'pyspellchecker' library
from spellchecker import SpellChecker

# Initialise SpellChecker class
spell = SpellChecker()

# Example list of words
words = ['let', 'us', 'wlak', 'on', 'the', 'grous']

# Identify misspelled words
misspelled = spell.unknown(words)
for word in misspelled:
    # Most probabilistic correction
    print(f'Correction for "{word}": {spell.correction(word)}')
    # All possible candidates
    print(f'Candidates for "{word}": {spell.candidates(word)}')

Correction for "grous": group
Candidates for "grous": {'group', 'grus', 'grouse', 'grows', 'gros', 'grots', 'groups', 'gross', 'grouts', 'grout'}
Correction for "wlak": walk
Candidates for "wlak": {'weak', 'flak', 'walk'}


In [12]:
# Using 'TextBlob' library
from textblob import TextBlob

text = 'I havv goood speling!'
corrected_text = str(TextBlob(text).correct())
print(corrected_text)

I have good spelling!


In [13]:
# Using Transformers (oliverguhr/spelling-correction-english-base)
from transformers import pipeline

corrector = pipeline('text2text-generation',
                     model='oliverguhr/spelling-correction-english-base')
sentence = 'She did no go to teh marcket.'
result = corrector(sentence)
print('Corrected:', result[0]['generated_text'])

Device set to use mps:0


Corrected: She did not go to the market.


Create a function to apply transformers to a dataframe.

In [14]:
def correct_spelling(text):
    # Handle empty or missing text
    if not isinstance(text, str) or not text.strip():
        return text
    result = corrector(text)
    return result[0]['generated_text']

In [15]:
# ! Execution time is very long ...
# df['clean_text'] = df['clean_text'].apply(correct_spelling)
# df.head()
df_transformers_top10 = df['clean_text'].iloc[:10].apply(correct_spelling)
df_transformers_top10

0                         Is history repeating itself?
1    Thank you for your incredible grace in leaders...
2                                          Life goals.
3                                        Me right now.
4                 SISTERS ARE DOIN' IT FOR THEMSELVES!
5    Happy  the gma ! at LACMA. Los Angeles County ...
6                                 Kyoto, Japan    .  .
7                                  At Sanrio Puroland.
8                   resolution to embody authenticity!
9                                             Sisters.
Name: clean_text, dtype: object

#### Lowercasing
Convert all text to lowercase to ensure uniformity.

In [16]:
df['clean_text'] = df['clean_text'].str.lower()
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,is history repeating itself...?
1,@barackobama Thank you for your incredible gra...,thank you for your incredible grace in leaders...
2,Life goals. https://t.co/XIn1qKMKQl,life goals.
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,me right now
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,sisters are doin' it for themselves!


#### Expanding Contractions
Replace contractions with their expanded forms for clarity and consistency. For the code below:
- `contraction` is the string to be replaced (e.g., *doin'*), and `expanded` is what it should become (e.g., *doing*).
- `re.escape(contraction)` ensures that any special characters in the contraction string (such as apostrophes) are treated as literal characters, not as regex operators.
- `re.compile(..., re.IGNORECASE)` creates a regex pattern object that will match the contraction in a case-insensitive way (so it matches *Doin'*, *DOIN'*, etc.).

In [17]:
import re


def expand_contractions_only(text):
    contractions = {
        "can't": "cannot", "won't": "will not", "it's": "it is",
        "I'm": "I am", "you're": "you are", "they're": "they are",
        "he's": "he is", "she's": "she is", "we're": "we are",
        "that's": "that is", "what's": "what is", "where's": "where is",
        "who's": "who is", "how's": "how is", "let's": "let us",
        "doin'": "doing", "goin'": "going", "tryin'": "trying"
    }
    for contraction, expanded in contractions.items():
        # Pattern matches contraction anywhere, case-insensitive
        pattern = re.compile(re.escape(contraction), re.IGNORECASE)
        text = pattern.sub(expanded, text)
    return text


# Apply only contraction expansion to the 'clean_text' column
df['clean_text'] = df['clean_text'].apply(expand_contractions_only)
df.head()

Unnamed: 0,content,clean_text
0,Is history repeating itself...?#DONTNORMALIZEH...,is history repeating itself...?
1,@barackobama Thank you for your incredible gra...,thank you for your incredible grace in leaders...
2,Life goals. https://t.co/XIn1qKMKQl,life goals.
3,Me right now 🙏🏻 https://t.co/gW55C1wrwd,me right now
4,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,sisters are doing it for themselves!


#### Removing Punctuation and Special Characters
Strip out punctuation, symbols, and special characters to reduce noise. Using `string.punctuation` makes the task easy and efficient.

In [18]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

`'[{}]'.format(string.punctuation)` inserts all these punctuation characters inside square brackets, resulting in a string like `'[!"#$%&\'()*+,-./:;<=>?@[\$$^_{|}~]'`. Setting `regex=True` tells pandas to interpret the pattern as a regular expression. In case our data contains regex-special characters (such as `[`, `\`, `^`), it's safer to escape them using `re.escape`.

In [19]:
import re
df['clean_text'] = df['clean_text'].str.replace(
    '[{}]'.format(re.escape(string.punctuation)), '', regex=True)
df.iloc[5:11]

Unnamed: 0,content,clean_text
5,happy 96th gma #fourmoreyears! 🎈 @ LACMA Los A...,happy th gma lacma los angeles county museu...
6,"Kyoto, Japan \r\n1. 5. 17. https://t.co/o28M0v...",kyoto japan
7,🇯🇵 @ Sanrio Puroland https://t.co/eXVev5UMBx,sanrio puroland
8,2017 resolution: to embody authenticity!,resolution to embody authenticity
9,sisters. https://t.co/5ZE21x2aNk,sisters
10,Happy Holidays! Sending love and light to ever...,happy holidays sending love and light to every...


Alternatively, without using regex:

In [20]:
df['clean_text'] = df['clean_text'].str.translate(
    str.maketrans('', '', string.punctuation))
df.iloc[5:11]

Unnamed: 0,content,clean_text
5,happy 96th gma #fourmoreyears! 🎈 @ LACMA Los A...,happy th gma lacma los angeles county museu...
6,"Kyoto, Japan \r\n1. 5. 17. https://t.co/o28M0v...",kyoto japan
7,🇯🇵 @ Sanrio Puroland https://t.co/eXVev5UMBx,sanrio puroland
8,2017 resolution: to embody authenticity!,resolution to embody authenticity
9,sisters. https://t.co/5ZE21x2aNk,sisters
10,Happy Holidays! Sending love and light to ever...,happy holidays sending love and light to every...


### Tokenisation
Tokenisation is the process of splitting text into smaller units called tokens. In NLP, tokens are typically words, subwords, or sentences. It is a crucial first step in most NLP pipelines for further analysis such as part-of-speech tagging, lemmatisation, and sentiment analysis.

Tokenisation is useful to:
- **Prepare text for analysis**: Models and algorithms work with tokens, not raw text.
- **Enable feature extraction**: Tokens are used to create frequency lists, embeddings, etc.
- **Handle language structure**: Helps separate words, punctuation, and sentences for more accurate processing.

#### Word Tokenisation
Splitting text into individual words or word-like units. Most common for English and languages with clear word boundaries.

In [32]:
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/tsu76i/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [33]:
text = 'The quick brown fox jumps over the lazy dog.'

# Using Python's `split()` method
split_tokens = text.split(' ')
print(f'With .split(): {split_tokens}')

# Using regex
regex_tokens = re.split(r'\W+', text)  # Split on non-word characters
print(f'With regex: {regex_tokens}')

# Using NLTK library
nltk_tokens = word_tokenize(text)
print(f'With NLTK: {nltk_tokens}')

With .split(): ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']
With regex: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '']
With NLTK: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']


#### Sentence Tokenisation
Splitting text into sentences, often based on punctuation and language rules. This method is useful for summarisation, translation, or sentence-level analysis.

In [22]:
from nltk.tokenize import sent_tokenize

text = 'The quick brown fox jumps over the lazy dog. It was a sunny day.'
sentences = sent_tokenize(text)
print(sentences)

['The quick brown fox jumps over the lazy dog.', 'It was a sunny day.']


#### N-gram Tokenisation
Splitting text into overlapping sequences of N words (bi-grams, tri-grams, etc.). Efficient way to implement feature extraction for language models or text classificaiton.

In [23]:
from nltk.util import ngrams
text = 'The quick brown fox jumps'
tokens = word_tokenize(text)
bigrams = list(ngrams(tokens, 2))
print(bigrams)

[('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps')]


#### Character Tokenisation
Splitting text into individual characters. Useful for languages without clear word boundaries, or for spelling correction.

In [24]:
text = 'Sample text'
char_tokens = list(text)
print(char_tokens)

['S', 'a', 'm', 'p', 'l', 'e', ' ', 't', 'e', 'x', 't']


#### Subword Tokenisation
Splitting words into smaller units (subwords), useful for handling rare words and out-of-vocabulary terms. Model NLP models (BERT, GPT, etc.) use subword tokenisers such as Byte-Pair-Encoding(BPE) or SentencePiece.

In [25]:
from transformers import AutoTokenizer
tokeniser = AutoTokenizer.from_pretrained('bert-base-uncased')
text = 'Chatbots are amazing!'
tokens = tokeniser.tokenize(text)
print(tokens)

['chat', '##bots', 'are', 'amazing', '!']


### Stop Word Removal
Stop words are common words (like 'the', 'and', 'is') that usually carry little meaningful information for NLP tasks. Removing them helps reduce noise and the dimensionality of the data.

In [26]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tsu76i/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [27]:
# List all the stop words in English
', '.join(stopwords.words('english'))

"a, about, above, after, again, against, ain, all, am, an, and, any, are, aren, aren't, as, at, be, because, been, before, being, below, between, both, but, by, can, couldn, couldn't, d, did, didn, didn't, do, does, doesn, doesn't, doing, don, don't, down, during, each, few, for, from, further, had, hadn, hadn't, has, hasn, hasn't, have, haven, haven't, having, he, he'd, he'll, her, here, hers, herself, he's, him, himself, his, how, i, i'd, if, i'll, i'm, in, into, is, isn, isn't, it, it'd, it'll, it's, its, itself, i've, just, ll, m, ma, me, mightn, mightn't, more, most, mustn, mustn't, my, myself, needn, needn't, no, nor, not, now, o, of, off, on, once, only, or, other, our, ours, ourselves, out, over, own, re, s, same, shan, shan't, she, she'd, she'll, she's, should, shouldn, shouldn't, should've, so, some, such, t, than, that, that'll, the, their, theirs, them, themselves, then, there, these, they, they'd, they'll, they're, they've, this, those, through, to, too, under, until, up, 

In [28]:
text = 'This is an example sentence demonstrating stop word removal.'
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in text.split() if word.lower() not in stop_words]
print(filtered_words)

['example', 'sentence', 'demonstrating', 'stop', 'word', 'removal.']


### Stemming
Stemming reduces words to their root form by chopping off suffixes. For example, 'running', 'runs' are reduced to 'run', but 'ran' remains 'ran' as stemming is rule-cased and often does not handle irregular verbs properly. Moreover, stemming may not always produce a real word.

In [29]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ['running', 'runner', 'runs', 'ran', 'easily', 'fairly', 'accordingly']
stemmed_words = [stemmer.stem(word) for word in words]
print(f'Before stemming: {words}')
print(f'After stemming: {stemmed_words}')

Before stemming: ['running', 'runner', 'runs', 'ran', 'easily', 'fairly', 'accordingly']
After stemming: ['run', 'runner', 'run', 'ran', 'easili', 'fairli', 'accordingli']


### Lemmatisation
Lemmatisation reduces words to their base or dictionary form (lemma), considering context and part of speech. Unlike stemming, lemmatisation always produces a real word.

In [30]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/tsu76i/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [31]:
lemmatiser = WordNetLemmatizer()
words = ['running', 'runner', 'runs', 'ran', 'easily', 'fairly', 'accordingly']
lemmatised_words = [lemmatiser.lemmatize(word, pos='v') for word in words] # v = verbs
print(f'Before lemmatisation: {words}')
print(f'After lemmatisation: {lemmatised_words}')

Before lemmatisation: ['running', 'runner', 'runs', 'ran', 'easily', 'fairly', 'accordingly']
After lemmatisation: ['run', 'runner', 'run', 'run', 'easily', 'fairly', 'accordingly']


### Part-of-Speech (POS) Tagging
Part-of-Speech (POS) Tagging, also known as grammatical tagging, is the process of assigning each word in a text a label that indicates its grammatical role, such as noun, verb, adjective, adverb, etc. This process provides essential information about sentence structure and the relationships between words.

Importance of POS:
- **Foundation of NLP Tasks**: POS tags are used in higher-level NLP applications such as named entity recognition, sentimental analysis, and machine translation.
- **Disambiguation**: Many words can have multiple grammatical roles. For instance, 'book' can be a noun ('*hand me that book*') or a verb ('*book a flight*'). POS tagging helps resolve such ambiguities by considering context.
- **Improving Text Understanding**: Knowing the part of speech helps models understand sentence structure and meaning, which is crucial for accurate language processing.

Techniques:
- **Rule-Based**: Early systems relied on hand-crafted rules.
- **Stochastic/Statistical**: Modern systems often use probabilistic models like Hidden Markov Models (HMMs) or neural networks, which learn from large annotated corpora.
- **Hybrid**: Some systems combine rules and statistical methods.


| POS Tag | Meaning                                   |
|---------|-------------------------------------------|
| CC      | Coordinating conjunction                  |
| CD      | Cardinal number                           |
| DT      | Determiner                                |
| EX      | Existential there                         |
| FW      | Foreign word                              |
| IN      | Preposition or subordinating conjunction  |
| JJ      | Adjective                                 |
| JJR     | Adjective, comparative                    |
| JJS     | Adjective, superlative                    |
| LS      | List item marker                          |
| MD      | Modal                                     |
| NN      | Noun, singular or mass                    |
| NNS     | Noun, plural                              |
| NNP     | Proper noun, singular                     |
| NNPS    | Proper noun, plural                       |
| PDT     | Predeterminer                             |
| POS     | Possessive ending                         |
| PRP     | Personal pronoun                          |
| PRP$    | Possessive pronoun                        |
| RB      | Adverb                                    |
| RBR     | Adverb, comparative                       |
| RBS     | Adverb, superlative                       |
| RP      | Particle                                  |
| SYM     | Symbol                                    |
| TO      | to                                        |
| UH      | Interjection                              |
| VB      | Verb, base form                           |
| VBD     | Verb, past tense                          |
| VBG     | Verb, gerund or present participle        |
| VBN     | Verb, past participle                     |
| VBP     | Verb, non-3rd person singular present     |
| VBZ     | Verb, 3rd person singular present         |
| WDT     | Wh-determiner                             |
| WP      | Wh-pronoun                                |
| WP$     | Possessive wh-pronoun                     |
| WRB     | Wh-adverb                                 |

In [35]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/tsu76i/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
text = 'The quick brown fox jumps over the lazy dog.'
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)

print(f'Tokens: {tokens}')
print(f'POS Tags: {pos_tags}')

Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


|  Word |  POS Tag  |           Meaning         |
|-------|-----------|---------------------------|
| The   |   DT	    |  Determiner               |
| quick |	JJ	    |  Adjective                |
| brown |	JJ	    |  Adjective                |
| fox   |   NN	    |  Noun, singular           |
| jumps |	VBZ	    |  Verb, 3rd person singular|
| over  |	IN	    |  Preposition              |
| the   |	DT	    |  Determiner               |
| lazy  |	JJ	    |  Adjective                |
| dog   |	NN	    |  Noun, singular           |
| .     |	.	    |  Punctuation              |
