# Data Processing  -  Tokenization

Tokenization is the process of dividing a text into smaller units known as tokens. Tokens are typically words or sub-words in the context of natural language processing.
The process involves splitting a string, or text into a list of tokens. One can think of tokens as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

                                                Types of Tokenization
Tokenization can be classified into several types based on how the text is segmented. Here are some types of tokenization:

Word Tokenization:
Word tokenization divides the text into individual words. Many NLP tasks use this approach, in which words are treated as the basic units of meaning.

Example:

Input: "Tokenization is an important NLP task."
Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]

Sentence Tokenization:
The text is segmented into sentences during sentence tokenization. This is useful for tasks requiring individual sentence analysis or processing.

Example:

Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]

Subword Tokenization:
Subword tokenization entails breaking down words into smaller units, which can be especially useful when dealing with morphologically rich languages or rare words.

Example:

Input: "tokenization"
Output: ["token", "ization"]

Character Tokenization:
This process divides the text into individual characters. This can be useful for modelling character-level language.

Example:

Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]

                                                Need of Tokenization
Tokenization is a crucial step in text processing and natural language processing (NLP) for several reasons.

Effective Text Processing: Tokenization reduces the size of raw text so that it can be handled more easily for processing and analysis.

Feature extraction: Text data can be represented numerically for algorithmic comprehension by using tokens as features in machine learning models.

Language Modelling: Tokenization in NLP facilitates the creation of organized representations of language, which is useful for tasks like text generation and language modelling.

Information Retrieval: Tokenization is essential for indexing and searching in systems that store and retrieve information efficiently based on words or phrases.

Text Analysis: Tokenization is used in many NLP tasks, including sentiment analysis and named entity recognition, to determine the function and context of individual words in a sentence.

Vocabulary Management: By generating a list of distinct tokens that stand in for words in the dataset, tokenization helps manage a corpus’s vocabulary.

Task-Specific Adaptation: Tokenization can be customized to meet the needs of particular NLP tasks, meaning that it will work best in applications such as summarization and machine translation.

Preprocessing Step: This essential preprocessing step transforms unprocessed text into a format appropriate for additional statistical and computational analysis.

In [3]:
import nltk
import pandas as pd

In [7]:
# Importing the dataset excel sheet
df = pd.read_excel('Cleaned new Dataset.xlsx')
df

Unnamed: 0,cleaned_data,Sentiment,Sarcasm
0,One of the other reviewers has mentioned that ...,positive,not sarcastic
1,A wonderful little production. The filming tec...,positive,not sarcastic
2,This movie was a groundbreaking experience! Iv...,positive,sarcastic
3,I thought this was a wonderful way to spend ti...,positive,not sarcastic
4,Basically theres a family where a little boy J...,negative,sarcastic
...,...,...,...
6492,This movies idea of character development is m...,negative,sarcastic
6493,I guess they ran out of budget for a decent sc...,negative,sarcastic
6494,Who needs a plot when you have explosions ever...,negative,sarcastic
6495,Is there an award for most generic action movi...,negative,sarcastic


# Word tokenizaton

In [8]:
from nltk.tokenize import word_tokenize 
df['word_token']= df['cleaned_data'].apply(word_tokenize)
df[['cleaned_data','word_token']]

Unnamed: 0,cleaned_data,word_token
0,One of the other reviewers has mentioned that ...,"[One, of, the, other, reviewers, has, mentione..."
1,A wonderful little production. The filming tec...,"[A, wonderful, little, production, ., The, fil..."
2,This movie was a groundbreaking experience! Iv...,"[This, movie, was, a, groundbreaking, experien..."
3,I thought this was a wonderful way to spend ti...,"[I, thought, this, was, a, wonderful, way, to,..."
4,Basically theres a family where a little boy J...,"[Basically, theres, a, family, where, a, littl..."
...,...,...
6492,This movies idea of character development is m...,"[This, movies, idea, of, character, developmen..."
6493,I guess they ran out of budget for a decent sc...,"[I, guess, they, ran, out, of, budget, for, a,..."
6494,Who needs a plot when you have explosions ever...,"[Who, needs, a, plot, when, you, have, explosi..."
6495,Is there an award for most generic action movi...,"[Is, there, an, award, for, most, generic, act..."


# Sentence tokenization

In [9]:
from nltk.tokenize import sent_tokenize 
df['sent_token']= df['cleaned_data'].apply(sent_tokenize)
df[['cleaned_data','sent_token']]

Unnamed: 0,cleaned_data,sent_token
0,One of the other reviewers has mentioned that ...,[One of the other reviewers has mentioned that...
1,A wonderful little production. The filming tec...,"[A wonderful little production., The filming t..."
2,This movie was a groundbreaking experience! Iv...,"[This movie was a groundbreaking experience!, ..."
3,I thought this was a wonderful way to spend ti...,[I thought this was a wonderful way to spend t...
4,Basically theres a family where a little boy J...,[Basically theres a family where a little boy ...
...,...,...
6492,This movies idea of character development is m...,[This movies idea of character development is ...
6493,I guess they ran out of budget for a decent sc...,[I guess they ran out of budget for a decent s...
6494,Who needs a plot when you have explosions ever...,[Who needs a plot when you have explosions eve...
6495,Is there an award for most generic action movi...,[Is there an award for most generic action mov...


# removing stopwords from the tokens

In [13]:
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word.lower() not in stop_words]

def remove_stopwords_sentences(sentences):
    return [' '.join(remove_stopwords(word_tokenize(sent))) for sent in sentences]

df['word_token'] = df['word_token'].apply(remove_stopwords)

df['sent_token'] = df['sent_token'].apply(remove_stopwords_sentences)

df[['cleaned_data', 'word_token', 'sent_token']]

Unnamed: 0,cleaned_data,word_token,sent_token
0,One of the other reviewers has mentioned that ...,"[One, reviewers, mentioned, watching, 1, Oz, e...",[One reviewers mentioned watching 1 Oz episode...
1,A wonderful little production. The filming tec...,"[wonderful, little, production, ., filming, te...","[wonderful little production ., filming techni..."
2,This movie was a groundbreaking experience! Iv...,"[movie, groundbreaking, experience, !, Ive, ne...","[movie groundbreaking experience !, Ive never ..."
3,I thought this was a wonderful way to spend ti...,"[thought, wonderful, way, spend, time, hot, su...",[thought wonderful way spend time hot summer w...
4,Basically theres a family where a little boy J...,"[Basically, theres, family, little, boy, Jake,...",[Basically theres family little boy Jake think...
...,...,...,...
6492,This movies idea of character development is m...,"[movies, idea, character, development, muscles...",[movies idea character development muscles les...
6493,I guess they ran out of budget for a decent sc...,"[guess, ran, budget, decent, script, .]",[guess ran budget decent script .]
6494,Who needs a plot when you have explosions ever...,"[needs, plot, explosions, every, five, minutes...",[needs plot explosions every five minutes ?]
6495,Is there an award for most generic action movi...,"[award, generic, action, movie, ever, made, ?]",[award generic action movie ever made ?]


# Lemmatization

What Is Lemmatization?
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to its root word, or lemme, good.  

Stemming and lemmatization are techniques used in natural language processing to reduce words to their base forms. Stemming involves removing prefixes or suffixes from words, often resulting in the stem not being an actual word. For example, "running" would be stemmed to "run." In contrast, lemmatization considers the word's meaning and context, ensuring that the root form (or lemma) is a valid word. For example, "better" would be lemmatized to "good," which is its base form. While stemming is simpler and faster, lemmatization is more precise and accurate, making it preferable in many applications.

Stemming can sometimes be too aggressive, leading to over-stemming or under-stemming, where words are incorrectly reduced or not reduced enough. On the other hand, lemmatization is more complex and requires access to a dictionary or vocabulary to determine the lemma of a word. This additional step makes lemmatization slower than stemming but results in more accurate word reductions. Overall, the choice between stemming and lemmatization depends on the specific requirements of the text processing task and the desired balance between speed and accuracy.

In [19]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

lemmatizer = WordNetLemmatizer()
'''
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

def lemmatize_sentences(sentences):
    return [' '.join(lemmatize_tokens(sentence.split())) for sentence in sentences]

df['word_token'] = df['word_token'].apply(lemmatize_tokens)
df['sent_token'] = df['sent_token'].apply(lemmatize_sentences)'''


# Function to convert POS tag to a format recognized by the lemmatizer
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def lemmatize_tokens(tokens):
    pos_tags = pos_tag(tokens)
    return [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]

def lemmatize_sentences(sentences):
    return [' '.join(lemmatize_tokens(word_tokenize(sentence))) for sentence in sentences]

df['word_token'] = df['word_token'].apply(lemmatize_tokens)
df['sent_token'] = df['sent_token'].apply(lemmatize_sentences)

df[['cleaned_data','word_token', 'sent_token']]

Unnamed: 0,cleaned_data,word_token,sent_token
0,One of the other reviewers has mentioned that ...,"[One, reviewer, mention, watch, 1, Oz, episode...",[One reviewer mention watch 1 Oz episode youll...
1,A wonderful little production. The filming tec...,"[wonderful, little, production, ., film, techn...","[wonderful little production ., film technique..."
2,This movie was a groundbreaking experience! Iv...,"[movie, groundbreaking, experience, !, Ive, ne...","[movie groundbreaking experience !, Ive never ..."
3,I thought this was a wonderful way to spend ti...,"[think, wonderful, way, spend, time, hot, summ...",[think wonderful way spend time hot summer wee...
4,Basically theres a family where a little boy J...,"[Basically, there, family, little, boy, Jake, ...",[Basically there family little boy Jake think ...
...,...,...,...
6492,This movies idea of character development is m...,"[movie, idea, character, development, muscle, ...",[movie idea character development muscle le br...
6493,I guess they ran out of budget for a decent sc...,"[guess, run, budget, decent, script, .]",[guess run budget decent script .]
6494,Who needs a plot when you have explosions ever...,"[need, plot, explosion, every, five, minute, ?]",[need plot explosion every five minute ?]
6495,Is there an award for most generic action movi...,"[award, generic, action, movie, ever, make, ?]",[award generic action movie ever make ?]
