# Data Processing

## Definition
- **Purpose**: Refers to the conversion of raw data into meaningful information through a series of operations and transformations.
- **Importance**: Enables the extraction of insights and knowledge from data for decision-making and problem-solving.

## Data Processing Techniques in NLP

## Lemmatization
- **Purpose**: Reduces words to their base or root form to identify similarities.
- **Impact**: Enhances text analysis by standardizing words for accurate processing.

## Tokenization
- **Purpose**: Breaks text into tokens (words, phrases, symbols) for analysis.
- **Impact**: Enables efficient processing and analysis of text data.

## Encoding Techniques
- **Purpose**: Converts categorical data into numerical form for analysis.
- **Impact**: Facilitates the use of categorical data in machine learning models.


In [1]:
import nltk
import pandas as pd

In [2]:
# Importing the dataset excel sheet
df = pd.read_excel('Cleaned new Dataset.xlsx')
df

Unnamed: 0,cleaned_data,Sentiment,Sarcasm
0,One of the other reviewers has mentioned that ...,positive,not sarcastic
1,A wonderful little production. The filming tec...,positive,not sarcastic
2,This movie was a groundbreaking experience! Iv...,positive,sarcastic
3,I thought this was a wonderful way to spend ti...,positive,not sarcastic
4,Basically theres a family where a little boy J...,negative,sarcastic
...,...,...,...
6492,This movies idea of character development is m...,negative,sarcastic
6493,I guess they ran out of budget for a decent sc...,negative,sarcastic
6494,Who needs a plot when you have explosions ever...,negative,sarcastic
6495,Is there an award for most generic action movi...,negative,sarcastic


# step 1 Removing stopwords from the tokens

In [3]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

df['cleaned_data'] = df['cleaned_data'].apply(remove_stopwords)

df.cleaned_data

0       One reviewers mentioned watching 1 Oz episode ...
1       wonderful little production. filming technique...
2       movie groundbreaking experience! Ive never see...
3       thought wonderful way spend time hot summer we...
4       Basically theres family little boy Jake thinks...
                              ...                        
6492    movies idea character development muscles less...
6493                      guess ran budget decent script.
6494            needs plot explosions every five minutes?
6495                award generic action movie ever made?
6496                Two hours nonstop mindnumbing action.
Name: cleaned_data, Length: 6497, dtype: object

#  step 2  Lemmatization in NLP

## Definition
- **Text Pre-processing Technique**: Reduces words to their base or root form to identify similarities.

## Comparison with Stemming
- **Stemming**: Simple and fast but may not result in actual words.
- **Lemmatization**: Considers word meaning and context, ensuring valid word roots, but is slower.

## Example
- **Word Reduction**: "Better" is lemmatized to "good" as they share the same root meaning.

## Precision vs. Speed
- **Stemming**: Faster but less precise.
- **Lemmatization**: More accurate but slower due to dictionary lookup.

## Application
- **NLP Models**: Used to standardize words for analysis, improving accuracy in tasks like sentiment analysis and information retrieval.


In [4]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

lemmatizer = WordNetLemmatizer()

# Function to convert POS tag to a format recognized by the lemmatizer
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def lemmatization(sentence):
    words = sentence.split()
    pos_tags = pos_tag(words)
    return ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags])


df['lemmatized_data'] = df['cleaned_data'].apply(lemmatization)

df[['cleaned_data','lemmatized_data']]

Unnamed: 0,cleaned_data,lemmatized_data
0,One reviewers mentioned watching 1 Oz episode ...,One reviewer mention watch 1 Oz episode youll ...
1,wonderful little production. filming technique...,wonderful little production. filming technique...
2,movie groundbreaking experience! Ive never see...,movie groundbreaking experience! Ive never see...
3,thought wonderful way spend time hot summer we...,think wonderful way spend time hot summer week...
4,Basically theres family little boy Jake thinks...,Basically there family little boy Jake think t...
...,...,...
6492,movies idea character development muscles less...,movie idea character development muscle less b...
6493,guess ran budget decent script.,guess run budget decent script.
6494,needs plot explosions every five minutes?,need plot explosion every five minutes?
6495,award generic action movie ever made?,award generic action movie ever made?


# Importance of Tokenization in NLP

Tokenization is the process of dividing a text into smaller units known as tokens. Tokens are typically words or sub-words in the context of natural language processing.
The process involves splitting a string, or text into a list of tokens. One can think of tokens as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

## Effective Text Processing
- **Simplifies Text Handling**: Reduces the size of raw text for easier processing and analysis.

## Feature Extraction
- **Numerical Representation**: Converts text data into tokens, which can be used as features in machine learning models.

## Language Modelling
- **Organized Representations**: Facilitates the creation of structured representations of language for tasks like text generation and language modelling.

## Information Retrieval
- **Efficient Indexing and Searching**: Essential for systems that store and retrieve information based on words or phrases.

## Text Analysis
- **NLP Tasks**: Used in tasks such as sentiment analysis and named entity recognition to determine the function and context of individual words.

## Vocabulary Management
- **Corpus Vocabulary Management**: Generates a list of distinct tokens representing words in the dataset.

## Task-Specific Adaptation
- **Customization**: Can be tailored to specific NLP tasks, enhancing applications like summarization and machine translation.

## Preprocessing Step
- **Essential Transformation**: Transforms raw text into a format suitable for further statistical and computational analysis.


#  step 3 Applying different tokenization techniques

# 1) Word tokenizaton

Word tokenization divides the text into individual words. Many NLP tasks use this approach, in which words are treated as the basic units of meaning.

Example:

Input: "Tokenization is an important NLP task."

Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]


In [5]:
from nltk.tokenize import word_tokenize 
df['word_token']= df['lemmatized_data'].apply(word_tokenize)
df[['cleaned_data','lemmatized_data','word_token']]

Unnamed: 0,cleaned_data,lemmatized_data,word_token
0,One reviewers mentioned watching 1 Oz episode ...,One reviewer mention watch 1 Oz episode youll ...,"[One, reviewer, mention, watch, 1, Oz, episode..."
1,wonderful little production. filming technique...,wonderful little production. filming technique...,"[wonderful, little, production, ., filming, te..."
2,movie groundbreaking experience! Ive never see...,movie groundbreaking experience! Ive never see...,"[movie, groundbreaking, experience, !, Ive, ne..."
3,thought wonderful way spend time hot summer we...,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ..."
4,Basically theres family little boy Jake thinks...,Basically there family little boy Jake think t...,"[Basically, there, family, little, boy, Jake, ..."
...,...,...,...
6492,movies idea character development muscles less...,movie idea character development muscle less b...,"[movie, idea, character, development, muscle, ..."
6493,guess ran budget decent script.,guess run budget decent script.,"[guess, run, budget, decent, script, .]"
6494,needs plot explosions every five minutes?,need plot explosion every five minutes?,"[need, plot, explosion, every, five, minutes, ?]"
6495,award generic action movie ever made?,award generic action movie ever made?,"[award, generic, action, movie, ever, made, ?]"


# 2) Sentence tokenization

The text is segmented into sentences during sentence tokenization. This is useful for tasks requiring individual sentence analysis or processing.

Example:

Input: "Tokenization is an important NLP task. It helps break down text into smaller units."

Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]

In [6]:
from nltk.tokenize import sent_tokenize 
df['sent_token']= df['lemmatized_data'].apply(sent_tokenize)
df[['cleaned_data','lemmatized_data','word_token','sent_token']]

Unnamed: 0,cleaned_data,lemmatized_data,word_token,sent_token
0,One reviewers mentioned watching 1 Oz episode ...,One reviewer mention watch 1 Oz episode youll ...,"[One, reviewer, mention, watch, 1, Oz, episode...",[One reviewer mention watch 1 Oz episode youll...
1,wonderful little production. filming technique...,wonderful little production. filming technique...,"[wonderful, little, production, ., filming, te...","[wonderful little production., filming techniq..."
2,movie groundbreaking experience! Ive never see...,movie groundbreaking experience! Ive never see...,"[movie, groundbreaking, experience, !, Ive, ne...","[movie groundbreaking experience!, Ive never s..."
3,thought wonderful way spend time hot summer we...,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...",[think wonderful way spend time hot summer wee...
4,Basically theres family little boy Jake thinks...,Basically there family little boy Jake think t...,"[Basically, there, family, little, boy, Jake, ...",[Basically there family little boy Jake think ...
...,...,...,...,...
6492,movies idea character development muscles less...,movie idea character development muscle less b...,"[movie, idea, character, development, muscle, ...",[movie idea character development muscle less ...
6493,guess ran budget decent script.,guess run budget decent script.,"[guess, run, budget, decent, script, .]",[guess run budget decent script.]
6494,needs plot explosions every five minutes?,need plot explosion every five minutes?,"[need, plot, explosion, every, five, minutes, ?]",[need plot explosion every five minutes?]
6495,award generic action movie ever made?,award generic action movie ever made?,"[award, generic, action, movie, ever, made, ?]",[award generic action movie ever made?]


# 3) Sub-word tokenization

Subword tokenization entails breaking down words into smaller units, which can be especially useful when dealing with morphologically rich languages or rare words.

Example:

Input: "tokenization"

Output: ["token", "ization"]


In [7]:
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator(df['lemmatized_data'].tolist(), vocab_size=5000, min_frequency=2)

def bpe_tokenize_reviews(reviews):
    return reviews.apply(lambda x: tokenizer.encode(x).tokens)

df['subword_token'] = bpe_tokenize_reviews(df['lemmatized_data'])

df[['cleaned_data','lemmatized_data','word_token','sent_token','subword_token']]


Unnamed: 0,cleaned_data,lemmatized_data,word_token,sent_token,subword_token
0,One reviewers mentioned watching 1 Oz episode ...,One reviewer mention watch 1 Oz episode youll ...,"[One, reviewer, mention, watch, 1, Oz, episode...",[One reviewer mention watch 1 Oz episode youll...,"[One, Ġreviewer, Ġmention, Ġwatch, Ġ1, ĠO, z, ..."
1,wonderful little production. filming technique...,wonderful little production. filming technique...,"[wonderful, little, production, ., filming, te...","[wonderful little production., filming techniq...","[w, onder, ful, Ġlittle, Ġproduction, ., Ġfilm..."
2,movie groundbreaking experience! Ive never see...,movie groundbreaking experience! Ive never see...,"[movie, groundbreaking, experience, !, Ive, ne...","[movie groundbreaking experience!, Ive never s...","[movie, Ġground, breaking, Ġexperience, !, ĠIv..."
3,thought wonderful way spend time hot summer we...,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...",[think wonderful way spend time hot summer wee...,"[think, Ġwonderful, Ġway, Ġspend, Ġtime, Ġhot,..."
4,Basically theres family little boy Jake thinks...,Basically there family little boy Jake think t...,"[Basically, there, family, little, boy, Jake, ...",[Basically there family little boy Jake think ...,"[B, as, ically, Ġthere, Ġfamily, Ġlittle, Ġboy..."
...,...,...,...,...,...
6492,movies idea character development muscles less...,movie idea character development muscle less b...,"[movie, idea, character, development, muscle, ...",[movie idea character development muscle less ...,"[movie, Ġidea, Ġcharacter, Ġdevelopment, Ġmus,..."
6493,guess ran budget decent script.,guess run budget decent script.,"[guess, run, budget, decent, script, .]",[guess run budget decent script.],"[gu, ess, Ġrun, Ġbudget, Ġdecent, Ġscript, .]"
6494,needs plot explosions every five minutes?,need plot explosion every five minutes?,"[need, plot, explosion, every, five, minutes, ?]",[need plot explosion every five minutes?],"[need, Ġplot, Ġexplosion, Ġevery, Ġfive, Ġminu..."
6495,award generic action movie ever made?,award generic action movie ever made?,"[award, generic, action, movie, ever, made, ?]",[award generic action movie ever made?],"[aw, ard, Ġgeneric, Ġaction, Ġmovie, Ġever, Ġm..."


# 4) Character tokenization

This process divides the text into individual characters. This can be useful for modelling character-level language.

Example:

Input: "Tokenization"

Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]

In [8]:
def char_tokenizer(text):
    return list(text)

df['char_data']= df['lemmatized_data'].apply(char_tokenizer)

df[['cleaned_data','lemmatized_data','word_token','sent_token','subword_token','char_data']]

Unnamed: 0,cleaned_data,lemmatized_data,word_token,sent_token,subword_token,char_data
0,One reviewers mentioned watching 1 Oz episode ...,One reviewer mention watch 1 Oz episode youll ...,"[One, reviewer, mention, watch, 1, Oz, episode...",[One reviewer mention watch 1 Oz episode youll...,"[One, Ġreviewer, Ġmention, Ġwatch, Ġ1, ĠO, z, ...","[O, n, e, , r, e, v, i, e, w, e, r, , m, e, ..."
1,wonderful little production. filming technique...,wonderful little production. filming technique...,"[wonderful, little, production, ., filming, te...","[wonderful little production., filming techniq...","[w, onder, ful, Ġlittle, Ġproduction, ., Ġfilm...","[w, o, n, d, e, r, f, u, l, , l, i, t, t, l, ..."
2,movie groundbreaking experience! Ive never see...,movie groundbreaking experience! Ive never see...,"[movie, groundbreaking, experience, !, Ive, ne...","[movie groundbreaking experience!, Ive never s...","[movie, Ġground, breaking, Ġexperience, !, ĠIv...","[m, o, v, i, e, , g, r, o, u, n, d, b, r, e, ..."
3,thought wonderful way spend time hot summer we...,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...",[think wonderful way spend time hot summer wee...,"[think, Ġwonderful, Ġway, Ġspend, Ġtime, Ġhot,...","[t, h, i, n, k, , w, o, n, d, e, r, f, u, l, ..."
4,Basically theres family little boy Jake thinks...,Basically there family little boy Jake think t...,"[Basically, there, family, little, boy, Jake, ...",[Basically there family little boy Jake think ...,"[B, as, ically, Ġthere, Ġfamily, Ġlittle, Ġboy...","[B, a, s, i, c, a, l, l, y, , t, h, e, r, e, ..."
...,...,...,...,...,...,...
6492,movies idea character development muscles less...,movie idea character development muscle less b...,"[movie, idea, character, development, muscle, ...",[movie idea character development muscle less ...,"[movie, Ġidea, Ġcharacter, Ġdevelopment, Ġmus,...","[m, o, v, i, e, , i, d, e, a, , c, h, a, r, ..."
6493,guess ran budget decent script.,guess run budget decent script.,"[guess, run, budget, decent, script, .]",[guess run budget decent script.],"[gu, ess, Ġrun, Ġbudget, Ġdecent, Ġscript, .]","[g, u, e, s, s, , r, u, n, , b, u, d, g, e, ..."
6494,needs plot explosions every five minutes?,need plot explosion every five minutes?,"[need, plot, explosion, every, five, minutes, ?]",[need plot explosion every five minutes?],"[need, Ġplot, Ġexplosion, Ġevery, Ġfive, Ġminu...","[n, e, e, d, , p, l, o, t, , e, x, p, l, o, ..."
6495,award generic action movie ever made?,award generic action movie ever made?,"[award, generic, action, movie, ever, made, ?]",[award generic action movie ever made?],"[aw, ard, Ġgeneric, Ġaction, Ġmovie, Ġever, Ġm...","[a, w, a, r, d, , g, e, n, e, r, i, c, , a, ..."


# step 5 Saving the tokenized dataset

In [11]:
df_new = pd.read_excel('tokenized Dataset.xlsx')
df_new['cleaned_data'] = df['lemmatized_data']
df_new['word_token'] = df['word_token']
df_new['sent_token'] = df['sent_token']
df_new['subword_token'] = df['subword_token']
df_new['char_data'] = df['char_data']
df_new['Sentiment'] = df['Sentiment']
df_new['Sarcasm'] = df['Sarcasm']
df_new.to_excel('tokenized Dataset.xlsx', index=False)
df_new

Unnamed: 0,cleaned_data,word_token,sent_token,subword_token,char_data,Sentiment,Sarcasm
0,One reviewer mention watch 1 Oz episode youll ...,"[One, reviewer, mention, watch, 1, Oz, episode...",[One reviewer mention watch 1 Oz episode youll...,"[One, Ġreviewer, Ġmention, Ġwatch, Ġ1, ĠO, z, ...","[O, n, e, , r, e, v, i, e, w, e, r, , m, e, ...",positive,not sarcastic
1,wonderful little production. filming technique...,"[wonderful, little, production, ., filming, te...","[wonderful little production., filming techniq...","[w, onder, ful, Ġlittle, Ġproduction, ., Ġfilm...","[w, o, n, d, e, r, f, u, l, , l, i, t, t, l, ...",positive,not sarcastic
2,movie groundbreaking experience! Ive never see...,"[movie, groundbreaking, experience, !, Ive, ne...","[movie groundbreaking experience!, Ive never s...","[movie, Ġground, breaking, Ġexperience, !, ĠIv...","[m, o, v, i, e, , g, r, o, u, n, d, b, r, e, ...",positive,sarcastic
3,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...",[think wonderful way spend time hot summer wee...,"[think, Ġwonderful, Ġway, Ġspend, Ġtime, Ġhot,...","[t, h, i, n, k, , w, o, n, d, e, r, f, u, l, ...",positive,not sarcastic
4,Basically there family little boy Jake think t...,"[Basically, there, family, little, boy, Jake, ...",[Basically there family little boy Jake think ...,"[B, as, ically, Ġthere, Ġfamily, Ġlittle, Ġboy...","[B, a, s, i, c, a, l, l, y, , t, h, e, r, e, ...",negative,sarcastic
...,...,...,...,...,...,...,...
6492,movie idea character development muscle less b...,"[movie, idea, character, development, muscle, ...",[movie idea character development muscle less ...,"[movie, Ġidea, Ġcharacter, Ġdevelopment, Ġmus,...","[m, o, v, i, e, , i, d, e, a, , c, h, a, r, ...",negative,sarcastic
6493,guess run budget decent script.,"[guess, run, budget, decent, script, .]",[guess run budget decent script.],"[gu, ess, Ġrun, Ġbudget, Ġdecent, Ġscript, .]","[g, u, e, s, s, , r, u, n, , b, u, d, g, e, ...",negative,sarcastic
6494,need plot explosion every five minutes?,"[need, plot, explosion, every, five, minutes, ?]",[need plot explosion every five minutes?],"[need, Ġplot, Ġexplosion, Ġevery, Ġfive, Ġminu...","[n, e, e, d, , p, l, o, t, , e, x, p, l, o, ...",negative,sarcastic
6495,award generic action movie ever made?,"[award, generic, action, movie, ever, made, ?]",[award generic action movie ever made?],"[aw, ard, Ġgeneric, Ġaction, Ġmovie, Ġever, Ġm...","[a, w, a, r, d, , g, e, n, e, r, i, c, , a, ...",negative,sarcastic


# step 4 Analysis on different tokenizations

applied the the tokens and labels on random forest model 
the following is the analysis

# 1) word tokenization

Evaluation for word_token:

Accuracy: 0.83
Confusion Matrix:
[[545  59]
 [160 536]]
Classification Report:
               precision    recall  f1-score   support

not sarcastic       0.77      0.90      0.83       604
    sarcastic       0.90      0.77      0.83       696

     accuracy                           0.83      1300
    macro avg       0.84      0.84      0.83      1300
 weighted avg       0.84      0.83      0.83      1300
 
The word_token model achieved an accuracy of 83% in classifying text as sarcastic or not sarcastic. It showed balanced performance with similar F1-scores, precision, and recall for both classes. Overall, the model demonstrated good generalization on the evaluation dataset.

# 2) sentence tokenization

Evaluation for sent_token:

Accuracy: 0.83
Confusion Matrix:
[[539  65]
 [157 539]]
Classification Report:
               precision    recall  f1-score   support

not sarcastic       0.77      0.89      0.83       604
    sarcastic       0.89      0.77      0.83       696

     accuracy                           0.83      1300
    macro avg       0.83      0.83      0.83      1300
 weighted avg       0.84      0.83      0.83      1300

The sent_token model achieved an accuracy of 83% in classifying text as sarcastic or not sarcastic. It showed balanced performance with similar F1-scores, precision, and recall for both classes. Overall, the model demonstrated good generalization on the evaluation dataset, similar to the word_token model.

# 3) sub owrd tokenization

Evaluation for subword_token:

Accuracy: 0.83
Confusion Matrix:
[[545  59]
 [162 534]]
Classification Report:
               precision    recall  f1-score   support

not sarcastic       0.77      0.90      0.83       604
    sarcastic       0.90      0.77      0.83       696

     accuracy                           0.83      1300
    macro avg       0.84      0.83      0.83      1300
 weighted avg       0.84      0.83      0.83      1300

The subword_token model achieved an accuracy of 83% in classifying text as sarcastic or not sarcastic. It showed balanced performance with similar F1-scores, precision, and recall for both classes. Overall, the model demonstrated good generalization on the evaluation dataset, similar to the word_token and sent_token models.

# 4) charcter tokenization

Evaluation for char_data:

Accuracy: 0.54
Confusion Matrix:
[[  1 603]
 [  0 696]]
Classification Report:
               precision    recall  f1-score   support

not sarcastic       1.00      0.00      0.00       604
    sarcastic       0.54      1.00      0.70       696

     accuracy                           0.54      1300
    macro avg       0.77      0.50      0.35      1300
 weighted avg       0.75      0.54      0.38      1300
The char_data model achieved an accuracy of 54% in classifying text as sarcastic or not sarcastic. However, it performed poorly in terms of precision, recall, and F1-score for the 'not sarcastic' class, with a precision of 1.00 but a recall and F1-score of 0.00, indicating that it failed to correctly classify any 'not sarcastic' instances. The model performed better for the 'sarcastic' class, with a precision of 0.54, recall of 1.00, and F1-score of 0.70. Overall, the model's poor performance on the 'not sarcastic' class significantly impacted its overall metrics.

# Conculsion
the word_token model stands out as a suitable choice for classifying text as sarcastic or not sarcastic. This model offers a balance between accuracy and interpretability, tokenizing text into words, which is easier to interpret and analyze compared to other tokenization methods. Additionally, word-level tokenization is more intuitive for users and provides better insights into the model's classification decisions.