# Different Tokenizations Methods

 Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, subwords, characters, or even entire sentences, depending on the granularity required for a particular task or model. 

### Word Tokenization: 
Useful for most NLP tasks where words are treated as basic units, such as text classification or sentiment analysis. Word tokenization breaks text into individual words or terms.
### Sentence Tokenization: 
Essential for tasks requiring analysis at the sentence level, such as machine translation or text summarization. Sentence tokenization splits text into individual sentences.
### Character Tokenization: 
Useful for tasks where character-level analysis is important, such as handwriting recognition or certain types of text generation. Character tokenization breaks down text into individual characters.
### Subword Tokenization: 
Particularly beneficial for handling unknown words or morphologically rich languages in NLP tasks like machine translation or named entity recognition. Subword tokenization divides text into smaller meaningful units or subwords

In [74]:
import nltk
import pandas as pd

In [75]:
df = pd.read_csv('Clean_Dataset.csv')
df.head()

Unnamed: 0,Label,Tweets
0,sarcastic,I loovee when people text back unamused_face
1,sarcastic,Don't you love it when your parents are Pissed...
2,sarcastic,"So many useless classes , great to be student"
3,sarcastic,Oh how I love getting home from work at am and...
4,sarcastic,I just love having grungy ass hair expressionl...


## Word Tokenization

In [76]:
from nltk.tokenize import TweetTokenizer

# Initialize the TweetTokenizer
tokenizer = TweetTokenizer()

def tokenize_text(Tweets):
    return tokenizer.tokenize(Tweets)

df = pd.read_csv('Clean_Dataset.csv')
df['tokens'] = df['Tweets'].apply(tokenize_text)

print(df.head(15))
df.to_csv('tokenized_dataset.csv', index=False)

        Label                                             Tweets  \
0   sarcastic     I loovee when people text back  unamused_face    
1   sarcastic  Don't you love it when your parents are Pissed...   
2   sarcastic      So many useless classes , great to be student   
3   sarcastic  Oh how I love getting home from work at am and...   
4   sarcastic  I just love having grungy ass hair expressionl...   
5   sarcastic  Thank you , random guy , for sneaking up behin...   
6   sarcastic  Being half spanish and not being able to speak...   
7   sarcastic  I know no one will remember it broken_heart   ...   
8   sarcastic  Anyone in Chem that has to do OWLs knows my pa...   
9   sarcastic                          Holy crap I look great      
10  sarcastic      I love doing 20 sprints in 10 degree weather    
11  sarcastic  feeling like a million bucks after that chem 2...   
12  sarcastic  I love working in Sydney river It makes me wan...   
13  sarcastic  I love hearing things about me th

## Sentence Tokennization

In [77]:
from nltk.tokenize import TweetTokenizer, sent_tokenize

# Initialize the TweetTokenizer
tokenizer = TweetTokenizer()

def tokenize_text(text):
    return tokenizer.tokenize(text)
    
def tokenize_sentences(text):
    return sent_tokenize(text)
df = pd.read_csv('Clean_Dataset.csv')

# Tokenize tweets into words
df['word_tokens'] = df['Tweets'].apply(tokenize_text)

# Tokenize tweets into sentences
df['sentence_tokens'] = df['Tweets'].apply(tokenize_sentences)
print(df.head(15))
df.to_csv('tokenized_dataset.csv', index=False)


        Label                                             Tweets  \
0   sarcastic     I loovee when people text back  unamused_face    
1   sarcastic  Don't you love it when your parents are Pissed...   
2   sarcastic      So many useless classes , great to be student   
3   sarcastic  Oh how I love getting home from work at am and...   
4   sarcastic  I just love having grungy ass hair expressionl...   
5   sarcastic  Thank you , random guy , for sneaking up behin...   
6   sarcastic  Being half spanish and not being able to speak...   
7   sarcastic  I know no one will remember it broken_heart   ...   
8   sarcastic  Anyone in Chem that has to do OWLs knows my pa...   
9   sarcastic                          Holy crap I look great      
10  sarcastic      I love doing 20 sprints in 10 degree weather    
11  sarcastic  feeling like a million bucks after that chem 2...   
12  sarcastic  I love working in Sydney river It makes me wan...   
13  sarcastic  I love hearing things about me th

## Character Tokenization

In [78]:
# Function to tokenize text into words using TweetTokenizer
def tokenize_text(text):
    return tokenizer.tokenize(text)

# sent_tokenize
def tokenize_sentences(text):
    return sent_tokenize(text)

# characters
def tokenize_characters(text):
    return list(text) 

df = pd.read_csv('Clean_Dataset.csv')

df['word_tokens'] = df['Tweets'].apply(tokenize_text)

df['sentence_tokens'] = df['Tweets'].apply(tokenize_sentences)

df['char_tokens'] = df['Tweets'].apply(tokenize_characters)

print(df.head(15))

df.to_csv('C:\\Users\\HP\\OneDrive\\Desktop\\tokenized_dataset.csv', index=False)


        Label                                             Tweets  \
0   sarcastic     I loovee when people text back  unamused_face    
1   sarcastic  Don't you love it when your parents are Pissed...   
2   sarcastic      So many useless classes , great to be student   
3   sarcastic  Oh how I love getting home from work at am and...   
4   sarcastic  I just love having grungy ass hair expressionl...   
5   sarcastic  Thank you , random guy , for sneaking up behin...   
6   sarcastic  Being half spanish and not being able to speak...   
7   sarcastic  I know no one will remember it broken_heart   ...   
8   sarcastic  Anyone in Chem that has to do OWLs knows my pa...   
9   sarcastic                          Holy crap I look great      
10  sarcastic      I love doing 20 sprints in 10 degree weather    
11  sarcastic  feeling like a million bucks after that chem 2...   
12  sarcastic  I love working in Sydney river It makes me wan...   
13  sarcastic  I love hearing things about me th

## Subword Tokenization

In [79]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_subwords(text):
    return tokenizer.tokenize(text)

df = pd.read_csv('Clean_Dataset.csv')

df['word_tokens'] = df['Tweets'].apply(tokenize_text)

df['char_tokens'] = df['Tweets'].apply(tokenize_characters)

df['sentence_tokens'] = df['Tweets'].apply(tokenize_sentences)

df['subword_tokens'] = df['Tweets'].apply(tokenize_subwords)

print(df.head(15))

df.to_csv('tokenized_dataset.csv', index=False)

        Label                                             Tweets  \
0   sarcastic     I loovee when people text back  unamused_face    
1   sarcastic  Don't you love it when your parents are Pissed...   
2   sarcastic      So many useless classes , great to be student   
3   sarcastic  Oh how I love getting home from work at am and...   
4   sarcastic  I just love having grungy ass hair expressionl...   
5   sarcastic  Thank you , random guy , for sneaking up behin...   
6   sarcastic  Being half spanish and not being able to speak...   
7   sarcastic  I know no one will remember it broken_heart   ...   
8   sarcastic  Anyone in Chem that has to do OWLs knows my pa...   
9   sarcastic                          Holy crap I look great      
10  sarcastic      I love doing 20 sprints in 10 degree weather    
11  sarcastic  feeling like a million bucks after that chem 2...   
12  sarcastic  I love working in Sydney river It makes me wan...   
13  sarcastic  I love hearing things about me th

# Lemmatization

 Lemmatization is a process in natural language processing (NLP) that involves reducing words to their base or root form, known as the lemma. The goal of lemmatization is to normalize words so that different forms of the same word are treated as the same token. 
 
Examples:

Lemmatization of the word "running":

Stemming: "running" -> "run"

Lemmatization: "running" -> "run"

In NLP tasks such as text classification, sentiment analysis, and information retrieval, lemmatization helps to reduce the complexity of text data and improve the accuracy of models by normalizing words. 

In [80]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download NLTK resources if not already downloaded
#nltk.download('wordnet')
#nltk.download('punkt')

tokenizer = nltk.tokenize.TweetTokenizer()

lemmatizer = WordNetLemmatizer()

# Function to lemmatize words
def lemmatize_words(tokens):
    lemma_tokens = []
    for token in tokens:
        pos_tag = nltk.pos_tag([token])[0][1][0].upper() 
        pos_tag = pos_tag if pos_tag in ['A', 'N', 'V'] else 'n' 
        lemma_tokens.append(lemmatizer.lemmatize(token, pos=pos_tag.lower()))
    
    return lemma_tokens

df = pd.read_csv('Clean_Dataset.csv')

df['word_tokens'] = df['Tweets'].apply(tokenizer.tokenize)

df['lemmatized_tokens'] = df['word_tokens'].apply(lemmatize_words)

print(df.head(5))

df.to_csv('lemmatized_dataset.csv', index=False)


       Label                                             Tweets  \
0  sarcastic     I loovee when people text back  unamused_face    
1  sarcastic  Don't you love it when your parents are Pissed...   
2  sarcastic      So many useless classes , great to be student   
3  sarcastic  Oh how I love getting home from work at am and...   
4  sarcastic  I just love having grungy ass hair expressionl...   

                                         word_tokens  \
0  [I, loovee, when, people, text, back, unamused...   
1  [Don't, you, love, it, when, your, parents, ar...   
2  [So, many, useless, classes, ,, great, to, be,...   
3  [Oh, how, I, love, getting, home, from, work, ...   
4  [I, just, love, having, grungy, ass, hair, exp...   

                                   lemmatized_tokens  
0  [I, loovee, when, people, text, back, unamused...  
1  [Don't, you, love, it, when, your, parent, be,...  
2  [So, many, useless, class, ,, great, to, be, s...  
3  [Oh, how, I, love, get, home, from, w