NLP Pipeline
What is NLP Pipeline
NLP is a set of steps followed to build an end-to-end NLP software. NLP software consists of the following steps:

 - Data Acquisition

- Text Preparation

Text Cleanup
Basic Preprocessing
Advanced Preprocessing

- Feature Engineering

- Modelling

Model Building
Evaluation

- Deployment

Deployment
Monitoring
Model Update

These steps are essential for creating effective NLP applications, from acquiring and preparing data to engineering features, building and evaluating models, and finally deploying and maintaining the models in a production environment.

It's not Universal

Pipeline is non-linear

ML based Pipeline



Detailed explanation of each point in an NLP pipeline:

- Data Acquisition
Data Acquisition is the process of collecting text data for NLP tasks. This can include:

Web Scraping: Extracting text data from websites.
APIs: Using APIs to gather data from various platforms like Twitter, Reddit, etc.
Databases: Retrieving text data from structured databases.
Manual Collection: Hand-collecting data, including surveys and interviews.

- Text Preparation
Text Preparation involves cleaning and preprocessing the raw text data to make it suitable for analysis.

- Text Cleanup
Remove Noise: Eliminate irrelevant data such as HTML tags, special characters, and extra spaces.
Case Normalization: Convert all text to lowercase or uppercase for consistency.
Spelling Correction: Correct common spelling errors to ensure uniformity.

- Basic Preprocessing
Tokenization: Splitting text into words, sentences, or phrases.
Stop Words Removal: Removing common words that do not contribute much meaning (e.g., "and", "the").
Punctuation Removal: Eliminating punctuation marks to focus on the words.

- Advanced Preprocessing
Lemmatization: Reducing words to their base or dictionary form (e.g., "running" to "run").
Stemming: Reducing words to their root form (e.g., "fishing" to "fish").
POS Tagging: Identifying parts of speech (nouns, verbs, adjectives, etc.) for each word.
Named Entity Recognition (NER): Identifying and classifying named entities (e.g., names of people, organizations, locations).

- Feature Engineering
Feature Engineering involves creating features from text data that can be used for modeling:

Bag of Words (BoW): Representing text as a collection of its words.
TF-IDF: Weighing the importance of words based on their frequency and uniqueness.
Word Embeddings: Representing words as dense vectors (e.g., Word2Vec, GloVe).
N-grams: Extracting contiguous sequences of n tokens.

- Modelling
Modelling involves building and evaluating machine learning models to perform NLP tasks.

- Model Building
Selecting Algorithms: Choosing appropriate algorithms (e.g., Naive Bayes, SVM, neural networks).
Training Models: Feeding the processed text data into the algorithms to train the models.
Hyperparameter Tuning: Adjusting model parameters to improve performance.
- Evaluation
Metrics: Using metrics like accuracy, precision, recall, F1-score, etc., to evaluate model performance.
Cross-Validation: Using techniques like k-fold cross-validation to assess model reliability and robustness.
Intrinsic-vs-Extrinsic: https://ai.plainenglish.io/nlp-evaluation-intrinsic-vs-extrinsic-assessment-ff1401505631

- Deployment
Deployment involves integrating the trained NLP model into a production environment and ensuring it functions correctly.

- Deployment
Integration: Embedding the model into applications or services where it will be used.
API Creation: Developing APIs to allow external systems to interact with the model.

- Monitoring
Performance Tracking: Continuously monitoring model performance to detect issues.
Error Analysis: Analyzing errors and making necessary adjustments to improve accuracy.
- Model Update
Retraining: Periodically retraining the model with new data to maintain its effectiveness.
Versioning: Keeping track of model versions to manage updates and changes efficiently

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df = pd.read_csv('/content/drive/MyDrive/datasets/IMDB Dataset.csv')

In [None]:
df.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


In [None]:
df.shape

(50000, 2)

In [None]:
df = df.head(30000)

In [None]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [None]:
df.shape

(30000, 2)

In [None]:
df.head(3)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive


In [None]:
df['review'][0]

'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

In [None]:
import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

### Text Preprocessing
Text preprocessing in NLP is the process of cleaning and preparing raw text data for analysis. It includes steps like removing noise (e.g., HTML tags, special characters), normalizing case, correcting spelling errors, tokenizing text into words or sentences, removing stop words, stripping punctuation, and performing lemmatization or stemming to reduce words to their base forms. Advanced preprocessing may involve POS tagging, named entity recognition, and feature extraction techniques such as TF-IDF or word embeddings. This process enhances the quality and performance of NLP models.

### Lowercasing
Lowercasing refers to the process of converting all characters in a text to lowercase. This standardization helps in reducing the complexity of text data by treating words with different cases (e.g., "Apple" and "apple") as the same word, thereby improving the efficiency and accuracy of subsequent text processing and analysis steps. Lowercasing is particularly useful in ensuring uniformity and consistency in the dataset.

In [None]:
df['review'] = df['review'].str.lower()

In [None]:
df['review'][0]

'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

Remove HTML tags using Regular expressions
We remove HTML tags from text for several key reasons:

1. Clean Text: HTML tags don't contribute to the actual content, only to its structure and presentation.
2. Normalization: Removing tags helps standardize the text, making it easier to process uniformly.
3. Preprocessing: Tags can interfere with tokenization and other text processing steps.
4. Accuracy: Clean text improves the performance of NLP models by focusing on meaningful content.
5. Consistency: Ensures uniformity across different text sources, simplifying downstream tasks.

In [None]:
import re
def remove_html_tags(text):
    pattern = re.compile("<.*?>")
    return pattern.sub(r"", text)

In [None]:
text = 'this is my <story> and I will complete this <day>'
remove_html_tags(text)

'this is my  and I will complete this '

In [None]:
df['review']=df['review'].apply(remove_html_tags)

In [None]:
df['review'][0]

'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

#### Removing URLs
In NLP, removing URLs from text is important for several reasons:

1. Noise Reduction: URLs are often irrelevant to the text's main content and can introduce noise, affecting the quality of text analysis.

2. Normalization: Like HTML tags, URLs can disrupt the uniform processing of text, complicating tokenization and other preprocessing steps.

3. Improved Model Performance: Clean text without URLs helps NLP models focus on meaningful content, leading to better performance.

4. Consistency: Removing URLs ensures a consistent text format across different sources, simplifying text processing and analysis.

5. Privacy and Security: URLs can contain sensitive information or lead to security risks, so removing them helps in maintaining privacy and security.

Overall, removing URLs is a standard preprocessing step to ensure cleaner, more consistent, and useful text for NLP tasks.

In [None]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [None]:
pattern = re.compile(r'https?://\S+|www\.\S+')

In [None]:
text2 = 'Check out my Instagram https://www.instagram.com/'
text3 = 'Google search here www.google.com'

print(remove_url(text2))
print(remove_url(text3))

Check out my Instagram 
Google search here 


In [None]:
df['review'] = df['review'].apply(remove_url)

In [None]:
df['review'][0]

'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

#### Removing punctuation (!"#$%&'()*+,-./:;<=>?@[]^_`{|}~)
In NLP, removing punctuation helps:

- Simplify Text: Reduces complexity for processing.
- Normalize Data: Ensures uniform text format.
- Improve Tokenization: Prevents punctuation from affecting word splits.
- Enhance Model Performance: Focuses on meaningful content for better results.
- Size: Punctuation makes the document large.

In [None]:
import string
marks = string.punctuation
marks

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
def remove_punctuation(text):
  for i in marks:
    text = text.replace(i,'')
  return text

In [None]:
text4 = 'this is my favourite work. and $ '
remove_punctuation(text4)

'this is my favourite work and  '

In [None]:
df['review'] = df['review'].apply(remove_punctuation)

In [None]:
df['review'][0]

'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

### Common chat abbreviations and slang
Handling chat words in NLP is crucial for several reasons:

1. Improved Understanding: Expanding chat abbreviations helps models better understand the content.
2. Contextual Accuracy: Many chat words affect sentiment, tone, or intent (e.g., "LOL" vs. "Laughing Out Loud").
3. Data Normalization: Ensures uniformity and consistency in text data, simplifying processing and analysis.
4. Enhanced Model Training: Models trained on expanded forms of chat words perform more accurately.
5. Sentiment Analysis: Properly handling chat words ensures more accurate sentiment detection (e.g., "LMAO" indicates strong amusement).
6. Readability: Expanded chat words are clearer for both humans and NLP tasks like summarization or translation.

In [None]:
chat_word = {
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'ATK': 'At The Keyboard',
    'ATM': 'At The Moment',
    'A3': 'Anytime, Anywhere, Anyplace',
    'BAK': 'Back At Keyboard',
    'BBL': 'Be Back Later',
    'BBS': 'Be Back Soon',
    'BFN': 'Bye For Now',
    'B4N': 'Bye For Now',
    'BRB': 'Be Right Back',
    'BRT': 'Be Right There',
    'BTW': 'By The Way',
    'B4': 'Before',
    'CU': 'See You',
    'CUL8R': 'See You Later',
    'CYA': 'See You',
    'FAQ': 'Frequently Asked Questions',
    'FC': 'Fingers Crossed',
    'FWIW': "For What It's Worth",
    'FYI': 'For Your Information',
    'GAL': 'Get A Life',
    'GG': 'Good Game',
    'GN': 'Good Night',
    'GMTA': 'Great Minds Think Alike',
    'GR8': 'Great!',
    'G9': 'Genius',
    'IC': 'I See',
    'ICQ': 'I Seek you (also a chat program)',
    'ILU': 'ILU: I Love You',
    'IMHO': 'In My Honest/Humble Opinion',
    'IMO': 'In My Opinion',
    'IOW': 'In Other Words',
    'IRL': 'In Real Life',
    'KISS': 'Keep It Simple, Stupid',
    'LDR': 'Long Distance Relationship',
    'LMAO': 'Laugh My A.. Off',
    'LOL': 'Laughing Out Loud',
    'LTNS': 'Long Time No See',
    'L8R': 'Later',
    'MTE': 'My Thoughts Exactly',
    'M8': 'Mate',
    'NRN': 'No Reply Necessary',
    'OIC': 'Oh I See',
    'PITA': 'Pain In The A..',
    'PRT': 'Party',
    'PRW': 'Parents Are Watching',
    'QPSA?': 'Que Pasa?',
    'ROFL': 'Rolling On The Floor Laughing',
    'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
    'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
    'SK8': 'Skate',
    'STATS': 'Your sex and age',
    'ASL': 'Age, Sex, Location',
    'THX': 'Thank You',
    'TTFN': 'Ta-Ta For Now!',
    'TTYL': 'Talk To You Later',
    'U': 'You',
    'U2': 'You Too',
    'U4E': 'Yours For Ever',
    'WB': 'Welcome Back',
    'WTF': 'What The F...',
    'WTG': 'Way To Go!',
    'WUF': 'Where Are You From?',
    'W8': 'Wait...',
    '7K': 'Sick:-D Laugher',
    'TFW': 'That feeling when',
    'MFW': 'My face when',
    'MRW': 'My reaction when',
    'IFYP': 'I feel your pain',
    'TNTL': 'Trying not to laugh',
    'JK': 'Just kidding',
    'IDC': "I don't care",
    'ILY': 'I love you',
    'IMU': 'I miss you',
    'ADIH': 'Another day in hell',
    'ZZZ': 'Sleeping, bored, tired',
    'WYWH': 'Wish you were here',
    'TIME': 'Tears in my eyes',
    'BAE': 'Before anyone else',
    'FIMH': 'Forever in my heart',
    'BSAAW': 'Big smile and a wink',
    'BWL': 'Bursting with laughter',
    'BFF': 'Best friends forever',
    'CSL': "Can't stop laughing"
}

In [None]:

tsting = 'this is my BFF and CSL'
def convert_chatword(text):
# Applying the mapping using a loop and replace
  for key, value in chat_word.items():
      text = text.replace(key, value)
  return text

print(convert_chatword(tsting))
print(convert_chatword("LOL I will BRB"))

this is my Best friends forever and Can't stop laughing
Laughing Out Loud I will Be Right Back


In [None]:
df['review']=df['review'].apply(convert_chatword)

In [None]:
df['review'][0]

'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

In [None]:
string = 'this is yes and this is no'

mapp = {'yes': '1', 'no': '0'}

# Applying the mapping using a loop and replace
for key, value in mapp.items():
    string = string.replace(key, value)

print(string)


this is 1 and this is 0


In [None]:
import pandas as pd

mapp = {'yes':1, 'no':0}
data = pd.DataFrame({'name':['ritik', 'tanuj'], 'status':['yes', 'no']})

# Applying the mapping using replace
data['status'] = data['status'].replace(mapp)
print(data)



mapp = {'yes':1, 'no':0}
data = pd.DataFrame({'name':['ritik', 'tanuj'], 'status':['yes', 'no']})

# Applying the mapping
data['status'] = data['status'].map(mapp)
print(data)


    name  status
0  ritik       1
1  tanuj       0
    name  status
0  ritik       1
1  tanuj       0


### Spelling correction
Spelling correction in NLP is done to improve text quality and ensure accurate analysis. Correcting spelling errors helps in:

- Enhanced Understanding: Ensures that words are recognized correctly by NLP models.
- Data Consistency: Maintains uniformity in text data.
- Improved Model Performance: Reduces noise, leading to better model training and predictions.
- Accurate Results: Improves the accuracy of tasks like sentiment analysis, information retrieval, and machine translation.

In [None]:
from textblob import TextBlob
incorrect_text = "Ths is an exmple of a sentnce with sevral speling erors."

textblb = TextBlob(incorrect_text)
textblb.correct().string

'The is an example of a sentence with several spelling errors.'

In [None]:
def spelling_correct(text):
  text = TextBlob(text)
  text = text.correct().string
  return text

In [None]:

incorrect = 'This is my example wih erors and mistoke bot corect '
spelling_correct(incorrect)

'His is my example with errors and mistake not correct '

### Removing StopWords
Removing stop words in NLP text processing is like cleaning up unnecessary words like "the", "is", and "and" from sentences. These words appear frequently in language but don't add much meaning. By getting rid of them, we focus more on the important words that carry the actual message, making our analysis faster and more accurate. It's like decluttering a room so you can see and understand the important things better.

In [None]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
words = stopwords.words('english')

In [None]:

'''this code is taking too much time to excecute'''
# def remove_stopwords(text):
#     new_text=[]
#     for word in text.split():
#         if word in stopwords.words('english'):
#             new_text.append('')
#         else:
#             new_text.append(word)

#     x = new_text[:]  # Create a copy of new_text
#     new_text.clear()  # Clear the original new_text list
#     return " ".join(x)  # Join the copied list x into a single string separated by spaces and return it


In [None]:
from nltk.corpus import stopwords

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    new_text = [word for word in text.split() if word not in stop_words]
    return " ".join(new_text)

# Example usage:
# text = "This is a sample text with some stopwords."
# clean_text = remove_stopwords(text)
# print(clean_text)


In [None]:
df['review']=df['review'].apply(remove_stopwords)

In [None]:
df['review'][0]

'one reviewers mentioned watching 1 oz episode youll hooked right exactly happened methe first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use wordit called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements never far awayi would say main appeal show due fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz doesnt mess around first episode ever saw struck nasty surreal couldnt say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards wholl sold nickel inmates wholl kill order get away well mannered middle 

In [None]:
df.shape

(30000, 2)

### Tokenization
What is Tokenization?
Tokenization is the process of breaking down text into smaller pieces called tokens. These tokens can be words, phrases, or even individual characters, depending on the application. Think of it like cutting a paragraph into smaller, manageable parts.

Example:¶
Imagine you have this sentence:

plaintext

I love eating pizza!

plaintext

["I", "love", "eating", "pizza", "!"]

Each word and punctuation mark becomes a separate token.

Why Do We Use Tokenization in NLP?
Easier Analysis: Breaking text into tokens makes it easier to analyze. It's like reading a book one word at a time instead of trying to understand it all at once.

- Understanding Context: It helps in understanding the context of each word in a sentence. For example, knowing that "love" is followed by "eating" gives a clear picture of the meaning.

- Efficient Processing: Computers can process and analyze tokens more efficiently than long strings of text. It speeds up tasks like searching for specific words or understanding the structure of sentences.

- Building Blocks for NLP Tasks: Tokenization is the first step for many NLP tasks like sentiment analysis, translation, and text summarization. It prepares the text for more complex processing.

- Tokenization helps break down text into smaller, understandable parts, making it easier for computers to analyze and work with.

1. Split function¶

In [None]:
# word tokenization
sent1 = 'I am from mumbai'
sent1.split()

['I', 'am', 'from', 'mumbai']

In [None]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [None]:
# Problems with split function
sent3 = 'I am going to delhi!!!!'
sent3.split()

['I', 'am', 'going', 'to', 'delhi!!!!']

2. Regular Expression

In [None]:
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

3. NLTK

In [None]:
import nltk
from nltk.tokenize import word_tokenize

# Download the necessary data
nltk.download('punkt')

# Sentence to tokenize
sent1 = 'I am going to visit Delhi!'

# Tokenize the sentence
tokens = word_tokenize(sent1)
print(tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['I', 'am', 'going', 'to', 'visit', 'Delhi', '!']


### Stemming
What is Stemming?
Stemming is the process of reducing words to their base or root form. It's like finding the "stem" of a word, which can help us understand different variations of the same word.

Example:
Imagine you have these words:

plaintext
running, runner, runs, ran

Why Do We Use Stemming in NLP?
Simplifies Text: Stemming simplifies words to their root form, which makes it easier to analyze text. For instance, "running" and "ran" are different forms of the same concept, and stemming helps treat them as one.

- Reduces Complexity: By converting different forms of a word to a common base, stemming reduces the number of unique words in a text. This makes the analysis more manageable and less complex.

- Improves Search Results: In tasks like search engines or information retrieval, stemming helps find relevant documents by matching different word forms. For example, searching for "run" will also return results for "running" and "ran".

- Consistent Analysis: It ensures that variations of a word are consistently analyzed together, improving the accuracy of tasks like text classification, sentiment analysis, and topic modeling.

Stemming will reduce "running", "runs", "runner", and "ran" to the common root "run". This way, your program understands that all these sentences are about the activity of running.

Stemming helps simplify and standardize words in text, making it easier for computers to analyze and understand different forms of words as part of the same concept.

What is a Stemmer?

A stemmer is a tool in NLP that reduces words to their root form or base form. This helps in simplifying and standardizing words for easier analysis.

- PorterStemmer:

Developed by: Martin Porter in 1980.
Characteristics:
It's one of the oldest and most widely used stemming algorithms.
It uses a set of rules to iteratively strip suffixes from words.
Known for its simplicity and efficiency.

In [None]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [None]:
sample = "running run runs runned"
stem_words(sample)

'run run run run'

Disadvantages of Stemming
1.Over-Simplification: Stemming can sometimes be too aggressive, reducing words to forms that are not real words (e.g., "better" becoming "bett").

2. Loss of Meaning: Important nuances and meanings might be lost when words are reduced to their base form (e.g., "running" and "runner" both becoming "run").

3. Inconsistency: Different stemming algorithms might produce different results for the same word, leading to inconsistency in text analysis.

4. Language Limitations: Some stemmers are designed for specific languages and might not work well with others.

5. Stemming helps in simplifying text, it can sometimes go too far, losing important details and creating inconsistencies.

### Lemmatization
What is Lemmatization?
Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. Unlike stemming, which cuts off word endings, lemmatization considers the context and converts words to their actual root form as found in the dictionary.

Example:¶
Imagine you have these words:

Why Do We Use Lemmatization in NLP?
1. Accurate Base Forms: It provides accurate base forms of words, maintaining the meaning. For example, "better" becomes "good," which is its true lemma.

2. Improves Understanding: Helps in understanding the text better by converting words to their proper form, making it easier for NLP models to analyze.

3. Consistent Analysis: Ensures consistency in text analysis by using standardized forms of words.

What is a Lemma?
A lemma is the base or dictionary form of a word. For instance, the lemma of "running" and "ran" is "run."

Lemmatization is like looking up the correct word form in the dictionary. It helps computers understand and process text more accurately by converting words to their true base form. This way, words like "running" and "ran" are understood to be the same action, "run".

In [None]:
df.head(3)

In [None]:
#steming
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
def stem(text):
  y = []
  for i in text.split():
    y.append(ps.stem(i))
  return " ".join(y)

In [None]:
# from nltk.stem.porter import PorterStemmer

# ps = PorterStemmer()

# def stem(text):
#     return " ".join([ps.stem(word) for word in text.split()])

In [None]:
df['review'] = df['review'].apply(stem)

In [None]:
df['review'][0]

'one review mention watch 1 oz episod youll hook right exactli happen meth first thing struck oz brutal unflinch scene violenc set right word go trust show faint heart timid show pull punch regard drug sex violenc hardcor classic use wordit call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around first episod ever saw struck nasti surreal couldnt say readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch oz may becom c

In [None]:
df.head(4)

Unnamed: 0,review,sentiment
0,one review mention watch 1 oz episod youll hoo...,positive
1,wonder littl product film techniqu unassum old...,positive
2,thought wonder way spend time hot summer weeke...,positive
3,basic there famili littl boy jake think there ...,negative


### vectorization

In [None]:
#Total words in the data
total = []

for i in df['review']:
  total.append(len(i))
sum(total)

22301384

### Append all the data into corpus

In [None]:
corpus = []
for i in range(len(df['review'])):
  corpus.append(x[i])


In [None]:
corpus[0]

'one review mention watch 1 oz episod youll hook right exactli happen meth first thing struck oz brutal unflinch scene violenc set right word go trust show faint heart timid show pull punch regard drug sex violenc hardcor classic use wordit call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around first episod ever saw struck nasti surreal couldnt say readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch oz may becom c

In [None]:
len(corpus)

30000

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=10000)

In [None]:
vectors = cv.fit_transform(corpus).toarray()


In [None]:
vectors.shape

(30000, 10000)

In [None]:
vectors[0:10]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
cv.get_feature_names_out()

array(['007', '010', '10', ..., 'zu', 'zucco', 'zucker'], dtype=object)

In [None]:
df.head(2)

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(df['sentiment'])
y

array([1, 1, 1, ..., 0, 1, 0])

In [None]:
y.shape

(30000,)

In [None]:
df1 = df.copy()

In [None]:
df1['sentiment'] = y

In [None]:
df1.head(2)

In [None]:
df1.shape

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(vectors,y,test_size=0.2,random_state=44)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((24000, 10000), (6000, 10000), (24000,), (6000,))

### Naive Bayes Model

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

In [None]:
mnb = MultinomialNB()
mnb.fit(x_train,y_train)

### Test model using Test Data

In [None]:
pred = mnb.predict(x_test)

### Check accuracy_score,Confusion_matrix and classification report

In [None]:
print(accuracy_score(y_test,pred))
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

0.8465
[[2579  435]
 [ 486 2500]]
              precision    recall  f1-score   support

           0       0.84      0.86      0.85      3014
           1       0.85      0.84      0.84      2986

    accuracy                           0.85      6000
   macro avg       0.85      0.85      0.85      6000
weighted avg       0.85      0.85      0.85      6000



### Actual and Predicted values

In [None]:
data = pd.DataFrame(np.c_[y_test,pred],columns=['Actual','Predicted'])

In [None]:
data

Unnamed: 0,Actual,Predicted
0,0,0
1,0,0
2,1,1
3,0,0
4,1,1
...,...,...
5995,1,1
5996,0,0
5997,0,0
5998,0,0


### Save the model

In [None]:
import pickle
pickle.dump(cv,open('count_vectorizer.pkl','wb'))
pickle.dump(mnb,open('movies_review_classification.pkl','wb'))

### Load the model

In [None]:
import pickle

In [None]:
save_cv = pickle.load(open('count_vectorizer.pkl','rb'))
model = pickle.load(open('movies_review_classification.pkl','rb'))

### Test model

In [None]:
def test_model(sentence):
  sen = save_cv.transform([sentence]).toarray()
  res = model.predict(sen)[0]
  if res==1:
    print('Positive')
  else:
    print('Negative')

### Testing the review

In [None]:
sent = 'I did not like last part but overall movie was decent'
res = test_model(sent)

Negative
