Steps for Model Building
1. Data Ingestion
2. Data Preprocessing
2.1 Cleaning
Lowercasing
Remove HTML
Remove URLs
Remove stopwords
Stemming/Lemmatization (finding root word)
Tokenization
Remove special characters
Remove white spaces
2.2 Part of Speech (POS) Tagging and Chunking
POS tagging
Chunking
Emoji removal
2.3 Encoding
One-Hot Encoding
Label Encoding
TF-IDF
2.4 Embedding
Word2Vec
3. Model Building
4. Model Evaluation

In [1]:
print('test')

test


In [7]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv')

In [8]:
data.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


In [9]:
data.sample(5)

Unnamed: 0,review,sentiment
22686,The China Syndrome is a perfectly paced thrill...,positive
20830,Jennifer Jason Leigh and Mare Winningham are a...,negative
27969,Nowadays it is sort of a trend to look upon al...,negative
25063,"I am not a fan of musicals, but I am a huge fa...",positive
7256,"For Romance's sake, as a married man. The foll...",positive


In [14]:
data["review"][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [None]:
#remove chat jargons


In [None]:
import re
import string
import emoji

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# Ensure stopwords are downloaded
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

#Use when we want to reduce the word to its base form, accuracy is more
def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_text = [lemmatizer.lemmatize(token) for token in tokens]
    return " ".join(lemmatized_text)

chat_words={
    "AFAIK":"As Far As I Know",
    "AFK": "Away From Keyboard",
    "ASAP":"As Soon As Possible",
    "BTW":"By The Way",
    "B4":"Before",
    "LAMO":"Laugh My A.. Off",
    "FYI":"For your information"
}

def normalize_chat_slang(text):
    new_text=[]
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
    return " ".join(filtered_text)

def clean_up(text):
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = emoji.demojize(text)  # Convert emojis to text
    text = text.lower()  # Convert to lowercase
    text = normalize_chat_slang(text)  # Convert chat jargon
    text = lemmatize_text(text)  # Lemmatize the text
    text = remove_stopwords(text)  # Remove stopwords
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

# Apply the clean_up function to the "review" column
data["review"] = data["review"].apply(lambda x: clean_up(str(x)))
print(data["review"])


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SudhindraGarre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SudhindraGarre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SudhindraGarre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0        one reviewer mentioned watching 1 oz episode y...
1        wonderful little production filming technique ...
2        thought wonderful way spend time hot summer we...
3        basically family little boy jake think zombie ...
4        petter matteis love time money visually stunni...
                               ...                        
49995    thought movie right good job wasnt creative or...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    catholic taught parochial elementary school nu...
49998    im going disagree previous comment side maltin...
49999    one expects star trek movie high art fan expec...
Name: review, Length: 50000, dtype: object


In [54]:
# Auto correct spelling
from textblob import TextBlob

def correct_spellings(text):
    return (TextBlob(text)).correct().string
print(correct_spellings("my namee is david"))

my name is david


In [None]:
#ignore stopwords

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
    return " ".join(filtered_text)

data["review"] = data["review"].apply(lambda x: remove_stopwords(str(x)))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SudhindraGarre\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SudhindraGarre\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [None]:
# tokenise using spacy

import spacy

# Download the model if not already installed
!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

def tokenize_text(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    return tokens

#Example usage
text = "Hello, how are you?"
tokens = tokenize_text(text)
print(tokens)

#['Hello', ',', 'how', 'are', 'you', '?']

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------ --------------------------------- 2.1/12.8 MB 11.8 MB/s eta 0:00:01
     ----------------- ---------------------- 5.5/12.8 MB 14.6 MB/s eta 0:00:01
     ---------------------------- ----------- 9.2/12.8 MB 15.9 MB/s eta 0:00:01
     ------------------------------- ------- 10.5/12.8 MB 16.4 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 13.4 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
['Hello', ',', 'how', 'are', 'you', '?']


In [None]:
# Lemmatization/Stemming using nltk

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

#Use when we want to reduce the word to its base form, accuracy is more
def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_text = [lemmatizer.lemmatize(token) for token in tokens]
    return " ".join(lemmatized_text)

# Use when we want to reduce the word to its root form, accuracy is less

def stem_text(text):
    tokens = word_tokenize(text)
    stemmed_text = [stemmer.stem(token) for token in tokens]
    return " ".join(stemmed_text)

# Example usage
text = "running ran runs"
lemmatized = lemmatize_text(text)
stemmed = stem_text(text)

print("Lemmatized:", lemmatized)
print("Stemmed:", stemmed)

# Lemmatized: running ran run
# Stemmed: run ran run

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SudhindraGarre\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\SudhindraGarre\AppData\Roaming\nltk_data...
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SudhindraGarre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SudhindraGarre\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


Lemmatized: running ran run
Stemmed: run ran run
