<a href="https://colab.research.google.com/github/raviteja-padala/NLP/blob/main/Text_Classification_Diverse_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Diverse Methods for Text Classification



## Objective:

The main objective of this project is to comprehensively explore various methods for text classification and understand their strengths and weaknesses. By implementing different techniques and classifiers, we aim to build effective models for text analysis.

In [None]:
#loading libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
#loading dataset
imdb_df = pd.read_csv("https://raw.githubusercontent.com/SK7here/Movie-Review-Sentiment-Analysis/master/IMDB-Dataset.csv")

In [None]:
#viewing top 5 columns
imdb_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
#shape of the dataset
imdb_df.shape

(50000, 2)

In [None]:
#loading only 10,000 reviwes for analysis
df = imdb_df.iloc[:10000]

In [None]:
#shape of dataset
df.shape

(10000, 2)

In [None]:
#info of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     10000 non-null  object
 1   sentiment  10000 non-null  object
dtypes: object(2)
memory usage: 156.4+ KB


#### About the dataset

- review column contains reviews of movies
- sentiment column contains sentiment of the review



In [None]:
# sentiment value counts
df['sentiment'].value_counts()

positive    5028
negative    4972
Name: sentiment, dtype: int64

In [None]:
# checking null values
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [None]:
#checking duplicated values
df.duplicated().sum()

17

In [None]:
#dropping Null values
df.drop_duplicates(inplace=True)

In [None]:
#viewing review
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

# Text preprocessing

In [None]:
# function to remove tags
import re
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
    return cleaned_text

In [None]:
df['review'] = df['review'].apply(remove_tags)

In [None]:
#lowercasing the text
df['review'] = df['review'].apply(lambda x:x.lower())

In [None]:
# removal of stopwards
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

sw_list = stopwords.words('english')

df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# checking df after preprocessing
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production. filming technique...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there's family little boy (jake) thi...,negative
4,"petter mattei's ""love time money"" visually stu...",positive


In [None]:
#train test split
X = df.iloc[:,0:1]
y = df['sentiment']

In [None]:
print(f"{X.shape=}")
print(f"{y.shape=}")

X.shape=(9983, 1)
y.shape=(9983,)


In [None]:
# encoding sentiment column
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)

In [None]:
# to check the encoding values
decoded_values = encoder.inverse_transform(y)

print("Encoded Values:", y)
print("Decoded Values:", decoded_values)

Encoded Values: [1 1 1 ... 0 0 1]
Decoded Values: ['positive' 'positive' 'positive' ... 'negative' 'negative' 'positive']


## Train test split

In [None]:
# train - test split

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [None]:
print(f"{X_train.shape=}")
print(f"{y_train.shape=}")
print(f"{X_test.shape=}")
print(f"{y_test.shape=}")

X_train.shape=(7986, 1)
y_train.shape=(7986,)
X_test.shape=(1997, 1)
y_test.shape=(1997,)


In [None]:
# importing Count vectoriser
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [None]:
# transforming X_train, X_test
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

In [None]:
print(f"{X_train_bow.shape=}")

print(f"{X_test_bow.shape=}")

X_train_bow.shape=(7986, 48282)
X_test_bow.shape=(1997, 48282)


## Building model

In [None]:
# building model with GaussianNB
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(X_train_bow,y_train)

In [None]:
#predicting from model
y_pred = gnb.predict(X_test_bow)

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_test,y_pred)

0.6324486730095142

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.59      0.75      0.66       952
           1       0.70      0.52      0.60      1045

    accuracy                           0.63      1997
   macro avg       0.64      0.64      0.63      1997
weighted avg       0.65      0.63      0.63      1997



In [None]:
## building model with Randomforest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8457686529794692

In [None]:
# trying improve model by utilising only 3000 features
cv = CountVectorizer(max_features=3000)

# Convert the text reviews to bag-of-words representation
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

# Initialize a RandomForestClassifier
rf = RandomForestClassifier()

# Fit the classifier on the training data
rf.fit(X_train_bow, y_train)

# Predict the labels for the test data
y_pred = rf.predict(X_test_bow)

# Calculate and print the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

0.8362543815723585

In [None]:
# Create a CountVectorizer with specified ngram range((unigrams and bigrams) and maximum features
cv = CountVectorizer(ngram_range=(1, 2), max_features=5000)

# Convert the text reviews to bag-of-words representation
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

# Initialize a RandomForestClassifier
rf = RandomForestClassifier()

# Fit the classifier on the training data
rf.fit(X_train_bow, y_train)

# Predict the labels for the test data
y_pred = rf.predict(X_test_bow)

# Calculate and print the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.842764146219329


In [None]:
# Import the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer()

# Convert the text reviews to TF-IDF representations
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review'])

# Initialize a RandomForestClassifier
rf = RandomForestClassifier()

# Fit the classifier on the training data
rf.fit(X_train_tfidf, y_train)

# Predict the labels for the test data
y_pred = rf.predict(X_test_tfidf)

# Calculate and print the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8362543815723585


# Using Word2vec

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# import the necessary libraries for Word2Vec modeling and preprocessing
import gensim
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

# Initialize an empty list to store preprocessed sentences
story = []

# Iterate through each document in the DataFrame's 'review' column
for doc in df['review']:
    # Tokenize the document into sentences
    raw_sent = sent_tokenize(doc)
    # Preprocess each sentence and add it to the story list
    for sentence in raw_sent:
        story.append(simple_preprocess(sentence))

# Initialize a Word2Vec model with specified parameters
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

# Build vocabulary using the preprocessed sentences
model.build_vocab(story)

# Train the Word2Vec model on the preprocessed sentences
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

# Print the length of the vocabulary
print("Vocabulary size:", len(model.wv.index_to_key))

Vocabulary size: 31845


In [None]:
# Define a function to compute the document vector
def document_vector(doc):
    # Remove out-of-vocabulary words
    doc = [word for word in doc.split() if word in model.wv.index_to_key]
    # Compute the mean vector of the document's word vectors
    return np.mean(model.wv[doc], axis=0)

In [None]:
# Compute the document vector for the first review in the DataFrame
document_vector(df['review'].values[0])

array([-0.17291921,  0.5019949 ,  0.18420105,  0.24283421, -0.12687327,
       -0.5808702 ,  0.21783249,  0.9364792 , -0.38609356, -0.29662418,
       -0.24351087, -0.40953594,  0.1367519 ,  0.0959124 ,  0.16773675,
       -0.10853798,  0.00189087, -0.33038345, -0.02186571, -0.62209576,
        0.05045998,  0.24931745,  0.07555286, -0.31679195, -0.33045274,
       -0.03454106, -0.27244443,  0.06135172, -0.31572965,  0.07248805,
        0.36164626,  0.01511672,  0.18708989, -0.26675293, -0.13015848,
        0.44420147,  0.10630554, -0.30188674, -0.27787647, -0.8319339 ,
        0.13751996, -0.16920884, -0.01154598, -0.12692456,  0.49500546,
       -0.12795532, -0.25376827, -0.02428494,  0.07118721,  0.33063453,
        0.06181994, -0.3904192 , -0.44261256, -0.1725567 , -0.09714159,
        0.21523477,  0.1744858 ,  0.05454216, -0.29906672,  0.05450452,
        0.03564481,  0.13158031,  0.05539799, -0.06575982, -0.46464714,
        0.3090137 ,  0.05305072,  0.18044986, -0.36362776,  0.33

In [None]:
# Import the tqdm library for progress tracking
from tqdm import tqdm

In [None]:
# Initialize an empty list to store document vectors
X = []

# Iterate through each document in the 'review' column and compute its vector
for doc in tqdm(df['review'].values):
    # Compute the document vector using the defined function and append to X
    X.append(document_vector(doc))

100%|██████████| 9983/9983 [08:42<00:00, 19.12it/s]


In [None]:
# Convert the list of vectors into a numpy array
X = np.array(X)

# Print the document vector of the first review
print("Document Vector for the first review:\n", X[0])

Document Vector for the first review:
 [-0.17291921  0.5019949   0.18420105  0.24283421 -0.12687327 -0.5808702
  0.21783249  0.9364792  -0.38609356 -0.29662418 -0.24351087 -0.40953594
  0.1367519   0.0959124   0.16773675 -0.10853798  0.00189087 -0.33038345
 -0.02186571 -0.62209576  0.05045998  0.24931745  0.07555286 -0.31679195
 -0.33045274 -0.03454106 -0.27244443  0.06135172 -0.31572965  0.07248805
  0.36164626  0.01511672  0.18708989 -0.26675293 -0.13015848  0.44420147
  0.10630554 -0.30188674 -0.27787647 -0.8319339   0.13751996 -0.16920884
 -0.01154598 -0.12692456  0.49500546 -0.12795532 -0.25376827 -0.02428494
  0.07118721  0.33063453  0.06181994 -0.3904192  -0.44261256 -0.1725567
 -0.09714159  0.21523477  0.1744858   0.05454216 -0.29906672  0.05450452
  0.03564481  0.13158031  0.05539799 -0.06575982 -0.46464714  0.3090137
  0.05305072  0.18044986 -0.36362776  0.33557653 -0.32790306  0.1258823
  0.5741596  -0.06761195  0.4214679   0.08097376 -0.07749937 -0.1912764
 -0.54745376  0.1

In [None]:
# Import the LabelEncoder class from sklearn.preprocessing
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
encoder = LabelEncoder()

# Encode the 'sentiment' column using the LabelEncoder
y = encoder.fit_transform(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [None]:
print(f"{X_train.shape=}")
print(f"{y_train.shape=}")
print(f"{X_test.shape=}")
print(f"{y_test.shape=}")

# when using Word2Vec embeddings for text data, Each document (text sample) is represented by a dense vector in a continuous vector space.
# These vectors capture semantic relationships between words and can be used as features for machine learning algorithms to perform tasks like classification text data.

X_train.shape=(7986, 100)
y_train.shape=(7986,)
X_test.shape=(1997, 100)
y_test.shape=(1997,)


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.7766649974962444

## Conclusion:

In conclusion, our study on text classification has encompassed a range of methods to build and enhance models for effective text analysis. We explored various techniques and classifiers, including Gaussian Naive Bayes, Random Forest Classifier, and leveraging different text representation strategies.

We ventured into the realm of feature engineering with Count Vectorization, strategically considering up to 3000 features and incorporating bigrams to capture more nuanced relationships between words. This technique translated text into numerical vectors, facilitating compatibility with machine learning algorithms.

We further harnessed the strengths of TF-IDF (Term Frequency-Inverse Document Frequency), which assessed the importance of words in the corpus to uncover hidden patterns in the data, enriching our models' accuracy and interpretability.

Lastly, we delved into the realm of Word2Vec, tapping into the power of word embeddings to represent words in a continuous vector space. This approach allowed us to capture semantic relationships and contextual nuances between words, fostering the creation of more context-aware models.

By combining these various methods, we endeavored to build robust models for text classification, enhancing our understanding of their strengths and weaknesses. This comprehensive exploration equips us with a diverse toolkit to tailor our approach based on the specific nature of the text data at hand, ultimately driving us towards the creation of more accurate and effective text classification models.

## Thank you for reading till the end.

### - Raviteja

https://www.linkedin.com/in/raviteja-padala/