# ðŸŽ¯ Project Objective (With NLP)
The objective of this project is to perform sentiment analysis on textual data using Natural Language Processing (NLP) techniques. The project involves preprocessing raw text data, extracting meaningful features, and building machine learning models such as Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine to classify sentiments into positive, negative, or neutral categories. The performance of these models is evaluated to identify the most effective approach for sentiment prediction.

# ðŸ“˜ Explanation of the Objective (NLP-Based)
## 1. Use of Natural Language Processing (NLP)
NLP is used to enable machines to understand and process human language. In this project, NLP techniques help convert unstructured text into a structured format that machine learning models can understand.

## 2. Text Preprocessing Using NLP Techniques
NLP preprocessing steps include:

Tokenization

Lowercasing

Stopword removal

Stemming or Lemmatization

Removing punctuation and noise

Feature extraction using Bag of Words or TF-IDF

These steps improve text quality and model performance.

## 3. Feature Extraction
NLP techniques are applied to extract numerical features from text, allowing machine learning algorithms to learn patterns related to sentiment.

## 4. Machine Learning Model Training
The extracted features are used to train:

Logistic Regression

Multinomial Naive Bayes

Support Vector Machine (SVM)

Each model learns to classify text into positive, negative, or neutral sentiment categories.

## 5. Model Evaluation
Models are evaluated using metrics such as:

Accuracy

Precision

Recall

F1-score

This ensures reliable and fair comparison between different NLP-based models.

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the Dataset

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("abhi8923shriv/sentiment-analysis-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/abhi8923shriv/sentiment-analysis-dataset?dataset_version_number=9...


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 54.4M/54.4M [00:02<00:00, 21.7MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/abhi8923shriv/sentiment-analysis-dataset/versions/9


In [3]:
df = pd.read_csv(f"{path}/train.csv", encoding='latin1')

In [4]:
data = df.iloc[:8000]

# Display top 5 records

In [5]:
data.head()

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (KmÂ²),Density (P/KmÂ²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


# Display last 5 records

In [6]:
data.tail()

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (KmÂ²),Density (P/KmÂ²)
7995,7107ef22f7,Rain. One more reason to stay snuggled beneath...,Rain. One more reason to stay snuggled beneath...,neutral,morning,46-60,Brunei,437479,5270.0,83
7996,1d07bd0bc2,Found out that a schoolmate died of an heart a...,miss,negative,noon,60-70,Bulgaria,6948445,108560.0,64
7997,ef0a7a1c39,I just added a butt to your name in my phone ...,e I love you!,positive,night,70-100,Burkina Faso,20903273,273600.0,76
7998,08c6ce4251,"Schools out, but works in","Schools out, but works in",neutral,morning,0-20,Burundi,11890784,25680.0,463
7999,1ba21ec741,"So sorry if i`ve been typing wrongly. usually,...",So sorry,negative,noon,21-30,CÃ´te d'Ivoire,26378274,318000.0,83


# Display all the Datatypes

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   textID            8000 non-null   object 
 1   text              7999 non-null   object 
 2   selected_text     7999 non-null   object 
 3   sentiment         8000 non-null   object 
 4   Time of Tweet     8000 non-null   object 
 5   Age of User       8000 non-null   object 
 6   Country           8000 non-null   object 
 7   Population -2020  8000 non-null   int64  
 8   Land Area (KmÂ²)   8000 non-null   float64
 9   Density (P/KmÂ²)   8000 non-null   int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 625.1+ KB


# 5 Points Summary Data

In [8]:
data.describe()

Unnamed: 0,Population -2020,Land Area (KmÂ²),Density (P/KmÂ²)
count,8000.0,8000.0,8000.0
mean,40115730.0,662690.2,357.646
std,150141600.0,1808348.0,2012.203079
min,801.0,0.0,2.0
25%,1968001.0,22810.0,35.0
50%,8655535.0,111890.0,89.0
75%,28435940.0,527970.0,214.0
max,1439324000.0,16376870.0,26337.0


# Check null values

In [9]:
data.isnull().sum()

Unnamed: 0,0
textID,0
text,1
selected_text,1
sentiment,0
Time of Tweet,0
Age of User,0
Country,0
Population -2020,0
Land Area (KmÂ²),0
Density (P/KmÂ²),0


# Drop null values

In [10]:
data.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.dropna(inplace=True)


# Check duplicate values

In [11]:
data.duplicated().sum()

np.int64(0)

# Drop duplicate values

In [12]:
data.drop_duplicates(inplace=True)
print(data.duplicated().sum())

0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop_duplicates(inplace=True)


# Dropped the columns which is not useful

In [13]:
data.drop(['textID', 'selected_text', 'Time of Tweet', 'Age of User', 'Country', 'Population -2020', 'Land Area (KmÂ²)', 'Density (P/KmÂ²)'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(['textID', 'selected_text', 'Time of Tweet', 'Age of User', 'Country', 'Population -2020', 'Land Area (KmÂ²)', 'Density (P/KmÂ²)'], axis=1, inplace=True)


# Value Count

In [14]:
data['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
neutral,3228
positive,2561
negative,2210


# Remove the HTML Tags

In [15]:
import re
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?!>'), '',raw_text)
    return cleaned_text

# Convert all text in the 'text' column to lowercase

In [16]:
data['text'] = data['text'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(lambda x: x.lower())


# Remove Tags

In [45]:
import re

def remove_star_exclam(text):
    if isinstance(text, str):
        return re.sub(r'[\`*!]+', '', text)
    return text

data['text'] = data['text'].apply(remove_star_exclam)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(remove_star_exclam)


# Remove Stopwords

In [18]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return " ".join(
        word for word in text.split() if word not in stop_words
    )

data.loc[:, 'text'] = data['text'].apply(remove_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [19]:
data

Unnamed: 0,text,sentiment
0,"i`d responded, going",neutral
1,sooo sad miss san diego,negative
2,boss bullying me...,negative
3,interview leave alone,negative
4,"sons , couldn`t put releases already bought",negative
...,...,...
7995,rain. one reason stay snuggled beneath duvet,neutral
7996,found schoolmate died heart attack morning. ba...,negative
7997,added butt name phone made go home cold. love,positive
7998,"schools out, works",neutral


# Define X,y and perform train_test_split

In [20]:
X = data.iloc[:,0:1]
y = data['sentiment']

In [21]:
X

Unnamed: 0,text
0,"i`d responded, going"
1,sooo sad miss san diego
2,boss bullying me...
3,interview leave alone
4,"sons , couldn`t put releases already bought"
...,...
7995,rain. one reason stay snuggled beneath duvet
7996,found schoolmate died heart attack morning. ba...
7997,added butt name phone made go home cold. love
7998,"schools out, works"


In [22]:
y

Unnamed: 0,sentiment
0,neutral
1,negative
2,negative
3,negative
4,negative
...,...
7995,neutral
7996,negative
7997,positive
7998,neutral


# Model Training and Testing

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
X_train.shape, X_test.shape

((6399, 1), (1600, 1))

In [25]:
y_train.shape, y_test.shape

((6399,), (1600,))

# Applying Bag of Word

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [27]:
X_train_bow = cv.fit_transform(X_train['text']).toarray()
X_test_bow = cv.transform(X_test['text']).toarray()

In [28]:
X_train_bow.shape

(6399, 10628)

In [29]:
X_test_bow.shape

(1600, 10628)

# Applying Naive bayes

In [30]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

In [31]:
gnb.fit(X_train_bow,y_train)

In [32]:
y_pred = gnb.predict(X_test_bow)

In [33]:
accuracy_score(y_test, y_pred)

0.35875

In [34]:
confusion_matrix(y_test, y_pred)

array([[285,  80,  67],
       [381, 110, 162],
       [252,  84, 179]])

In [35]:
from sklearn.ensemble import RandomForestClassifier

In [36]:
rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.66875

# Applying Random Forest

In [37]:
rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)

In [38]:
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.66875

In [47]:
data

Unnamed: 0,text,sentiment
0,i d responded going,neutral
1,sooo sad miss san diego,negative
2,boss bullying me...,negative
3,interview leave alone,negative
4,sons couldn t put releases already bought,negative
...,...,...
7995,rain. one reason stay snuggled beneath duvet,neutral
7996,found schoolmate died heart attack morning. ba...,negative
7997,added butt name phone made go home cold. love,positive
7998,schools out works,neutral


In [50]:
def predict_review_sentiment(text):
    # Apply preprocessing steps
    cleaned_text = remove_tags(str(text)) # Ensure text is string for preprocessing
    cleaned_text = cleaned_text.lower()
    cleaned_text = remove_stopwords(cleaned_text)
    cleaned_text = remove_star_exclam(cleaned_text)

    # Transform the text using the fitted CountVectorizer
    text_bow = cv.transform([cleaned_text]) # cv.transform expects an iterable

    # Predict the sentiment using the trained RandomForestClassifier
    prediction = rf.predict(text_bow)

    return prediction[0] # Return the single predicted sentiment

test_text = input("Enter the text for the review: ")

# Call the predict_review_sentiment function
prediction_result = predict_review_sentiment(
    text=test_text
)

print(f'Result :   {prediction_result}')


Enter the text for the review: rain. one reason stay snuggled beneath duvet
Result :   neutral


# Final Outcome
## The project delivers an NLP-based sentiment analysis system capable of automatically predicting sentiment from new text data, useful for analyzing reviews, feedback, and social media content.