# Sentiment Analysis using Natural Language Processing (NLP)

**Internship:** CODTECH  
**Task:** Task-4  
**Objective:** Perform sentiment analysis on textual data using NLP techniques

In [1]:
# ===============================
# Importing Required Libraries
# ===============================

# Data manipulation libraries
import pandas as pd
import numpy as np

# Regular expression library for text cleaning
import re

# Natural Language Processing library
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
# ===============================
# Loading the Dataset
# ===============================

# Load sentiment dataset from CSV file
df = pd.read_csv("sentiment_large_dataset.csv")

# Display first 5 rows of the dataset
df.head()

Unnamed: 0,text,sentiment
0,This is the worst experience I have ever had (...,negative
1,The service was okay and acceptable (review 2),neutral
2,Absolutely fantastic quality and fast delivery...,positive
3,I am very disappointed with the product (revie...,negative
4,It is average and nothing special (review 5),neutral


In [6]:
# ===============================
# Dataset Information
# ===============================

# Check dataset structure and data types
df.info()
# Check for missing values
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       150 non-null    object
 1   sentiment  150 non-null    object
dtypes: object(2)
memory usage: 2.5+ KB


Unnamed: 0,0
text,0
sentiment,0


In [7]:
# ===============================
# Sentiment Distribution
# ===============================

# Count number of samples for each sentiment
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
negative,50
neutral,50
positive,50


In [8]:
# ===============================
# Text Preprocessing
# ===============================

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Function to clean text data
def clean_text(text):
    """
    This function:
    - Converts text to lowercase
    - Removes punctuation and numbers
    - Removes stopwords
    """

    # Convert text to lowercase
    text = text.lower()

    # Remove special characters and numbers
    text = re.sub(r'[^a-z\s]', '', text)

    # Split sentence into words
    words = text.split()

    # Remove stopwords
    words = [word for word in words if word not in stop_words]

    # Join words back into a sentence
    return " ".join(words)

# Apply text cleaning to dataset
df['clean_text'] = df['text'].apply(clean_text)

# Display cleaned text
df.head()

Unnamed: 0,text,sentiment,clean_text
0,This is the worst experience I have ever had (...,negative,worst experience ever review
1,The service was okay and acceptable (review 2),neutral,service okay acceptable review
2,Absolutely fantastic quality and fast delivery...,positive,absolutely fantastic quality fast delivery review
3,I am very disappointed with the product (revie...,negative,disappointed product review
4,It is average and nothing special (review 5),neutral,average nothing special review


In [9]:
# ===============================
# Feature Extraction using TF-IDF
# ===============================

# Define input features and output labels
X = df['clean_text']
y = df['sentiment']

# Convert text into numerical features using TF-IDF
tfidf = TfidfVectorizer(max_features=5000)

X_tfidf = tfidf.fit_transform(X)

In [10]:
# ===============================
# Splitting Data into Training and Testing Sets
# ===============================

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf,
    y,
    test_size=0.2,
    random_state=42
)

In [11]:
# ===============================
# Logistic Regression Model
# ===============================

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model using training data
model.fit(X_train, y_train)

In [12]:
# ===============================
# Model Evaluation
# ===============================

# Predict sentiments for test dataset
y_pred = model.predict(X_test)

# Calculate model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.13333333333333333


In [13]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00        11
     neutral       0.17      0.20      0.18        10
    positive       0.14      0.22      0.17         9

    accuracy                           0.13        30
   macro avg       0.10      0.14      0.12        30
weighted avg       0.10      0.13      0.11        30



## ðŸ“Š Insights
- Text data was successfully cleaned and preprocessed using NLP techniques.
- TF-IDF effectively converted text into numerical features.
- Logistic Regression provided reliable sentiment classification results.

## âœ… Conclusion
This project demonstrates the application of Natural Language Processing techniques
to analyze sentiment from textual data. The model can be extended to analyze real-world
data such as social media posts and customer reviews.