# Sentiment Analysis â€“ Text Classification

## Objective
The objective of this task is to classify text data into sentiment categories by applying natural language processing techniques and machine learning models.


In [2]:
import pandas as pd
import numpy as np
import re
import string

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report




In [3]:
df = pd.read_csv(r"E:\SUMED_DATA\CODVEDA - INTERNSHIP\3) Sentiment dataset.csv")
df.head()


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,0,0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,1,1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,2,2,Just finished an amazing workout! ðŸ’ª ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,3,3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,4,4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19


## Dataset Overview
The dataset contains textual data along with sentiment labels, making it suitable for supervised text classification.


In [4]:
df.info

<bound method DataFrame.info of      Unnamed: 0.1  Unnamed: 0  \
0               0           0   
1               1           1   
2               2           2   
3               3           3   
4               4           4   
..            ...         ...   
727           728         732   
728           729         733   
729           730         734   
730           731         735   
731           732         736   

                                                  Text    Sentiment  \
0     Enjoying a beautiful day at the park!        ...   Positive     
1     Traffic was terrible this morning.           ...   Negative     
2     Just finished an amazing workout! ðŸ’ª          ...   Positive     
3     Excited about the upcoming weekend getaway!  ...   Positive     
4     Trying out a new recipe for dinner tonight.  ...   Neutral      
..                                                 ...          ...   
727  Collaborating on a science project that receiv...       Happy    


In [6]:
df.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'Text', 'Sentiment', 'Timestamp', 'User',
       'Platform', 'Hashtags', 'Retweets', 'Likes', 'Country', 'Year', 'Month',
       'Day', 'Hour'],
      dtype='object')

In [None]:
# Clean column names 
df.columns = df.columns.str.lower().str.strip()
print(df.columns)


Index(['unnamed: 0.1', 'unnamed: 0', 'text', 'sentiment', 'timestamp', 'user',
       'platform', 'hashtags', 'retweets', 'likes', 'country', 'year', 'month',
       'day', 'hour'],
      dtype='object')


In [12]:
df = df.drop(columns=['unnamed: 0', 'unnamed: 0.1'])


In [17]:
df['sentiment'].value_counts()



sentiment
Positive               44
Joy                    42
Excitement             32
Happy                  14
Neutral                14
                       ..
Vibrancy                1
Culinary Adventure      1
Mesmerizing             1
Thrilling Journey       1
Winter Magic            1
Name: count, Length: 279, dtype: int64

### Initial Observations
- The dataset includes raw text and corresponding sentiment labels.
- Class distribution was reviewed to identify potential imbalance.


In [8]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.strip()
    return text


In [13]:
df['clean_text'] = df['text'].apply(clean_text)


In [14]:
df['clean_text'] = df['text'].apply(clean_text)
df[['text', 'clean_text']].head()


Unnamed: 0,text,clean_text
0,Enjoying a beautiful day at the park! ...,enjoying a beautiful day at the park
1,Traffic was terrible this morning. ...,traffic was terrible this morning
2,Just finished an amazing workout! ðŸ’ª ...,just finished an amazing workout ðŸ’ª
3,Excited about the upcoming weekend getaway! ...,excited about the upcoming weekend getaway
4,Trying out a new recipe for dinner tonight. ...,trying out a new recipe for dinner tonight


### Text Preprocessing
Text was normalized by converting to lowercase, removing punctuation, numbers, and extra whitespace.


In [15]:
X = df['clean_text']
y = df['sentiment']


In [19]:
# Clean sentiment labels properly
df['sentiment'] = df['sentiment'].str.strip()
df['sentiment'] = df['sentiment'].str.lower()

df['sentiment'].value_counts().head(20)


sentiment
positive       45
joy            44
excitement     37
contentment    19
neutral        18
gratitude      18
curiosity      16
serenity       15
happy          14
despair        11
nostalgia      11
hopeful         9
loneliness      9
awe             9
grief           9
sad             9
embarrassed     8
confusion       8
acceptance      8
euphoria        7
Name: count, dtype: int64

In [21]:
# Remove rare classes (frequency < 2)
class_counts = df['sentiment'].value_counts()
valid_classes = class_counts[class_counts >= 2].index

df = df[df['sentiment'].isin(valid_classes)]

df['sentiment'].value_counts().describe()


count    112.000000
mean       5.830357
std        7.028755
min        2.000000
25%        2.000000
50%        4.000000
75%        6.000000
max       45.000000
Name: count, dtype: float64

In [22]:
X = df['clean_text']
y = df['sentiment']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [24]:
tfidf = TfidfVectorizer(
    max_features=5000,
    stop_words='english'
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)


In [32]:
print("Train class distribution:")
print(y_train.value_counts())

print("\nTest class distribution:")
print(y_test.value_counts())


Train class distribution:
sentiment
positive       36
joy            35
excitement     29
contentment    15
neutral        14
               ..
enjoyment       2
devastated      2
resentment      2
bitter          2
creativity      2
Name: count, Length: 112, dtype: int64

Test class distribution:
sentiment
positive       9
joy            9
excitement     8
neutral        4
contentment    4
              ..
love           1
proud          1
bitterness     1
empowerment    1
happiness      1
Name: count, Length: 83, dtype: int64


### Text Vectorization
TF-IDF was used to convert textual data into numerical feature vectors while reducing the impact of common words.


In [25]:
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)

y_pred_nb = nb.predict(X_test_tfidf)


In [26]:
accuracy_nb = accuracy_score(y_test, y_pred_nb)
f1_nb = f1_score(y_test, y_pred_nb, average='weighted')

accuracy_nb, f1_nb


(0.17557251908396945, 0.09588458052077725)

In [27]:
lr = LogisticRegression(max_iter=300)
lr.fit(X_train_tfidf, y_train)

y_pred_lr = lr.predict(X_test_tfidf)


In [28]:
accuracy_lr = accuracy_score(y_test, y_pred_lr)
f1_lr = f1_score(y_test, y_pred_lr, average='weighted')

accuracy_lr, f1_lr


(0.26717557251908397, 0.20288801792773173)

In [29]:
print("Naive Bayes Classification Report")
print(classification_report(y_test, y_pred_nb))

print("Logistic Regression Classification Report")
print(classification_report(y_test, y_pred_lr))


Naive Bayes Classification Report
                precision    recall  f1-score   support

    acceptance       0.00      0.00      0.00         2
accomplishment       0.00      0.00      0.00         1
    admiration       0.00      0.00      0.00         1
     adventure       0.00      0.00      0.00         1
   ambivalence       0.00      0.00      0.00         1
     amusement       0.00      0.00      0.00         1
  anticipation       0.00      0.00      0.00         1
       arousal       0.00      0.00      0.00         1
           awe       0.00      0.00      0.00         2
           bad       0.00      0.00      0.00         1
      betrayal       0.00      0.00      0.00         1
        bitter       0.00      0.00      0.00         1
    bitterness       0.00      0.00      0.00         1
       boredom       0.00      0.00      0.00         1
      calmness       0.00      0.00      0.00         1
    compassion       0.00      0.00      0.00         1
 compassionat

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [30]:
comparison = pd.DataFrame({
    'Model': ['Naive Bayes', 'Logistic Regression'],
    'Accuracy': [accuracy_nb, accuracy_lr],
    'F1 Score': [f1_nb, f1_lr]
})

comparison


Unnamed: 0,Model,Accuracy,F1 Score
0,Naive Bayes,0.175573,0.095885
1,Logistic Regression,0.267176,0.202888


## Final Summary
- Text data was cleaned and vectorized using TF-IDF.
- Naive Bayes provided a strong baseline for text classification.
- Logistic Regression achieved competitive performance.
- F1-score was prioritized due to potential class imbalance.

This NLP pipeline demonstrates an end-to-end text classification workflow.
