Sentimental Analysis

In [62]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import re
import nltk
from nltk.tokenize import word_tokenize

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [36]:
df = pd.read_csv('./train.csv' , encoding_errors='ignore')
df.drop(['textID' , 'Population -2020' , 'Land Area (Km)' , 'Density (P/Km)'], axis=1 , inplace=True)
df.ffill(axis=0, inplace=True)
df.dropna(inplace=True , axis=0)

df[-10:-5].head()


Unnamed: 0,text,selected_text,sentiment,Time of Tweet,Age of User,Country
27471,"i`m defying gravity. and nobody in alll of oz,...","i`m defying gravity. and nobody in alll of oz,...",neutral,morning,46-60,France
27472,http://twitpic.com/663vr - Wanted to visit the...,were too late,negative,noon,60-70,Gabon
27473,in spoke to you yesterday and u didnt respond...,in spoke to you yesterday and u didnt respond ...,neutral,night,70-100,Gambia
27474,So I get up early and I feel good about the da...,I feel good ab,positive,morning,0-20,Georgia
27475,enjoy ur night,enjoy,positive,noon,21-30,Germany


From the Observation negative sentiment can be seen more in nighttime

In [56]:
df.groupby(['sentiment'])['Time of Tweet'].value_counts().sort_values(ascending=False)

sentiment  Time of Tweet
neutral    morning          3763
           night            3680
           noon             3675
positive   noon             2883
           night            2862
           morning          2837
negative   night            2618
           noon             2602
           morning          2561
Name: count, dtype: int64

In [44]:
nltk.download('stopwords')
nltk.download('punkt_tab')

stop_words = set(nltk.corpus.stopwords.words('english'))

def preprocess_text(text):
    if text is None:
        return ""
    # Remove URLs, mentions, hashtags, and special characters
    text = re.sub(r"http\S+|@\S+|#\S+|[^a-zA-Z\s]", "", text)
    text = text.lower()  # Convert to lowercase
    tokens = word_tokenize(text)  # Tokenize text
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return " ".join(tokens)

df['text'] = df['text'].apply(preprocess_text)
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sahildev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/sahildev/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unnamed: 0,text,selected_text,sentiment,Time of Tweet,Age of User,Country
0,id responded going,"I`d have responded, if I were going",1,morning,0-20,Afghanistan
1,sooo sad miss san diego,Sooo SAD,0,noon,21-30,Albania
2,boss bullying,bullying me,0,night,31-45,Algeria
3,interview leave alone,leave me alone,0,morning,46-60,Andorra
4,sons couldnt put releases already bought,"Sons of ****,",0,noon,60-70,Angola


In [45]:
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])
df['Time of Tweet'] = le.fit_transform(df['Time of Tweet'])
df.head()

Unnamed: 0,text,selected_text,sentiment,Time of Tweet,Age of User,Country
0,id responded going,"I`d have responded, if I were going",1,0,0-20,Afghanistan
1,sooo sad miss san diego,Sooo SAD,0,2,21-30,Albania
2,boss bullying,bullying me,0,1,31-45,Algeria
3,interview leave alone,leave me alone,0,0,46-60,Andorra
4,sons couldnt put releases already bought,"Sons of ****,",0,2,60-70,Angola


- Text Model Building

In [70]:
X = df['text']
y = df['sentiment']

X_train , X_test , y_train , y_test = train_test_split(df['text'] , df['sentiment'] , test_size=0.2 , random_state=42)

tf = CountVectorizer( max_features=1000)
X_train = tf.fit_transform(X_train)
X_test = tf.transform(X_test)

model = LogisticRegression()
model.fit(X_train , y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test , y_pred))


              precision    recall  f1-score   support

           0       0.72      0.57      0.64      1562
           1       0.63      0.75      0.69      2230
           2       0.77      0.72      0.74      1705

    accuracy                           0.69      5497
   macro avg       0.71      0.68      0.69      5497
weighted avg       0.70      0.69      0.69      5497





Precision: It is the ratio of true positive to the sum of true positive and false positive. It is a measure of how accurate the model is when it predicts a positive outcome. It is calculated as TP/(TP+FP).

Recall: It is the ratio of true positive to the sum of true positive and false negative. It is a measure of how well the model detects all the positive outcomes. It is calculated as TP/(TP+FN).

F1 score: It is the weighted average of precision and recall. It is a measure of the accuracy of the model. It is calculated as 2\*(precision\*recall)/(precision+recall).
/******  a930f010-fe0a-42d5-b819-d01f9b0f1587  *******/