Twitter Sentiment Analysis is the process of using Python to understand the emotions or opinions expressed in tweets automatically. By analyzing the text we can classify tweets as positive, negative or neutral. This helps businesses and researchers track public mood, brand reputation or reactions to events in real time. Python libraries like TextBlob, Tweepy and NLTK make it easy to collect tweets, process the text and perform sentiment analysis efficiently.

**How is Twitter Sentiment Analysis Useful?**

Twitter Sentiment Analysis is important because it helps people and businesses understand what the public thinks in real time.

Millions of tweets are posted every day, sharing opinions about brands, products, events or social issues. By analyzing this huge stream of data, companies can measure customer satisfaction, spot trends early, handle negative feedback quickly and make better decisions based on how people actually feel.

It’s also useful for researchers and governments to monitor public mood during elections, crises or big events as it turns raw tweets into valuable insights.

# **Step by Step Implementation**
**Step 1: Import Necessary Libraries**

This block imports the required libraries. It uses pandas to load and handle data, TfidfVectorizer to turn text into numbers and scikit learn to train model.

In [1]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

**Step 2: Load Dataset**

Here we loads the Sentiment140 dataset from a zipped CSV file, you can download it from Kaggle.

We keep only the polarity and tweet text columns, renames them for clarity and prints the first few rows to check the data.

In [2]:
df = pd.read_csv('/content/sample_data/training.1600000.processed.noemoticon.csv.zip.zip', encoding='latin-1', header=None)
df = df[[0, 5]]
df.columns = ['polarity', 'text']
print(df.head())

   polarity                                               text
0         0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1         0  is upset that he can't update his Facebook by ...
2         0  @Kenichan I dived many times for the ball. Man...
3         0    my whole body feels itchy and like its on fire 
4         0  @nationwideclass no, it's not behaving at all....


**Step 3: Keep Only Positive and Negative Sentiments**

Here we removes neutral tweets where polarity is 2, maps the labels so 0 stays negative and 4 becomes 1 for positive.

Then we print how many positive and negative tweets are left in the data.

In [3]:
df = df[df.polarity != 2]

df['polarity'] = df['polarity'].map({0: 0, 4: 1})

print(df['polarity'].value_counts())

polarity
0    800000
1    800000
Name: count, dtype: int64


**Step 4: Clean the Tweets**

Here we define a simple function to convert all text to lowercase for consistency, applies it to every tweet in the dataset.

Then shows the original and cleaned versions of the first few tweets.

In [4]:
def clean_text(text):
    return text.lower()

df['clean_text'] = df['text'].apply(clean_text)

print(df[['text', 'clean_text']].head())

                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   
3    my whole body feels itchy and like its on fire    
4  @nationwideclass no, it's not behaving at all....   

                                          clean_text  
0  @switchfoot http://twitpic.com/2y1zl - awww, t...  
1  is upset that he can't update his facebook by ...  
2  @kenichan i dived many times for the ball. man...  
3    my whole body feels itchy and like its on fire   
4  @nationwideclass no, it's not behaving at all....  


**Step 5: Train Test Split**

This code splits the clean_text and polarity columns into training and testing sets using an 80/20 split.

random_state=42 ensures reproducibility.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    df['clean_text'],
    df['polarity'],
    test_size=0.2,
    random_state=42
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))

Train size: 1280000
Test size: 320000


**Step 6: Perform Vectorization**

This code creates a TF IDF vectorizer that converts text into numerical features using unigrams and bigrams limited to 5000 features.

It fits and transforms the training data and transforms the test data and then prints the shapes of the resulting TF IDF matrices.

In [6]:
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print("TF-IDF shape (train):", X_train_tfidf.shape)
print("TF-IDF shape (test):", X_test_tfidf.shape)

TF-IDF shape (train): (1280000, 5000)
TF-IDF shape (test): (320000, 5000)


**Step 7: Train Bernoulli Naive Bayes model**

Here we train a Bernoulli Naive Bayes classifier on the TF IDF features from the training data.

It predicts sentiments for the test data and then prints the accuracy and a detailed classification report.

In [7]:
bnb = BernoulliNB()
bnb.fit(X_train_tfidf, y_train)

bnb_pred = bnb.predict(X_test_tfidf)

print("Bernoulli Naive Bayes Accuracy:", accuracy_score(y_test, bnb_pred))
print("\nBernoulliNB Classification Report:\n", classification_report(y_test, bnb_pred))

Bernoulli Naive Bayes Accuracy: 0.766478125

BernoulliNB Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.75      0.76    159494
           1       0.76      0.78      0.77    160506

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000



**Step 8: Train Support Vector Machine (SVM) model**

This code trains a Support Vector Machine (SVM) with a maximum of 1000 iterations on the TF IDF features.

It predicts test labels then prints the accuracy and a detailed classification report showing how well the SVM performed.

In [8]:
svm = LinearSVC(max_iter=1000)
svm.fit(X_train_tfidf, y_train)

svm_pred = svm.predict(X_test_tfidf)

print("SVM Accuracy:", accuracy_score(y_test, svm_pred))
print("\nSVM Classification Report:\n", classification_report(y_test, svm_pred))

SVM Accuracy: 0.79528125

SVM Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.78      0.79    159494
           1       0.79      0.81      0.80    160506

    accuracy                           0.80    320000
   macro avg       0.80      0.80      0.80    320000
weighted avg       0.80      0.80      0.80    320000



**Step 9: Train Logistic Regression model**

This code trains a Logistic Regression model with up to 100 iterations on the TF IDF features.

It predicts sentiment labels for the test data and prints the accuracy and detailed classification report for model evaluation.

In [9]:
logreg = LogisticRegression(max_iter=100)
logreg.fit(X_train_tfidf, y_train)

logreg_pred = logreg.predict(X_test_tfidf)

print("Logistic Regression Accuracy:", accuracy_score(y_test, logreg_pred))
print("\nLogistic Regression Classification Report:\n", classification_report(y_test, logreg_pred))

Logistic Regression Accuracy: 0.79539375

Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.78      0.79    159494
           1       0.79      0.81      0.80    160506

    accuracy                           0.80    320000
   macro avg       0.80      0.80      0.80    320000
weighted avg       0.80      0.80      0.80    320000



**Step 10: Make Predictions on sample Tweets**

This code takes three sample tweets and transforms them into TF IDF features using the same vectorizer.

It then predicts their sentiment using the trained BernoulliNB, SVM and Logistic Regression models and prints the results for each classifier.

Where 1 stands for Positive and 0 for Negative.

In [10]:
sample_tweets = ["I love this!", "I hate that!", "It was okay, not great."]
sample_vec = vectorizer.transform(sample_tweets)

print("\nSample Predictions:")
print("BernoulliNB:", bnb.predict(sample_vec))
print("SVM:", svm.predict(sample_vec))
print("Logistic Regression:", logreg.predict(sample_vec))


Sample Predictions:
BernoulliNB: [1 0 1]
SVM: [1 0 1]
Logistic Regression: [1 0 1]
