<a href="https://colab.research.google.com/github/tc-wandering/twitter-sentiment-analysis/blob/main/twitter_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sentiment analysis on Twitter classifies tweets (text) as positive or negative. We’ll use the Sentiment140 dataset (from Kaggle), which contains labeled tweets. Begin by installing needed packages

In [5]:
!pip install pandas scikit-learn
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report




Load data: Download the Sentiment140 CSV (it’s often zipped as training.1600000.processed.noemoticon.csv.zip). Load it and keep only columns for polarity and text

In [6]:
df = pd.read_csv('/content/archive.zip',
                 encoding='latin-1', header=None)
df = df[[0, 5]]
df.columns = ['polarity', 'text']
print(df.head())


   polarity                                               text
0         0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1         0  is upset that he can't update his Facebook by ...
2         0  @Kenichan I dived many times for the ball. Man...
3         0    my whole body feels itchy and like its on fire 
4         0  @nationwideclass no, it's not behaving at all....


Filter labels: Remove neutral tweets (polarity=2) and map labels 0→0 (negative), 4→1 (positive):

In [7]:
df = df[df.polarity != 2]
df['polarity'] = df['polarity'].map({0: 0, 4: 1})
print(df['polarity'].value_counts())

polarity
0    800000
1    800000
Name: count, dtype: int64


Text preprocessing: A simple step is to lowercase the text. This matches clean_text function

In [8]:
df['clean_text'] = df['text'].str.lower()
print(df[['text','clean_text']].head())

                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   
3    my whole body feels itchy and like its on fire    
4  @nationwideclass no, it's not behaving at all....   

                                          clean_text  
0  @switchfoot http://twitpic.com/2y1zl - awww, t...  
1  is upset that he can't update his facebook by ...  
2  @kenichan i dived many times for the ball. man...  
3    my whole body feels itchy and like its on fire   
4  @nationwideclass no, it's not behaving at all....  


Train-test split: Split the data into training and test sets (80/20), this yields e.g. 1,280,000 train and 320,000 test tweets

In [9]:
X = df['clean_text']
y = df['polarity']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
print("Train size:", len(X_train), "Test size:", len(X_test))

Train size: 1280000 Test size: 320000


Vectorization: Convert text to numeric features using TF-IDF. For speed, limit features to the top 5000 bigrams/unigrams:

In [10]:
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
print("TF-IDF shape (train):", X_train_tfidf.shape)

TF-IDF shape (train): (1280000, 5000)


Model training: Train different classifiers and compare. For example, Bernoulli Naive Bayes, Linear SVM, and Logistic Regression

In [11]:
bnb = BernoulliNB()
bnb.fit(X_train_tfidf, y_train)
bnb_pred = bnb.predict(X_test_tfidf)
print("BernoulliNB Accuracy:", accuracy_score(y_test, bnb_pred))

svm = LinearSVC(max_iter=1000)
svm.fit(X_train_tfidf, y_train)
svm_pred = svm.predict(X_test_tfidf)
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))

logreg = LogisticRegression(max_iter=100)
logreg.fit(X_train_tfidf, y_train)
log_pred = logreg.predict(X_test_tfidf)
print("Logistic Regression Accuracy:", accuracy_score(y_test, log_pred))

BernoulliNB Accuracy: 0.766478125
SVM Accuracy: 0.79528125
Logistic Regression Accuracy: 0.79539375


Sample predictions: Try your models on new example tweets: All models should output the same sentiment labels (1=positive, 0=negative)

In [12]:
sample_tweets = ["I love this!", "I hate that!", "It was okay, not great."]
sample_vec = vectorizer.transform(sample_tweets)
print("BernoulliNB:", bnb.predict(sample_vec))
print("SVM:", svm.predict(sample_vec))
print("Logistic:", logreg.predict(sample_vec))

BernoulliNB: [1 0 1]
SVM: [1 0 1]
Logistic: [1 0 1]
