This is the sentiment140 dataset.
It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and they can be used to detect sentiment .
It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 4 = positive)
ids: The id of the tweet ( 2087)
date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
flag: The query (lyx). If there is no query, then this value is NO_QUERY.
user: the user that tweeted (robotickilldozr)
text: the text of the tweet (Lyx is cool)

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [2]:
data = pd.read_csv("tweet_sentiments.csv")
data.columns = ['label','time','date','query','username','text']
data.head()

Unnamed: 0,label,time,date,query,username,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [3]:
data.tail()

Unnamed: 0,label,time,date,query,username,text
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1599998,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


In [4]:
print(data.columns)
print('length of data is ', len(data))
print('shape of data is ',data.shape)


Index(['label', 'time', 'date', 'query', 'username', 'text'], dtype='object')
length of data is  1599999
shape of data is  (1599999, 6)


In [5]:
print(data.dtypes)
print(np.sum(data.isnull().any(axis=1)))

label        int64
time         int64
date        object
query       object
username    object
text        object
dtype: object
0


We only care about the text and label columns

In [6]:
data = data[['text','label']]
data['label'][data['label']==4]=1 # assign 1 to positive sentiment and leave 0 as negative sentiment
data.tail()

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  data['label'][data['label']==4]=1 # assign 1 to positive sentiment and leave 0 as negative sentiment


Unnamed: 0,text,label
1599994,Just woke up. Having no school is the best fee...,1
1599995,TheWDB.com - Very cool to hear old Walt interv...,1
1599996,Are you ready for your MoJo Makeover? Ask me f...,1
1599997,Happy 38th Birthday to my boo of alll time!!! ...,1
1599998,happy #charitytuesday @theNSPCC @SparksCharity...,1


In [7]:
data_pos = data[data['label']==1]
data_neg = data[data['label']==0]
#take subset of our data so our machine can run easily
data_pos = data_pos.iloc[:int(250000)]
data_neg = data_neg.iloc[:int(250000)]

In [8]:
data = pd.concat([data_pos,data_neg])
print(data.shape)
#make statement in lower case
data['text'] = data['text'].str.lower()



(500000, 2)


In [9]:
def clean_text(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)  # Keeps only letters and spaces

def cleaning_repeating_char(text):
    return re.sub(r'(.)\1+', r'\1', text)

data['text'] = data['text'].apply(lambda x : clean_text(x))
data['text'] = data['text'].apply(lambda x : cleaning_repeating_char(x))

data.tail()

Unnamed: 0,text,label
249995,bah to much water in cofe again,0
249996,lunch with analt then swiming maybe blah heada...,0
249997,blazinsquadnews oh no im sory for you i hope y...,0
249998,a want ice cream,0
249999,tamarzipan haha i couldnt my dad is watching tv,0


In [10]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])  # Convert text to numerical features
y = data['label']  # Labels (0 = Negative, 1 = Positive)


In [11]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Train Multinomial Naive Bayes
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Multinomial Model: {accuracy:.5f}")


Accuracy of Multinomial Model: 0.77443


In [12]:
new_tweets = ["I am naive", "Turtles are overrated", "I am at peace", 'I\'m a machine']
new_X = vectorizer.transform(new_tweets)  # Convert text to numerical features

predictions = model.predict(new_X)
print(predictions)  # Output: (1=Positive, 0=Negative)


[0 1 1 0]


In [13]:
# Load dataset
data = pd.read_csv("tweets.csv")
data.columns = ['label', 'time', 'date', 'query', 'username', 'text']
data = data[['text', 'label']]

# Convert positive sentiment from 4 to 1
data['label'] = data['label'].replace(4, 1)

# Split positive and negative tweets
data_pos = data[data['label'] == 1].iloc[:10000]  # Adjust size as needed
data_neg = data[data['label'] == 0].iloc[:10000]
data = pd.concat([data_pos,data_neg])

# Clean text function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'@\w+', '', text)  # Remove mentions
    text = re.sub(r'#\w+', '', text)  # Remove hashtags
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    return text

# Apply text cleaning
data['text'] = data['text'].apply(clean_text)

* Now run the model with a Gaussian Naive Bayes approach * 

In [14]:
import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'@\w+', '', text)  # Remove mentions
    text = re.sub(r'#\w+', '', text)  # Remove hashtags
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    return text


stop_words = {
    'i', 'the', 'to', 'a', 'you', 'and', 'my', 'for', 'it', 'is', 'in', 'of', 
    'on', 'im', 'me', 'that', 'with', 'so', 'have', 'just', 'be', 'its', 
    'at', 'but','was', 'your', 'are', 'this', 'now'
}

def make_Dictionary_tweets(tweets):
    all_words = []

    total_tweets = len(tweets)
    print(f"Total tweets to process: {total_tweets}")

    for idx, tweet in enumerate(tweets):
        # Split the tweet into words and add to all_words
        words = tweet.split()
        all_words += words

        # Filter out stop words
        all_words = [word for word in all_words if word not in stop_words]

        # Print progress every 10% of the dataset
        if (idx + 1) % (total_tweets // 10) == 0:
            percent_complete = (idx + 1) / total_tweets * 100
            print(f"Processed {idx + 1} tweets ({percent_complete:.1f}%)")
    
    dictionary = Counter(all_words)
    list_to_remove = list(dictionary)
    for item in list_to_remove:
        if not item.isalpha() or len(item) == 1:
            del dictionary[item]
    dictionary = dictionary.most_common(3000)
    return dictionary

In [15]:
dictionary = make_Dictionary_tweets(data['text'].tolist())

Total tweets to process: 20000
Processed 2000 tweets (10.0%)
Processed 4000 tweets (20.0%)
Processed 6000 tweets (30.0%)
Processed 8000 tweets (40.0%)
Processed 10000 tweets (50.0%)
Processed 12000 tweets (60.0%)
Processed 14000 tweets (70.0%)
Processed 16000 tweets (80.0%)
Processed 18000 tweets (90.0%)
Processed 20000 tweets (100.0%)


In [16]:
# Extract features
def extract_features_tweets(tweets, dictionary):
    features_matrix = np.zeros((len(tweets), len(dictionary)))
    labels = np.zeros(len(tweets))
    for docID, tweet in enumerate(tweets):
        words = tweet.split()
        for i, word in enumerate(dictionary):
            features_matrix[docID, i] = words.count(word[0])
        # Assign labels (1 for positive, 0 for negative)
        labels[docID] = data.iloc[docID]['label']
    return features_matrix, labels

In [17]:
features_matrix, labels = extract_features_tweets(data['text'].tolist(), dictionary)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features_matrix, labels, test_size=0.2, random_state=42)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict and evaluate
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Gaussian Model: {accuracy:.4f}")


Accuracy: 0.6460


In [23]:
import numpy as np

# New tweets to classify
new_tweets = ["Lizards are ugly", "I am at peace", "I hate certain food"]

# Preprocess the new tweets (use the same clean_text function as before)
cleaned_tweets = [clean_text(tweet) for tweet in new_tweets]

# Function to convert new tweets into feature vectors using the dictionary
def extract_features_new_tweets(tweets, dictionary):
    features_matrix = np.zeros((len(tweets), len(dictionary)))
    for docID, tweet in enumerate(tweets):
        words = tweet.split()
        for i, word in enumerate(dictionary):
            features_matrix[docID, i] = words.count(word[0])  # word[0] is the word in the dictionary
    return features_matrix

# Convert new tweets into feature vectors
new_X = extract_features_new_tweets(cleaned_tweets, dictionary)

# Make predictions
predictions = gnb.predict(new_X)
# new_tweets = ["Lizards are ugly", "I am at peace", "I hate certain food"]
# Print predictions
print(predictions)  # Output:[0,1,1]] (1=Positive, 0=Negative)

[0. 1. 1.]


* Now we train a Bernoulli NB Model

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, classification_report

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)

# Convert text to binary feature vectors
vectorizer = CountVectorizer(binary=True)  # Use binary=True for BernoulliNB
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train Bernoulli Naive Bayes model
bnb = BernoulliNB()
bnb.fit(X_train_vec, y_train)

# Make predictions
y_pred = bnb.predict(X_test_vec)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.7628

Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.77      0.76      1981
           1       0.77      0.76      0.76      2019

    accuracy                           0.76      4000
   macro avg       0.76      0.76      0.76      4000
weighted avg       0.76      0.76      0.76      4000



In [29]:
new_tweets = ["I am busy", "Turtles are ugly", "I am at peace", "I hate hate"]
cleaned_tweets = [clean_text(tweet) for tweet in new_tweets]
new_X = vectorizer.transform(cleaned_tweets)
predictions = bnb.predict(new_X)
print(predictions)  # Output: [1, 0, 1, 0] (1=Positive, 0=Negative)

[0 1 1 0]
