This is the sentiment140 dataset.
It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and they can be used to detect sentiment .
It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
ids: The id of the tweet ( 2087)
date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
flag: The query (lyx). If there is no query, then this value is NO_QUERY.
user: the user that tweeted (robotickilldozr)
text: the text of the tweet (Lyx is cool)

In [81]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [97]:
data = pd.read_csv("tweet_sentiments.csv")
data.columns = ['label','time','date','query','username','text']
data.head()

Unnamed: 0,label,time,date,query,username,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [83]:
data.tail()

Unnamed: 0,label,time,date,query,username,text
777438,4,1985234208,Sun May 31 16:42:08 PDT 2009,NO_QUERY,Stewie_tunes,"Am off to bed nw,tweet in the morning night"
777439,4,1985234226,Sun May 31 16:42:08 PDT 2009,NO_QUERY,failingwords,@emily0418 hello loove
777440,4,1985234335,Sun May 31 16:42:09 PDT 2009,NO_QUERY,rainy_tori,my internet's back !
777441,4,1985234345,Sun May 31 16:42:09 PDT 2009,NO_QUERY,Ste1987,Off to bed to watch some TV. Adios!!
777442,4,1985234386,Sun May 31 16:42:09 PDT 2009,NO_QUERY,jessicahh7,is very anxious for xc... even though there is...


In [84]:
print(data.columns)
print('length of data is ', len(data))
print('shape of data is ',data.shape)


Index(['label', 'time', 'date', 'query', 'username', 'text'], dtype='object')
length of data is  777443
shape of data is  (777443, 6)


In [85]:
print(data.dtypes)
print(np.sum(data.isnull().any(axis=1)))

label        int64
time         int64
date        object
query       object
username    object
text        object
dtype: object
0


We only care about the text and label columns

In [86]:
data = data[['text','label']]
data['label'][data['label']==4]=1 # assign 1 to positive sentiment and leave 0 as negative sentiment
data.tail()

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  data['label'][data['label']==4]=1 # assign 1 to positive sentiment and leave 0 as negative sentiment


Unnamed: 0,text,label
777438,"Am off to bed nw,tweet in the morning night",1
777439,@emily0418 hello loove,1
777440,my internet's back !,1
777441,Off to bed to watch some TV. Adios!!,1
777442,is very anxious for xc... even though there is...,1


In [87]:
data_pos = data[data['label']==1]
data_neg = data[data['label']==0]
#take subset of our data so our machine can run easily
data_pos = data_pos.iloc[:int(250000)]
data_neg = data_neg.iloc[:int(250000)]

In [88]:
data = pd.concat([data_pos,data_neg])
print(data.shape)
#make statement in lower case
data['text'] = data['text'].str.lower()



(500000, 2)


In [89]:
def clean_text(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)  # Keeps only letters and spaces

def cleaning_repeating_char(text):
    return re.sub(r'(.)\1+', r'\1', text)

data['text'] = data['text'].apply(lambda x : clean_text(x))
data['text'] = data['text'].apply(lambda x : cleaning_repeating_char(x))

data.tail()

Unnamed: 0,text,label
249995,im bored i fel like driving around but have no...,0
249996,greythinking actualy i cant completely copy bc...,0
249997,not a fan of chinese fod anymore making rororo...,0
249998,wel ghostlands and other realms are ofline i j...,0
249999,so no one has any god music ideas to share ok ...,0


In [90]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])  # Convert text to numerical features
y = data['label']  # Labels (0 = Negative, 1 = Positive)


In [91]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Train Multinomial Naive Bayes
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.5f}")


Accuracy: 0.79049


In [101]:
new_tweets = ["I am naive", "Turtles are overrated", "I am at peace", 'I\'m a machine']
new_X = vectorizer.transform(new_tweets)  # Convert text to numerical features

predictions = model.predict(new_X)
print(predictions)  # Output: [1, 0, 1] (1=Positive, 0=Negative)


[0 1 1 0]
