

```
# About Dataset
Context
This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

# Content
It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

ids: The id of the tweet ( 2087)

date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

flag: The query (lyx). If there is no query, then this value is NO_QUERY.

user: the user that tweeted (robotickilldozr)

text: the text of the tweet (Lyx is cool)

# Acknowledgements
The official link regarding the dataset with resources about how it was generated is here
The official paper detailing the approach is here

Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
```



In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
from nltk.corpus import stopwords
nltk.download('punkt_tab')
nltk.download('stopwords')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
import kagglehub
# Download latest version
path = kagglehub.dataset_download("kazanova/sentiment140")
print("Path to dataset files:", path)
!ls -l /root/.cache/kagglehub/datasets/kazanova/sentiment140/versions/2

Path to dataset files: /root/.cache/kagglehub/datasets/kazanova/sentiment140/versions/2
total 233208
-rw-r--r-- 1 root root 238803811 Jan 14 08:41 training.1600000.processed.noemoticon.csv


In [3]:
!ls /root/.cache/kagglehub/datasets/kazanova/sentiment140/versions/2

training.1600000.processed.noemoticon.csv


In [4]:
filepath = path+ "/"+"training.1600000.processed.noemoticon.csv"
print(filepath)

/root/.cache/kagglehub/datasets/kazanova/sentiment140/versions/2/training.1600000.processed.noemoticon.csv


In [5]:
df = pd.read_csv(filepath, encoding='ISO-8859-1', header=None,
                 names=["target", "ids", "date", "flag", "user", "text"])
df = df.drop(columns=["ids", "date", "flag", "user"])

In [6]:
def extract_features(words):
  features = {}
  for word in words:
    features[word] = True
  return features

In [7]:
mapitem = {0:'negative', 4:'positive'}
df['target'] = df['target'].map(mapitem)

In [8]:
df.head()

Unnamed: 0,target,text
0,negative,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,negative,is upset that he can't update his Facebook by ...
2,negative,@Kenichan I dived many times for the ball. Man...
3,negative,my whole body feels itchy and like its on fire
4,negative,"@nationwideclass no, it's not behaving at all...."


In [9]:
df = df

In [10]:
df.head()
documents = []
for sentence,target in zip(df['text'],df['target']):
  words = nltk.word_tokenize(sentence)
  words = [word.lower() for word in words]
  words = [word for word in words if word not in stopwords.words("english")]
  documents.append((list(words),target))

In [12]:
import random
random.shuffle(documents)
featureset = []
for (d,c) in documents:
  features = extract_features(d)
  featureset.append((features,c))

In [13]:
train_set,test_set = featureset[:10000],featureset[10000:]

In [14]:
classifier = NaiveBayesClassifier.train(train_set)
accuracy = nltk_accuracy(classifier,test_set)
print("Accuracy : {}".format(accuracy*100))

Accuracy : 72.28421383647799


In [15]:
classifier.show_most_informative_features(3)

Most Informative Features
                 missing = True           negati : positi =     20.5 : 1.0
                   sadly = True           negati : positi =     13.5 : 1.0
                     sad = True           negati : positi =     13.1 : 1.0


In [16]:
def sentiment_find(text):
  words = nltk.word_tokenize(text)
  words = [word.lower() for word in words]
  words = [word for word in words if word not in stopwords.words("english")]
  features = extract_features(words)
  return classifier.classify(features)

In [18]:
print(sentiment_find("I hate this thing"))
print(sentiment_find("I love this thing"))
print(sentiment_find("What a horrible thing this is "))
print(sentiment_find("I love this thing"))


negative
positive
negative
positive
