The dataset can be found here: https://www.kaggle.com/datasets/kazanova/sentiment140. 
You can download and process manually, but we'll use `haggle` here to manage Kaggle datasets.
You can `pip install haggle` if you want to go that way.
And, regardless you'll need a [Kaggle API key](https://www.kaggle.com/docs/api).


The data is in the form of a (zipped) csv, but the csv has no header. 
The dataset source page mentions this:
- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- ids: The id of the tweet ( 2087)
- date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- flag: The query (lyx). If there is no query, then this value is NO_QUERY.
- user: the user that tweeted (robotickilldozr)
- text: the text of the tweet (Lyx is cool)

Just one note there: We'll use `id_` instead of `ids` to remove the plural 
(which other columns don't have), but without clashing with python reserved name `id`.

# The Dacc

In [5]:
from imbed_data_prep.twitter_sentiment import Dacc 

dacc = Dacc()

In [9]:
dacc.raw_data

Unnamed: 0,target,id_,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


target
4    800000
0    799999
0         1
Name: count, dtype: int64

In [12]:
dacc.column_info

{'target': 'the polarity of the tweet (e.g. 0 = negative, 2 = neutral, 4 = positive)',
 'id_': 'The id of the tweet (e.g. 2087)',
 'date': 'the date of the tweet (e.g. Sat May 16 23:58:44 UTC 2009)',
 'flag': 'The query (lyx). If there is no query, then this value is NO_QUERY.',
 'user': 'the user that tweeted (e.g. robotickilldozr)',
 'text': 'the text of the tweet (e.g. Lyx is cool)'}

# Scrap: Twitter sentiment data prep bits

The dataset can be found here: https://www.kaggle.com/datasets/kazanova/sentiment140. 
You can download and process manually, but we'll use `haggle` here to manage Kaggle datasets.
You can `pip install haggle` if you want to go that way.
And, regardless you'll need a [Kaggle API key](https://www.kaggle.com/docs/api).

In [None]:
column_info  = {
    "target": "the polarity of the tweet (e.g. 0 = negative, 2 = neutral, 4 = positive)",
    "id_": "The id of the tweet (e.g. 2087)",
    "date": "the date of the tweet (e.g. Sat May 16 23:58:44 UTC 2009)",
    "flag": "The query (lyx). If there is no query, then this value is NO_QUERY.",
    "user": "the user that tweeted (e.g. robotickilldozr)",
    "text": "the text of the tweet (e.g. Lyx is cool)"
}

In [1]:
# Often used imports

import pandas as pd

In [2]:
from haggle import KaggleDatasets

kaggle = KaggleDatasets()

list(kaggle)

['andrewmvd/sp-500-stocks',
 'analystmasters/world-soccer-live-data-feed',
 'uciml/human-activity-recognition-with-smartphones',
 'rtatman/english-word-frequency',
 'sitsawek/phonetics-articles-on-plos',
 'kazanova/sentiment140']

In [4]:
from imbed import extension_based_wrap
from tabled import auto_decode_bytes

raw_data_store = extension_based_wrap(kaggle['kazanova/sentiment140'])
list(raw_data_store)

['training.1600000.processed.noemoticon.csv']

In [7]:
df = raw_data_store['training.1600000.processed.noemoticon.csv']

# there was no header in the csv file, so we need to specify the column names
columns = list(column_info.keys())  # column_info is defined above (was extracted from https://www.kaggle.com/datasets/kazanova/sentiment140)
# we need to move the current columns of df to be the first row, and then set the columns to be the columns we want
df = pd.DataFrame([df.columns] + df.values.tolist(), columns=columns)

print(f"{df.shape=}")
df.iloc[0]

df.shape=(1600000, 6)


target                                                    0
id_                                              1467810369
date                           Mon Apr 06 22:19:45 PDT 2009
flag                                               NO_QUERY
user                                        _TheSpecialOne_
text      @switchfoot http://twitpic.com/2y1zl - Awww, t...
Name: 0, dtype: object

In [15]:
import oa

list(oa.compute_price.model_information_dict)

['text-embedding-3-small',
 'text-embedding-3-large',
 'text-embedding-ada-002',
 'batch__text-embedding-3-small',
 'batch__text-embedding-3-large',
 'batch__text-embedding-ada-002',
 'gpt-4',
 'gpt-4-32k',
 'gpt-4-turbo',
 'gpt-3.5-turbo',
 'o1-preview',
 'o1-mini',
 'gpt-4o',
 'gpt-4o-mini']

In [17]:
import oa

token_count = oa.num_tokens('\n###\n'.join(df['text'].values))
print(f"{token_count=}")

oa.compute_price(oa.DFLT_EMBEDDINGS_MODEL, token_count)

token_count=35031137


0.70062274