# GPT-Based Tweet Sentiment Analysis Classification

A couple of notes:
* Do not run on any Freddie Mac Wifi or the NLTK corpus of tweets will not load correctly
* Make sure that you've pip installed all the packages in the first code chunk
* I used ChatGPT to write some of the code, illustrating its power!
* You need to set your OpenAI keys in your path variables

Current pretrained models available:
* Ada normal learning rate multiplier model 1 epoch: ada:ft-personal-2023-06-15-16-12-17
* Ada 0.02 learning rate multiplier model 1 epoch: ada:ft-personal-2023-07-11-16-04-16
* Davinci 0.02 learning rate multiplier model 1 epoch: davinci:ft-personal-2023-07-11-23-22-38
* Davinci 0.02 learning rate multiplier model 4 epochs: davinci:ft-innovatonlab-2023-07-14-00-17-52


## Data Preprocessing

Import necessary packages.

In [3]:
import nltk
from nltk.corpus import twitter_samples
import pandas as pd
import numpy as np
import openai
import tweepy
import json

Import twitter samples from the NLTK corpus and convert tweets from JSON format to string arrays.

In [2]:
nltk.download('twitter_samples', quiet = True)
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

[nltk_data] Error loading twitter_samples: <urlopen error [WinError
[nltk_data]     10054] An existing connection was forcibly closed by
[nltk_data]     the remote host>


Append positive or negative label and concatenate the two datasets, labeling the columns. Then save to a jsonl file that will be used for fine-tuning.

In [3]:
pd_plus = pd.DataFrame(np.asarray([all_positive_tweets, np.full(len(all_positive_tweets), "positive")])).T
pd_neg = pd.DataFrame(np.asarray([all_negative_tweets, np.full(len(all_negative_tweets), "negative")])).T
all_tweets = pd.concat([pd_plus, pd_neg])
all_tweets = all_tweets.rename(columns={0: "prompt", 1: "completion"})
all_tweets = all_tweets.sample(frac = 1)
all_tweets.to_json("tweets.jsonl", orient='records', lines=True)

Remove non-ascii characters and store them in the list non_ascii_characters.

In [4]:
input_file = 'tweets.jsonl'
output_file = 'output.jsonl'

with open(input_file, 'r') as file:
    with open(output_file, 'w') as outfile:
        for line in file:
            json_data = json.loads(line)
            ascii_data = {}
            for key, value in json_data.items():
                if isinstance(value, str):
                    ascii_value = value.encode('ascii', 'ignore').decode('ascii')
                    if value != ascii_value:
                        non_ascii_characters = set(value) - set(ascii_value)
                    ascii_data[key] = ascii_value
                else:
                    ascii_data[key] = value
            outfile.write(json.dumps(ascii_data) + '\n')

## Fine-Tuning

We can now use a data preparation tool which will suggest a few improvements to our dataset before fine-tuning. We specify `-q` which auto-accepts all suggestions.

In [7]:
!openai tools fine_tunes.prepare_data -f output.jsonl -q

Analyzing...

- Your file contains 10000 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 4 duplicated prompt-completion sets. These are rows: [2799, 4045, 5382, 6952]
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["prompt"] += suffix
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["completion"] = x["completion"].apply(


Note: MUST RUN IN COMMAND LINE, DOES NOT ALWAYS WORK IN VSCODE (messages supressed).

In [None]:
# openai api fine_tunes.create -t "output_prepared_train.jsonl" -v "output_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " positive" -m davinci --n_epochs 4 --learning_rate_multiplier 0.02

Once the fine-tune is create, run this periodically to check on its status.

In [1]:
!openai api fine_tunes.follow -i ft-D7duIy3Kk9WZgD1IRszabUZU

[2023-07-13 15:23:56] Created fine-tune: ft-D7duIy3Kk9WZgD1IRszabUZU
[2023-07-13 19:18:08] Fine-tune costs $28.68
[2023-07-13 19:18:08] Fine-tune enqueued. Queue number: 8
[2023-07-13 19:18:40] Fine-tune is in the queue. Queue number: 7
[2023-07-13 19:19:27] Fine-tune is in the queue. Queue number: 6
[2023-07-13 19:20:58] Fine-tune is in the queue. Queue number: 5
[2023-07-13 19:21:14] Fine-tune is in the queue. Queue number: 4
[2023-07-13 19:21:37] Fine-tune is in the queue. Queue number: 3
[2023-07-13 19:22:54] Fine-tune is in the queue. Queue number: 2
[2023-07-13 19:22:56] Fine-tune is in the queue. Queue number: 1
[2023-07-13 19:25:29] Fine-tune is in the queue. Queue number: 0
[2023-07-13 19:26:35] Fine-tune started
[2023-07-13 19:40:41] Completed epoch 1/4
[2023-07-13 20:04:32] Completed epoch 3/4
[2023-07-13 20:17:53] Uploaded model: davinci:ft-innovatonlab-2023-07-14-00-17-52
[2023-07-13 20:17:54] Uploaded result file: file-OdGlOWc7m9Oo7HFfnjVLOzET
[2023-07-13 20:17:55] Fine-t

This line needs to be edited with the name of the fine-tune job following the -i tag.
You can also modify the output file.
This will print the classification, validation, and training metrics at each step to the output file.

In [2]:
!openai api fine_tunes.results -i ft-D7duIy3Kk9WZgD1IRszabUZU > result3.csv

## Analysis

Set the model you're going to analyze. MUST RUN.

In [3]:
ft_model = 'davinci:ft-innovatonlab-2023-07-14-00-17-52'

Read the results from the file you saved.

In [4]:
results = pd.read_csv('result.csv')
results[results['classification/accuracy'].notnull()].tail(1)

Unnamed: 0,step,elapsed_tokens,elapsed_examples,training_loss,training_sequence_accuracy,training_token_accuracy,validation_loss,validation_sequence_accuracy,validation_token_accuracy,classification/accuracy,classification/precision,classification/recall,classification/auroc,classification/auprc,classification/f1.0
563,564,484672,9024,0.017856,1.0,1.0,,,,1.0,1.0,1.0,1.0,1.0,1.0


In [1]:
!openai api fine_tunes.list

{
  "object": "list",
  "data": [
    {
      "object": "fine-tune",
      "id": "ft-oGkjdTyJN7GeJd5s1C9Flb1K",
      "hyperparams": {
        "n_epochs": 4,
        "batch_size": 2,
        "prompt_loss_weight": 0.01,
        "learning_rate_multiplier": 0.1,
        "classification_positive_class": " positive",
        "compute_classification_metrics": true
      },
      "organization_id": "org-uVFaeHyVotBNyU65K31v45SX",
      "model": "ada",
      "training_files": [
        {
          "object": "file",
          "id": "file-ua3FDEKjEztRg9r5DIXzCYQH",
          "purpose": "fine-tune",
          "filename": "tweets_prepared_train_1000.jsonl",
          "bytes": 120224,
          "created_at": 1686585804,
          "status": "processed",
          "status_details": null
        }
      ],
      "validation_files": [
        {
          "object": "file",
          "id": "file-3eIdPbHJeWEOv4vxSflu8f8E",
          "purpose": "fine-tune",
          "filename": "tweets_prepared_valid_100.

### Tweets Retrieval and Completions (can't validate)

Function to retrieve tweets for chosen accounts using method built by ChatGPT.

In [15]:
# MY TWITTER CREDENTIALS - DO NOT SHARE!!!
consumer_key= "w1vvnQKa7K5xAXR8dNQVEqU5n"
consumer_secret= "aku11h72uvLXkdZdKGfZE2LL7KG220opwZNmJdbn5UVD7vPXCq"
access_token= "1576783538921275393-yt7IPCYHVzL4hiOkvyXupMNIWZmF3W"
access_token_secret= "Losh4fJM48gvI2YbtVpnheXp5JE9aEuXozRDfUj8wS46f"

def retrieve_and_save_tweets(username, count):
    # Authenticate to Twitter API
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    output_file = username + ".jsonl"
    # Create API object
    api = tweepy.API(auth)

    # Retrieve tweets from user timeline using tweepy.Cursor
    tweets = tweepy.Cursor(api.user_timeline, screen_name=username, tweet_mode="extended").items(count)
    
    # Save tweets to JSONL file
    with open(output_file, "w") as f:
        for tweet in tweets:
            ascii_tweet = tweet.full_text.encode("ascii", errors="ignore").decode("ascii")
            json.dump({"tweet": ascii_tweet}, f)
            f.write("\n")   

Using the model you just created, read the tweets and predict whether each is positive or negatively connotated. The connotations are then appended to the dataset, which is returned. Make sure to use the actual fine-tune's name, not the job name!

In [16]:
def read_tweets(ft_model, username, tweetcount):
    retrieve_and_save_tweets(username, tweetcount)
    twitter_user = pd.read_json(username + ".jsonl", lines=True)
    completions = []
    for i in range(twitter_user.shape[0]):
        res = openai.Completion.create(model=ft_model, prompt=twitter_user["tweet"][i] + '\n\n###\n\n',max_tokens=1, temperature=0,logprobs=2)
        completions.append(res['choices'][0]['text'])
    twitter_user["completions"] = completions
    return twitter_user

elon_musk_completions = read_tweets(ft_model, "elonmusk", 200)

### Validation Tweets Testing

Feed the validation tweets back through the model using your model of choice. Then, append the predicted completions. Create another column "match" that is True if the observed matches the expected, and False if it doesn't, and then print how many True versus False (accuracy),

In [23]:
test = pd.read_json("output_prepared_valid.jsonl", lines=True)
match = []
predicted = []
for i in range(test.shape[0]):
    res = openai.Completion.create(model=ft_model, prompt=test["prompt"][i],max_tokens=1, temperature=0,logprobs=2)
    predicted.append(res['choices'][0]['text'])
    match.append(predicted[len(predicted) - 1] == test["completion"][i])
    
test["match"] = match
test["predicted"] = predicted
test["match"].value_counts()

### Other Labeled Tweets Testing

Use pre-existing Kaggle labeled tweet dataset and load into pandas, do some data wrangling, and shuffle rows.

In [17]:
labeled = pd.read_csv("labeled.csv", header = None)
labeled[0] = labeled[0].replace(0," negative").replace(2, " neutral").replace(4, " positive")
labeled.head()
labeled = labeled.sample(frac = 1)


In [22]:
labeled[0].value_counts

<bound method IndexOpsMixin.value_counts of 382875      negative
468968      negative
852074      positive
945517      positive
124618      negative
             ...    
828621      positive
390878      negative
42623       negative
1582754     positive
1343827     positive
Name: 0, Length: 1600000, dtype: object>

Set the model name, create the completions, analyze the number of correctly matching tweets.

In [18]:
ft_model = 'ada:ft-personal-2023-06-15-16-12-17'
# ft_model = 'ada:ft-personal-2023-07-11-16-04-16'
# ft_model = 'davinci:ft-personal-2023-07-11-23-22-38'
# ft_model = 'davinci:ft-innovatonlab-2023-07-14-00-17-52'

match = []
predicted = []
for i in range(400):
    res = openai.Completion.create(model=ft_model, prompt=labeled[5][i] + '\n\n###\n\n',max_tokens=1, temperature=0,logprobs=2)
    predicted.append(res['choices'][0]['text'])
    match.append(predicted[len(predicted) - 1] == labeled[0][i])
    
pd.DataFrame(predicted).value_counts()
pd.DataFrame(match).value_counts()

True     394
False      6
Name: count, dtype: int64

In [19]:
pd.DataFrame(predicted).value_counts()

 negative    394
 positive      6
Name: count, dtype: int64