 # Step 2A1: Synthetic Text Generation



 In this step, we generate synthetic tweets to create a balanced dataset for training our classifier. The process includes:



 1. Data Generation:

    - Using GPT-4 to generate synthetic tweets

    - Creating both literal and sarcastic tweets

    - Using original tweets as reference for style and topic

    - Maintaining similar length and natural language patterns



 2. Data Processing:

    - Cleaning and normalizing text

    - Adding metadata (word count, character count, hashtag/mention presence)

    - Filtering tweets (minimum 10 words)

    - Computing English word statistics



 3. Quality Control:

    - Ensuring balanced distribution between literal and sarcastic classes

    - Maintaining realistic tweet properties

    - Tracking original tweet references

    - Computing confidence scores



 Note: This step takes about 1.5 hours and costs ~$0.50 to run for 3000 samples per class.

 The generated data is saved to `synthetic_tweets.csv` and can be reused to skip this step.



 The synthetic dataset will be used to train our initial classifier, which will help us identify and handle label inconsistencies in the original dataset.

 ## Step 2.1: Create Synthetic Data and Train Naive Classifier



 We'll use GPT to generate synthetic tweets for both literal and sarcastic classes to help train our initial classifier.

 Before you begin, download Create `openai_api_key.json` with your API key in project root.

In [4]:
# Import necessary libraries
import os
import json
import pandas as pd
import openai
from tqdm import tqdm
import time
from typing import List, Dict, Tuple
import random
from nltk.tokenize import word_tokenize
import re
import unicodedata
import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)


True

In [7]:
from google.colab import userdata # Import userdata to access secrets

# Load OpenAI API key from Colab secrets
openai.api_key = userdata.get('OpenAI_API_Key')

# Initialize OpenAI client
client = openai.OpenAI(api_key=openai.api_key)

In [11]:
# Load the original dataset
columns = ['target', 'id', 'date', 'flag', 'user', 'text']
original_df = pd.read_csv('processed_sentiment140.csv',
                         encoding='latin-1',
                         names=columns)


  original_df = pd.read_csv('processed_sentiment140.csv',


In [12]:
def clean_text(text: str) -> str:
    """
    Clean and normalize text.

    Args:
        text (str): Input text to clean

    Returns:
        str: Cleaned text
    """
    # Convert to lowercase
    text = text.lower()

    # Normalize unicode characters
    text = unicodedata.normalize('NFKD', text)

    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove user mentions and hashtags
    text = re.sub(r'@\w+|\#\w+', '', text)

    # Remove special characters
    text = re.sub(r'[^\w\s]', ' ', text)

    # Remove extra whitespace
    text = ' '.join(text.split())

    return text


In [13]:
def is_english_word(word: str) -> bool:
    """
    Check if a word is English.

    Args:
        word (str): Word to check

    Returns:
        bool: True if word is English
    """
    return bool(re.match(r'^[a-zA-Z]+$', word))

def count_english_words(text: str) -> int:
    """
    Count the number of English words in text.

    Args:
        text (str): Input text

    Returns:
        int: Number of English words
    """
    words = word_tokenize(text)
    return sum(1 for word in words if is_english_word(word))


In [14]:
def get_tweet_properties(tweet: str) -> Dict:
    """
    Analyze properties of a tweet.

    Args:
        tweet: The tweet text

    Returns:
        Dictionary containing tweet properties
    """
    words = word_tokenize(tweet)
    return {
        'word_count': len(words),
        'char_count': len(tweet),
        'has_hashtag': int('#' in tweet),
        'has_mention': int('@' in tweet)
    }


In [15]:
def generate_synthetic_tweet(client: openai.OpenAI,
                           original_tweet: str,
                           class_type: str,
                           max_retries: int = 3) -> str:
    """
    Generate a synthetic tweet based on an original tweet.

    Args:
        client: OpenAI client instance
        original_tweet: The original tweet to use as reference
        class_type: Either 'literal' or 'sarcastic'
        max_retries: Maximum number of retries for failed API calls

    Returns:
        Generated tweet text
    """
    prompt = f"""Given this tweet: "{original_tweet}"

    Generate a new tweet that:
    1. Is about the same subject/topic
    2. Has a similar length and style
    3. Is {class_type} in tone
    4. Sounds natural and realistic

    Only return the new tweet text, nothing else."""

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4.1-mini",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant that generates realistic tweets."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.7,
                max_tokens=500
            )
            tweet = response.choices[0].message.content.strip()
            if tweet:
                return tweet
        except Exception as e:
            if attempt == max_retries - 1:
                print(f"Failed to generate tweet after {max_retries} attempts: {str(e)}")
            time.sleep(0.5)  # Wait before retrying

    return None


In [16]:
def generate_synthetic_dataset(client: openai.OpenAI,
                             original_df: pd.DataFrame,
                             num_samples: int = 10,
                             max_retries: int = 3) -> pd.DataFrame:
    """
    Generate synthetic dataset based on original tweets.

    Args:
        client: OpenAI client instance
        original_df: Original dataset DataFrame
        num_samples: Number of samples to generate per class
        max_retries: Maximum number of retries for failed API calls

    Returns:
        DataFrame containing synthetic tweets
    """
    synthetic_tweets = []
    synthetic_labels = []
    original_tweets = []

    # Sample random tweets from original dataset
    sample_tweets = original_df.sample(n=num_samples*2)['text'].tolist()

    # Generate literal tweets
    print("Generating literal tweets...")
    for original_tweet in tqdm(sample_tweets[:num_samples]):
        # Generate synthetic tweets (this may take a while and cost tokens)
        synthetic_tweet = generate_synthetic_tweet(client, original_tweet, 'literal', max_retries)
        if synthetic_tweet:
            synthetic_tweets.append(synthetic_tweet)
            synthetic_labels.append('literal')
            original_tweets.append(original_tweet)

    # Generate sarcastic tweets
    print("Generating sarcastic tweets...")
    for original_tweet in tqdm(sample_tweets[num_samples:]):
        # Generate synthetic tweets (this may take a while and cost tokens)
        synthetic_tweet = generate_synthetic_tweet(client, original_tweet, 'sarcastic', max_retries)
        if synthetic_tweet:
            synthetic_tweets.append(synthetic_tweet)
            synthetic_labels.append('sarcastic')
            original_tweets.append(original_tweet)

    # Create DataFrame
    synthetic_data = pd.DataFrame({
        'text': synthetic_tweets,
        'label': synthetic_labels,
        'original_tweet': original_tweets,
        'source': ['synthetic'] * len(synthetic_tweets),
        'confidence': [1.0] * len(synthetic_tweets)
    })

    # Add metadata columns
    synthetic_data['word_count'] = synthetic_data['text'].str.split().str.len()
    synthetic_data['char_count'] = synthetic_data['text'].str.len()
    synthetic_data['has_hashtag'] = synthetic_data['text'].str.contains('#').astype(int)
    synthetic_data['has_mention'] = synthetic_data['text'].str.contains('@').astype(int)

    return synthetic_data


In [None]:
# Generate synthetic dataset
print("Generating synthetic dataset...")
synthetic_data = generate_synthetic_dataset(client, original_df, num_samples=3000)


In [21]:
# Preprocess the synthetic dataset
print("\nPreprocessing synthetic dataset...")
print("Starting synthetic dataset preprocessing...")

# Clean text
print("Cleaning text...")
synthetic_data['cleaned_text'] = synthetic_data['text'].apply(clean_text)

# Add word count information
print("Adding word count information...")
synthetic_data['word_count'] = synthetic_data['cleaned_text'].apply(lambda x: len(word_tokenize(x)))
synthetic_data['english_word_count'] = synthetic_data['cleaned_text'].apply(count_english_words)

# Filter out tweets with fewer than 10 words
original_size = len(synthetic_data)
synthetic_data = synthetic_data[synthetic_data['word_count'] >= 10]
filtered_size = len(synthetic_data)

# Print statistics
print(f"Original synthetic dataset size: {original_size}")
print(f"Filtered synthetic dataset size: {filtered_size}")
print(f"Removed {original_size - filtered_size} tweets with fewer than 10 words")
print(f"Average word count: {synthetic_data['word_count'].mean():.2f}")
print(f"Average English word count: {synthetic_data['english_word_count'].mean():.2f}")



Preprocessing synthetic dataset...
Starting synthetic dataset preprocessing...
Cleaning text...
Adding word count information...
Original synthetic dataset size: 6000
Filtered synthetic dataset size: 5003
Removed 997 tweets with fewer than 10 words
Average word count: 18.42
Average English word count: 18.24


In [22]:
# Save synthetic data
synthetic_data.to_csv('synthetic_tweets.csv', index=False)
print("Synthetic data saved to 'synthetic_tweets.csv'")


Synthetic data saved to 'synthetic_tweets.csv'


In [23]:
# Display statistics
print("\nSynthetic Data Statistics:")
print(f"Total samples: {len(synthetic_data)}")
print("\nLabel distribution:")
print(synthetic_data['label'].value_counts())

print("\nText length statistics by label:")
print(synthetic_data.groupby('label')['word_count'].describe())

print("\nHashtag and mention usage by label:")
print(synthetic_data.groupby('label')[['has_hashtag', 'has_mention']].mean())



Synthetic Data Statistics:
Total samples: 5003

Label distribution:
label
sarcastic    2754
literal      2249
Name: count, dtype: int64

Text length statistics by label:
            count       mean       std   min   25%   50%   75%   max
label                                                               
literal    2249.0  17.824366  5.688833  10.0  13.0  17.0  22.0  36.0
sarcastic  2754.0  18.915033  6.187649  10.0  14.0  18.0  23.0  55.0

Hashtag and mention usage by label:
           has_hashtag  has_mention
label                              
literal       0.020898     0.448199
sarcastic     0.074074     0.387073


In [24]:
# Display sample pairs of original and synthetic tweets
print("\nSample pairs of original and synthetic tweets:")
sample_pairs = synthetic_data[['original_tweet', 'text', 'cleaned_text', 'label']].head(5)
for _, row in sample_pairs.iterrows():
    print(f"\nOriginal tweet: {row['original_tweet']}")
    print(f"Synthetic tweet ({row['label']}): {row['text']}")
    print(f"Cleaned synthetic tweet: {row['cleaned_text']}")
    print("-" * 80)


Sample pairs of original and synthetic tweets:

Original tweet: Watching jordon walk into his doom mwhahaha 
Synthetic tweet (literal): Seeing Jordan head straight into trouble, can’t wait to see what happens next.
Cleaned synthetic tweet: seeing jordan head straight into trouble can t wait to see what happens next
--------------------------------------------------------------------------------

Original tweet: Soo hungry !!, but why am I still in bed?  Got cookie monster!?  
Synthetic tweet (literal): Starving right now, but why can't I get out of bed? Where’s my cookie monster?
Cleaned synthetic tweet: starving right now but why can t i get out of bed where s my cookie monster
--------------------------------------------------------------------------------

Original tweet: Gotta go now.  Parent alert again. I currently have 160k and proud about climbing my way up from 80k to 160 in just 4hours. &lt;33 Nighty! :*
Synthetic tweet (literal): Time to head out, parents just checked in. S