 # Step 1: Data Acquisition and Preprocessing



 This step focuses on acquiring and preprocessing the [Sentiment140 dataset](https://www.kaggle.com/datasets/kazanova/sentiment140), which contains 1.6 million tweets labeled for sentiment analysis. The preprocessing steps include:



 1. Text cleaning:

    - Converting to lowercase

    - Normalizing unicode characters

    - Removing URLs (http, https, www)

    - Removing user mentions (@username) and hashtags (#hashtag)

    - Removing special characters (keeping only alphanumeric and whitespace)

    - Normalizing whitespace (removing extra spaces)



 2. Text filtering and analysis:

    - Removing tweets with fewer than 10 words

    - Counting total words and English words

    - Adding word count statistics


The processed dataset will be saved as `data/processed_sentiment140.csv` for further analysis and model training.

Before you begin, download `kaggle.json` from account settings and place in project root.

In [1]:
# Install required packages
!pip install -q kaggle


In [1]:
# Import all required libraries
import re, os, json, unicodedata, nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from typing import List, Dict
from tqdm import tqdm


In [3]:
# Set up Kaggle - Get your kaggle.json key from Kaggle.com and copy it to the same env as this notebook
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [4]:
# Download the dataset
!kaggle datasets download -d kazanova/sentiment140 --unzip -p ./data


Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other


In [2]:
# Download required NLTK resources
print("Downloading NLTK resources...")
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)


Downloading NLTK resources...


True

In [3]:
def clean_text(text: str) -> str:
    """
    Clean and normalize text.

    Args:
        text (str): Input text to clean

    Returns:
        str: Cleaned text
    """
    # Convert to lowercase
    text = text.lower()

    # Normalize unicode characters
    text = unicodedata.normalize('NFKD', text)

    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove user mentions and hashtags
    text = re.sub(r'@\w+|\#\w+', '', text)

    # Remove special characters
    text = re.sub(r'[^\w\s]', ' ', text)
    #text = re.sub(r'\d+', '', text)

    # Remove extra whitespace
    text = ' '.join(text.split())

    return text


In [4]:
def is_english_word(word: str) -> bool:
    """
    Check if a word is English.

    Args:
        word (str): Word to check

    Returns:
        bool: True if word is English
    """
    return bool(re.match(r'^[a-zA-Z]+$', word))

def count_english_words(text: str) -> int:
    """
    Count the number of English words in text.

    Args:
        text (str): Input text

    Returns:
        int: Number of English words
    """
    words = word_tokenize(text)
    return sum(1 for word in words if is_english_word(word))


In [5]:
def preprocess_dataset(df: pd.DataFrame, min_words: int = 10) -> pd.DataFrame:
    """
    Preprocess the entire dataset.

    Args:
        df (pd.DataFrame): Input DataFrame with 'text' column
        min_words (int): Minimum number of words required in a tweet

    Returns:
        pd.DataFrame: Preprocessed DataFrame with new columns
    """
    print("Starting dataset preprocessing...")

    # Clean text
    print("Cleaning text...")
    df['cleaned_text'] = df['text'].apply(clean_text)

    # Add word count information
    print("Adding word count information...")
    df['word_count'] = df['cleaned_text'].apply(lambda x: len(word_tokenize(x)))
    df['english_word_count'] = df['cleaned_text'].apply(count_english_words)

    # Filter out tweets with fewer than min_words
    original_size = len(df)
    df = df[df['word_count'] >= min_words]
    filtered_size = len(df)

    # Print statistics
    print(f"Original dataset size: {original_size}")
    print(f"Filtered dataset size: {filtered_size}")
    print(f"Removed {original_size - filtered_size} tweets with fewer than {min_words} words")
    print(f"Average word count: {df['word_count'].mean():.2f}")
    print(f"Average English word count: {df['english_word_count'].mean():.2f}")

    return df


In [6]:
# Load the dataset
columns = ['target', 'id', 'date', 'flag', 'user', 'text']
df = pd.read_csv('./data/training.1600000.processed.noemoticon.csv',
                 encoding='latin-1',
                 names=columns)


In [7]:
# Display basic information about the dataset
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()


Dataset shape: (1600000, 6)

First few rows:


Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [8]:
# Basic statistics
# target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
print("\nTarget distribution:")
df['target'].value_counts()



Target distribution:


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


In [9]:
# Analyze text lengths
df['text_length'] = df['text'].str.len()
print("Maximum text length:", df['text_length'].max())
print("\nText length statistics:")
df['text_length'].describe()


Maximum text length: 374

Text length statistics:


Unnamed: 0,text_length
count,1600000.0
mean,74.09011
std,36.44114
min,6.0
25%,44.0
50%,69.0
75%,104.0
max,374.0


In [10]:
# Preprocess the dataset
processed_df = preprocess_dataset(df, min_words=10)


Starting dataset preprocessing...
Cleaning text...
Adding word count information...
Original dataset size: 1600000
Filtered dataset size: 999789
Removed 600211 tweets with fewer than 10 words
Average word count: 17.58
Average English word count: 17.27


In [11]:
# Display basic information about the processed dataset
print("Processed Dataset Information:")
print("\nShape:", processed_df.shape)
print("\nColumns:", processed_df.columns.tolist())
print("\nSample of processed text:")
processed_df[['text', 'cleaned_text', 'word_count', 'english_word_count']].head()


Processed Dataset Information:

Shape: (999789, 10)

Columns: ['target', 'id', 'date', 'flag', 'user', 'text', 'text_length', 'cleaned_text', 'word_count', 'english_word_count']

Sample of processed text:


Unnamed: 0,text,cleaned_text,word_count,english_word_count
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",a that s a bummer you shoulda got david carr o...,17,17
1,is upset that he can't update his Facebook by ...,is upset that he can t update his facebook by ...,22,22
2,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...,17,16
3,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,10,10
4,"@nationwideclass no, it's not behaving at all....",no it s not behaving at all i m mad why am i h...,23,23


In [12]:
# Save processed dataset
print("\nSaving processed dataset...")
os.makedirs('./data', exist_ok=True)
processed_df.to_csv('processed_sentiment140.csv', index=False)
print("Preprocessing complete. Processed dataset saved.")


Saving processed dataset...
Preprocessing complete. Processed dataset saved.
