# Computational Text Analysis

This notebook, created for SICSS 2025 by Björn Ross and based on earlier notebooks by Björn Ross and Steve Wilson, illustrates some common use cases for computational text analysis:
- Unsupervised learning, using Latent Dirichlet Allocation
- Supervised learning where a custom classifier (SVM or Random Forest) is trained from scratch
- Zero-shot learning, where we prompt an LLM (Gemini) to return classification results

## Installing and importing packages

In [4]:
!pip install gensim
!pip install pyLDAvis



In [1]:
import csv
import numpy as np
import pandas as pd
import gensim

import re
import string

from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel

from collections import Counter
from scipy import sparse

from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

import pyLDAvis.gensim_models

from tqdm import tqdm
import time

from google.genai import errors

At this point I sometimes get an error message that can be fixed by restarting the session (Runtime - Restart session).

In [2]:
import nltk
from nltk.tokenize import word_tokenize

# Download tokenizer data (run once)
nltk.download('punkt_tab')

from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()

from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Opening the data

In [3]:
# check out the data (use ! for command line operation, as opposed to Python code)
!cat Tweets.14cat.train | head -5

45029314109075046	Furniture for - so cute! gotta show my #granddog mama the last one especially :) http://t.co/F69aT71TVQ http://t.co/YQVK09pZzB	Pets & Animals
45033090867215155	"#Sunday aww"": Mr Peebles	Pets & Animals
45036625162627481	CATS ART http://t.co/cJre1jn2Bl #creative #feline #art #love #cat #cats #kittens #housecat #domestic #alley #tomcat	Pets & Animals
45086603513077350	RT @Masala_chaai: Keep Calm & Hug your Dog ! #PetLovers cc @MyICETag @pooja330 @huftindia @PranitaBalar @BarknBond http://t.co/JJHSvf…	Pets & Animals
45138968053405286	RT @TheSoulfulEMU: RETWEET if you love your dog!! http://t.co/QWvjFFnfiP via @earthposts @LUKIKA 	Pets & Animals


In [4]:
# Load the data into pandas
training_data = pd.read_csv(
    'Tweets.14cat.train',
    sep='\t',
    quoting = csv.QUOTE_NONE,
    names = ["ID", "Text", "Category"])
training_data

Unnamed: 0,ID,Text,Category
0,45029314109075046,Furniture for - so cute! gotta show my #grandd...,Pets & Animals
1,45033090867215155,"""#Sunday aww"""": Mr Peebles",Pets & Animals
2,45036625162627481,CATS ART http://t.co/cJre1jn2Bl #creative #fel...,Pets & Animals
3,45086603513077350,RT @Masala_chaai: Keep Calm & Hug your Dog ! #...,Pets & Animals
4,45138968053405286,RT @TheSoulfulEMU: RETWEET if you love your do...,Pets & Animals
...,...,...,...
2498,551069446257655808,Series of Car Window Breakages Under Investiga...,News & Politics
2499,551156371031199744,night Maltese club offer Evans contract: Conv...,News & Politics
2500,551166096598781952,All of today's news headlines in one place. ht...,News & Politics
2501,551617864805781504,ISBPL: Affordable housing scheme for EPFO subs...,News & Politics


In [5]:
test_data = pd.read_csv(
    'Tweets.14cat.test',
    sep='\t',
    quoting = csv.QUOTE_NONE,
    names = ["ID", "Text", "Category"])
test_data

Unnamed: 0,ID,Text,Category
0,45108992991784550,21 Photos of Babies and Pets Being the Cutest ...,Pets & Animals
1,45034300968799027,#Trance #House #Electro Masoud feat. Alexandra...,Music
2,45077420496768205,Murder trial begins in Fulton County: The tria...,News & Politics
3,45056046598311116,Carcraft announces management buyout: CARCRAFT...,Autos & Vehicles
4,45134816305232691,RT @bill_nizzle: Check out these videos about ...,Science & Technology
...,...,...,...
620,547624232331395072,French feminists place warnings on toys deemed...,News & Politics
621,548619949116121088,A Few Thoughts on Reducing Unforced Errors: Th...,News & Politics
622,549374461208956928,RT @SriLankaTweet: Tamil National Alliance #TN...,News & Politics
623,549823135705362432,Michael Grimm to resign from Congress: sources...,News & Politics


## Latent Dirichlet Allocation for Topic Modelling

In [None]:
# Tokenize texts
texts = []
for text in training_data["Text"]:
  tokens = tweet_tokenizer.tokenize(text.lower())

  stop_words = set(stopwords.words('english'))
  tokens = [token for token in tokens if token not in stop_words and len(token) > 2]

  texts.append(tokens)

In [None]:
word_tokenize(training_data["Text"][1])

['``', '#', 'Sunday', 'aww', "''", "''", ':', 'Mr', 'Peebles']

In [None]:
tweet_tokenizer.tokenize(training_data["Text"][0])

['Furniture', 'for', '-', 'so', 'cute', '!', 'gotta', 'show', 'my', '#granddog', 'mama', 'the', 'last', 'one', 'especially', ':)', 'http://t.co/F69aT71TVQ', 'http://t.co/YQVK09pZzB']

In [None]:
texts[0]

['furniture', 'cute', 'gotta', 'show', '#granddog', 'mama', 'last', 'one', 'especially', 'http://t.co/f69at71tvq', 'http://t.co/yqvk09pzzb']

In [None]:
# Create a corpus from a list of texts
common_dictionary = Dictionary(texts)
common_corpus = [common_dictionary.doc2bow(text) for text in texts]

# Train the model on the corpus.

lda = LdaModel(common_corpus,
               num_topics = 10,
               passes = 10,
               iterations = 1000)



In [None]:
lda.get_document_topics(bow = common_corpus)

<gensim.interfaces.TransformedCorpus object at 0x799b0bba5b10>

In [None]:
lda.print_topics()

[(0, '0.012*"71" + 0.009*"5766" + 0.008*"234" + 0.006*"46" + 0.005*"1122" + 0.004*"5623" + 0.004*"593" + 0.004*"5263" + 0.004*"1915" + 0.003*"1127"'), (1, '0.010*"71" + 0.006*"234" + 0.004*"1328" + 0.004*"8603" + 0.004*"62" + 0.003*"740" + 0.003*"1585" + 0.003*"692" + 0.003*"5056" + 0.003*"1401"'), (2, '0.022*"71" + 0.006*"5438" + 0.005*"413" + 0.005*"9" + 0.005*"8513" + 0.004*"1122" + 0.004*"181" + 0.004*"461" + 0.004*"839" + 0.003*"11360"'), (3, '0.020*"71" + 0.006*"646" + 0.005*"593" + 0.005*"1122" + 0.005*"938" + 0.004*"436" + 0.004*"1466" + 0.003*"11099" + 0.003*"815" + 0.003*"234"'), (4, '0.022*"71" + 0.015*"234" + 0.012*"1485" + 0.012*"1122" + 0.007*"1484" + 0.007*"2199" + 0.004*"4576" + 0.004*"1029" + 0.004*"1907" + 0.003*"23"'), (5, '0.006*"46" + 0.005*"815" + 0.005*"1371" + 0.004*"71" + 0.003*"813" + 0.003*"334" + 0.003*"1122" + 0.003*"7336" + 0.003*"9" + 0.003*"953"'), (6, '0.007*"1122" + 0.006*"5097" + 0.004*"62" + 0.004*"577" + 0.003*"46" + 0.003*"329" + 0.003*"7382" + 0.0

In [None]:
lda[common_corpus]


<gensim.interfaces.TransformedCorpus object at 0x799b0bba0cd0>

In [None]:
pyLDAvis.enable_notebook()


In [None]:
panel = pyLDAvis.gensim_models.prepare(lda, corpus = common_corpus, dictionary = common_dictionary)
panel

## Supervised learning-based text classification

### Preprocessing the training data for classification

In [11]:
# convert to list of lists: documents containing tokens
# and return the list of categories
# also get the vocabulary
def preprocess_data(data):

    chars_to_remove = re.compile(f'[{string.punctuation}]')

    documents = []
    categories = []
    vocab = set([])


    for index, row in data.iterrows():
      words = tweet_tokenizer.tokenize(row["Text"].lower())
      for word in words:
        vocab.add(word)

      # add the list of words to the documents list
      documents.append(words)

      # add the category to the categories list
      categories.append(row["Category"])

    return documents, categories, vocab

In [12]:
%time
# ^ see how long this takes
# preprocess the data
preprocessed_training_data, training_categories, train_vocab = preprocess_data(training_data)
preprocessed_test_data, test_categories, test_vocab = preprocess_data(test_data)

print(f"Training Data has {len(preprocessed_training_data)} " +
      f"documents and vocab size of {len(train_vocab)}")
print(f"Test Data has {len(preprocessed_test_data)} " +
      f"documents and vocab size of {len(test_vocab)}")
print(f"There were {len(set(training_categories))} " +
      f"categories in the training data and {len(set(test_categories))} in the test.")

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 23.1 µs
Training Data has 2503 documents and vocab size of 13763
Test Data has 625 documents and vocab size of 4687
There were 14 categories in the training data and 14 in the test.


In [173]:
training_categories

['Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Comedy',
 'Autos & Vehicles',
 'Science & Technology',
 'News & Politics',
 'News & Politics',
 'Gaming',
 'Autos & Vehicles',
 'Nonprofits & Activism',
 'Autos & Vehicles',
 'Music',
 'Gaming',
 'Science & Technology',
 'Science & Technology',
 'Autos & Vehicles',
 'Comedy',
 'Gaming',
 'Autos & Vehicles',
 'Autos & Vehicles',
 'Autos & Vehicles',
 'Gaming',
 'News & Politics',
 'News & Politics',
 'News & Politics',
 'Comedy',
 'Comedy',
 'Gaming',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Comedy',
 'Gaming',
 'Science & Technology',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Pets & Animals',
 'C

In [174]:
list(training_data["Category"])

['Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Pets & Animals',
 'Comedy',
 'Autos & Vehicles',
 'Science & Technology',
 'News & Politics',
 'News & Politics',
 'Gaming',
 'Autos & Vehicles',
 'Nonprofits & Activism',
 'Autos & Vehicles',
 'Music',
 'Gaming',
 'Science & Technology',
 'Science & Technology',
 'Autos & Vehicles',
 'Comedy',
 'Gaming',
 'Autos & Vehicles',
 'Autos & Vehicles',
 'Autos & Vehicles',
 'Gaming',
 'News & Politics',
 'News & Politics',
 'News & Politics',
 'Comedy',
 'Comedy',
 'Gaming',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Film & Animation',
 'Comedy',
 'Gaming',
 'Science & Technology',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Education',
 'Pets & Animals',
 'C

In [175]:
# check the most common categories in the training data
print(Counter(training_categories).most_common())

[('Gaming', 220), ('Autos & Vehicles', 210), ('Howto & Style', 207), ('Sports', 203), ('Travel & Events', 196), ('Science & Technology', 189), ('Film & Animation', 178), ('Pets & Animals', 177), ('News & Politics', 168), ('Music', 160), ('Entertainment', 159), ('Comedy', 153), ('Education', 142), ('Nonprofits & Activism', 141)]


### Set up mappings for word and category IDs

In [176]:
# convert the vocab to a word id lookup dictionary
# anything not in this will be considered "out of vocabulary" OOV
word2id = {}
for word_id,word in enumerate(train_vocab):
    word2id[word] = word_id

# and do the same for the categories
cat2id = {}
for cat_id,cat in enumerate(set(training_categories)):
    cat2id[cat] = cat_id

print("The word id for dog is",word2id['dog'])
print("The category id for Pets & Animals is",cat2id['Pets & Animals'])

The word id for dog is 5658
The category id for Pets & Animals is 4


In [177]:
cat2id

{'Science & Technology': 0,
 'News & Politics': 1,
 'Entertainment': 2,
 'Film & Animation': 3,
 'Pets & Animals': 4,
 'Nonprofits & Activism': 5,
 'Travel & Events': 6,
 'Music': 7,
 'Comedy': 8,
 'Sports': 9,
 'Howto & Style': 10,
 'Education': 11,
 'Autos & Vehicles': 12,
 'Gaming': 13}

### Convert data to bag-of-words format

In [178]:
# build a BOW representation of the files: use the scipy
# data is the preprocessed_data
# word2id maps words to their ids
def convert_to_bow_matrix(preprocessed_data, word2id):

    # matrix size is number of docs x vocab size + 1 (for OOV)
    matrix_size = (len(preprocessed_data),len(word2id)+1)
    oov_index = len(word2id)
    # matrix indexed by [doc_id, token_id]
    X = sparse.dok_matrix(matrix_size)

    # iterate through all documents in the dataset
    for doc_id,doc in enumerate(preprocessed_data):
        for word in doc:
            # default is 0, so just add to the count for this word in this doc
            # if the word is oov, increment the oov_index
            X[doc_id,word2id.get(word,oov_index)] += 1

    return X

In [179]:
%%time
X_train = convert_to_bow_matrix(preprocessed_training_data, word2id)

CPU times: user 787 ms, sys: 5.32 ms, total: 792 ms
Wall time: 790 ms


In [180]:
# check some docs
print("First 3 documents are:",X_train[:3])

First 3 documents are:   (0, 1672)	1.0
  (0, 11695)	1.0
  (0, 9080)	1.0
  (0, 4998)	1.0
  (0, 10131)	1.0
  (0, 12237)	1.0
  (0, 2514)	1.0
  (0, 10594)	1.0
  (0, 8593)	1.0
  (0, 8684)	1.0
  (0, 13363)	1.0
  (0, 6420)	1.0
  (0, 1384)	1.0
  (0, 4111)	1.0
  (0, 764)	1.0
  (0, 7387)	1.0
  (0, 11272)	1.0
  (0, 5408)	1.0
  (1, 8574)	3.0
  (1, 3738)	1.0
  (1, 10412)	1.0
  (1, 4817)	1.0
  (1, 7006)	1.0
  (1, 12058)	1.0
  (2, 8929)	1.0
  (2, 5424)	1.0
  (2, 11514)	1.0
  (2, 8108)	1.0
  (2, 2775)	1.0
  (2, 2776)	1.0
  (2, 5020)	1.0
  (2, 11305)	1.0
  (2, 2735)	1.0
  (2, 12257)	1.0
  (2, 8330)	1.0
  (2, 2190)	1.0
  (2, 6528)	1.0
  (2, 10097)	1.0


In [181]:
y_train = [cat2id[cat] for cat in training_categories]

In [182]:
# check the first 3 categories
print(y_train[:3])

[4, 4, 4]


In [183]:
X_train

<2503x13764 sparse matrix of type '<class 'numpy.float64'>'
	with 38369 stored elements in Dictionary Of Keys format>

### Train an SVM model

In [184]:
# Let's train a model: now that the setup is done, it's a piece of cake!
%time
# instantiate an SVM classification model
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
# you can set various model hyperparamters here
model = svm.SVC(C=1000, kernel ="linear")
# then train the model!
model.fit(X_train,y_train)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 7.15 µs


In [185]:
# make a prediction
sample_text = ['retweet','if','you','are','a','dog','person']
# create just a single vector as input (as a 1 x V matrix)
sample_x_in = sparse.dok_matrix((1,len(word2id)+1))
for word in sample_text:
    sample_x_in[0,word2id[word]] += 1

# what does the example document look like?
print(sample_x_in)
prediction = model.predict(sample_x_in)
# what category was predicted?
print("Prediction was:",prediction[0])
# what category was that?
print(cat2id)

  (0, 2246)	1.0
  (0, 11441)	1.0
  (0, 2902)	1.0
  (0, 1424)	1.0
  (0, 3898)	1.0
  (0, 5658)	1.0
  (0, 10822)	1.0
Prediction was: 4
{'Science & Technology': 0, 'News & Politics': 1, 'Entertainment': 2, 'Film & Animation': 3, 'Pets & Animals': 4, 'Nonprofits & Activism': 5, 'Travel & Events': 6, 'Music': 7, 'Comedy': 8, 'Sports': 9, 'Howto & Style': 10, 'Education': 11, 'Autos & Vehicles': 12, 'Gaming': 13}


### Evaluating the model

In [186]:
# evaluate on training data: how well did we fit to the data we trained on?
y_train_predictions = model.predict(X_train)

# now can compute any metrics we care about. Let's quickly do accuracy
def compute_accuracy(predictions, true_values):
    num_correct = 0
    num_total = len(predictions)
    for predicted,true in zip(predictions,true_values):
        if predicted==true:
            num_correct += 1
    return num_correct / num_total

accuracy = compute_accuracy(y_train_predictions,y_train)
print("Accuracy:",accuracy)
# how did we do?

Accuracy: 1.0


Is that a good score? The score can be informative, but it isn't hard to do well on the training data.

In [187]:
# prepare test data in the same was as training data
X_test = convert_to_bow_matrix(preprocessed_test_data, word2id)
y_test = [cat2id[cat] for cat in test_categories]

In [188]:
# now evaluate on test data: data the model has NOT seen during training time
# make sure you do NOT update the model, only get predictions from it
y_test_predictions = model.predict(X_test)
y_test_predictions

array([ 4,  0,  1, 12,  8,  1, 13,  5,  3,  0,  6,  1,  4,  6, 12,  4,  2,
        0, 11,  4, 11, 12,  5,  1,  6,  7,  1,  2, 12,  6,  3,  9, 13,  6,
       10,  6,  7,  8,  8, 12,  6,  4,  1, 11,  0,  5,  0,  9,  4,  0,  6,
        1,  6,  4, 10, 13,  2,  2,  7, 10, 10, 10, 10, 10,  8,  4,  5,  7,
        7,  2,  2,  2,  2, 13,  4,  6,  7,  3,  2,  1,  4,  4,  8,  5, 12,
        9, 11,  2,  2,  8, 11,  1,  8,  5, 12, 10, 13,  6, 13,  9, 11,  6,
        8,  3, 10,  1,  4,  5,  6,  9,  8,  9,  8,  6,  2, 12, 12, 12,  8,
        2,  6,  8, 12,  6,  1,  1,  9, 13,  2,  2,  2,  2,  1, 12, 10,  2,
       12, 12,  0, 11, 12, 11,  6,  6,  7, 11, 13, 11, 11,  1,  4, 12,  4,
        3, 12,  3,  2, 10,  6,  3,  2, 13, 13, 13,  1, 12,  9,  9,  9,  1,
       11,  2,  6,  2,  2, 11, 11,  8,  8,  4, 13,  3,  9,  3,  2,  2, 12,
       13, 12,  6,  0,  2,  3,  7,  6,  0,  1,  2,  0,  6, 13,  7,  6,  2,
        1,  7,  7, 12,  2,  3,  4,  6,  0,  1, 12,  2,  5,  2,  7, 12,  0,
       13, 11,  7,  7,  1

In [189]:
cat_names = []
for cat,cid in sorted(cat2id.items(),key=lambda x:x[1]):
    cat_names.append(cat)
print(classification_report(y_test, y_test_predictions, target_names=cat_names))

                       precision    recall  f1-score   support

 Science & Technology       0.36      0.40      0.38        43
      News & Politics       0.36      0.57      0.44        37
        Entertainment       0.69      0.67      0.68        49
     Film & Animation       0.41      0.37      0.39        46
       Pets & Animals       0.70      0.71      0.70        45
Nonprofits & Activism       0.40      0.37      0.38        38
      Travel & Events       0.42      0.52      0.46        54
                Music       0.55      0.45      0.49        40
               Comedy       0.62      0.61      0.61        38
               Sports       0.51      0.47      0.49        53
        Howto & Style       0.97      0.75      0.85        40
            Education       0.62      0.44      0.51        41
     Autos & Vehicles       0.80      0.78      0.79        51
               Gaming       0.58      0.62      0.60        50

             accuracy                           0.56 

In [190]:
# what would a simple baseline be? How about most common category from before (Gaming)?
# we should *definitely* be doing better than this! Otherwise the model is not helping at all
baseline_predictions = [cat2id['Gaming']] * len(y_test)
baseline_accuracy = compute_accuracy(baseline_predictions,y_train)
print("Accuracy:",baseline_accuracy)

Accuracy: 0.0848


In [191]:
# trying a different model...
# how about a random forest classifier?
%time
model = RandomForestClassifier()
model.fit(X_train,y_train)

y_train_predictions = model.predict(X_train)
print("Train accuracy was:",compute_accuracy(y_train_predictions,y_train))
y_test_predictions = model.predict(X_test)
print("Test accuracy was:",compute_accuracy(y_test_predictions,y_test))

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 10.3 µs
Train accuracy was: 1.0
Test accuracy was: 0.6224


In [192]:
cat_names = []
for cat,cid in sorted(cat2id.items(),key=lambda x:x[1]):
    cat_names.append(cat)
print(classification_report(y_test, y_test_predictions, target_names=cat_names))

                       precision    recall  f1-score   support

 Science & Technology       0.36      0.37      0.37        43
      News & Politics       0.25      0.49      0.33        37
        Entertainment       0.83      0.71      0.77        49
     Film & Animation       0.54      0.61      0.57        46
       Pets & Animals       0.83      0.87      0.85        45
Nonprofits & Activism       0.62      0.39      0.48        38
      Travel & Events       0.55      0.48      0.51        54
                Music       0.72      0.57      0.64        40
               Comedy       0.64      0.55      0.59        38
               Sports       0.63      0.51      0.56        53
        Howto & Style       0.70      0.82      0.76        40
            Education       0.73      0.66      0.69        41
     Autos & Vehicles       0.79      0.86      0.82        51
               Gaming       0.74      0.74      0.74        50

             accuracy                           0.62 

## Zero-shot classification using Gemini models

Before you run the following cell, store your API key in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key or you aren't sure how to create a Colab Secret, see [Authentication](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb) for an example. Only the steps "Create an API Key" and "Add your key to Colab Secrets" from that page are needed here.

In [6]:
from google.colab import userdata

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

In [7]:
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

In [8]:
MODEL_ID = "gemini-2.0-flash-lite" # @param ["gemini-2.5-flash-preview-05-20", "gemini-2.5-pro-preview-05-06", "gemini-2.0-flash-lite"] {"allow-input":true, isTemplate: true}

In [9]:
from IPython.display import Markdown

response = client.models.generate_content(
    model=MODEL_ID,
    contents="What's the largest planet in our solar system?"
)

Markdown(response.text)

The largest planet in our solar system is **Jupiter**.


In [13]:
set(training_categories)

{'Autos & Vehicles',
 'Comedy',
 'Education',
 'Entertainment',
 'Film & Animation',
 'Gaming',
 'Howto & Style',
 'Music',
 'News & Politics',
 'Nonprofits & Activism',
 'Pets & Animals',
 'Science & Technology',
 'Sports',
 'Travel & Events'}

In [14]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents="""Classify the following text as one of the following categories. Output only the category, with no Markdown formatting. Categories: {'Music', 'Entertainment', 'Autos & Vehicles', 'Travel & Events', 'Nonprofits & Activism', 'Howto & Style', 'Comedy', 'Pets & Animals', 'Science & Technology', 'Gaming', 'Sports', 'Film & Animation', 'News & Politics', 'Education'}
    Text: I like my dog
"""
)

In [15]:
Markdown(response.text)

Pets & Animals


In [16]:
response.text

'Pets & Animals\n'

When making request to the Gemini API, we need to be aware of API limits. Gemini API limits can be found here: https://ai.google.dev/gemini-api/docs/rate-limits

### Attempt no. 1: Make one request for each example

Let's try making one request per example.

In [None]:
from tqdm import tqdm
import time

In [None]:
gemini_classifications = []

for test_example in tqdm(test_data["Text"]):
  response = client.models.generate_content(
      model=MODEL_ID,
      contents="""Classify the following text as one of the following categories. Output only the category, with no Markdown formatting. Categories: {'Music', 'Entertainment', 'Autos & Vehicles', 'Travel & Events', 'Nonprofits & Activism', 'Howto & Style', 'Comedy', 'Pets & Animals', 'Science & Technology', 'Gaming', 'Sports', 'Film & Animation', 'News & Politics', 'Education'}
      Text: """ + test_example
      )
  time.sleep(5)
  gemini_classifications.append(response.text)

 11%|█         | 68/625 [07:05<58:02,  6.25s/it]


KeyboardInterrupt: 

When running the above without time.sleep(5), I hit the rate limit. (Error 429)

When running the above with time.sleep(5), the requests works but takes more than an hour!

Let's try something else.

### Attempt no. 2: Requesting all examples in one go, with JSON schema in the text prompt

In [18]:
import json

In [26]:
prompt_parts = [
        "You are a tweet classifier. Classify the following tweets into one of these categories: " + ", ".join(test_categories) + ".\n",
        "Return the classifications as a JSON array of objects, where each object has 'ID' and 'category'.\n",
        "Example:\n",
        json.dumps([
            {"ID": "T1", "text": "I love my dog!", "category": "Pets & Animals"},
            {"ID": "T2", "text": "What kind of murderer has moral fibre? A cereal killer.", "category": "Comedy"}
        ]),
        "\n\nClassify the following:\n",
        test_data[["ID","Text"]].to_json(orient = 'records')
    ]

generation_config = {
    "response_mime_type": "application/json",
    }

response_json = client.models.generate_content(
      model=MODEL_ID,
      contents=prompt_parts,
      config = generation_config
      )

In [27]:
json.loads(response_json.text)

JSONDecodeError: Expecting ',' delimiter: line 831 column 15 (char 14511)

I got a JSONDecodeError, indicating that the model response could not be parsed as valid JSON. It looks like simply asking the model politely to output valid JSON does not work that well!

### Attempt no. 3: Requesting all examples in one go, with constrained generation

Let's try something else. This time we'll define a JSON schema. This should force the model to output data that conforms to this schema. Read more on this approach here:

https://ai.google.dev/gemini-api/docs/structured-output

In [133]:
from pydantic import BaseModel

class TweetCategories(BaseModel):
    ID: str
    category: str

prompt_parts = [
        "You are a tweet classifier. Classify the following tweets into one of these categories: " + ", ".join(test_categories) + ".\n",
        "Classify the following:\n",
        test_data[["ID","Text"]].to_json(orient = 'records')
    ]

generation_config = {
    "response_mime_type": "application/json",
    "response_schema": list[TweetCategories],
    }

response_constrained_json = client.models.generate_content(
      model=MODEL_ID,
      contents=prompt_parts,
      config=generation_config
      )

KeyboardInterrupt: 

In [30]:
gemini_categories: list[TweetCategories] = response_constrained_json.parsed

In [43]:
response_constrained_json.parsed is None

True

Unfortunately, for me this did not work either. The call to .parsed returns None instead of a list of TweetCategories, indicating that the output could not be parsed as valid JSON. Inspecting the output using response_constrained_json.text shows that the output is cut off after about 200 examples. Apparently we requested too much data in one go.

### Attempt no. 4: Requesting classification for 100 tweets at a time

In this next attempt, we'll use the same approach as before, with structured generation, but only request 100 examples at a time.

In [159]:
json_request_batches = []

for start in range(0, len(test_data), 100):
  json_request_batches.append(
      test_data.loc[start:start+100,["ID","Text"]].to_json(orient = 'records')
      )

In [160]:
len(json_request_batches)

7

We've now split our data into 7 batches. (6 of 100 examples each and one batch of 30)

In [147]:
generation_config = {
    "response_mime_type": "application/json",
    "response_schema": list[TweetCategories],
    }

json_batch_results = []

for json_batch in tqdm(json_request_batches):
  prompt = [
        "You are a tweet classifier. Classify the following tweets into one of these categories: " + ", ".join(test_categories) + ".\n",
        "Classify the following:\n",
        json_batch
    ]

  response_constrained_json = client.models.generate_content(
      model=MODEL_ID,
      contents=prompt,
      config=generation_config
      )
  json_batch_results.append(response_constrained_json)
  time.sleep(5)


 71%|███████▏  | 5/7 [02:04<00:49, 24.95s/it]


ServerError: 503 UNAVAILABLE. {'error': {'code': 503, 'message': 'The model is overloaded. Please try again later.', 'status': 'UNAVAILABLE'}}

When running this I get an error, ServerError: 503 UNAVAILABLE. The model is overloaded. Please try again later.
Clearly not every request succeeds. Let's try again with proper exception handling, waiting five seconds if the request fails.

In [161]:
generation_config = {
    "response_mime_type": "application/json",
    "response_schema": list[TweetCategories],
    }

json_batch_results = []

for json_batch in tqdm(json_request_batches):
  prompt = [
        "You are a tweet classifier. Classify the following tweets into one of these categories: " + ", ".join(test_categories) + ".\n",
        "Classify the following:\n",
        json_batch
    ]

  retries = 3 # Number of retries

  for attempt in range(retries):
      try:
          response_constrained_json = client.models.generate_content(
              model=MODEL_ID,
              contents=prompt,
              config=generation_config
              )
          json_batch_results.append(response_constrained_json)
          break # Success, exit retry loop
      except errors.ServerError as e:
        time.sleep(5) # Wait for 5 seconds
      except Exception as e:
          raise e # Re-raise other exceptions

  time.sleep(5) # Keep a small delay between batches as well

100%|██████████| 7/7 [02:56<00:00, 25.28s/it]


In [162]:
parsed_results = [result_batch.parsed for result_batch in json_batch_results]

In [163]:
for p in parsed_results:
  print(p is None)

False
False
False
False
False
False
False


In [164]:
json_batch_results[0].text

'[\n  {\n    "ID": "45108992991784550",\n    "category": "Pets & Animals"\n  },\n  {\n    "ID": "45034300968799027",\n    "category": "Music"\n  },\n  {\n    "ID": "45077420496768205",\n    "category": "News & Politics"\n  },\n  {\n    "ID": "45056046598311116",\n    "category": "Autos & Vehicles"\n  },\n  {\n    "ID": "45134816305232691",\n    "category": "Science & Technology"\n  },\n  {\n    "ID": "45072223200464896",\n    "category": "Autos & Vehicles"\n  },\n  {\n    "ID": "45073984987243724",\n    "category": "Gaming"\n  },\n  {\n    "ID": "45083134973446553",\n    "category": "Film & Animation"\n  },\n  {\n    "ID": "45101698759068467",\n    "category": "Film & Animation"\n  },\n  {\n    "ID": "45142325339580006",\n    "category": "Film & Animation"\n  },\n  {\n    "ID": "45188110355372441",\n    "category": "Comedy"\n  },\n  {\n    "ID": "45102066684672000",\n    "category": "Education"\n  },\n  {\n    "ID": "45142569757194649",\n    "category": "Education"\n  },\n  {\n    "ID"

In [165]:
for p in parsed_results:
  print(len(p))

101
101
101
100
101
101
25


In [166]:
for batch in json_request_batches:
  print(len(p))

25
25
25
25
25
25
25


Interestingly, the length of each result batch (in parsed_results) does not match exactly the length of the request batches (in json_request_batches).

We could investigate this further: it looks like the model may sometimes include classifications for tweets that weren't in the original request (hallucinations), or sometimes omit tweets from the original request.

Now let's look up each tweet ID in our results, implementing some fallback handling for tweets in the request that weren't found in the results.

In [167]:
# look up the tweet ID in the results
# as a fallback, if the tweet ID is not found, return the most common category
# note that we could return None instead, in this case, the call to classification_report would fail (it cannot handle missing data)
# replacing with the most common value is a common form of data imputation to handle missing values
def find_label(tweet_id):
  for result_batch in parsed_results:
    for result in result_batch:
      if result.ID == str(tweet_id):
        return result.category
  return 'Gaming'

In [168]:
test_predictions_gemini = []

for tweet_id in test_data["ID"].tolist():
  test_predictions_gemini.append(find_label(tweet_id))

In [169]:
len(test_predictions_gemini)

625

In [118]:
len(test_categories)

625

In [170]:
print(classification_report(test_categories, test_predictions_gemini))

                       precision    recall  f1-score   support

     Autos & Vehicles       0.92      0.96      0.94        51
               Comedy       0.73      0.84      0.78        38
            Education       0.80      0.90      0.85        41
        Entertainment       0.79      0.63      0.70        49
     Film & Animation       0.80      0.80      0.80        46
               Gaming       0.92      0.90      0.91        50
        Howto & Style       0.97      0.80      0.88        40
                Music       0.82      0.82      0.82        40
      News & Politics       0.54      0.84      0.66        37
Nonprofits & Activism       0.88      0.74      0.80        38
       Pets & Animals       0.91      0.96      0.93        45
             Religion       0.00      0.00      0.00         0
 Science & Technology       0.81      0.81      0.81        43
               Sports       0.93      0.75      0.83        53
      Travel & Events       0.92      0.89      0.91  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The results are much better than the Random Forest classifier, which got an F1 score of 0.62.

But something odd is happening here: I am getting an error message, "UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples" This seems to be about the label "Religion"

In [196]:
# Let's look at the true category labels from the test data
set(test_categories)

{'Autos & Vehicles',
 'Comedy',
 'Education',
 'Entertainment',
 'Film & Animation',
 'Gaming',
 'Howto & Style',
 'Music',
 'News & Politics',
 'Nonprofits & Activism',
 'Pets & Animals',
 'Science & Technology',
 'Sports',
 'Travel & Events'}

In [197]:
# Let's look at the predicted category labels from the test data
set(test_predictions_gemini)

{'Autos & Vehicles',
 'Comedy',
 'Education',
 'Entertainment',
 'Film & Animation',
 'Gaming',
 'Howto & Style',
 'Music',
 'News & Politics',
 'Nonprofits & Activism',
 'Pets & Animals',
 'Religion',
 'Science & Technology',
 'Sports',
 'Travel & Events'}

My output to the last call includes a 15th category, Religion. Gemini has made up this category! Let's print the tweet.

In [233]:
for tweet_id in test_data["ID"].tolist():
  cat = find_label(tweet_id)
  if cat == "Religion":
    print(test_data.loc[test_data.ID == tweet_id, 'Text'].iloc[0])

Join us on http://t.co/da82RRNXWA! #Christian #Single #Jesus #Bible #Prayer #Discussions #Easter #Movies #Football 


Fair enough, this tweet is about religion, or at least it uses hashtags about religion. Maybe you can prevent Gemini from inventing categories in the reponse by changing the prompt? Try and see for yourself!