# Interpreting `TfidfLogisticRegression` Model

`TfidfLogisticRegression` has two interpretable components:

* `TfidfVectorizer`
* `LogisticRegression`

🧠 Can help us understand the dataset better

🧠 Can help us spot potential issues in the dataset, before we apply more complex models.

💪 Start with inspecting/interpreting the `TfidfVectorizer`

In [1]:
from typing import List

from sklearn.preprocessing import LabelEncoder

from text_classification import defs
from text_classification.data import Samples
from text_classification.models import TfidfLogisticRegression

Get source assets.

In [2]:
model: TfidfLogisticRegression = defs.load_asset_value("tfidf_logistic_regression_model")  # type: ignore

2023-02-12 23:33:25 +0000 - dagster - DEBUG - system - Loading file from: /Users/thomelane/Projects/text_classification/data/storage/tfidf_logistic_regression_model


## Interpret `TfidfVectorizer`

❓ What are the most common tokens in the dataset?

In [3]:
vocab_token_to_id = model.tfidf_vectorizer.vocabulary_
vocab_id_to_token = {v: k for k, v in vocab_token_to_id.items()}

# warning: referencing global variables
def most_common_tokens(top_k: int):
    token_idxs = model.tfidf_vectorizer.idf_.argsort()[:top_k]
    idfs = model.tfidf_vectorizer.idf_[token_idxs]
    tokens = [(vocab_id_to_token[id], idf) for id, idf in zip(token_idxs, idfs)]
    for token, idf in tokens:
        print(f"{idf:.2}: {token}")

In [4]:
most_common_tokens(10)

3.2: trump
3.5: new
3.7: photos
3.9: just
3.9: like
4.0: people
4.0: time
4.0: year
4.1: day
4.1: said


🧠 "Donald Trump" seems to be one of the most common topics in the dataset.

🧠 Could be that there is a category for US Politics, and that category has a lot of samples in the dataset. Maybe category G?

🧠 Given the training set, we can calculate its TF-IDF matrix (similar to an embedding matrix), and inspect that.

In [5]:
train_set: Samples = defs.load_asset_value("train_set")  # type: ignore

2023-02-12 23:33:25 +0000 - dagster - DEBUG - system - Loading file from: /Users/thomelane/Projects/text_classification/data/storage/train_set


In [6]:
# warning: referencing global variables
def highest_weighted_tokens(category: str, top_k: int = 10):
    samples = [s for s in train_set if s["category"] == category]
    tfidf_matrix = model.embed(samples)
    token_weights = tfidf_matrix.sum(axis=0)
    token_ids = token_weights.argsort()[-top_k:][::-1]
    tokens = [vocab_id_to_token[token_id] for token_id in token_ids]
    for token in tokens:
        print(token)

In [7]:
label_transform: LabelEncoder = defs.load_asset_value("label_transform")  # type: ignore

2023-02-12 23:33:25 +0000 - dagster - DEBUG - system - Loading file from: /Users/thomelane/Projects/text_classification/data/storage/label_transform


In [8]:
categories: List[str] = label_transform.label_encoder.classes_  # type: ignore
for category in categories:
    print(f"\n### Category: {category}")
    highest_weighted_tokens(category, top_k=10)


### Category: A
art
artist
new
artists
photos
world
book
imageblog
exhibition
women

### Category: B
climate
change
animal
dog
week
california
world
oil
water
energy

### Category: C
police
man
shooting
suspect
allegedly
cops
killed
shot
accused
old

### Category: D
gay
black
new
people
lgbt
queer
trans
transgender
lgbtq
community

### Category: E
wedding
divorce
marriage
weddings
married
love
couples
day
ex
divorced

### Category: F
photos
style
fashion
home
look
check
week
new
beauty
pinterest

### Category: G
trump
donald
president
clinton
gop
obama
hillary
house
says
new

### Category: H
world
korea
isis
north
war
people
government
president
iran
country

### Category: I
man
watch
woman
dog
just
cat
weird
people
police
cops

### Category: J
kids
parents
children
mom
baby
child
year
time
parenting
day


🧠 Quite distinct topics are observed.

👍 Seems that TF-IDF is giving sensible features, that can be used by the `LogisticRegression`.

❓ Is the `LogisticRegression` model using these features as we would expect?

## Interpret `LogisticRegression`

In [9]:
# warning: referencing global variables
def most_influential_tokens(category: str, top_k: int = 10):
    class_idx = label_transform.label_encoder.transform([category])[0]
    token_ids = model.logistic_regression.coef_[class_idx].argsort()[-top_k:][::-1]
    tokens = [vocab_id_to_token[token_id] for token_id in token_ids]
    for token in tokens:
        print(token)

In [10]:
for category in categories:
    print(f"\n### Category: {category}")
    most_influential_tokens(category, top_k=10)


### Category: A
art
artist
artists
imageblog
arts
theatre
book
exhibition
theater
nighter

### Category: B
climate
animal
dog
oil
animals
nature
california
environmental
earth
coal

### Category: C
shooting
police
allegedly
murder
cops
accused
man
shooter
prison
arrested

### Category: D
gay
queer
black
lgbtq
lgbt
trans
transgender
latino
lesbian
latinos

### Category: E
divorce
wedding
marriage
divorced
weddings
married
ex
bride
proposal
single

### Category: F
photos
fashion
style
beauty
home
hair
makeup
photo
model
kate

### Category: G
trump
gop
obama
democrats
republicans
senate
clinton
congress
bush
republican

### Category: H
isis
korea
greece
israeli
government
india
saudi
migrants
china
iran

### Category: I
man
weird
cat
watch
ufo
cops
dog
weirdest
shark
woman

### Category: J
kids
parenting
children
mom
baby
parents
daughter
breastfeeding
babies
moms


👍 Yes, `LogisticRegression` is using the TF-IDF vectors in a sensible way.

🧠 We get a very clear view on which tokens contribute to a prediction for each category.

🧠 So good that we can take an attempt at manual category labelling.

In [11]:
class_labels = {
    "A": "Art",
    "B": "Environment",
    "C": "Crime",
    "D": "Diversity",
    "E": "Relationship",
    "F": "Fashion",
    "G": "US Politics",
    "H": "Foreign Affairs",
    "I": "Bizarre",
    "J": "Parenting"
}

💡 Can give the model some test inputs to get an intuitive feel for the performance.

In [12]:
def predict(headline: str, short_description: str):
    y_pred = model.predict([{
        "headline": headline,
        "short_description": short_description
    }])[0]
    class_label = class_labels[categories[y_pred]]
    print(class_label)

In [13]:
predict(
    headline="Renowned Artist Announces New Collection Inspired by Nature",
    short_description="The latest works by this celebrated artist explore the beauty of the natural world."
)

Art


In [14]:
predict(
    headline="World Shocked as Giant Hamster Takes Over as Ruler of Tiny European Country",
    short_description="Giant hamster becomes surprise ruler of small European country, leaving the world in disbelief and sparking debates on unconventional leadership."
)

US Politics


🧠 Was trying some examples at the border of categories to explore the decision boundary.

💪 Should look at failure cases. Will do that in a model evaluation report.