# Defining Artist "Eras" Through Lyrical Analysis

If you have a Taylor Swift fan in your life like I do, you've probably heard about the **Eras Tour** by now. This [record-breaking](https://www.newsweek.com/taylor-swift-eras-tour-broke-18-records-1996679#:~:text=The%20Eras%20Tour%20broke%20the,concert%20tour%20of%20all%20time.) worldwide tour captivated "Swifties" around the globe, featuring songs from every Taylor Swift album, dating all the way back to her 2006 debut.

As Taylor has grown, so has her music. Her life experiences are reflected in her lyrics, inspiring fans to define distinct musical "eras" based on the themes and emotional undertones in each of her albums. Each era tells a different story, shaped by the time in which it was created.

This leads us to some interesting questions:

- Do other artists also have recognizable eras?
- Can we identify these eras just by analyzing lyrics?
- How do eras evolve? Do they shift sharply from album to album, or more gradually over time?

In this notebook, we’ll explore those questions using AI-powered text analysis. We'll cluster songs based on lyrical similarity, characterize the themes in each cluster, and compare those groupings to traditional ways of organizing music (like albums or release dates) to see how well they align.


## Setup

Review datasets and install required packages, including the Gemini API Python SDK.

In [None]:
import numpy as np
import pandas as pd
import os

In [None]:
!pip uninstall -qqy jupyterlab kfp &>/dev/null  # Remove unused conflicting packages
!pip install -U -q "google-genai==1.7.0" "chromadb==0.6.3" &>/dev/null
!pip install scikit-learn &>/dev/null

In [None]:
from google import genai
from google.genai import types

genai.__version__

### Set up your API key

To run the following cell, your API key must be stored it in a [Kaggle secret](https://www.kaggle.com/discussions/product-feedback/114053) named `GOOGLE_API_KEY`.

If you don't already have an API key, you can grab one from [AI Studio](https://aistudio.google.com/app/apikey). You can find [detailed instructions in the docs](https://ai.google.dev/gemini-api/docs/api-key).

To make the key available through Kaggle secrets, choose `Secrets` from the `Add-ons` menu and follow the instructions to add your key or enable it for this notebook.

In [None]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

client = genai.Client(api_key=GOOGLE_API_KEY)

If you received an error response along the lines of `No user secrets exist for kernel id ...`, then you need to add your API key via `Add-ons`, `Secrets` **and** enable it.

![Screenshot of the checkbox to enable GOOGLE_API_KEY secret](https://storage.googleapis.com/kaggle-media/Images/5gdai_sc_3.png)

## Data

Three datasets are loaded into this notebook by default. Each one contains the lyrics for a particular artist's discography.

The artists include:

- Taylor Swift
- Bob Dylan
- Kendrick Lamar

To ensure the datasets are correctly loaded, we can check the `/kaggle/input` directory

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

If you see filepaths for `.csv` files, you're good to go. 

Each of the datasets contains the data that we need, but they all have slightly different ways of storing it.

Below we have a code block for each artist. 
The code block creates a DataFrame using the dataset for that artist and it defines column names used in that dataset. 

You should pick one artist to move forward with in this notebook.
To use that artist, simply uncomment their corresponding code block before continuing.

By default, **Taylor Swift** is uncommented.

You may always re-run the notebook with a different artist later.

First, define common columns that are used no matter which dataset is selected

In [None]:
# Common columns - Leave these uncommented
cluster_col = 'cluster'
year_col = 'release_year'
embeddings_col = 'embeddings'
vibe_col = 'vibe'

### Taylor Swift

In [None]:
df=pd.read_csv('/kaggle/input/taylor-swift-released-song-discography-genius/ts_discography_released.csv')
title_col='song_title'
album_col='album_title'
lyrics_col='song_lyrics'

# Exclude special editions albums that would result in repeated lyrics
exclude = ['1989 (Deluxe)', 'Fearless (Platinum Edition)', 'Red (Deluxe Version)', 
           'Speak Now (Deluxe)', 'The Taylor Swift Holiday Collection - EP']
df = df[~df[album_col].isin(exclude)]

# Create a release year column
df['release_date_dt'] = pd.to_datetime(df['song_release_date'])
df[year_col] = df['release_date_dt'].dt.year

### Bob Dylan

In [None]:
# df=pd.read_csv('/kaggle/input/bob-dylan-songs/clear.csv')
# title_col='title'
# album_col='album'
# lyrics_col='lyrics'

### Kendrick Lamar

In [None]:
# df=pd.read_csv('/kaggle/input/kendrick-lamar-albumslyrics-dataset/discog_data.csv')
# title_col='track_name'
# album_col='album'
# lyrics_col='lyrics'

# # Create a release year column
# df['release_date_dt'] = pd.to_datetime(df['release_date'])
# df[year_col] = df['release_date_dt'].dt.year

Sometimes these datasets will contain minor releases that only contain a few songs.

To get the best results possible, remove these songs from the dataframe before starting.

In [None]:
# Filter out any albums with fewer the 5 songs
album_counts = df[album_col].value_counts()
valid_albums = album_counts[album_counts >= 5].index
df = df[df[album_col].isin(valid_albums)]

Now, let's look at the data we're given. Feel free to explore the dataframe more before continuing!

In [None]:
df.head()

## Create Embeddings

**Embeddings** are numerical representations of data that are designed to capture the underlying meaning of that data.

Below, we'll create a function to generate embeddings for our song lyrics using the Gemini API.

**Note**: Not all embeddings are created equal. There are specific types of embeddings for different use cases. Notice that we use the `clustering` type in our example because we want to identify groups of lyrics that are semantically similar.

For more information on embeddings and their types, see [here](https://ai.google.dev/gemini-api/docs/embeddings).

In [None]:
from google.api_core import retry
import tqdm
from tqdm.rich import tqdm as tqdmr
import warnings

tqdmr.pandas()

# Filter the experimental warning
warnings.filterwarnings("ignore", category=tqdm.TqdmExperimentalWarning)

# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

@retry.Retry(predicate=is_retriable, timeout=300.0)
def embed_lyrics(lyrics: list[str]) -> list[list[float]]:
    # Helper to chunk the lyrics (Max 100 per request)
    def chunks(lst, n):
        for i in range(0, len(lst), n):
            yield lst[i:i + n]

    all_embeddings = []
    for batch in chunks(lyrics, 100):
        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=batch,
            config=types.EmbedContentConfig(
                task_type="clustering",
            ),
        )

        # Extract the embedding vector from the responses
        batch_embeddings = [embedding.values for embedding in response.embeddings]

        # Add to the total
        all_embeddings.extend(batch_embeddings)

    return all_embeddings

Let's create the embeddings and see what they actually look like! 

**Note**: It's normal for this step to take a couple seconds

In [None]:
# Create the embeddings and add to our DataFrame
embeddings = embed_lyrics(df[lyrics_col].tolist())
df[embeddings_col] = embeddings

# View the embeddings
df.loc[:,[title_col, album_col, embeddings_col]].head()

We can see that the embeddings are a vector. We can't tell much from the numbers alone, so it would helpful if we could visualize this data.

Let's see how many dimensions each of these vectors contains.

In [None]:
len(df[embeddings_col][0])

Alright, too many for us to visualize.

Thankfully, we can use **Dimensionality Reduction** to reduce the vectors to 2 dimensions.

## Dimensionality Reduction

To perform dimensionality reduction, we'll be using something called [**T-distributed Stochastic Neighbor Embedding**](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) (TSNE).

TSNE is going to preserve the relationship between our data points as much as possible while reducing their dimensionality. 

In [None]:
from sklearn.manifold import TSNE

# Convert the embeddings column to a list
embeddings_list = np.array(df[embeddings_col].to_list(), dtype=np.float32)

# Call the TSNE functions on the list
tsne = TSNE(random_state=0, n_iter=1000)
tsne_results = tsne.fit_transform(embeddings_list)

# Add TSNE results to the DataFrame
df['TSNE1'] = tsne_results[:, 0]
df['TSNE2'] = tsne_results[:, 1]

df[[title_col, 'TSNE1', 'TSNE2']].head()

Great! Now each embedding is represented using two dimensions (TSNE1, TSNE2). 

From here, we can plot these points on a graph to actually visualize the embeddings we've created.

We'll sort the embeddings by album for now to see if we can spot any trends.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the 2-dimensional data points using Seaborn
fig, ax = plt.subplots(figsize=(8,6))
sns.set_style('darkgrid', {"grid.color": ".6", "grid.linestyle": ":"})
sns.scatterplot(data=df, x='TSNE1', y='TSNE2', hue=album_col, palette='hls')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
plt.title('Song Lyric Embeddings by Album');
plt.xlabel('TSNE1');
plt.ylabel('TSNE2');
plt.axis('equal')

See if you can draw any conclusions from the graph. Are there any clusters of dots with the same color?

It's alright if you don't see any meaningful trends. This may give us a hint that the artist we're looking
at uses similar lyrics for each of their albums. 

In the next step, we'll attempt to define our own groups of similar data points to see if they could be
worthy of an "era".

## Clustering Embeddings

Recall that when we initially created these embeddings the embedding type we chose was `clustering`. 

That means that we should be able to cluster our data points together based on semantic simlarity.

We will use [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) to do this. 

KMeans is an algorithm that will calculate "clusters" from our embeddings. These "clusters" represent semantic groupings of the original data we embedded.

In [None]:
from sklearn.cluster import KMeans

# The number of clusters to create
# This value is great for experimentation:
#   - a low value will provide more concrete and distinct groupings
#   - a high value will provide more nuanced groupings that may have some overlap
num_clusters=3

# Call KMeans on our 2-dimensional data points
kmeans_model = KMeans(n_clusters=num_clusters, random_state=1, n_init='auto').fit(embeddings_list)
labels = kmeans_model.fit_predict(embeddings_list)

# Propagate the results to a "Cluster" column in our DataFrame
df[cluster_col] = labels
df.head()

In [None]:
# Plot the Clusters using Seaborn
fig, ax = plt.subplots(figsize=(8,6)) # Set figsize
sns.set_style('darkgrid', {"grid.color": ".6", "grid.linestyle": ":"})
sns.scatterplot(data=df, x='TSNE1', y='TSNE2', hue=cluster_col, palette='magma')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
plt.title('Song Lyric Embeddings by Cluster');
plt.xlabel('TSNE1');
plt.ylabel('TSNE2');
plt.axis('equal')

We've should be seeing more visible groupings now thanks to KMeans.

So we've verified that we can create cluster of semantically similar songs.

However, we still don't know what these clusters really represent, or whether they can constitute an "era" in the artist's discography

We'll attempt to define them more concretely in the next section.

## Defining the Clusters using AI

Unless you're a diehard fan, you're not going to be able to look at all the songs in a particular cluster and identify what they have in common.

You would have to read through and analyze all of the lyrics in a particular cluster in a tedious and painstaking process.

Thankfully, we can have an AI model do this for us.

Below, we create a [few-shot prompt](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/few-shot-examples) to have an AI model read the lyrics and identify what we're looking for.

In [None]:
few_shot_prompt = """
You are an expert on music and lyrics who specializes in identifying commonalities between
groups of song lyrics. 

These commonalities can range from concrete elements like word choice, tone, and frequency to more
more abstract ideas such as overarching themes, emotional undertones, or the general “feel” of the lyrics.

You will be provided with a group of song lyrics that were clustered together using an AI model based on their embeddings.

Your task is to:
1) Read the lyrics thoroughly.
2) Infer and explain what characteristics these lyrics share that may have contributed to their grouping.

The lyrics for each song will be separated by a short line of dashes: "---"

The lyrics for any given song may contain other line breaks. These line breaks do not indicate a new
song has started. Only look for the line of 3 dashes.

Your response should be returned as a valid JSON containing 3 fields:

description: A thoughtful and concise summary of the shared characteristics of the lyrics and why
             you think that they were originally grouped together.
vibe: A word or short phrase that captures the overall mood, theme, or vibe of the lyrics.
      Be creative and descriptive with this field.
line: A line of short group of lines from the lyrics that exemplifies the overall vibe that you
      determined for the group of lyrics

EXAMPLE:
Cluster Lyrics:
Little darling, the smile’s returning to their faces.
Here comes the sun, and I say, it’s all right.

---

But you gotta keep your head up, oh oh
And you can let your hair down, eh eh

---

The dog days are over
The dog days are done

---

Yeah, there's always been a rainbow
Hangin' over your head

JSON RESPONSE:
{
"description": "The lyrics share a common emotional theme of the transition from struggle or darkness into relief, light, or freedom. They're not just hopeful — they're reaffirming, as if acknowledging that something difficult has been endured and now there is a release, a breath, a return of joy or peace.",
"vibe": "Optimistic and Hopeful",
"line": "you gotta keep your head up"
}

EXAMPLE:
Cluster Lyrics: 
I hurt myself today
To see if I still feel

---

Hello darkness, my old friend
I've come to talk with you again

---

I’m just a little bit caught in the middle
Life is a maze and love is a riddle

---

Sometimes I wish it would rain
So no one could see me cry

JSON RESPONSE:
{
  "description": "These lyrics reflect introspection, emotional vulnerability, and a deep sense of isolation or sadness. The themes often revolve around pain, confusion, and quiet suffering, conveyed through stark and honest expressions of emotional struggle.",
  "vibe": "Melancholic and Reflective",
  "line": "Hello darkness, my old friend"
}

ACTUAL:
Cluster Lyrics:
"""

You might notice that we ask for our response in `json` format. We do this to make it easier to process and re-use the data from the response later in the notebook.

We can further enforce this output structure by creating a dictionary in Python and passing it to the model when we generate our response, as seen below.

In [None]:
import typing_extensions as typing

# The object used to parse the model output
class ClusterSummary(typing.TypedDict):
    description: str
    vibe: str
    line: str

# Prints a ClusterSummary
def print_cluster_summary(summary: ClusterSummary) -> None:
    # Note: The "number" field will be added after the response is generated
    md_output = f"""
**Cluster**: {summary["number"]}

- **Description**: {summary['description']}
- **Vibe**: {summary['vibe']}
- **Line**: {summary['line']}
"""
    display(Markdown(md_output))

Now, create the function that calls the model using the prompt and lyrics from the cluster.

In [None]:
import random, json, enum
from IPython.display import display, Markdown

# Summarizes a cluster into its ClusterFeatures
def summarize_cluster(df, cluster_num, max_lyrics=40):
    lyrics_list = df[df[cluster_col] == cluster_num][lyrics_col].tolist()
    cluster_lyrics = "\n\n---\n\n".join(random.sample(lyrics_list, min(len(lyrics_list), max_lyrics)))

    structured_output_config = types.GenerateContentConfig(
      response_mime_type="application/json", # Generate JSON output
      response_schema=ClusterSummary,        # Use our defined schema
  )

    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=[few_shot_prompt, cluster_lyrics],
        config=structured_output_config,
    )
    
    return response.parsed

Call the function on each of our clusters.

In [None]:
# Save and print the cluster summaries
cluster_summaries = []
for cluster_num in range(num_clusters):
    # Summarize the cluster
    summary = summarize_cluster(df, cluster_num)
    summary["number"] = str(cluster_num) # Add the cluster number to the ClusterSummary
    
    # Print the ClusterSummary
    print_cluster_summary(summary)

    # Add to array for future usage
    cluster_summaries.append(summary)

We have now defined each of our clusters! 

Depending on the artist you selected and the number of clusters you created, your results in these descriptions may vary. You're encouraged to try different combinations of parameters to achieve the best result possible!

## Visualizing Cluster Relationships

Now that we have defined the clusters we can start to explore the relationships these clusters have with other parts of the data.

More specifically, we'll be looking at how clusters relate to albums and how cluster usage tends to change over time.

### Album-Cluster Relationship

In this section, we'll explore how an artist's albums relate to their lyrical clusters.

We'll aim to answer questions like:

- Do albums tend to contain songs from the same cluster, or are they more evenly spread across multiple clusters?
- Alternatively, do clusters tend to correspond to a specific album, or are they more mixed?
- Are there any broader patterns we can identify, such as a group of albums sharing a similar cluster makeup?

By analyzing these relationships, we can begin to understand whether artists group similar lyrical themes within albums, if those themes evolve more gradually over time, or if those themes have no meaningful change through the discography at all.


Define functions to visualize the album-cluster relationship:

In [None]:
# Add the "vibe" of each cluster as a column - Used as an axis when visualizing
df[vibe_col] = [cluster_summaries[i]["vibe"] for i in df[cluster_col]]

# Function to plot the composition of an album or cluster using a pie chart
def composition_pie_chart(wholeCol, piecesCol):
    items = df[wholeCol].unique()
    
    for item in items:
        item_df = df[df[wholeCol] == item]
        counts = item_df[piecesCol].value_counts().sort_index()

        plt.figure(figsize=(5, 5))
        plt.pie(counts, labels=counts.index, autopct="%1.1f%%", startangle=140)
        plt.title(f"{item}")
        plt.axis("equal")
        plt.show()


# Function to plot the shared relationship using a pivot table
def composition_pivot_table():
    pivot = df.pivot_table(index=album_col, columns=vibe_col, aggfunc="size", fill_value=0)
    
    plt.figure(figsize=(10, 6))
    sns.heatmap(pivot, annot=True, fmt="d", cmap="Blues")
    plt.title("Number of Songs per Album per Cluster")
    plt.xlabel("Cluster Vibe")
    plt.ylabel("Album")
    plt.show()

Call the functions we defined above to visualize cluster relationships.

Uncomment lines to see different visualization.

In [None]:
# Try them all!

# Which clusters make up an album?
# composition_pie_chart(album_col, vibe_col) 

# Which albums make up a cluster?
# composition_pie_chart(vibe_col, album_col)

# Combination
composition_pivot_table()

### Time-Cluster Relationship

In this section, we'll explore how an artist's usage of lyrics belonging to a particular cluster changes over the course of their career. 

We'll aim to answer questions like:

- Does the artist have distinct "eras" in which they shift their lyrical themes noticeably over time?
- How quickly do these lyrical shifts take place? Are the changes sudden or more gradual?

In [None]:
# Count songs per cluster per year
counts = df.groupby([year_col, vibe_col]).size().reset_index(name='count')

# Pivot to have years as rows and clusters as columns
pivot = counts.pivot(index=year_col, columns=vibe_col, values='count').fillna(0)

# Normalize to get proportions per year
proportions = pivot.div(pivot.sum(axis=1), axis=0)

# Plot
proportions.plot.area(colormap='tab10', figsize=(10, 6))
plt.title("Proportion of Clusters in Songs Over Time")
plt.ylabel("Proportion of Songs")
plt.xlabel("Year")
plt.legend(title="Cluster", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

## Evaluating Cluster Definitions

We've been using these cluster summaries to find relationships and draw conclusions about trends in lyrics.

Now, we'll evaluate these summaries to see how effectively they capture the theme of the cluster.

To do this, we'll create new embeddings. However, this time we'll:

- Create `retrieval_query` embeddings for lyrics
- Create `retrieval_document` embeddings for cluster summaries

We'll store the `retrieval_document` embeddings in a [Chroma](https://www.trychroma.com/) vector database. We can then query the database using our `retrieval_query` embeddings. 

This means that for any given query, the vector database will return the most semantically similar document that it contains.

In our case, we'll be querying using the lyrics from a particular songs, and the database will return the most appropriate cluster summary for those lyrics.

We can compare the cluster summary that is returned from the database to the cluster summary that the lyrics were first assigned. This will give us an idea of how effective these summaries are at capturing the overall sentiment of the lyrics for that particular cluster.


Create a new function for generating embeddings. This function differs from our last function that created embeddings in order to support Chroma's [`EmbeddingFunction` protocol](https://docs.trychroma.com/docs/embeddings/embedding-functions#custom-embedding-functions)

In [None]:
from chromadb import Documents, EmbeddingFunction, Embeddings

class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable, timeout=300.0)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

Use the function to create document embeddings for our cluster summary descriptions.

In [None]:
import chromadb 

embed_fn = GeminiEmbeddingFunction()

# Use document embeddings for cluster descriptions
embed_fn.document_mode = True 

# Collect the summary descriptions as our documents
documents = []
for summary in cluster_summaries:
    documents.append(summary["description"])

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name="cluster_descriptions", embedding_function=embed_fn)

# Add the descriptions to the database
db.add(documents=documents, ids=[str(i) for i in range(len(documents))])

Alright, our cluster summaries are in the database. Let's try out some queries!

Change the `query` variable below with your own text, maybe some sample lyrics. Run the cell and see which cluster summary is most closely associated with that text. Is it what you expected?

In [None]:
query = "Somebody once told me / The world is gonna roll me..."

result = db.query(query_texts=[query], n_results=1)
print(result["documents"][0])

Now we'll run queries like this on each of our lyrics to see which cluster is returned. 

We'd like to keep track of the cluster __number__ that is returned as well, so create a map to access this easily.

In [None]:
# Create a map from description to cluster number
desc_map = {}
for summary in cluster_summaries:
    desc_map[summary["description"]] = summary["number"]

Create and run the function to query the database for each set of lyrics.

**Note**: Since we're creating the query embeddings serially, this step may take some time. If the progress is stuck, please wait a few minutes, as we may have hit our request quota and need to cool down before trying again.

In [None]:
doc_cluster_col = "document_cluster"

# Use query embeddings for lyrics
embed_fn.document_mode = False

# Returns a list of cluster numbers in order of most semantically similar
@retry.Retry(predicate=is_retriable, timeout=300.0)
def get_db_clusters(lyrics: str) -> [int]:
    # Query the db using the lyrics - Return num_clusters results to see the full similarity order
    result = db.query(query_texts=[lyrics], n_results=num_clusters)

    # Use the desc_map to find the cluster number for the returned description
    return [desc_map[doc] for doc in result["documents"][0]]

df[doc_cluster_col] = df[lyrics_col].progress_apply(get_db_clusters)

We’ve now added a column that contains a list of cluster numbers ordered by decreasing semantic similarity to the song's lyrics. 

The most similar cluster based on description appears first in the list, the second most similar appears second, and so on.

Let's look at this column:

In [None]:
df.loc[:,[cluster_col, doc_cluster_col]].head()

Next, we’ll create a new column that measures how well the cluster summaries reflect the original clustering. 

Specifically, we’ll calculate the index of the original cluster within the ranked list of matches:

In [None]:
df['accuracy_index'] = df.apply(lambda row: row[doc_cluster_col].index(str(row[cluster_col])), axis=1)

This index tells us how close the original cluster summary was to being the best semantic match for the song. 

A accuracy_index of 0 means the original cluster summary was the top match, indicating that the summary is accurate. Higher values suggest the original summary didn’t align as closely with the lyrics.

Now let’s visualize the distribution of these indices to get a sense of how well our summaries performed:

In [None]:
# Ignore warnings
warnings.filterwarnings("ignore", category=FutureWarning, message=".*use_inf_as_na.*")

sns.histplot(df['accuracy_index'], bins=range(df['accuracy_index'].max() + 2), discrete=True)
plt.xlabel('Index of Cluster Number in Ranked List')
plt.ylabel('Count')
plt.title('Distribution of Cluster Indices')
plt.show()

## Conclusion

In this notebook, we attempted to identify whether artists have distinctive creative "eras" with their lyrics, similar to what Taylor Swift fans have seen with her discography.

We clustered song lyrics into semantically similar groupings and identified the general sentiment of the lyrics in each group. 

We explored how these clusters related to albums and release years and we even performed an evaluation on the cluster summaries to see how effectively they conveyed the message of the lyrics. 

We leveraged several AI capabilities in order to accomplish this, including:

- Embeddings (`clustering`, `retrieval_query`, and `retrieval_document`)
- Few-shot Prompting
- Structured output with JSON
- Vector databases and querying

So, back to our original question:

**Could every artist have an Eras Tour of their own?**

The results seem to indicate that while artists tend to have a handful of common themes in their lyrics, they're not always as distinct and well-defined as we may believe. 

Clusters generally varied across albums and through different time periods, with a few exceptions that are worth noting. In many cases, even after "leaving" a particular era, an artist would typically find their way back to those lyrical themes at some point. 

It's important to note that our analysis has its limitations:

- A small sample size of only a couple artists (and only one at a time)
- Analysis was done using lyrics only, which cannot capture the entire essence of a song
- AI models still have room for improvement, especially with potentially abstract inputs like song lyrics

We can draw certain conclusions from the data we have, but ultimately these limitations prevent us from making any broad claims about musicians and the lyrics in general.

In the future, we could address these limitations to make this analysis more robust. We could even extend the project to include things like:

- Comparison of eras between artists
- Contexualization of eras by including information about the musical and societal landscape at the time of release
- Comparison of AI-defined eras with fan-defined eras

Thanks for reading!

## Citations

Many of the code snippets in this notebook were taken and/or derive from those found in the [Google 5-Day Gen AI Intensive Course CodeLabs](https://www.kaggle.com/learn-guide/5-day-genai).

Some code snippets were also taken from the Gemini documentation and [tutorials](https://github.com/google/generative-ai-docs/tree/main/site/en/gemini-api/tutorials).