<a href="https://colab.research.google.com/github/sj7272/DataFire/blob/master/tutorials/image_classification_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Active Learning for a Drifting Image Classification Model</h1>

Imagine you're in charge of maintaining a model that classifies the action of people in photographs. Your model initially performs well in production, but its performance gradually degrades over time.

Phoenix helps you surface the reason for this regression by analyzing the embeddings representing each image. Your model was trained on crisp and high-resolution images, but as you'll discover, it's encountering blurred and noisy images in production that it can't correctly classify.

In this tutorial, you will:

- Download curated datasets of embeddings and predictions
- Define a schema to describe the format of your data
- Launch Phoenix to visually explore your embeddings
- Investigate problematic clusters
- Export problematic production data for labeling and fine-tuning

Let's get started!

## Install Dependencies and Import Libraries

Install Phoenix.

In [1]:
!pip install arize-phoenix

Collecting arize-phoenix
  Downloading arize_phoenix-3.15.0-py3-none-any.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ddsketch (from arize-phoenix)
  Downloading ddsketch-2.0.4-py3-none-any.whl (18 kB)
Collecting hdbscan>=0.8.33 (from arize-phoenix)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting openinference-instrumentation-langchain>=0.1.12 (from arize-phoenix)
  Downloading openinference_instrumentation_langchain-0.1.12-py3-none-any.whl (13 kB)
Collecting openinference-instrumentation-llama-index>=1.2.0 (from arize-phoenix)
  Downloading openinference_instrumentation_llama_index-1.2

Import libraries.

In [2]:
import pandas as pd
import phoenix as px
from IPython.display import HTML, display

## Download and Inspect the Data

Download production and training image data containing photographs of people performing various actions (sleeping, eating, running, etc.).

In [3]:
train_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/cv/human-actions/human_actions_training.parquet"
)
prod_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/cv/human-actions/human_actions_production.parquet"
)

View a few training data points.

In [4]:
train_df.head()

Unnamed: 0,prediction_id,prediction_ts,url,image_vector,actual_action,predicted_action
0,595d87df-5d50-4d60-bc5f-3ad1cc483190,1655757000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.26720312, 0.02652928, 0.0, 0.028591828, 0.0...",drinking,drinking
1,37596b85-c007-4e4f-901d-b87e5297d4b8,1655757000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.08745878, 0.0, 0.16057675, 0.036570743, 0.0...",fighting,fighting
2,b048d389-539a-4ffb-be61-2f4daa52e700,1655757000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.9822482, 0.0, 0.037284207, 0.017358225, 0.2...",clapping,clapping
3,3e00c023-49b4-49c2-9922-7ecbf1349c04,1655757000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.028404092, 0.063946, 1.0448836, 0.65191674,...",fighting,fighting
4,fb38b050-fb12-43af-b27d-629653b5df86,1655758000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.06121698, 0.5172761, 0.50730985, 0.5771937,...",sitting,sitting


The columns of the dataframe are:
- **prediction_id:** a unique identifier for each data point
- **prediction_ts:** the Unix timestamps of your predictions
- **url:** a link to the image data
- **image_vector:** the embedding vectors representing each image
- **actual_action:** the ground truth for each image
- **predicted_action:** the predicted class for the image

View a few production data points.

In [5]:
prod_df.head()

Unnamed: 0,prediction_id,prediction_ts,url,image_vector,predicted_action
0,8fa8d06a-3dba-46c4-b134-74b7f3eb479b,1657053000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.38830394, 0.13084425, 0.026343096, 0.426129...",hugging
1,80138725-1dbd-46cf-9754-5de495b2d5fc,1657053000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.38679752, 0.33045158, 0.032496776, 0.001283...",laughing
2,0d2d4bb7-ff80-46c5-8134-e5191ad56c73,1657053000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.041905474, 0.057079148, 0.0, 0.24986057, 0....",drinking
3,050fe2b2-bb72-4092-8294-cff9f8d07d10,1657053000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.14649533, 0.18736616, 0.043569583, 1.226385...",sleeping
4,ada433c5-2251-49d3-9cd7-33718f814034,1657053000.0,https://storage.googleapis.com/arize-assets/fi...,"[0.7338474, 0.09456189, 0.83416396, 0.09127828...",fighting


Notice that the production data is missing ground truth, i.e., has no "actual_action" column.

Display a few images alongside their predicted and actual labels.

In [6]:
def display_examples(df):
    """
    Displays each image alongside the actual and predicted classes.
    """
    sample_df = df.reindex(columns=["actual_action", "predicted_action", "url"]).rename(
        columns={"url": "image"}
    )
    html = sample_df.to_html(
        escape=False, index=False, formatters={"image": lambda url: f'<img src="{url}">'}
    )
    display(HTML(html))


display_examples(train_df.head())

actual_action,predicted_action,image
drinking,drinking,
fighting,fighting,
clapping,clapping,
fighting,fighting,
sitting,sitting,


## Launch Phoenix

Define a schema to tell Phoenix what the columns of your training dataframe represent (features, predictions, actuals, tags, embeddings, etc.). See the [docs](https://docs.arize.com/phoenix/) for guides on how to define your own schema and API reference on `phoenix.Schema` and `phoenix.EmbeddingColumnNames`.

In [7]:
train_schema = px.Schema(
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="predicted_action",
    actual_label_column_name="actual_action",
    embedding_feature_column_names={
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="url",
        ),
    },
)

The schema for your production data is the same, except it does not have an actual label column.

In [8]:
prod_schema = px.Schema(
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="predicted_action",
    embedding_feature_column_names={
        "image_embedding": px.EmbeddingColumnNames(
            vector_column_name="image_vector",
            link_to_data_column_name="url",
        ),
    },
)

Create Phoenix datasets that wrap your dataframes with schemas that describe them.

In [9]:
prod_ds = px.Dataset(dataframe=prod_df, schema=prod_schema, name="production")
train_ds = px.Dataset(dataframe=train_df, schema=train_schema, name="training")

Launch Phoenix. Follow the instructions in the UI to open the Phoenix UI.

In [10]:
session = px.launch_app(primary=prod_ds, reference=train_ds)

🌍 To view the Phoenix app in your browser, visit https://yk5ozyafp7l1-496ff2e9c6d22116-6006-colab.googleusercontent.com/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


## Find and Export Problematic Clusters

Click on "image_embedding" in the "Embeddings" section.

![click on image embedding](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/click_on_image_embedding.png)

Select a period of high drift in the Euclidean distance graph at the top.

![select period of high drift](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/select_period_of_high_drift.png)

Click on the top cluster in the panel on the left. Phoenix has identified this cluster as problematic because it consists entirely or almost entirely of production data, meaning that your model is making production inferences on data the likes of which it never saw during training.

![select top cluster](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/select_top_cluster.png)

Use the panel at the bottom to examine the data points in this cluster. What do you notice about these data points that is different from the training data points you saw earlier?

![inspect points in cluster](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/inspect_points_in_cluster.png)

The data points in the cluster above are grainy and noisy. Click on the "Export" button to save your cluster for relabeling and fine-tuning.

![export cluster](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/export_cluster.png)

## Load and View Exported Data

View the exported cluster as a dataframe in your notebook.

In [11]:
export_df = session.exports[-1]
export_df.head()

IndexError: list index out of range

Display a few examples from your exported data.

In [12]:
display_examples(export_df.head())

NameError: name 'export_df' is not defined

Congrats! You've pinpointed the blurry or noisy images that are hurting your model's performance in production. As an actionable next step, you can label your exported production data and fine-tune your model to improve performance.