# Classify
This notebook demonstrate the `tailwiz.classify` function. The purpose of the function is to classify unlabeled data into 2 or more classes. The number of classes depends on the number of classes in the labeled data you pass to the function. You have the option of passing in labeled data as references that the AI will use to assign classes to the unlabeled data. If no labeled data is passed in, your data will be separated into two classes simply by text similarity.

In [None]:
# Import required packages.
import tailwiz
import pandas as pd

## 1. Data prep
First, we read in our example data from a .csv file using the `pandas` library.

In [None]:
labeled_data = pd.read_csv('data/tweets-labeled.csv')
unlabeled_data = pd.read_csv('data/tweets-unlabeled.csv')

Our example data is Twitter data. It consists of tweets (`text`), the tweet sentiment (`sentiment`, either positive or negative), and an excerpt that identifies the sentiment of the tweet (`selected_text`). We have 200 labeled examples and ~3K unlabeled examples. We will focus on the tweet sentiments in this notebook: our goal will be to use `tailwiz` to label the 3K unlabeled examples using our 200 labeled examples as references.

Note that providing more prelabeled examples will generally improve performance.

Below is a preview of our data.

In [None]:
# View first 5 rows of labeled data.
labeled_data.head()

In [None]:
# View first 5 rows of unlabeled data.
unlabeled_data.head()

In [None]:
# Before calling tailwiz.classify with our data, we must rename our columns in accordance to `tailwiz.classify` standards.
# The text column must be named 'text' (it already is) and the label column must be named 'label' (it is currently named 'sentiment').
labeled_data = labeled_data.rename(columns={'sentiment': 'label'})

## 2. Call `classify` function
The next step is to call `tailwiz.classify`! We set `output_metrics` to `True` to also output an estimate of the performance of our classification job.

Depending on the complexity of your data, this may take a few minutes. If this is your first time running `tailwiz.classify`, you might see some extra downloads.

In [None]:
results, performance_estimate = tailwiz.classify(
    text_to_label=unlabeled_data[['text']],
    prelabeled_text=labeled_data[['text', 'label']],
    output_metrics=True,
)

## 3. Inspect and save results
After classifying our unlabeled data, we can inspect and save results.

First, let's inspect the first five rows to do a quick sanity check. Note the new column, `label_from_tailwiz`.

In [None]:
results.head()

We can also print out our performance estimate to gain some additional insight to our labels.

In [None]:
performance_estimate

Note that this is only an estimate based on your labeled data. We will not know for certain how the classification actually performed on the unlabeled data.

Finally, we can save these results:

In [None]:
results.to_csv('data/tweets-unlabeled-with-classify-results-from-tailwiz.csv', index=False)  # We set index to False to avoid saving the index column added by pandas.