# Classify
This notebook demonstrate the `tailwiz.classify` function. The purpose of the function is to classify unlabeled data into 2 or more classes. The number of classes depends on the number of classes in the labeled data you pass to the function. You have the option of passing in labeled data as references that the AI will use to assign classes to the unlabeled data. If no labeled data is passed in, your data will be separated into two classes simply by text similarity.

To quickly use this notebook for your own data, replace the variables in the first cell with your own paths. Then, run all remaining cells.

In [None]:
###################################################################
#######            START - Edits variables here.            #######

# Path of data (csv) to be classified by tailwiz.
to_classify = 'data/tweets.csv'
# Column name of the text to be classified by tailwiz.
to_classify_text_col = 'text'

# Path of labeled data (csv) that tailwiz learn from.
labeled_examples = 'data/tweets-with-labels.csv'
# Column name of the text to be learned by tailwiz.
labeled_examples_text_col = 'text'
# Column name of the label to be learned by tailwiz.
labeled_examples_label_col = 'sentiment'

# Path to where you want to save your results.
save_csv = 'data/tweets-with-tailwiz-labels.csv'

##################################################################
#######   END - Leave unedited to run with example data.   #######

The example data consists of tweets (`text`), the tweet sentiment (`sentiment`, positive or negative), and an excerpt that identifies the sentiment of the tweet (`selected_text`). We have 200 labeled examples and ~3K unlabeled examples. Our goal will be to use `tailwiz` to label the 3K unlabeled examples using our 200 labeled examples as references. Providing more labeled examples will generally improve performance.

## 1. Install `tailwiz`

In [None]:
!python -m pip install --upgrade tailwiz

In [None]:
# Import required packages.
import tailwiz
import pandas as pd

## 2. Data prep
We read in our data from a .csv file using the `pandas` library.

In [None]:
labeled_examples = pd.read_csv(labeled_examples)
to_classify = pd.read_csv(to_classify)

In [None]:
# View first 5 rows of labeled data.
labeled_examples.head()

In [None]:
# View first 5 rows of unlabeled data to be classified by tailwiz.
to_classify.head()

In [None]:
# Before calling tailwiz.classify with our data, we must rename our columns in accordance to `tailwiz.classify` standards.
# The text column must be named 'text' and the label column must be named 'label'.
to_classify = to_classify.rename(columns={to_classify_text_col: 'text'})
labeled_examples = labeled_examples.rename(columns={labeled_examples_text_col: 'text', labeled_examples_label_col: 'label'})

## 3. Call `classify` function
The next step is to call `tailwiz.classify`! We set `output_metrics` to `True` to also output an estimate of the performance of our classification job.

Depending on the complexity of your data, this may take a few minutes. If this is your first time running `tailwiz.classify`, you might see some extra downloads.

In [None]:
results, performance_estimate = tailwiz.classify(
    to_classify,
    labeled_examples,
    output_metrics=True,
)

## 4. Inspect and save results
After classifying our unlabeled data, we can inspect and save results.

First, let's inspect the first five rows to do a quick sanity check. A new column, `tailwiz_label`, contains the newly generated labels.

In [None]:
results.head()

We can also print out our performance estimate to gain some additional insight to our labels.

In [None]:
performance_estimate

This is only an estimate based on your labeled data. We will not know for certain how the classification actually performed on the unlabeled data.

Finally, we can save these results:

In [None]:
results.to_csv(save_csv, index=False)  # We set index to False to avoid saving the index column added by pandas.