# Parse
This notebook demonstrate the `tailwiz.parse` function. The purpose of the function is to extract a snippet of text from a `context` given a `prompt`. You have the option of passing in labeled data as references that the AI will use to parse the unlabeled data. If no labeled data is passed in, your data's `context` will still be parsed given `prompt`, but possibly with unexpected results.

The main difference between `tailwiz.parse` and `tailwiz.generate` is that, with `tailwiz.parse`, the labels must be extracted directly from the text. By contrast, `tailwiz.generate` is able to generate labels simply given a prompt.

In [None]:
###################################################################
#######            START - Edits variables here.            #######

# Instructions or question describing your task.
prompt = 'Extract the most important phrase in determining the sentiment of the text.'

# Path of data (csv) to be classified by tailwiz.
to_parse = 'data/tweets.csv'
# Column name of the text to be parsed by tailwiz.
to_parse_context_col = 'text'

# Path of labeled data (csv) that tailwiz learn from.
labeled_examples = 'data/tweets-with-labels.csv'
# Column name of the context to be learned by tailwiz.
labeled_examples_context_col = 'text'
# Column name of the label to be learned by tailwiz.
labeled_examples_label_col = 'selected_text'

# Path to where you want to save your results.
save_csv = 'data/tweets-with-tailwiz-labels.csv'

##################################################################
#######   END - Leave unedited to run with example data.   #######

The example data consists of tweets (`text`), the tweet sentiment (`sentiment`, positive or negative), and an excerpt that identifies the sentiment of the tweet (`selected_text`). We have 200 labeled examples and ~3K unlabeled examples. Our goal will be to use `tailwiz` to label the 3K unlabeled examples using our 200 labeled examples as references. Providing more labeled examples will generally improve performance.

## 1. Install `tailwiz`

In [None]:
!python -m pip install --upgrade tailwiz

In [None]:
# Import required packages.
import tailwiz
import pandas as pd

## 2. Data prep
First, we read in our example data from a .csv file using the `pandas` library.

In [None]:
df_labeled_examples = pd.read_csv(labeled_examples)
df_to_parse = pd.read_csv(to_parse)

In [None]:
# View first 5 rows of labeled data.
df_labeled_examples.head()

In [None]:
# View first 5 rows of unlabeled data to be classified by tailwiz.
df_to_parse.head()

In [None]:
# Before calling tailwiz.classify with our data, we must rename our columns in accordance to `tailwiz.classify` standards.
# The text column must be named 'text' and the label column must be named 'label'.
df_to_parse = df_to_parse.rename(columns={to_parse_context_col: 'context'})
df_labeled_examples = df_labeled_examples.rename(columns={labeled_examples_context_col: 'context', labeled_examples_label_col: 'label'})

We must create a prompt column. `tailwiz` will attempt to follow the prompt to extract the desired phrases from your context.

In [None]:
# We give all examples the same prompt.
df_labeled_examples['prompt'] = prompt
df_to_parse['prompt'] = prompt

## 3. Call `parse` function
The next step is to call `tailwiz.parse`! We set `output_metrics` to `True` to also output an estimate of the performance of our classification job.

This may take a few minutes (5-15 minutes). If this is your first time running `tailwiz.parse`, you might see some extra downloads.

In [None]:
results, performance_estimate = tailwiz.parse(
    to_parse=df_to_parse[['context', 'prompt']],
    labeled_examples=df_labeled_examples[['context', 'prompt', 'label']],
    output_metrics=True,
)

## 4. Inspect and save results
After parsing responses for our unlabeled data, we can inspect and save results.

First, let's inspect the first five rows to do a quick sanity check. The new column, `tailwiz_label` contains the parsed results.

In [None]:
results.head()

We can also print out our performance estimate to gain some additional insight to our labels.

In [None]:
performance_estimate

This is only an estimate based on your labeled data. We will not know for certain how the parsing job actually performed on the unlabeled data.

Finally, we can save these results:

In [None]:
results.to_csv(save_csv, index=False)  # We set index to False to avoid saving the index column added by pandas.