# Instruction Tuning with GPT-4

This notebook is developed to produce the pie chart html/figure in the GPT-4-LLM paper. It analyzes the GPT4 output by following the instructions.

```
``Instruction Tuning with GPT-4'' (https://arxiv.org/abs/2304.03277)
Baolin Peng*, Chunyuan Li*, Pengcheng He*, Michel Galley, Jianfeng Gao (*Equal Contribution)
```

- Project: https://instruction-tuning-with-gpt-4.github.io/
- Github Repo: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM

Please submit an issue in the github repo, if you have any questions.



*Note: The original script from [self-instruct repo](https://github.com/yizhongw/self-instruct/blob/main/self_instruct/instruction_visualize.ipynb). The script uses Berkeley Neural Parser to parse the generated instructions, and visualize the results using Plotly. Please make sure to install benepar following their documentation [here](https://github.com/nikitakit/self-attentive-parser#installation).*

In [7]:
import benepar, spacy

%python -m spacy download en_core_web_md
nlp = spacy.load('en_core_web_md')
doc = nlp("The time for action is now. It's never too late to do something.")

import matplotlib.pyplot as plt


UsageError: Line magic function `%python` not found (But cell magic `%%python` exists, did you mean that instead?).


## 1. Generate Verb-Noun Pairs for GPT4 Output

:warning: Warning: It takes 20 minutes to run the entire pre-processing, and save it into a csv file. You consider to skip processing, and load our pre-process Verb-Noun CSV file in the next step.

In [None]:
def find_root_verb_and_its_dobj(tree_root):
    # first check if the current node and its children satisfy the condition
    if tree_root.pos_ == "VERB":
        for child in tree_root.children:
            if child.dep_ == "dobj" and child.pos_ == "NOUN":
                return tree_root.lemma_, child.lemma_
        return tree_root.lemma_, None
    # if not, check its children
    for child in tree_root.children:
        return find_root_verb_and_its_dobj(child)
    # if no children satisfy the condition, return None
    return None, None


def find_root_verb_and_its_dobj_in_string(s):
    doc = nlp(s)
    first_sent = list(doc.sents)[0]
    return find_root_verb_and_its_dobj(first_sent.root)


find_root_verb_and_its_dobj_in_string("Write me a story about education.")

NameError: name 'nlp' is not defined

In [12]:
import pandas as pd
import json
import tqdm


generated_data_path = "data/gpt4_alpaca_data_0329.json"

with open(generated_data_path, "r") as fin:
    gpt4_machine_generated_tasks = json.load(fin)

# print(gpt4_machine_generated_tasks[0])

instruction_outputs = set(
    [task["output"] for task in gpt4_machine_generated_tasks]
)  # if you are interested in studying the instructions, please change the task key
print(len(instruction_outputs))

raw_phrases = []
for out in tqdm.tqdm(instruction_outputs):
    try:
        verb, noun = find_root_verb_and_its_dobj_in_string(out)
        raw_phrases.append({"verb": verb, "noun": noun, "instruction_output": out})
    except Exception as e:
        print(e)
        print(out)

FileNotFoundError: [Errno 2] No such file or directory: 'data/gpt4_alpaca_data_0329.json'

In [None]:
len(raw_phrases)
raw_phrases = pd.DataFrame(raw_phrases)
raw_phrases.to_csv(r"data/gpt4_alpaca_verb_noun_output.csv")

## 2. Pie Chart Creation on Verb-Noun

Load our pre-process Verb-Noun CSV file, and create the html file with plotly

In [None]:
raw_phrases = pd.read_csv(r"data/gpt4_alpaca_verb_noun_output.csv")
raw_phrases = pd.DataFrame(raw_phrases)
phrases = pd.DataFrame(raw_phrases).dropna()
count_list = (
    phrases[["verb", "noun"]]
    .groupby(["verb", "noun"])
    .size()
    .sort_values(ascending=False)
)

In [None]:
len(count_list)
# count_list[:25]

# count_list[:25].plot.barh()
# plt.ylabel('verb, noun')
# plt.xlabel('frequency')
# plt.show()

5229

In [None]:
top_verbs = phrases[["verb"]].groupby(["verb"]).size().nlargest(20).reset_index()

df = phrases[phrases["verb"].isin(top_verbs["verb"].tolist())]
# df = df[~df["noun"].isin(["I", "what"])]
# df = phrases
# df[~df["verb"].isin(top_verbs["verb"].tolist())]["verb"] = "other"
# df[~df["verb"].isin(top_verbs["verb"].tolist())]["noun"] = "other"
df = (
    df.groupby(["verb", "noun"])
    .size()
    .reset_index()
    .rename(columns={0: "count"})
    .sort_values(by=["count"], ascending=False)
)
# df = df[df["count"] > 10]
df = (
    df.groupby("verb")
    .apply(lambda x: x.sort_values("count", ascending=False).head(4))
    .reset_index(drop=True)
)
df

Unnamed: 0,verb,noun,count
0,bring,change,13
1,bring,benefit,7
2,bring,joy,3
3,bring,hope,2
4,contain,error,14
...,...,...,...
75,use,function,11
76,write,letter,22
77,write,novel,8
78,write,book,7


In [None]:
import plotly.graph_objects as go
import plotly.express as px

# df["blank"] = "ROOT"
# df = phrases.groupby(["verb", "noun"]).size().sort_values(ascending=False).head(5).reset_index().rename(columns={0: "count"})

df = df[df["count"] > 10]
fig = px.sunburst(df, path=["verb", "noun"], values="count")
# fig.update_layout(uniformtext=dict(minsize=10, mode='hide'))
fig.update_layout(
    margin=dict(l=0, r=0, t=0, b=0),
    font_family="Times New Roman",
)
# fig.show()
fig.write_html("output/gpt4_alpaca_verb_noun_output.html")
# fig.savefig("output/verb_noun.pdf")

5.8.0
