- 准备数据集，处理 gorilla 的 instruction + code example
    - Instruction 任务说明
    - Function，接受端到端任务
    - Test function
    - Test dataset

In [41]:
import json
import pprint

def load_jsonl_data(path):
    data = []
    with open(path) as f:
        for l in f:
            d = json.loads(l)
            data.append(d)
            
    return data

hf_api_data = load_jsonl_data("gorilla/data/api/huggingface_api.jsonl")
len(hf_api_data), hf_api_data[0]

(936,
 {'domain': 'Natural Language Processing Feature Extraction',
  'framework': 'Hugging Face Transformers',
  'functionality': 'Feature Extraction',
  'api_name': 'YituTech/conv-bert-base',
  'api_call': "AutoModel.from_pretrained('YituTech/conv-bert-base')",
  'api_arguments': 'N/A',
  'python_environment_requirements': 'transformers',
  'example_code': 'N/A',
  'performance': {'dataset': 'N/A', 'accuracy': 'N/A'},
  'description': 'A pre-trained ConvBERT model for feature extraction provided by YituTech, based on the Hugging Face Transformers library.'})

In [28]:
from collections import Counter

domain_counter_dict = Counter()

for d in hf_api_data:
    domain_counter_dict[d['domain']] += 1
        
pprint.pp(domain_counter_dict.most_common())

[('Natural Language Processing Text2Text Generation', 41),
 ('Natural Language Processing Text Generation', 39),
 ('Natural Language Processing Sentence Similarity', 33),
 ('Computer Vision Image Classification', 33),
 ('Natural Language Processing Token Classification', 33),
 ('Natural Language Processing Zero-Shot Classification', 33),
 ('Natural Language Processing Text Classification', 32),
 ('Audio Automatic Speech Recognition', 31),
 ('Natural Language Processing Table Question Answering', 31),
 ('Computer Vision Video Classification', 30),
 ('Multimodal Text-to-Image', 30),
 ('Multimodal Image-to-Text', 30),
 ('Computer Vision Object Detection', 30),
 ('Computer Vision Image Segmentation', 30),
 ('Natural Language Processing Fill-Mask', 30),
 ('Natural Language Processing Question Answering', 29),
 ('Multimodal Document Question Answer', 29),
 ('Computer Vision Depth Estimation', 29),
 ('Computer Vision Unconditional Image Generation', 29),
 ('Audio Text-to-Speech', 29),
 ('Audi

In [42]:
hf_train_data = load_jsonl_data("gorilla/data/apibench/huggingface_train.json")
len(hf_train_data), hf_train_data[0]

(8191,
 {'code': "###Instruction: Write an API implementation that takes customer reviews as input and extracts features to analyze customer sentiment.\n###Output: <<<domain>>>: Natural Language Processing Feature Extraction\n<<<api_call>>>: AutoModel.from_pretrained('YituTech/conv-bert-base')\n<<<api_provider>>>: Hugging Face Transformers\n<<<explanation>>>: 1. We import the necessary classes from the transformers package. This includes AutoTokenizer and AutoModel for tokenizing and processing customer review text.\n2. We use the from_pretrained method of the AutoModel class to load the pre-trained model 'YituTech/conv-bert-base'. This model is based on ConvBERT and is suitable for feature extraction in text data.\n3. We load the customer review text, tokenize it, and use the model to extract features from the review. These features can then be used to analyze customer sentiment.\n<<<code>>>: from transformers import AutoTokenizer, AutoModel\ntokenizer = AutoTokenizer.from_pretrained(

In [44]:
hf_eval_data = load_jsonl_data("gorilla/data/apibench/huggingface_eval.json")
len(hf_eval_data), hf_eval_data[1]

(911,
 {'code': "###Instruction: Design a feature for a social media website to recommend articles to users based on how similar the articles are to their previously liked articles.\n###Output: <<<domain>>>: Natural Language Processing Sentence Similarity\n<<<api_call>>>: AutoModel.from_pretrained('princeton-nlp/unsup-simcse-roberta-base')\n<<<api_provider>>>: Hugging Face Transformers\n<<<explanation>>>:1. We first import the necessary classes and modules from the transformers package. This includes AutoTokenizer and AutoModel for loading the pre-trained models from Hugging Face.\n2. We use the AutoModel.from_pretrained() method to load the 'princeton-nlp/unsup-simcse-roberta-base' model, which is specially designed for calculating sentence similarity.\n3. To build the recommendation feature, we process the text of previously liked articles and compute sentence embeddings. For each new article, we compute its sentence embedding and compare it to the embeddings of previously liked arti

In [57]:
hf_eval_data[2]

{'code': "###Instruction: As a journalist, I am curious about speech sentiment analysis in a group of people in a crowd. I want to extract features from the audio to run sentiment analysis.\n###Output: <<<domain>>>: Multimodal Feature Extraction\n<<<api_call>>>: HubertModel.from_pretrained('facebook/hubert-large-ll60k')\n<<<api_provider>>>: Hugging Face Transformers\n<<<explanation>>>: 1. Import the necessary libraries, which include the 'HubertModel' from transformers.\n2. Load the pretrained model 'facebook/hubert-large-ll60k', which is a self-supervised speech representation learning model, capable of dealing with unique problems in speech representation learning and extracting useful features from audio data.\n3. Process the crowd audio data and convert it into an acceptable input format for the Hubert model.\n4. Pass the preprocessed audio data through the Hubert model to extract features that can be used for further sentiment analysis.\n<<<code>>>: from transformers import Hubert

# 1. Instruction

In [52]:
# instruction: apibench - {lib}_train.json - code - instruction

import re

def get_code_parts_from_apibench_data(data):
    text = data['code']
    instruction, _ = text.split("\n###Output")
    
    # Extracting domain, api_call, api_provider, and code using regular expressions
    domain_pattern = r'<<<domain>>>: (.+?)\n'
    api_call_pattern = r'<<<api_call>>>: (.+?)\n'
    api_provider_pattern = r'<<<api_provider>>>: (.+?)\n'
    code_pattern = r'<<<code>>>: (.+)'

    domain = re.search(domain_pattern, text).group(1)
    api_call = re.search(api_call_pattern, text).group(1)
    api_provider = re.search(api_provider_pattern, text).group(1)
    code = re.search(code_pattern, text, re.DOTALL).group(1).strip()

    return {
        'instruction': instruction, 
        'domain': domain, 
        'api_call': api_call, 
        'api_provider': api_provider, 
        'code': code
    }

d = hf_eval_data[0]
code_parts = get_code_parts_from_apibench_data(d)
code_parts, code_parts['instruction']

({'instruction': '###Instruction: Design a feature for a social media website to recommend articles to users based on how similar the articles are to their previously liked articles.',
  'domain': 'Natural Language Processing Sentence Similarity',
  'api_call': "AutoModel.from_pretrained('princeton-nlp/unsup-simcse-roberta-base')",
  'api_provider': 'Hugging Face Transformers',
  'code': "from transformers import AutoTokenizer, AutoModel\ntokenizer = AutoTokenizer.from_pretrained('princeton-nlp/unsup-simcse-roberta-base')\nmodel = AutoModel.from_pretrained('princeton-nlp/unsup-simcse-roberta-base')"},
 '###Instruction: Design a feature for a social media website to recommend articles to users based on how similar the articles are to their previously liked articles.')

In [55]:
for d in hf_eval_data:
    code_parts = get_code_parts_from_apibench_data(d)
    print(code_parts['instruction'])
    pprint.pp(d)
    break

###Instruction: Design a feature for a social media website to recommend articles to users based on how similar the articles are to their previously liked articles.
{'code': '###Instruction: Design a feature for a social media website to '
         'recommend articles to users based on how similar the articles are to '
         'their previously liked articles.\n'
         '###Output: <<<domain>>>: Natural Language Processing Sentence '
         'Similarity\n'
         '<<<api_call>>>: '
         "AutoModel.from_pretrained('princeton-nlp/unsup-simcse-roberta-base')\n"
         '<<<api_provider>>>: Hugging Face Transformers\n'
         '<<<explanation>>>:1. We first import the necessary classes and '
         'modules from the transformers package. This includes AutoTokenizer '
         'and AutoModel for loading the pre-trained models from Hugging Face.\n'
         '2. We use the AutoModel.from_pretrained() method to load the '
         "'princeton-nlp/unsup-simcse-roberta-base' model,

# 2. Function / Test Function
- code part -> gpt -> function
- dataset 问题，先通过 prompt 解决一部分，需要对应到 huggingface dataset 名称才能对应
- prompt:
    generate following code based on above infomation:
    1. function with：
    - detailed comments
    - function description
    2. test function with：
    - test dataset
    - using assert in test function
    - do not compare number strictly
    - if dataset is provided in performance - dataset, load the dataset, then select several sample from the dataset, otherwise, using online source, do not leave blank


# 3. Dataset
- search from {lib}