## Installation

In [1]:
%%time

! pip install huggify-data

Collecting huggify-data
  Downloading huggify_data-0.4.1-py3-none-any.whl (12 kB)
Collecting accelerate<0.21.0,>=0.20.3 (from huggify-data)
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes==0.40.2 (from huggify-data)
  Downloading bitsandbytes-0.40.2-py3-none-any.whl (92.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.20.0 (from huggify-data)
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
Collecting matplotlib==3.9.0 (from huggify-data)
  Downloading matplotlib-3.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

CPU times: user 559 ms, sys: 129 ms, total: 688 ms
Wall time: 1min 46s


## Data

Acquire data from `HuggingFace` website. As an example, we use [storyseed](https://huggingface.co/datasets/Aiwensile2/StorySeed) for this demo.

In [None]:
from datasets import load_dataset

ds = load_dataset("Aiwensile2/StorySeed")

Downloading readme:   0%|          | 0.00/2.33k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/18 [00:00<?, ?files/s]

Generating train split:   0%|          | 0/4376 [00:00<?, ? examples/s]

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 4376
    })
})

In [None]:
ds['train']['prompt'][0]

'Please help me create a picture book with the theme of\nA curious kitten embarks on a whimsical journey through a magical forest, encountering friendly creatures along the way.\nThe format should include a title, followed by a cover page with an illustration only, and the remaining pages should each have both text and illustrations.'

In [None]:
ds['train']['response'][0]

'### Title: Whisker\'s Magical Forest Adventure\n\n---\n\n### Cover Page Illustration:\nA small, fluffy, gray kitten with wide, curious eyes standing at the edge of a lush, vibrant forest filled with colorful flowers and twinkling lights.\n\n---\n\n### Page 1 Illustration:\nWhiskers peeking out from behind a large oak tree, gazing at the path ahead.\n\n#### Text:\nIn the heart of a sun-dappled garden, little Whiskers the kitten found a hidden path leading into a mysterious forest. With a twitch of his nose and a flick of his tail, he decided it was the perfect day for an adventure.\n\n---\n\n### Page 2 Illustration:\nWhiskers meeting a cheerful squirrel with a bushy tail, sitting on a tree branch.\n\n#### Text:\nNot far into the forest, Whiskers met Squeaky the squirrel. "Where are you off to, little kitten?" Squeaky chirped. "I\'m exploring this magical forest," Whiskers replied with a purr. Squeaky nodded, "Follow me! I\'ll show you the way to the Whispering Willows."\n\n---\n\n### P

```python
### Human: What does YSA stand for? ### Assistant: YSA stands for Youth Spirit Artworks.
```

In [None]:
from tqdm import tqdm

In [None]:
questions = []
answers = []
texts = []

for i in tqdm(range(100)):
    ques = ds['train']['prompt'][i]
    answ = ' '.join(ds['train']['response'][i].split('### '))
    text = "### Human: " + ques + " ### Assistant: " + answ

    questions.append(ques)
    answers.append(answ)
    texts.append(text)

100%|██████████| 100/100 [00:02<00:00, 44.47it/s]


In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({'questions': questions, 'answers': answers, 'text': texts})
df.head()

Unnamed: 0,questions,answers,text
0,Please help me create a picture book with the ...,Title: Whisker's Magical Forest Adventure\n\n...,### Human: Please help me create a picture boo...
1,Please help me create a picture book with the ...,Title: **Mia and Mico’s Treasure Quest**\n\n ...,### Human: Please help me create a picture boo...
2,Please help me create a picture book with the ...,Title: The Enchanted Doorway\n\n---\n\n# Cove...,### Human: Please help me create a picture boo...
3,Please help me create a picture book with the ...,"Title: ""The Cave of Dragon Friends""\n\n---\n\...",### Human: Please help me create a picture boo...
4,Please help me create a picture book with the ...,Title: Pip's Quest for the Golden Cheese\n\n ...,### Human: Please help me create a picture boo...


In [None]:
df.to_csv('data.csv', index=False)

## Training

We use **huggify-data** to fine-tune a `Llama2` model.

In [None]:
from google.colab import userdata

In [None]:
from huggify_data.push_modules import DataFrameUploader

# Example usage:
df = pd.read_csv('/content/data.csv')
uploader = DataFrameUploader(
    df,
    hf_token=userdata.get('HF_TOKEN'),
    repo_name='storySeed-v3', # this is whatever you want
    username='eagle0504' # your own user name
)
uploader.process_data()
uploader.push_to_hub()

Dataframe verified: columns 'questions' and 'answers' are present.
Data processed into DatasetDict.
Repository created: eagle0504/storySeed-v3


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Dataset pushed to Hugging Face Hub.


In [None]:
from huggify_data.train_modules import *

In [None]:
# Usage
model_name = "NousResearch/Llama-2-7b-chat-hf"
dataset_name = "eagle0504/storySeed-v3"
new_model = "storySeed-v3-llama2"
huggingface_token = userdata.get('HF_TOKEN')

In [None]:
%%time

trainer = LlamaTrainer(model_name, dataset_name, new_model, huggingface_token)
peft_config = trainer.configure_lora()
training_args = trainer.configure_training_arguments(num_train_epochs=1)
trainer.train_model(training_args, peft_config)

## Inference

Once you finish everything above, feel free to shut down the Colab session. The next time you come back you can start with **Installation** section and directly jump down to **Inference** to the following code.

In [1]:
from huggify_data.train_modules import *
from google.colab import userdata

In [2]:
%%time

# Usage
model_name = "NousResearch/Llama-2-7b-chat-hf"
dataset_name = "eagle0504/storySeed-v3"
new_model = "storySeed-v3-llama2"
huggingface_token = userdata.get('HF_TOKEN')
trainer = LlamaTrainer(model_name, dataset_name, new_model, huggingface_token)



tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Starting model loading...


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Model loading completed.
CPU times: user 17.1 s, sys: 18.5 s, total: 35.6 s
Wall time: 55.9 s


In [3]:
some_model, some_tokenizer = trainer.load_model_and_tokenizer(
    base_model_path="NousResearch/Llama-2-7b-chat-hf",
    new_model_path="eagle0504/storySeed-v3-llama2",
)

adapter_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

In [7]:
%%time

# Example usage
prompt = "### Human: Please help me create a picture book with the theme of a mechanic Jensen Huang that embarks on a journey to become a blacksmith. ### Assistant:"
response = trainer.generate_response(some_model, some_tokenizer, prompt, 300)

CPU times: user 11min 59s, sys: 2.18 s, total: 12min 1s
Wall time: 3min


In [8]:
print(response.split("### Assistant: ")[1])

I'd be happy to help you create a picture book with the theme of a mechanic, Jensen Huang. Here's a rough outline and draft of the story:

Title: The Mechanic's Journey to Blacksmithing

Plot Summary: Jensen Huang is a talented mechanic who is content with fixing things around town, but he has always felt like something is missing in his life. One day, while tinkering with an old engine, he discovers an old blacksmithing book in an antique shop. Intrigued, he buys the book and becomes captivated by the art of shaping metal. Determined to learn more, Jensen decides to take a leave of absence from work and embark on a journey to discover the world of blacksmithing.

Jensen sets off on his journey and meets various characters along the way, each with their own unique insights into the craft. He meets a kind-hearted old blacksmith who teaches him the basics of shaping metal and forms a lasting bond with him. Along the way, he also learns important lessons about perseverance, patience, and 

Or, let's try using `pipeline`.