# Introduction

The code that follows leverages language transformer models to perform a sequence of two tasks that enables the generation of a short story script in the english language
1. **Semantic search** through a transformer fine-tuned to the sentence classification problem to extract the most relevant philosophical quote macthing an input key word from a larger corpus
2. **Text generation** by fine tuning a transformer model to the sentence completion problem by leveraging the words generated from the earlier stage to start a new sentence

The huggingface model hub has been used to load the required datasets and fine-tune the transformer models for accomplishing the task described above. I would advise you to take a course offered by huggingfaces linked [here](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt) if you are interested in exploring this further or checking out other language models

This code has been built and tested in the kaggle ecosystem. Hence, if a similar notebook environment is utilized by a user, it is advisable to make all installations associated with libraries involving the huggingface models

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
!pip install datasets
!pip install -U git+https://github.com/huggingface/transformers.git
!pip install -U git+https://github.com/huggingface/accelerate.git
!pip install faiss-cpu

/kaggle/input/inputs/tiny-book-corpus-validation.jsonl
/kaggle/input/inputs/tiny-book-corpus-train.jsonl
[0mCollecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-4hldwzfh
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-4hldwzfh
  Resolved https://github.com/huggingface/transformers.git to commit 5bb4430edc7df9f9950d412d98bbe505cc4d328b
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25ldone
[?25h  Created wheel for transformers: filename=transformers-4.31.0.dev0-py3-none-any.whl size=7352645 sha256=4afb0f29e60526dae4e3fc5170c1504d7fa02360e2f7121524203019fb343f43
  Stored in directory: /tmp/pip-ephem-

# Sentence Completion using a Text generation transformer model

The second section of the problem will be tackled first to build a huggingfaces pipeline that is capable of generating a short story script when provided with one or more sentence starters

1. The bookcorpus dataset available on the huggingface model hub can be used to fine tune a GPT-based text generation model for making it suitable to the task of story script generation. However, the dataset is too large in a resource crunched environment to train within a reasonable amount of time. So for the purpose of this modeling exercise, around 60,000 samples are extracted from the entire dataset and used for fine tuning purpose. The following code which is commented out serves to extract the 60,000 random text samples. This could be uncommented and run only once when the notebook is executed for the first time. After the first run, the condensed dataset of 60,000 samples will be stored due to which we can again comment out the code section for the subsequent runs. In this step, the training dataset of 60,000 samples can be further split into training and validation sets by selecting a suitable split ratio

In [2]:
#from datasets import load_dataset
#import datasets
#raw_story_scripts_train = load_dataset('bookcorpus',split='train[:20%]+train[-20%:]')
#condensed_size = 60000
#raw_story_scripts={'train':raw_story_scripts_train}
#raw_story_scripts = datasets.DatasetDict(raw_story_scripts)
#raw_story_scripts_condensed = raw_story_scripts['train'].shuffle(seed=42).select(range(condensed_size))
#split_dataset = raw_story_scripts_condensed.train_test_split(test_size=0.2,seed=42)
#split_dataset['validation'] = split_dataset.pop('test')
#for split,dataset in split_dataset.items():
#    dataset.to_json(f'tiny-book-corpus-{split}.jsonl')

2. The stored 60,000 samples can be loaded by providing the path of the local directory where the samples are stored. It should be ensured that the path string **/kaggle/input/inputs/** is replaced by the correct local directory path

In [3]:
from datasets import load_dataset
data_files = {'train':'/kaggle/input/inputs/tiny-book-corpus-train.jsonl','test':'/kaggle/input/inputs/tiny-book-corpus-validation.jsonl'}
split_dataset = load_dataset('json',data_files=data_files)
split_dataset['validation'] = split_dataset.pop('test')

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-f976537655f3a925/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-f976537655f3a925/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
split_dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 48000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 12000
    })
})

3. The training corpus dataset is sequentially subjected to tokenization and encoding following which it is used to fine-tune (just train the model head) of a text generation model. The model used here is a gpt2 like model with the checkpoint of 'distilgpt2'. This checkpoint is chosen because the underlying model is a lightweight version of the larger gpt2 model. The huggingface model hub provides the flexibility to load pretrained transformer models along with the tokenizers that are compatible with the particular model

In [5]:
from transformers import AutoTokenizer, GPT2LMHeadModel
model_checkpoint = 'distilgpt2'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = GPT2LMHeadModel.from_pretrained(model_checkpoint)

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

4. Training the model head on short sentences is not likely to be an optimal way of learning the story generation approach. Hence, the next step is to identify the word length corresponding to each text input and filtering out sentences that are less than 15 words in length

In [6]:
def count_str_length(el):
    return {'word_length':[len(sample.split(' ')) for sample in el['text']]}


split_dataset = split_dataset.map(count_str_length,batched=True)
split_dataset.set_format('pandas')
split_dataset.reset_format()
split_dataset = split_dataset.filter(lambda x: x['word_length']>=15)
split_dataset

  0%|          | 0/48 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

  0%|          | 0/48 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'word_length'],
        num_rows: 16994
    })
    validation: Dataset({
        features: ['text', 'word_length'],
        num_rows: 4275
    })
})

5. The input strings are tokenized and encoded next through the tokenizer object to be stored with key of 'input_ids' in the training data set dictionary post execution of this step. It shuould be noted here that the tokens are truncated to a length of 512 and the truncated tokens are also populated as the subsequent entry in the encoded output

In [7]:
context_length=512
def tokenize_scripts(el):
    tok_output = tokenizer(el['text'], truncation=True, max_length=context_length,
                    return_overflowing_tokens=True,return_length=True)
    return {'input_ids':tok_output['input_ids']}

tokenized_dataset = split_dataset.map(tokenize_scripts,batched=True,
                                      remove_columns=split_dataset['train'].column_names)
tokenized_dataset

  0%|          | 0/17 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 16994
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 4275
    })
})

6. Next a data collator with the DataCollatorForLanguageModeling class is instantiated to translate the encoded tokens into batches for fine-tuning the model (training the model head)

In [8]:
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,mlm=False)
out = data_collator([tokenized_dataset['train'][i] for i in range(3)])
for key in out:
    print(f'{key} shape:{out[key].shape}')

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids shape:torch.Size([3, 46])
attention_mask shape:torch.Size([3, 46])
labels shape:torch.Size([3, 46])


In [9]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

7. The hugging face libraries have wrappers to abstract the process of learner instantiation and configuration with the classes named TrainingArguments and Trainer. These classes can be used to set thee model hyperparameters like number of epochs, leanrning rate, weight decay along with identification of the encoded training and validation tokens. The fine-tuned model is stored with the name of 'tiny-random-GPT2LMHeadModel-finetuned-corpus' on the hugging face model hub so that it can be invoked through a pipeline later on and used for the story script generation

In [11]:
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
    f'tiny-random-GPT2LMHeadModel-finetuned-corpus',
    overwrite_output_dir=True,
    num_train_epochs=3,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    learning_rate=5e-4,
    weight_decay = 0.01,
    fp16=True,
    push_to_hub=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    tokenizer = tokenizer,
    data_collator = data_collator
)

Cloning https://huggingface.co/san94/tiny-random-GPT2LMHeadModel-finetuned-corpus into local empty directory.


Download file pytorch_model.bin:   0%|          | 8.00k/312M [00:00<?, ?B/s]

Download file runs/Jul04_12-41-46_20b74ae67198/events.out.tfevents.1688474526.20b74ae67198.28.0: 100%|########…

Download file runs/Jul04_18-14-35_8ce4b887e41a/events.out.tfevents.1688494601.8ce4b887e41a.28.0: 100%|########…

Download file training_args.bin: 100%|##########| 3.93k/3.93k [00:00<?, ?B/s]

Clean file runs/Jul04_12-41-46_20b74ae67198/events.out.tfevents.1688474526.20b74ae67198.28.0:  16%|#5        |…

Clean file runs/Jul04_18-14-35_8ce4b887e41a/events.out.tfevents.1688494601.8ce4b887e41a.28.0:  16%|#5        |…

Clean file training_args.bin:  25%|##5       | 1.00k/3.93k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/312M [00:00<?, ?B/s]

8. The train() method of the Trainer object can be used to commence the fine-tuning process which will take some time post which the model will be pushed to the hugging face model hub

In [12]:
trainer.train()
trainer.push_to_hub()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Epoch,Training Loss,Validation Loss
1,4.4433,4.278949
2,3.7013,4.251234
3,3.0412,4.449711


Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file runs/Jul15_14-20-23_bf33ad69f349/events.out.tfevents.1689430948.bf33ad69f349.29.0:   0%|          …

To https://huggingface.co/san94/tiny-random-GPT2LMHeadModel-finetuned-corpus
   90f81cf..e4d7c0f  main -> main

To https://huggingface.co/san94/tiny-random-GPT2LMHeadModel-finetuned-corpus
   e4d7c0f..bcf9fb1  main -> main



'https://huggingface.co/san94/tiny-random-GPT2LMHeadModel-finetuned-corpus/commit/e4d7c0ff9284c5fbe28fbf1c4b392a02e7acb81a'

In [None]:
#from transformers import pipeline
#story_teller = pipeline('text-generation',model='tiny-random-GPT2LMHeadModel-finetuned-corpus')
#story_teller('Once upon')

In [None]:
#story_teller('You are')[0]['generated_text']

In [None]:
#story_teller('Roses are red')[0]['generated_text']

In [None]:
#normal_generator = pipeline('text-generation')
#normal_generator('Roses are red')[0]['generated_text']

In [None]:
#normal_generator('Once upon')[0]['generated_text']

In [None]:
#story_teller('My love')[0]['generated_text']

# Semantic search for seeding phrases through a Sentence Classification Transformer model

Now that the story content generation model is ready, it is time to build the other model that can return a philosophical quote generating the best match with an input word representing a story genre. The matching logic is implemented with the help of semantic search which returns the quotes whose token embeddings generate the best cosine scores with the embedding of the input story genre

1. For achieving the semantic search logic, the corpus of philosophical quotes available on the hugging face model hub is leveraged. The training set is limited to quotes with word length in between 5 and 50 because the subsequent text generation model should get sufficient leeway to generate script words for the story to be meaningful

In [13]:
from datasets import load_dataset
raw_quote_dataset = load_dataset('mertbozkurt/quotes_philosophers',encoding="latin-1")
raw_quote_dataset = raw_quote_dataset.map(count_str_length,batched=True)
raw_quote_dataset = raw_quote_dataset.filter(lambda x: (x['word_length']<=50)&(x['word_length']>=5))
raw_quote_dataset

Downloading and preparing dataset text/mertbozkurt--quotes_philosophers to /root/.cache/huggingface/datasets/text/mertbozkurt--quotes_philosophers-da87b9e9f82c0fa9/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/40.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/22.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.77k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/64.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.3k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/mertbozkurt--quotes_philosophers-da87b9e9f82c0fa9/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'word_length'],
        num_rows: 2160
    })
})

In [None]:
#len_dist = raw_quote_dataset['train']['word_length'].value_counts().to_frame().reset_index().rename(columns={'index':'word_length','word_length':'len_count'}).sort_values('word_length',ascending=False)
#len_dist

In [None]:
#sum(len_dist['len_count'])

In [None]:
#sum(len_dist[(len_dist.word_length<50) & (len_dist.word_length>=5)]['len_count'])

In [None]:
#raw_quote_dataset.reset_format()
#raw_quote_dataset = raw_quote_dataset.filter(lambda x: (x['word_length']<=50)&(x['word_length']>=5))
#raw_quote_dataset

2. The tokenizer and architecture for the sentence classification model required here correspond to the checkpoint of 'multi-qa-mpnet-base-dot-v1'. The model and the compatible tokenizer are loaded with the suitable hugging face routines

In [14]:
from transformers import AutoTokenizer, AutoModel
import torch
ss_checkpoint = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
ss_tokenizer = AutoTokenizer.from_pretrained(ss_checkpoint)
ss_model = AutoModel.from_pretrained(ss_checkpoint)

device = torch.device('cuda') if torch.device('cuda') is not None else torch.device('cpu')
ss_model.to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

3. Semantic search works by generating embedding vectors for the search string as well as the philosophical quote corpus and then computing the dot product similarity between the embedding vectors to find the closest matching quote from the corpus. The method that follows generates an embedding vector through a method known as cls pooling by taking text phrase as input

In [15]:
def cls_pooling(model_out):
    return model_out.last_hidden_state[:,0]

def generate_embeddings(el):
    tkzd_inputs = ss_tokenizer(el,padding=True,return_tensors='pt')
    tkzd_inputs = {k:v.to(device) for k,v in tkzd_inputs.items()}
    outputs = ss_model(**tkzd_inputs)
    return cls_pooling(outputs)

4. The corpus of quotes is then subjected to the generate_embeddings() function to create an embedding vector for each quote in the corpus. For computing dot product similarity on the embedding vectors, the faiss (Facebook AI Similarity Search) algorithm is used 

In [16]:
raw_quote_dataset = raw_quote_dataset['train']
embedding_space = raw_quote_dataset.map(lambda x:{'embedding':generate_embeddings(x['text']).detach().cpu().numpy()[0]})
embedding_space.add_faiss_index(column='embedding')

  0%|          | 0/2160 [00:00<?, ?ex/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['text', 'word_length', 'embedding'],
    num_rows: 2160
})

5. The search string is also then translated into an embedding vector and then compared with the embedding vectors of the corpus through the dot product similarity metric by leveraging the faiss technique. The semantic_search() function returns the top 5 semantically similar quotes from the corpus corresponding to the search string that is passed as argument

In [17]:
def semantic_search(question):
    question_embedding = generate_embeddings([question]).detach().cpu().numpy()[0]
    scores,samples = embedding_space.get_nearest_examples('embedding',question_embedding,k=5)
    sample_df = pd.DataFrame.from_dict(samples)
    sample_df['scores'] = scores
    sample_df.sort_values('scores',ascending=False,inplace=True)
    return sample_df

6. The quotes from the corpus that generate the highest similarity score with the question input (corresponding to movie or genre) are collected and kept ready for the script generation process

In [18]:
story_starters = []
search_results = semantic_search('drama')
for _,row in search_results.iterrows():
    story_starters.append(row.text)
story_starters

['The life of every individual is really always a tragedy, but gone through in detail, it has the character of a comedy.',
 'Comedy aims at representing men as worse, Tragedy as better than in actual life.',
 'Enjoy life. This is not a dress rehearsal.',
 'PLOT is CHARACTER revealed by ACTION.',
 'In a true tragedy, both parties must be right.']

7. A text generation pipeline is created from the gpt2-like model fine-tuned and saved earlier on the huggingface model hub.The search results gathered from the previous step are then passed to the pipeline for obtaining the corresponding story scripts

In [31]:
from transformers import pipeline
story_teller = pipeline('text-generation',model='tiny-random-GPT2LMHeadModel-finetuned-corpus',max_length=250)

8. Next gradio interfaces will be created and uploaded to huggingface Spaces to ensure that a user is able to witness a demonstration of the model

In [None]:
!pip install gradio
import gradio as gr

In [33]:
def create_script(search):
    seed_quote = semantic_search(search)
    best_quote_result = seed_quote.reset_index().loc[0,'text']
    return story_teller('"'+best_quote_result+'"')[0]['generated_text']

textbox = gr.Textbox(label='Enter emotion/genre',lines=2,placeholder='Suspense')
desc = 'The application generates a story scipt of approximately 250 word length when an emotion or genre is provided as input'
gr.Interface(fn=create_script,inputs=textbox,outputs='text',
            title='Story Script Generator',description=desc).launch()

Kaggle notebooks require sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Running on local URL:  http://127.0.0.1:7863
Running on public URL: https://e0bd8530aec96b92c3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


