# Introduction

The code that follows leverages language transformer models to perform a sequence of two tasks that enables the generation of a short story script in the english language
1. **Semantic search** through a transformer fine-tuned to the sentence classification problem to extract a set of words from a corpus of philosophical quotes
2. **Text generation** by fine tuning a transformer model to the sentence completion problem by leveraging the words generated from the earlier stage to start a new sentence

The huggingface model hub has been used to load the required datasets and fine-tune the transformer models for accomplishing the task described above. I would advise you to take a course offered by huggingfaces linked [here](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt) if you are interested in exploring this further or checking out other language models

This code has been built and tested in the kaggle ecosystem. Hence, if a similar notebook environment is utilized by a user, it is advisable to make all installations associated with libraries involving the huggingface models

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
!pip install datasets
!pip install -U git+https://github.com/huggingface/transformers.git
!pip install -U git+https://github.com/huggingface/accelerate.git
!pip install faiss-cpu

/kaggle/input/inputs/tiny-book-corpus-validation.jsonl
/kaggle/input/inputs/tiny-book-corpus-train.jsonl


# Sentence Completion using a Text generation transformer model

The second section of the problem will be tackled first to build a huggingfaces pipeline that is capable of generating a short story script when provided with one or more sentence starters

1. The bookcorpus dataset available on the huggingface model hub can be used to fine tune a GPT-based text generation model for making it suitable to the task of story script generation. However, the dataset is too large in a resource crunched environment to train within a reasonable amount of time. So for the purpose of this modeling exercise, around 60,000 samples are extracted from the entire dataset and used for fine tuning purpose. The following code which is commented out serves to extract the 60,000 random text samples. This could be uncommented and run only once when the notebook is executed for the first time. After the first run, the condensed dataset of 60,000 samples will be stored due to which we can again comment out the code section for the subsequent runs. In this step, the training dataset of 60,000 samples can be further split into training and validation sets by selecting a suitable split ratio

In [5]:
#from datasets import load_dataset
#import datasets
#raw_story_scripts_train = load_dataset('bookcorpus',split='train[:20%]+train[-20%:]')
#condensed_size = 60000
#raw_story_scripts={'train':raw_story_scripts_train}
#raw_story_scripts = datasets.DatasetDict(raw_story_scripts)
#raw_story_scripts_condensed = raw_story_scripts['train'].shuffle(seed=42).select(range(condensed_size))
#split_dataset = raw_story_scripts_condensed.train_test_split(test_size=0.2,seed=42)
#split_dataset['validation'] = split_dataset.pop('test')
#for split,dataset in split_dataset.items():
#    dataset.to_json(f'tiny-book-corpus-{split}.jsonl')

Downloading and preparing dataset bookcorpus/plain_text (download: 1.10 GiB, generated: 4.52 GiB, post-processed: Unknown size, total: 5.62 GiB) to /root/.cache/huggingface/datasets/bookcorpus/plain_text/1.0.0/44662c4a114441c35200992bea923b170e6f13f2f0beb7c14e43759cec498700...


Generating train split:   0%|          | 0/74004228 [00:00<?, ? examples/s]

Dataset bookcorpus downloaded and prepared to /root/.cache/huggingface/datasets/bookcorpus/plain_text/1.0.0/44662c4a114441c35200992bea923b170e6f13f2f0beb7c14e43759cec498700. Subsequent calls will reuse this data.


Dataset({
    features: ['text'],
    num_rows: 29601692
})

2. The stored 60,000 samples can be loaded by providing the path of the local directory where the samples are stored. It should be ensured that the path string **/kaggle/input/inputs/** is replaced by the correct local directory path

In [23]:
from datasets import load_dataset
data_files = {'train':'/kaggle/input/inputs/tiny-book-corpus-train.jsonl','test':'/kaggle/input/inputs/tiny-book-corpus-validation.jsonl'}
split_dataset = load_dataset('json',data_files=data_files)
split_dataset['validation'] = split_dataset.pop('test')

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-47c4e9a206c111e4/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-47c4e9a206c111e4/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [24]:
split_dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 48000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 12000
    })
})

In [25]:
from transformers import AutoTokenizer, GPT2LMHeadModel
model_checkpoint = 'distilgpt2'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = GPT2LMHeadModel.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [5]:
def count_str_length(el):
    return {'word_length':[len(sample.split(' ')) for sample in el['text']]}

In [26]:

split_dataset = split_dataset.map(count_str_length,batched=True)
split_dataset.set_format('pandas')

  0%|          | 0/48 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

In [27]:
split_dataset.reset_format()
split_dataset = split_dataset.filter(lambda x: x['word_length']>=15)
split_dataset

  0%|          | 0/48 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'word_length'],
        num_rows: 16994
    })
    validation: Dataset({
        features: ['text', 'word_length'],
        num_rows: 4275
    })
})

In [29]:
context_length=512
def tokenize_scripts(el):
    tok_output = tokenizer(el['text'], truncation=True, max_length=context_length,
                    return_overflowing_tokens=True,return_length=True)
    return {'input_ids':tok_output['input_ids']}

In [30]:
tokenized_dataset = split_dataset.map(tokenize_scripts,batched=True,
                                      remove_columns=split_dataset['train'].column_names)
tokenized_dataset

  0%|          | 0/17 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 16994
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 4275
    })
})

In [31]:
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,mlm=False)
out = data_collator([tokenized_dataset['train'][i] for i in range(3)])
for key in out:
    print(f'{key} shape:{out[key].shape}')

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids shape:torch.Size([3, 46])
attention_mask shape:torch.Size([3, 46])
labels shape:torch.Size([3, 46])


In [32]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [33]:
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
    f'tiny-random-GPT2LMHeadModel-finetuned-corpus',
    overwrite_output_dir=True,
    num_train_epochs=3,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    learning_rate=5e-4,
    weight_decay = 0.01,
    fp16=True,
    push_to_hub=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    tokenizer = tokenizer,
    data_collator = data_collator
)

Cloning https://huggingface.co/san94/tiny-random-GPT2LMHeadModel-finetuned-corpus into local empty directory.


Download file pytorch_model.bin:   0%|          | 15.4k/312M [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 3.93k/3.93k [00:00<?, ?B/s]

Download file runs/Jul04_12-41-46_20b74ae67198/events.out.tfevents.1688474526.20b74ae67198.28.0: 100%|########…

Clean file training_args.bin:  25%|##5       | 1.00k/3.93k [00:00<?, ?B/s]

Clean file runs/Jul04_12-41-46_20b74ae67198/events.out.tfevents.1688474526.20b74ae67198.28.0:  16%|#5        |…

Clean file pytorch_model.bin:   0%|          | 1.00k/312M [00:00<?, ?B/s]

In [34]:
trainer.train()
trainer.push_to_hub()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Epoch,Training Loss,Validation Loss
1,4.4433,4.278949
2,3.7013,4.251234
3,3.0412,4.449711


Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file runs/Jul04_18-14-35_8ce4b887e41a/events.out.tfevents.1688494601.8ce4b887e41a.28.0:   0%|          …

To https://huggingface.co/san94/tiny-random-GPT2LMHeadModel-finetuned-corpus
   cbc59d0..fe6c3bf  main -> main



'https://huggingface.co/san94/tiny-random-GPT2LMHeadModel-finetuned-corpus/commit/fe6c3bfcb75852ab6265ecd4c77a373ed552cc36'

In [15]:
from transformers import pipeline
story_teller = pipeline('text-generation',model='tiny-random-GPT2LMHeadModel-finetuned-corpus')
story_teller('Once upon')

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Once upon a success, your potential has disappeared a century or so before it became necessary for you to find me. '' dante turned to gautier and gion. '' gautier opened his mouth again, and gus nodded. ``"}]

In [17]:
story_teller('You are')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"You are the ones who've been through all this, '' i said, and julian took the phone from him and placed it on the front of his head. '' `` please give me a call. '' i stepped back, letting the"

In [19]:
story_teller('Roses are red')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Roses are red, blood red, gold, and so forth just as blood is stil not healing. '' `` i don't want any of these things dying people, '' he says, `` but you do need to know that a very"

In [20]:
normal_generator = pipeline('text-generation')
normal_generator('Roses are red')[0]['generated_text']

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Roses are red, not green (depending on the color). (Photo: Michael Moore / Los Angeles Public Media)\n\nIt was part of the plan of the Los Angeles Police Department to provide a safe nightbed for all law enforcement officers in'

In [21]:
normal_generator('Once upon')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Once upon a return, he made repeated trips to a car dealership and sold his new Bentley. But one morning in September, 2003, when an acquaintance he knew was working at a dealership in Atlanta, Ga., approached him to meet at his house,'

In [22]:
story_teller('My love')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"My love for you, i can not accept you being tortured until you wish to see it directly upon your own. '' she cried tears again, still clutching at me for a moment. '' she tried to imagine what her eyes had been forced back into"

In [3]:
#Identifying starter content scripts based on supplied string
from datasets import load_dataset
raw_quote_dataset = load_dataset('mertbozkurt/quotes_philosophers',encoding="latin-1")
raw_quote_dataset

Downloading and preparing dataset text/mertbozkurt--quotes_philosophers to /root/.cache/huggingface/datasets/text/mertbozkurt--quotes_philosophers-da87b9e9f82c0fa9/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/40.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/22.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.77k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/64.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.3k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/mertbozkurt--quotes_philosophers-da87b9e9f82c0fa9/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2458
    })
})

In [5]:
raw_quote_dataset['train'][:3]['text']

["Be a free thinker and don't accept everything you hear as truth. Be critical and evaluate what you believe in.",
 'Excellence is never an accident. It is always the result of high intention, sincere effort, and intelligent execution; it represents the wise choice of many alternatives - choice, not chance, determines your destiny.',
 'To appreciate the beauty of a snow flake, it is necessary to stand out in the cold.']

In [6]:
raw_quote_dataset = raw_quote_dataset.map(count_str_length,batched=True)
raw_quote_dataset.set_format('pandas')

  0%|          | 0/3 [00:00<?, ?ba/s]

In [7]:
len_dist = raw_quote_dataset['train']['word_length'].value_counts().to_frame().reset_index().rename(columns={'index':'word_length','word_length':'len_count'}).sort_values('word_length',ascending=False)
len_dist

Unnamed: 0,word_length,len_count
99,109,1
102,108,1
94,107,2
96,106,1
89,105,2
...,...,...
13,6,64
22,5,36
33,4,23
54,3,8


In [20]:
sum(len_dist['len_count'])

2458

In [8]:
sum(len_dist[(len_dist.word_length<50) & (len_dist.word_length>=5)]['len_count'])

2143

In [9]:
raw_quote_dataset.reset_format()
raw_quote_dataset = raw_quote_dataset.filter(lambda x: (x['word_length']<=50)&(x['word_length']>=5))
raw_quote_dataset

  0%|          | 0/3 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'word_length'],
        num_rows: 2160
    })
})

In [10]:
from transformers import AutoTokenizer, AutoModel
import torch
ss_checkpoint = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
ss_tokenizer = AutoTokenizer.from_pretrained(ss_checkpoint)
ss_model = AutoModel.from_pretrained(ss_checkpoint)

device = torch.device('cuda') if torch.device('cuda') is not None else torch.device('cpu')
ss_model.to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [16]:
def cls_pooling(model_out):
    return model_out.last_hidden_state[:,0]

def generate_embeddings(el):
    tkzd_inputs = ss_tokenizer(el,padding=True,return_tensors='pt')
    tkzd_inputs = {k:v.to(device) for k,v in tkzd_inputs.items()}
    outputs = ss_model(**tkzd_inputs)
    return cls_pooling(outputs)

In [17]:
raw_quote_dataset = raw_quote_dataset['train']
embedding_space = raw_quote_dataset.map(lambda x:{'embedding':generate_embeddings(x['text']).detach().cpu().numpy()[0]})
embedding_space.add_faiss_index(column='embedding')

  0%|          | 0/2160 [00:00<?, ?ex/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['text', 'word_length', 'embedding'],
    num_rows: 2160
})

In [20]:
def semantic_search(question):
    question_embedding = generate_embeddings([question]).detach().cpu().numpy()[0]
    scores,samples = embedding_space.get_nearest_examples('embedding',question_embedding,k=5)
    sample_df = pd.DataFrame.from_dict(samples)
    sample_df['scores'] = scores
    sample_df.sort_values('scores',ascending=False,inplace=True)
    return sample_df

In [58]:
story_starters = []
search_results = semantic_search('what is life?')
for _,row in search_results.iterrows():
    story_starters.append(row.text)
story_starters

['Life is nothing until it is lived; but it is yours to make sense of, and the of it is nothing other than the sense you choose.',
 'Life is the faculty of spontaneous activity, the awareness that we have powers.',
 'Life is the will to power; our natural desire to dominate and reshape the world to fit our own preferences and assert our personal strength to the fullest degree.',
 'Life in the true sense is perceiving or thinking.',
 'Life is a constant process of dying.']

In [59]:
from transformers import pipeline
story_teller = pipeline('text-generation',model='tiny-random-GPT2LMHeadModel-finetuned-corpus')
story_teller(story_starters)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[[{'generated_text': "Life is nothing until it is lived; but it is yours to make sense of, and the of it is nothing other than the sense you choose. '' `` look at me, she says. '' mrs. durnik, the words one"}],
 [{'generated_text': "Life is the faculty of spontaneous activity, the awareness that we have powers. '' `` think about this. '' he sighed. `` i am sure you understand. '' he repeated the words. '' he began to repeat himself again. '' a hand on"}],
 [{'generated_text': "Life is the will to power; our natural desire to dominate and reshape the world to fit our own preferences and assert our personal strength to the fullest degree.'''''''''''''''she continued."}],
 [{'generated_text': "Life in the true sense is perceiving or thinking. less worrying because someone is spending their time in temporary flevance.'' '' '' i'm not going to waste a second too long, so i go back to sleep.'m"}],
 [{'generated_text': "Life is a constant process of dying.'''''''''-she's been dead for about s