# Introduction

The code that follows leverages language transformer models to perform a sequence of two tasks that enables the generation of a short story script in the english language
1. **Semantic search** through a transformer fine-tuned to the sentence classification problem to extract the most relevant philosophical quote macthing an input key word from a larger corpus
2. **Text generation** by fine tuning a transformer model to the sentence completion problem by leveraging the words generated from the earlier stage to start a new sentence

The huggingface model hub has been used to load the required datasets and fine-tune the transformer models for accomplishing the task described above. I would advise you to take a course offered by huggingfaces linked [here](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt) if you are interested in exploring this further or checking out other language models

This code has been built and tested in the kaggle ecosystem. Hence, if a similar notebook environment is utilized by a user, it is advisable to make all installations associated with libraries involving the huggingface models

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
!pip install datasets
!pip install -U git+https://github.com/huggingface/transformers.git
!pip install -U git+https://github.com/huggingface/accelerate.git
!pip install faiss-cpu

# Sentence Completion using a Text generation transformer model

The second section of the problem will be tackled first to build a huggingfaces pipeline that is capable of generating a short story script when provided with one or more sentence starters

1. The bookcorpus dataset available on the huggingface model hub can be used to fine tune a GPT-based text generation model for making it suitable to the task of story script generation. However, the dataset is too large in a resource crunched environment to train within a reasonable amount of time. So for the purpose of this modeling exercise, around 60,000 samples are extracted from the entire dataset and used for fine tuning purpose. The following code which is commented out serves to extract the 60,000 random text samples. This could be uncommented and run only once when the notebook is executed for the first time. After the first run, the condensed dataset of 60,000 samples will be stored due to which we can again comment out the code section for the subsequent runs. In this step, the training dataset of 60,000 samples can be further split into training and validation sets by selecting a suitable split ratio

In [None]:
#from datasets import load_dataset
#import datasets
#raw_story_scripts_train = load_dataset('bookcorpus',split='train[:20%]+train[-20%:]')
#condensed_size = 60000
#raw_story_scripts={'train':raw_story_scripts_train}
#raw_story_scripts = datasets.DatasetDict(raw_story_scripts)
#raw_story_scripts_condensed = raw_story_scripts['train'].shuffle(seed=42).select(range(condensed_size))
#split_dataset = raw_story_scripts_condensed.train_test_split(test_size=0.2,seed=42)
#split_dataset['validation'] = split_dataset.pop('test')
#for split,dataset in split_dataset.items():
#    dataset.to_json(f'tiny-book-corpus-{split}.jsonl')

2. The stored 60,000 samples can be loaded by providing the path of the local directory where the samples are stored. It should be ensured that the path string **/kaggle/input/inputs/** is replaced by the correct local directory path

In [None]:
from datasets import load_dataset
data_files = {'train':'/kaggle/input/inputs/tiny-book-corpus-train.jsonl','test':'/kaggle/input/inputs/tiny-book-corpus-validation.jsonl'}
split_dataset = load_dataset('json',data_files=data_files)
split_dataset['validation'] = split_dataset.pop('test')

In [None]:
split_dataset

3. The training corpus dataset is sequentially subjected to tokenization and encoding following which it is used to fine-tune (just train the model head) of a text generation model. The model used here is a gpt2 like model with the checkpoint of 'distilgpt2'. This checkpoint is chosen because the underlying model is a lightweight version of the larger gpt2 model. The huggingface model hub provides the flexibility to load pretrained transformer models along with the tokenizers that are compatible with the particular model

In [None]:
from transformers import AutoTokenizer, GPT2LMHeadModel
model_checkpoint = 'distilgpt2'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = GPT2LMHeadModel.from_pretrained(model_checkpoint)

4. Training the model head on short sentences is not likely to be an optimal way of learning the story generation approach. Hence, the next step is to identify the word length corresponding to each text input and filtering out sentences that are less than 15 words in length

In [None]:
def count_str_length(el):
    return {'word_length':[len(sample.split(' ')) for sample in el['text']]}


split_dataset = split_dataset.map(count_str_length,batched=True)
split_dataset.set_format('pandas')
split_dataset.reset_format()
split_dataset = split_dataset.filter(lambda x: x['word_length']>=15)
split_dataset

5. The input strings are tokenized and encoded next through the tokenizer object to be stored with key of 'input_ids' in the training data set dictionary post execution of this step. It shuould be noted here that the tokens are truncated to a length of 512 and the truncated tokens are also populated as the subsequent entry in the encoded output

In [None]:
context_length=512
def tokenize_scripts(el):
    tok_output = tokenizer(el['text'], truncation=True, max_length=context_length,
                    return_overflowing_tokens=True,return_length=True)
    return {'input_ids':tok_output['input_ids']}

tokenized_dataset = split_dataset.map(tokenize_scripts,batched=True,
                                      remove_columns=split_dataset['train'].column_names)
tokenized_dataset

6. Next a data collator with the DataCollatorForLanguageModeling class is instantiated to translate the encoded tokens into batches for fine-tuning the model (training the model head)

In [None]:
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,mlm=False)
out = data_collator([tokenized_dataset['train'][i] for i in range(3)])
for key in out:
    print(f'{key} shape:{out[key].shape}')

In [None]:
from huggingface_hub import notebook_login
notebook_login()

7. The hugging face libraries have wrappers to abstract the process of learner instantiation and configuration with the classes named TrainingArguments and Trainer. These classes can be used to set thee model hyperparameters like number of epochs, leanrning rate, weight decay along with identification of the encoded training and validation tokens. The fine-tuned model is stored with the name of 'tiny-random-GPT2LMHeadModel-finetuned-corpus' on the hugging face model hub so that it can be invoked through a pipeline later on and used for the story script generation

In [None]:
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
    f'tiny-random-GPT2LMHeadModel-finetuned-corpus',
    overwrite_output_dir=True,
    num_train_epochs=3,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    learning_rate=5e-4,
    weight_decay = 0.01,
    fp16=True,
    push_to_hub=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    tokenizer = tokenizer,
    data_collator = data_collator
)

8. The train() method of the Trainer object can be used to commence the fine-tuning process which will take some time post which the model will be pushed to the hugging face model hub

In [None]:
trainer.train()
trainer.push_to_hub()

In [None]:
#from transformers import pipeline
#story_teller = pipeline('text-generation',model='tiny-random-GPT2LMHeadModel-finetuned-corpus')
#story_teller('Once upon')

In [None]:
#story_teller('You are')[0]['generated_text']

In [None]:
#story_teller('Roses are red')[0]['generated_text']

In [None]:
#normal_generator = pipeline('text-generation')
#normal_generator('Roses are red')[0]['generated_text']

In [None]:
#normal_generator('Once upon')[0]['generated_text']

In [None]:
#story_teller('My love')[0]['generated_text']

# Semantic search for seeding phrases through a Sentence Classification Transformer model

Now that the story content generation model is ready, it is time to build the other model that can return a philosophical quote generating the best match with an input word representing a story genre. The matching logic is implemented with the help of semantic search which returns the quotes whose token embeddings generate the best cosine scores with the embedding of the input story genre

1. For achieving the semantic search logic, the corpus of philosophical quotes available on the hugging face model hub is leveraged. The training set is limited to quotes with word length in between 5 and 50 because the subsequent text generation model should get sufficient leeway to generate script words for the story to be meaningful

In [None]:
from datasets import load_dataset
raw_quote_dataset = load_dataset('mertbozkurt/quotes_philosophers',encoding="latin-1")
raw_quote_dataset = raw_quote_dataset.map(count_str_length,batched=True)
raw_quote_dataset = raw_quote_dataset.filter(lambda x: (x['word_length']<=50)&(x['word_length']>=5))
raw_quote_dataset

In [None]:
#len_dist = raw_quote_dataset['train']['word_length'].value_counts().to_frame().reset_index().rename(columns={'index':'word_length','word_length':'len_count'}).sort_values('word_length',ascending=False)
#len_dist

In [None]:
#sum(len_dist['len_count'])

In [None]:
#sum(len_dist[(len_dist.word_length<50) & (len_dist.word_length>=5)]['len_count'])

In [None]:
#raw_quote_dataset.reset_format()
#raw_quote_dataset = raw_quote_dataset.filter(lambda x: (x['word_length']<=50)&(x['word_length']>=5))
#raw_quote_dataset

2. The tokenizer and architecture for the sentence classification model required here correspond to the checkpoint of 'multi-qa-mpnet-base-dot-v1'. The model and the compatible tokenizer are loaded with the suitable hugging face routines

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
ss_checkpoint = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
ss_tokenizer = AutoTokenizer.from_pretrained(ss_checkpoint)
ss_model = AutoModel.from_pretrained(ss_checkpoint)

device = torch.device('cuda') if torch.device('cuda') is not None else torch.device('cpu')
ss_model.to(device)

3. Semantic search works by generating embedding vectors for the search string as well as the philosophical quote corpus and then computing the dot product similarity between the embedding vectors to find the closest matching quote from the corpus. The method that follows generates an embedding vector through a method known as cls pooling by taking text phrase as input

In [None]:
def cls_pooling(model_out):
    return model_out.last_hidden_state[:,0]

def generate_embeddings(el):
    tkzd_inputs = ss_tokenizer(el,padding=True,return_tensors='pt')
    tkzd_inputs = {k:v.to(device) for k,v in tkzd_inputs.items()}
    outputs = ss_model(**tkzd_inputs)
    return cls_pooling(outputs)

4. The corpus of quotes is then subjected to the generate_embeddings() function to create an embedding vector for each quote in the corpus. For computing dot product similarity on the embedding vectors, the faiss (Facebook AI Similarity Search) algorithm is used 

In [None]:
raw_quote_dataset = raw_quote_dataset['train']
embedding_space = raw_quote_dataset.map(lambda x:{'embedding':generate_embeddings(x['text']).detach().cpu().numpy()[0]})
embedding_space.add_faiss_index(column='embedding')

5. The search string is also then translated into an embedding vector and then compared with the embedding vectors of the corpus through the dot product similarity metric by leveraging the faiss technique. The semantic_search() function returns the top 5 matching quotes from the corpus corresponding to the search string that is passed as argument

In [None]:
def semantic_search(question):
    question_embedding = generate_embeddings([question]).detach().cpu().numpy()[0]
    scores,samples = embedding_space.get_nearest_examples('embedding',question_embedding,k=5)
    sample_df = pd.DataFrame.from_dict(samples)
    sample_df['scores'] = scores
    sample_df.sort_values('scores',ascending=False,inplace=True)
    return sample_df

In [None]:
story_starters = []
search_results = semantic_search('what is life?')
for _,row in search_results.iterrows():
    story_starters.append(row.text)
story_starters

In [None]:
from transformers import pipeline
story_teller = pipeline('text-generation',model='tiny-random-GPT2LMHeadModel-finetuned-corpus')
story_teller(story_starters)