# Making your dataset
This notebook contains some logic that can be used to format a text file of conversations to the format required by PersonaChat.
The text file should have conversations made of lines of dialogue alternating between two speakers.  
Each utterance of dialogue should be on a separate line.
The conversations should be separated by a single newline.  
Each conversation should begin with Speaker1, the customer, followed by speaker2, the bot.

Example conversation format:
```
hours of operation  
All BotBank locations are open 7am to 4pm monday through friday! What else I can help with?  
that's all  
Thank you for using BotBank.  

what are the hours?  
All BotBank locations are open 7am to 4pm monday through friday! What else I can help with?  
thats all I needed  
Glad I could help. Thanks for choosing BotBank Have a nice day.  
```



In [41]:
import json
from random import choice

Read from a text file of conversations.   
The format must be as described above, and relies on a double newline character ('\n\n') to divide conversations.  
Parse the conversations into a list of lists.  

In [44]:
with open('conversations.txt') as txt:
    data = txt.read()

convos = data.split('\n\n')
convo_list = []

for convo in convos:
    fixed = convo.lower()
    fixed = fixed.replace('.', ' .').replace(',', '').replace('?', ' ?').replace('!', ' !').replace("'", "")
    convo_list.append(fixed.split('\n'))

convo_list[1]

['hi i want to close my account .',
 'you wish to close an existing account . is that correct ?',
 'thats right .',
 'i can log a request to speed up the process . please give me your first and last name .',
 'my name is <f_name> <l_name>',
 'your first and last name is <f_name> <l_name> . is that correct ?',
 'correct .',
 'a request has been logged . is there anything else i can help with ?',
 'thats all .',
 'thank you for using botbank .']

Additional text is required to fill the 'candidates' section of the training data.  
Reads from a text file of replies from a financial question and answer dataset.   
Replies of length > 100 and of length < 40 are filtered from the data.  

In [46]:
distractors = []
with open('financial_responses.txt') as e_file:
    replies = e_file.readlines()
    for reply in replies:
        if len(reply) < 100 and len(reply)> 40:
            distractors.append(reply.replace('\n', ''))
distractors[:5]

['but it depends on the circumstances and what it is you want to deduct',
 'expenses must be reasonable and appropriate deductions for extravagant expenses are not allowable',
 'more information is available in publication 463 travel entertainment gift and car expenses',
 'more discussion of the rules and limitations can be found in publication 463',
 'edit for meal expenses amount of standard meal allowance']

The chatbot created by Hugging Face uses a persona to apply some context to its replies.  
The following cell establishes the personality of the chatbot.   

In [48]:
personality = [
    "i am here help you with your questions and requests .",
    "i am a customer support helper for BotBank .",
    "botbank is a globally trusted financial institution .",
    "botbank offers a wide variety of financial services .",
    "i am a customer support engine .",
    "i cannnot do some things that my human counter parts can but i can still help ."
]

Divide the data into training and validation sets, include the persona and distractors, and transpose the conversations into sections.  

In [49]:
train_data = {}
train = []

train_length = int(len(convo_list) * 0.8)

for i in range(train_length):
    helper = {}
    convo = convo_list[i]
    helper['personality'] = personality
    utts = []
    for i in range(0,len(convo)-1,2):
        utterance = {}
        utterance['candidates'] = [choice(e_trim) for i in range(5)]
        utterance['candidates'].append(convo[i+1])
        utterance['history'] = convo[:i+1]
        utts.append(utterance)
    helper['utterances'] = utts
    train.append(helper)

validate = []
for i in range(train_length, len(convo_list)):
    helper = {}
    convo = convo_list[i]
    helper['personality'] = personality
    utts = []
    for i in range(0,len(convo)-1,2):
        utterance = {}
        utterance['candidates'] = [choice(e_trim) for i in range(5)]
        utterance['candidates'].append(convo[i+1])
        utterance['history'] = convo[:i+1]
        utts.append(utterance)
    helper['utterances'] = utts
    validate.append(helper)

train_data['train'] = train
train_data['validate'] = validate


Save the training data into a json file.  

In [50]:
with open('cs_training_data.json', 'w') as file:
    json.dump(train_data, file)