## This code was generated on Ubuntu 16.10 using python3. 

### Table of Contents

* [Venv and Imports](#chapter1)
* [Transformers - Questions Generation](#chapter3)
    * [Improvements](#chapter3.1)
* [Speech2Text](#chapter4)

# Venv and Imports <a class="anchor" id="chapter1"></a>
<h1> Create a virtual environment </h1>
<h> Run this jupyter notebook in the virtual environment </h> 

In [1]:
!python3 -m venv linc_sayali
! . linc_sayali/bin/activate

Install following packages, if not available, in the virtual environment.

In [2]:
!pip3 install --upgrade transformers
!pip3 install torch
!pip3 install nlp
!pip3 install nltk
!pip3 install SpeechRecognition
!pip3 install sentencepiece
!pip3 install tensorflow_hub

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user install

If following imports don't work, then try to restart the notebook kernel. Sometimes, the installed packages are not reflected in the notebook. 

In [3]:
import json
import time
from scipy.spatial.distance import cosine

import pandas as pd
import numpy as np

from transformers import AutoModelWithLMHead, AutoTokenizer, AutoModelForSeq2SeqLM
import matplotlib
import nltk
import nlp
import torch
import transformers 
import speech_recognition as sr
import tensorflow as tf
import tensorflow_hub as hub

# Transformers - Questions Generation <a class="anchor" id="chapter3"></a>

In [4]:
class AutoQAGeneration:
    
    def __init__(self, data_path):
        self.data_path = data_path
        
    def read_data(self):
        """
        Renaming the columns to human readable format and used as inputs in the transformers.
        """
        rename_columns = {
            'product_id': 'product id',
            'category_name': 'category',
            'price': 'price',
            'rating': 'rating',
            'name': 'name',
            'attribute.fittext': 'fit measurements',
            'attribute.colour': 'color',
            'color_group': 'color group',
            'attribute.size': 'size',
            'attribute.gender': 'gender',
            'attribute.style': 'style',
            'attribute.itemtype': 'type',
            'attribute.material': 'material',
            'google_product_category': 'google product category',
            'condition': 'condition',
            'shop_info': 'shop information',
            'age_group': 'age group',
            'sku_code': 'code',
            'department': 'department',
            'fitname': 'fit name',
            'fit': 'fit',
            'attribute.description': 'description'
        }
        data = pd.read_csv(self.data_path, sep='\t')
        data = data.rename(rename_columns, axis=1)
        return data
        
    def get_question(self, answer, context, tokenizer, model, max_length=128):
        """
        This function takes answer and context as inputs and outputs a question generated from the context and oriented 
        toward the answer.
        :param answer: context is the text from which a question is derived
        :param context: answer is used to generate a focussed question
        :param max_length: max length of question
        :return: generated question
        """
        input_text = "answer: %s  context: %s </s>" % (answer, context) # Formatting input for T5
        features = tokenizer([input_text], return_tensors='pt') # Creating inputs and attention mask using pretrained tokenizer

        output = model.generate(input_ids=features['input_ids'], 
                   attention_mask=features['attention_mask'],
                   max_length=max_length)
        return tokenizer.decode(output[0])
    
    def print_to_file(self, PATH, dict_object, keys, values):
        # create or open a file with append mode
        f = open(PATH, "a")
        f.write('[')
        count = 0
        for k,v in dict_object.items():
            count = count + 1
            # create each json object to be inserted in the file
            temp_dict = {}
            temp_dict[keys] = k
            temp_dict[values] = list(v) # convert set to list for json formatting
            json_object = json.dumps(temp_dict, indent = 4) # dump json on the file

            f.write(json_object) # write json object on file
            if count < len(dict_object):
                f.write(",")
                f.write('\n')
        f.write(']')

        f.close() # close the file
    
    def create_answers(self, write_to_file=False):
        aq_dict = {}
        qa_dict = {}
        
        data = self.read_data()
        
        """
        Answers are generated as the value of every column in each row. For that answer, create a context and generate a question. 
        This process will repeat for each value in the dataframe. 
        Time complexity = O(n*m)
        n = num of rows
        m = num of columns
        """
        # Using T5 transformer model for text generation. Using the pretrained models to generate questions given the context
        # and answer.

        # AutoTokenizer is called to tokenize the inputs for t5 model. AutoModelWithLMHead is multi-head attention model
        # and called to predict a question. 
        tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
        model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
        
        for num in range(len(data)):
            # Keep each row as dataframe, so to use the column names in the context formation
            curr_data = data.iloc[num]
            if num%5==0:
                print("loop # ", num, "/", str(len(data)))
            for i in data.columns:
                # create context and answer
                # keeping the context short, so that questions are particular and focussed to that answer 
                context = str(curr_data['gender']) + " " + str(curr_data['type']) + " " + i + " " + str(curr_data[i])
                answer = str(curr_data[i])

                # generate question using T5 model
                question = self.get_question(answer, context, tokenizer, model).replace("<pad> question: ", "").replace('</s>', '')

                # create answer-question dictionery to put in file
                # key = answer, value = questions
                if answer in aq_dict:
                    aq_dict[answer].add(question)
                else:
                    aq_dict[answer] = {question}

                # create question-answer dictionery to put it in file, later used for question matching
                # key = question, value = answers
                if question in qa_dict:
                    qa_dict[question].add(answer)
                else:
                    qa_dict[question] = {answer}
                      
        if write_to_file:
            # autogenerated_QA.txt is created and is used for next task
            # This filename is used to test out the output from code to file
            PATH = "autogenerated_QA_1.txt"
            self.print_to_file(PATH, dict_object=aq_dict, keys='answer', values='questions')
            
            PATH = "autogenerated_questions_1.txt"
            self.print_to_file(PATH, dict_object=qa_dict, keys='question', values='answers')
                      
        return qa_dict

In [5]:
qa_gen = AutoQAGeneration('product_catalog.tsv')
qa_dict = qa_gen.create_answers(write_to_file=True)

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


loop #  0 / 78


Token indices sequence length is longer than the specified maximum sequence length for this model (743 > 512). Running this sequence through the model will result in indexing errors


loop #  5 / 78
loop #  10 / 78
loop #  15 / 78
loop #  20 / 78
loop #  25 / 78
loop #  30 / 78
loop #  35 / 78
loop #  40 / 78
loop #  45 / 78
loop #  50 / 78
loop #  55 / 78
loop #  60 / 78
loop #  65 / 78
loop #  70 / 78
loop #  75 / 78


## Accuracy
Generated data was manually checked to see if the questions and answers are correct as of catalog. 

## Improvements <a class="anchor" id="chapter3.1"></a>

Question generation task can be improved by a few methods mentioned below:

- Training the T5 model for question generation and fine tune it to our needs. Reason I didn't train the model is we are using generic words, in our dataset, on which T5 models are already pretrained. 

- Creating carefully curated context, to generate wide variety of questions. This can be done by the method I mentioned in my code, where I am combining different columns along with column names to generate a short and concise context for question to be generated. For short context, to generate different types of questions for one given answer is still a challenge since English language can be modified in many artistic ways. 

- Automatic answer generation. E.g. follow the code below. It takes in the context and generates answers automatically without any given questions. This is useful to generate the question. when I used this method, it didn't provide the desired answers. Further it could improved by training the model with our dataset and try to generate the answers. These answers could be used to generate questions. 

```
model_qa = AutoModelForSeq2SeqLM.from_pretrained("valhalla/t5-small-qa-qg-hl")

def get_question_answer(context, max_length=128):
  input_text = "context: %s </s>" % (context)
  features = tokenizer([input_text], return_tensors='pt')
  
  output = model_qa.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'],
               max_length=max_length)

  return tokenizer.decode(output[0])
```

- Use similar words in order to increase the questions set. E.g. jeans category is bottoms, it could also be called pants, men are also called males etc.

- Build a text generator to automatically generate query from a question for finding a better answer. E.g. Are there jeans available in women's department below \$50? It should generate SQL like query which can answer that question. 

## Reference links
### Code
- https://github.com/huggingface/transformers

### Theory
- https://www.coursera.org/learn/attention-models-in-nlp/home/welcome


# Speech2Text <a class="anchor" id="chapter4"></a>

## Steps followed
- First the utterances are recorded into .flac format. 
- Then converted into strings. 
- Pretrained models are used to create word embeddings for utterances and questions in the database. 
- The utterance and questions are matched based on their similarity score (cosine similary). 
- Top 3 answers are returned as per the utterance matching. 

In [6]:
class Speech2Text:
    
    def __init__(self):
        pass
        
    def get_audio_files(self, directory, audio_format=".flac"):
        """
        This function reads audio files from the mentioned directory. 
        :param directory: Directory of the audio files
        :param audio_format: Audio files to look for in the directory
        :return: List of audio file paths
        """

        import os

        audio_files = []
        for entry in os.scandir(directory):
            if entry.path.endswith(audio_format) and entry.is_file():
                audio_files.append(entry.path)
        return audio_files

    def read_audio_files(self, file_path):
        """
        This function reads the audio file and returns the utterance as string.
        :param file_path: Audio file path
        :return: Utterance as string
        """
        # Set speech recognizer
        r = sr.Recognizer()
        
        intro = sr.AudioFile(file_path)
        with intro as source:
            audio = r.record(source)
        return r.recognize_google(audio)

    def match_questions(self, utterance, sentences, sentence_embeddings, query_vec):
        """
        This function takes the utterance and matches with the questions using cosine similarityand returns top 
        3 matched questions. 
        :param utterance: Input from audio file
        :param sentences: List of questions as string
        :param sentence_embeddings: Questions in embeddings format
        :return: Top 3 questions matched with utterance
        """
        d = {}
        
        for sent, sent_emb in zip(sentences, sentence_embeddings):
            sim = cosine(query_vec, sent_emb)
            # Only considering the match with similarity score 50%
            if sim < 0.5:
                d[sim] = sent

        if d: 
            d = dict(sorted(d.items(), key=lambda item: item[0]))
            if len(d) >= 3:
                return list(d.values())[0:3]
            else:
                return list(d.values())
        else:
            return
        
    def get_utterances(self, directory, print_output=False):
        """
        This function takes the directory and returns utterances in string format. 
        :param directory: Directory where audio files reside
        :param print_output: Print output of utterances as a proof
        :return: Utterances as list of string
        """
        # get all audio files
        audio_files = self.get_audio_files(directory)

        # reading all audio files and getting their utterances in string format
        utterances = []
        for file in audio_files:
            utterances.append(self.read_audio_files(file))
        if print_output:
            print(utterances)
            
        return utterances
    
    def get_answers(self, utterances, sentences, sentence_embeddings, qa_dict, model_emb):
        """
        This function takes in the utterances from audio files as query, and matches with questions from database. 
        Use model_emb to create word embeddings for every utterance. For every output of list of questions, return
        top 3 answers. If the questions don't match or are irrelevant, then return NA. 
        :param utterances: audio file inputs
        :param sentences: list of questions
        :param sentence_embeddings: embeddings of questions
        :param model_emb: model for word embedding
        """

        # Matching questions for each utterance and returning answers
        for utterance in utterances:
            # Start tracking time
            tic = time.time()

            print('utterance =', utterance)
            query_vec = model_emb([utterance])[0]

            # Get matching questions for the utterance
            questions = self.match_questions(utterance, sentences, sentence_embeddings, query_vec)
            
            if questions:
                # Get top 3 answers for the utterance
                answers = []
                for q in questions:
                    # find the answers from the dictionery created
                    answers = answers + list(qa_dict[q])
                if len(answers) >= 3:
                    print("Top 3 answers to the utterances are\n", answers[0:3])
                else:
                    print("Top 3 answers to the utterances are\n", answers)
            else:
                print("NA")

            # End tracking time
            tac = time.time()
            print("Time taken for matching one utterance is", (tac-tic)*1000, "ms")
            print('\n')

In [7]:
with open('autogenerated_questions.txt') as json_file:
    datafile = json.load(json_file)

In [8]:
qa_input = {}
for i in datafile:
    key = i['question']
    value = i['answers']
    if key in qa_input:
        qa_input[key] = qa_input[key] + value
    else:
        qa_input[key] = value

In [9]:
# Load pretrained TFhub model for word embeddings
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model_emb = hub.load(module_url)

# List questions generated in the previous task
sentences = np.array(list(qa_input.keys()))

# Get embeddings for questions using above model
sentence_embeddings = np.array(model_emb(sentences))

s2t = Speech2Text()
utterances = s2t.get_utterances(directory="./audio/")

s2t.get_answers(utterances, sentences, sentence_embeddings, qa_input, model_emb)

INFO:absl:Using /tmp/tfhub_modules to cache modules.


utterance = what is the price of men's jeans
Top 3 answers to the utterances are
 ['89.5', '69.5', '34.98']
Time taken for matching one utterance is 24.939537048339844 ms


utterance = what is the color of women's jeans
Top 3 answers to the utterances are
 ['Black Squared', 'Medium Wash', 'Lonesome Road']
Time taken for matching one utterance is 15.981435775756836 ms


utterance = what are women's jeans measurement
Top 3 answers to the utterances are
 ['Skinny, Super Skinny', 'Mid rise: Sits at waist, Relaxed through hip and thigh, Straight leg', '23x30']
Time taken for matching one utterance is 8.982658386230469 ms


utterance = what is the condition of men's jeans
Top 3 answers to the utterances are
 ['NEW', 'NEW', '100% Cotton, 5-pocket styling, Button Fly, Imported, Read our definitive guide to making jeans last longer <strong><a href="https://www.levi.com/US/en_US/blog/article/the-definitive-guide-to-denim/" target="_blank">here</a></strong>, Wash your jeans once every 10 wears at

## Is question a good match with utterance

Above are 12 examples of QA. 1 of them is irrelavant and 11 of them are relevant to catalog. 

To match utterance with questions, I used cosine similarity score on word embeddings. The method is fast and gives similar questions. For the question 'hello my name is Riley', no similar question was found and NA was returned. The similarity score ranges from 0 to 1. Values closer to 0 are more similar vectors and values close to 1 are different vectors. The threshold of 50% is used in order to find the similar questions. I.e. when the score is below 0.5, then the question is acceptable to the utterance. 

The utterance 'what is the color of mainstream' was by mistakenly read 'mainstream' instead of 'men's jeans'. But the similarity score gave the color of the jeans. 

Reference
- https://realpython.com/python-speech-recognition/#picking-a-python-speech-recognition-package
- https://tfhub.dev/google/universal-sentence-encoder/4
