<a href="https://colab.research.google.com/github/wilstermanz/holbertonschool-machine_learning/blob/main/supervised_learning/qa_bot/qa_bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install tensorflow-hub
!pip install transformers

# 0. Question Answering

Write a function ```def question_answer(question, reference):``` that finds a snippet of text within a reference document to answer a question:

    question is a string containing the question to answer
    reference is a string containing the reference document from which to find the answer
    Returns: a string containing the answer
    If no answer is found, return None
    Your function should use the bert-uncased-tf2-qa model from the tensorflow-hub library
    Your function should use the pre-trained BertTokenizer, bert-large-uncased-whole-word-masking-finetuned-squad, from the transformers library


In [2]:
import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer

In [None]:
model = hub.load("https://tfhub.dev/see--/bert-uncased-tf2-qa/1")
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [4]:
def question_answer(question, reference):
    question_tokens = tokenizer.tokenize(question)
    reference_tokens = tokenizer.tokenize(reference)
    tokens = ['[CLS]'] + question_tokens + ['[SEP]']\
        + reference_tokens + ['[SEP]']
    input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_word_ids)
    input_type_ids = [0] * (1 + len(
        question_tokens) + 1) + [1] * (len(reference_tokens) + 1)

    input_word_ids, input_mask, input_type_ids = map(
        lambda t: tf.expand_dims(tf.convert_to_tensor(t, dtype=tf.int32), 0),
        (input_word_ids, input_mask, input_type_ids))
    outputs = model([input_word_ids, input_mask, input_type_ids])

    short_start = tf.argmax(outputs[0][0][1:]) + 1
    short_end = tf.argmax(outputs[1][0][1:]) + 1
    answer_tokens = tokens[short_start: short_end + 1]
    answer = tokenizer.convert_tokens_to_string(
        answer_tokens) if len(answer_tokens) > 0 else None

    return answer

In [5]:
with open('/content/drive/MyDrive/ZendeskArticles/PeerLearningDays.md') as f:
    reference = f.read()

print(question_answer('When are PLDs?', reference))
print(question_answer('What does PLD stand for?', reference))
print(question_answer('What are Mock Interviews?', reference))

on - site days from 9 : 00 am to 3 : 00 pm
peer learning days
None


# 1. Create the loop

Create a script that takes in input from the user with the prompt ```Q:``` and prints ```A:``` as a response. If the user inputs ```exit```, ```quit```, ```goodbye```, or ```bye```, case insensitive, print ```A: Goodbye``` and exit.

In [6]:
import cmd

In [7]:
def loop():
    exit_cmds = ['exit', 'quit', 'goodbye', 'bye']
    prompt = 'Q: '

    # Prompt for question
    q = input(prompt).lower()

    while q not in exit_cmds:

        # Find answer
        answer = ''

        # Print answer
        print('A:', answer)

        # Prompt for next question
        q = input(prompt).lower()

    # End loop
    print('A: Goodbye')

if __name__ == '__main__':
    loop()

Q: Hello
A: 
Q: How are you?
A: 
Q: BYE
A: Goodbye


# 2. Answer Questions

Based on the previous tasks, write a function ```def answer_loop(reference):``` that answers questions from a reference text:

    reference is the reference text
    If the answer cannot be found in the reference text, respond with Sorry, I do not understand your question.


In [8]:
def answer_loop(reference):
    exit_cmds = ['exit', 'quit', 'goodbye', 'bye']
    prompt = 'Q: '
    reference_tokens = tokenizer.tokenize(reference)

    # Prompt for question
    q = input(prompt).lower()

    while q not in exit_cmds:

        # Find answer
        question_tokens = tokenizer.tokenize(q)
        tokens = ['[CLS]'] + question_tokens + ['[SEP]']\
            + reference_tokens + ['[SEP]']
        input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
        input_mask = [1] * len(input_word_ids)
        input_type_ids = [0] * (1 + len(
            question_tokens) + 1) + [1] * (len(reference_tokens) + 1)

        input_word_ids, input_mask, input_type_ids = map(
            lambda t: tf.expand_dims(
                tf.convert_to_tensor(t, dtype=tf.int32), 0),
            (input_word_ids, input_mask, input_type_ids))
        outputs = model([input_word_ids, input_mask, input_type_ids])

        short_start = tf.argmax(outputs[0][0][1:]) + 1
        short_end = tf.argmax(outputs[1][0][1:]) + 1
        answer_tokens = tokens[short_start: short_end + 1]

        if len(answer_tokens) > 0:
            answer = tokenizer.convert_tokens_to_string(answer_tokens)

        else:
            answer = 'Sorry, I do not understand your question.'

        # Print answer
        print('A:', answer)

        # Prompt for next question
        q = input(prompt).lower()

    # End loop
    print('A: Goodbye')

In [9]:
with open('/content/drive/MyDrive/ZendeskArticles/PeerLearningDays.md') as f:
    reference = f.read()

answer_loop(reference)

Q: When are PLDs
A: mandatory on - site days from 9 : 00 am to 3 : 00 pm
Q: What does PLD stand for?
A: peer learning days
Q: What are mock interviews?
A: Sorry, I do not understand your question.
Q: GOoDBye
A: Goodbye


# 3. Semantic Search

Write a function def semantic_search(corpus_path, sentence): that performs semantic search on a corpus of documents:

    corpus_path is the path to the corpus of reference documents on which to perform semantic search
    sentence is the sentence from which to perform semantic search
    Returns: the reference text of the document most similar to sentence


In [10]:
import glob
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline

In [16]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

In [14]:
def semantic_search(corpus_path, sentence):
    corpus = [sentence]
    for filename in glob.glob(corpus_path + '/*'):
        with open(filename, 'r') as file:
            corpus.append(file.read())

    embeddings = np.array(embed(corpus))

    cos_sims = []
    for embedding in embeddings:
        cos_sims.append(cosine_similarity([embeddings[0]], [embedding]))
    max_index = np.array(cos_sims[1:]).argmax() + 1

    return corpus[max_index]

In [19]:
print(semantic_search('/content/drive/MyDrive/ZendeskArticles/', 'When are PLDs?'))
print(semantic_search('/content/drive/MyDrive/ZendeskArticles/', 'What is a Mock Interview'))

PLD Overview
Peer Learning Days (PLDs) are a time for you and your peers to ensure that each of you understands the concepts you've encountered in your projects, as well as a time for everyone to collectively grow in technical, professional, and soft skills. During PLD, you will collaboratively review prior projects with a group of cohort peers.
PLD Basics
PLDs are mandatory on-site days from 9:00 AM to 3:00 PM. If you cannot be present or on time, you must use a PTO. 
No laptops, tablets, or screens are allowed until all tasks have been whiteboarded and understood by the entirety of your group. This time is for whiteboarding, dialogue, and active peer collaboration. After this, you may return to computers with each other to pair or group program. 
Peer Learning Days are not about sharing solutions. This doesn't empower peers with the ability to solve problems themselves! Peer learning is when you share your thought process, whether through conversation, whiteboarding, debugging, or li

# 4. Multi-reference Question Answering

Based on the previous tasks, write a function ```def question_answer(corpus_path):``` that answers questions from multiple reference texts:

    corpus_path is the path to the corpus of reference documents


In [20]:
def semantic_search(corpus_path, sentence):
    corpus = [sentence]
    for filename in glob.glob(corpus_path + '/*'):
        with open(filename, 'r') as file:
            corpus.append(file.read())

    embeddings = np.array(embed(corpus))

    cos_sims = []
    for embedding in embeddings:
        cos_sims.append(cosine_similarity([embeddings[0]], [embedding]))
    max_index = np.array(cos_sims[1:]).argmax() + 1

    return corpus[max_index]

def question_answer(corpus_path):
    exit_cmds = ['exit', 'quit', 'goodbye', 'bye']
    prompt = 'Q: '

    # Prompt for question
    q = input(prompt).lower()

    while q not in exit_cmds:

        # Find reference
        reference = semantic_search(corpus_path, q)

        # Find answer
        question_tokens = tokenizer.tokenize(q)
        reference_tokens = tokenizer.tokenize(reference)
        tokens = ['[CLS]'] + question_tokens + ['[SEP]']\
            + reference_tokens + ['[SEP]']
        input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
        input_mask = [1] * len(input_word_ids)
        input_type_ids = [0] * (1 + len(
            question_tokens) + 1) + [1] * (len(reference_tokens) + 1)

        input_word_ids, input_mask, input_type_ids = map(
            lambda t: tf.expand_dims(
                tf.convert_to_tensor(t, dtype=tf.int32), 0),
            (input_word_ids, input_mask, input_type_ids))
        outputs = model([input_word_ids, input_mask, input_type_ids])

        short_start = tf.argmax(outputs[0][0][1:]) + 1
        short_end = tf.argmax(outputs[1][0][1:]) + 1
        answer_tokens = tokens[short_start: short_end + 1]

        if len(answer_tokens) > 0:
            answer = tokenizer.convert_tokens_to_string(answer_tokens)

        else:
            answer = 'Sorry, I do not understand your question.'

        # Print answer
        print('A:', answer)

        # Prompt for next question
        q = input(prompt).lower()

    # End loop
    print('A: Goodbye')

In [23]:
question_answer('/content/drive/MyDrive/ZendeskArticles/')

Q: When are PLDs?
A: on - site days from 9 : 00 am to 3 : 00 pm
Q: What are Mock Interviews?
A: help you train for technical interviews
Q: What does PLD stand for?
A: peer learning days
Q: What is Holberton?
A: a supportive learning space with clear expectations
Q: What is the framework?
A: building foundations
Q: answer this
A: Sorry, I do not understand your question.
Q: bye
A: Goodbye
