## Package import
Make sure to set the kernel as Python 3.8 - AzureML before starting your work. You can ignore the warning messages about tensorflow when running the import cell.

In [1]:
import re, os, json, pickle, ast, time, random, requests
import pandas as pd
import numpy as np
import spacy
import scipy
import scipy.sparse as sp

import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.corpus import wordnet
nltk.download("stopwords", quiet = True)
nltk.download("wordnet", quiet = True)
nltk.download("averaged_perceptron_tagger", quiet = True)
nltk.download("punkt", quiet = True)

from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import torch
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

from transformers import BertTokenizerFast, BertForQuestionAnswering

from tqdm import tqdm
tqdm.pandas()

2023-07-28 01:37:29.210156: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-28 01:37:38.505707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-07-28 01:37:38.505798: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

GLOBAL_SEED = 1
 
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

GLOBAL_WORKER_ID = None
def _init_fn(worker_id):
    global GLOBAL_WORKER_ID
    GLOBAL_WORKER_ID = worker_id
    set_seed(GLOBAL_SEED + worker_id)

set_seed(GLOBAL_SEED)

In [3]:
def check_equal(actual, expected):
    assert actual == expected, actual

def check_approx(actual, expected):
    assert np.allclose(actual, expected), actual

## Part A: Project Overview
We will train our NLP models on the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), a reading comprehension dataset with more than 100,000 questions. SQuAD was one of the first with a public leaderboard and thus was able to garner a large amount of research result and publicity towards itself. Questions in the dataset can be answered from the context that accompanies them, without requiring any domain-specific knowledge; thus they belong to the class of *single-hop* question answering problem.

Here's an example of what our model will do: given a *question*,
```
When did Beyonce start becoming popular?
```
and a block of text containing the answer, called the *context*,
```
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
```
our model will be able to extract the answer from this context, which is
```
in the late 1990s
```

We will start by evaluating some simple models that only identify the sentence which contains the answer (i.e., the *answer sentence*) within the context. Then we'll move to a state-of-the-art model called BERT that can also identify the exact answer.

First we load the dataset -- note that the following cell should have the tag `excluded_from_script`, since the autograder will use a different dataset.

In [4]:
df_squad = pd.read_csv(
    "cleaned_squad_data.csv",
    dtype = { 
        "question" : str,
        "context_paragraph" : str,
        "answer" : str,
        "answer_start" : int,
        "answer_end" : int,
        "answer_sent_index" : int,
        "tokenized_context" : str
    },
    converters = {"context_sentences" : ast.literal_eval}
)

df_squad.head(5)

Unnamed: 0,question,context_paragraph,answer,answer_start,answer_end,context_sentences,answer_sent_index
0,When did Beyonce start becoming popular?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,in the late 1990s,269,286,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,1
1,What areas did Beyonce compete in when she was...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,singing and dancing,207,226,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,1
2,When did Beyonce leave Destiny's Child and bec...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,2003,526,530,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,3
3,In what city and state did Beyonce grow up?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,"Houston, Texas",166,180,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,1
4,In which decade did Beyonce become famous?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,late 1990s,276,286,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,1


Since the cell contents are truncated, let's print out and examine one row in detail:

In [5]:
df_squad.iloc[3].to_dict()

{'question': 'In what city and state did Beyonce  grow up? ',
 'context_paragraph': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'answer': 'Houston, Texas',
 'answer_start': 166,
 'answer_end': 180,
 'context_sentences': ['Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, re

Now we can get a better understanding of what each column means:
* `question` is the question text.
* `context_paragraph` is the paragraph of text that contains the answer, which our model extracts from.
* `answer` is the ground-truth answer to the given question.
* `answer_start` and `answer_end` are the indexes of the first and last character of `answer` within `context_paragraph`. In other words, `context_paragraph[answer_start:answer_end]` yields `answer`. Note that `answer_end` is not inclusive.
* `context_sentences` is the list of sentences in `context_paragraph`.
* `answer_sent_index` is the ground-truth index of the answer sentence within `context_sentences` (indexing starts from 0).

There are several techniques to build a question-answering model from this dataset, which we will introduce in the following sections.

## Part B: Unsupervised Models
In this section, we will implement three unsupervised learning models to identify the sentence that contains the answer to a given question. Here "unsupervised" means that we will not make use of the ground truth answer provided in the dataset (i.e., the columns `text`, `answer_start`, and `answer_end`). Instead, the sentence identification will be based only on some pre-defined heuristics.

### Question 1: Jaccard Overlap
The Jaccard overlap of two given sets $A$ and $B$ measures the similarity between them and is defined as

\begin{equation}
J(A, B) = \begin{cases}
    \frac{|A \cap B|}{|A \cup B|} & \text{ if $A \ne \emptyset$ or $B \ne \emptyset$ } \\
    1 & \text { otherwise }
\end{cases}
\end{equation}

Given a question and a list of context sentences, we can identify the answer sentence using Jaccard overlap as follows:
1. Construct the set of words that are in the input question; we will call this set $Q$.
1. Construct the sets of words that are in each context sentence; we will call these sets $S_1, S_2, \ldots, S_m$. Here $S_i$ is the set of words in the $i$-th context sentence.
1. Return the index of the context sentence whose Jaccard overlap with the input question is largest; this is our predicted answer sentence: 

$$\hat y = \underset{1 \le i \le m}{\operatorname{argmax}} J(Q, S_i).$$

Implement the function `get_jaccard_prediction` that performs the above steps on the dataset `df_squad`. For every row, it stores the predicted answer sentence index in a new column `"jaccard_prediction"`, and the corresponding largest Jaccard overlap value in a new column `"jaccard_value"`.

**Notes**:
* Our math notations use 1-based indexing, but in your implementation the indexes start from 0. In other words, if the first sentence in the context paragraph is the predicted answer sentence, you should return 0.
* If multiple context sentences have the same (largest) Jaccard overlap with the question, return the smallest sentence index.
* To build the set of words from a sentence, you should first tokenize the sentence with `nltk`, and then turn the resulting list of tokens into a set.
* You do not need to perform any rounding on the distance values.
* Refer to the [Pandas primer](https://nbviewer.jupyter.org/url/clouddatascience.blob.core.windows.net/primers/pandas-primer/pandas_primer.ipynb) on how to vectorize a row-wise dataframe operation.

In [7]:
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    words = nltk.word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return words

def jaccard_overlap(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

def get_jaccard_prediction(df_squad):
    df_squad["jaccard_prediction"] = None
    df_squad["jaccard_value"] = None
    
    for idx, row in df_squad.iterrows():
        question_tokens = set(nltk.word_tokenize(row["question"]))
        context_sentences = row["context_sentences"]
        jaccard_values = [jaccard_overlap(question_tokens, set(nltk.word_tokenize(context))) for context in context_sentences]        
        max_jaccard_value = max(jaccard_values)
        max_jaccard_index = jaccard_values.index(max_jaccard_value)
        
        df_squad.at[idx, "jaccard_prediction"] = max_jaccard_index
        df_squad.at[idx, "jaccard_value"] = max_jaccard_value
    df_squad["jaccard_value"] = df_squad["jaccard_value"].astype(float)
    
    return df_squad


In [8]:
def test_get_jaccard_prediction():
    """Test on the first 10 rows"""
    df_jaccard = get_jaccard_prediction(df_squad.head(10).copy())
    
    check_equal(df_jaccard.shape, (10, 9))
    
    check_approx(df_jaccard["jaccard_value"].tolist(),[
        0.0, 0.046511627906976744, 0.15, 0.03225806451612903, 0.02040816326530612,
        0.18421052631578946, 0.10869565217391304, 0.12, 0.05263157894736842, 0.10256410256410256
    ])
    
    check_equal(df_jaccard["jaccard_prediction"].tolist(), [0, 1, 1, 0, 3, 1, 3, 2, 1, 1])
    
    jaccard_accuracy = (df_jaccard["jaccard_prediction"] == df_jaccard["answer_sent_index"]).values.mean()
    check_equal(jaccard_accuracy, 0.6)
    
    
    """Test on the entire dataset"""
    df_jaccard = get_jaccard_prediction(df_squad.copy())
    
    check_equal(df_jaccard.shape, (86821, 9))
    print(df_jaccard["jaccard_value"])
    check_approx(df_jaccard.tail(10)["jaccard_value"].tolist(), [
        0.11428571428571428, 0.17073170731707318, 0.08333333333333333, 0.10526315789473684, 0.125,
        0.10714285714285714, 0.04, 0.07407407407407407, 0.07142857142857142, 0.08333333333333333
    ])
    
    check_equal(df_jaccard.tail(10)["jaccard_prediction"].tolist(), [0, 0, 0, 4, 4, 1, 1, 1, 1, 1])
    
    jaccard_accuracy = (df_jaccard["jaccard_prediction"] == df_jaccard["answer_sent_index"]).values.mean()
    check_approx(jaccard_accuracy, 0.7001992605475634)
    print("All tests passed!")
    
test_get_jaccard_prediction()

0        0.000000
1        0.046512
2        0.150000
3        0.032258
4        0.020408
           ...   
86816    0.107143
86817    0.040000
86818    0.074074
86819    0.071429
86820    0.083333
Name: jaccard_value, Length: 86821, dtype: float64
All tests passed!


Essentially, the Jaccard technique is saying "the context sentence that is most similar to the question contains the answer." We see that even such a simple heuristic can achieve 70% accuracy, which is not bad at all. This performance can be improved a bit if we take the time to preprocess (e.g., remove stopwords, lemmatize tokens) in the questions and context sentences, before computing the Jaccard overlap.

### Question 2: TF-IDF Vectors
Instead of Jaccard overlap, we can employ other measures of similarity, such as the Euclidean distance:

$$d_{\text{euclidean}}(a, b) = \|a - b\|_2 = \sqrt{\sum_{i=1}^k (a_i - b_i)^2}.$$

To compute this distance, we first need to convert each question and context sentence into a vector. Recall from earlier projects that one way to do so is building a TF-IDF model, which transforms each string into a vector $v \in \mathbb{R}^k$ where:
* $k$ is the number of unique tokens in the entire dataset.
* The $i$-th element is the frequency of token $i$ in the string, divided by its IDF value.

Given a question, a list of context sentences, and a trained TF-IDF model, we can identify the answer sentence as follows:
1. Transform the question into a vector $u$ using the TF-IDF model.
1. Transform each context sentence $s_i$ into a vector $v_i$ using the TF-IDF model.
1. Return the index of the context sentence whose Euclidean distance to the input question is smallest in the TF-IDF space:

$$\hat y = \underset{1 \le i \le m}{\operatorname{argmin}} d_{\text{euclidean}}(u, v_i).$$

Implement the function `get_tfidf_prediction` that performs the above steps on the dataset `df_squad`. For every row, it stores the predicted answer sentence index in a new column `"tfidf_prediction"`, and the corresponding smallest Euclidean distance value in a new column `"distance_value"`.

**Notes**:
* Since the input `tfidf_vectorizer` is already trained, you don't need to fit it on anything. Instead, just call `.transform` on the appropriate question / context sentence.
* Because TF-IDF transformation outputs sparse matrices, you should only use `scipy.sparse` methods to operate on them. Using standard NumPy/Scipy methods may lead to dimension issues or implicit conversion of the sparse matrices to dense.
* If multiple context sentences have the same (smallest) distance value from the question, return the smallest sentence index.
* You may find that using `df.apply(<your_custom_function>, axis = 1)` to process every row is quite slow, due to the complexity of the operations involved. To work around this issue, try to think about how to completely vectorize this function, using only built-in Pandas and NumPy/Scipy or Sklearn operations.
* You may find the Pandas method `.explode()` helpful. Note that there are duplicate questions in the dataset (the same question may apply to different context paragraphs), so be careful when performing groupby on the exploded dataset.

In [7]:
def euclidean_distance(X, Y, df_squad_copy):
    """
    XX = np.repeat(X, Y.shape[0], axis=1)
    YY = np.repeat(Y.T, X.shape[0], axis=0)

    distances = scipy.sparse.linalg.norm(X-Y)
    return distances
    """
    distances = np.zeros(X.shape[0], dtype=np.float64)
    min_distance_indexes = np.zeros(X.shape[0], dtype=np.int32)
    for i in range(X.shape[0]):
        filtered_Y = Y[df_squad_copy["id"]==i]
        norms = np.zeros(filtered_Y.shape[0], dtype=np.float64)
        for ind, y in enumerate(filtered_Y):
            norms[ind] = scipy.sparse.linalg.norm(X[i] - y)

        min_ind = np.argmin(norms)
        distances[i] = norms[min_ind]
        min_distance_indexes[i] = min_ind

    return distances, min_distance_indexes
    
    

def get_tfidf_prediction(df_squad, tfidf_vectorizer):
    """
    Identify the answer sentence as one whose TF-IDF representation has minimal distance to that of the question.
    
    args:
        df_squad (pd.DataFrame) : a copy of the SQuAD dataset
        tfidf_vectorizer (sklearn.feature_extraction.text.TfidfVectorizer) :
            the TF-IDF model to transform questions and sentences
        
    returns:
        pd.DataFrame : the input dataframe with two additional columns, "tfidf_prediction" and "distance_value"
    """
    df_squad["tfidf_prediction"] = None
    df_squad["distance_value"] = None

    question_csr = tfidf_vectorizer.transform(df_squad["question"])
    
    df_squad_copy = df_squad.copy()
    df_squad_copy["id"] = df_squad_copy.index
    
    df_squad_copy = df_squad_copy.explode("context_sentences")
    context_csr = tfidf_vectorizer.transform(df_squad_copy["context_sentences"])
    
    
    min_distance, min_distance_indexes = euclidean_distance(question_csr, context_csr, df_squad_copy)

                                             
    df_squad["tfidf_prediction"] = min_distance_indexes
    df_squad["distance_value"] = min_distance#distances[np.arange(len(df_squad)), min_distance_indexes]
    

    return df_squad

Below is an example trained TF-IDF vectorizer that we can use. Since fitting it on the entire dataset takes a while, we will initialiize and fit the vectorizer in the global namespace, so that it can be reused in all subsequent functions.

In [22]:
tfidf_vectorizer = TfidfVectorizer(
    tokenizer = nltk.word_tokenize,
    stop_words = stopwords.words('english'),
    ngram_range = (1,2),
    max_df = 1.0,
    min_df = 10
)
tfidf_vectorizer.fit(df_squad["context_paragraph"].unique())



In [81]:
def test_get_tfidf_prediction():
    """Test on the first 10 rows"""
    df_tf_idf = get_tfidf_prediction(df_squad.head(10).copy(), tfidf_vectorizer)
    
    check_equal(df_tf_idf.shape, (10, 9))
    check_approx(df_tf_idf["distance_value"].tolist(), [
        1.4142135623730951, 1.4142135623730951, 1.118805210307003, 1.414213562373095,
        1.414213562373095, 1.0991965705062294, 1.2823059131390453, 1.1299491754496973,
        1.3217983153692023, 1.1364153055563095
    ])
    
    """Test on the whole dataset"""
    df_tf_idf = get_tfidf_prediction(df_squad.copy(), tfidf_vectorizer)
    
    check_equal(df_tf_idf.shape, (86821, 9))
    
    check_approx(df_tf_idf.tail(10)["distance_value"].tolist(), [
        1.2205482532513834, 1.1011579092263701, 1.23935316383152, 1.2848863907388077,
        1.1456258185112482, 1.2657059224645815, 1.4142135623730951, 1.2768968376533885,
        1.2666294346869091, 1.4142135623730951
    ])
    
    tfidf_accuracy = (df_tf_idf["tfidf_prediction"] == df_tf_idf["answer_sent_index"]).values.mean()
    assert 0.68 <= round(tfidf_accuracy, 2) <= 0.69, tfidf_accuracy
    print("All tests passed!")
    
%time test_get_tfidf_prediction()

(10, 25403) (40, 25403)
(86821, 25403) (443259, 25403)
All tests passed!
CPU times: user 3min 56s, sys: 116 ms, total: 3min 56s
Wall time: 3min 56s


If your implementation is sufficiently optimized, you should expect to see the local test being finished in about 3 minutes or less, on a `STANDARD_NC8AS_T4_V3` GPU Compute. If your code runs for more than 7 minutes, you should try to improve its efficiency.

We still see a training accuracy of about 0.68, so TF-IDF is fairly similar in performance to the baseline Jaccard model.

Up until now we have used language models that rely only on word frequencies, without considering the meaning of the words themselves. Now we will address this shortcoming by considering the *word embedding* of each word in our corpus.  Roughly speaking, a word embedding is a vector representation of that word in some space $\mathbb{R}^k$. This representation differs from the TF-IDF transformation in two important ways:

1. If two words have similar meanings in some sense, their Euclidean distances should be close. For example, we may expect the word `"Pittsburgh"` to be closer, in Euclidean distance, to `"Chicago"` than to `"Pikachu"`, because the first two are city names while the third is a Pokemon.
1. The dimensionality of the vector $k$ is fixed and typically much smaller than the vocabulary size.

While these features sound promising, constructing word embeddings requires a very large amount of data (you need to see `"Pittsburgh"` and `"Chicago"` appear together in overlapping context enough times for the model to learn that they are similar). The algorithms to train word embedding models are unfortunately beyond the scope of this course, as they involve many machine learning theories we haven't covered.

That said, there are many powerful pre-trained models that we can use. These models have been trained on huge amounts of data and encode a lot of information about the meanings of the words. The first pre-trained model we will use is one from the [SentenceTransformer library](https://github.com/UKPLab/sentence-transformers#getting-started) called `'distilbert-base-nli-stsb-mean-tokens'`. Once loaded, it can encode a collection of strings, yielding a matrix where each row is the vector embedding of one string.

Run the following cell to see the model in action. Note that the first time you do so, it may take some time to download the model. Also note that the embedding matrix is a dense NumPy matrix, rather than a sparse Scipy matrix like what TF-IDF outputs.

In [23]:
sent_transformer = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
embedding = sent_transformer.encode([
    'This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.'
])
print(embedding.shape)
print(embedding)

(3, 768)
[[-0.21486165  0.395723    0.469087   ... -0.23119043 -0.4957917
   0.42366374]
 [-0.4400171  -0.28488502  0.23363815 ...  0.11956113 -0.16530254
  -0.0862514 ]
 [-0.29504815 -0.24928899 -0.02407092 ...  0.1194457   0.00626569
   1.0400687 ]]


### Question 3: Sent2vec Encoders
Here we will follow roughly the same procedure as in the previous question: first convert the questions and context sentences into vectors using `sent_transformer`, then identify the context sentence whose vector representation is closest in Euclidean distance to that of the input question. However, one important caveat is that encoding the questions and context sentences with `sent_transformer` takes a lot longer (as much as 10 times longer) than with `tfidf_vectorizer`, because this encoding is actually the forward pass through a pre-trained neural network. To address this issue, we recommend the following outline for your implementation:

1. Construct a mapping from each unique question / context sentence to its vector encoding.
1. Use this mapping to compute the vector representation of every question / context sentence in the dataset.
1. Compute the Euclidean distances between the questions and context sentences to identify the answer sentence index for each question.

The key idea is that there are some duplicate questions and many duplicate context sentences (since several questions share the same context paragraph), so you want to store the encoding of each unique question / context sentence to avoid recomputing them several times. This same idea also applies to Question 2, although repeated computations are not as big of an issue there since TF-IDF transformations are fast.

Implement the function `get_sent2vec_prediction` that performs the above steps on the dataset `df_squad`. For every row, it stores the predicted answer sentence index in a new column `"sent2vec_prediction"`, and the corresponding smallest Euclidean distance value in a new column `"distance_value"`.

**Notes**:
* When working with only the values of a Series (e.g., a dataframe column) and you do not care about its name or indices, calling the `.values` attribute to convert it to a NumPy array may provide some speed-up.
* If multiple context sentences have the same (smallest) distance value from the question, return the smallest sentence index.
* You should encode the entire set of unique questions / context sentences at once, rather then encoding each of them individually in a loop. Due to the encoder's internal implementation, you will get different embedding results if you encode each question / context sentence individually.
* Be careful when dealing with nested NumPy arrays. If you get a NumPy array that contains other NumPy arrays, you can use `np.stack` to turn it to a normal multi-dimensional array (otherwise it will be treated as a 1D array of pointers to other arrays).

In [24]:
def encode_sentences(sentences, encoder):
    embeddings = encoder.encode(sentences)
    return {sentence: embedding for sentence, embedding in zip(sentences, embeddings)}

def np_euclidean_distance(X, Y, df_squad_copy):
    distances = np.zeros(X.shape[0], dtype=np.float64)
    min_distance_indexes = np.zeros(X.shape[0], dtype=np.int32)
    for i in range(X.shape[0]):
        filtered_Y = Y[df_squad_copy["id"]==i]
        norms = np.zeros(filtered_Y.shape[0], dtype=np.float64)
        for ind, y in enumerate(filtered_Y):
            norms[ind] = np.linalg.norm(X[i] - y)

        min_ind = np.argmin(norms)
        distances[i] = norms[min_ind]
        min_distance_indexes[i] = min_ind

    return distances, min_distance_indexes

def get_sent2vec_prediction(df_squad, encoder):
    """
    Identify the answer sentence as one whose Sent2vec representation has minimal distance to that of the question.
    
    args:
        df_squad (pd.DataFrame) : a copy of the SQuAD dataset
        encoder (SentenceTransformer) :
            the Sent2vec encoder used to transform questions and sentences into their word embeddings
        
    returns: Tuple(question_embeddings, context_embeddings, df_sent2vec)
        question_embeddings (Dict[str, np.ndarray]) :
            a mapping between each unique question and its Sent2vec embedding
        context_embeddings (Dict[str, np.ndarray]) :
            a mapping between each unique context sentence and its Sent2vec embedding
        df_sent2vec (pd.DataFrame) :
            the input dataframe with two additional columns, "sent2vec_prediction" and "distance_value"
    """
    question_embeddings = {}
    context_embeddings = {}
    df_squad['question_id'] = df_squad.index
    df_squad_copy = df_squad.explode('context_sentences')
    
    for idx, row in df_squad_copy.iterrows():
        q_id = row['question_id']
        q_text = row['question']
        c_text = row['context_sentences']
      
        if q_text not in question_embeddings:
            question_embeddings[q_text] = q_text

        if c_text not in context_embeddings:
            context_embeddings[c_text] = c_text
    
    question_embeddings = encode_sentences(list(question_embeddings.values()), encoder)
    context_embeddings = encode_sentences(list(context_embeddings.values()), encoder)
    
    distances = []
    min_indexes = []
    for idx, row in df_squad.iterrows():
        q_id = row['question_id']
        q_text = row['question']
        
        c_text_list = df_squad_copy[df_squad_copy["question_id"]==q_id]['context_sentences']
        norms = np.zeros(c_text_list.shape[0], dtype=np.float64)
        q_embedding = question_embeddings[q_text]
        for ind, c_text in enumerate(c_text_list):
            c_embedding = context_embeddings[c_text]
            norms[ind] = np.linalg.norm(q_embedding - c_embedding)

        min_index = np.argmin(norms)
        distances.append(norms[min_index])
        min_indexes.append(min_index)
        

    
    df_sent2vec = df_squad.copy()
    df_sent2vec['sent2vec_prediction'] = min_indexes
    df_sent2vec['distance_value'] = distances
    return question_embeddings, context_embeddings, df_sent2vec

In [112]:
def test_get_sent2vec_prediction():
    """Test on the first 10 rows"""
    question_embeddings_map, context_embeddings_map, df_sent2vec = \
        get_sent2vec_prediction(df_squad.head(10).copy(), sent_transformer)
    
    question = 'When did Beyoncé rise to fame?'
    check_approx(
        question_embeddings_map[question][:10],
            [-0.32787397503852844, -0.15557105839252472, 0.6588357090950012, -0.6630659699440002,
             0.5884889960289001, -0.04990821331739426, 0.4931581914424896, 0.24734026193618774,
             -0.278196781873703, 0.6030771732330322]
    )
    
    question = 'In what city and state did Beyonce  grow up? '
    check_approx(
        question_embeddings_map[question][:10],
            [-0.010077972896397114, -0.3763282299041748, 0.939608097076416, -0.580588161945343,
             0.10827723145484924, 0.25717461109161377, 0.6128496527671814, 0.5031235814094543,
             -0.4418582022190094, 0.4174884855747223]
    )
    
    context = "Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child."
    check_approx(
        context_embeddings_map[context][:10],
            [-0.10903950035572052, -0.27940085530281067, 0.23572109639644623, 0.5180771946907043,
             0.3553291857242584, 0.13151134550571442, 0.17776617407798767, -0.4766906797885895,
             -0.09587486833333969, 0.9343098998069763]
    )
    
    context = "Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time."
    check_approx(
        context_embeddings_map[context][:10],
            [-0.8026146292686462, 0.13804392516613007, 0.5916290283203125, 0.39244064688682556,
             -0.2768344283103943, 0.46939727663993835, 0.41246476769447327, 0.300344318151474,
             -1.0914496183395386, -0.1786373406648636]
    )

    check_approx(
        df_sent2vec["distance_value"],
        [12.849681854248047, 13.180705070495605, 11.950621604919434, 13.039735794067383, 12.867680549621582,
         13.494372367858887, 15.975934028625488, 17.264524459838867, 12.373725891113281, 12.654834747314453]
    )
    
    """Test on the full dataset"""
    question_embeddings_map, context_embeddings_map, df_sent2vec = \
        get_sent2vec_prediction(df_squad.copy(), sent_transformer)
    
    question = 'What is KMC an initialism of?'
    check_approx(
        question_embeddings_map[question][:10],
        [-0.9376349 ,  0.07794607, -0.08829201,  0.23485802,  0.05154163,
         -0.11316115, -0.15181658,  0.69910896,  0.43863776, -0.5214241]
    )
    
    question = 'In what year did Kathmandu create its initial international relationship?'
    check_approx(
        question_embeddings_map[question][:10],
        [0.20806803,  0.5848249 ,  0.51129144, -0.8319231 ,  0.10700534,
        0.35735554,  0.4291537 ,  1.0774763 , -0.12320331,  0.5771949 ]
    )
    
    context = "KMC's first international relationship was established in 1975 with the city of Eugene, Oregon, United States."
    check_approx(
        context_embeddings_map[context][:10],
        [0.65290284,  0.1876768 ,  0.5210961 , -0.13445881, -0.05164678,
        0.5333182 ,  0.6115602 ,  0.6255963 ,  0.38067245, -0.10421762]
    )
    
    context = 'It was established in 1972 and started to impart medical education from 1978.'
    check_approx(
        context_embeddings_map[context][:10],
        [0.72947043, -0.5591796 ,  0.86937994, -0.80692345, -0.05610372,
        0.07088739,  1.1165347 ,  0.8680078 ,  0.04112893, -0.71701765]
    )
    check_approx(
        df_sent2vec["distance_value"].tail(10),
        [11.888211250305176, 13.456267356872559, 14.015336990356445, 14.333199501037598,
         11.331104278564453, 10.903414726257324, 17.290882110595703, 14.260327339172363,
         13.246850967407227, 16.56328582763672]
    )
    
    sent2vec_accuracy = (df_sent2vec["sent2vec_prediction"] == df_sent2vec["answer_sent_index"]).mean()
    check_equal(round(sent2vec_accuracy, 2), 0.68)
    print("All tests passed!")
    
    print("Saving the embedding to pickle files for later use ...")
    with open("question_embeddings_map.pkl", "wb") as f1, open("context_embeddings_map.pkl", "wb") as f2:
        pickle.dump(question_embeddings_map, f1)
        pickle.dump(context_embeddings_map, f2)
    print("Done!")
        
    
%time test_get_sent2vec_prediction()

('Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.', array([-0.24805641,  0.20478271,  0.19086157, -0.12463253,  0.05259079,
        0.11522473,  0.4996784 ,  0.21356966, -0.6957434 ,  0.82723707,
        0.04887048,  0.85088307, -0.10093276,  0.30218396, -0.43054464,
       -0.3384794 , -0.14326866, -0.69582444,  0.5204662 , -0.12149618,
        1.1398077 ,  0.3896157 ,  0.65049285,  0.22858167,  0.41144606,
       -0.4606142 , -0.54546285, -0.20729576, -0.07678653, -0.15446976,
        0.19700187,  0.41138193, -0.03102243,  0.33958694,  0.02686436,
        0.8576672 , -0.48134816,  0.09965491,  0.0280837 , -0.555686  ,
       -0.16650306, -0.6773592 ,  0.13851304,  0.32370898,  0.36658695,
        0.35975432, -0.05355458,  0.22257382,  0.09504988, -0.27131653,
       -0.6825185 , -0.114786  , -0.801149  , -0.3122289 ,  0.4406761 ,
        0.33042803, -0.08285239, -0.83385944,  0.7988174 

If your implementation is sufficiently optimized, you should expect to see the local test finished in about 7.5 minutes or less, on a `STANDARD_NC8AS_T4_V3` GPU Compute. If your code runs for more than 10 minutes, you should try to improve its efficiency.

We see a slight improvement in accuracy, compared to the previous two methods. It seems like using the meanings of the words isn't too effective here. Now we will try a different technique that utilizes the linguistic structures of the questions and context sentences. Let's walk through an example of what we will do first.

Assume we have the following input `question`:

```
How many parameters does BERT-large have?
```
and `context`:
```
BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance.
```

We will perform the following steps:
1. Lowercase `question` and identify its *root word*:
```py
question_root = "have"
```
1. Lemmatize this root word to reduce it to its base form.
```py
lemmatized_question_root = "have"
```
1. Split `context` into context sentences and lowercase them:
```py
sent1 = "bert-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340m parameters!"
sent2 = "altogether it is 1.34gb, so expect it to take a couple minutes to download to your colab instance."
```
1. Identify the *noun chunks* in each context sentence:
```py
sent1_ncs = ["it", "24-layers", "an embedding size", "a total", "340m", "parameters"]
sent2_ncs = ["it", "1.34gb", "it", "a couple minutes", "your colab instance"]
```
1. Extract the *root word* from each noun chunk:
```py
sent1_nc_roots = ["it", "layers", "size", "total", "parameters"]
sent2_nc_roots = ["it", "gb", "it", "minutes", "instance"]
```
1. Identify and lemmatize the *head* for each of the above root words. Store these heads into sets:
```py
sent1_nc_root_heads = {"have", "of", "layer", "for"}
sent2_nc_root_heads = {"to", "take", "is"}
```
1. Return the index of the first context sentence whose `nc_root_heads` set contains the question's root word `lemmatized_question_root`. If no context sentence meets this requirement, return 0 (in other words, we predict that the first context sentence is the answer, based on the assumption that the first sentence in a paragraph typically contains the most important information).

We first define some global variables that will be employed in this task. In particular, we will use the `en_core_web_sm` model from spaCy, and the part-of-speech lemmatization procedure from Project 4. Note: the autograder will use this cell so do not change its content and do not add the tag `excluded_from_script`.

In [11]:
en_nlp = spacy.load("en_core_web_sm")
lemmatizer = WordNetLemmatizer()
pos_mapping = {
    'ADJ' : wordnet.ADJ, 'NOUN' : wordnet.NOUN,
    'VERB' : wordnet.VERB, 'ADP' : wordnet.ADV
}

def lemmatize_token(token):
    """
    Lemmatize a spaCy token based on its part-of-speech tag. If a tag is not recognized, treat it as a noun.
    
    args:
        token (spacy.tokens.token.Token) : an output token when inputting a raw string to a spaCy model
    
    return:
        str : the lemmatized string
    """
    return lemmatizer.lemmatize(token.text, pos = pos_mapping.get(token.pos_, wordnet.NOUN))

As a first step, we recommend consulting the [AST primer](https://nbviewer.jupyter.org/url/clouddatascience.blob.core.windows.net/primers/ast-primer/ast_primer.ipynb) and the [spaCy documentation](https://spacy.io/usage/linguistic-features) to replicate the above example in code. Once you have successfully done so, proceed to the next question.

**Notes**:
* To identify a question's root word, you can input it to the `en_nlp` model, extract the first sentence from the `.sents` generator, and use the `.root` attribute. This returns a `Token` that you can then input to `lemmatize_token` to get the lemmatized form.
* Remember to also lemmatize the heads of the root words for the noun chunks (in the above example, `sent1_nc_root_heads` and `sent2_nc_root_heads` contain lemmatized strings).

In [12]:
question = "How many parameters does BERT-large have?"
context_sentences = [
    "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters!",
    "Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."
]

# your code here

### Question 4: Root word matching
Implement the function `get_rootword_prediction` that performs the above steps on the dataset `df_squad`. For every row, it stores:
* the predicted answer sentence index in a new column `"rootword_prediction"`
* the lemmatized root word of the question in a new column `"question_root"`
* the set of lemmatized heads of the root words for the noun chunks in the predicted context sentence (for the above example, this would be the set `sent1_nc_root_heads`) in a new column `"nc_root_heads"`.

**Notes**:
* Recall that the context paragraphs have already been split into sentences in the column `context_sentences`.
* Remember to lower case all the questions and context sentences when doing root word matching (but do not change the original `question`, `context_paragraph` or `context_sentences` columns in `df_squad`).
* If a question contains multiple sentences, make sure you extract the question root word from the **first sentence**, not the last.

In [28]:
def find_root_head(question):
    doc = en_nlp(question)
    for token in doc:
        if token.dep == 8206900633647566924:# and token.head.pos == spacy.symbols.VERB:
            return lemmatize_token(token)#token.lemma_
    
    # If no root, return first word
    return lemmatize_token(doc[0])#doc[0].lemma_
def get_rootword_prediction(df_squad, en_nlp):
    """
    Identify the answer sentence as the first sentence whose list of heads of the root words for its noun chunks
    contains the question's root word.
    
    args:
        df_squad (pd.DataFrame) : a copy of the SQuAD dataset
        en_nlp (spacy.lang.en.English) : a pre-trained SpaCy language model
    
    returns:
        pd.DataFrame : the input dataframe with three additional columns,
            "rootword_prediction", "question_root" and "nc_root_heads"
    """
    question_roots = []
    nc_root_heads_list = []
    df_squad['rootword_prediction'] = None
    df_squad['question_root'] = None
    df_squad['nc_root_heads'] = None
    for index, row in tqdm(df_squad.iterrows()):
        question = row['question']
        context = row['context_sentences']
        
        # 1. Find question's root head
        question_root = find_root_head(question.lower())
        
        context_roots = []
        for sentence in context:
            sentence_doc = en_nlp(sentence.lower())
            root_heads = set()

            # 2. Find noun chunks and extract their root heads
            for chunk in sentence_doc.noun_chunks:
                root_head = lemmatize_token(chunk.root.head)
                root_heads.add(root_head)

            context_roots.append(root_heads)
            
        # 3. Find the corresponding index
        for i, root_heads in enumerate(context_roots):
            if question_root in root_heads:
                df_squad.at[index, 'rootword_prediction'] = i
                break

        df_squad.at[index, 'question_root'] = question_root

        if df_squad.at[index, 'rootword_prediction']==None:
            df_squad.at[index, 'rootword_prediction']=0
        df_squad.at[index, 'nc_root_heads'] = context_roots[df_squad.at[index, 'rootword_prediction']]

    return df_squad


In [29]:
def test_rootword_prediction():
    """Test on the first 10 rows"""
    df_rootword = get_rootword_prediction(df_squad.head(10).copy(), en_nlp)
    check_equal(df_rootword.shape, (10, 10))
    
    check_equal(df_rootword["rootword_prediction"].tolist(), [0, 0, 0, 1, 2, 1, 0, 0, 0, 0])
    
    check_equal(df_rootword["nc_root_heads"].tolist(), [
        {'carter', 'producer', 'singer', 'songwriter', 'is'},
        {'carter', 'producer', 'singer', 'songwriter', 'is'},
        {'carter', 'producer', 'singer', 'songwriter', 'is'},
        {'in', 'perform', 'of', 'to', 'houston', 'as'},
        {'become', 'father', 'of', 'by'},
        {'in', 'perform', 'of', 'to', 'houston', 'as'},
        {'carter', 'producer', 'singer', 'songwriter', 'is'},
        {'carter', 'producer', 'singer', 'songwriter', 'is'},
        {'carter', 'producer', 'singer', 'songwriter', 'is'},
        {'carter', 'producer', 'singer', 'songwriter', 'is'}
    ])
    
    check_equal(df_rootword["question_root"].tolist(), [
        'start', 'compete', 'leave', 'in', 'become', 'in', 'make', 'manage', 'rise', 'have'
    ])
    
    """Test on 5000 sampled data points"""
    df_rootword = get_rootword_prediction(df_squad.sample(5000, random_state = 0).copy(), en_nlp)
    check_equal(df_rootword.shape, (5000, 10))
    #print(df_rootword["rootword_prediction"].tail(10).tolist())
    #print(df_rootword["question_root"].tail(10).tolist())
    #print(df_rootword["nc_root_heads"].tail(10).tolist())
    #print((df_rootword["rootword_prediction"] == df_rootword["answer_sent_index"]).values.mean())
    check_equal(df_rootword["rootword_prediction"].value_counts().to_dict(), 
        {0: 3547, 1: 523, 2: 394, 3: 244, 4: 156, 5: 64, 6: 41, 7: 18, 8: 4, 9: 4, 10: 2, 11: 2, 21: 1}
    )
    
    check_equal(df_rootword["rootword_prediction"].tail(10).tolist(), [0, 1, 3, 1, 0, 0, 1, 2, 0, 4])
    
    check_equal(df_rootword["question_root"].tail(10).tolist(), [
        'see', 'kill', 'in', 'vote', 'wa', 'is', 'say', 'wa', 'compile', 'assign'
    ])
    
    check_equal(df_rootword["nc_root_heads"].tail(10).tolist(), [
        {'announce', 'by', 'from', 'heed', 'in', 'live'},
        {'destroy', 'in', 'kill', 'of'},
        {'be', 'in', 'park'},
        {'arouse', 'fuel', 'in', 'of', 'studdard', 'than', 'vote'},
        {'at', 'by', 'hold', 'in', 'louis', 'of', 'on'},
        {'achieve', 'as', 'garde', 'have', 'in', 'on', 'skalkottas', 'with', 'xenakis'},
        {'call', 'is', 'of', 'say', 'statement'},
        {'after', 'by', 'find', 'of', 'say', 'wa'},
        {'architect', 'artist', 'attract', 'from', 'in', 'of', 'ruler', 'through'},
        {'assign', 'at', 'by', 'glanville', 'like', 'manage', 'of', 'teach', 'wa', 'with'}
    ])
    
    accuracy = (df_rootword["rootword_prediction"] == df_rootword["answer_sent_index"]).values.mean()
    check_equal(accuracy, 0.4626)
    print("All tests passed!")
    
%time test_rootword_prediction()


10it [00:00, 30.84it/s]
5000it [02:54, 28.57it/s]

[0, 1, 3, 1, 0, 0, 1, 2, 0, 4]
['see', 'kill', 'in', 'vote', 'wa', 'is', 'say', 'wa', 'compile', 'assign']
[{'announce', 'in', 'by', 'heed', 'live', 'from'}, {'of', 'in', 'destroy', 'kill'}, {'be', 'park', 'in'}, {'of', 'than', 'fuel', 'arouse', 'in', 'vote', 'studdard'}, {'of', 'hold', 'in', 'by', 'louis', 'at', 'on'}, {'achieve', 'xenakis', 'in', 'with', 'as', 'skalkottas', 'have', 'garde', 'on'}, {'of', 'is', 'say', 'call', 'statement'}, {'after', 'wa', 'of', 'by', 'say', 'find'}, {'architect', 'attract', 'of', 'ruler', 'in', 'artist', 'through', 'from'}, {'wa', 'of', 'manage', 'with', 'by', 'like', 'assign', 'glanville', 'at', 'teach'}]
0.4626
All tests passed!
CPU times: user 2min 55s, sys: 468 ms, total: 2min 55s
Wall time: 2min 55s





Since this procedure takes a very long time, we only tested it on a random sample of 5000 data points. If your code is sufficiently optimized, the local test should finish in about 4 minutes on a `STANDARD_NC8AS_T4_V3` compute.

## Part C: Supervised Models
So far we have been exploring unsupervised methods for answer extraction which involves dividing the questions and contexts into tokens and projecting those tokens into a common representation space. You may notice that their performances weren't particularly great because we didn't perform any training on the dataset; instead, we only used pre-defined heuristics and pre-trained models. From this point, we will move to the supervised learning domain, where we make use of the ground-truth answers and build models that learn from these answers.

### Question 5: Preparing dataset for supervised learning
We will first consider a binary classification setting, where we are given a question and a context sentence, and need to predict whether this context sentence contains the answer. In this setting, we can get several training data points from each row in the original dataset `df_squad`. In particular, if a row in `df_squad` looks like the following:

|question|context_sentences|answer_sent_index|
|---|---|---|
|`q`|`[s0, s1, s2, s3]`|2|

then it contributes four data points:

|question|context_sentence|is_answer_sent|
|---|---|---|
|`q`|`s0`|0|
|`q`|`s1`|0|
|`q`|`s2`|1|
|`q`|`s3`|0|

More generally, a row in `df_squad` where the `context_sentences` list has `n` sentences will be transformed into `n` rows, one for each context sentence. Among these new rows, only the row at index `answer_sent_index` gets assigned the label 1, while the others get the label 0.

Implement the function `build_data_for_classification` that turns the original dataset `df_squad` into a dataframe with 3 columns -- `question`, `context_sentence` and `is_answer_sent` -- using the procedure specified above.

**Notes**:
* Keep in mind that the new dataframe has a column named `context_sentence`, **not** `context_sentences`.
* You should preserve the original row ordering in the input dataframe.

In [28]:
def build_data_for_classification(df_squad):
    """
    Convert the SQuAD dataset into a format where every row contains one question, one context answer,
    and a flag that indicates whether the context sentence is the answer.
    
    args:
        df_squad (pd.DataFrame) : a copy of the SQuAD dataset
        
    returns:
        pd.DataFrame : a new dataframe with 3 columns: question, context_sentence, is_answer_sent
    """
    df_squad["id"] = df_squad.index
    data_tuples = [(row['question'], context_sentence, row['answer_sent_index'], row["id"])
                   for _, row in df_squad.iterrows() for context_sentence in row['context_sentences']]

    
    questions, context_sentences, answer_sent_indices, index = zip(*data_tuples)
    index = np.array(index)
    answer_sent_indices = np.array(answer_sent_indices)
    is_answer_sent=[]
    for ind in df_squad["id"]:
        """mask = [index_ele == ind for index_ele in index]
        filtered = [answer_sent_indices[i] for i, mask_value in enumerate(mask) if mask_value]
        answer_sent_index = filtered[0]
        is_answer_sent.extend([1 if i == answer_sent_index else 0 for i in range(len(filtered))])"""
        mask = np.where(index == ind, 1, 0)
        filtered = answer_sent_indices[mask==1]
        answer_sent_index = filtered[0]
        is_answer_sent.extend([1 if i == answer_sent_index else 0 for i in range(len(filtered))])
        

    # Create the new dataframe
    new_df = pd.DataFrame({
        'question': questions,
        'context_sentence': context_sentences,
        'is_answer_sent': is_answer_sent
    })
    df_squad.drop(columns='id', inplace=True)
    return new_df

In [293]:
def test_build_data_for_classification():
    """Test on 10 random rows"""
    df_sample = df_squad.sample(n = 10, random_state = 0).copy()
    df_formatted = build_data_for_classification(df_sample)
    
    assert df_formatted.shape == (48, 3), df_formatted.shape
    
    assert sorted(df_formatted.columns) == ['context_sentence', 'is_answer_sent', 'question']
    
    assert df_formatted['is_answer_sent'].tolist() == [
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
        1, 0, 0, 0, 1, 0, 1, 0
    ], df_formatted['is_answer_sent'].tolist()
    
    # check that question orderings are preserved
    assert (df_formatted["question"].unique() == df_sample["question"].unique()).all()
    
    """Test on the full dataset"""
    df_formatted = build_data_for_classification(df_squad.copy())
    
    assert df_formatted.shape == (443259, 3), df_formatted.shape
    
    assert df_formatted['is_answer_sent'].sample(n = 40, random_state = 200).tolist() == [
        1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 1
    ], df_formatted['is_answer_sent'].tail(40).tolist()
    
    assert (df_formatted["question"].unique() == df_squad["question"].unique()).all()
    print("All tests passed!")
    
test_build_data_for_classification()

100%|██████████| 10/10 [00:00<00:00, 58092.85it/s]
100%|██████████| 86821/86821 [00:55<00:00, 1572.51it/s]


All tests passed!


Before building our supervised learning models, we will load the embeddings you created in Question 3 and set up the train set and test set.

In [29]:
question_embeddings_map = pd.read_pickle("question_embeddings_map.pkl")
context_embeddings_map = pd.read_pickle("context_embeddings_map.pkl")

In [30]:
df_squad_train, df_squad_test = train_test_split(df_squad, train_size = 0.8, random_state = 0)
df_train_formatted = build_data_for_classification(df_squad_train.reset_index())
df_test_formatted = build_data_for_classification(df_squad_test.reset_index())

### Question 6: Logistic Regression
Having set up the dataset for binary classification, we can now train a logistic regression model. While the binary labels are already set up, we still need to construct the feature vectors as follows. For every question `q` and context sentence `s`:
* Convert the question to its word embedding $x_q \in \mathbb{R}^k$, using the question embedding map from Question 3.
* Convert the context sentence to its word embedding $x_s \in \mathbb{R}^l$, using the context embedding map from Question 3.
* Concatenate these two vectors to get the input vector to the logistic regression model:

$$x_{q,s} = (x_{q1} \quad x_{q2} \quad \ldots \quad x_{qk} \quad x_{s1} \quad x_{s2} \quad \ldots \quad x_{sl})^\top \in \mathbb{R}^{k+l}.$$

Implement the function `get_lr_prediction` that performs the following steps:

1. Construct the feature vector for each row of the train set `df_train_formatted`, using the above formula.
1. Use an Sklearn `StandardScaler` (with default parameters) to fit and transform the train set `df_train_formatted`.
1. Train an Sklearn `LogisticRegression` model on the train set `df_train_formatted`.
1. Use this model to perform prediction on the test set `df_test_formatted`.
1. Return the trained LR model and its accuracy on the test set (i.e., the number of correct predictions divided by the test set size).

**Notes**:
* When creating a `LogisticRegression` model you should set `random_state` to the input `seed` and `max_iters` to 1000. You do not need to specify any other parameter.
* Make sure you also standardize the feature matrix built from the test set before inputting it to the LR model for prediction.
* Similar to Question 3, if you get a NumPy array that contains other NumPy arrays, you can use `np.stack` to turn it to a normal multi-dimensional array (otherwise it will be treated as a 1D array of pointers to other arrays).

In [31]:
def get_lr_prediction(df_train_formatted, df_test_formatted, question_embeddings_map, context_embeddings_map, seed = 0):
    """
    Train and evaluate the performance of a binary logisitic regression model to predict
    whether a context sentence contains the answer to a given question.
    
    args:
        df_train_formatted (pd.DataFrame) : the train set dataframe with 3 columns:
            question, context_sentence, is_answer_sent
        df_test_formatted (pd.DataFrame) : the test set dataframe with 3 columns:
            question, context_sentence, is_answer_sent
        question_embeddings_map (dict[str, np.ndarray]) : a mapping from question to word embedding
        context_embeddings_map (dict[str, np.ndarray]) : a mapping from context sentence to word embedding
        seed (int) : the random generator used in LogisticRegression
        
    return: Tuple(trained_model, accuracy)
        trained_model (sklearn.linear_model.LogisticRegression) : the LR model trained on the train set
        accuracy (float) : the accuracy score of the trained model on the test set
    """
    def construct_feature_vector(row):
        question_embedding = question_embeddings_map[row["question"]]
        context_embedding = context_embeddings_map[row["context_sentence"]]
        return np.concatenate((question_embedding, context_embedding))

    X_train = np.stack(df_train_formatted.apply(construct_feature_vector, axis=1))
    X_test = np.stack(df_test_formatted.apply(construct_feature_vector, axis=1))

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    
    lr = LogisticRegression(random_state=seed, max_iter=1000)
    lr.fit(X_train_scaled, df_train_formatted["is_answer_sent"])

    y_pred = lr.predict(X_test_scaled)
    accuracy = np.mean(y_pred == df_test_formatted["is_answer_sent"])

    return lr, accuracy

In [297]:
def test_get_lr_prediction():
    "Test on the first 1000 rows of the dataset"
    df_squad_train_mini, df_squad_test_mini = train_test_split(df_squad.head(1000), train_size = 0.8, random_state = 0)
    df_train_formatted_mini = build_data_for_classification(df_squad_train_mini.reset_index())
    df_test_formatted_mini = build_data_for_classification(df_squad_test_mini.reset_index())
    lr_mini, acc_mini = get_lr_prediction(
        df_train_formatted_mini, df_test_formatted_mini,
        question_embeddings_map, context_embeddings_map
    )
    assert lr_mini.coef_.flatten()[:10].round(2).tolist() == [
        0.07, -0.01, -0.03, -0.03, -0.01, 0.01, -0.03, 0.0, 0.04, -0.04
    ]
    assert lr_mini.intercept_.round(2)[0] == -2.73
    assert round(acc_mini, 2) ==  0.81
    
    """Test on the entire dataset"""
    lr, acc = get_lr_prediction(df_train_formatted, df_test_formatted, question_embeddings_map, context_embeddings_map)
    assert lr.coef_.flatten()[:10].round(2).tolist() == [
        -0.04, 0.0, 0.01, 0.0, -0.0, 0.0, -0.01, -0.0, -0.03, 0.01
    ]
    assert lr.intercept_.round(2)[0] == -1.54
    assert round(acc, 2) == 0.80
    
    print("All tests passed!")
    
test_get_lr_prediction()

100%|██████████| 800/800 [00:00<00:00, 59385.22it/s]
100%|██████████| 200/200 [00:00<00:00, 76685.33it/s]


All tests passed!


We obtain about 80% accuracy in this binary classification task, which is not too bad! However, it's important to note that this accuracy cannot be compared with those from previous questions, because it is evaluated in a different setting (`df_test_formatted`). If we were only interested in whether the ground truth answer sentences are correctly detected, we would evaluate the accuracy on the original test set `df_squad_test`, instead of the formatted one.

While logistic regression on Sent2vec representation performs relatively well, it still relies on a *uni-directional* representation of words. In this setting, the same word is always mapped to the same vector, even though it may have different meanings in different contexts (e.g., the word `bank` in `bank account` is not the same as in `river bank`). The final model we will explore in this project, which also addresses the above issue, is called BERT (Bi-directional Encoder Representations from Transformers). This is a language model that learns to predict the probability of a sequence of words. The reason for BERT's success is its large feedforward layers and its attention heads, giving it 110 million parameters for the base model and 340 million parameters for the large model. It has been trained on Wikipedia articles and the Book Corpus dataset, which contains text from over 10,000 books of different genres over the tasks of next sentence prediction (NSP) and masked language modeling (MLM). BERT represents the current state of the art in various NLP task, including question answering.

One nice feature of NLP models like BERT is that they have already been pre-trained on massive text corpuses, but can also be fine-tuned further for a specific domain (in this case, our SQuAD dataset). We will implement this workflow in the rest of the project. First, we define a sub-class of `Dataset`, similar to Project 6, to turn our SQuaD dataset into a format that PyTorch can work with.

In [32]:
class SQuADDataset(Dataset):
    def __init__(self, df_squad):
        """
        Class constructor.
        """
        self.data = df_squad
        self.data_cols = ["question", "context_paragraph", "answer_start", "answer_end"]
    
    def __len__(self):
        """
        Get the dataset length.
        """
        return len(self.data)
    
    def __getitem__(self, index):
        """
        Get the question, context paragraph, answer start and answer end value
        at the row specified by the input index from the dataset.
        """
        return tuple(self.data.loc[index, self.data_cols])

### Question 7: Tokenization for BERT
Similar to how our `LogisticRegression` model expects a real-valued vector as input, BERT also has its own way of constructing the input. At a high level, we want to convert each input tuple
```
(question, context_paragraph, answer_start, answer_end)
```
into a tuple
```
(encoding, token_start, token_end)
```
where `encoding` is a dictionary that maps three keywords -- `"input_ids"`, `"token_type_ids"`, `"attention_mask"` -- to their respective vector representations.

Implement the class `SQuADTokenizer` that performs the above conversion using a pre-trained Bert tokenizer. In particular, the class constructor accepts a `BertTokenizer` instance and the maximum sequence size `max_length`, which are stored as instance variables. Then, the `__call__` function acccepts a batch of data and performs the following steps:
1. Extract the questions, context paragraphs, answer start indexes, and answer end indexes from the batch.
1. Input the questions and contexts to the tokenizer as separate lists, so that the tokenizer can append special tokens such as `[SEP]` that help BERT recognize different passages. You should also set the following parameters: `padding` to `"longest"`, `truncation` to `True`, `max_length` to the stored `max_len`, and `return_tensors` to `"pt"`.
1. Convert the `answer_start` and `answer_end` character indices to the indices of the two tokens that correspond to these start and end characters. For example, let's say:
```py
question = "Which pets does your son have?"
context_paragraph = "My son loves pets. He has two cats and a dog."
tokenized_context = ["my", "son", "love", "pet", "he", "have", "two", "cat", "and", "a", "dog"]
answer_start, answer_end = 26, 44 # answer = "two cats and a dog"
```
now you need to identify the token that corresponds to the character `context_paragraph[answer_start]`. This character is `t` and the corresponding token is `two`, which is at index `6` in `tokenized_context`. Similarly for `answer_end`, the character `context_paragraph[answer_end]` is `g`, and the corresponding token is `dog`, which is at index 10 in `tokenized_context`. In summary, you are converting
```
(answer_start = 26, answer_end = 44)
```
to
```
(token_start = 6, token_end = 10)
```
1. Return the encoded data (output from calling the tokenizer on the input questions and context paragraphs in Step 2), as well as the list of token start indexes and the list of token end indexes.

**Notes**:
* The [Tokenizer page](https://huggingface.co/transformers/main_classes/tokenizer.html) and the section about [using the tokenizer](https://huggingface.co/transformers/quicktour.html?highlight=max_length) on HuggingFace may be helpful.
* To convert `(answer_start, answer_end)` to `(token_start, token_end)`, you can use the [.char_to_token](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding.char_to_token) method. Here the `batch_or_char_index` is the index of the current data point within the input batch, and `sequence_index` should be 1 because we are doing this conversion on the context paragraph, which is the second input string to the tokenizer. You need to specify both `batch_or_char_index` and `char_index` when calling `.char_to_token`.
* Sometimes calling `.char_to_token` will return `None` because the tokens have been changed (e.g., due to lemmatization) from their original form in `context_paragraph`. A simple work-around is to also consider the token that corresponds to the next or previous character. If doing so still yields `None`, we will just set the token index as `max_len` as a last resort. More formally, you should assign `token_start` to the first element that is not `None` among the following four values:
    * `char_to_token(char_index = answer_start, ...)`
    * `char_to_token(char_index = answer_start+1, ...)`
    * `char_to_token(char_index = answer_start-1, ...)`
    * `max_len`
* You will do a similar assignment for `token_end` as well.
* After extracting `answer_starts` and `anwer_ends` from the batch, you should convert them to `LongTensor` so that they can be used in `char_to_token()`.

In [33]:
class SQuADTokenizer:
    def __init__(self, tokenizer, max_len = 512):
        """
        Store the input BertTokenizer instance and the max length
        """
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __call__(self, batch):
        """
        Perform tokenization on a batch of data
        
        args:
            batch (Tuple[questions, contexts, answer_starts, answer_ends]):
                questions (List[str]) : a list of questions
                contexts (List[str]) : a list of context paragraphs
                answer_starts (List[int]) : a list of answer start indexes
                answer_ends (List[int]) : a list of answer end indexes
        
        returns:
            Tuple[encoding, token_starts, token_ends]
                encoding (dict[str, tensor]): the output of calling tokenizer on the questions and contexts
                token_starts (List[int]) : the list of indexes for the tokens that correspond to the first answer character
        """
        questions, contexts, answer_starts, answer_ends = batch
        answer_starts = torch.LongTensor(answer_starts)
        answer_ends = torch.LongTensor(answer_ends)
        encoding = self.tokenizer(questions, contexts, padding="longest", truncation=True, max_length=self.max_len, return_tensors="pt")
        token_starts, token_ends = [], []

        for i, (answer_start, answer_end) in enumerate(zip(answer_starts, answer_ends)):
            token_start = self._char_to_token(i, answer_start, encoding, sequence_index=1)
            token_end = self._char_to_token(i, answer_end, encoding, sequence_index=1)

            token_starts.append(token_start)
            token_ends.append(token_end)

        return encoding, token_starts, token_ends
    def _char_to_token(self, batch_or_char_index, char_index, encoding, sequence_index):
        
        token_index = encoding.char_to_token(batch_or_char_index, char_index = char_index, sequence_index = sequence_index)

        # If token_index is None, try with adjacent characters or set as max_len as a last resort
        if token_index is None:
            for offset in [-1, 1]:
                new_char_index = char_index + offset
                token_index = encoding.char_to_token(batch_or_char_index, char_index = new_char_index, sequence_index = sequence_index)
               
                if token_index is not None:
                    if offset==-1:
                        token_index+=1
                    else:
                        token_index-=1                    
                    break
                else:
                    token_index = self.max_len

        return token_index


In [318]:
def test_tokenizer():
    tokenizer = BertTokenizerFast.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
    batch_tokenizer = SQuADTokenizer(tokenizer)
    example_1 = (
        ['When did Beyonce start becoming popular?'],
        ['Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'],
        [269],
        [286]
    )
    encoding, token_starts, token_ends = batch_tokenizer(example_1)
    check_equal(list(encoding["input_ids"].shape), [1, 174])
    check_equal(encoding["input_ids"][0][:10].numpy().tolist(), [
        101, 2043, 2106, 20773, 2707, 3352, 2759, 1029, 102, 20773
    ])
    check_equal(encoding["token_type_ids"].numpy().tolist(), [[0]*9 + [1]*165])
    check_equal(encoding["attention_mask"].numpy().tolist(), [[1] * 174])
    check_equal(token_starts, [75])
    check_equal(token_ends, [79])

    example_2 = (
        ['When did Beyonce start becoming popular?', 'What score did the writer from the Chicago Tribune give to Spectre?'], 
        ['Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".', 'Critical appraisal of the film was mixed in the United States. In a lukewarm review for RogerEbert.com, Matt Zoller Seitz gave the film 2.5 stars out of 4, describing Spectre as inconsistent and unable to capitalise on its potential. Kenneth Turan, reviewing the film for Los Angeles Times, concluded that Spectre "comes off as exhausted and uninspired". Manohla Dargis of The New York Times panned the film as having "nothing surprising" and sacrificing its originality for the sake of box office returns. Forbes\' Scott Mendelson also heavily criticised the film, denouncing Spectre as "the worst 007 movie in 30 years". Darren Franich of Entertainment Weekly viewed Spectre as "an overreaction to our current blockbuster moment", aspiring "to be a serialized sequel" and proving "itself as a Saga". While noting that "[n]othing that happens in Spectre holds up to even minor logical scrutiny", he had "come not to bury Spectre, but to weirdly praise it. Because the final act of the movie is so strange, so willfully obtuse, that it deserves extra attention." In a positive review Rolling Stone, Peter Travers gave the film 3.5 stars out of 4, describing "The 24th movie about the British MI6 agent with a license to kill is party time for Bond fans, a fierce, funny, gorgeously produced valentine to the longest-running franchise in movies". Other positive reviews from Mick LaSalle from the San Francisco Chronicle, gave it a perfect 100 score, stating: “One of the great satisfactions of Spectre is that, in addition to all the stirring action, and all the timely references to a secret organization out to steal everyone’s personal information, we get to believe in Bond as a person.” Stephen Whitty from the New York Daily News, gave it an 80 grade, saying: “Craig is cruelly efficient. Dave Bautista makes a good, Oddjob-like assassin. And while Lea Seydoux doesn’t leave a huge impression as this film’s “Bond girl,” perhaps it’s because we’ve already met — far too briefly — the hypnotic Monica Bellucci, as the first real “Bond woman” since Diana Rigg.” Richard Roeper from the Chicago Sun-Times, gave it a 75 grade. He stated: “This is the 24th Bond film and it ranks solidly in the middle of the all-time rankings, which means it’s still a slick, beautifully photographed, action-packed, international thriller with a number of wonderfully, ludicrously entertaining set pieces, a sprinkling of dry wit, myriad gorgeous women and a classic psycho-villain who is clearly out of his mind but seems to like it that way.” Michael Phillips over at the Chicago Tribune, gave it a 75 grade. He stated: “For all its workmanlike devotion to out-of-control helicopters, “Spectre” works best when everyone’s on the ground, doing his or her job, driving expensive fast cars heedlessly, detonating the occasional wisecrack, enjoying themselves and their beautiful clothes.” Guy Lodge from Variety, gave it a 70 score, stating: “What’s missing is the unexpected emotional urgency of “Skyfall,” as the film sustains its predecessor’s nostalgia kick with a less sentimental bent.”'], 
        [269, 2118], 
        [286, 2120]
    )
    encoding, token_starts, token_ends = batch_tokenizer(example_2)
    check_equal(list(encoding["input_ids"].shape), [2, 512])
    check_equal(encoding["input_ids"][0][-10:].numpy().tolist(), [0]*10)
    check_equal(encoding["token_type_ids"].numpy().tolist()[0], [0]*9 + [1]*165 + [0]*338)
    check_equal(encoding["token_type_ids"].numpy().tolist()[1], [0]*16 + [1]*496)
    check_equal(encoding["attention_mask"][0].numpy().tolist(), [1]*174 + [0]*338)
    check_equal(token_starts, [75, 512])
    check_equal(token_ends, [79, 512])
    print("All tests passed!")
    
test_tokenizer()

All tests passed!


### Question 8: Fine-tuning BERT
With the tokenizer ready, we can begin fine-tuning a pre-trained BERT model to our dataset with the following steps:
1. Move the input `model` to `device` and set it to training mode.
1. Repeat `n_iters` times:
    * For every batch of data from the dataloader:
        * Use the input `tokenizer` (an instance of the class `SQuADTokenizer` that you implemented) to encode this batch, yielding the tuple `(encoding, token_starts, token_ends)`.
        * Extract the `input_ids`, `token_type_ids` and `attention_mask` from `encoding`. Convert these tensors to `LongTensor` and move them to `device`.
        * Perform the usual PyTorch training workflow (zero grad, forward pass, backprop, ...). See the PyTorch primer for a reminder.
        
Implement the function `fine_tune_bert` that, given a pretrained BERT model and other training parameters, performs the above training procedure. This function should return the fine-tuned model and the average training loss across epochs.

**Notes**:
* Consult the [Bert documentation page](https://huggingface.co/transformers/v4.7.0/model_doc/bert.html#transformers.BertForQuestionAnswering.forward) for which parameters to specify in the forward pass. Make sure to input them in the correct order, as specified in the documentation.
* To carry out the forward pass, you can input the encoding elements (after convering them to datatype Long), along with `token_starts` and `token_ends`, to `model`. Calling `.loss` on the output of the forward pass will yield the loss value.
* We also provide a `verbose` flag. If you would like to add print debugging messages during model training, simply precede each print statement with an `if verbose` check, for example:
```py
if verbose:
    print("Training loss", train_loss)
```
The autograder will only call your function with `verbose = False`, so that your printout messages do not interfere with grading.
* To compute the average training loss, you should sum all the losses from every forward pass, then divide this sum by the number of epochs at the end of the function.

In [34]:
def fine_tune_bert(model, n_epochs, optimizer, dataloader, squad_tokenizer, device, verbose = False):
    """
    Fine-tune a pre-trained BERT model on the SQuAD dataset.
    
    args:
        model (BertForQuestionAnswering) : a pre-trained BERT model for QA tasks
        n_epochs (int) : the number of epochs to train
        dataloader (DataLoader) : a data loader that provides access to one batch of data at a time
        squad_tokenizer (SQuADTokenizer) : a tokenizer instance to be called on every batch of data from dataloader
        device (torch.device) : the device (CPU or Cuda) that the model and data should be moved to
        verbose (bool) : a flag that indicates whether debug messages should be printed out
    
    return:
        model (BertForQuestionAnswering) : the fine-tuned model
        avg_loss (float) : the average training loss across epochs
    """
    model = model.to(device)
    model.train()
    total_loss = 0.0
    for epoch in range(n_epochs):
        epoch_loss = 0.0
        for data in dataloader:
            encoding, token_starts, token_ends = squad_tokenizer(data)
            #input_ids, token_type_ids, attention_mask = encoding
            input_ids = encoding["input_ids"]
            token_type_ids = encoding["token_type_ids"]
            attention_mask = encoding["attention_mask"]
            
            input_ids = torch.LongTensor(input_ids).to(device)
            token_type_ids = torch.LongTensor(token_type_ids).to(device)
            attention_mask = torch.LongTensor(attention_mask).to(device)
            token_starts = torch.LongTensor(token_starts).to(device)
            token_ends = torch.LongTensor(token_ends).to(device)
            
            optimizer.zero_grad()

            # Forward pass
            outputs = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask,
                            start_positions=token_starts, end_positions=token_ends)
            loss = outputs.loss

            # Backward pass and optimization
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
        epoch_loss /= len(dataloader)
        total_loss += epoch_loss

        #print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {epoch_loss}")

    avg_loss = total_loss / n_epochs

    return model, avg_loss

In [328]:
def test_fine_tune_bert():
    tokenizer = BertTokenizerFast.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
    squad_tokenizer = SQuADTokenizer(tokenizer)
    
    """Train on 8 data points"""
    train_dataset = SQuADDataset(df_squad.head(8)[["question", "context_paragraph", "answer_start", "answer_end"]])
    squad_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=False, num_workers=6, worker_init_fn=_init_fn)
    model = BertForQuestionAnswering.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
    optimizer = optim.AdamW(model.parameters(), lr=2e-5, eps=1e-8)  
    model, avg_train_loss = fine_tune_bert(model, 1, optimizer, squad_dataloader, squad_tokenizer, device)
    assert avg_train_loss < 5.0, avg_train_loss
    
    """Train on 100 data points"""
    train_dataset = SQuADDataset(df_squad.head(100)[["question", "context_paragraph", "answer_start", "answer_end"]])
    squad_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=False, num_workers=6, worker_init_fn=_init_fn)
    model = BertForQuestionAnswering.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
    optimizer = optim.AdamW(model.parameters(), lr=2e-5, eps=1e-8)  
    model, avg_train_loss = fine_tune_bert(model, 1, optimizer, squad_dataloader, squad_tokenizer, device)
    assert avg_train_loss < 38, avg_train_loss
    print("All tests passed!")
    
%time test_fine_tune_bert()

  0%|          | 0/1 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 1/1 [00:00<00:00,  1.99it/s]

	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 1/1 [00:00<00:00,  1.76it/s]


Epoch 1/1, Loss: 3.8675308227539062


  0%|          | 0/13 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 13/13 [00:04<00:00,  2.88it/s]

	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 13/13 [00:04<00:00,  2.87it/s]

Epoch 1/1, Loss: 2.6324413739717922
All tests passed!
CPU times: user 6.57 s, sys: 646 ms, total: 7.22 s
Wall time: 6.79 s





We will fine-tune the model on 80% of the data (the remaining 20% will be for evaluation). Because this takes a long time, we will only do it for 1 epoch.

**Notes**:
* The following cell will take about **1 hour** to run, and is **required** for the next question. We recommend that you make a submission to Sail() at this point, to make sure you have everything correct so far. Ideally you would want to avoid having to fine-tune BERT more than once.

In [329]:
model = BertForQuestionAnswering.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
tokenizer = BertTokenizerFast.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
train_indexes, test_indexes = train_test_split(df_squad.index, train_size = 0.8, random_state = 0)
df_squad_train = df_squad.loc[train_indexes, ["question", "context_paragraph", "answer_start", "answer_end"]].reset_index()
df_squad_test = df_squad.loc[test_indexes].reset_index()

train_dataset = SQuADDataset(df_squad_train)
train_dataloader = DataLoader(train_dataset, batch_size=6, shuffle=False, num_workers=6, worker_init_fn=_init_fn, pin_memory = True)
optimizer = optim.AdamW(model.parameters(), lr=2e-5, eps=1e-8)
squad_tokenizer = SQuADTokenizer(tokenizer)

%time tuned_model, avg_train_loss = fine_tune_bert(model, 1, optimizer, train_dataloader, squad_tokenizer, device)

  0%|          | 0/11576 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

100%|██████████| 11576/11576 [54:40<00:00,  3.53it/s] 

Epoch 1/1, Loss: 0.7943448718426763
CPU times: user 55min 6s, sys: 21.6 s, total: 55min 27s
Wall time: 54min 41s





We will also save the fine-tuned model into a directory to reuse it later. Note that the following code will create a directory `bert_fine_tuned_squad` with several files in it.

In [330]:
tuned_model.cpu().save_pretrained("bert_fine_tuned_squad")

Let's see how the fine tuned model performs. We provide the following function `get_bert_prediction` to predict the answer given a pair of question and context. Overall the inference process is very similar to the forward pass during fine-tuning, except that we don't provide the starting and ending tokens to the Bert model, since those are what we need to predict.

You may also notice that we are looping through each data point, instead of doing inference in a vectorized manner. The reason is that this same inference code will be used for model deployment on CPU later, where we have limited memory and cannot afford to process data in batches (recall how you performed inference both in the loop approach and the batch approach in the OPE).

In [35]:
def get_bert_prediction(questions, contexts, device, model=None, tokenizer=None):
    '''
    Given a list of questions and a list of corresponding contexts, predict the answers using BERT.

    args:
        questions (List[string]): list of questions to be answered
        contexts (List[string]): list of context paragraphs, each for answering a question in the input questions
        device (torch.device) : the device (CPU or Cuda) that the model and data should be moved to
        model (BertForQuestionAnswering): BERT model to be used for question answering 
            or None - if None, `bertserini-bert-base-squad` will be loaded
        tokenizer (BertTokenizerFast object): tokenizer to be used for encoding questions and contexts
            or None - if None, `bertserini-bert-base-squad` will be loaded
    return:
        outputs (List[string]): list of generated answers
    '''
    
    if model is None:
        model = BertForQuestionAnswering.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
    if tokenizer is None:
        tokenizer = BertTokenizerFast.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
    
    model.to(device)
    model.eval()
    
    outputs = []

    for question, context in tqdm(zip(questions, contexts), total=len(questions)):

        encoded_seq = tokenizer(question, context, padding="longest", truncation=True, max_length=512)

        tokens = tokenizer.convert_ids_to_tokens(encoded_seq["input_ids"])
        
        input_ids = torch.LongTensor([encoded_seq["input_ids"]]).to(device)
        token_type_ids = torch.LongTensor([encoded_seq["token_type_ids"]]).to(device)
        attention_mask = torch.FloatTensor([encoded_seq["attention_mask"]]).to(device)

        with torch.no_grad():
            output = model(input_ids=input_ids, 
                           attention_mask=attention_mask, 
                           token_type_ids=token_type_ids)
        logits_start, logits_end = output['start_logits'], output['end_logits']
        token_start = torch.argmax(logits_start)
        token_end = torch.argmax(logits_end)
        
        outputs.append(tokenizer.convert_tokens_to_string(tokens[token_start:token_end]))
    return outputs

Let's try predicting one row of data first:

In [332]:
print("question:\n{}\ncontext:\n{}\nanswer:\n{}\n".format(
    df_squad.loc[200, "question"],
    df_squad.loc[200, "context_paragraph"],
    df_squad.loc[200, "answer"]
))

print("Bert prediction:", get_bert_prediction(
    df_squad.loc[[200], "question"],
    df_squad.loc[[200], "context_paragraph"],
    device,
    model = tuned_model
))

question:
How many awards was Beyonce nominated for at the 52nd Grammy Awards?
context:
At the 52nd Annual Grammy Awards, Beyoncé received ten nominations, including Album of the Year for I Am... Sasha Fierce, Record of the Year for "Halo", and Song of the Year for "Single Ladies (Put a Ring on It)", among others. She tied with Lauryn Hill for most Grammy nominations in a single year by a female artist. In 2010, Beyoncé was featured on Lady Gaga's single "Telephone" and its music video. The song topped the US Pop Songs chart, becoming the sixth number-one for both Beyoncé and Gaga, tying them with Mariah Carey for most number-ones since the Nielsen Top 40 airplay chart launched in 1992. "Telephone" received a Grammy Award nomination for Best Pop Collaboration with Vocals.
answer:
ten



100%|██████████| 1/1 [00:00<00:00, 47.70it/s]

Bert prediction: ['ten']





We see that it does work! Now let's see how accurate the fine tuned model is on the entire test set. We'll also compute the accuracy of a pre-trained model without fine-tuning, to see how much improvement our fine-tuning provided. Each of the following two cells will take about **11 minutes** to run, and they are **not** required for later questions, so feel free to skip them for now and revisit later.

In [333]:
pretrained_test_prediction = get_bert_prediction(
    df_squad_test["question"], df_squad_test["context_paragraph"],
    device, BertForQuestionAnswering.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
)
print( (np.array(pretrained_test_prediction) == df_squad_test["answer"].str.lower().values ).mean() )

100%|██████████| 17365/17365 [03:46<00:00, 76.59it/s]

0.008407716671465592





In [334]:
tuned_test_predictions = get_bert_prediction(
    df_squad_test["question"], df_squad_test["context_paragraph"],
    device, tuned_model
)
print( (np.array(tuned_test_predictions) == df_squad_test["answer"].str.lower().values ).mean() )

100%|██████████| 17365/17365 [03:46<00:00, 76.79it/s]

0.5861215087820328





We see the pretrained model did not work well at all, while fine-tuning the model for one epoch already yields a large improvement in accuracy (about 58%) on the test set. Naturally, the model still has plenty of room for improvement when trained for more epochs, but doing so requires a lot more time.

## 4. Model Deployment
As seen from this project, building a QA system involves a very complex technology stack. To make your model easily accessible to others, you can deploy it to a public endpoint on Azure, using the same workflow from Project 6. We have built several models so far and they can all be deployed together under the same endpoint. However, for the rest of this project, let's focus on BERT deployment.

In [36]:
import azureml.core
from azureml.core.workspace import Workspace
from azureml.core.model import Model
from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig
from azureml.core.webservice import Webservice
from azureml.core.environment import Environment
from azureml.core.webservice import LocalWebservice

We will provide a brief reminder of the involved steps. You can consult the code from your Project 6 or the [model deployment primer](https://nbviewer.jupyter.org/url/clouddatascience.blob.core.windows.net/primers/p5-machine-learning-azure-primer/azure_model_deployment_primer.ipynb) for more details.

**Step 1: Initialize workspace with `Workspace.from_config()`**

In [37]:
# your code here
ws = Workspace.from_config()

**Step 2: Register the model with `Model.register()`**

The `model_path` parameter should be set to the name of your Bert model directory, `"./bert_fine_tuned_squad"`. The `model_name` parameter should be set to `"bert_fine_tuned"`.

In [38]:
# your code here
register_bert_model = Model.register(
    workspace = ws, model_path = "./bert_fine_tuned_squad",
    model_name = "bert_fine_tuned",
    description = "A Bert model"
)

Registering model bert_fine_tuned


**Step 3: Create scoring script and API endpoint**

Our grader will send a POST request to your endpoint, where the JSON content is structured as follows:

```py
{
    "questions" : [
        "How many parameters does BERT-large have?",
        "When did Beyonce start becoming popular?"
    ],

    "context_paragraphs" : [
        "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance.",
        'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'
    ]
}
```
Here `questions` is a list of `N` questions and `context_paragraphs` is a list of `N` context paragraphs. For each pair of question and context paragraph, you should tokenize and input them to the fine-tuned Bert model, and return a dictionary with the following format:
```py
{
    'predicted_ans' : ['340', 'late 1990s']
}
```
where the key `predicted_ans` maps to a list of `N` strings, with each string being the predicted answer for one input pair of question and context paragraph.

Implement the `init` and `run` function in the `score.py` file to perform the above processing. In particular,
* `init` will load the fine-tuned Bert model from file and store it in a global variable.
* `run` will convert the input `input_data` to JSON, extract the questions and contexts, then perform inference and return the specified response. You can reuse the code from `get_bert_predictions` that we provided earlier.

**Notes**:
* To load the fine tuned Bert model, you can call
```py
BertForQuestionAnswering.from_pretrained(Model.get_model_path("bert_fine_tuned"))
```
Also remember that to modify a global variable, you need to use the `global` keyword.
* When doing inference you can use the pre-trained tokenizer `"rsvp-ai/bertserini-bert-base-squad"`. This tokenizer should be created in `init` so that you don't need to take time loading it during inference.
* Keep in mind that the return value of `run` in `scoring.py` should be a dictionary, not a string.

In [62]:
%%writefile score.py
import os, json, pickle
import numpy as np
import torch
from tqdm import tqdm
from transformers import BertTokenizerFast
from transformers import BertForQuestionAnswering
from azureml.core.model import Model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_bert_prediction(questions, contexts, device, model=None, tokenizer=None):
    '''
    Given a list of questions and a list of corresponding contexts, predict the answers using BERT.

    args:
        questions (List[string]): list of questions to be answered
        contexts (List[string]): list of context paragraphs, each for answering a question in the input questions
        device (torch.device) : the device (CPU or Cuda) that the model and data should be moved to
        model (BertForQuestionAnswering): BERT model to be used for question answering 
            or None - if None, `bertserini-bert-base-squad` will be loaded
        tokenizer (BertTokenizerFast object): tokenizer to be used for encoding questions and contexts
            or None - if None, `bertserini-bert-base-squad` will be loaded
    return:
        outputs (List[string]): list of generated answers
    '''
    
    if model is None:
        model = BertForQuestionAnswering.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
    if tokenizer is None:
        tokenizer = BertTokenizerFast.from_pretrained('rsvp-ai/bertserini-bert-base-squad')
    
    model.to(device)
    model.eval()
    
    outputs = []

    for question, context in tqdm(zip(questions, contexts), total=len(questions)):

        encoded_seq = tokenizer(question, context, padding="longest", truncation=True, max_length=512)

        tokens = tokenizer.convert_ids_to_tokens(encoded_seq["input_ids"])
        
        input_ids = torch.LongTensor([encoded_seq["input_ids"]]).to(device)
        token_type_ids = torch.LongTensor([encoded_seq["token_type_ids"]]).to(device)
        attention_mask = torch.FloatTensor([encoded_seq["attention_mask"]]).to(device)

        with torch.no_grad():
            output = model(input_ids=input_ids, 
                           attention_mask=attention_mask, 
                           token_type_ids=token_type_ids)
        logits_start, logits_end = output['start_logits'], output['end_logits']
        token_start = torch.argmax(logits_start)
        token_end = torch.argmax(logits_end)
        
        outputs.append(tokenizer.convert_tokens_to_string(tokens[token_start:token_end]))
    return outputs

def init():
    """
    Load the fine-tuned Bert model from file and store it in a global variable.
    Also initialize the pre-trained tokenizer.
    """
    global model, tokenizer

    model = BertForQuestionAnswering.from_pretrained(Model.get_model_path("bert_fine_tuned"))
    model.to(device)
    model.eval()  # Set the model to evaluation mode

    tokenizer = BertTokenizerFast.from_pretrained("rsvp-ai/bertserini-bert-base-squad")


def run(input_data):
    """
    Convert the input data from string to JSON, extract the questions and contexts,
    then perform inference with Bert and return the specified JSOn response 
    """
    input_data = json.loads(input_data)
    predicted_answers = get_bert_prediction(input_data["questions"], input_data["context_paragraphs"],device, model)
    response = {"predicted_ans": predicted_answers}

    return response#json.dumps(response)

Overwriting score.py


Now check that the content of `score.py` is as you expect:

In [63]:
!cat score.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
import os, json, pickle
import numpy as np
import torch
from tqdm import tqdm
from transformers import BertTokenizerFast
from transformers import BertForQuestionAnswering
from azureml.core.model import Model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_bert_prediction(questions, contexts, device, model=None, tokenizer=None):
    '''
    Given a list of questions and a list of corresponding contexts, predict the answers using BERT.

    args:
        questions (List[string]): list of questions to be answered
        contexts (List[string]): list of context paragraphs, each for answering a question in the input questions
        device (torch.device) : the devi

**Step 4: Create environment file**

Now you will need to create an environment file (`myenv.yml`) that specifies all of the scoring script's package dependencies. This file is used to ensure that all of those dependencies are installed in the Docker image by Azure ML. The `pip_packages` parameter value should be the following list:
```py
['azureml-defaults', 'torch==2.0.1', 'transformers==4.25.0', 'numpy']
```

In [9]:
# your code here
## your code here
from azureml.core.conda_dependencies import CondaDependencies 

environment_file = CondaDependencies.create(pip_packages=[
    'azureml-defaults', 'torch==2.0.1', 'transformers==4.25.0', 'numpy'
])

with open("myenv.yml","w") as f:
    f.write(environment_file.serialize_to_string())

print(environment_file.serialize_to_string())

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.8 and later.
- python=3.8.13

- pip:
  - azureml-defaults~=1.52.0
  - torch==2.0.1
  - transformers==4.25.0
  - numpy
channels:
- anaconda
- conda-forge



**Step 5: Deploy to local service**

Follow the steps in the [model deloyment primer](https://nbviewer.jupyter.org/url/clouddatascience.blob.core.windows.net/primers/machine-learning-azure-primer/azure_model_deployment_primer.ipynb) to create an `Environment`, an `InferenceConfig`, a `LocalWebservice`, a `Model.deploy` object, and call `wait_for_deployment` on it. This code may take about 10 minutes to run.

In [16]:
# your code here
myenv = Environment.from_conda_specification(name = "myenv", file_path = "myenv.yml")
inference_config = InferenceConfig(source_directory = '.', entry_script = "score.py", environment = myenv)
local_deployment_config = LocalWebservice.deploy_configuration(port = 8890)
local_service = Model.deploy(
    ws, "local-service",
    [register_bert_model], inference_config, 
    local_deployment_config
)
local_service.wait_for_deployment(True)


To leverage new model deployment capabilities, AzureML recommends using CLI/SDK v2 to deploy models as online endpoint, 
please refer to respective documentations 
https://docs.microsoft.com/azure/machine-learning/how-to-deploy-managed-online-endpoints /
https://docs.microsoft.com/azure/machine-learning/how-to-attach-kubernetes-anywhere 
For more information on migration, see https://aka.ms/acimoemigration 
  local_service = Model.deploy(


Downloading model bert_fine_tuned:2 to /tmp/azureml_k8johg1o/bert_fine_tuned/2
Generating Docker build context.
Package creation Succeeded
Logging into Docker registry 37a09863d5804f449c0d3945465c9786.azurecr.io
Logging into Docker registry 37a09863d5804f449c0d3945465c9786.azurecr.io
Building Docker image from Dockerfile...
Step 1/5 : FROM 37a09863d5804f449c0d3945465c9786.azurecr.io/azureml/azureml_dd46aeee7ee79528c366831c872e31d2
 ---> 4a6d2bef742f
Step 2/5 : COPY azureml-app /var/azureml-app
 ---> e2503ca83546
Step 3/5 : RUN mkdir -p '/var/azureml-app' && echo eyJhY2NvdW50Q29udGV4dCI6eyJzdWJzY3JpcHRpb25JZCI6ImUwNTkwNmMzLWIwZTctNGYwZC05N2I3LWE3ZDYxMmVmNGU2NyIsInJlc291cmNlR3JvdXBOYW1lIjoicDciLCJhY2NvdW50TmFtZSI6InByb2plY3Q3Iiwid29ya3NwYWNlSWQiOiIzN2EwOTg2My1kNTgwLTRmNDQtOWMwZC0zOTQ1NDY1Yzk3ODYifSwibW9kZWxzIjp7fSwibW9kZWxzSW5mbyI6e319 | base64 --decode > /var/azureml-app/model_config_map.json
 ---> Running in 943cfd058c51
 ---> cc20d92b32e9
Step 4/5 : RUN mv '/var/azureml-app/tmp1wuip4k

**Step 6: Test the local service**

You should change the `local_service` variable below to the variable name that you used earlier to store the return value of `Model.deploy`. Check that the response JSON matches the format specified above.

In [65]:
input_json = json.dumps({
    "questions" : [
        "How many parameters does BERT-large have?",
        "When did Beyonce start becoming popular?"
    ],

    "context_paragraphs" : [
        "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance.",
        'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'
    ]
})

input_json = bytes(input_json, encoding = "utf8")
output = local_service.run(input_json)
print(output)

{'predicted_ans': ['340m', '1990s']}


Note that your model may output `340` or `340m` as the answer to the first question -- both are acceptable, due to the random weights in the Bert model.

**Step 7: Deploy to ACI container**

Now that your local service has worked properly, the last step is to deploy it to a pulic endpoint. Follow the steps in the primer to create an `AciWebService` and a new `Model.deploy` object, then call `wait_for_deployment` on it. This code should take about 20 minutes to run. If it takes longer, you should consult the "Monitoring public deployment" section in the primer to diagnose the issues.

**Notes**:
* When creating the `AciWebService`, you should specify `cpu_cores = 3.8` and `memory_gb = 15` to maximize processing power in the deployment environment.

In [73]:
# your code here
aci_deployment_config = AciWebservice.deploy_configuration(
    cpu_cores=3.8, memory_gb=15, description = 'Public endpoint for Bert model',
    tags={'data' : 'SQuAD', 'method' : 'predict answer', 'framework' : 'pytorch'},
)

# create a new service instance
aci_service = Model.deploy(
    ws, "aci-services",
    [register_bert_model], inference_config, 
    aci_deployment_config
)

aci_service.wait_for_deployment(True)
print(aci_service.state)

To leverage new model deployment capabilities, AzureML recommends using CLI/SDK v2 to deploy models as online endpoint, 
please refer to respective documentations 
https://docs.microsoft.com/azure/machine-learning/how-to-deploy-managed-online-endpoints /
https://docs.microsoft.com/azure/machine-learning/how-to-attach-kubernetes-anywhere 
For more information on migration, see https://aka.ms/acimoemigration 
  aci_service = Model.deploy(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2023-07-27 09:00:57+00:00 Creating Container Registry if not exists.
2023-07-27 09:00:57+00:00 Registering the environment.
2023-07-27 09:00:59+00:00 Use the existing image.
2023-07-27 09:00:59+00:00 Generating deployment configuration.
2023-07-27 09:01:01+00:00 Submitting deployment to compute.
2023-07-27 09:01:04+00:00 Checking the status of deployment aci-services..
2023-07-27 09:04:42+00:00 Checking the status of inference endpoint aci-services.
Succeeded
ACI service creation operation finished, operation "Succeeded"
He

**Step 8: Test the public service**

Let's use the same input json from earlier and check that the public endpoint is working as well.

In [74]:
deployed_uri = aci_service.scoring_uri
print(deployed_uri)

input_json = json.dumps({
    "questions" : [
        "How many parameters does BERT-large have?",
        "When did Beyonce start becoming popular?"
    ],

    "context_paragraphs" : [
        "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance.",
        'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'
    ]
})

response = requests.post(deployed_uri, input_json, headers = {'Content-Type' : 'application/json'})
print(response.status_code)
print(json.loads(response.content))

http://e3dc4626-e1b5-4273-a938-7a52dad7bc13.eastus.azurecontainer.io/score
200
{'predicted_ans': ['340m', '1990s']}


### Question 9: Deploying Bert model
We provide a local test to check that your deployed model can handle a request that contains 100 data points.

In [75]:
def test_deploy_bert():
    df_test_deploy = df_squad.sample(100, random_state = 100)
    input_json = json.dumps({
        "questions" : df_test_deploy["question"].tolist(),
        "context_paragraphs" : df_test_deploy["context_paragraph"].tolist()
    })
    response = requests.post(deployed_uri, input_json, headers = {'Content-Type' : 'application/json'})
    check_equal(response.status_code, 200)
    print(json.loads(response.content))
    print(df_test_deploy["answer"].str.lower().values)
    print(json.loads(response.content)["predicted_ans"])
    predicted_ans = np.array(json.loads(response.content)["predicted_ans"])
    ans = df_test_deploy["answer"].str.lower().values
    accuracy = (predicted_ans == ans).mean()
    assert accuracy >= 0.3, accuracy
    print("All tests passed!")
    
test_deploy_bert()

{'predicted_ans': ["isma ' il ibn jafar", 'suppression of french protestantism', 'united states', 'four', 'andreas alciatus', '2020', 'the united nations uses myanmar', 'plasma', 'erlacherhof', 'executive order 13526', 'as a result of integration of 28 petty princely states', '1913', 'out of touch', 'two', 'madhavacharya', 'eritrea, tunisia, algiers, the balkans and romania', '1967', 'japanese patriarchal system', '1980', 'intimate emotional', 'placental group', 'plaza del ayuntamiento', 'detailed graphics, fluid animation and high - quality music', 'more expensive', '150, 000', 'vote for the worst', 'collecting and reporting data on crimes', 'peng dehuai', '1 / 12 of a second', 'rhombencephalon', '£750, 000', 'the legislation prohibits interstate and foreign transactions for list species, no provisions are made for in - state commerce', 'john f. kennedy', 'individual and group identity', 'a nearby massive ob star', '12th to 14th', 'moscow', 'third - party software developers', 'king h

If you get an error 504 and the service log indicates a time out issue, you should look into optimizing the `run` function in `scoring.py`. You should not need to change anything in `get_bert_prediction`, but make sure to avoid repeated computations whenever possible (e.g., create the tokenizer in `init`, call `json.loads` only once).

Finally, let's write the deployed URI to a text file so that it can be submitted to the autograder.

In [78]:
with open('scoring_uri.txt', 'w') as f:
    f.write(deployed_uri)
    print("Saved scoring_uri to file!")

Saved scoring_uri to file!


**Step 9: Clean up cloud resources**

You have completed all the questions in this project! One last step you need to do is to make sure you clean up your resources properly, so that no unexpected charge is incurred. **You will still need to use Azure for the final exam**.

Use the left navigation bar in the Azure Machine Learning Studio to manage your computes and endpoints. If you don't anticipate any further usage of Azure ML Studio for this project, you should go back to the [Azure homepage](https://portal.azure.com/) and delete the entire resource group, as the resource group itself also incurs charges over time, even without any computes or endpoints.

<p style="color:red;">
    <strong>To check your remaining budget, you need to visit the <a href="https://portal.azure.com/#blade/Microsoft_Azure_Education/EducationMenuBlade/overview">Education</a> tab on Azure. The Cost Mangement + Billing Tab will not display anything.</strong>
</p>