# [DLAI - Understanding and Applying Text embeddings](https://learn.deeplearning.ai/courses/google-cloud-vertex-ai)

## Project environment setup

- Load credentials and relevant Python Libraries
- If you were running this notebook locally, you would first install Vertex AI.  In this classroom, this is already installed.

In [None]:
#!pip install 
#!pip install google-cloud-aiplatform

from utils import authenticate
credentials, PROJECT_ID = authenticate() # Get credentials and project ID

## Application of Embeddings


In [None]:
from vertexai.language_models import TextGenerationModel

generation_model = TextGenerationModel.from_pretrained("text-bison@001") # for Q&A
generation_model = TextGenerationModel.from_pretrained("chat-bison@001") # for chatbots

# This is from public list of questions and answers from Stack overflow
# question_embeddings has embeddings of the questions only. The code to create this is not present in this file, therefore this section will not work
question_embeddings

### Cluster the embeddings of the Stack Overflow questions

In [None]:
from sklearn.cluster import KMeans

clustering_dataset = question_embeddings[:1000]
n_clusters = 2
kmeans = KMeans(n_clusters=n_clusters, 
                random_state=0, 
                n_init = 'auto').fit(clustering_dataset)
kmeans_labels = kmeans.labels_

# Reduce dimensionality to help visualize
from sklearn.decomposition import PCA
PCA_model = PCA(n_components=2)
PCA_model.fit(clustering_dataset)
new_values = PCA_model.transform(clustering_dataset)

# To visualize
import matplotlib.pyplot as plt
import mplcursors
%matplotlib ipympl
from utils import clusters_2D
clusters_2D(x_values = new_values[:,0], y_values = new_values[:,1], 
            labels = so_df[:1000], kmeans_labels = kmeans_labels)

### Use Isolation Forest to identify potential outliers

- `IsolationForest` classifier will predict `-1` for potential outliers, and `1` for non-outliers.
- You can inspect the rows that were predicted to be potential outliers and verify that the question about baking is predicted to be an outlier.

In [None]:
from sklearn.ensemble import IsolationForest

clf = IsolationForest(contamination=0.005, random_state = 2) 
preds = clf.fit_predict(question_embeddings)

print(f"{len(preds)} predictions. Set of possible values: {set(preds)}")
so_df.loc[preds == -1]

### Classification
- Train a random forest model to classify the category of a Stack Overflow question (as either Python, R, HTML or CSS).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# read the category from the file to get Y
so_df = pd.read_csv('so_database_app.csv')
y = so_df['category'].values
y.shape

# X would be question_embeddings
X = question_embeddings
X.shape

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    random_state = 2)

clf = RandomForestClassifier(n_estimators=200)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred) # compute accuracy
print("Accuracy:", accuracy)

# choose a number between 0 and 1999
i = 2
label = so_df.loc[i,'category']
question = so_df.loc[i,'input_text']

## LLMs for Question Answering
- You can ask an open-ended question to the language model.

In [None]:
prompt = "I'm a high school student. Recommend me a programming activity to improve my skills."
generation_model.predict(prompt=prompt).text

### Adjusting Creativity/Randomness
- You can control the behavior of the language model's decoding strategy by adjusting the temperature, top-k, and top-n parameters.
- For tasks for which you want the model to consistently output the same result for the same input, (such as classification or information extraction), set temperature to zero.
- For tasks where you desire more creativity, such as brainstorming, summarization, choose a higher temperature (up to 1).

In [None]:
temperature = 0.0
response = generation_model.predict(
    prompt=prompt,
    temperature=temperature,
)
generation_model.predict(prompt=prompt).text

#### Top P
- Top p: sample the minimum set of tokens whose probabilities add up to probability `p` or greater.
- The default value for `top_p` is `0.95`.
- If you want to adjust `top_p` and `top_k` and see different results, remember to set `temperature` to be greater than zero, otherwise the model will always choose the token with the highest probability.

In [None]:
top_p = 0.2
prompt = "Write an advertisement for jackets \
that involves blue elephants and avocados."
response = generation_model.predict(
    prompt=prompt, 
    temperature=0.9, 
    top_p=top_p,
)
print(response.text)

#### Top k
- The default value for `top_k` is `40`.
- You can set `top_k` to values between `1` and `40`.
- The decoding strategy applies `top_k`, then `top_p`, then `temperature` (in that order).

top_k = 20
top_p = 0.7
response = generation_model.predict(
    prompt=prompt, 
    temperature=0.9, 
    top_k=top_k,
    top_p=top_p,
)
print(response.text)

## LLMs for Semantic Search, Building a Q&A System

Building a database within python (you can also use vector dbs)

In [None]:
# Read external data, here it is in form of a CSV
import pandas as pd
so_database = pd.read_csv('so_database_app.csv')

# Import text embedding model to convert the external text to embeddings
from vertexai.language_models import TextEmbeddingModel
embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

# Here is the code that embeds the text
import numpy as np
from utils import encode_text_to_embedding_batched

so_questions = so_database.input_text.tolist()
question_embeddings = encode_text_to_embedding_batched(
            sentences = so_questions,
            api_calls_per_second = 20/60, 
            batch_size = 5)

import pickle
with open('question_embeddings_app.pkl', 'rb') as file:
      
    # Call load method to deserialze
    question_embeddings = pickle.load(file)

# Add embeddings as a column to the dataframe
so_database['embeddings'] = question_embeddings.tolist()

### Semantic Search

When a user asks a question, we can embed their query on the fly and search over all of the Stack Overflow question embeddings to find the most simliar datapoint.

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances_argmin as distances_argmin

query = ['How to concat dataframes pandas']
query_embedding = embedding_model.get_embeddings(query)[0].values

# Calculate the cosine similarity between query and every embedding (building an index)
# First arguemnt is a list of list, there query_embedding in within a parenthesis. second argument is a list, there array is converted to a list 
cos_sim_array = cosine_similarity([query_embedding], list(so_database.embeddings.values))

# This is (1,2000), so index is calculated for the query with every data point
cos_sim_array.shape

# Grab the question with the max cosine
index_doc_cosine = np.argmax(cos_sim_array)
index_doc_distances = distances_argmin([query_embedding], list(so_database.embeddings.values))[0]
so_database.input_text[index_doc_cosine]
so_database.output_text[index_doc_cosine]

### Question answering with relevant context

Now that we have found the most simliar Stack Overflow question, we can take the corresponding answer and use an LLM to produce a more conversational response.

In [None]:
from vertexai.language_models import TextGenerationModel
generation_model = TextGenerationModel.from_pretrained(
    "text-bison@001")

context = "Question: " + so_database.input_text[index_doc_cosine] +\
"\n Answer: " + so_database.output_text[index_doc_cosine]

# if the question isn't actually relevant when answering the user query? 
# we added additional instructions in the prompt to handle those cases
prompt = f"""Here is the context: {context}
             Using the relevant information from the context,
             provide an answer to the query: {query}."
             If the context doesn't provide \
             any relevant information, \
             answer with \
             [I couldn't find a good match in the \
             document database for your query]
             """

from IPython.display import Markdown, display

t_value = 0.2
response = generation_model.predict(prompt = prompt, temperature = t_value, max_output_tokens = 1024)

display(Markdown(response.text))

### Scale with approximate nearest neighbor search

When dealing with a large dataset, computing the similarity between the query and each original embedded document in the database might be too expensive. Instead of doing that, you can use approximate nearest neighbor algorithms that find the most similar documents in a more efficient way.

These algorithms usually work by creating an index for your data, and using that index to find the most similar documents for your queries. In this notebook, we will use ScaNN to demonstrate the benefits of efficient vector similarity search. First, you have to create an index for your embedded dataset.

In [None]:
import scann
from utils import create_index

#Create index using scann
index = create_index(embedded_dataset = question_embeddings, 
                     num_leaves = 25,
                     num_leaves_to_search = 10,
                     training_sample_size = 2000)

query = "how to concat dataframes pandas"

import time 

# Use scann
start = time.time()
query_embedding = embedding_model.get_embeddings([query])[0].values
neighbors, distances = index.search(query_embedding, final_num_neighbors = 1)
end = time.time()

for id, dist in zip(neighbors, distances):
    print(f"[docid:{id}] [{dist}] -- {so_database.input_text[int(id)][:125]}...")

print("Latency (ms):", 1000 * (end - start))

# Use cosine similarity
start = time.time()
query_embedding = embedding_model.get_embeddings([query])[0].values
cos_sim_array = cosine_similarity([query_embedding], list(so_database.embeddings.values))
index_doc = np.argmax(cos_sim_array)
end = time.time()

print(f"[docid:{index_doc}] [{np.max(cos_sim_array)}] -- {so_database.input_text[int(index_doc)][:125]}...")

print("Latency (ms):", 1000 * (end - start))