<a href="https://colab.research.google.com/github/shekhar-git-hub/AI-Virtual-Agent-Engine-for-FAQ/blob/main/AI_Virtual_Agent_for_FAQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###  Virtual Agent for Frequently Asked Questions

Building Virtual Agent that understands the semantics of user utterances has become simple with transformers based models out there and with the support of large collection of open-source libraries

<b>Note:</b> Update variables under <b>Variables</b> section if required before running the notebook. To run notebook cell by cell, click on a cell and click <b>Run</b> button below the <b>Menu</b> bar. Or to run all cells, select <b>Cell --> Run All<b> from Menu bar.

##### Variables

The variable <b>DATA_SOURCE_PATH</b> should be set to the path of <b>input</b> data to be used for training. 
MODEL_PATH variable refers to the location to store embeddings of input text.

The dataset is expected to contain questions under column <b>Q</b> and answers under column <b>A</b>. Please reference <b>faqs.csv</b> from downloaded repository that can be found in the kit_installer file location.

The execution of <b>last cell</b> helps to try Virtual Agent built.

In [None]:
DATA_SOURCE_PATH = r"faqs.csv"      # Reading the csv dataset of frequently asked questions.

Default model location

In [None]:
MODEL_PATH = r"models/model_va.pickle"      # We want to store the model that we have build up using sentence transformers, so that we are using pickle. So, this cell indicates the path we have given to store the embeddings through the pickle.

###### Import libraries for data analysis

In [None]:
import numpy as np
import pandas as pd

###### Import libraries for text mining

In [18]:
!pip install lingualytics 
!pip install texthero



In [None]:
from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords      # Removing noises
from lingualytics.stopwords import en_stopwords
from texthero.preprocessing import remove_digits

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


##### Import libraries for transformers 

In [None]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 3.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 34.2 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.0-py3-none-any.whl size=120747 sha256=b9fd04193e80a8982aafcca21dc706fe8cb21fb5181cbbce69284353a0ce166b
  Stored in directory: /root/.cache/pip/wheels/83/c0/df/b6873ab7aac3f2465aa9144b6b4c41c4391cfecc027c8b07e7
Successfully built sentence-transformers
Installing collected packages: sentencepiece, sentence-transformers
Successfully installed sentence-transformers-2.2.0 sentencepiece-0.1.96


In [None]:
from sentence_transformers import SentenceTransformer    # Used for building machine learning models

##### Import libraries for computing similarities

In [None]:
from torch.nn import CosineSimilarity     # torch.nn is for building your neural network; CosineSimilarity for computing the similarities for different paraphrases which means or are close, for eg. what is kandi, info about kandi, tell me about kandi; these all means same thing or are similar.
import torch

###### Import library for storing into binary file

In [None]:
import pickle    # imported pickle so the storage of your embeddings could be done dumped out here.

###### Load data source into dataframe

In [19]:
df = pd.read_csv(DATA_SOURCE_PATH, encoding_errors="ignore")      #Once imported all the libraries, next (this) step is to load your datasource
pd.set_option('display.max_colwidth', None)                       # We have used panda library as pd, and reading the csv file from data source path that we previously set previously.
df                                                                # Then we have converted into dataframe as rows and columns.

Unnamed: 0,Q,A
0,What is kandi?,"kandi (pronounced kandee) is a platform that helps developers pick the right library, package, code samples, APIs, and cloud functions, by analyzing over 430 million knowledge items."
1,Have feedback or want to know more?,"We are a passionate set of application focused techies. Wed love to hear from you on your feedback, questions, and any other comments.\nDirect Message us on Twitter Message @OpenWeaverInc\nYou can email us at kandi.support@openweaver.com\nJoin our Discord community here"
2,What components does kandi cover?,kandi helps you select software components across:\nPackages from all package managers and repositories\nSource Code across all major code repositories\nCloud Functions and APIs across all hyperscale cloud providers
3,How do I use kandi?,"kandi provides two simplified experiences to help you choose the right software component to accelerate your application development:\n\n1. Search\nYou can search for the component using natural language to describe your functional and technical requirements, and kandi gets to work by matching these over 430 million knowledge items to show you a shortlist.\nYou can further filter them or refine your query and pick your chosen ones based on scores available on the component listing page.\nClick on the components from the list to review detailed insights such as support, quality, security, and a reference guide covering code snippets, community discussions from the provider, and popular channels.\nThe component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.\n\n2. Explore\nYou can Explore kandi curated sections across Popular Collections, Hot Tech, and Industry Domains from the Home Page or the Explore Page. These sections list the popular components among your peers, have functional relevance, and positive security, quality, and support scores in the respective areas.\nYou can browse these sections to get industry insights.\nYou can further filter them and pick your chosen ones based on scores available on the component listing page.\nClick on the components from the list to review detailed insights such as support, quality, security, and a reference guide covering code snippets, community discussions from the provider, and popular channels.\nThe component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links."
4,How do I shortlist components on kandi?,"You can use the below filters to shortlist components based on your architectural preferences:\n\nLanguages This is an expanding list of languages chosen by popularity amongst kandi users.\nLicenses Licenses are grouped by:\n\nOSS License families, covering Permissive, Weak Copyleft, and Strong Copyleft.\nProprietary license category covering the emerging cloud licenses as well as As-a-Service contracts.\nNo License indicates that the respective repository does not have the license file declared as per the repository managers standard. They could still have a license file declared in a different format or section. Components without a license have all rights reserved, and you may not be able to use them. Hence kandi alerts you when a valid license file is not found.\n\nSupport High support indicates a thriving ecosystem across the author and users, that will help you implement with relative ease.\nComponent Types Component Types are grouped by:\n\nLibraries from package managers and repositories that can be readily installed.\nSource Code that may or may not be associated with a package and are from code repositories.\nCloud Functions and APIs that are provided As-a-Service from cloud providers.\n\nSources This is an expanding list of software component sources chosen by popularity amongst kandi users.\nIndustries This indicates the industry domain that the component has been associated with or could be used in, for specific use cases.\nSecurity This reflects the security score of the software component across reported and code-based vulnerabilities."
5,How do I implement the components that I have selected on kandi?,"The component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.\nYou can follow implementation instructions from the software component home page based on the component type."


###### Cleanse data by removing numbers and punctutation
This process is part of pre-processing that aids in getting rid of unnecessary text, which would otherwise hinder the learning process of the model. Techniques like stemming, lemmatisation can also help here.

As we're using sentence embedding, we wouldn't be doing extensive pre-processing here. The pre-processing complexity decreases with increase in the quality of the dataset

In [20]:
df['procd_Q'] = df['Q'].pipe(remove_digits).pipe(remove_punctuation)#.pipe(remove_lessthan,length=3)\
                                                    #.pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords))
df                               # We've created another featured column called 'procd_Q' to store the pre-processed cleaned data.
                                # For 'procd_Q' , we've removed digits, punctuation etc from previous 'Q'. And saved it in a 'procd_Q' column.

  return s.str.replace(rf"([{punctuation}])+", " ")


Unnamed: 0,Q,A,procd_Q
0,What is kandi?,"kandi (pronounced kandee) is a platform that helps developers pick the right library, package, code samples, APIs, and cloud functions, by analyzing over 430 million knowledge items.",What is kandi
1,Have feedback or want to know more?,"We are a passionate set of application focused techies. Wed love to hear from you on your feedback, questions, and any other comments.\nDirect Message us on Twitter Message @OpenWeaverInc\nYou can email us at kandi.support@openweaver.com\nJoin our Discord community here",Have feedback or want to know more
2,What components does kandi cover?,kandi helps you select software components across:\nPackages from all package managers and repositories\nSource Code across all major code repositories\nCloud Functions and APIs across all hyperscale cloud providers,What components does kandi cover
3,How do I use kandi?,"kandi provides two simplified experiences to help you choose the right software component to accelerate your application development:\n\n1. Search\nYou can search for the component using natural language to describe your functional and technical requirements, and kandi gets to work by matching these over 430 million knowledge items to show you a shortlist.\nYou can further filter them or refine your query and pick your chosen ones based on scores available on the component listing page.\nClick on the components from the list to review detailed insights such as support, quality, security, and a reference guide covering code snippets, community discussions from the provider, and popular channels.\nThe component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.\n\n2. Explore\nYou can Explore kandi curated sections across Popular Collections, Hot Tech, and Industry Domains from the Home Page or the Explore Page. These sections list the popular components among your peers, have functional relevance, and positive security, quality, and support scores in the respective areas.\nYou can browse these sections to get industry insights.\nYou can further filter them and pick your chosen ones based on scores available on the component listing page.\nClick on the components from the list to review detailed insights such as support, quality, security, and a reference guide covering code snippets, community discussions from the provider, and popular channels.\nThe component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.",How do I use kandi
4,How do I shortlist components on kandi?,"You can use the below filters to shortlist components based on your architectural preferences:\n\nLanguages This is an expanding list of languages chosen by popularity amongst kandi users.\nLicenses Licenses are grouped by:\n\nOSS License families, covering Permissive, Weak Copyleft, and Strong Copyleft.\nProprietary license category covering the emerging cloud licenses as well as As-a-Service contracts.\nNo License indicates that the respective repository does not have the license file declared as per the repository managers standard. They could still have a license file declared in a different format or section. Components without a license have all rights reserved, and you may not be able to use them. Hence kandi alerts you when a valid license file is not found.\n\nSupport High support indicates a thriving ecosystem across the author and users, that will help you implement with relative ease.\nComponent Types Component Types are grouped by:\n\nLibraries from package managers and repositories that can be readily installed.\nSource Code that may or may not be associated with a package and are from code repositories.\nCloud Functions and APIs that are provided As-a-Service from cloud providers.\n\nSources This is an expanding list of software component sources chosen by popularity amongst kandi users.\nIndustries This indicates the industry domain that the component has been associated with or could be used in, for specific use cases.\nSecurity This reflects the security score of the software component across reported and code-based vulnerabilities.",How do I shortlist components on kandi
5,How do I implement the components that I have selected on kandi?,"The component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.\nYou can follow implementation instructions from the software component home page based on the component type.",How do I implement the components that I have selected on kandi


###### Load sentence transformer model of your choice for getting sentence embeddings
The model can be chosen by considering various aspects and comparing available models from this link.
https://www.sbert.net/docs/pretrained_models.html

In [21]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')            # Once done with data cleaning, next (this) step is for building your model, to convert your sentence into a vector form i.e sentence embedding.
                                                                  # So, for that we're using SentenceTransformer. And inside SentenceTransformer framework we're using 'paraphrase-MiniLM-L6-v2' model.

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [22]:
from google.colab import drive
drive.mount('/content/drive')

#Optional: move to the desired location:
%cd drive/My Drive/Colab Notebooks/faq-virtual-agent-main

Mounted at /content/drive
/content/drive/My Drive/Colab Notebooks/faq-virtual-agent-main


###### Find embeddings of sentences and store in a binary file

The binary file storage helps to load and use embeddings later without having the need to computing them again. We use pickle here to store in binary files. You may also use joblib

In [23]:
q_embs = model.encode(df["procd_Q"]) # Using the above model, we've computed the embeddings for 'procd_Q' and stored in q_embs.
                                    # computes encode for all the questions from the dataset. 
                                    #Embeddings can be computed in batches for massive dataset.
with open(MODEL_PATH, "wb") as file:
    pickle.dump(q_embs, file)       # To store the vector form (embeddings) q_embs to a file using pickle.dump 

###### Load embeddings from binary file into memory

In [24]:
with open(MODEL_PATH, "rb") as file:        # Once done with training, next step is to test or prediction.
    q_embs = pickle.load(file)              # First loading embeddings using pickle.load

###### Predict answer to user query
The user query is cleansed and pre-processed as earlier, and then a matching query from data source is predicted. The predicted query is used to look up to find corresponding answer

In [25]:
def pred_answer(usr_query):
    df_query = pd.DataFrame([usr_query], columns=["usr_query"]) # use similar pipeline (for computing embeddings for user query) that was used for computing embeddings from dataset 
    df_query["clean_usr_q"] = df_query["usr_query"].pipe(remove_digits).pipe(remove_punctuation)   # storing cleaned user query in 'clean_usr_q'
    usr_q_emb = model.encode(df_query["clean_usr_q"]) # compute embedding
    cosine_similarity = CosineSimilarity()
    q_idx = np.argmax(cosine_similarity(torch.from_numpy(usr_q_emb), torch.from_numpy(q_embs))) # compute cosine similarity and find the matched query
    return df["A"][q_idx.item()] # look up answer of the matched query from the dataframe of input dataset

In [26]:
usr_query = "tell me about kandi"

In [27]:
pred_answer(usr_query)

  return s.str.replace(rf"([{punctuation}])+", " ")


'kandi (pronounced kandee) is a platform that helps developers pick the right library, package, code samples, APIs, and cloud functions, by analyzing over 430 million knowledge items.'

###### Simulating Virtual Agent

In [None]:
while True:
    usr_q = input("Ask a query(or type 'exit' to exit):")
    if usr_q == "exit":
        break
    else:
        print("Answer: ", pred_answer(usr_q))
    print("-----------------")

Ask a query(or type 'exit' to exit):TELL ME ABOUT KANDI


  return s.str.replace(rf"([{punctuation}])+", " ")


Answer:  kandi (pronounced kandee) is a platform that helps developers pick the right library, package, code samples, APIs, and cloud functions, by analyzing over 430 million knowledge items.
-----------------
Ask a query(or type 'exit' to exit):HOW TO USE KANDI
Answer:  kandi provides two simplified experiences to help you choose the right software component to accelerate your application development:

1. Search
You can search for the component using natural language to describe your functional and technical requirements, and kandi gets to work by matching these over 430 million knowledge items to show you a shortlist.
You can further filter them or refine your query and pick your chosen ones based on scores available on the component listing page.
Click on the components from the list to review detailed insights such as support, quality, security, and a reference guide covering code snippets, community discussions from the provider, and popular channels.
The component listing and det