In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
df = pd.read_csv('AWS_FAQ_Bot.csv')
df.head()

Unnamed: 0,Question,Answer
0,What is Amazon Elastic Compute Cloud (Amazon E...,Amazon Elastic Compute Cloud (Amazon EC2) is a...
1,What can I do with Amazon EC2?,Just as Amazon Simple Storage Service (Amazon ...
2,How can I get started with Amazon EC2?,"To sign up for Amazon EC2, click the “Sign up ..."
3,Why am I asked to verify my phone number when ...,Amazon EC2 registration requires you to have a...
4,What can developers now do that they could not...,"Until now, small developers did not have the c..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 671 entries, 0 to 670
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  670 non-null    object
 1   Answer    654 non-null    object
dtypes: object(2)
memory usage: 10.6+ KB


In [5]:
df.isnull().sum()

Question     1
Answer      17
dtype: int64

In [6]:
df.dropna(inplace=True)

In [7]:
df.isnull().sum()

Question    0
Answer      0
dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 670
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  653 non-null    object
 1   Answer    653 non-null    object
dtypes: object(2)
memory usage: 15.3+ KB


In [9]:
vec = TfidfVectorizer()
vec.fit(np.concatenate((df['Question'], df['Answer'])))

TfidfVectorizer()

In [10]:
feature_col = vec.get_feature_names_out()
feature_col

array(['00', '000', '00z', ..., 'zonal', 'zone', 'zones'], dtype=object)

In [11]:
feature_col[200:500]

array(['active', 'activities', 'activity', 'actor', 'actual', 'actually',
       'adapter', 'add', 'added', 'adding', 'addition', 'additional',
       'additionally', 'additions', 'address', 'addressed', 'addresses',
       'adds', 'adequate', 'adjust', 'adjusted', 'adjusting',
       'administration', 'administrative', 'administrator', 'adoption',
       'advance', 'advanced', 'advancements', 'advances', 'advantage',
       'advantages', 'advertised', 'advisory', 'aerospike', 'aes',
       'affect', 'affected', 'afi', 'afis', 'after', 'again', 'against',
       'aggregate', 'aggregated', 'aggregates', 'agreement', 'ahead',
       'ai', 'aka', 'alarm', 'alarms', 'alerts', 'algorithm',
       'algorithmic', 'algorithms', 'alias', 'alive', 'all', 'allocate',
       'allocated', 'allocation', 'allow', 'allowed', 'allowing',
       'allows', 'alone', 'along', 'alongside', 'already', 'also',
       'alternate', 'alternatively', 'although', 'always', 'am', 'amazon',
       'amazonaws', 'amer

In [12]:
len(feature_col)

3421

In [13]:
df_vectors = vec.transform(df['Question'])
df_vectors

<653x3421 sparse matrix of type '<class 'numpy.float64'>'
	with 6475 stored elements in Compressed Sparse Row format>

In [14]:
from re import T

In [15]:
print(f"Hello, welcome to Learnbay Chatbot, this is a simple chatbot which can answer the question")
print("Ask me anything related to AWS")

while True:
    input_question = input()
    
    if input_question == 'stop':
        break
        
    input_question_vec = vec.transform([input_question])
    similarity = cosine_similarity(input_question_vec, df_vectors)
    closest_ans = np.argmax(similarity, axis = 1)
    print(f"Response from chatbot is : {df['Answer'].iloc[closest_ans].values[0]}")

Hello, welcome to Learnbay ChatbTo, this is a simple chatbot which can answer the question
Ask me anything related to AWS
Tell me more about AWS ? As i am new in this technology
Response from chatbot is : There is a good chance that you won’t need to make any system changes to handle the new format. If you only use the console to manage AWS resources, you might not be impacted at all, but you should still update your settings to use the longer ID format as soon as possible. If you interact with AWS resources via APIs, SDKs, or the AWS CLI, you might be impacted, depending on whether your software makes assumptions about the ID format when validating or persisting resource IDs. If this is the case, you might need to update your systems to handle the new format.
Some failure modes could include:
If your systems use regular expressions to validate the ID format, you might error if a longer format is encountered.
If there are expectations about the ID length in your database schemas, you m