# Designing and Training a Simple Chatbot

Preprocessing Text Data

Before training a chatbot, we need to preprocess the text data by tokenizing the text, removing stop words, and applying stemmin or lemmatization.

In [13]:
!pip install nltk



In [17]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

#Importing the dataset
data = pd.read_csv('chatbot_dataset.csv')

#Preprocessing the dataset
nltk.download('punkt')
data['Questions'] = data['Questions'].apply(lambda x: ' '.join(nltk.word_tokenize(x.lower())))
print(data.head())

                                     Questions  \
0                   introduction to the course   
1  overview of data science and its importance   
2    introduction to the data science workflow   
3         key skills and tools in data science   
4          where can i find my course videos ?   

                                             Answers  
0  Welcome to the data science course. Here you w...  
1  Data science is crucial for making informed de...  
2  The data science workflow includes data collec...  
3  Important skills include programming, statisti...  
4  You can find all your course videos on the Cip...  


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Vectorizing Text Data

We convert the text data into numerical values using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['Questions'])
print(X.shape)

(48, 112)


# Training a Text Classification Model

We use the Naive Bayes Classifier to train the model on the vectorized text data.

In [21]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

#Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data['Questions'],data['Answers'],test_size=0.2,random_state=42)

#Create a model pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

#Train the model
model.fit(X_train, y_train)

print("Model Training complete.")

Model Training complete.


# Implementing a function to get Chatbot Responses

We write a funtion to process user input and return responses based on the trained model.

In [22]:
#Function to get response from the chatbot
def get_response(question):
  question = ' '.join(nltk.word_tokenize(question.lower()))
  answer = model.predict([question])[0]
  return answer

#Testing the function
print(get_response("What is NLP?"))

Seaborn is a Python visualization library based on Matplotlib that provides a high-level interface for drawing attractive statistical graphics. This is covered in the Data Visualization with Seaborn module.
