# **Artificial Intelligence - Phase 3**

# ChatBot Using Python

**To Do:**


*   Load Dataset
*   Preprocess Dataset
*  Perform Analysis





In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**Step 1**: Dataset Loading

 Here we load the dataset from our given URL

In [None]:

# Load the dataset
dataset_path = '/content/dialogs.txt'
data = pd.read_csv(dataset_path, delimiter='\t', names=['User', 'Bot'])


**Step 2**: Dataset Overview

In [None]:
# Display the first few rows of the dataset
data.head()

Unnamed: 0,User,Bot
0,"hi, how are you doing?",i'm fine. how about yourself?
1,i'm fine. how about yourself?,i'm pretty good. thanks for asking.
2,i'm pretty good. thanks for asking.,no problem. so how have you been?
3,no problem. so how have you been?,i've been great. what about you?
4,i've been great. what about you?,i've been good. i'm in school right now.


**Step 3**: Data Analysis


In this section, we will perform some basic data analysis to gain insights into our dataset.


In [None]:
# Check for missing values
missing_values = data.isnull().sum()
print("Missing Values:")
print(missing_values)

# Check the distribution of user and bot messages
user_message_count = len(data[data['User'].notnull()])
bot_message_count = len(data[data['Bot'].notnull()])
total_messages = len(data)
user_message_percentage = (user_message_count / total_messages) * 100
bot_message_percentage = (bot_message_count / total_messages) * 100

print(f"Total Messages: {total_messages}")
print(f"User Messages: {user_message_count} ({user_message_percentage:.2f}%)")
print(f"Bot Messages: {bot_message_count} ({bot_message_percentage:.2f}%)")


Missing Values:
User    0
Bot     0
dtype: int64
Total Messages: 3725
User Messages: 3725 (100.00%)
Bot Messages: 3725 (100.00%)


**Step 4**: Data Preprocessing

Now, we will preprocess the text data for use in our chatbot project. This typically involves text cleaning and tokenization.


In [None]:
# Text cleaning and tokenization
data['User'] = data['User'].str.lower()
data['User'] = data['User'].apply(lambda x: x.split())
data['Bot'] = data['Bot'].str.lower()
data['Bot'] = data['Bot'].apply(lambda x: x.split())


## Document Completion

We have successfully loaded and preprocessed the dataset for our chatbot project. The data analysis and preprocessing steps are essential for building a chatbot capable of understanding and responding to user input. Next, we can proceed with more advanced tasks like natural language processing and model development.


# **ALGORITHM SELECTION FOR MACHINE LEARNING**

In this section, we will choose a machine learning algorithm suitable for our chatbot project. We'll provide reasons for our choice.


In [None]:
# Import the necessary machine learning libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Explain the choice of algorithm
algorithm_choice = "Multinomial Naive Bayes"
reasoning = """
We have chosen Multinomial Naive Bayes for its simplicity and effectiveness in text classification tasks. Our dataset primarily contains text data, and Naive Bayes is known to perform well in such scenarios.
"""

# Print the algorithm choice and reasoning
print(f"Selected Algorithm: {algorithm_choice}")
print("Reasoning:", reasoning)


Selected Algorithm: Multinomial Naive Bayes
Reasoning: 
We have chosen Multinomial Naive Bayes for its simplicity and effectiveness in text classification tasks. Our dataset primarily contains text data, and Naive Bayes is known to perform well in such scenarios.



## Data Preparation

Before training the model, we need to prepare the data for training and testing. This typically involves splitting the dataset into training and testing sets and vectorizing text data.


In [None]:
# Split the data into training and testing sets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split


X = data['User']
y = data['Bot']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = [' '.join(doc).lower() for doc in X_train]
X_test = [' '.join(doc).lower() for doc in X_test]

tfidf_vectorizer = TfidfVectorizer()
X_train = tfidf_vectorizer.fit_transform(X_train)
X_test = tfidf_vectorizer.transform(X_test)

**Model Training**

In [None]:
# Assuming 'Bot' is assigned label 1 and 'Not Bot' is assigned label 0

# Convert text labels to binary labels (0 or 1)
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)  # y_train should contain labels like ['Bot', 'Not Bot']

# Train the model for binary classification
model = MultinomialNB()
model.fit(X_train, y_train)



  y = column_or_1d(y, warn=True)


In [2]:
pip install transformers

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
Col

In [None]:
import torch
# Initialize the GPT-2 tokenizer and model
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Define a padding token and add it to the tokenizer
padding_token = '<PAD>'
tokenizer.add_special_tokens({'pad_token': padding_token})

# Tokenize and encode the data for training
input_ids = tokenizer(X.tolist(), return_tensors="pt", padding=True, truncation=True, max_length=128)

# Train the model for text generation
# Note: There's no need to train the pre-trained GPT-2 model, it's already fine-tuned.
# You can directly use it for text generation as shown in the previous code.

# Define a function to generate responses using the GPT-2 model
def generate_response(user_input):
    input_ids = tokenizer.encode(user_input, return_tensors="pt")
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=model.device)
    response_ids = model.generate(input_ids, max_length=50, num_return_sequences=1, attention_mask=attention_mask)
    response = tokenizer.decode(response_ids[0], skip_special_tokens=True)
    return response




while True:
    user_input = input("You: ")  # Get user input
    if user_input.lower() == "exit":
        print("Chatbot: Goodbye!")
        break  # Exit the loop if the user types "exit"

    gpt_response = generate_response(user_input)  # Get chatbot response using the GPT-2 model
    print(f"Chatbot: {gpt_response}")  # Display chatbot's response
