### Types of Chatbots:

There are broadly two variants of chatbots: Rule-Based and Self-learning.

    1. In a Rule-based approach, a bot answers questions based on some rules, which it is trained on. The rules defined can be very simple to very complex. The bots can handle simple queries but fail to manage complex ones.
    
    2. Self-learning bots are the ones that use some Machine Learning-based approaches and are more efficient than rule-based bots. These bots can be of further two types: Retrieval Based or Generative. 


    2.a) In retrieval-based models, a chatbot uses some heuristic to select a response from a library of predefined responses. The chatbot uses the message and context of the conversation for choosing the best response from a predefined list of bot messages. The context can include a current position in the dialogue tree, all previous messages in the conversation, previously saved variables (e.g., username). Heuristics for selecting a response can be engineered in many different ways, from rule-based if-else conditional logic to machine learning classifiers.

    2. b) Generative bots can generate the answers and not always replies with one of the answers from a set of answers. This makes them more intelligent as they take word by word from the query and generates the answers

### Examples

Retrieval-Based Chatbot Example:


Predefined Responses:

    "The library is open from 9 AM to 5 PM on weekdays."
    "You can borrow up to 5 books at a time."
    "To renew a book, please visit the library's website or contact the help desk."

Conversation:
User: "What are the library's opening hours?"
Bot: "The library is open from 9 AM to 5 PM on weekdays."

How it works:

    Message: "What are the library's opening hours?"
    Heuristic: The bot matches the user's message to the closest predefined response using keywords or patterns (e.g., "opening hours").
    Selected Response: "The library is open from 9 AM to 5 PM on weekdays."

Generative Chatbot Example:

Imagine a more advanced chatbot that can generate responses on the fly.

Conversation:
User: "What are the library's opening hours?"
Bot: "The library is open from 9 AM to 5 PM on weekdays, but it is closed on weekends."

How it works:

    Message: "What are the library's opening hours?"
    Generative Model: The bot processes the input using a neural network that has been trained on a large dataset of conversational text. It generates a response word by word.
    Generated Response: "The library is open from 9 AM to 5 PM on weekdays, but it is closed on weekends."

Key Differences:

    Retrieval-Based Bot:
        Response Source: Predefined responses.
        Selection Method: Heuristics like keyword matching or pattern recognition.
        Flexibility: Limited to the responses it has been given.

    Generative Bot:
        Response Source: Generates responses dynamically.
        Selection Method: Uses machine learning models to create a response based on the input.
        Flexibility: More adaptable and can handle a wider range of queries with nuanced answers

### Text Pre- Processing with NLTK

The main issue with text data is that it is all in text format (strings). However, Machine learning algorithms need some sort of numerical feature vector to perform the task. So before we start with any NLP project, we need to pre-process it to make it ideal for work. Basic text pre-processing includes:

    Converting the entire text into uppercase or lowercase so that the algorithm does not treat the same words in different cases as different
    
### Tokenization
    Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens, i.e., words that we want. A sentence tokenizer can be used to find the list of sentences, and a Word tokenizer can be used to find the list of words in strings.

In [None]:
#lowercase
import nltk
nltk.download('punkt')  # Downloading the punkt tokenizer models

text = "Natural Language Processing with NLTK is Fun!"
text_lowercase = text.lower()
print(text_lowercase)


### Tokenization:

Purpose: Breaking down the text into smaller pieces like sentences or words.

In [4]:
#sentence tokenization
from nltk.tokenize import sent_tokenize

text = "Sentencewise Hello World. Natural Language Processing with NLTK is Fun!"
sentences = sent_tokenize(text)
print(sentences)


['Sentencewise Hello World.', 'Natural Language Processing with NLTK is Fun!']


In [5]:
#word tokenization
from nltk.tokenize import word_tokenize

text = "Wordwise: Natural Language Processing with NLTK is Fun!"
words = word_tokenize(text)
print(words)


['Wordwise', ':', 'Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'Fun', '!']


## Term Frequency(TF) and Inverse Document Frequency (IDF):

Term Frequency (TF) and Inverse Document Frequency (IDF) are fundamental concepts in Natural Language Processing (NLP) used to measure the importance of a word in a document relative to a collection of documents (corpus).

a) Term Frequency (TF):
TF measures how frequently a term appears in a document. It is the ratio of the number of times a term appears in a document to the total number of terms in the document.

Formula:
TF(t,d)=Number of times term t appears in document d / Total number of terms in document d

Example: TF(t,d)= 3/100 =0.03

b) Inverse Document Frequency (IDF):
IDF measures how important a term is. While computing TF, all terms are considered equally important. However, certain terms like "is", "of", and "that" may appear frequently but have little importance. IDF weighs down the frequent terms while scaling up the rare ones.

Formula:
IDF(t,D)=log⁡(Total number of documents (N) / Number of documents with term t)

Example: IDF(t,D)=log(1000/10)=log(100)≈2

c) TF-IDF:
TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a corpus.

Formula:
TF-IDF(t,d,D)= TF(t,d) × IDF(t,D)

Example: TF-IDF(t,d,D)=0.03×2=0.06

In [1]:
import nltk
import math
from collections import Counter

# Download necessary resources
nltk.download('punkt')

# Example documents
documents = [
    "Natural, Language, Processing with NLTK is fun.",
    "Natural Language Processing and machine learning are closely related.",
    "Text processing with NLTK and Python is powerful."
]

# Step 1: Convert to lowercase and tokenize the text
tokenized_documents = [nltk.word_tokenize(doc.lower()) for doc in documents]

# Step 2: Calculate Term Frequency (TF)
def compute_tf(word_dict, doc):
    tf_dict = {}
    doc_count = len(doc)
    for word, count in word_dict.items():
        tf_dict[word] = count / float(doc_count)
    return tf_dict

# Compute TF for each document
tf_documents = []
for doc in tokenized_documents:
    word_counts = Counter(doc)
    tf_documents.append(compute_tf(word_counts, doc))

# Step 3: Calculate Inverse Document Frequency (IDF)
def compute_idf(documents):
    N = len(documents)
    unique_words = set(word for doc in documents for word in doc)
    idf_dict = dict.fromkeys(unique_words, 0)
    # idf_dict = dict.fromkeys(documents[0], 0)
    for doc in documents:
        for word in set(doc):
            idf_dict[word] += 1
    for word, val in idf_dict.items():
        idf_dict[word] = math.log(N / float(val))
    return idf_dict

# Compute IDF
idf_dict = compute_idf(tokenized_documents)

# Step 4: Calculate TF-IDF
def compute_tfidf(tf_doc, idf_dict):
    tfidf_dict = {}
    for word, tf_val in tf_doc.items():
        tfidf_dict[word] = tf_val * idf_dict[word]
    return tfidf_dict

# Compute TF-IDF for each document
tfidf_documents = [compute_tfidf(tf_doc, idf_dict) for tf_doc in tf_documents]

# Print results
for i, doc in enumerate(tfidf_documents):
    print(f"\nDocument {i+1} TF-IDF scores:")
    for word, score in doc.items():
        print(f"{word}: {score:.4f}")



Document 1 TF-IDF scores:
natural: 0.0405
,: 0.2197
language: 0.0405
processing: 0.0000
with: 0.0405
nltk: 0.0405
is: 0.0405
fun: 0.1099
.: 0.0000

Document 2 TF-IDF scores:
natural: 0.0405
language: 0.0405
processing: 0.0000
and: 0.0405
machine: 0.1099
learning: 0.1099
are: 0.1099
closely: 0.1099
related: 0.1099
.: 0.0000

Document 3 TF-IDF scores:
text: 0.1221
processing: 0.0000
with: 0.0451
nltk: 0.0451
and: 0.0451
python: 0.1221
is: 0.0451
powerful: 0.1221
.: 0.0000


[nltk_data] Downloading package punkt to C:\Users\A S P I R E
[nltk_data]     7\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
from collections import Counter
a = [1,2,32,1,2,32]
Counter(a) #gives number of digits in dictionary form

Counter({1: 2, 2: 2, 32: 2})