<a href="https://colab.research.google.com/github/rajat-malvi/Classical-AI-algo/blob/main/Supervised-learning/naiveBayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes Algorithm

1. **Data Split**
   - Split the dataset into training and testing sets (60-40% or 80-20%).

2. **Training**
   - Create a dictionary of unique words in the training data.
   - Calculate word probabilities for each class (e.g., spam and not spam) and store them.
   - Estimate prior probabilities for each class.

3. **Testing**
   - For each test instance, calculate class probabilities and classify based on the highest probability.
   - Compare predictions with actual labels to assess performance.

4. **Evaluation Metrics**
   - Calculate **Accuracy**, **Recall**, **Precision**, and **F1 Score** to evaluate the model.

For this algorithm, we use data from [GitHub](https://github.com/atmabodha/selfshiksha/blob/main/Supervised%20Learning%20Basics/SelfShiksha_SLB_MCQ_31_NaiveBayes_EmailClassification/SMSSpamCollection).


1. Write a code to implement the Naive Bayes algo on your own.

2. Explore the Naive Bayes module in sklearn, and compare it with your own code results.

3. Write a code that takes a list of actual output values (0s and 1s) and another list of predicted values (again 0s and 1s), and computes the following statistics: accuracy, TP, TN, FP, FN, Recall, Precision and F1 score. Compare your results with equivalent modules in sklearn.

In [15]:
import pandas as pd
df = pd.read_csv('./SMSSpamCollection.txt', sep='\t', header=None, names=['Label', 'SMS'])
df

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [16]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

def tokenization(sent):
    # Use regular expression to remove unwanted characters
    cleaned_sent = re.sub(r'[^a-zA-Z\s]', '', sent)

    # Tokenize the cleaned sentence
    original_words = word_tokenize(cleaned_sent)

    # Define stop words
    stop_words = set(stopwords.words('english'))

    # Filter out stop words
    filtered_tokens = [word.lower() for word in original_words if word.lower() not in stop_words]

    # Print results
    # print("Original Tokens:", original_words)
    # print("Filtered Tokens:", filtered_tokens)

    return filtered_tokens

# string = "Text and Expressions: Use a relational database (like PostgreSQL) or a document store (like MongoDB) to store text and expressions. Store each chapter or section as a separate document/record with clear segmentation for headings, paragraphs, and inline expressions."

# tokenization(string)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
#  probability of pos and neg
dct = dict(df['Label'].value_counts())
# dct

prob_posi = dct['ham']/(dct['ham']+dct['spam'])
print(prob_posi)
prob_neg = 1- prob_posi
print(prob_neg)

0.8659368269921034
0.1340631730078966


In [18]:
dbDict = {}

In [19]:
# counting Store
def train_data(df,splitor,dbDict):
  # train_rainge= 5560
  for index, row in df.iterrows():
      if index > splitor:
          break

      label = row['Label']
      text = row['SMS']

      if label not in dbDict:
          dbDict[label] = {}

      words = tokenization(text)

      for word in words:
          if word not in dbDict[label]:
              dbDict[label][word] = 0
          dbDict[label][word] += 1


  return dbDict

In [None]:
#  lang-chain module
#  cromedb

In [None]:
#  probability of word
# prob_come = dbDict['ham']['come']/len(dbDict['ham'].keys())
# print(prob_come)

0.031267217630853994


In [23]:
def computation(query_tockens,positive,dbDict,isPositive):
  total_prob = positive
  if isPositive:
    val = 'ham'
  else:
    val = 'spam'

  # total_words = len(dbDict[val].keys())
  # word sum
  total_words = 0
  for i in dbDict[val].values():
    total_words += i


  for word in query_tockens:
    word_count = dbDict[val].get(word)
    if word_count is None:
      word_count = 0

    prob = word_count/total_words
    total_prob *= prob

  return total_prob


In [24]:
# Naive Bayes Algorithm for SMS Spam Classification
def naive_bayes(df,splitor):
  # Dictionary to hold word probabilities for each class
  dbDict= {}
  dbDict = train_data(df,splitor,dbDict)

  # Calculate prior probabilities for each class
  dct = dict(df['Label'].value_counts())
  prob_positive = dct['ham']/(dct['ham']+dct['spam'])
  prob_negative = dct['spam']/(dct['ham']+dct['spam'])

  # Iterate through test set, classify messages, and compare predictions
  for i in range(splitor,len(df)):
    query  = df['SMS'][i]
    query_word = tokenization(query)

    # Compute probabilities for each class
    positive_computation = computation(query_word,prob_positive,dbDict,True)
    negative_computation = computation(query_word,prob_negative,dbDict,False)

    # Classify based on computed scores and print result
    if positive_computation > negative_computation:
        print(df['Label'][i], ' ham ', df['SMS'][i])
    elif positive_computation < negative_computation:
        print(df['Label'][i], ' spam ',df['SMS'][i])


In [25]:
naive_bayes(df,5500)

ham  ham  Love has one law; Make happy the person you love. In the same way friendship has one law; Never make ur friend feel alone until you are alive.... Gud night
spam  spam  PRIVATE! Your 2003 Account Statement for 07808247860 shows 800 un-redeemed S. I. M. points. Call 08719899229 Identifier Code: 40411 Expires 06/11/04
ham  ham  Wait . I will msg after  &lt;#&gt;  min.
ham  ham  What i told before i tell. Stupid hear after i wont tell anything to you. You dad called to my brother and spoken. Not with me.
ham  ham  I want to be inside you every night...
ham  ham  Machan you go to gym tomorrow,  i wil come late goodnight.
ham  ham  Lol they were mad at first but then they woke up and gave in.
ham  ham  I went to project centre
ham  ham  Just making dinner, you ?
ham  ham  Yes. Please leave at  &lt;#&gt; . So that at  &lt;#&gt;  we can leave
ham  ham  Miles and smiles r made frm same letters but do u know d difference..? smile on ur face keeps me happy even though I am miles away fr

# Naive bayes using sklearn algo

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load and preprocess data
def load_data(df):
    X = df['SMS']
    y = df['Label'].map({'ham': 0, 'spam': 1})  # Convert labels to binary values (0 for ham, 1 for spam)
    return X, y

# Train and evaluate Naive Bayes classifier
def naive_bayes_sklearn(df, test_size=0.2):
    X, y = load_data(df)

    # Split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

    # Convert text to feature vectors
    vectorizer = CountVectorizer()
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)

    # Initialize and train the Naive Bayes classifier
    nb_classifier = MultinomialNB()
    nb_classifier.fit(X_train_vectorized, y_train)

    # Make predictions on the test set
    y_pred = nb_classifier.predict(X_test_vectorized)

    # Output results
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['ham', 'spam']))

    # Show some predictions
    for text, true_label, predicted_label in zip(X_test, y_test, y_pred):
        label_str = 'ham' if predicted_label == 0 else 'spam'
        print(f"Actual: {'ham' if true_label == 0 else 'spam'}, Predicted: {label_str}, SMS: {text}")

# Usage example
# Assuming `df` is your DataFrame with 'SMS' and 'Label' columns
naive_bayes_sklearn(df)


Accuracy: 0.9919282511210762
Classification Report:
               precision    recall  f1-score   support

         ham       0.99      1.00      1.00       966
        spam       1.00      0.94      0.97       149

    accuracy                           0.99      1115
   macro avg       1.00      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115

Actual: ham, Predicted: ham, SMS: Squeeeeeze!! This is christmas hug.. If u lik my frndshp den hug me back.. If u get 3 u r cute:) 6 u r luvd:* 9 u r so lucky;) None? People hate u:
Actual: ham, Predicted: ham, SMS: And also I've sorta blown him off a couple times recently so id rather not text him out of the blue looking for weed
Actual: ham, Predicted: ham, SMS: Mmm thats better now i got a roast down me! id b better if i had a few drinks down me 2! Good indian?
Actual: ham, Predicted: ham, SMS: Mm have some kanji dont eat anything heavy ok
Actual: ham, Predicted: ham, SMS: So there's a ring that comes with th