<a href="https://colab.research.google.com/github/rajat-malvi/DBMS/blob/main/Supervised-learning/naiveBayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes Algorithm

1. **Data Split**
   - Split the dataset into training and testing sets (60-40% or 80-20%).

2. **Training**
   - Create a dictionary of unique words in the training data.
   - Calculate word probabilities for each class (e.g., spam and not spam) and store them.
   - Estimate prior probabilities for each class.

3. **Testing**
   - For each test instance, calculate class probabilities and classify based on the highest probability.
   - Compare predictions with actual labels to assess performance.

4. **Evaluation Metrics**
   - Calculate **Accuracy**, **Recall**, **Precision**, and **F1 Score** to evaluate the model.

For this algorithm, we use data from [GitHub](https://github.com/atmabodha/selfshiksha/blob/main/Supervised%20Learning%20Basics/SelfShiksha_SLB_MCQ_31_NaiveBayes_EmailClassification/SMSSpamCollection).


1. Write a code to implement the Naive Bayes algo on your own.

2. Explore the Naive Bayes module in sklearn, and compare it with your own code results.

3. Write a code that takes a list of actual output values (0s and 1s) and another list of predicted values (again 0s and 1s), and computes the following statistics: accuracy, TP, TN, FP, FN, Recall, Precision and F1 score. Compare your results with equivalent modules in sklearn.

In [1]:
import pandas as pd
df = pd.read_csv('./SMSSpamCollection.txt', sep='\t', header=None, names=['Label', 'SMS'])
df

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [2]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

def tokenization(sent):
    # Use regular expression to remove unwanted characters
    cleaned_sent = re.sub(r'[^a-zA-Z\s]', '', sent)

    # Tokenize the cleaned sentence
    original_words = word_tokenize(cleaned_sent)

    # Define stop words
    stop_words = set(stopwords.words('english'))

    # Filter out stop words
    filtered_tokens = [word.lower() for word in original_words if word.lower() not in stop_words]

    # Print results
    # print("Original Tokens:", original_words)
    # print("Filtered Tokens:", filtered_tokens)

    return filtered_tokens

# string = "Text and Expressions: Use a relational database (like PostgreSQL) or a document store (like MongoDB) to store text and expressions. Store each chapter or section as a separate document/record with clear segmentation for headings, paragraphs, and inline expressions."

# tokenization(string)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
#  probability of pos and neg
dct = dict(df['Label'].value_counts())
# dct

prob_posi = dct['ham']/(dct['ham']+dct['spam'])
print(prob_posi)
prob_neg = 1- prob_posi
print(prob_neg)

0.8659368269921034
0.1340631730078966


In [4]:
dbDict = {}

In [5]:
# counting Store
def train_data(df,splitor,dbDict):
  # train_rainge= 5560
  for index, row in df.iterrows():
      if index > splitor:
          break

      label = row['Label']
      text = row['SMS']

      if label not in dbDict:
          dbDict[label] = {}

      words = tokenization(text)

      for word in words:
          if word not in dbDict[label]:
              dbDict[label][word] = 0
          dbDict[label][word] += 1


  return dbDict

In [None]:
#  lang-chain module
#  cromedb

0.031267217630853994


In [6]:
def computation(query_tockens,positive,dbDict,isPositive):
  total_prob = positive
  if isPositive:
    val = 'ham'
  else:
    val = 'spam'

  # total_words = len(dbDict[val].keys())
  # word sum
  total_words = 0
  for i in dbDict[val].values():
    total_words += i


  for word in query_tockens:
    word_count = dbDict[val].get(word)
    if word_count is None:
      word_count = 0

    prob = word_count/total_words
    total_prob *= prob

  return total_prob


In [12]:
# Naive Bayes Algorithm for SMS Spam Classification
def naive_bayes(df,splitor):
  splitor = len(df) - int(splitor*len(df))

  # Dictionary to hold word probabilities for each class
  dbDict= {}
  dbDict = train_data(df,splitor,dbDict)

  # Calculate prior probabilities for each class
  dct = dict(df['Label'].value_counts())
  prob_positive = dct['ham']/(dct['ham']+dct['spam'])
  prob_negative = dct['spam']/(dct['ham']+dct['spam'])

  # To claculate TP,TN,FP,FP
  tp = 0
  tn = 0
  fp = 0
  fn = 0

  # Iterate through test set, classify messages, and compare predictions
  for i in range(splitor,len(df)):
    query  = df['SMS'][i]
    query_word = tokenization(query)

    # Compute probabilities for each class
    positive_computation = computation(query_word,prob_positive,dbDict,True)
    negative_computation = computation(query_word,prob_negative,dbDict,False)

    # Classify based on computed scores and print result
    if positive_computation > negative_computation:
        print(df['Label'][i], ' ham ', df['SMS'][i])
        if df['Label'][i] == 'ham':
          tp += 1
        else:
          fp += 1
    else:
        print(df['Label'][i], ' spam ',df['SMS'][i])
        if df['Label'][i] == 'spam':
          tn += 1
        else:
          fn += 1

  recall = tp/(tp+fn)
  precison = tp/(tp+fp)
  acuurecy = (tp+tn)/(tp+tn+fp+fn)
  f1= 2*(recall*precison)/(recall+precison)

  print("\n\n")
  print({"TP":tp,"TN":tn,"FP":fp,"FN":fn})
  print(f"recall:{recall}\n precison:{precison} \n Accurecy:{acuurecy} \n F1:{f1}")


In [13]:
naive_bayes(df,0.2)

ham  ham  Aight should I just plan to come up later tonight?
ham  spam  Die... I accidentally deleted e msg i suppose 2 put in e sim archive. Haiz... I so sad...
spam  spam  Welcome to UK-mobile-date this msg is FREE giving you free calling to 08719839835. Future mgs billed at 150p daily. To cancel send "go stop" to 89123
ham  spam  This is wishing you a great day. Moji told me about your offer and as always i was speechless. You offer so easily to go to great lengths on my behalf and its stunning. My exam is next friday. After that i will keep in touch more. Sorry.
ham  spam  Thanks again for your reply today. When is ur visa coming in. And r u still buying the gucci and bags. My sister things are not easy, uncle john also has his own bills so i really need to think about how to make my own money. Later sha.
ham  spam  Sorry I flaked last night, shit's seriously goin down with my roommate, what you up to tonight?
ham  ham  He said i look pretty wif long hair wat. But i thk he's cuttin

# Naive bayes using sklearn algo

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load and preprocess data
def load_data(df):
    x = df['SMS']
    y = df['Label'].map({'ham': 0, 'spam': 1})
    return x, y

# Train and evaluate Naive Bayes classifier
def naive_bayes(df, test_size=0.2):
    x, y = load_data(df)

    # Split the data into train and test sets
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size, random_state=42)

    # Convert text to feature veoctors
    vectorizer = CountVectorizer()

    x_train_vectorized = vectorizer.fit_transform(x_train)
    x_test_vectorized = vectorizer.transform(x_test)

    # Initialize and train the Naive Bayes classifier
    nb_classifier = MultinomialNB()
    nb_classifier.fit(x_train_vectorized, y_train)

    # Make predictions on the test set
    y_pred = nb_classifier.predict(x_test_vectorized)
    print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\n")

    # Show some predictions
    for text, true_label, predicted_label in zip(x_test[:5], y_test[:5], y_pred[:5]):
        label_str = 'ham' if predicted_label == 0 else 'spam'
        print(f"{'ham' if true_label == 0 else 'spam'} {label_str} {text}")

naive_bayes(df,0.4)

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1938
        spam       0.97      0.94      0.95       291

    accuracy                           0.99      2229
   macro avg       0.98      0.97      0.97      2229
weighted avg       0.99      0.99      0.99      2229

Accuracy: 0.9883355764917003


ham ham Squeeeeeze!! This is christmas hug.. If u lik my frndshp den hug me back.. If u get 3 u r cute:) 6 u r luvd:* 9 u r so lucky;) None? People hate u:
ham ham And also I've sorta blown him off a couple times recently so id rather not text him out of the blue looking for weed
ham ham Mmm thats better now i got a roast down me! id b better if i had a few drinks down me 2! Good indian?
ham ham Mm have some kanji dont eat anything heavy ok
ham ham So there's a ring that comes with the guys costumes. It's there so they can gift their future yowifes. Hint hint


- Write a code that takes a list of actual output values (0s and 1s) and another list of predicted values (again 0s and 1s), and computes the following statistics: accuracy, TP, TN, FP, FN, Recall, Precision and F1 score. Compare your results with equivalent modules in sklearn.