# Naive Bayes

Tutorial for the Naive Bayes classifier using scikit-learn. This example uses Pyktok data to classify TikTok videos as ads or non-ads.

Code based on tutorial from StackAbuse: https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/  

### 1. Preparing our data for the model

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [None]:
df = pd.read_csv('pyktok_ad_data.csv',
                   usecols=['video_id', 'suggested_words', 'video_description', 'video_is_ad'])

#### Preprocessing the data

In [None]:
df['video_is_ad'] = df.video_is_ad.map({False: 0, True: 1})

#merge the suggested_words and video_description columns
df['description'] = df['suggested_words'].combine_first(df['video_description'])

#lowercase and remove punctuation
df['description'] = df.description.map(lambda x: x.lower())
df['description'] = df.description.str.replace('[^\w\s]', '')

df.head()

In [None]:
df.shape

#### Tokenize the descriptions into separate words using nltk

You will need to install the nltk library, if you don't have it:

In [None]:
!pip install nltk

***NOTE:***
The code below will open a dialog window to ask you to downlaod some packages. In that window, switch to the "Models" tab and choose "punkt" from the "Identifier" column. Click "Download" and it will install the necessary files to apply tokenization.

In [None]:
import nltk
nltk.download()

In [None]:
df['description'] = df['description'].apply(nltk.word_tokenize)

#### Perform word stemming

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
 
df['description'] = df['description'].apply(lambda x: [stemmer.stem(y) for y in x])
df.head()

#### Use CountVectorizer to transform data into occurrences

In [None]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

# This converts the list of words into space-separated strings
df['description'] = df['description'].apply(lambda x: ' '.join(x))

count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['description'])

#### Use TF-IDF as model features instead of word counts

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer().fit(counts)

counts = transformer.transform(counts)

### 2. Using the Naive Bayes Model

#### Split the data into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

shuffled_df = df.sample(frac=1, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(counts, shuffled_df['video_is_ad'], 
                                                    test_size=0.2, random_state=1)

#### Fit the data to a Naive Bayes classifier.

We use the Multinomial Naive Bayes Classifier here for text classification. There are other types of Naive Bayes classifiers for a variety of tasks.

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)

#### Testing the model 

In [None]:
import numpy as np

predicted = model.predict(X_test)

print(np.mean(predicted == y_test))

Our model's accuracy varies between 60-75%, which isn't great...Let's check the number of features and the sparsity of the document-term matrix. 

In [None]:
import numpy as np

features = len(count_vect.get_feature_names_out())
print("Number of features:", features)

#Sparsity is the number of zero-valued elements divided by the total number of elements
sparsity = (1- np.count_nonzero(X_train.toarray()) / np.prod(X_train.shape)) * 100
print("Sparsity:", sparsity)


We can use a confusion matrix to get a better idea of our model's performance:

### 3. Confusion Matrix Heatmap

In [None]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, predicted)

In [None]:
import seaborn as sns

# Plot confusion matrix
ax = sns.heatmap(conf_matrix, annot=True)
 
# set x-axis label and ticks. 
ax.set_xlabel("Predicted label", fontsize=14, labelpad=20)
ax.xaxis.set_ticklabels(['Non-Ad', 'Ad'])
 
# set y-axis label and ticks
ax.set_ylabel("True label", fontsize=14, labelpad=20)
ax.yaxis.set_ticklabels(['Non-Ad', 'Ad'])
 
# set plot title
ax.set_title("Confusion Matrix for TikTok Ad Detection Model", fontsize=14, pad=20)

In [None]:
# Let's print out the values for each cell in the confusion matrix:
true_neg, false_pos, false_neg, true_pos = conf_matrix.ravel()
 
true_neg, false_pos, false_neg, true_pos

**Calculate f1_score**

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test.values, predicted, average='weighted')