# Data analytics question

To create a Naive Bayes that will predict if a an email is a spam or not

## Metric of success

Creating an accurate email classification using Naive bayes.

We will split the data int0 80-20,70-30 and 60-40 train& test dataset and  select the most accurate model

## Understanding the context

The collection of spam e-mails came from the postmaster and individuals who had filed spam. The collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

## data relevance

8 continuous real [0,100] attributes of type word_freq_WORD 
| = percentage of words in the e-mail that match WORD,
| i.e. 100 * (number of times the WORD appears in the e-mail) / 
| total number of words in e-mail.  A "word" in this case is any 
| string of alphanumeric characters bounded by non-alphanumeric 
| characters or end-of-string.
|
| 6 continuous real [0,100] attributes of type char_freq_CHAR
| = percentage of characters in the e-mail that match CHAR,
| i.e. 100 * (number of CHAR occurences) / total characters in e-mail
|
| 1 continuous real [1,...] attribute of type capital_run_length_average
| = average length of uninterrupted sequences of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_longest
| = length of longest uninterrupted sequence of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_total
| = sum of length of uninterrupted sequences of capital letters
| = total number of capital letters in the e-mail
|
| 1 nominal {0,1} class attribute of type spam
| = denotes whether the e-mail was considered spam (1) or not (0), 
| i.e. unsolicited commercial e-mail.  
|


# Recording the Experimental Design

To conduct the analysis successfully , the following steps will be followed:

Loading the dataset

Data understanding

Data cleaning and manipulation

Exploratory Data analysis

Predictive analysis

Implementing the solution

## Data understanding

In [1]:
### import libraries
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [19]:
### get the data
data=pd.read_csv('/content/spambase (1).data')

In [20]:
### view our data
data.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,0.7,0.64.2,0.8,0.9,0.10,0.32.1,0.11,1.29,1.93,0.12,0.96,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31,0.32.2,0.33,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [4]:
##view the data
data.tail()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,0.7,0.64.2,0.8,0.9,0.10,0.32.1,0.11,1.29,1.93,0.12,0.96,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31,0.32.2,0.33,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
4595,0.31,0.0,0.62,0.0,0.0,0.31,0.0,0.0,0.0,0.0,0.0,1.88,0.0,0.0,0.0,0.0,0.0,0.0,0.62,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.31,0.31,0.31,0.0,0.0,0.0,0.232,0.0,0.0,0.0,0.0,1.142,3,88,0
4596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.353,0.0,0.0,1.555,4,14,0
4597,0.3,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.8,0.3,0.0,0.0,0.0,0.0,0.9,1.5,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.2,0.0,0.0,0.102,0.718,0.0,0.0,0.0,0.0,1.404,6,118,0
4598,0.96,0.0,0.0,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,1.93,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.32,0.0,0.32,0.0,0.0,0.0,0.057,0.0,0.0,0.0,0.0,1.147,5,78,0
4599,0.0,0.0,0.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.65,0.0,0.0,0.0,0.0,0.0,4.6,0.0,0.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.97,0.65,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,1.25,5,40,0


In [7]:
#### the shape of the data
data.shape

(4600, 58)

In [10]:
### columns
data.columns

Index(['0', '0.64', '0.64.1', '0.1', '0.32', '0.2', '0.3', '0.4', '0.5', '0.6',
       '0.7', '0.64.2', '0.8', '0.9', '0.10', '0.32.1', '0.11', '1.29', '1.93',
       '0.12', '0.96', '0.13', '0.14', '0.15', '0.16', '0.17', '0.18', '0.19',
       '0.20', '0.21', '0.22', '0.23', '0.24', '0.25', '0.26', '0.27', '0.28',
       '0.29', '0.30', '0.31', '0.32.2', '0.33', '0.34', '0.35', '0.36',
       '0.37', '0.38', '0.39', '0.40', '0.41', '0.42', '0.778', '0.43', '0.44',
       '3.756', '61', '278', '1'],
      dtype='object')

In [11]:
#### the datainfo
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4600 non-null   float64
 1   0.64    4600 non-null   float64
 2   0.64.1  4600 non-null   float64
 3   0.1     4600 non-null   float64
 4   0.32    4600 non-null   float64
 5   0.2     4600 non-null   float64
 6   0.3     4600 non-null   float64
 7   0.4     4600 non-null   float64
 8   0.5     4600 non-null   float64
 9   0.6     4600 non-null   float64
 10  0.7     4600 non-null   float64
 11  0.64.2  4600 non-null   float64
 12  0.8     4600 non-null   float64
 13  0.9     4600 non-null   float64
 14  0.10    4600 non-null   float64
 15  0.32.1  4600 non-null   float64
 16  0.11    4600 non-null   float64
 17  1.29    4600 non-null   float64
 18  1.93    4600 non-null   float64
 19  0.12    4600 non-null   float64
 20  0.96    4600 non-null   float64
 21  0.13    4600 non-null   float64
 22  

# Data Cleaning and manipulation

In [23]:
#### missing values
data.isnull().sum()

0         0
0.64      0
0.64.1    0
0.1       0
0.32      0
0.2       0
0.3       0
0.4       0
0.5       0
0.6       0
0.7       0
0.64.2    0
0.8       0
0.9       0
0.10      0
0.32.1    0
0.11      0
1.29      0
1.93      0
0.12      0
0.96      0
0.13      0
0.14      0
0.15      0
0.16      0
0.17      0
0.18      0
0.19      0
0.20      0
0.21      0
0.22      0
0.23      0
0.24      0
0.25      0
0.26      0
0.27      0
0.28      0
0.29      0
0.30      0
0.31      0
0.32.2    0
0.33      0
0.34      0
0.35      0
0.36      0
0.37      0
0.38      0
0.39      0
0.40      0
0.41      0
0.42      0
0.778     0
0.43      0
0.44      0
3.756     0
61        0
278       0
1         0
dtype: int64

In [24]:
#### duplicates
data.duplicated().any()

True

In [25]:
#### dealing with duplicates
data.drop_duplicates()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,0.7,0.64.2,0.8,0.9,0.10,0.32.1,0.11,1.29,1.93,0.12,0.96,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31,0.32.2,0.33,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.00,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
1,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.00,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.06,0.0,0.0,0.12,0.00,0.06,0.06,0.0,0.0,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
2,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.00,0.00,0.31,0.00,0.00,3.18,0.00,0.31,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.00,0.00,0.31,0.00,0.00,3.18,0.00,0.31,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,1.85,0.00,0.00,1.85,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.000,0.223,0.0,0.000,0.000,0.000,3.000,15,54,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,0.00,1.88,0.00,0.00,0.00,0.00,0.00,0.00,0.62,0.00,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.31,0.31,0.31,0.0,0.0,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4596,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,6.00,0.00,2.00,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,2.00,0.0,0.0,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4597,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.80,0.30,0.00,0.00,0.00,0.00,0.90,1.50,0.00,0.30,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.00,1.20,0.0,0.0,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4598,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,0.00,0.32,0.00,0.00,0.00,0.00,0.00,0.00,1.93,0.00,0.32,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.00,0.32,0.00,0.32,0.0,0.0,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [26]:
#### detecting outliers
q1=data.quantile(0.25)
q3=data.quantile(0.75)
IQR=q3-q1
print(IQR)

0           0.00000
0.64        0.00000
0.64.1      0.42000
0.1         0.00000
0.32        0.38250
0.2         0.00000
0.3         0.00000
0.4         0.00000
0.5         0.00000
0.6         0.16000
0.7         0.00000
0.64.2      0.80000
0.8         0.00000
0.9         0.00000
0.10        0.00000
0.32.1      0.10000
0.11        0.00000
1.29        0.00000
1.93        2.64000
0.12        0.00000
0.96        1.27000
0.13        0.00000
0.14        0.00000
0.15        0.00000
0.16        0.00000
0.17        0.00000
0.18        0.00000
0.19        0.00000
0.20        0.00000
0.21        0.00000
0.22        0.00000
0.23        0.00000
0.24        0.00000
0.25        0.00000
0.26        0.00000
0.27        0.00000
0.28        0.00000
0.29        0.00000
0.30        0.00000
0.31        0.00000
0.32.2      0.00000
0.33        0.00000
0.34        0.00000
0.35        0.00000
0.36        0.11000
0.37        0.00000
0.38        0.00000
0.39        0.00000
0.40        0.00000
0.41        0.18800


In [27]:
####
data.shape

(4600, 58)

Removing the outliers using IQR technique is ineffient since  it will remove 4450 rows hence  we will not remove the outliers

# Feature engineering

In [29]:
#### diving the data into  features and label
x=data.drop('1',1)
y=data['1']

In [36]:
### removing the correlated features
df_x=pd.DataFrame(x)
corr_x=df_x.corr().abs()
corr_x
# Select upper triangle of correlation matrix
upper = corr_x.where(np.triu(np.ones(corr_x.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.8
drop = [column for column in upper.columns if any(upper[column] > 0.8)]
drop

['0.25', '0.31']

The above columns will be dropped because they are highly  correlated, above 0.8. if not dropped it would affect the performance of the model on new data.

In [37]:
### lets drop the above highly correlated features
x=x.drop(['0.25','0.31'],axis=1)

In [39]:
### normalization
from sklearn.preprocessing import Normalizer
norm=Normalizer()
x=norm.fit_transform(x)

In [42]:
### dimention reduction using PCA
from sklearn.decomposition import PCA
pca=PCA(n_components=5)
x=pca.fit_transform(x)


array([[-0.08679704, -0.10948865, -0.01270154,  0.01235082, -0.0011106 ],
       [-0.07190724, -0.00440353, -0.04256889, -0.00403363, -0.0076519 ],
       [-0.07014222, -0.00520911, -0.02717205, -0.00923936, -0.01096777],
       ...,
       [-0.08876255, -0.15049023,  0.00948988,  0.01695207,  0.0015184 ],
       [-0.08658711, -0.13703148,  0.01511021,  0.01012359, -0.00344649],
       [-0.06427696, -0.06891474,  0.08634874, -0.03060328, -0.00339785]])

In [43]:
### dividing the  data into test and train...80.20
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=24)

# Predictive analysis

In [44]:
#### using  Naivas bayes lets create a classification model to predicting if the mail is a spam or not
from sklearn.naive_bayes import GaussianNB
bayes=GaussianNB()
bayes.fit(x_train,y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [45]:
### predicting
y_pred=bayes.predict(x_test)
df_1=pd.DataFrame({'actual':y_test,'predict':y_pred})
df_1.describe()

Unnamed: 0,actual,predict
count,920.0,920.0
mean,0.402174,0.851087
std,0.490603,0.356196
min,0.0,0.0
25%,0.0,1.0
50%,0.0,1.0
75%,1.0,1.0
max,1.0,1.0


In [46]:
### the accuracy of the model
from sklearn.metrics import accuracy_score,confusion_matrix
print('accuracy score',accuracy_score(y_test,y_pred))
print('confusion matrix',confusion_matrix(y_test,y_pred))

accuracy score 0.5119565217391304
confusion matrix [[119 431]
 [ 18 352]]


When split 80-20 the model have an accuracy of is 51.2%..

the confusion matrix the true negative predicted 119 items and true positive is 352

## Model under 70-30 train, test split

In [47]:
### splitting the data into 70 30
x_train1,x_test1,y_train1,y_test1=train_test_split(x,y,train_size=0.7,random_state=24)

In [49]:
### Naivas bayes
bayes1=GaussianNB()
bayes1.fit(x_train1,y_train1)

GaussianNB(priors=None, var_smoothing=1e-09)

In [50]:
### prediction
y_pred1=bayes1.predict(x_test1)
df_2=pd.DataFrame({'actual':y_test1,'predict':y_pred1})
df_2.describe()


Unnamed: 0,actual,predict
count,1380.0,1380.0
mean,0.399275,0.857971
std,0.489927,0.349206
min,0.0,0.0
25%,0.0,1.0
50%,0.0,1.0
75%,1.0,1.0
max,1.0,1.0


In [51]:
### accuracy of the model
print('accuracy',accuracy_score(y_test1,y_pred1))

accuracy 0.5079710144927536


In [53]:
print(confusion_matrix(y_test1,y_pred1))

[[173 656]
 [ 23 528]]


When split 70-30 the model have an accuracy of is 50.79%..

the confusion matrix the true negative predicted 173 items and true positive is 528

## model under 60 40 train_test split

In [57]:
### splitting data
x_train2,x_test2,y_train2,y_test2=train_test_split(x,y,train_size=0.6,random_state=24)


In [58]:
## naives bayes
bayes2=GaussianNB()
bayes2.fit(x_train2,y_train2)

GaussianNB(priors=None, var_smoothing=1e-09)

In [60]:
### predict
y_pred2=bayes2.predict(x_test2)
df_3=pd.DataFrame({'actual':y_test2,'predict':y_pred2})
df_3.describe()

Unnamed: 0,actual,predict
count,1840.0,1840.0
mean,0.394565,0.852717
std,0.48889,0.354484
min,0.0,0.0
25%,0.0,1.0
50%,0.0,1.0
75%,1.0,1.0
max,1.0,1.0


In [61]:
####accuracy
print('accuracy',accuracy_score(y_test2,y_pred2))

accuracy 0.5135869565217391


In [63]:
print(confusion_matrix(y_test2,y_pred2))

[[245 869]
 [ 26 700]]


When split 60-40 the model have an accuracy of is 51.2%..

the confusion matrix the true negative predicted 245 items and true positive is 700

# Implementing the solution

From the above experience data that split 80-20 has the highest accuracy score but according to confusion matrix it has the least correct values to be precise 479. The data that is split 70-30 has the least accuracy score of 50% but more correct predicted values that data split 80-20.

On the flip side  data split 60-40 has the highest number of  correct predicted values and second accuracy score..

Hence splitting the data more will increase the number of correct predicted values, hence the best strategy to improve Naive bayes model.

# Follow up question

a) Do we need other data to answer the research question

The data was sufficient to get a good model, but the columns should be provided to understand the data more.