## Naive Bayes Model

In the first step, we extract the data and apply different normalization steps to it. After this, we have 3 different arrays with the following attributes:

- **norm_data**: lower case, number to word equivalent, remove punctuation, tokenized
- **filtered**: removed common stop words from **norm_data**
- **lem_data**: applied lemmatization to **filtered**

In [34]:
from utils import *

raw_data, labels = extract_data('data/Sentences_50Agree.txt')
norm_data = normalize_corpus(raw_data)
filtered = remove_stopwords(norm_data)
lem_data = lemmatize(filtered)

Now we define the Naive Bayes model. We can use this function to run multiple experiments with different data representations.

In [40]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

def train_naive_bayes(data):
    X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=48)

    nb_model = MultinomialNB()
    nb_model.fit(X_train, y_train)
    y_pred = nb_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

In [41]:
# Convert tokenized sentences into strings
sentence_strings = [' '.join(sentence) for sentence in lem_data]

# Create an instance of TfidfVectorizer or CountVectorizer
vectorizer = TfidfVectorizer()  # or CountVectorizer()

# Fit and transform the sentences
X = vectorizer.fit_transform(sentence_strings)

acc = train_naive_bayes(X)