# Text classification introduction & example

## Table of contents

1. [What is text classification?](#what_is_text_classification)
    1. [Instances in which text classification is used](#text_classification_instances)
    2. [Note on supervised learning](#supervised_learning)


2. [Pros and cons of text classification](#pros_and_cons)


3. [Text classification steps](#steps)


4. [Text classification example - spam text message](#example_start)
    1. [Text pre-processing](#text_pre_processing)
    2. [Feature extraction](#feature_extraction)
    3. [Model training](#model_training)
    4. [Model Evaluation](#model_evaluation)


5. [Some further notes](#further_notes)
    1. [Balanced datasets](#balanced_datasets)
    2. [Exploratory data analysis](#EDA)


6. [Further resources](#resources) 

# 1. What is text classification? <a name="what_is_text_classification"></a>

Text classification is a common Natural Language Processing (NLP) task. 

It involves assigning labels or categories to text documents based on their contents. 

The goal is to automatically analyse and organise text into pre-defined categories or classes. 

## 1.1. Instances in which text classification is used <a name="text_classification_instances"></a>

1. Spam detection: used to identify and filter out spam emails from legitimate ones. 

2. Sentiment analysis: used to analyse text and understand the underlying sentiment, usually positive or negative. 

3. Document classification: search engines use text classification to categorise and index web pages.

There are many more applications, these are just a few examples.

### An MOJ example

In the MOJ, the data science hub has done text classification projects, such as the Invoice Tagging project, which is still extensively used by finance. 

Invoices (>£25k) are given a plain English description before publication. The Plain English Description is essentially a set of categories that each invoice fits into.

Before this project, a human had to read each invoice description and assign a category by hand, which was time consuming. 

By using machine learning, a model was created to understand the relevant invoice information well enough to accurately assign the correct Plain English Description to each invoice, saving days of work for finance.

## 1.2. Note on supervised learning <a name="supervised_learning"></a>

Text classification is a supervised machine learning task. But what does this actually mean? 

In this context, it means that you have some group of texts and you know each one's label/category. 
The dataset usually contains input variables, known as features, as well as the output variable, known as the target. 

Essentially, the dataset that you have already has the user characteristics and the answer to the question you're trying to answer.

For example, let's say that you want to build a model that correctly classifies whether an article is about politics, sports, education etc. In a supervised learning context, the data that you have would contain the text of the article as well as a label that tells you what kind of article it is. 

This then allows us to fit a model to the data, using the answers to help us refine our results iteratively.

The machine learning algorithm will try to learn the relationship between features and target and will usually apply these "learnings" to unlabeled data to try and classify it. 

# 2. Pros and cons of text classification <a name="pros_and_cons"></a>

## 2.1. Pros

1. Scalability: can handle massive datasets, making these models suitable for dealing with large amounts of data. 

2. Automation: automates the process of organising large volumes of unstructured text data into pre-determined categories, saving time and effort. 

3. Consistency and accuracy: given a model is properly trained, it will correctly categorise text, reducing the risk of bias and human error. 

## 2.2. Cons

1. Ambiguity and noise: models may struggle with ambiguous or noisy data, such as misspellings, abbreviations, slang or sarcasm which can lead to inaccuracies.

2. Domain specificity: models trained on one domain may not generalise well to other domains due to differences in vocabolary, writing styles or context, requiring domain-specific training data and fine-tuning.

3. Imbalance and bias: class imbalance in the training data where certain classes are under-represented can lead to biased models with inaccurate predictions. 

4. Feature engineering complexity and interpretability: extracting useful information from text data may require parameter tuning to improve model performance and interpreting results may be challenging. 

Remember that your model will be as good as the data you feed it to learn. 

If you're training the model for a specific task and the data changes drastically over time, then the trained model will not be as effective and it will require constant training to ensure it's accurate. 

# 3. Text classification steps <a name="steps"></a>

1. Data collection: gather your data with labeled examples

2. Text pre-processing: clean your text data to remove noise, inconsistencies, remove stopwords, converting text to lowercase etc. This is a key step, probably as important as building the model itself. When done correctly, it can drastically improve model performance. 

   Check out this kaggle notebook that goes into detail on pre-processing: https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing/notebook
   
   Tokenization is also an important step in text pre-processing. See this article: https://www.datacamp.com/blog/what-is-tokenization



3. Feature extraction: represent the text data as numerical features that the machine learning model can understand. (Machine learning models can't actually take text data as an input, they would not be able to make sense of it. Turning the features into numbers that the model can understand is a key step). 

   See this article on feature extraction techniques: https://www.analyticsvidhya.com/blog/2022/05/a-complete-guide-on-feature-extraction-techniques/

4. Model training: train the machine model on labeled data. There are many techniques that can be used such as Naive Bayes, Support Vector Machines (SVM), neural networks and more. 

5. Model evaluation: assess the performance of the trained model on a separate set of data (called test set) that has not been used during training. 

Once you are happy with how the model is performing, you can save the model for future use.

# 4. Text classification example - spam text message data from kaggle <a name="example_start"></a>

Going to use data from kaggle. It contains two columns: a column with some messages in them and a column that details whether the message is "ham" or spam. 

Sourced from here: https://www.kaggle.com/datasets/team-ai/spam-text-message-classification?resource=download

In [None]:
# import packages first

import os

# data manipulation packages
import pandas as pd
import numpy as np

# data visualisation packages
import seaborn as sns
import matplotlib.pyplot as plt

# regular expressions package
import re

# NLP packages & functions
import nltk
import string
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt') 

In [None]:
# Check the current working directory
os.getcwd()

In [None]:
# Load the dataset
df = pd.read_csv('data/SPAM text message data kaggle.csv')

df

In [None]:
# Let's check whether there are any rows with NA. If there are, let's remove them.
df.isnull().sum()

In [None]:
# Let's check whether there are any duplicated columns
df.duplicated().sum()

In [None]:
# There are duplicates in the data, let's get rid of them
df = df.drop_duplicates(keep = 'first')

# Let's check the shape of the data after removing duplicates
df.shape

In [None]:
# Let's plot the category column
sns.countplot(x="Category", data = df, hue = "Category")

# You can see that the data is unbalanced, with the majority of messages being ham

## 4.1. Text Pre-Processing <a name="text_pre_processing"></a>

Here, we're trying to standardise the data as much as possible by doing things such as removing hyperlinks, emojis and implementing classic NLP cleaning techniques (stemming, etc). These help us in training better models

In this script, we're going to do each step one by one, but usually, you would write one big function that does all the steps in one go. 

In [None]:
# First create a column that turns the Category column into a binary one
df["spam"] = df["Category"].apply(lambda x: 1 if x == "spam" else 0)

df

### 4.1.1. Lowercasing

In [None]:
# Create a function that makes all the text in lowercase
def to_lowercase(text):
    text = text.lower()
    return text

# Now apply the function to the text column
df["cleaned_message"] = df["Message"].apply(to_lowercase)

# print the dataframe
df

### 4.1.2. Removing punctuation

In [None]:
# The string package contains a function that outputs all punctuation, we're going to use it in our function
string.punctuation

In [None]:
# Create a function that removes punctuation from any given text
def remove_punctuation(text):
    return text.translate(str.maketrans("", "", string.punctuation))

# Apply to the column of interest
df["cleaned_message"] = df["cleaned_message"].apply(remove_punctuation)

df

### 4.1.3. Removing stopwords

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a,” “the,” “is,” “are,” etc. 

Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so widely used that they carry very little useful information.

In [None]:
# The package nltk already contains a list of stopwords inside it that we can call upon. Makes life easier when wanting to remove all stopwords from text.
# You can call the function to print the list. I am attaching ", ".join() to the function to paste all of them together rather than printing a long list.
# When calling on the function, you have to specify the language
", ".join(stopwords.words("english"))

In [None]:
# Assign a name to the stopwords first
stopwords = set(stopwords.words("english"))

# Create a function that removes stopwords
def remove_stopwords(text):
    
    # create an empty list
    list = []
    
    # tokenize the text
    tokenized_text = nltk.word_tokenize(text)
    
    # loop across each tokenized text and remove stopwords
    for word in tokenized_text:
        if word not in stopwords:
            list.append(word)
        
    return " ".join(list)
    
# Apply the function to the data
df["cleaned_message"] = df["cleaned_message"].apply(remove_stopwords)

#print
df

### 4.1.4. Removing frequent words

Here, we are talking about removing words that are NOT stopwords. Removing frequent words is something that you might want to do on a case by case basis, depending on the project you're working on. 

Essentially, we're trying to remove words that occur very frequently in the corpus that might not be informative for the specific classification task at hand.

You may consider doing this if certain words occur very frequently across all documents in your corpus. They might not carry much discriminatory power so by removing highly frequent non-stop words, you can help the model focus on the more distinctive and discriminative terms that are more likely to capture the semantics relevant to the classification task.


It's up to you to determine whether they have value or not. In some cases, removing highly frequent non-stop words may improve performance, while in others, it might not make a significant difference or could even degrade performance. 

In [None]:
# First let's check which words occur more frequently
# Initialise a counter
from collections import Counter
frequent_words = Counter()

# Check the words
for text in df["cleaned_message"]:
    for word in text.split():
        frequent_words[word] += 1
        
 #print the 10 most common words       
frequent_words.most_common(10)

In this case, it might actually be worth removing some frequent words - they are basically stopwords! But because they are misspelled (u instead of you, dont instead don't etc.), they were not removed in the previous step. 

In [None]:
# Let's just select the stopwords and remove them
remove_words = [frequent_words.most_common(10)[0][0], frequent_words.most_common(10)[2][0], frequent_words.most_common(10)[5][0], frequent_words.most_common(10)[7][0]]
   
print(remove_words)

In [None]:
# Define a function to remove the words
def remove_frequent_words(text):
    return " ".join([word for word in text.split() if word not in remove_words])

df["cleaned_message"] = df["cleaned_message"].apply(remove_frequent_words)

df

You could do the same with rare words if you wanted

In [None]:
frequent_words.most_common()[:-10:-1]

### 4.1.5. Removing special characters

Remove any special characters (that hasn't already been removed by the previous steps). 

In [None]:
# Create the function
def remove_special_chars(text):
    # The regular expression here means except anything between a to z, A to Z and 0 to 9. So anything that fulfills that condition will be replaced with a space
    text = re.sub("[^a-zA-Z0-9]"," ", text)
    # replace any extra whitespaces with a single one
    text = re.sub("\s+", " ", text)
    return text

# Apply it
df["cleaned_message"] = df["cleaned_message"].apply(remove_special_chars)

df

### 4.1.6. Stemming and lemmatization

Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.

For example, the words “programming,” “programmer,” and “programs” can all be reduced down to the common word stem “program.” In other words, “program” can be used as a synonym for the prior three inflection words.

Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to its root word, or lemme, good.

Doing either stemming or lemmatization (or both!!) is a key step in text pre-processing. 

Read more about the two here: https://www.datacamp.com/tutorial/stemming-lemmatization-python

#### Stemming

In [None]:
# Let's import the required function
from nltk.stem.porter import PorterStemmer

# Create an instance 
ps = PorterStemmer()

# Create a function that does the stemming
def stemmer(text):
    
    # Tokenization using NLTK
    tokenized_text = nltk.word_tokenize(text)
    
    # empty list
    list = []
    
    # loop across each text
    for text in tokenized_text:
        list.append(ps.stem(text))
        
    return " ".join(list)

# Apply the function
df["cleaned_message_stemmed"] = df["cleaned_message"].apply(stemmer)

df

#### Lemmatization

In [None]:
# The steps are quite similar to above
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

# Create an instance
lemmatizer = WordNetLemmatizer() 

# Function
def lemmatize(text):
    tokenized_text = nltk.word_tokenize(text)
    
    # empty list
    list = []
    
    # loop across each text
    for text in tokenized_text:
        list.append(lemmatizer.lemmatize(text))
        
    return " ".join(list)

# Apply the function
df["cleaned_message_lemmatized"] = df["cleaned_message"].apply(lemmatize)

df

Going forward, we'll use the stemmed column for simplicity

### 4.1.7. Removing URLs

URLs don't contribute to text analysis, they are just noise.

In [None]:
def remove_url(text):
    return re.sub("r'https?://\S+www\.\/S+", "", text)

df["cleaned_message_stemmed"] = df["cleaned_message_stemmed"].apply(remove_url)

df

These include most of the steps that you'd usually perform in text pre-processing. Depending on your data, you may have to perform other things. 

I've done it step by step so you are able to compare how each function impacts the text, but in standard practice, you'd usually combine everything into one function and apply it to the text. Makes it faster

## 4.2. Feature extraction <a name="feature_extraction"></a>

As mentioned earlier, feature extraction is a step in which you convert the raw text data into numerical inputs that the model can understand. 

There are many techniques that you can perform, for example: 
1. Bag of words (BOW). Read about it here: https://machinelearningmastery.com/gentle-introduction-bag-words-model/
2. Term Frequency - Inverse Document Frequency (TF-IDF). Read more about it here: https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/
3. Word embeddings (word2vec, GloVe). Read more about it here: https://www.turing.com/kb/guide-on-word-embeddings-in-nlp

I've just listed the more popular ones, but there are more. 

Before performing TF-IDF, we're going to split the data into training and test set first. 

test_size = 0.3 means that we're splitting the data into 70% training data and 30% test data. We [split the data](https://www.linkedin.com/pulse/why-do-we-need-data-splitting-utkarsh-sharma) to avoid overfitting the model, to ensure that the model will perform reasonably well on unseen data. Using a 70%-30% or 80%-20% split is common practice as that empirically, these tend to get the best results

However, there's no hard rule here. It depends on your project as well. You want to ensure that you're training the model and evaluating it on enough data. So sample size of both sets should be big enough.

In [None]:
from sklearn.model_selection import train_test_split

X = df["cleaned_message_stemmed"]
y = df["spam"]

X_train, X_test , y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 2)

In this example, we're going to use TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of the vectorizer
tfidf = TfidfVectorizer()

# Now you want to pass the text through the vectorizer. fit_transform lets the vectorizer learn the vocabulary and the IDF and return a document-term matrix. 
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()

There are actually many arguments that you could pass through the TfidfVectorizer() function that may be useful to your problem, most common ones being max_features, min_df or max_df etc.

Check out the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

## 4.3. Model training <a name="model_training"></a>

Model training in machine learning refers to the process of teaching a machine learning model to recognize patterns and make predictions from data. 

In a supervised learning setting, during the training phase, the model learns from a labeled dataset, where input data (features) are paired with corresponding output labels (targets or responses).

In this script, we're going to try out 3 classifiers: multinomial naive Bayes, random forest and linear support vector classification. 

In [None]:
# Import the packages
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix

### 4.3.1. Multinomial Naive Bayes

Article explaining how MNB works: https://www.upgrad.com/blog/multinomial-naive-bayes-explained/

In [None]:
# Start the model up
MNB = MultinomialNB()

# Fit the model to the vectorised training set (this is where the actual machine learning is happening)
MNB.fit(X_train_tfidf, y_train)

# Now that the model has learnt from the training data, let it make predictions on the test set (which it hasn't seen before) and see how accurate it is against the true labels.
y_prediction_MNB = MNB.predict(X_test_tfidf)

# Calculate the probability estimates for each class label for a given input sample.
y_probability_MNB = MNB.predict_proba(X_test_tfidf)

In [None]:
# Add the prediction and the probability back to the dataset with the actuals
df_test = pd.DataFrame(X_test)
df_test["Actual"] = y_test
df_test["MNB_prediction"] = y_prediction_MNB
df_test["MNB_probability"] = y_probability_MNB[:, 0]

df_test.head()

### 4.3.2. Random forest classifier

Nice article from Datacamp on random forest with an explanation of how the algorithm works: https://www.datacamp.com/tutorial/random-forests-classifier-python

In [None]:
# Create an instance of the model
# You can choose the number of trees in the classifiers with the n_estimators argument. The default is 100 so I'll just leave it as that. 
rf = RandomForestClassifier()

# Fit the model
rf.fit(X_train_tfidf, y_train)

# Make predictions
y_prediction_rf = rf.predict(X_test_tfidf)

# Calculate the probabilities of each category
y_probability_rf = rf.predict_proba(X_test_tfidf)

In [None]:
# Adding prediction to the data
df_test["rf_prediction"] = y_prediction_rf
df_test["rf_probability"] = y_probability_rf[:, 0]

df_test.head()

### 4.3.3. Linear SVC

Article that explains how SVMs work: https://www.analyticsvidhya.com/blog/2021/10/support-vector-machinessvm-a-complete-guide-for-beginners/

In [None]:
# Create an instance of the model
LSVC = LinearSVC()

# Fit the model
LSVC.fit(X_train_tfidf, y_train)

# Make predictions
y_prediction_LSVC = LSVC.predict(X_test_tfidf)

# LinearSVC is a linear classifier that directly learns decision boundaries without providing probability estimates.
# So in this case, there's no predict_proba() function.

In [None]:
# Add prediction to the df
df_test["LSVC_prediction"] = y_prediction_LSVC

df_test.head()

## 4.4. Model evaluation <a name="model_evaluation"></a>

### 4.4.1. Accuracy, precision and recall <a name="metrics_explained"></a>

Accuracy, precision and recall are three metrics used to evaluate a machine learning model's performance. 

**Accuracy is the ratio of correctly classified instances to the total instances in the dataset.** It is a good measure when the classes are balanced, but it can be misleading if there is a class imbalance, as a model might achieve high accuracy by simply predicting the majority class.

**Precision measures the proportion of true positive predictions (correctly identified positive instances) out of all positive predictions made.** It is useful when it is important to minimize false positives, such as in spam detection or medical diagnosis where false positives could have serious consequences.

**Recall (also known as sensitivity or true positive rate) measures the proportion of true positive instances that were actually identified correctly from all the actual positive samples in the dataset.** It is important when it is critical to find all positive instances, such as in fraud detection where missing a fraudulent transaction could be costly.

When deciding which metric to use, consider the following:

Use accuracy when the classes are balanced and the cost of false positives and false negatives is similar.

Use precision when it is crucial to minimize false positives, even if it means missing some true positives.

Use recall when it is essential to find all positive instances, even if it means increasing the number of false positives.

Depending on your problem, you may care more about one metric than another.

See these sources for more on accuracy, precision and recall: 

https://www.kimberlyfessel.com/mathematics/data/accuracy-precision-recall/

https://developers.google.com/machine-learning/crash-course/classification/accuracy

https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall 

#### Confusion matrix

To visualise accuracy, precision and recall, you could create a confusion matrix. It may give you a better insight into each models. 

This article from DataCamp gives a good introduction to confusion matrices:

https://www.datacamp.com/tutorial/what-is-a-confusion-matrix-in-machine-learning

#### F1 score

Ideally, when looking at precision and recall, you'd want your classifier to have high precision and high recall. This is not always possible, and there's also usually a trade-off between the two: trying to improve one comes at the cost of the other. 

There are instances in which it's desirable to have both high precision and high recall - this is where the F1 score comes in. 

The F1 score is calculated as the harmonic mean of the precision and recall of a classification model. Both metrics equally contribute to the score and the effect of smaller values is enhanced to ensure the score correctly reflects this. 

The score ranges from 0 to 1, with 1 representing a model that can accurately classifies each observation into its correct class and 0 representing a model that cannot categorise any observation correctly. 

A high F1 score indicates a well balanced model performance, with high precision and high recall. 

Read more about F1 score here: 

https://encord.com/blog/f1-score-in-machine-learning/#:~:text=The%20F1%20score%20or%20F,the%20reliability%20of%20a%20model.

https://arize.com/blog-course/f1-score/

https://deepai.org/machine-learning-glossary-and-terms/f-score

Now, let's calculate the three metrics for each of the models we've used in training and visualise the confusion matrices

### 4.4.2. Multinomial Naive Bayes evaluation

In [None]:
# Calculate scores for MNB model
accuracy_MNB = accuracy_score(y_test, y_prediction_MNB)
precision_MNB = precision_score(y_test, y_prediction_MNB)
recall_MNB = recall_score(y_test, y_prediction_MNB)

print(f'Accuracy of the model: {accuracy_MNB}')
print(f'Precision Score of the model: {precision_MNB}')
print(f'Recall Score of the model: {recall_MNB}')

In [None]:
# Confusion matrix for MNB model

# Create the confusion matrix
cf_MNB = confusion_matrix(df_test.Actual, df_test.MNB_prediction)

# Print the matrix
cf_MNB

In [None]:
# Notice above how the result is quite literally a matrix. 
# Our aim, however, is to create a plot with labels that gives more clarity and is easier to understand

# Plot the confusion matrix
sns.heatmap(cf_MNB, annot=True, fmt='d', cmap='Greens',
            xticklabels=['Predicted Negative', 'Predicted Positive'],
            yticklabels=['Actual Negative', 'Actual Positive'])

plt.xlabel('Predicted label')
plt.ylabel('Actual label')
plt.title('Multinomial Naive Bayes Confusion Matrix')
plt.show()

Let's recall the definition of precision first: it measures the proportion of true positive predictions (correctly identified positive instances) out of all positive predictions made.

If you wanted to calculate it manually just from the confusion matrix, you'd do 125/(125+0) = 1. 

This is exactly the same as the precision score of 1 calculated above using the precision_score() function. There were no instances in which the model incorrectly classified a genuine messages as spam (no false positives). Hence why the top right in the confusion matrix is 0.

Let's do the same for recall. Recall measures the proportion of true positive instances that were actually identified correctly from all the actual positive samples in the dataset.

Calculating it manually from the confusion matrix, we would do: 125/(125+72) = 0.63.

There were 72 instances where spam messages were categorised as genuine messages. 

In this example specifically, we care more about precision: we want to avoid miss-classifying genuine messages into spam.

### 4.4.3. Random Forest Evaluation

In [None]:
# Create confusion matrix
cf_rf = confusion_matrix(df_test.Actual, df_test.rf_prediction)

# Create plot for matrix
sns.heatmap(cf_rf, annot=True, fmt="d", cmap="Reds", 
            xticklabels = ["Predicted negative", "Predicted positive"],
            yticklabels = ["Actual negative", "Actual positive"])

plt.xlabel("Predicted label")
plt.ylabel("Actual label")
plt.title("Random forest confusion matrix")
plt.show()

# Calculate metrics for random forest
accuracy_rf = accuracy_score(y_test, y_prediction_rf)
precision_rf = precision_score(y_test, y_prediction_rf)
recall_rf = recall_score(y_test, y_prediction_rf)

print(f'Accuracy of the model: {accuracy_rf}')
print(f'Precision Score of the model: {precision_rf}')
print(f'Recall Score of the model: {recall_rf}')

### 4.4.4. Linear Support Vector Classifier Evaluation

In [None]:
# Create confusion matrix
cf_lsvc = confusion_matrix(df_test.Actual, df_test.LSVC_prediction)

# Create plot for confusion matrix
sns.heatmap(cf_lsvc, annot = True, fmt = "d", cmap = "Blues",
            xticklabels = ["Predicted negative", "Predicted positive"],
            yticklabels = ["Actual negative", "Actual positive"])

plt.xlabel("Predicted label")
plt.ylabel("Actual label")
plt.title("LSVC confusion matrix")
plt.show()

# Calculate metrics for LSVC
accuracy_LSVC = accuracy_score(y_test, y_prediction_LSVC)
precision_LSVC = precision_score(y_test, y_prediction_LSVC)
recall_LSVC = recall_score(y_test, y_prediction_LSVC)

print(f'Accuracy of the model: {accuracy_LSVC}')
print(f'Precision Score of the model: {precision_LSVC}')
print(f'Recall Score of the model: {recall_LSVC}')

## 5. Some further notes <a name="further_notes"></a>

### 5.1. Balanced datasets <a name="balanced_datasets"></a>

A balanced dataset is one in which the number of samples for each class is roughly the same. 

In a binary classification problem, for instance, a balanced dataset would have an equal number of instances for both classes. 

This is important because in machine learning, a model trained on an imbalanced dataset can become biased towards the majority class, which can lead to poor performance when predicting the minority class.

Balancing a dataset can help prevent such bias and improve the model's ability to generalize to new, unseen data.

Check this article out on some ways you can deal with an unbalanced dataset: https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html

### 5.2. Exploratory Data Analysis <a name="EDA"></a>

One step that is usually performed, but hasn't been done here is that of explatoratory data analysis. 

Usually, you would want to properly explore the text data and understand more about it. What words are common? How long are the words? Etc. Anything that would give you an idea on the text you're working on.

Check out this article on performing EDA: https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/

# 6. Further Resources <a name="resources"></a>

Here are some resources you can use:

NLP Datacamp course in Python: https://www.datacamp.com/tracks/natural-language-processing-in-python

NLP Datacamp course in R: https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-r

Textmining with R textbook: https://www.tidytextmining.com/

Google's Introduction to Machine Learning crash course: https://developers.google.com/machine-learning/crash-course/ml-intro