## Theory

### Introduction
The Naive Bayes classifier is a probabilistic machine learning algorithm based on Bayes' Theorem,  
used extensively for text classification tasks such as spam detection,  
sentiment analysis, and topic categorization.  
Its "naive" assumption is that features (words, in text classification)  
are conditionally independent given the class label.  
Despite this simplification, it performs well in many real-world scenarios,  
and in the case of banking data, the order of the words is unlikely to hold any information.

### Baye's Theorem

The Formula is given as

$P(H|E) = \frac{P(E|H)P(H)}{P(E)}$

where:
- $P(H∣E)$: Posterior probability of *Hypothesis*, given that the *Evidence* is true
- $P(E∣H)$: Likelihood of *Evidence*, given that the *Hypothesis* is true
- $P(H)$: Prior propability of the *Hypothesis*
- $P(E)$: Prior propability that the *Evidence* is true

In the banking file processing, the Hypothesis is to assume the row to be some *Category*,  
and the Evidences are the *Feature* vector X. The Propability of seing the Evidence is constant,  
for all of the Categoires, and thus, the formula can be used as

$P(C|X) \propto P(X|C)P(C)$

The classifier assumes each word's occurrence in a document is independent of others, simplifying computation:

$P(X|C) = P(x_1|C) \cdot P(x_2|C) \cdot ... \cdot P(x_n|C)$

### Training
Uses labeled data to estimate probabilities $P(X_i|C)$ and $P(C)$.  
Handling missing tokens (words not encountered during training)  
is a critical aspect of the Naive Bayes classifier to avoid issues like zero probabilities.  
If a token $x_i$ is not present in the training data for a particular class $C$,  
its likelihood $P(x_i|C) = 0$. Since Naive Bayes multiplies probabilities,  
a single zero can make the entire product $P(X|C) = 0$.  
To address this, Laplace Smoothing techniques is used,  
where 1 is added to each $P(x_i|C)$ to prevent zeros.  
The mathematical effect will be minimal for larger datasets,  
but it will make the model more robust for unseen data.
The propabilites are computed in the log-format to prevent  
underflow erros, and transformed back to the normal format  
to give more meaningfull results.  

### Advantages
- Computationally efficient and simple to implement.
- Performs well with small datasets and text data.
- Robust to irrelevant features.
### Limitations
- The independence assumption is often violated in real-world text data.
- Struggles with complex decision boundaries compared to more advanced methods like deep learning.

## Code

In [2]:
import pandas as pd
import numpy as np

## Import the model module from Backend

In [1]:
from backend.ml.model import NB

## Get the Example Data

In [21]:
df_train = pd.read_csv('frontend/app_assets/your_banking_file1.csv')

df_train = df_train.fillna('N/A') # The model does not work with Nan values -> those must be filled or removed

df_train['Category'] = np.random.choice(['FOOD', 'HEALTH', 'OTHER'], df_train.shape[0]) # Fill Mock Categoires, since those are not already classified

display(df_train.head(5))

Unnamed: 0,Date,Receiver,Amount,SomeRandomColumn,Category
0,2022-07-31,VR-YHTYMÄ OY,-49.9,,OTHER
1,2022-07-27,VR-YHTYMÄ OY,-27.0,,FOOD
2,2022-07-27,STOCKMANN TAPIOLA,-53.4,,OTHER
3,2022-07-25,BESTSELLER,-79.99,,FOOD
4,2022-07-25,PRISMA ISO OMENA,-34.66,,OTHER


## Init the model and Fit it

In [22]:
model = NB()

model.fit(str_features=df_train[['Receiver']].to_numpy(), # The model supports multiple columns for String, and Float inputs, and excpects those to be 2D Numpy Arryas
          float_features=df_train[['Amount']].to_numpy(), 
          y=df_train['Category'].to_numpy()
          )

## Display model Prior propabilites for each target 
### $P(C)$ means the overall propability of seing each Class

In [None]:
for key, value in model.get_priors().items(): # This is the overall propability of seing each Category in the dataset 

    print(f'Category: {key: >6}, Propability: {value*100:.2f}%')

Category:  OTHER, Propability: 46.15%
Category:   FOOD, Propability: 28.21%
Category: HEALTH, Propability: 25.64%


## Display the Likelihoods of each Token, given the Category   
### $P(X_i|C)$ means what is the propability of seing the Feature, if I assume I know the Class

In [None]:
for key, value in model.get_likelihoods().items(): # What is the Likelihood to see some Token, given that the cateogry is known
    print(f'Category: {key}:')

    for key2, value2 in value.items():
        print(f'\tToken: {key2:>16}, Likelihood: {value2*100:.2f}%')

Category: OTHER:
	Token: negativecashflow, Likelihood: 100.00%
	Token:      smallamount, Likelihood: 68.42%
	Token:     mediumamount, Likelihood: 36.84%
	Token:               oy, Likelihood: 26.32%
	Token:            omena, Likelihood: 15.79%
	Token:               vr, Likelihood: 15.79%
	Token:              iso, Likelihood: 15.79%
	Token:           yhtymä, Likelihood: 15.79%
	Token:              oyj, Likelihood: 15.79%
	Token:               at, Likelihood: 10.53%
	Token:        kauhajoki, Likelihood: 10.53%
	Token:           prisma, Likelihood: 10.53%
	Token:       citymarket, Likelihood: 10.53%
	Token:               ya, Likelihood: 10.53%
	Token:              seo, Likelihood: 10.53%
	Token:              vfi, Likelihood: 10.53%
	Token:                k, Likelihood: 10.53%
	Token:              dna, Likelihood: 10.53%
	Token:        markkinak, Likelihood: 10.53%
	Token:             tori, Likelihood: 10.53%
	Token:          finland, Likelihood: 10.53%
	Token:      ca778475780, Likelihood:

## Predict with the Model  
### The Predictions are the Posterior propabilites $P(C|X) \propto$ in Descending order

In [43]:
preds = model.predict(str_features=np.array([['K-Supermaket Somewhere']]), # The pred are also 2D Numpy arrays, and can have multiple rows for batch prediction
                      float_features=np.array([[-45.45]])
                      )

display(preds) # The preds is a list of dicts, where each elelemnt is one row, and in this case, there is only one row

[{'FOOD': np.float64(0.04113247863247864),
  'OTHER': np.float64(0.017898998508416787),
  'HEALTH': np.float64(0.00847637211273575)}]

In [44]:
first_row_predictions = preds[0]
for key, value  in first_row_predictions.items(): # Each dict is The Posterrior Propability of each possible Category in Descending order 

    print(f'Category: {key: >6}, Posterrior: {value*100:.2f}%')

Category:   FOOD, Posterrior: 4.11%
Category:  OTHER, Posterrior: 1.79%
Category: HEALTH, Posterrior: 0.85%
