# Naive Bayes
## 1. Overview

Naive Bayes is a probabilistic machine learning model used for classification tasks. It is based on Bayes' Theorem, a fundamental theorem in probability theory. The 'naive' aspect of the name comes from the assumption that the features (or predictors) that go into the model are independent of each other. This is a simplifying assumption that, while not always true in real-world data, allows the algorithm to be efficient and perform well, especially in the case of text classification and spam filtering.

Bayes' Theorem, in this context, is used to calculate the probability of a hypothesis (like whether an email is spam or not spam) based on prior knowledge of conditions that might be related to the hypothesis (like the presence of certain words in the email).

## Explanation in Layman's Terms

Let's say you have a basket of fruits which are either apples or oranges, and you want to determine the likelihood of picking an apple based on some of its features like color, size, and shape. Naive Bayes helps in making this determination.

Imagine that you know some general facts like apples are generally red, and oranges are mostly orange in color. If you pick a fruit randomly and see it's red, Naive Bayes uses the color information to increase the likelihood in your mind that the fruit is an apple. It does this by calculating probabilities based on the features (color in this case) and what you already know about apples and oranges.

The reason it's called 'naive' is because it assumes that each feature (like color, size, shape) contributes independently to the fruit being an apple or an orange. This is like assuming that the color of the fruit doesn’t affect its size or shape, which simplifies the calculation but isn’t always true in real life.

Despite this simplification, Naive Bayes can be surprisingly effective and is particularly popular in tasks like email spam detection, where it looks at words in the emails and decides if an email is spam or not based on what it has learned from previous examples.

https://www.youtube.com/watch?app=desktop&v=O2L2Uv9pdDA

## 2. History of Naive Bayes

* **Based on Bayes' Theorem**:  The Naive Bayes classifier is based on Bayes' Theorem, developed by Thomas Bayes in the 18th century, but the "naive" version as used today was developed in the mid-20th century for document classification.
* **Naive**: It's termed "naive" because it assumes independence among predictor variables within the model, a simplification often considered unrealistic in real-world scenarios. This independence assumption simplifies the computation, making Naive Bayes a fast and effective model for certain types of data, despite its simplicity.

## 3. Sample Code

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Sample dataset creation
data = {
    "Credit_Score": [750, 680, 720, 610, 590, 770, 650, 740, 580, 600],
    "Income": [75000, 45000, 55000, 32000, 27000, 90000, 35000, 62000, 25000, 28000],
    "Loan_Amount": [200000, 100000, 150000, 80000, 70000, 220000, 90000, 160000, 60000, 85000],
    "Loan_Approved": [1, 0, 1, 0, 0, 1, 0, 1, 0, 0]  # 1 = Approved, 0 = Rejected
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Features and target variable
X = df[["Credit_Score", "Income", "Loan_Amount"]]
y = df["Loan_Approved"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Naive Bayes classifier
nb_model = GaussianNB()

# Train the model
nb_model.fit(X_train, y_train)

# Make predictions
y_pred = nb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_test, y_pred))

# Predicting a new data point
new_data = [[700, 50000, 120000]]  # Credit_Score, Income, Loan_Amount
prediction = nb_model.predict(new_data)
print("Prediction for new data:", "Approved" if prediction[0] == 1 else "Rejected")

Accuracy: 1.0
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3

Prediction for new data: Approved




## 4. Usecases in Finance 

- **Fraud Detection:** Classifying transactions as fraudulent or legitimate based on transaction metadata and historical patterns.

- **Credit Scoring:** Categorizing loan applicants into risk categories (low, medium, high) based on credit history and other financial attributes.

- **Spam Detection in Financial Emails:** Identifying phishing or spam emails targeting financial clients using email text features.

- **Customer Segmentation:** Grouping customers into predefined categories based on spending habits, income levels, and product usage patterns.

- **Predicting Customer Attrition:** Estimating the likelihood of customers leaving a financial service or product.

- **Loan Default Prediction:** Classifying borrowers based on their likelihood of default using historical data and financial indicators.

- **Sentiment Analysis for Market Predictions:** Analyzing news headlines or social media sentiment to classify the market as bullish, bearish, or neutral.

- **Insurance Claim Classification:** Categorizing insurance claims into fraudulent or genuine based on historical claims data and customer profiles.

- **Portfolio Risk Assessment:** Classifying assets or portfolios into risk categories (low, medium, high) based on their historical performance.

- **Marketing Campaign Classification:** Identifying the most likely responders to financial product campaigns using demographic and past interaction data.


## 5.  Evaluation

# Comparing Logistic Regression and Naive Bayes

## Fundamental Approach

- **Logistic Regression:** 
  - A predictive analysis algorithm based on the concept of probability.
  - Uses a logistic function to model a binary dependent variable.
  - Estimates the probability of a binary outcome based on one or more independent variables.

- **Naive Bayes:**
  - A classification technique based on Bayes' Theorem with an assumption of independence among predictors.
  - Particularly known for text classification problems where it considers conditional probability of each word/class.

## Assumptions

- **Logistic Regression:**
  - Assumes a linear relationship between the log-odds of the dependent variable and the independent variables.
  - Requires the independent variables to be linearly related to the log odds.

- **Naive Bayes:**
  - Assumes that all features (predictors) are independent of each other, which is the 'naive' part.
  - Works well in cases where this assumption holds true, especially in high-dimensional datasets.

## Data Suitability

- **Logistic Regression:**
  - Better suited for cases where there is a direct relationship between the independent and dependent variables.
  - Often used in binary classification problems like spam detection, credit scoring, disease diagnosis.

- **Naive Bayes:**
  - Highly efficient with large datasets, particularly in text classification (like spam filtering, sentiment analysis).
  - Performs well in multi-class prediction problems.

## Performance

- **Logistic Regression:**
  - Can provide probabilities for outcomes and is robust to a noisy dataset.
  - Requires careful feature selection to avoid overfitting and underfitting.

- **Naive Bayes:**
  - Generally faster and can be more efficient with a large number of features.
  - Performs well even with less training data if the assumption of independence holds.

## Use Cases

- **Logistic Regression:** 
  - Ideal for problems where you have a dataset with numeric and categorical variables and you want to predict a binary outcome.

- **Naive Bayes:**
  - Excellent for scenarios with large feature spaces as in text classification, where the independence assumption simplifies the computation significantly.
