# **Foundations of AI and Machine Learning**

## Outline
- Machine Learning vs Instruction-based programming.
- Supervised and unsupervised learning paradigms. Examples and applications in real-world scenarios.
- Basics of model training, evaluation metrics, and performance assessment techniques.
- **Hands-on Lab:** Training a classifier.


## **Understanding Artificial Intelligence**

<img src="./images/artificial-intelligence.webp" width="700" align="center"/>

- It explores the possibility of endowing computers with intelligent behaviors, akin to those exhibited by humans.

- Today's computers, while vastly more advanced than Babbage's designs, still adhere to the fundamental concept of controlled computations.
  - They operate based on algorithms, following precise sequences of instructions to perform tasks.



## **Programming for Intelligence**

- Programming a computer to accomplish a task is feasible
  - **if** we understand the necessary sequence of steps
- Algorithms play a pivotal role in directing computers to achieve specific goals.

- Modern computing relies on the idea of **controlled computations**
  - Computers execute tasks methodically, relying on algorithms to guide their actions.




## **Weak AI vs. Strong AI**

<img src="./images/WAIS.png" width="500" align="center"/>

#### **Weak AI**

- Weak AI refers to systems designed to solve **specific**, **narrowly-defined** tasks.
  - Identifying a person's age from a photo.
- These systems do not possess **general intelligence** or the ability to perform a wide range of tasks like a human being.
  - Lacks the cognitive abilities associated with human intelligence.
- Any Weak AI example in your environment?

#### **Strong AI (Artificial General Intelligence - AGI)**

- Also known as Artificial General Intelligence (AGI),
  - Aims to create a system with human-like intelligence
- It seeks to develop computers that can perform a broad spectrum of tasks, similar to **human cognitive** capabilities
- Achieving Strong AI is a complex and ambitious goal
  - As it involves replicating human-level intelligence and understanding

<img src="./images/border.jpg" height="10" width="1500" align="center"/>


## **The Turing Test**

<img src="./images/what-is-the-turing-test.jpg" width= 400 align="center"/>

####  Alan Turing's Turing Test

- Alan Turing proposed the Turing Test as a means to assess the intelligence of a computer system.
- The test compares the system's responses to those of a human being in a text-based dialogue.
- The goal is for the system to mimic human-like responses to the extent that a human interrogator cannot reliably distinguish between the two.



<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Understanding Human Intelligence for AI**


- To make a computer behave like a human, we must model our way of thinking.
- Understanding human intelligence is essential for programming it into a machine.
- Human decision-making involves subconscious and reasoning processes.

#### **Two Approaches to AI**

##### 1. Top-down Approach (Symbolic Reasoning)

- **Characteristics:**
  - Models human reasoning.
  - Involves extracting and representing human knowledge in a computer-readable form.
  - Requires modeling reasoning processes inside a computer.
- **Example**
  - Automated traffic signal control systems
    - Predefined rules and logic dictate when to change traffic lights based on traffic flow and patterns

<img src="./images/traffic.jpeg" width= 500 align="center"/>

#####  2. Bottom-up Approach (Neural Networks)

- **Characteristics:**
  - Models the structure of the human brain.
  - Utilizes neurons as simple units that perform weighted averaging of inputs.
  - Training neural networks with data enables them to solve practical problems.
- **Example**
  - Image recognition
      - Starts with individual pixels
      - Gradually builds complex features and patterns
        - Identify objects without explicit programming

<img src="./images/imagerecog.jpg" width= 500 align="center"/>

#### Other Approaches

- **Emergent, Synergetic, or Multi-agent Approach:**
  - Complex intelligent behavior arises from interactions among numerous simple agents.
  - Intelligence emerges from reactive behavior during metasystem transition.

- **Evolutionary Approach (Genetic Algorithm):**
  - Utilizes optimization principles based on the concept of evolution.
  - Mimics natural selection to optimize solutions.

## **The Top-Down Approach: Symbolic Reasoning**

#### Modeling Human Reasoning

- In the top-down approach, we aim to model **human reasoning**
- We formalize the **thought processes** that guide human decision-making.
- This approach is known as **symbolic reasoning**.
- Example: Recipe Recommendation
  - How a human chef thinks when creating recipes
  -  "spicy + meat = Mexican cuisine"

#### Rule-Based Decision Making

- Human decision makers often follow **internal rules** when solving problems
  - A doctor diagnosing a patient may use rules to connect symptoms and potential causes
- By applying a set of rules to a specific problem

#### Knowledge Representation and Reasoning

- Central to the top-down approach is **knowledge representation** and **reasoning**.
- Extracting knowledge from human experts can be **challenging**
  - Experts sometimes arrive at conclusions without explicit reasoning.

#### Challenges in Knowledge-Based Tasks

- Some tasks, like determining a person's age from a photograph, **cannot** be reduced solely to knowledge manipulation
- Complex, nuanced tasks (e.g., diplomatic negotiations) may not align well with symbolic reasoning

<img src="https://raw.githubusercontent.com/wsko/hands-on-gen-ai-2/main/images/border.jpg" height="10" width="1500" align="center"/>

## **The Bottom-Up Approach: Artificial Neural Networks**

#### Modeling Neurons

- Alternatively, we can model the basic elements of the human brain: neurons.
- Artificial neural networks emulate the structure and function of neurons.

#### Learning by Example

- Like a newborn learning by observation, we teach artificial neural networks through examples.
- By providing training data, the network learns to solve problems by adjusting its internal connections.




## Discussion

- Identify where AI is most effectively utilized.
- AI applications are prevalent across various domains, enhancing user experiences and enabling new functionalities.
  - Mapping Applications
  - Speech-to-Text Services
  - Video Games

<img src="./images/border.jpg" height="10" width="1500" align="center"/>


## **Machine Learning and Neural Networks**


<img src="./images/ML.png" width="500" align="center"/>


- Machine Learning is a field within Artificial Intelligence focused on **training computer models** using **data** to **solve problems**

- In Machine Learning, we work with datasets consisting of input examples (X) and corresponding output values (Y)
  - Examples are typically represented as N-dimensional vectors with features.
  - Outputs are referred to as labels.

## **Common Machine Learning Problems**

- Two fundamental Machine Learning problems are:
  - **Classification:** Assigning input objects into two or more predefined classes or categories.
    - Medical diagnostics
  - **Regression:** Predicting numerical values for each input sample.
    - Housing prices

<img src="./images/house.png" width="500" align="center"/>

## **Tensor Representation**

<img src="./images/tensors.jpeg" width="500" align="center"/>

- In Machine Learning, data is often represented as tensors.
- The input dataset is a matrix of size M×N, where M is the number of samples and N is the number of features.
- Output labels (Y) are a vector of size M.



### Machine Learning definition


__“The field of study that gives computers the ability to learn without being explicitly programmed" (Arthur Samuel, 1959)__


__Example: Self-driving cars__
- Rule-based: Tell the car the rules for all possible scenarios
- Machine Learning: Let the car record the scenery and your reactions, then let it predict the next reaction





### Rule-based systems: examples and limitation

- Credit card fraud detection https://fraud.net/d/rules-based-fraud-detection/
- Loan application approval  https://www.researchgate.net/publication/220841474_Presenting_a_Rule_Based_Loan_Evaluation_Expert_System
- __Problem:__ too many rules.
Works only for specific domains with limited, clear rules, e.g. chess

### Learning from examples without explicit programming
- Self-driving vehicles
- Image classification https://www.kaggle.com/competitions/dogs-vs-cats
- Language translation
    
### Resources

- __Book__
    - https://hastie.su.domains/Papers/ESLII.pdf


- __Communities__
    - https://www.kdnuggets.com/
    - https://www.kaggle.com/


- __Key Influencers__
    - Andrew Ng
    - Yann LeCun

### Sci-kit learn
- https://scikit-learn.org/stable/

### What Are Some Common Questions Asked in Data Science?

**Machine learning more or less asks the following questions:**

- Does X predict Y?
- Are there any distinct groups in our data?
- What are the key components of our data?
- Is one of our observations “weird”?

**From a business perspective, Data Science can help us with use cases such as:**

- What is the likelihood that a customer will buy this product?
- Is this a good or bad review?
- How much demand will there be for my service tomorrow?
- Is this the cheapest way to deliver my goods?
- Is there a better way to segment my marketing strategies?
- What groups of products are customers purchasing together?
- Can we automate this simple yes/no decision?


## Are AI and Machine Learning different things?

The AI onion

- Artificial intelligence is an umbrella term that covers machine learning and deep learning
- Deep learning and neural networks are also types of machine learning algorithms  
- What Data Science VS. (Machine Learning Engineer):


<img src="./images/onion.png?raw=1" width="270" height="270" align="center"/>


## AI History

<img src="./images/aiml.png?raw=1" width="800" align="center"/>


<img src=".s/images/aiwinters.png?raw=1" width="800" align="center"/>

## Why now?

---

In the last few years there has been a lot of advancements in technologies that enable AI
> - Compute Power
> - Big Data
> - Powerful Algorithms

<img src="./images/whynow.png?raw=1" width="500" align="center"/>

Read more here: https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/an-executives-guide-to-ai

Can you mention examples of advancements in the above technologies?

# Data collected every minute
<img src="./images/data.png?raw=1" width="400" height="400" align="center"/>


## Why is AI powerful?
<img src="./images/ny-vs-sf.jpg" height="350" align="center"/>



<img src="./images/nysf.png"  height="350" align="center"/>


Check the demo here:
http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

<a id="dswf"></a>
## Introduction: The Data Science Workflow

---
- **Understand the Business Problem**: Develop a hypothesis-driven approach to your analysis.
- **Data Acquisition and Understanding**: Select, import, explore, and clean your data.
- **Build a Model**: engineer your data, build models, evaluate them and build the best model.
- **Deployment**: deploy your model in production and deliver ROI!

<img src="./images/lifecycle.png" height="650" align="center"/>



# This is what data scientists do

Data scientists identify relevant questions, collect data from a multitude of different data sources, organize the information, translate results into solutions, and communicate their findings in a way that positively affects business decisions. These skills are required in almost all industries, causing skilled data scientists to be increasingly valuable to companies.

<img src="./images/timewise.png" height="400" align="center"/>



### Data scientist vs. machine learning engineer
While there’s some overlap, which is why some data scientists with software engineering backgrounds move into machine learning engineer roles, data scientists focus on analyzing data, providing business insights, and prototyping models, while machine learning engineers focus on coding and deploying complex, large-scale machine learning products.

## What data engineers, analysts and architects do?

ETL and Data Cleaning are the most time consuming steps

> Data scientists work with machine learning engineers to move their models to production

<img src="./images/time.jpg" width="900" align="center"/>




<img src="./images/border.jpg" height="10" width="1500" align="center"/>

# Data Science Lifecycle Step by Step

# Step 1. Business Understanding

---

- Identify the business/product objectives.
- Identify and hypothesize goals and criteria for success.
- Create a set of questions to help you identify the correct data set.

## An Example Use Case
We work for a real estate company interested in using data science to determine the best properties to buy and resell. Specifically, your company would like to identify the characteristics of residential houses that estimate their sale price and the cost-effectiveness of doing renovations.

> #### Identify the Business/Product Objectives

The customer tells us their business goals are to accurately predict prices for houses (so that they can sell them for as large a profit as possible) and to identify which kinds of features in the housing market would be more likely to lead to foreclosure and other abnormal sales (which could represent more profitable sales for the company).

> #### Identify and Hypothesize Goals and Criteria for Success

Ultimately, the customer wants us to:
* Deliver a presentation to the real estate team.
* Write a business report discussing results, procedures used, and rationales.
* Build an API that provides estimated returns.

> #### Create a Set of Questions to Help You Identify the Correct Data Set

* Can you think of questions that would help this customer deliver on their business goals?
* What sort of features or columns would you want to see in the data?

# Step 2. Data Acquisition

** Ideal Data vs. Available Data**  

Oftentimes, we'll start by identifying the *ideal data* we would want for a project.

Then, during the data acquisition phase, we'll learn about the limitations on the types of data actually available. We have to decide if these limitations will inhibit our ability to answer our question of interest or if we can work with what we have to find a reasonable and reliable answer.

For example, we provide a set of housing data for Ames, Iowa, which [includes](./extra-materials/ames_data_documentation.txt):

- 20 continuous variables indicating square footage.
- 14 discrete variables indicating number of each room type.
- 46 categorical variables containing 2–28 classes each, e.g., street type (gravel/paved) and neighborhood (city district name).

---



### **Review the Dataset**

Take a moment to look through the data description. How closely does the set match the ideal data that you envisioned? Would it be sufficient for our purposes? What limitations does it have?

---

This is possibly the hardest step in the data science workflow. At this stage, it's common to realize that the problem you're trying to solve may not be solvable with the information available. The data could be incomplete, non-existant, or unable to meet the criteria necessary to answer your question.  

That said, you now have a better feel for the data that's available and the information they could contain. You can now identify a new, answerable question that ultimately helps you solve or better understand your problem.

## 2.1 Data Wrangling & Cleaning

This is by far the most time consuming step of Data Science Lifecyle

For the Ames housing dataset we discussed,
- What if the data are in different databases and we have to consolidate them?
- What if the values for some columns in the dataset are missing or in wrong format?


<img src="./images/datac.png" width="400" height="400" align="center"/>


** we will review and practice the data cleaning process as part of this course. **


# Step 3. Modeling
** What is a Model? **

- Using Machine Learning algorithms we build a model from input data (image, text, ...)
> - In case of housing data set discussed above we can build a model that learns how to predict price of a house
- The resulted model is a representative of the data used for training

<img src="./images/model.png"  height="400" align="center"/>

> - The size of the output model can be alot smaller than the training data

## There are many algorithms that can be used to build a model

<img src="./images/modelS.png" height="500" align="center"/>

> - Depending on the use case, requirements and available data, a model will be selected!

## Data scientists use one of these available algorithms and tune it for their use case
> - Most these algorithms are available in public and open source libraries

> - Most data Scientists do no build their own algorithms, they just customize and tune an existing algorithm  

<a id="common-ml-defs"> </a>
## 3.1 Supervised  vs. Unsupervised Learning

There are two main categories of machine learning: supervised learning and unsupervised learning.

**Supervised learning (a.k.a., “predictive modeling”):**  
_Classification and regression_
- Predicts an outcome based on input data.
    - Example: Predicts whether an email is spam or ham.
- Attempts to generalize.
- Requires past data on the element we want to predict (the target).

**Unsupervised learning:**  
_Clustering and dimensionality reduction_
- Extracts structure from data.
    - Example: Segmenting grocery store shoppers into “clusters” that exhibit similar behaviors.
- Attempts to represent.
- **Does not require** past data on the element we want to predict.

<img src="https://github.com/wsko/machinie-learning-fundamentals-ps/blob/main/AI%20and%20ML%20Concepts/images/sup.png?raw=1" width="800" align="center"/>


Oftentimes, we may combine both types of machine learning in a project to reduce the cost of data collection by learning a better representation. This is referred to as transfer learning.

Unsupervised learning tends to present more difficult problems because its goals are amorphous. Supervised learning has goals that are almost too clear and can lead people into the trap of optimizing metrics without considering business value.

## 3.2 Feature Engineering

#### Data Enrichment

- Machine learning algorithms need the data to be engineered before they consume it


<img src="./images/garbage.png" height="300" align="center"/>

> - We need feature engineering to enrich the raw data


> - Brainstorm features.
> - Create features.
> - Check how the features work with the model.
> - Start again from first until the features work perfectly.


So here is another definition of feature engineering:

### Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

## 3.3 OverFitting and UnderFitting


What is a Good Model?
Arguably, Machine Learning models have one sole purpose; to generalize well.

> Generalization is the model’s ability to give sensible outputs to sets of input that it has never seen before.

### Example: which model (red) has the best ability to generalize the training data (blue)?

<img src="./images/under-over-fit.png" width="900" align="center"/>






## 3.4 Test Train Split

Should we use all the data for training a model?


> Data Scientists usually keep parts of the data for testing the model performance
<img src="./images/ttsplit.jpg?raw=1" width="500" align="center"/>

> if we use all the data for training then we do not have any way of evaluating the model performance.







### Cross Validation

> why to have a fixed test and train split when we can use different combination of test and train data?

<img src="./images/cross.png" width="350" align="center"/>

Instead of using one fixed set of the data for test and train we can use cross validation.
> - In Cross Validation we use different parts of the data for test and training purposes to evaluate the model performance

> - Then average performance of different test and train splits can be used as final performance

# Step 4. Use Cases
**What are some of the use cases for AI/ML?**

- Nearly all occupations will be affected by automation, but only about 5 percent of occupations could be fully automated by currently demonstrated technologies.
- Many more occupations have portions of their constituent activities that are automatable:we find that about 30 percent of the activities in 60 percent of all occupations could be automated.

<img src="./images/usecase.svg?raw=1" width="800" align="center"/>

> - the size of the output model can be alot smaller than the training data

## 4.1  Example AI Use Cases

> **Instructor Note**: This is a good section in which to provide your own work (or side project) experience as well! These are just a couple of options:
- [This Person is not real](https://thispersondoesnotexist.com/)
- [Google Quick Draw](https://quickdraw.withgoogle.com/)
- [Deep Dream Generator](https://deepdreamgenerator.com/)
- Add your own!

## AI Ethics


### Data-Biasing

The quality of your model is usually a direct result of the quality and quantity of your data.

You can imagine a myriad of situations in which classification problems could go wrong because of bias in past data. From an ethical perspective, I think we can all agree that systems which discriminate against individuals on the basis of race, gender, age, ethnicity, etc.

Some bad outcomes:
> Security systems trained to discriminate based on an individual’s race or gender.

> An AI based resume review tool that values the gender of applicants

> Facial recognition systems that lack a diverse training set, resulting in only detecting the race for which they are trained

> Court systems (AI judges/juries) with past biased rulings against certain races as the training data



<img src="./images/border.jpg" height="10" width="1500" align="center"/>


# Machine Learning Algorithms


### Linear Regression

- Supervised learning algorithm

- Maps Label _y_ to Features _X_ using a linear expression: $y = c_0 + c_1 x_1 + c_2 x_2 + ... +c_N x_N$

- Label _y_ is continuous numerical

- Training a linear regression model means computing its coefficients to minimize model error (optimization problem)

- Parametric model

## Example: Tips Data

- Can we predict the amount of a tip from the total amount of the restaurant bill?
#### Univariate linear regression

### Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv("./data/tips.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### Create Feature and Label arrays

- Note that `sklearn` requires data in a form of a numerical (NumPy) arrays:
    - Label array (1d)
    - Features array (2d)

In [None]:
X = df[["total_bill"]].to_numpy()  ## feature
y = df['tip'].to_numpy()  ## label

In [None]:
# X

### Train a Linear Regression model

In [None]:
### instantiate a blank linear regression model as a Python object

lrm = LinearRegression()

In [None]:
### use .fit() method to train the model on X and y. Note that .fit() is executed in-place

lrm.fit(X, y)

In [None]:
## display model parameters. What story do they tell?
pd.Series([lrm.coef_[0], lrm.intercept_], index = ['slope', 'intercept'])

In [None]:
## predict Tip from new values of Total Bill

lrm.predict(np.array([[0],[ 100]]))

#### Model Interpretation

- What is the meaning of slope and intercept?

In [None]:
## how well does our model predict Tip? What causes prediction errors?
df['tip_predicted'] = lrm.predict(X)
df.head()

In [None]:
ax = sns.scatterplot(data = df, x = 'total_bill', y = 'tip', label = 'data')
ax = sns.scatterplot(data = df, x = 'total_bill', y = 'tip_predicted', label = 'model')
ax.legend()
ax.set_xlim(0, 55)
ax.set_ylim(0, 11)
plt.show()

### Model Scoring

- How good is our predictive model? How do we measure prediction errors?
- Compare known and predicted values of the label `y` vs `y_pred`

#### Mean Squared Error

    
<img src="./images/linear_regression.png" width="400">



In [None]:
mean_squared_error(y, lrm.predict(X))

- MSE is a mean squared difference between the known and predicted values of `y`

## Multivariate Linear Regression

- Linear regression requires all features to be numerical
- To convert categorical data into numerical, use one hot encoding (`pd.get_dummies()`)

In [None]:
df = pd.read_csv("./data/tips.csv")
## feature engineering: transform categforiucal data into numerical using one-hot encoding
df_feature_engineered = pd.get_dummies(df, drop_first=True).astype('float')
df_feature_engineered.head()

- Next, let's build numerical arrays X and y and train a LR model

In [None]:
label = 'tip'
features = list(df_feature_engineered.columns)
features.remove(label)
features

In [None]:
X = df_feature_engineered[features].to_numpy()

In [None]:
X.shape

In [None]:
lrm = LinearRegression()
lrm.fit(X, y)

- Model slopes: now we have one coefficient per each feature

In [None]:
pd.Series(lrm.coef_, index = features)

- Model fit improves as we add more features

In [None]:
mean_squared_error(y, lrm.predict(X))

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## Are we scoring our models correctly?

So far, we have been computing MSE on the data used to train the model. This may lead to over-optimistic scores as the data has been already seen by the model when it was trained

We need to understand our model's performance on new (previously unseen) data

##### To simulate "new data", we can split the original data set into two parts
- Training set (50-90 %)
- Test set (the remainder)
    - random splitting works and is preferrable for most data types

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=12)

In [None]:
lrm = LinearRegression()
lrm.fit(X_train, y_train)

In [None]:
### training error

mean_squared_error(y_train, lrm.predict(X_train))

In [None]:
### test error

mean_squared_error(y_test, lrm.predict(X_test))

### Discussion: which MSE is more important for model performance?

- Train error
    - OR
- Test error

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## Learning as an Iterative Process

__Generally, machine learning is approached as an optimization problem and is solved numerically using...__

### Gradient Descent (SGD)
"S" is for "stochastic"

#### Definitions

- Gradient descent (steepest descent) is an iterative optimization algorithm for finding a minimum of a function

- In ML, gradient descent finds a set of model parameters (coefficients) to minimize a loss function such as MSE

- Mathematically, to minimize means to find a "valley" where the first derivative of the loss function w.r.t. model coefficients `= 0`

#### How does SGD work?


<img src="./images/gradient_descent_01.gif" alt="Bias and Variance" height = "600" width="600">

<img src="./images/gradient_descent_02.gif" alt="Bias and Variance" width="900">


- Optional demo (homework): https://remykarem.github.io/backpropagation-demo/



<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## Classification

- Logistic Regression
- K Nearest Neighbor (KNN)  Classifier
- Random Forest Classifier


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression

# import data preprocessing functions
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# import various metrics for model scoring

from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

## Predict insurance claims

Let's predict if a health insurance policy holder will make a major claim.
- Import insurance2 dataset (originally from kaggle.com):

In [None]:
df = pd.read_csv("./data/insurance2.csv")
df.head()
## insuranceclaim is a categorical label >> Classification

In [None]:
df.info()

- Feature engineering: perform necessary data transformation

In [None]:
### region is an integer but should be treated as a categorical variable one-hot encoded
df['region'] = df['region'].astype('str')


### use pd.get_dummies to encode all categorical variables (note they still can be "str" typed)
df = pd.get_dummies(df, drop_first=True).astype('float')
df.head()

In [None]:
### create numpy arrays X and y
label = 'insuranceclaim'
features = list(df.columns)
features.remove(label)
X = df[features].to_numpy()
y = df[label].to_numpy()
print(X.shape, y.shape)

In [None]:
## name label classes
label_classes = ['Not_filed', 'Filed']

In [None]:
### split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

### perform min-max scaling using a sklearn function MinMaxScaler() - use fit and transform methods
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Logistic Regression

<img src="./images/logistic_regression.webp" width="600" align="center"/>

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)


## Scoring a Classifier:
- MSE is a suitable score for numerical labels, but not for classes
- Classifiers metrics are computed from the numbers of correctly vs incorrectly predicted classes

<img src="./images/confusion_matrix.png" width="700" align="center"/>

In [None]:
# Calculate the confusion matrix
def classifier_scoring(model):

  cm = confusion_matrix(y_test, model.predict(X_test))

  print(f'Accuracy: {accuracy_score(y_test, model.predict(X_test)):.4f}')
  print(f'Precision: {precision_score(y_test, model.predict(X_test)):.4f}')
  print(f'Recall: {recall_score(y_test, model.predict(X_test)):.4f}')
  print(f'F1 Score: {f1_score(y_test, model.predict(X_test)):.4f}')

  # Plot the confusion matrix
  plt.figure(figsize=(6, 4))
  sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
                xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
  plt.xlabel('Predicted')
  plt.ylabel('Actual')
  plt.title('Confusion Matrix')
  plt.show()




In [None]:
classifier_scoring(lr)

- Model interpretation

In [None]:
## model parameters (coefficients)
##
pd.Series(lr.coef_[0], index = features)
## feature importance order: bmi, children, smoker, age, etc. .....

#### Logistic Regression Summary

- model type: generalized linear, parametric
- model expression: sigmoid funcion of a linear mathematical expression evaluated w.r.t. a threshold
- assumptions: classes are linearly separable
- python implementation: sklearn.linear_model.LogisticRegression
- hyperparameters: regularization parameters (we will discuss regularization later)
- interpretability: high
- scalability: high


https://towardsdatascience.com/comparative-study-on-classic-machine-learning-algorithms-24f9ff6ab222

## K nearest neighbors (KNN)


- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html


- To predict on new data X_new, find N nearest neighbours (using a distance metric such as "euclidean" or "manhattan") in the training data set. Vote on the majority class. Optionally, compute the probability (percentage of votes) for each class.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()#n_neighbors = 18, metric = 'euclidean') #
knn.fit(X_train, y_train)
#print(classification_report(y_test, knn.predict(X_test), target_names = ['Not filed', 'Filed'])[:220])
#print("confusion matrix:\n", confusion_matrix(y_test, knn.predict(X_test)))

In [None]:
# Calculate the confusion matrix

classifier_scoring(knn)

#### KNN and Overfit

- What hyperparameters does KNN have?
- What makes KNN overfit?


    
<img src="./images/KNN.png" alt="KNN" width="600">



#### KNN Classifier Summary

- model type: non-parametric
- assumptions: none
- python implementation: sklearn.neighbors.KNeighborsClassifier
- hyperparameters: k (number of neighbours), distance type
- interpretability: low
- scalability: low (can be improved with "non-brute" distance algorithms)

## Random Forest Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

# **LAB:** Train a Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=400)
rfc.fit(X_train, y_train)
print(classification_report(y_test, rfc.predict(X_test), target_names = ['Not filed', 'Filed'])[:220])
pd.DataFrame(confusion_matrix(y_test, rfc.predict(X_test)))