<a 
 href="https://colab.research.google.com/github/LearnPythonWithRune/MachineLearningWithPython/blob/main/colab/starter/00 - Lesson - k-Nearest-Neighbors Classifier (KNN).ipynb"
 target="_parent">
<img 
 src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"/>
</a>

# $k$-Nearest-Neighbors Classifier (KNN)

### Goal of Lesson
- Understand the difference between Classical Computing and Machine Learning
- Know the 3 main categories of Machine Learning
- Dive into Supervised Learning
- Classification with $k$-Nearest-Neighbors Classifier (KNN)
- How to classify data
- What are the challenges with cleaning data
- Create a project on real data with $k$-Nearest-Neighbor Classifier

## What is Machine Learning?

![Machine Learning](img/machine_learning.png)

- In the **classical computing model** every thing is programmed into the algorithms. 
    - This has the limitation that all decision logic need to be understood before usage. 
    - And if things change, we need to modify the program.
- With the **modern computing model (Machine Learning)** this paradigm is changes. 
    - We feed the algorithms (models) with data.
    - Based on that data, the algorithms (models) make decisions in the program.

## How Machine Learning Works

### Phase 1: Learning

![ML Learning](img/ml_process.png)

- **Get Data**: Identify relevant data for the problem you want to solve. This data set should represent the type of data that the Machine Learn model will use to predict from in Phase 2 (predction).
- **Pre-processing**: This step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.
- **Train model**: This is where the magic happens, the learning step (Train model). There are three main paradigms in machine learning.
    - **Supervised**: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
    - **Unsupervised**: is when the learning algorithm is not told what to do with it and it should make the structure itself.
    - **Reinforcement**: teaches the machine to think for itself based on past action rewards.
- **Test model**: Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.

### Phase 2: Prediction

![ML Prediction](img/ml_prediction.png)

 ## Supervised Learning
 - Given a dataset of input-output pairs, learn a function to map inputs to outputs
 - There are different tasks - but we start to focus on **Classification**
 
 
 ### Classification
 - **Supervised learning**: the task of learning a function mapping an input point to a descrete category

### Example
- Predict if it is going to rain or not
- We have historical data to train our model

| Date       | Humidity  | Pressure  | Rain      |
| :--------- |:---------:| ---------:| :---------|
| Jan. 1     | 93%       | 999.7     | Rain      |
| Jan. 2     | 49%       | 1015.5    | No Rain   |
| Jan. 3     | 79%       | 1031.1    | No Rain   |
| Jan. 4     | 65%       | 984.9     | Rain      |
| Jan. 5     | 90%       | 975.2     | Rain      |

- This is supervised learning as it has the label

### The task of Supervised Learning
- Simply explained, the task of from the example above, is to find a funcion $f$ as follows.

**Ideally**: $f(humidity, pressure)$

Examples:
- $f(93, 999.7) =$ Rain
- $f(49, 1015.5) =$ No Rain
- $f(79, 1031.1 =$ No Rain

**Goal**: Approximate the function $f$ - the approximation function is often denoted $h$

Let's start by visualizing it
- Notice that we can do this because it only has two dimensions
- Computers have no problem with higher dimensions.

> #### Programming Notes:
> - Libraries used
>     - [**pandas**](https://pandas.pydata.org) - a data analysis and manipulation tool
>     - [**matplotlib**](http://matplotlib.org) - visualization with Python ([Lecture on **visualization**](https://youtu.be/htIh8YHh4xs))
> - Functionality and concepts used
>     - [**CSV**](https://en.wikipedia.org/wiki/Comma-separated_values) file ([Lecture on CSV](https://youtu.be/LEyojSOg4EI))
>     - [**read_csv()**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) read a comma-separated values (csv) file into **pandas** DataFrame.
>     - **List Comprehension** to convert data ([Lecture on **List Comprehension**](https://youtu.be/vCYEvtfXdig))

### Nearest-Neighbors Classification
- Given an input, choose the class of nearest datapoint

![Nearest-Neighbors Classification](img/nearest_neighbors.png)

### $k$-Nearest-Neighbors Classification
- Given an input, choose the most common class out of the $k$ nearest data points

### Overfitting
- a model that fits too closely to a particular dataset - but fails to predict on future values

### Some approaches
- **Regularization**: penalizing hypothesis that are more complex to favore simpler ones
- [**Holdout Cross-validation**](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Holdout_method): split data into training and testing sets.
- [**$k$-fold Cross validation**](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation): Splitting data into $k$ sets and run $k$ experiments with each set as test set once (using the remaing data as training set)

> #### Programming Notes:
> - Libraries used
>     - [**numpy**](http://numpy.org) - scientific computing with Python ([Lecture on NumPy](https://youtu.be/BpzpU8_j0-c))
>     - [**sklearn**](https://scikit-learn.org/stable/) - tools for predictive data analysis
> - Functionality and concepts used
>     - [**dropna()**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) clean the **pandas** DataFrame
>     - **List Comprehension** to convert data ([Lecture on **List Comprehension**](https://youtu.be/vCYEvtfXdig))
>     - [**train_test_split**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from [**sklearn**](https://scikit-learn.org/stable/)
>     - [**KNeighborsClassifier**](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) to train (fit) the model
>     - [**metrics.accuracy_score**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to get the accuracy of the predictions