<a href="https://colab.research.google.com/github/yujiecui/MachineLearningBasics/blob/main/Machine_Learning_Basic_Practice_V0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Basic

##**Important**: Please do not modify this original notebook. To experiment, create your own copy to avoid affecting the shared version

### Author
Yimeng He (Yimeng.He@cshs.org)

## Overview

Our journey begins with a classic project: **classifying Iris flowers based on their characteristics.** This seemingly simple task is a powerful entry point into understanding how machines can learn and make predictions.

To be specific, we will develop 2 models, **Decision Tree** and **Logistic Regression**, and evaluate their performance on the Iris dataset respectively.

## The Dataset

In [None]:
import pandas as pd

We are using Pandas Libliary here. There are some resources tha might be helpful:

[Pandas Tutorial (Youtube)](https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS)

[Pandas Documentation](https://pandas.pydata.org/docs/)

[Pandas Exercise](https://github.com/guipsamora/pandas_exercises)

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
df = pd.read_csv(url, header=None, names=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                                          'petal width (cm)', 'Species'])
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


The objective to use **sepal length**, **sepal width**, **petal length**, **petal width** to predict **Species**.

## Data Preprocessing


### Train, Test Split

First of all, we need to split the data into 2 folds, training and testing.

There are mainly 2 reasons why we need to split the data: **Prevent Overfitting**, **Evaluate the model unbiasedly**.

<br/>
Reference:

[Train Test Split and its importance (Medium)](https://medium.com/@kavyasree42/train-test-split-and-its-importance-f2022472382d)

[Machine Learning Tutorial Python - 7: Training and Testing Data](https://www.youtube.com/watch?v=fwY9Qv96DJY)

In [None]:
from sklearn.model_selection import train_test_split


# TODO
# 1. Sperate 'Species' from the dataset
# 2. Use train_test_split to split the data into 2 folds (test_size = .2, random_state=37)


# train_test_split documentation:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html



In [None]:
# prompt: wirte me an assertion to check X_train.shape==(120,4), X_test.shape==(30,4)
assert X_train.shape == (120,4), "X_train shape is incorrect"
assert X_test.shape == (30,4), "X_test shape is incorrect"


NameError: name 'X_train' is not defined

### Data Normalization

For distance based algorithm like logestic regression, we need to perform the **Feature scaling** first.

However, rule-based model like decision trees are not sensitive to the scale of the features.

<br/>
Reference:

[Feature scaling (Wiki)](https://en.wikipedia.org/wiki/Feature_scaling)

[Why Do We Need to Perform Feature Scaling? (Youtube)](https://www.youtube.com/watch?v=nmBqnKSSKfM)

In [None]:
from sklearn.preprocessing import StandardScaler

# TODO
# 1. Fit the X_train using an instance StandardScaler.
# 2. Transform X_train, X_test using that instance.
# Naming them as X_train_scaled, X_test_scaled respectively.

# documentation:
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html





### Both mean should < 0.1
print(f'Mean of X_train_scaled:{X_train_scaled.mean()}')
print(f'Mean of X_test_scaled:{X_test_scaled.mean()}')

In [None]:
assert X_train_scaled.mean() < .1 and X_test_scaled.mean() < .1


## Model Training & Evaluation

### Decision Tree

Model Explaination:

[Decision and Classification Trees, Clearly Explained!!!(Youtube)](https://www.youtube.com/watch?v=_L39rN6gz7Y)

[Decision Tree (Wiki)](https://en.wikipedia.org/wiki/Decision_tree)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# TODO
# 1. Train a Decision Tree model with max_depth of 3 using X_train, y_train
# 2. Evaluate its accuracy on the X_test and y_test data
# 3. Draw the confusion matrix of the data
# 4.* Repeat the process on X_train_scaled, y_train and observe the difference

# Decision Tree model:
# Decision Trees Explaination (scikit-learn):
# https://scikit-learn.org/stable/modules/tree.html
# Doc:
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

# Accuracy_score:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

# Confusion Matrix:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

# Hint:
# The accuracy score should be 1.0.


### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# TODO
# 1. Train a LogisticRegression model on X_train_scaled, y_train
# 2. Evaluate its accuracy on the X_test_scaled and y_test data
# 3*. Do the same thing on X_train, y_train, and observe the iteration times

# LogisticRegression:
# Docs:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

# Tutorial:
# https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a


# Hint:
# The accuracy should also be 1.0.



### Great job completing the first project! You've taken a big step in understanding how to apply machine learning to solve real-world problems.