# Module 2: Fundamental Algorithms I

### Introduction Slide

#### Module 2: Fundamental Algorithms I
- Linear Regression
- Logistic Regression
- Decision Tree

---
## Introduction Script
Hello and welcome. 

In this module, you will learn 3 supervised machine learning algorithms. Lesson1 introduces linear regression, lesson 2 introduces logistic regression, lesson 3 is about decision tree. 

In the lesson notebooks, we often include some python code that create plots to help you understand some concepts. You are not required to understand those plotting code.

You should, however, understand how the algorithms work in general, and more importantly, know how to apply those algorithms using python scripts. As we learned from module 1, the steps to apply machine learning models are standard for all models defined in python scikit learn module, so it's pretty easy to understand.

You also need to know the key hyperparameters of each machine learning model and the value options for those key hyperparameters.

For each of the 3 lessons in this module, please go through the lesson notebook briefly, then watch the lesson video with questions in mind. The lesson videos explain the concept of the machine learning algorithms. To fully understand how to apply those algorithms in python, you need to go through the lesson notebooks again after watching the video. 

I encourage you to run the code in the lesson notebooks, make modifications and rerun the code to see different results. Learning by doing is the best way to learn data analytics.

---
## Lesson 1: Introduction to Linear Regression

---
### Slide 1
#### Linear Regression

- Supervised Learning
- Predicts continuous output
- Assumes linear relationships between input and output
- Feature scaling is normally not needed


### Slide 1 Script

In this lesson, we will introduce linear regression.

Regression is a modeling technique used to explore the relationship between independent and dependent variables. Regression is used to predict continous output, like interest of a loan, or premium of an auto insurance policy.

Linear regression assumes linear relationship between independent and dependent variables. In another word, linear regression suggests that the relationship between dependent and independent variables can be expressed  by a straight line. We normally don't need to scale the continuous features in linear regression.

When we use only one independent variable to predict an outcome of a dependent variable, it's called simple linear regression. If there're more than one independent variables in a linear regression model, it's called multiple linear regression.


###### Simple Linear Regression
$\hat{y} = \beta_0 + \beta_1  x$  

$y = \hat{y} + \epsilon$ = $\beta_0 + \beta_1  x  + \epsilon$

###### Multiple Linear Regression
$\hat{y} = \beta_0 + \beta_1  x_1 + ... + \beta_n  x_n$


### Slide 2

##### Simple Linear Regression
$\hat{y} = \beta_0 + \beta_1  x$  

$y = \hat{y} + \epsilon$ = $\beta_0 + \beta_1  x  + \epsilon$

###### Multiple Linear Regression
$\hat{y} = \beta_0 + \beta_1  x_1 + ... + \beta_n  x_n$



### Slide 2 Script
In a simply linear regression, when there's only one indepedent variable and one dependent variable, the relationship can be represented by equation $y = \beta_0 + \beta_1  x$ where x is the independent variable and y is the dependent variable. This is also the quation of a straight line in a two-dimensional space.


We use y hat in the first equation because this is the predicted outcome of the dependent variable, which is usually not same as the observed value of dependent variable which is represented by y. The difference between y and y hat is called error, represented by epsilon in the formula.

Our task is to find the best $\beta_0 and \beta_1$ so that we can minimize the overall error. This approach of minimizing errors is often used in machine learning where we define a cost function and determine the model parameters through the process of minimizing the cost function.

We'll demonstrate cost function with simple linear regression in the next slide.


---
### Slide 3

#### Cost Function
$\epsilon_i^2 = \left( \ y_i - \hat{y}_i \ \right)^2$  
$cost = \sum \epsilon_i^2$

<img src='images/linear_regression.png' width=400>

---
### Slide 3 Script

This is the plot of a simple linear regression. The blue dots are observed values. The red line is the estimated regression line. The slope of the regression line is beta1, and the intercept of the regression line and y axis is beta 0. For any given x, for example, x3, the observed output is y3, and the predicted output which is on the regression line, is y hat 3. The difference between y3 and y hat 3, epsilon 3, is the error for this data point x3.

We wish to find the combination of beta 0 and beta 1 to minmize the overal the errors. So we will first define a cost function to represents the overall errors. In linear regression, the most common cost function is defined as a sum of squared errors. This kind of linear regression is also called ordinary least square regression or OLS. The square term is necessary to remove any negative errors like epsilon 10 in the image. It also gives more weight to larger errors which usually results a better fit. But it's not always true, if a dataset has some extreme outlies, the outliers may have big impact on the linear regression due to the square term. You may want to deal with the outliers in the data before applying OLS linear regression.

Ok, enough of theory. Python scikit learn module defines a linear regression model, which only requires several lines of code to construct a linear regression. I'll demonstate this with iris dataset. Let's take a look at the dataset first.

---
### Slide 4
#### Data
<img src='images/tips_dataset.png' width=400>


### Slide 4 Script
This is the tips dataset, which is built-in in the seaborn module. The tips dataset records the data of people visiting a resturant. 

The columns are pretty self-explanatory. Our goal is to build a linear regression model to predict amount of tips. So the dependent variables is tip in the dataset, and other varaibles are independent variables which can be used to predict tip. 

To construct a linear regression model, we need to preprocess the data first. We will separate dependent and independent variables, we will also encode the categorical features. But we don't need to scale the features. we will discuss why we don't need to scale features in linear regression in a few moments.


### Slide 5
#### Data Preparation with patsy
```
import patsy as pts 
y, x = pts.dmatrices('tip ~ total_bill + size + C(time)', data=tdf, return_type='dataframe')
```

#### Independent Variables(x)
<img src='images/tips_data_encode.png' width=400>

### Slide 5 Script
Python patsy module defines a function dmatrices which makes data pre-processing very simple when we don't need to scale the features. We simply define a formula of our linear regression as a string and pass it to the dmatrices function. The function returns dependent variable and encoded independent variables directly.

The first item in the formula string is the column name of the dependent variable, in this case it's 'tip'. Then we use telda to separate dependent variable from independent variables, you can imagine it as a equal sign in a linear regression equation. The column names of indepenpent variables are connected by + sign. Notice that there's a capical C and parenthesis around time column. This indicates that time is a categorical feature and need encoding. dmatrices will encode all columns wrapped by capital C. Here we only choose three features as independent variable, two continuous and one categorical. 

The table below displays some random rows of the resulting independent variables. In the table, instead of time, there's C(time)(T.Dinner] column. dematrices encodes categorical features similar to one-hot encoding, but with one less dummy feature. For example, there are two unique values in time feature, dinner and lunch. With one-hot encoding, two dummy features will be created, one for dinner one for lunch. But actually one dummy is good enough, since a 0 in dinner column indicates it's lunch.

In the table, there's an extra feature created, intercept, with value 1 for all data points. This can be used to estimate the intercept of the linear regression model.

Now with the dependent variable and the pre-processed independent variable, we can apply linear regression model.

---
### Slide 6
#### Scikit Learn LinearRegression model
```
from sklearn.linear_model import LinearRegression

ind_train, ind_test, dep_train, dep_test = \
    train_test_split(x, y, test_size=0.4, random_state=23)

model = LinearRegression()
model.fit(ind_train, dep_train)

score = model.score(ind_test, dep_test)
```


### Slide 6 Script

We first split the data to train and test. The independent variable x is splitted into independent train and independent test, and the dependet variable y is splitted to dependent train and dependent test.

Then we create a linear regression model object and fit the model with the independent and depdendent train set. Then we call the score function in the model to evaluate the model with the test set. The score function will first make prediction on the test data set, then compare the prediction with observed output of test set to calculate an accuracy score.

This is the standard way to apply all machine learning models defined in the scikit learn module.

Now let's explore the parameters or the value of betas in this fitted linear regression model.

### Slide 7

#### Linear Regression Formula
```
# Display model fit parameters for training data
print(f"tip = {model.intercept_[0]:4.2f} + " + \
      f"{model.coef_[0][1]:4.2f} Dinner + " + \
      f"{model.coef_[0][2]:4.2f} total_bill + " + 
      f"{model.coef_[0][3]:4.2f} size")
```

`tip = 0.83 + -0.23 Dinner + 0.09 total_bill + 0.22 size`


### Slide 7 Script
This piece of code contructs the equation of the fitted linear model. We will use intercept_ attribute in the model to get the intercept, or beta 0, and attribute coef_ to get values of other betas, or the coeffecients of all indpendent variables.

We mentioned above that we don't need to scale features for linear regression. This is because the scale of features can be adjusted by the coefficients. For example, if we convert the unit of total_bill column to cent from dollar, and still keep dollar as unit in tip column. We will get same linear model except that the coefficient of total_bill becomes 0.0009 to accomodate the unit change in total_bill.

The equation indicates that larger total bill and size of the group result more tip, which is understandable. But tip from dinner is less than that from lunch, which is kind of counter intuitive. Can we trust these results?

Before we discuss the reliablity of the regression results, let's look at another way to do linear regression in python.

---
### Slide 8
#### Statsmodels Linear Regression
```
import statsmodels.formula.api as smf

result = smf.ols(formula='tip ~ total_bill + size + C(time)', data=tdf).fit()
result.summary()
```

Most of the machine learning models we introduce in this course are defined in the scikit learn module. But for linear regression, there's an ols implementation in the statsmodels module, which in my opinion, is more convenient to use than the scikit learn linear regression model.

Only one line of code is needed to construct an ols linear regression with statsmodels module. We don't need to separate dependent and independent variable or encode categorical features. We simply pass a string formula as we did above when we prepare data with patsy module.

We can print the linear regression information nicely with the summary function.

### Slide 9
#### Linear Regression Result
<img src='images/lin_reg_result.png' width=400>

### Slide 9 Script
This is the result. We can ingore the bottom part of the result. The top part shows the general information of this linear model. For examle, the dependent variable is tip, and the model use OLS method. The R-squared score is 0.468, it's a score we can use to evaluate the accuracy of the model. R-squared is normally between 0 and 1, larger r-squared indicates a better model. We will discuss evaluation metrics in more details in the following lessons.

The middle part the result shows the model parameters. It not only shows the coefficient values, but also shows some statistics of the coefficients. For example, t statistics. If the absolute value of a coefficient t statistics is greater than 2, we can say that the value is significant at 95% confidence level. Notice that the t statistics of the coefficient of C time Dinner is only -0.028, which means the value is not significant. In another word, time is a good indicator in determining tips.

Notice that the coefficient returned by statsmodels are different to that returned by the scikit learn linear regression model. The reason is that we don't split the dataset and use the whole dataset when we use statsmodels linear model. The r-squared score is also calculated with the whole dataset, which makes it less reliable. Since we usually want to evaluate a model with data that's unseen by the model. Despite of this, statsmodels is still a good choice for linear regression just because it's so easy to use. The scikit learn linear regression model doesn't return coefficient statistics directly. You will have to calculate them.

---
## Lesson 2: Introduction to Logistic Regression

---
### Slide 1
#### Logistic Regression
- Supervised learning
- For classification problems
- Feature scaling is normally not needed




---
### Slide 1 Script

In this lesson we will learn logistic regression which is a supervised learning. Despite its name, logistic regression is for classification problems, which is used to predict discrete output, like true or false, yes or no. We learned in the first lesson that a regression is normally used to predict continous value, but if we apply some function to map continous value into a probability with a range from 0 to 1, we can then apply a threshold on the probability to predict binary output. We will use an example to explain this concept.

---
### Slide 2
#### Challenger Disaster Investigation
<img src='images/oring.png' width=500>


### Slide 2 Script
In 1986, after the space shuttle challenger disaster, the investigation commission determined that the disaster is cuased by the failure of an O-ring seal in the solid rocket motor due to the cold temperature. The figure in this slide is a report of o-ring test which was actually done before challenger launch. The x axis is temperature, the y axis is the number of o-ring failures. Green dots in the figure indicate no failure, while red dots indicate at least one o-ring failure. We can create a dataset with this figure with two variables, the independent varialbe is temperature, and the dependent variable is whether there's oring failure, 0 if there's no failure and 1 if there's at least one failure.

Through a logistic function transformation, we can fit the logistic regression model with this data and get following result.

### Slide 3
#### Map to Probability
<img src='images/sigmoid.png' width=500>

The dots in the figure are observed data. For any give independent variable or temperature in this case, the output is either 0 or 1. When we apply a logistic transformation, we convert a linear model into a logistic model which can map temperature to oring failure probability. The relationship is represented by the s shaped curve in the figure which is called a sigmoid curve. The shape and location of thesigmoid curve is determined by the model paramters, or betas, just like in linear regression. And like in linear regression, we get the best betas by minizing the overall errors of the predictions.

The figure shows the best fit based on the oring test data, we can see that based on this logistic regression model, the oring failure probability at 65 degree Fahrenheit is about 50%, and the probability is close to 100% when temperate is below 50 degree. And guess what's the temperature at challenger launch time? It is 31 degree!

---
### Slide 4
#### Adult Income Dataset

<img src='images/adult_dataset.png' width=800>



### Slide 4 Script

Now let's learn how to create a logistic regression model with python. We will use adult income data to demonstate this. The adult income data is extracted from the 1994 Census database. The dataset contains employee information like education, race, sex, weekly work hour etc. The Salary column, which has two values greater than 50 thousand or less or equal to 50 thousand. This column will be our label which we are trying to predict with employee infomation.

We first need to pre-process the data, which includes encoding the label, map greater than 50 thousand to 1 and otherwise 0, encoding categorical features, and splitting the dataset to train and test.

After that, we apply the scikit learn logistic model same way as we did with linear regression.

---
### Slide 5
#### Scikit Learn LogisticRegression model
```
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

adult_model = LogisticRegression(C=1E6)
adult_model = adult_model.fit(x_train, y_train)

predicted = adult_model.predict(x_test)
score = metrics.accuracy_score(y_test, predicted)
```


### Slide 5 Script
This is the code we use to apply scikit learn logistic regression model. There are couple of things worth mentioning. One is that we set a hyperparmater C when constructing logisticRegression model. This hyperparameter is used to redcue the risk of overfitting. We will discuss how to choose proper value for C in future lessons. Another thing is we call predict function of the model to get prediction on test set, then use accuracy_score function in scikit learn metrics module to calculate accuracy score of our model. In the previous lesson, for the linear regression model, we directly call the score function of the model to calculate the r-squared score. We can also do the same thing for logistic regression model by calling the score function of the model to get accuracy score. The score function combines predicting and calculating accuracy score into one function. To separate the two steps, allows us to calculate other evaluation metrics, which we will discuss briefly in the next slide.

---
### Slide 6
### Classification Metrics
#### Classification Report
```
metrics.classification_report(y_test, predicted)
```

<img src='images/classification_report.png' width=600>

### Slide 6 Script
The accuracy score is a common classification evaluation metric. It indicates the overall proportion of correct predictions. But accuracy score is often not able to reflect the quality of a classification model. Especially when the dataset is imbalanced. For example, if a dataset has two outcomes, positive and negative, and 99% of the dataset has negative outcome and only 1% has positive outcome. A zero model, which predicts all data with majority outcome, which is negative in this case, will have accuracy score at 99%. But zero model is esentially useless since it's not able to identify any positive outcome.

So in addition to accuracy score, we also use some other metrics to evaluate a classification model. Scikit learn metrics module has classfication_report function which returns a classification report with more comprehensive metrics.

The figure in this slide is the classification report for our logistic regression model fitted on the adult dataset. The first line is the scores on predicting low income, or income below 50 thousand dollars and the second line is the score on predicting high income or incomes above 50 thousand dollars.

Precision is the proportion of predictions that is correct. For example, the first line shows that among all low income predictions, 80% of them are correct.

Recall rate is the proportion of an actual class that are predicted correctly. For example, the second line in recall column indicates that our model only identifies 25% of all high income cases correctly.

We can also plot a confusion matrix to show actual number of correct and wrong predictions.

---
### Slide 7
### Classification Metrics
#### Confusion Matrix
<img src='images/confusion_matrix.png' width=400>

This is the confusion matrix. The lesson 2 notebook has the code to plot this matrix. We will also use this code to plot confusion matrix in future lessons. You don't have to understand the plotting code, but you need to know how to use the function to plot a confusion matrix.

In this matrix, the rows indicate observed outcomes or number of actual classes, the columns indicate the predicted outcomes. The first row indicates that there are total 1182 plus 29 low income cases, and 1182 of them are predicted correctly as low income by our model. The second row shows that there are 307 plus 82 high income cases, among which only 82 are classified correctly by our model. 

We can calcualte precision and recall with confusion matrix. For example, for high income recall, it's 82 divided by all high income cases which is 82 plus 307, the result is 25% as shown in the classfication report.

We will discuss the evaluation matrix in more detail in future lessons.

---
## Lesson 3: Introduction to Decision Tree

---
### Slide 1
#### Decision Tree
- Supervised learning
- For both classification and regression problems
- Can handle categorical feature(numerical)
- Feature scaling is not needed



---
### Slide 1 Script
In this lesson we will discuss decision tree, which is a simple algorithm that is easy to understand. Decision tree is a tree-like decsion making model. It can be used in both classification and regression problems. 

For linear regression and logistic regression, we normally want to create dummy variables for categorical features because we don't want to introduce artificial ranking information into the dataset. Decesion tree, on the other hand, can handle categorical features directly as long as they have numeric values. For text categorical values, we can simply apply label encoding to map them to numbers.

Just like in linear or logistic regression, we normally don't need to scale continous features with decision tree.

---
### Slide 2
#### Decition Tree
<img src="images/dt_sunglasses.png" width=600>

### Slide 2 Script
This is a simply decision tree. Decision tree use conditional statements to split the dataset into different groups. In this image, the blue boxes with conditions are decision nodes, the top most node is the root node. The tree splitted into branches based on the conditions. The end of branches represented by green ovals that don't split any more are called leaves, or decision nodes. 

The decision tree in this slide determines whether you should wear sunglasses based on time and location. The root node has a condition on time, if it's night, directly go to decision no sunglasses. If it's day, check location feature and eventually make decision based on location.

In this decision tree, the two features are all categorical features. If there're continuous features, the conditions will be comparisons, for example, check if temperature is greater than 90 degree.

We mentioned above that decision tree can be used for both classification and regression problems. In this lesson, we will use iris dataset to demonstrate how to apply decsion tree classfier with scikit learn decision tree module.

### Slide 3
#### Iris Dataset
<img src='images/iris_dataset.png' width=500>

### Slide 3 Script

As tips dataset we used in lesson one, iris dataset is also a built-in dataset in seaborn module. The data set consists of 50 samples from each of three species of Iris, Iris setosa, Iris virginica and Iris versicolor. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

The species column will be our label and other four columns will be the features. All 4 features are continuous features. For a decision tree model, we don't need to scale continuous features. We do need to encode the label, or the species column because they have text values.

After we encode the species column and split the dataset to train and test, we can use the standard way to apply scikit learn decision tree classifier.

---
### Slide 4
#### Decision Tree Classifier
```
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=23)
dtc = dtc.fit(d_train, l_train)
score = dtc.score(d_test, l_test)
```

### Slide 4 Script
By now you should be very familiar with this piece of code. It is very simple to apply scikit learn machine learning models. That's why modeling doesn't really take too much time, most of your time and effort will be put in data understanding and data preparation.

Here we use score function of the model directly to get accuracy score. If you recall what we did in previous lesson, we first get predictions with predict function, then compare the predictions with true label to get accuracy score. Both ways do the same thing.

### Slide 5
<img src='images/dt_image.png' width=500>

### Slide 5 Script
A big advantage of decision tree is its interpretability. We can plot the tree from the scikit learn decision tree classifier. This is the decision tree that's created on the iris dataset.

The first line in the decision nodes is the condition. For example, in the root node, the condition is petal width less than or equal to 0.8.

samples is the total number of data points in the node. In the root node, there're 90 samples which is total count of the training dataset. 

value is a list which shows the count of each class in the node. There are three different species of iris in the dataset, so value is a list of three numbers. The root node indicates that there are 29 setosa, 32 vercicolor and 29 verginica iris in the training dataset.  

gini is a measurement of the homogeneity of the node. A smaller gini indicates higher homogeneity. For example, in the first decision node at 2nd row, the gini is zero, which means all of the 29 samples in this node are same species, or setosa. We reach this node with the condition petal width less than or equal to 0.8. This means with just one condition, we can already identify all setosa iris.

A decision tree can have many layers and eventually reach to a point that all leaves have 0 gini, or all data points in a leaf node have same class. The tree we show in this slide has some impure leaves, because we limit the depth of the tree to make sure the tree is not too big to fit in a slide.

We can feed new iris data into this tree and we will reach to one of the leaf nodes base on the features. The majority class in the leaf node will be the classification of the new data.

For a regression problem, the mean of all outputs in the leaf nodes will be the prediction.

### Slide 6
#### Auto MPG Dataset
<img src='images/mpg_dataset.png' width=800>

### Side 6 Script
For decision tree regression, we will use another seaborn built-in dataset, the auto mpg dataset. The dependent variable is mpg, or mile per gallon. We'd like to predict a vehicle's fuel efficiency based on its features like cylinders, horsepower, model year etc.



---
### Slide 7
#### Decision Tree Regressor
```
from sklearn.tree import DecisionTreeRegressor

auto_model = DecisionTreeRegressor(random_state=23)
auto_model = auto_model.fit(ind_train, dep_train)
score = auto_model.score(ind_test, dep_test)
```

### Slide 7 Script

The preprocessing is similar to what we did in the desicion tree classification. To apply the scikit learn decision tree regressor, we again, just need these several lines.

# Review

### Slide 1
#### Module 2 Review
- Linear Regression
- Logistic Regression
- Decision Tree

---
### Slide 1 Script

We learned three machine learning algorithms in this module. Linear regression is for regression problems, logistic regression is used for classification problems despite of its name. Decision tree can be use on both classification and regression problems.



### Slide 2
#### Module 2 Review
- Continuous Features
 - Scaling not needed
- Categorical Features
 - Create dummy variables
   - Linear Regression
   - Logistic Regression
 - Label encoding
   - Decision Tree

### Slide 2 Script
For continuous features, all three algorithms don't require feature scaling. 

For categorical features, creating dummy variables is normally prefered for linear regression and logistic regression. Decision tree, on the other hand, can deal with categorical features directly as long as they have numeric values. For categories with string value, label encoding is good enough for a decision tree.

### Script without slide
The first module's assignment is fairly straightforward. Just remember to work on the problems in order.

Good luck.