<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Lab_5_%5BSTUDENT%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab \#5: Linear Regression**
---

**Description:**  In this lab, you will practice implementing linear regression models on three datasets. Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables. Linear regression is widely used in various fields such as economics, finance, social sciences, and engineering for making predictions, understanding the nature of relationships between variables, and identifying important factors that contribute to the variability of the data. It is a fundamental tool for data analysis and is often one of the first models explored when working with a new dataset.

</br>

**About Datasets:**
- **Boston Housing Dataset**: The Boston Housing Dataset is a collection of data that contains information on various features of houses in the Boston area, such as the number of rooms, the age of the house, and the distance to employment centers. The dataset is often used for regression analysis and is a popular benchmark dataset for machine learning algorithms.

- **Diabetes Dataset**: The diabetes dataset includes various patient features such as BMI, age, blood pressure, and glucose levels, which can be used to predict disease progression in diabetic patients.

- **California Housing Dataset**: The California Housing Dataset is a collection of data containing information on the median house value and other features of census block groups in California.
</br>

### **Lab Structure**
**Part 1**: [Boston Housing Dataset](#p1)

**Part 2**: [Diabetes Dataset](#p2)

**Part 3**: [[OPTIONAL] California Housing Dataset](#p3)

**Part 4**: [[ADDITIONAL PRACTICE] Zoo Animal Classification Dataset](#p4)

</br>

**Goals**: By the end of this lab, you will:
* Implement a linear regression model on your own.
* Test and use linear regression models to predict disease progression and housing prices.
</br> 

### **Cheat Sheets**
[EDA cheatsheet](https://drive.google.com/file/d/1ZZnIzgcT8dYcGwWVAR9DDFIwGXTGbIiU/view?usp=sharing)

**Run the code below before continuing:**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, model_selection

<a name="p1"></a>

## **Part 1: Boston Housing Dataset [Practice Together]**
---
The following dataset contains information on Boston housing and contains 13 numerical features and a numerical target. **Using several features, we are going to build a housing value predictor for Boston in the 1970s.** 

The features are as follows:

* `CRIM`: Per capita crime rate by town
* `ZN`: Proportion of residential land zoned for lots over 25,000 sq. ft
* `INDU`S: Proportion of non-retail business acres per town
* `CHAS`: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* `NOX`: Nitric oxide concentration (parts per 10 million)
* `RM`: Average number of rooms per dwelling
* `AGE`: Proportion of owner-occupied units built prior to 1940
* `DIS`: Weighted distances to five Boston employment centers
* `RA`D: Index of accessibility to radial highways
* `TAX`: Full-value property tax rate per 10,000 dolalrs
* `PTRATIO`: Pupil-teacher ratio by town
* `B`: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
* `LSTAT`: Percentage of lower status of the population
* **`TARGET`** (target that needs to be added): Median value of owner-occupied homes in $1000s. *You need to add this column after loading the boston data from sklearn datasets*.

<br>

**NOTE:** The Boston housing prices dataset has a noted ethical problem: the authors of this dataset engineered a non-invertible variable “B” assuming that racial self-segregation had a positive impact on house prices. This variable is likely due to the practice of ['Redlining'](https://www.wgbh.org/news/local-news/2019/11/12/how-a-long-ago-map-created-racial-boundaries-that-still-define-boston) from the 1930s to 1970s in Boston, which has had long lasting affects in Boston still present today. The goal of the research that led to the creation of this dataset was to study the impact of air quality, but it did not give adequate demonstration of the validity of this assumption. Please know this data set is used for *practice only* and can serve as a good example of why ethical standards are so important for ML models and implementation. [Read more](https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8) on problems existing within this data set and why it is not used for anything other than practicing ML.

### **Step #1: Load the data**

In [None]:
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
df = pd.read_csv(url)
df = df.rename(columns={'medv': 'TARGET', 'rm': 'RM', 'lstat':'LSTAT'})
df.head()

### **Step #2: Decide independent and dependent variables**
We are going to use "Rooms per dwelling" (`RM`) and "Percentage of lower status of the population" (`LSTAT`) as our dependent variables for predicting `TARGET`. Our target is the median value of owner-occupied homes. **With these values, we are building a housing value predictor for Boston in the 1970s.**

In [None]:
df[["RM","LSTAT", "TARGET"]]


**Before we continue, create two graphs. One with `LSTAT` and the target, and another with `RM` and the target to explore the relationship between the variables further.**

#### **Solution**

In [None]:
plt.figure(figsize=(20, 5))

features = ['LSTAT', 'RM']
target = df['TARGET']

for i, col in enumerate(features):
    plt.subplot(1, len(features) , i+1)
    x = df[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('target')

### **Step #3: Split data into training and testing data**


#### **Solution**

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(df[["RM", "LSTAT"]], df[["TARGET"]], test_size=0.2, random_state=42)

### **Step #4: Import your algorithm**
Import sklearn's linear regression algorithm.

In [None]:
# import that LinearRegression algorithm
from sklearn.linear_model import LinearRegression

### **Step #5: Initialize your model and set hyperparameters**
Linear regression takes no hyperparameters, so just initialize the model.

#### **Solution**

In [None]:
# initialize
reg = LinearRegression()

### **Step #6: Fit your model, test on the testing data, and create a visualization if applicable**


#### **Solution**

In [None]:
# fit
reg.fit(X_train, y_train)
# predict
pred = reg.predict(X_test)

**Create a visualization**

Use `y_test` and your `prediction` (x and y on graph) from the model to create a scatter plot. Then use the following line to visualize where a correct prediction would be:
```
plt.plot([0, 50], [0, 50], '--k', label="Correct prediction")
```

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(y_test, pred)
plt.plot([0, 50], [0, 50], '--k', label="Correct prediction")
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
plt.title("Real vs predicted house prices in Boston")
plt.legend()
plt.tight_layout()

### **Step #7: Evaluate your model**

Use mean squared error and the R2 score as the evaluation metrics.

#### **Solution**

In [None]:
from sklearn.metrics import mean_squared_error
print('mean squared: ', mean_squared_error(y_test, pred))
print('R2 score: ', r2_score(y_test, pred))


mean squared:  31.243290601783627


### **Step #8: Use the model**
Using the model we created, predict the price of two houses in Boston:

* House 1:  7 `RM` and `LSTAT` is 5.0%

* House 2:  6 `RM` and `LSTAT` is 4.0%

**Note:** you must create a dataframe containing with the information of the new houses:

```python
new_houses = pd.DataFrame(enter_new_house_data_here, columns =["RM", "LSTAT"])
```

This `new_houses` variable can then be placed directly into the `model.predict()` function.

#### **Solution**

In [None]:
new_houses = pd.DataFrame([[7,5], [6,4]], columns =["RM", "LSTAT"])
new_prediction = reg.predict(new_houses)
print('prediction: ', new_prediction)

prediction:  [[31.25202152]
 [26.41942131]]


<a name="p2"></a>

## **Part 2: Diabetes Dataset**
---
This dataset contains data from diabetic patients with features such as their BMI, age, blood pressure, and glucose levels that are useful in predicting the diabetes disease progression in patients. We will be looking at these variables that will be used to help predict disease progression in diabetic patients. Note that similar to the above, we will be using the 8-steps of the Machine Learning Process. 

**Steps of the ML Process:**
1. **Load the data**
2. **Decide independent variables and dependent variables**
3. **Split the data into training and test data**
4. **Import the model**
5. **Initialize the model and set hyperparameters**
6. **Fit your model, test on the testing data, and create a visualization if applicable**
7. **Evaluate your model**
8. **Use the model**


### **Step #1: Load the data**
The following code will load the data. Turn this into a date frame.
```python
diabetes = datasets.load_diabetes()
```
Add a column called `TARGET` with the target data (`diabetes.target`).  In this case, the target is a measure for disease progression.

In [None]:
diabetes = datasets.load_diabetes()
df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
## YOUR CODE HERE ##
df

### **Step #2: Decide independent and dependent variables**
Here we would like to use the `age` `bmi` and `bp` columns as our dependent variables and the `TARGET` as our independent variable.

We are building a predictor of disease progression.


In [None]:
df[['age', 'bmi', 'bp', "TARGET"]]

### **Step #3: Split data into training and testing data**
Use `age`, `bmi`, and `bp` for our independent variables.

### **Step #4: Import your model**


### **Step #5: Initialize your model and set hyperparameters**
Linear regression takes no hyperparameters, so just initialize the model.

### **Step #6: Fit your model, test on the testing data, and create a visualization if applicable**

**Create a visualization**

Use `y_test` and your `prediction` from the model to create a scatter plot. Then use the following line to visualize where a correct prediction would be.

**This has already been done for you.**
```
plt.plot([0, 300], [0, 300], '--k', label="Correct prediction")
```

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(y_test, pred)
plt.plot([0, 300], [0, 300], '--k', label="Correct prediction")
plt.axis('tight')
plt.xlabel('True Progression')
plt.ylabel('Predicted Progression')
plt.title("Real vs predicted Disease Progression in Diabetic Patients")
plt.legend()
plt.tight_layout()

### **Step #7: Evaluate your model**


### **Step #8: Use the model**
Using the model we created, predict the disease progression of two new patients:

* Patient 1:  0.0045 `age` 0.053 `bmi` 0.014 `bp`

* Patient 2:  0.0039 `age` -0.012 `bmi` 0.018 `bp`

**Note:** you must create a dataframe containing with the information of the new patients:

```python
new_patient_data = pd.DataFrame(new_patient_data_here, columns =["age", "bmi", "bp"])
```

<a name="p3"></a>

## **Part 3: California Housing Dataset [Optional]**
---
This dataset was derived from the 1990 U.S. Census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people)

The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). We will use the data to help make a model that will predict the median house value in California in 1990.

**Specifically create a linear regression model and predict the median house value of a district that has: 7.2 average rooms, 1.5 average bedrooms, 51 years old, located at 38.1 Latitude, -121.08 Longitude. *Try different independent variables for your model and see how the accuracy changes.***


### **Step #1: Load the data**

In [None]:
#import relevant packages
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# 1
cali_data = fetch_california_housing()
df = pd.DataFrame(data=cali_data.data, columns=cali_data.feature_names)
df['TARGET'] = cali_data.target

### **Step #2: Decide independent and dependent variables**
The dependent variables will be the `TARGET`, so find the best independent variables using `.var()` and `.corr()`.

In [None]:
# var

In [None]:
# corr

### **Steps #3-6: Split data, import/initialize your model, fit the model, make a prediction, and create a visualization**

In [None]:
# 2 - nothing to do here, we just used all columns except for AveRooms

# 3 
X_train, X_test, y_train, y_test = train_test_split(df[['HouseAge',	'AveBedrms','Latitude',	'Longitude']], df['TARGET'], test_size=0.2)

# 4 - 6


In [None]:
# 6 part two: visualization








### **Steps #7-8: Evaluate and use the model**

In [None]:
# 8 


# 9


## [OPTIONAL] **Part 4: Zoo Animal Classification Dataset**
---
The following dataset contains information on various zoo animals, including their characteristics and classifications. Our goal is to build a model that predicts the classification of an animal based on its features.

The features are as follows:


*    `animal_name`: Name of the animal
*   `hair`: Hair presence (1 if present, 0 if not)
- `feathers`: Feather presence (1 if present, 0 if not)
-     `eggs`: Egg-laying ability (1 if yes, 0 if no)
-    ` milk`: Milk production ability (1 if yes, 0 if no)
-     `airborne`: Ability to fly (1 if yes, 0 if no)
-     `aquatic`: Ability to live in water (1 if yes, 0 if no)
- `predator`: Predatory behavior (1 if yes, 0 if no)
- `toothed`: Teeth presence (1 if present, 0 if not)
- `backbone`: Backbone presence (1 if present, 0 if not)
-  `breathes`: Ability to breathe (1 if yes, 0 if no)
- `venomous`: Venom presence (1 if present, 0 if not)
- `fins`: Fin presence (1 if present, 0 if not)
- `legs`: Number of legs (numeric)
- `tail`: Tail presence (1 if present, 0 if not)
- `domestic`: Domestication status (1 if domestic, 0 if not)
- `catsize`: Animal size (1 if cat-size or larger, 0 if smaller)
- `class_type`: Numeric class identifier (1-7)

### **Step #1: Load the data**

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data"

# Create dataframe
column_names = ['animal_name', 'hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail', 'domestic', 'catsize', 'class_type']
df = pd.read_csv(url, names=column_names)
df.head()

### **Step #2: Decide independent and dependent variables**
We are going to use all features except `animal_name` and `class_type` as our independent variables for predicting class_type.



### **Step #3: Split data into training and testing data**

### **Step #4: Import your algorithm**
Import sklearn's DecisionTreeClassifier algorithm.

### **Step #5: Initialize your model and set hyperparameters**
Initialize the DecisionTreeClassifier model.

### **Step #6: Fit your model, test on the testing data**

### **Step #7: Evaluate your model**
Use `accuracy_score` as the evaluation metric.

**Reflection question:** How accurately was your algorithm able to predict the type of species?

In [None]:
''

'Your Answer Here'


''

**Congratulations on finishing this notebook!** In this lab, we practiced implementing linear regression models on three datasets: Boston Housing Dataset, Diabetes Dataset, and California Housing Dataset. We learned how to load and explore datasets, split data into training and testing data, and implement a linear regression algorithm in Python using `scikit-learn`.

---
© 2023 The Coding School, All rights reserved