<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Lab_8_%5BSTUDENT%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab #8 : Logistic Regression**
---

### **Description**
In this lab, we'll learn how to use logistic regression to classify data into different categories. We'll start with binary classification, where the goal is to separate data into two categories. We'll revisit the breast cancer dataset, which consists of biopsied breast tissue samples that are classified as either malignant or benign. This is the same dataset we used in Lab 6 with the KNN model.

In the second part of the lab, we'll move on to multiclass classification, where the goal is to separate data into more than two categories. We'll use the Digits dataset, which consists of images of handwritten digits, with each image represented as an 8x8 array of grayscale pixels.

For both of these parts, we'll use pandas to load and preprocess the data, matplotlib to visualize it, and scikit-learn (sklearn) to build and evaluate the logistic regression model.

Finally, we'll return to the Titanic dataset from Kaggle and compare the performance of KNN and Logistic Regression models to see which performs best on the dataset.

<br>


### **Datasets**

> For the binary classification part of the lab, we'll use the **Breast Cancer Wisconsin (Diagnostic) dataset**, which can be loaded using scikit-learn's `load_breast_cancer` function. The dataset consists of 569 samples of biopsied breast tissue, each described by 30 features. The goal is to classify each tissue sample as either malignant (cancerous) or benign (non-cancerous).

> For the multiclass classification part of the lab, we'll use the **Digits dataset**, which can be loaded using scikit-learn's `load_digits` function. The dataset consists of 1,797 images of handwritten digits, with each image represented as an 8x8 array of grayscale pixels. The goal is to classify each image into one of 10 classes, corresponding to the digits 0-9.

<br>

### **Lab Structure**
**Part 1**: Binary Classification

**Part 2**: Multiclass Classification

**Part 3**: Titanic Project Continued: Battle of the Models

> **Part 3.1:** Build, Train, and Validate Models

> **Part 3.2:** [OPTIONAL] Make Predictions


<br>

### **Goals** 
By the end of this lab, you will:
* Know how to implement logistic regression for binary classification.
* Know how to implement logistic regression for multiclass classification.
* Understand how to validate models by choosing an appropriate metric and using the validation dataset.

<br>

### **Cheat Sheets**
[Logistic Regression with sklearn](https://docs.google.com/document/d/1rLTuWGgx9E-K1pgWYxUF4B1ExKKxt6MVSkgEKoUbhuE/edit?usp=sharing)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer, load_digits

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

---
## **Part 1: Binary Classification**
---

In this lab, we'll learn how to use logistic regression to classify breast cancer tumors as malignant or benign. We'll use pandas to load and preprocess the data, matplotlib to visualize it, and scikit-learn (sklearn) to build and evaluate the logistic regression model.

**The code for loading the dataset has been provided for you. Run the cell below.**

In [None]:
data = load_breast_cancer()

X = data.data
y = data.target

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

##### **Problem #1: Print a summary of the dataset.**
---

The data has been pre-cleaned so we can move on to modeling.

##### **Problem #2: Split the data into train and test sets.**
---

We are not comparing models or making changes to the model, so we will skip adding a validation set this time. Make sure the test dataset is 20% of the original dataset.

##### **Problem #3: Import, initialize, and train a Logistic Regression model.**
---


##### **Problem #4: Make predictions for the test data.**
---


In [None]:
y_pred = # COMPLETE THIS LINE

y_pred_proba = # COMPLETE THIS LINE

y_pred_binary = # COMPLETE THIS LINE

##### **Problem #5: Print the accuracy score.**
---


In [None]:
accuracy = # WRITE YOUR CODE HERE
print(f'Accuracy: {accuracy}')

##### **Problem #6: Print the classification report.**
---

In [None]:
report = # WRITE YOUR CODE HERE
print(report)

#### **Discussion Question: How did Logistic Regression perform compared to the KNN model in Lab 6?**

---

<center>

#### **Back to lecture**

---

---
## **Part 2: Multiclass Classification**
---

In this section, we'll learn how to use logistic regression to classify handwritten digits into their respective numerical values. We'll use the digits dataset, which consists of 1797 samples of grayscale images of size 8x8 pixels, each represented as a 64-dimensional feature vector. The goal is to classify each image into one of 10 classes (corresponding to the 10 digits).

##### **Problem #7: Load and plot the data.**
---

We will provide code to load the data into a dataframe and plot a sample. Separate the data into `X` (features) and `y` (target) variables.

In [None]:
data = load_digits()
df = pd.DataFrame(data.data, columns=[f'pixel{i}' for i in range(64)])
df['target'] = data.target

fig, axes = plt.subplots(nrows=3, ncols=5, figsize=(8, 6))
for i, ax in enumerate(axes.flat):
    ax.imshow(data.images[i], cmap='binary')
    ax.set_title(f'Target: {data.target[i]}')
plt.show()

X = # WRITE YOUR CODE HERE
y = # WRITE YOUR CODE HERE

##### **Problem #8: Split the data into train and test sets.**
---

We are not comparing models or making changes to the model, so we will skip adding a validation set this time. Make sure the test dataset is 20% of the original dataset.

##### **Problem #9: Initialize and train a Logistic Regression model for multiclass classification.**
---

Use the `ovr` multi-class mode.

##### **Problem #10: Make predictions for the test data.**
---


In [None]:
y_pred = # COMPLETE THIS LINE

y_pred_proba = # COMPLETE THIS LINE

y_pred_binary = # COMPLETE THIS LINE

##### **Problem #11: Print the accuracy score.**
---


In [None]:
accuracy = # WRITE YOUR CODE HERE
print(f'Accuracy: {accuracy}')

##### **Problem #12: Plot the confusion matrix.**
---

In [None]:
cm = # WRITE YOUR CODE HERE
disp = # WRITE YOUR CODE HERE
disp.plot()
plt.show()

---

<center>

#### **Back to lecture**

---

---
## **Part 3: Battle of the Titanic Models**
---

In this part, we will revisit the Titanic dataset from the [Titanic competition on Kaggle](https://www.kaggle.com/c/titanic). In Lab 4, you cleaned the data, created features, encoded features, and visualized the data to understand patterns and trends among passengers who survived. Here, we will take the analysis a step further and use the classification models we have learned about (KNN and Logistic Regression) to make predictions for whether or not a passenger survived. We will validate and improve the models to determine a winner that will make final predictions, which you have the option of submitting to the Kaggle competition if you want to find out the final score.


---

### **Part 3.1: Build, Train, and Validate Models**

---

Recall the steps for validating and improving models:

1. **Decide on a metric** as the basis for comparing the models and a target value for that metric
2. **Train** the models on the training dataset
3. **Evaluate** the models on the validation dataset with the chosen metric
4. **Make changes** to the models *if improvements are needed*
5. **Repeat** steps 2–4 until target metric is achieved by one or more models.

Next lecture, we will discuss advanced methods for validating and improving models. 

First, we'll clean and prepare the data the same way we did in Lab 4. We are creating a function that does all the data preparation. This is good practice to make sure all data is prepared the same way. We will be using the same function on the training data (which we will split further into a train/validation datasets) and the test data set (from Kaggle, we do not have the solutions for this. If you want to see the result, you'll have to submit your predictions to the competition).

**This code has been provided for you. Run the cell below.**

In [None]:
def prepare_titanic_data(input_data):
  # We will create a copy and preserve the original dataframe
  data = input_data.copy() 

  # Clean the data
  data.drop_duplicates(inplace=True)
  data = data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
  data['Embarked'] = data['Embarked'].fillna(data['Embarked'].mode()[0])
  data['Age'] = data['Age'].fillna(data['Age'].median())

  # Feature creation
  data.loc[data['Age'] < 18, 'AgeGroup'] = 'Child'
  data.loc[(data['Age'] >= 18) & (data['Age'] < 65), 'AgeGroup'] = 'Adult'
  data.loc[data['Age'] >= 65, 'AgeGroup'] = 'Elderly'

  data['FareGroup'] = pd.qcut(data['Fare'], 4, labels=['Cheap','Low','High','Expensive'])
  data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

  # Feature encoding
  label_map = {'male': 0, 'female': 1}
  data['Sex_encoded'] = data['Sex'].map(label_map)

  label_map = {'Child': 0, 'Adult': 1, 'Elderly': 2}
  data['AgeGroup_encoded'] = data['AgeGroup'].map(label_map)

  fare_map = {'Cheap': 0, 'Low': 1, 'High': 2, 'Expensive': 3}
  data['FareGroup_encoded'] = data['FareGroup'].map(fare_map)

  embark_map = {'S': 0, 'C': 1, 'Q': 2}
  data['Embarked_encoded'] = data['Embarked'].map(embark_map)
  
  return data

Next, we need to load and apply the function to preprocess the train and test data. 

**This code has been provided for you. Run the cell below.**

In [None]:
# Load Kaggle train and test data
train_data = pd.read_csv("https://raw.githubusercontent.com/n-sachdeva/titanic/main/train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/n-sachdeva/titanic/main/test.csv")

# Use preprocessing function on both train and test data
train = prepare_titanic_data(train_data)
test = prepare_titanic_data(test_data)

# Select features to use for modeling
features = ["Sex_encoded","Pclass", "FamilySize", 
            "AgeGroup_encoded", "FareGroup_encoded", "Embarked_encoded"]

# Separate features and target
X = train[features]
y = train['Survived']
X_test = test[features]
# there is no y_test, because we do not have the truth values for the test set

##### **Problem #13: Split the train data into a training and validation datasets.**

---



In [None]:
# Split train data into train/validation data
X_train, X_valid, y_train, y_valid = # WRITE YOUR CODE HERE

##### **Problem #14: Initialize a logistic regression model, a KNN model with k=3, and a KNN model with k=5.**

---



In [None]:
# Create models
logr = # WRITE YOUR CODE HERE
knn3 = # WRITE YOUR CODE HERE
knn5 = # WRITE YOUR CODE HERE

---

<center>

#### **Back to lecture**

---

#### **Step #1: Decide on a metric and target value**
---

There are many choices for metrics to use. In this case we will use accuracy, and set a target value of 0.8. These can be changed later if we learn something from the validation process that suggests this may not be the best metric or target.


#### **Step #2: Train the models on the training set.**
---



##### **Problem 15: Train each of the models: `logr`, `knn3`, and `knn5`.**
---

#### **Step #3: Evaluate**
---

Now, we can check the accuracy of each model on the training and validation datasets.


##### **Problem 16: Evaluate and print the accuracy scores.**
---

Fill in the code where indicated to print the accuracy scores on the training and validation datasets.

In [None]:
# Print training and validation scores
models = [logr, knn3, knn5]
model_names = ['Logistic Regression', 'KNN-3', 'KNN-5']

# Looping over models to print scores for each
for model, name in zip(models, model_names):
  train_predictions = model.predict(# FILL IN CODE HERE)
  train_score = accuracy_score(# FILL IN CODE HERE)
  print(name)
  print(f'Training Accuracy: {train_score}')
  validation_predictions = model.predict(# FILL IN CODE HERE)
  validation_score = accuracy_score(# FILL IN CODE HERE)
  print(f'Validation Accuracy: {validation_score}\n')


#### **Discussion Questions**

* Did any of the models reach the target value on the validation set? 
* What trends do you notice in the results?
* Did any models show signs of overfitting or underfitting? How so?

##### **Problem 17: Which model performed best?**

In [None]:
winning_model = # ADD THE WINNING MODEL HERE

---

### **[OPTIONAL] Part 3.2: Make Final Predictions**  

---

Now, we can make final predictions. This dataset is from the [Kaggle Titanic competition](https://www.kaggle.com/c/titanic). There is a test dataset from the competition we can use for final predictions. **We do not have the solutions for the test data.** We will provide code to generate a submission file you can upload on Kaggle if you'd like to submit your work and see the score on Kaggle. You will need to create a Kaggle account to do so. As you progress in this course, feel free to apply your knowledge to improve the features and models in this notebook and resubmit your predictions on Kaggle.

Run the cell below. The file `submission.csv` will appear in your file explorer on the left side panel. You may download it to your computer and upload it to the competition to see your score.

In [None]:
predictions = winning_model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Successfully saved!")

#End of notebook
---
© 2023 The Coding School, All rights reserved