# **Lab 11: Final Projects**
---

### **Description**
In today's notebook, you will apply what you have learned throughout Part II, particularly with regards to Deep Learning and Natural Language Processsing, to several projects in order to reinforce these skills. We encourage you to not only solve the problems and follow the steps below, but to also consider how you could apply what else you've learned to each problem or in coming up with new problems to solve with the given datasets.

<br>

### **Lab Structure**
**Part 1**: [Heart Attack Predictor](#p1)

**Part 2**: [Amazon Review Sentiment Analysis](#p2)

**Part 3**: [Semantic Segmentation with U-Net](#p3)

**Part 4**: [Generating Wikipedia Entries](#p4)


<br>

### **Goals**
By the end of this lab, you will have honed the skills you have learned throughout the program and started to see how you could extend them to more complex situations.

<br>

### **Cheat Sheets**
* [EDA with pandas](https://docs.google.com/document/d/1FFoqw45P-kuoq912ARP4qfdGeLTqoq73_qjZThPp2_8/edit?usp=drive_link)

* [Data Visualization with matplotlib](https://docs.google.com/document/d/1YlUp6ll81qOyDpU1OWzE-SPxQ3hnF5C9ukLRL_6PYKE/edit?usp=drive_link)

* [Linear Regression with sklearn](https://docs.google.com/document/d/1iVieBynTpoKq1LA0kR-4pqDo6evoW5wvbNyE0wOGhYY/edit?usp=drive_link)

* [KNN with sklearn](https://docs.google.com/document/d/1U-AWXkJEDXZFqhBwFlDjyp9bLsVOeeXGYaxa6SZ7KpY/edit?usp=drive_link)

* [Logistic Regression with sklearn](https://docs.google.com/document/d/1Xi4fXFROik5Rs6C0d3oIM-OmK3pvw7MwkvM3TJw7vn4/edit?usp=drive_link)

* [Deep Learning with pytorch](https://docs.google.com/document/d/1Wm01maZUrSuwdOhuI05uZBtqt5nL5shOGnJ7kTHWl_I/edit?usp=drive_link)

* [CNNs with pytorch](https://docs.google.com/document/d/15UV1gVy5J6fzAD5vYyikiprew4erlR9Fop66h89ql0w/edit?usp=drive_link)

* [Natural Language Processing I](https://docs.google.com/document/d/1MamYMxe8zlWoiDc0tX2RzUKQULCPVUh-2QtdzRRvzcs/edit?usp=drive_link)

* [Natural Language Processing II](https://docs.google.com/document/d/1OoP-sFW6qMk0BzvYMlavgJtiXX9eziTUptlFdzgLfGk/edit?usp=drive_link)

* [Natural Language Processing III](https://docs.google.com/document/d/1jrzya_r_97qrmk7RGKqWCPhkHhsKMsMDkrn7YjdZUK0/edit?usp=drive_link)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
!pip --quiet install scikit-learn scikit-optimize
!pip --quiet install torchview torch graphviz
!pip --quiet install fastai
!pip --quiet install hyperopt
!conda install -q python-graphviz

In [None]:
import os
import random
from random import choices

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from PIL import Image

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from fastai.vision.all import *
from fastai.text.all import *
from fastai.optimizer import Adam

import torchvision
from torchvision import datasets, transforms
from torchvision.datasets import ImageFolder

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *


from skopt import BayesSearchCV
from skopt.space import Integer
from hyperopt import fmin, tpe, rand, hp, Trials, STATUS_OK

import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # PyTorch v0.4.0


def binary_accuracy(y_pred, y_true):
    # Output 0 if y_pred <= 0.5 and 1 if y_pred is > 0.5
    y_pred = (y_pred > 0.5).float()
    # Returns accuracy
    return (y_pred == y_true).float().mean()

<a name="p1"></a>

---
## **Part 1: Heart Attack Predictor**
---

Your goal in this section is to use KNN and a neural network to predict whether a given patient is likely to have a heart attack or not. Specifically, the dataset has:

* **Features**: `'age'`, `'sex'`, `'cp'`, `'trestbps'`, `'chol'`, `'fbs'`, `'restecg'`, `'thalach'`, `'exang'`, `'oldpeak'` that describe a variety of health statistics taken by doctors for a given patient.

* **Target**: `'heart attack'` which is 0 if the patient has not had a heart attack and 1 if the patient has had a heart attack.


<br>

**Run the code provided below to import the dataset and split into training and test sets.**

In [None]:
# Load the data into a DataFrame
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSa0metcKBFqn-MHLn05vVGWONMlzljcWa-xIM1wJPXIa5kbrmIzGqmWcMh8eKG_ntByF9qqn6Mx3MT/pub?gid=1052859518&single=true&output=csv'
df = pd.read_csv(url)

# Split the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(
    df.drop(columns = 'heart attack'),
    df['heart attack'],
    test_size = 0.2,
    random_state = 42)

# Define a custom function for creating a DataLoader from features and target data
def create_dataloader(X,y):
  X_tensor = torch.tensor(X.to_numpy(), dtype=torch.float32)
  y_tensor = torch.tensor(y.to_numpy(), dtype=torch.float32).unsqueeze(1)
  dataset = list(zip(X_tensor, y_tensor))
  dl = DataLoader(dataset, batch_size=64, shuffle=True)
  return dl

# Define the DataLoaders
train_dl = create_dataloader(X_train, y_train)
valid_dl = create_dataloader(X_valid, y_valid)
dls = DataLoaders(train_dl, valid_dl)

# Define DataLoaders for female subset
train_dl_female = create_dataloader(
    X_train[X_train['sex']==0],
    y_train[X_train['sex']==0])
valid_dl_female = create_dataloader(
    X_valid[X_valid['sex']==0],
    y_valid[X_valid['sex']==0])
dls_female = DataLoaders(train_dl_female, valid_dl_female)

# Define DataLoaders for male subset
train_dl_male = create_dataloader(
    X_train[X_train['sex']==1],
    y_train[X_train['sex']==1])
valid_dl_male = create_dataloader(
    X_valid[X_valid['sex']==1],
    y_valid[X_valid['sex']==1])
dls_male = DataLoaders(train_dl_male, valid_dl_male)

#### **Problem #1.1: Use a KNN classifier**


Train a KNN model to perform this task. Try multiple values of K (`n_neighbors`) to achieve the highest performance you can on the training and validation data. Consider using random search, grid search, or bayesian optimization to find the best value of K.

In [None]:
# Define the KNN Classifier


# Train the model


# Make predictions
y_pred = # COMPLETE THIS CODE

train_accuracy = accuracy_score(y_train, # COMPLETE THIS CODE
valid_accuracy = accuracy_score(# COMPLETE THIS CODE

print(f"Training accuracy: {train_accuracy:.4f}")
print(f"Validation accuracy: {valid_accuracy:.4f}")

# Display confusion matrix
cm = confusion_matrix(y_valid, y_pred, labels=knn.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=knn.classes_)
disp.plot()

plt.xticks(rotation = 90)
plt.show()

#### **Problem #1.2: Use a fully connected neural network**

Train a fully connected network to perform this task. Change model architechure and hyperparameters to improve the model, particularly using regularization and hyperparameter tuning techniques. What's the highest accuracy you are able to achieve on the validation set?

In [None]:
# Hyperparameter search
# Define the objective function to minimize (Hyperopt minimizes the loss, so we negate accuracy)
def objective(params):

    # COMPLETE THIS CODE



# Define the search space
space = {
    # COMPLETE THIS CODE
}

# Perform optimization
trials = Trials()
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=20, trials=trials)




# Train the best model
# Get the best parameters and best score
best_params = {k: best[k] for k in best}
best_score = -trials.best_trial['result']['loss']


model = nn.Sequential(
  # COMPLETE THIS CODE
)

# COMPLETE THIS CODE

learn = Learner(dls, model, loss_func=loss_func, metrics=binary_accuracy, # COMPLETE THIS CODE
learn.fit(# COMPLETE THIS CODE



# Calculate training accuracy
train_loss, train_accuracy = learn.validate(# COMPLETE THIS CODE
print(f"Training accuracy: {train_accuracy:.4f}")

# Calculate validation accuracy
valid_loss, valid_accuracy = learn.validate(# COMPLETE THIS CODE
print(f"Validation accuracy: {valid_accuracy:.4f}")

#### **Reflection Questions**
* Which of your models performed better?
* Does deep learning always outperform traditional ML?
* What kinds of problems are best-suited for deep learning?

In [None]:
'''
# WRITE YOUR RESPONSES HERE
''';

#### **Problem #1.3: Evaluate for Females and Males Separately**


Now, evaluate both of your models for females and males separately to see if there's any difference in the performance of your high performing models. You can just print the validation accuracy.

##### **1. Evaluate KNN for Female vs. Male.**

In [None]:
female_rows = X_valid['sex']==0
male_rows = # COMPLETE THIS CODE

female_preds = knn.predict(X_valid[female_rows])
male_preds = # COMPLETE THIS CODE

accuracy_female = accuracy_score(y_valid[female_rows], # COMPLETE THIS CODE
accuracy_male = accuracy_score( # COMPLETE THIS CODE
print(f"Female validation accuracy: {accuracy_female:.4f}")
print(f"Male validation accuracy: {accuracy_male:.4f}")

##### **2. Evaluate NN for Female vs. Male.**

**NOTE**: We have already created a separated female and male train and validation data loader that you can use called: `train_dl_female`, `valid_dl_female`, `train_dl_male`, and `valid_dl_male`.

In [None]:
# Calculate female training and validation accuracy
_, train_accuracy_female = learn.validate(# COMPLETE THIS CODE
print(f"Female training accuracy: {train_accuracy_female:.4f}")

_, valid_accuracy_female = learn.validate(# COMPLETE THIS CODE
print(f"Female validation accuracy: {valid_accuracy_female:.4f}")


# Calculate male training and validation accuracy
_, train_accuracy_male = learn.validate(# COMPLETE THIS CODE
print(f"Male training accuracy: {train_accuracy_male:.4f}")

_, valid_accuracy_male = learn.validate(# COMPLETE THIS CODE
print(f"Male validation accuracy: {valid_accuracy_male:.4f}")

#### **Problem #1.4: Examine why**


You likely saw a noticeable difference in the performance between females and males for both models. Examine why this might be by doing the following:

1. Plot a bar chart of the number of males vs. females in this data.
2. Plot a grouped bar chart of the number of males and females that did not have a heart attack vs. those that did.

##### **1. Plot a bar chart of the number of males vs. females in this data.**

In [None]:
plt.bar(['Female', 'Male'], df['sex'].# COMPLETE THIS CODE

# COMPLETE THIS CODE

##### **2. Plot a grouped bar chart of the number of males and females that did not have a heart attack vs. those that did.**

In [None]:
df_female = df[df['sex'] == 0]
df_male = df[df['sex'] == 1]

plt.bar([-0.1, 0.9], # COMPLETE THIS CODE
plt.bar([0.1, 1.1], # COMPLETE THIS CODE

plt.xticks(ticks = [0, 1], labels = ['No Heart Attack', 'Heart Attack'], fontsize = 'x-large')
plt.title('Breakdown of Heart Attacks by Sex', fontsize = 'x-large')
plt.legend()
plt.show()

#### **Problem #1.5: What if we blind the models to this variable?**


A common approach to avoid bias is to take a "blind" approach, in which we remove the biased variable from the equation. In this case, we'll do that by training new models using data without the `'sex'` column. Specifically,

1. Train and evaluate an KNN Classifier on the blind data.
2. Train and evaluate a NN on the blind data.

In both cases, evaluate the models separately on the female and male rows just as you did in Problem #3.

<br>

**Run the code below before starting to create the blind data.**

In [None]:
X_train_blind = X_train.drop(columns=['sex'])
X_valid_blind = X_valid.drop(columns=['sex'])

##### **1. Train and evaluate a KNN Classifier on the blind data.**

In [None]:
# COMPLETE THIS CODE

##### **2. Train and evaluate a neural net on the blind data.**

First, run the cell below to set up the dataloaders.


In [None]:
train_dl_blind = create_dataloader(X_train_blind, y_train)
valid_dl_blind = create_dataloader(X_valid_blind, y_valid)
dls_blind = DataLoaders(train_dl_blind, valid_dl_blind)

In [None]:
# Perform hyperparameter search


# Train the model


# Calculate overall training and validation accuracy


# Calculate female training and validation accuracy


# Calculate male training and validation accuracy


### **Potential Future Work**
---

This is an unfortunately common case of biased data, specifically *unbalanced data*, leading to potentially harmful results. The attempt at blinding the models to the sex of the patient likely provided little to no help. Oftentimes, bias runs deeper than the most obvious variables and may be correlated with others in ways that humans and especially advanced ML algorithms can still pick up on. Consider some of the following ideas for improving on these results:

* Training models separately for male and female and data.

* Using a statistical methods for balancing the data. For instance, upsampling and downsampling are common first approaches to tackling this problem.

* Find a dataset that is more balanced to begin with. In an ideal world, we would make sure that the data is balanced (representative) upon collection.

<a name="p2"></a>

---
## **Part 2: Amazon Review Analysis**
---

Your goal in this section is to use a neural network to predict the satisfaction of a customer based on their rating from 0.5 to 5. You can use a model of your choice. This is a difficult dataset, what's the highest accuracy you are able to achieve?

We will provide a column of the dataframe with text with stopwords removed. You can use either the 'text' or the 'text_without_stopwords' for your model.

<br>

**Run the code provided below to import the dataset.**

In [None]:
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vT3fAwK4iEaWvsgy5XjbbwVxyzVpQj3En2hk7hO9D5giyk8zvx9xfOP0aU4o9p0ujvaeV4Tcfi-JnyN/pub?gid=103697572&single=true&output=csv')
df['text'] = df['title'] + ' ' + df['review']

# Define stopwords
import sklearn.feature_extraction.text as text
stop = text.ENGLISH_STOP_WORDS

# Remove stopwords from text column
df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word for word in str(x).split() if word not in stop]) if isinstance(x, str) else x)

# Show the results
df[['text', 'text_without_stopwords','rating']].head(2)

In [None]:
df.iloc[3:5]

#### **Step #1: Import and split data into training and validation sets**


Use TextDataLoaders to load and split the data.


In [None]:
dls = TextDataLoaders.from_df(
    df,
    text_col=# COMPLETE THIS CODE
    label_col='rating',
    valid_pct=0.2,
    bs=#choose a batch size,
    seq_len=#choose a sequence length,
    device=device
)

#### **Step #2: Determine the input dimension of your data**


Print the length of the vocabulary.

In [None]:
# COMPLETE THIS CODE

#### **Steps #3-6: Build the neural network**


You can use model of your choice. We recommend using the models *with* embeddings. You may also use a pre-trained models.

In [None]:
# COMPLETE THIS CODE

#### **Step #7: Train the model**


In [None]:
# COMPLETE THIS CODE

#### **Step #8: Evaluate the model**


In [None]:
# COMPLETE THIS CODE

### **Potential Future Work**
---

Hopefully, through some trial and error and using the variety of tools you have learned at this point, you were able to create a pretty accurate model. If you are interested in going further with this dataset, here are some ideas to consider:
* Compare results with and without stopwords. Which performed better?

* How does the model perform when given just the title or just the review?

* Can you treat this as a regression problem?

* To get further insights into the data or even what your model is doing, you could create a wordcloud using this library: https://pypi.org/project/wordcloud/ (or others).

<a name="p3"></a>

---
## **Part 3: Semantic Segmentation with U-Net**
---

In this project, we'll complete a semantic segmentation task on a subset of the CamVid dataset. The CamVid dataset is a relatively small dataset containing images of street scenes, with pixel-level annotations for 32 semantic classes.

We'll use the U-Net architecture with a pretrained ResNet-34 "backbone" to perform the segmentation.

U-Net is a semantic segmentation architecture designed in a U-shape, composed of an encoder (contracting path) and a decoder (expanding path). The encoder is a typical convolutional neural network (CNN) that uses convolution and pooling layers. The decoder, a "reverse" CNN, recovers spatial information using up-convolution (also called transpose convolution) layers and upsampling layers.

Using a ResNet-34 backbone means that the encoder of the U-Net is replaced with a pretrained ResNet-34 architecture. This takes advantage of transfer learning and a more efficient design with residual connections, resulting in faster convergence and better performance for the segmentation task.

<br>

**Run the code provided below to import the dataset.**

In [None]:
path = untar_data(URLs.CAMVID_TINY)

def get_y_fn(x):
    return path/'labels'/f'{x.stem}_P{x.suffix}'

codes = np.loadtxt(path/'codes.txt', dtype=str)

dls = SegmentationDataLoaders.from_label_func(
    path,
    get_image_files(path/"images"),
    get_y_fn,
    codes=codes,
    bs=8,
    item_tfms=Resize(460),
    batch_tfms=[*aug_transforms(), Normalize.from_stats(*imagenet_stats)]
)

#### **Problem #3.1: Create the Learner**



You'll be using the [`unet_learner()`](https://docs.fast.ai/vision.learner.html#unet_learner) function. We can provide as an input the choice to use Resnet34 as the backbone. Instantiate the learner with the following inputs:
* `dls`
* `resnet34`
* `metrics=Dice()`
* `wd=1e-2`

Finally, you can use a method called `to_fp16()` at the end of the model definition to have the model work in single precision, which can save a lot of training time.

The Dice metric is better suited for segmentation problems. It is a coefficient that compares similarity: low numbers are for poor similarity. You should aim to increase this metric as much as possible when hyperparameter tuning.

In [None]:
learn = # COMPLETE THIS CODE

#### **Problem #3.2**



Train your model using either `learn.fit()` or `learn.fine_tune()`. Remember if you use `learn.fit()` you will need to use `learn.freeze()` before.

In [None]:
# COMPLETE THIS CODE

#### **Problem #3.3: View results**


Run the code below to visualize your results. How did your model perform? How much can you improve the segmentation masks through hyperparameter tuning?

In [None]:
learn.show_results(max_n=2, figsize=(10, 10))

In [None]:
image_path = 'path/to/your/image.jpg'
img = PIL.Image.open(image_path)
pred_mask, _, _ = learn.predict(img)
pred_mask.show(figsize=(5, 5))

### **Potential Future Work**
---

Image segmentation is a challenging task that requires balancing model complexity, dataset size, and computation time. Here are some ideas for improving on these results:

* Training with the full CamVid dataset or another image segmentation dataset.

* Fine-tuning the pre-trained model. This project used a pre-trained ResNet34 as the backbone of the UNet model. However, there are many other pre-trained models available that may work better for this specific task. Consider trying a different model, such as ResNet50 or VGG16, and fine-tuning the model on the segmentation dataset.


<a name="p4"></a>

---
## **Part 4: Generating Wikipedia Entries**
---

In this section, you will apply what you learned about generating text by training a model on 30,000 sentences from Wikipedia as of 2021. This text has been downloaded from [https://wortschatz.uni-leipzig.de/en/download/English](https://wortschatz.uni-leipzig.de/en/download/English), which also contains the entries in several other languages as well as other corpora from the internet.

<br>


**Run the code provided below to import the dataset.**

In [None]:
path = untar_data(URLs.WIKITEXT_TINY)
train_df = pd.read_csv(path/'train.csv', header=None, names=['text'])
dls = TextDataLoaders.from_df(train_df,
                              text_col='text',
                              is_lm=True,
                              valid_pct=0.1,
                              bs=64)

#### **Problem #4.1: Define a pre-trained language model learner**



You'll use the function `language_model_learner()` which works very similarly to `text_classifier_learner()`. Pass the following inputs to the model:
* `dls`
* `AWD_LSTM`
* `metrics=[accuracy, Perplexity()]`,
* `wd=0.1`

Finally, you can use a method called `to_fp16()` at the end of the model definition to have the model work in single precision, which can save a lot of training time.

In [None]:
learn = language_model_learner(dls, AWD_LSTM, metrics=[accuracy, Perplexity()], wd=0.1).to_fp16()

#### **Problem #4.2: Train your model**


In [None]:
learn.fine_tune(5)

If you are happy with your model, we recommend saving it so you don't have to re-train it again later. The code for saving the model is provided for you below.

In [None]:
learn.save('fine_tuned_wikitext_tiny')

Below is the code to load the model again. Only the model weights are saved, so when using `learn.load()` make sure you first define the model again.

In [None]:
learn.load('fine_tuned_wikitext_tiny')

#### **Problem #4.3: Generate Text**


We have provided a function below that you can use to generate text. The inputs are:
* prompt -> A string that you want the model to generate text at the end of
* n_words -> The number of words the model should generate
* temperature -> A parameter that represents the amount of "randomness" in the model's response

**Run the cell below to set up the function.**

In [None]:
def generate_text(prompt, n_words=20, temperature=1.0):
    return learn.predict(prompt, n_words, temperature=temperature)

In [None]:
prompt = "Chicago is a place where"
predicted_text = generate_text(prompt,30,0.75)
print(predicted_text)

### **Potential Future Work**
---

Text generation can be a tricky task that can be extremely dependent on the framing of the problem, the dataset available, the models used, and more. Consider some of the following ideas for improving on these results:

* Training with the full Wikitext dataset or another language model dataset.

* Hyperparameter tuning or using a different pre-trained model, maybe a transformer model.

* Can you think of how modify this code to make a chatbot?

# End of notebook
---
© 2024 The Coding School, All rights reserved