<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Lab_10_%5BSTUDENT%5D_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab #10: ML Ethics - Analyzing UCI's Adult Data Set** 
---

### **Description**: 
In this lab, we will analyze UCI's Adult Data set and explore ethical issues with the dataset and models created from it, as well as attempt to mitigate any bias we find. Then, we will explore Amazon's Clarify.


### **Lab Structure**
**Part 1**: [Using Visualizations to Identify Bias](#p1)
**Part 2**: [Using AWS Clarify to Identify Bias](#p2)
**Part 3**: [Mitigating Bias](#p3)


</br>


### **Goals:** 
By the end of this lab, you will be able to:
* Explain why imbalanced data can lead to bias in models.
* You will be able to identify bias through different means like data exploration and using AWS Clarify.
* Mitigate bias in data once you have found it using several different approaches.


<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
#!pip install scikit-learn
#!pip install --quiet smclarify

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from smclarify.bias.report import *

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

<a name="p1"></a>

---
## **Part 1: Using Visualizations to Identify Bias**
---

We will be exploring UCI's [Adult Data Set](https://archive-beta.ics.uci.edu/ml/datasets/adult) and the bias inherent within it. This data was intended to inspect incomes over $50K based on census data in 1994, but as we will discover, there are some pitfalls with this data set. The following table will provide a description of what each column in the data set represents.

<br>

**NOTE:** This data set was derived from 'Census Data'. For this project, we are assuming this refers to the US Census Data for the year 1994. Our comparisons for visualizations reflect this assumption.

<br>

**Run the cell below to load the data.**

In [None]:
url = "https://raw.githubusercontent.com/batuhansahincanel/UCIsAdultDataset/main/adult.data" #"https://raw.githubusercontent.com/batuhansahincanel/UCIsAdultDataset/main/adult.data" 
names=["Age", "Workclass", "Final-Weight", "Education", "Education-number-of-years", "Marital-status",
        "Occupation", "Relationship", "Race", "Sex", "Capital-gain", "Capital-loss",
        "Hours-per-week", "Native-country", "Target"]
        
adult = pd.read_csv(url, names = names)
adult = adult.dropna()
adult.head()

### **Problem #1: Sex Distribution**
---
Complete the code below to create a bar graph for the sex distribution in this dataset. What do you notice in the visualization?

In [None]:
sex_labels = adult["Sex"].# COMPLETE THIS LINE
sex_counts = adult["Sex"].# COMPLETE THIS LINE

plt.# COMPLETE THIS LINE

plt.title('Sex Distribution in the Data', fontweight = 'bold')
plt.xlabel('Sex', fontweight = 'bold')
plt.ylabel('Number of Adults', fontweight = 'bold')

plt.show()

### **Problem #2: Race Distribution**
---
Complete the code below to create a bar graph for the race distribution in this dataset. What do you notice in the visualization?

<br>

Consider using the following when creating your graph to make the x-ticks more readable: `plt.xticks(rotation = 45)`.

In [None]:
race_labels = # COMPLETE THIS LINE
race_counts = # COMPLETE THIS LINE

plt.# COMPLETE THIS LINE

plt.title(# COMPLETE THIS LINE
plt.xlabel(# COMPLETE THIS LINE
plt.ylabel(# COMPLETE THIS LINE
plt.xticks(# COMPLETE THIS LINE

plt.show()

### **Problem #3: Education Distribution**
---
Create a bar graph for the education distribution in this dataset. What do you notice in the visualization?

In [None]:
# COMPLETE THIS CODE

### **[OPTIONAL] Problem #4: Education Distribution Revised**
---

Modify your solution from above so that the counts for:
* `1st-4th` and `5th-6th` are grouped under a new label: `Elementary`.
* `9th`, `10th`, `11th`, `12th` are grouped under a new label: `High School`.

In [None]:
# COMPLETE THIS CODE

### **Problem #5: Income Distribution**
---

Create a bar graph for the target (income above or below 50K). What do you notice?

In [None]:
# COMPLETE THIS CODE

### **Problem #6: Target by Sex**
---

Create a new feature, `Target by Sex`, that designates any person with:
* `Target` of `' <=50K'` and `Sex` of `' Male'` as `'Male <=50K'`.
* `Target` of `' <=50K'` and `Sex` of `' Female'` as `'Female <=50K'`.
* `Target` of `' >50K'` and `Sex` of `' Male'` as `'Male >50K'`.
* `Target` of `' >50K'` and `Sex` of `' Female'` as `'Female >50K'`.


Then create a bar plot of this feature.

In [None]:
# COMPLETE THIS CODE

<a name="p2"></a>

---
## **Part 2: Using AWS Clarify to Identify Bias**
---

In this section, we are going to use Clarify to identify bias. This method is *much faster* than creating bar graphs for every column.

Here is a AWS Clarify [cheatsheet](https://docs.google.com/document/d/1SPnA_Bqewm3tG2gLF_SmG2DIKdm9Jinm1cRBoURv1yw/edit?usp=sharing) for interpretting results.

<br>

**NOTE**: Steps #1 - 2 have already been taken care of above.

### **Step #3: Denote the facet column, the label column, and the group variable.**
---

Set the:
* Facet column as `Sex`.
* Label column with `Target` as the target column and `' >50K'` as the positive label.
* `Sex` as the group variable.

In [None]:
facet_column = FacetColumn(# COMPLETE THIS LINE
label_column = LabelColumn(# COMPLETE THIS LINE
group_variable = # COMPLETE THIS LINE

### **Step #4: Generate bias report.**
---
Check out this [cheatsheet](https://docs.google.com/document/d/1SPnA_Bqewm3tG2gLF_SmG2DIKdm9Jinm1cRBoURv1yw/edit?usp=sharing) for how to interpret pre-training metrics. Using this page, interpret what the following report is saying.

<br>

**Run the cell below to generate your bias report.**

In [None]:
report = bias_report(adult, facet_column, label_column, stage_type=StageType.PRE_TRAINING, group_variable=group_variable)

# use this to print your report - call it "report" for the code to work
for cl in report:
    print("\n\n","-"*35)
    print("-"*15, cl["value_or_threshold"], "-"*15)
    for metric in cl['metrics']:
        print(f"{metric['description']}: {metric['value']}")

### **Step #5 Look at the imbalance.**
---

Use the [cheatsheet](https://docs.google.com/document/d/1SPnA_Bqewm3tG2gLF_SmG2DIKdm9Jinm1cRBoURv1yw/edit?usp=sharing) provided to interpret what the bias report is telling you. 

In [None]:
# Write Male Results here

In [None]:
# Write Female Results

<a name="p3"></a>

---
## **Part 3: Mitigating Bias**
---
In this section, we will implement a KNN model before and after using different techniques for mitigating bias.

### **Problem #1: Convert `Sex` to numerical values.**

We need to convert any non-numerical binary values into numerical ones that can be used by our model. Specifically convert,
* `Sex` from `' Female'` and `' Male'` to 0 and 1 respectively.

<br>

**NOTE**: There are several ways you could do this that you have learned, including label encoding.

In [None]:
# COMPLETE THIS CODE

### **Problem #2: Dummy variable encode non-numerical categorical variables.**

Dummy variable encode the following non-numerical categorical variables:
* `Workclass`
* `Education`
* `Marital-status`
* `Occupation`
* `Relationship`
* `Race`
* `Native-country`

<br>

**NOTE**: Now that you have seen how to do this in yesterday's lab, you can use just one `OneHotEncoder` to transform all these variables at once.

In [None]:
# Create the new dataframe
new_adult = adult.drop(columns = ['Workclass', 'Education', 'Marital-status', 'Occupation', 'Relationship', 'Race', 'Native-country', 'Target by Sex'], axis = 1)

# Create the encoder and transform the desired columns
ohe = OneHotEncoder(drop = 'first', sparse_output = False)
ohe.set_output(transform = 'pandas')

transformed = ohe.fit_transform(adult[[# COMPLETE THIS LINE BY LISTING ALL COLUMNS TO ENCODE

# Create the new dataframe
new_adult[# COMPLETE THIS LINE


new_adult.head()

### **Problem #3: Prepare the data for modeling.**
---

Specifically,
* Decide the independent and dependent variables (only including numerical variables).
* Split the data into train and test sets such that 20% is left for testing.
* Scale your data.


In [None]:
# Decide the independent and dependent variables
features = # COMPLETE THIS LINE
label = # COMPLETE THIS LINE

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(# COMPLETE THIS LINE

# Scale your data
scaler = # COMPLETE THIS LINE

X_train_scaled = scaler.# COMPLETE THIS LINE
X_test_scaled = scaler.# COMPLETE THIS LINE

### **Problem #3: Initialize and train your model.**
---

Use a KNN model with K = 5.

In [None]:
knn = # COMPLETE THIS LINE
knn. # COMPLETE THIS LINE

### **Problem #4: Make predictions for the standardized test data.**
---

In [None]:
y_pred = # COMPLETE THIS LINE

### **Problem #5: Evaluate your model.**
---

Specifically,
* Print the accuracy score.
* Plot the confusion matrix.

<br>

**NOTE**: Since we are using `type1` as the label directly here, we just use `display_labels=clf.classes_`. This is one good reason for using non-encoded labels.

In [None]:
# Print the accuracy score
accuracy = # COMPLETE THIS LINE
print(f'Accuracy: {accuracy}')

# Plot the confusion matrix
cm = confusion_matrix(# COMPLETE THIS LINE

disp = ConfusionMatrixDisplay(# COMPLETE THIS LINE
disp.plot()

plt.xticks(rotation = 90)
plt.show()

### **Problem #6: Mitigating bias by blinding.**
---

This method involves just *removing* sex as an input for our model. The hope is that by blinding our model to a feature we have explicitly shown to be biased in the previous Parts, we can remove this bias.

<br>

**NOTE**: Although it is often one of the first thoughts, blinding to bias is not considered good practice! There are patterns in the data correlated to the removed features that mean we have not really removed the bias. In fact, this can be even more problematic since we believe we have mitigated the bias and take the results more seriously, when in fact the results still reflect biases.

In [None]:
# Decide the independent and dependent variables
features = new_adult.drop(# COMPLETE THIS LINE SUCH THAT 'Sex' IS NOT A FEATURE
label = new_adult['Target']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(# COMPLETE THIS LINE

# Scale your data
scaler = # COMPLETE THIS LINE

X_train_scaled = # COMPLETE THIS LINE
X_test_scaled = # COMPLETE THIS LINE


# Initialize and train your model
knn = # COMPLETE THIS LINE
knn.# COMPLETE THIS LINE


# Make predictions for the standardized test data
y_pred = knn.# COMPLETE THIS LINE


# Evaluate your model
# Print the accuracy score
accuracy = # COMPLETE THIS LINE
print(f'Accuracy: {accuracy}')

# Plot the confusion matrix
cm = confusion_matrix(# COMPLETE THIS LINE

disp = ConfusionMatrixDisplay(# COMPLETE THIS LINE
disp.plot()

plt.xticks(rotation = 90)
plt.show()

### **Problem #7: Mitigate bias by balancing.**
---

This method involves making the data distribution more equal. In this step, we will balance the sex distribtuion by ensuring there are equal amount of data points for male and female entries being used in our model.

<br>

**Run the cell below to create this balanced dataset.**

In [None]:
# get two data frames by sex
females, males = new_adult.query("Sex == 0"), new_adult.query("Sex == 1")

# make the males the minimum between females and males
sampled_males = males.sample(n=int(min(females.shape[0], males.shape[0]))).reset_index(drop=True)

# combine
balanced_adult = pd.concat([sampled_males, females]).sample(frac=1).reset_index(drop=True)

In [None]:
# Decide the independent and dependent variables
features = balanced_adult.drop(# COMPLETE THIS LINE
label = balanced_adult['Target']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(# COMPLETE THIS LINE

# Scale your data
scaler = # COMPLETE THIS LINE

X_train_scaled = # COMPLETE THIS LINE
X_test_scaled = # COMPLETE THIS LINE


# Initialize and train your model
knn = # COMPLETE THIS LINE
knn.# COMPLETE THIS LINE


# Make predictions for the standardized test data
y_pred = knn.# COMPLETE THIS LINE


# Evaluate your model
# Print the accuracy score
accuracy = # COMPLETE THIS LINE
print(f'Accuracy: {accuracy}')

# Plot the confusion matrix
cm = confusion_matrix(# COMPLETE THIS LINE

disp = ConfusionMatrixDisplay(# COMPLETE THIS LINE
disp.plot()

plt.xticks(rotation = 90)
plt.show()

# End of Lab

---
© 2023 The Coding School, All rights reserved