# MBAI 448 | Week 3 Assignment: AI & Machine Learning Fundamentals

##### Assignment Overview

This assignment explores how statistical learning can be applied to a real-world problem. It is organized into three Acts:

- Act I: Understand the problem and context
- Act II: Prototype a solution with AI technology
- Act III: Socialize the work with stakeholders

##### Assignment Tools

This assignment assumes you will be working with Github Copilot in VS Code, and will require you to submit your chat history along with this notebook. If you are curious about how to work effectively with Github Copilot, please consult the [VS Code documentation](https://code.visualstudio.com/docs/copilot/overview).

Submissions that demonstrate thoughtless interaction with Copilot (e.g., asking Copilot to just read the notebook and produce all the outputs) will receive reduced credit.

### Act 1 : Understand the problem and context

##### Business Goal / Case Statement
Improve retention by proactively identifying employees who may resign.

##### Assignment Context

**Relevant Industry and/or Business Function:** HR

**Description:** You report to the new Director of Talent Development at Zaibatsu, a large industrial conglomerate.  Given the tight labor market, one of their top priorities is improving employee retention.  They have shared with you a personnel file of company employees, and are looking for your help to understand and ultimately predict employee attrition.

##### The Data

**Data Location:** <code>'./data/attrition.csv'</code>

#### Step 0 : Scope the work in `agents.md`

Before moving forward, create a a file named `agents.md` in the project root directory (likely the same level of the directory in which this notebook lives). This file specifies the intended role of AI in this project and serves as reference context for Github Copilot as you work.

Your `agents.md` must include the following five sections:

##### 1. What we’re building
A one-sentence "elevator pitch" describing the prototype and its primary output (e.g., "A predictive lead-scoring engine that identifies high-value customers based on historical CRM data.")

##### 2. How AI helps solve the business problem
2–4 bullet points explaining the specific value-add of the AI components. Focus on the transition from the business "pain point" to the AI "solution."

##### 3. Key file locations and data structure
List the paths that matter (e.g., `./mbai448_week03_assignment.ipynb`, `./data/attrition.csv`).

##### 4. High-level execution plan
A step-by-step outline of the build process (e.g., 1. Data processing, 2. Model training, 3. Evaluation, 4. Model tuning). Feel free to ask Copilot for help (or take a peek at the steps in Act II below) for a sense on structuring the work.

##### 5. Code conventions and constraints
To ensure the prototype remains manageable, add 1-2 bullet points specifying that code be as simple and straightforward, using standard libraries unless instructed otherwise.

### Act 2 : Prototype a solution with AI technology

## Prototyping an Employee Attrition Forecaster

In this act, you will prototype a workflow that uses employee data to:

- segment the workforce into groups using unsupervised learning, and  
- train classifiers that estimate attrition risk.

This is meant to be an exploratory prototype, to better understand how how employee representations, clustering approaches, and predictive models can help to anticipate employee attrition.

You are encouraged to use GitHub Copilot throughout. For each step, follow the same disciplined loop:

- **Plan**: Have Copilot create a short, narrative plan describing what needs to happen and what artifacts will be produced.
- **Validate**: Review and revise that plan until it is complete, coherent, and aligned with the purpose of the step.
- **Execute**: Once the plan is validated, have Copilot implement it in code.
- **Check**: Use the resulting code to perform one or two concrete actions that confirm you have what you need.

#### Environment Setup

To run this notebook locally as you move through the assignment, we suggest you create and activate a Python virtual environment.

From the project root directory:

##### On MacOS/Linux:
`python -m venv venv
`source venv/bin/activate

##### On Windows:
`python -m venv venv
`venv\Scripts\activate

Once your virtual environment is activated, you can set it as the kernel for this notebook in the top right corner of your notebook pane.


## Step 1: Load the dataset
### Plan
Have Copilot create a plan to load the employee dataset into a DataFrame, inspect its columns and data types, and display a small sample of rows.

### Validate
- identify the attrition label column explicitly
- distinguish categorical from numeric fields
- produce visible output rather than silent assignments

### Execute
Once the plan is validated, implement it in code.

### Check
- Print the dataset shape and list of column names.
- Display 3–5 rows and confirm they read like plausible employee records.

In [None]:
# write Step 1 code below



## Step 2: Transform the data into a modeling-ready representation

### Plan
- clean the data (e.g., missing values)
- encode categorical variables numerically (e.g., one-hot encoding)
- separate features from the attrition label

### Validate
- keep the attrition label out of the feature matrix
- ensure resulting feature set is purely numeric
- make the transformation visible (e.g., before/after column counts)

### Execute
Implement the validated transformation steps in code.

### Check
- Print the number of feature columns before and after transformation.
- Confirm the attrition label is not included in the feature matrix.

### Food for Thought
- Is each row a good representation of an employee?

In [None]:
# write Step 2 code below



## Step 3: Cluster employees using unsupervised learning

### Plan
- fit a KMeans model on the transformed feature matrix
- assign a cluster label to each employee
- store cluster labels alongside the employee data

### Validate
- exclude the attrition label from clustering
- specify the number of clusters explicitly
- produce inspectable cluster assignments

### Execute
Implement KMeans clustering and attach labels to the data.

### Check
- Print the number of employees per cluster.
- Confirm that clusters are not trivially identical in size.

In [None]:
# write Step 3 code below



## Step 4: Reduce dimensionality to visualize cluster structure

### Plan
- apply PCA to reduce the feature matrix to two components
- project employees into this reduced space
- visualize clusters in two dimensions

### Validate
- fit PCA only on feature data (not labels)
- produce a 2D representation suitable for plotting
- use cluster labels to color or differentiate points

### Execute
Implement PCA and create a scatter plot of the two components.

### Check
- Generate a scatter plot of the two PCA components colored by cluster.
- Confirm that the plot renders and that points are not all collapsed.

In [None]:
# write Step 4 code below



## Step 5: Select and justify a cluster count

### Plan
- compute a clustering quality metric (e.g., inertia) across multiple values of k
- visualize how that metric changes as k increases
- select a reasonable k to proceed with

### Validate
- compare multiple values of k
- treat the selection as judgment-based
- re-run clustering with the chosen k

### Execute
Compute metric across k and plot the elbow/metric curve.

### Check
- Produce a plot showing metric vs. k.
- Re-run clustering with the chosen k and confirm cluster assignments update.

In [None]:
# write Step 5 code below



## Step 6: Train a baseline attrition classifier (all employees)

### Plan
- split the dataset into training and test sets
- train a classification model to predict attrition
- evaluate model performance

### Validate
- use a classification-appropriate model
- evaluate predictions on held-out data
- produce a concrete evaluation artifact (e.g., confusion matrix)

### Execute
Train the model and evaluate on test data.

### Check
- Generate a confusion matrix.
- Identify which type of error appears most frequently.

In [None]:
# write Step 6 code below



## Step 7: Compare prediction behavior across clusters

### Plan
- evaluate classifier performance separately within each cluster
- compare results across clusters

### Validate
- use the same model and metrics across clusters
- avoid over-interpreting very small clusters
- produce per-cluster outputs that can be compared

### Execute
Run per-cluster evaluations and summarize differences.

### Check
- Identify one cluster where performance is noticeably better or worse.
- Confirm this difference is visible in the evaluation outputs.

In [None]:
# write Step 7 code below



## Step 8: Train a model to assign new employees to clusters

### Plan
- train a classifier that predicts cluster labels from employee features
- evaluate its accuracy
- optionally tune model parameters

### Validate
- treat cluster assignment as a separate prediction task
- evaluate performance on held-out data
- avoid claiming that higher accuracy implies 'better' clusters

### Execute
Train and evaluate the cluster-assignment classifier.

### Check
- Report cluster-assignment accuracy.
- Inspect at least one misclassified example.

In [None]:
# write Step 8 code below



## End of Act 2

At this point, you should have some concrete experience with unsupervised and supervised learning methods and the ways in which they may help in forecasting employee attrition.

Before moving on to Act III, create a file named `README.md` in the project root.

This README should capture the current state of the prototype as if you were handing it off to a colleague. Keep it concise and grounded in what actually exists.

### 1. What this prototype does
In one sentence, clearly describe the capability that was built and the problem it is intended to address.

### 2. How it works (at a high level)
In a few bullet points, specify:
- what data the system operates over,
- what representation or model it uses,
- how results are produced.

### 3. Limitations and open questions
Briefly note:
- the most important limitations you observed or conceive of, and
- any open questions that would need to be addressed before broader use.


This README will be used as reference context in Act 3.

## Act 3 — Socialize the Work

You have built a working prototype. Now you need to think about what it would mean to use it.

In this act, you will have conversations with three "colleagues" who approach this feature from different professional perspectives:

- A **Talent Development Lead** focused on employee growth, engagement, and long-term development, responsible for designing programs that support retention without undermining trust.

- An **HR Analytics Associate** focused on using data to understand workforce composition, risk patterns, and structural drivers of attrition across the organization.

- A **Line Manager** responsible for teams where attrition has direct operational consequences, such as staffing gaps, missed deadlines, or increased workload.

Each of these perspectives highlights a different set of circumstantial concerns that emerge once a technical capability is placed inside an organization and exposed to real use.

Your goal in these conversations is to engage with those concerns. This means:
- explaining how the prototype behaves and performs,
- articulating tradeoffs in plain, cross-functional language,
- and reckoning with how technical choices intersect with human expectations, organizational processes, and downstream impact.

Each conversation should feel like a real internal discussion. When a persona has what they need to understand your reasoning and its implications, the conversation will naturally come to a close.


### Submission

- Save the Notebook you have been working in and other files you created in your repo (i.e., <code>agents.md</code>, <code>readme.md</code>, etc).
- Export your Copilot Chat and save as a <code>.txt</code>, <code>.json</code>, or <code>.md</code> in the same directory as the above. 
- Stop / shut down the Google Colab session in which the Notebook was running.
- **Upload your exported Notebook to [the Canvas page for Assignment 3](https://canvas.northwestern.edu/courses/245397/assignments/1668982).**