# MBAI 448 | Week 5 Assignment: Transformers and NLP

##### Assignment Overview

This assignment explores how transformer-based NLP can be applied to a real-world content enrichment problem in media and entertainment. It is organized into three Acts:

- Act I: Understand the problem and context
- Act II: Prototype a solution with AI technology
- Act III: Socialize the work with stakeholders

##### Assignment Tools

This assignment assumes you will be working with GitHub Copilot in VS Code or Google Colab, and will require you to submit your chat history along with this notebook. If you are curious about how to work effectively with GitHub Copilot, please consult the [VS Code documentation](https://code.visualstudio.com/docs/copilot/overview).

Submissions that demonstrate thoughtless interaction with Copilot (e.g., asking Copilot to just read the notebook and produce all the outputs) will receive reduced credit.

## Business Goal / Case Statement
Improve traffic and user experience for an online news aggregation platform by enriching and embellishing content.

## Assignment Context

**Relevant Industry and/or Business Function:** Media and Entertainment

**Description:** XYZ Inc. operates a digital platform for news, entertainment, and social content. The company wants to boost traffic and user engagement, in particular among young professionals. Your boss, the Head of News Content, has just brokered a deal with a new provider to acquire commercial news articles. They want you to help build out a business content section, complete with subsections for "Good News ðŸŽ‰" and "Bad News ðŸ‘Ž" articles, topical tags, pithy summaries, and a clarifying question about each article.

## The Data

**Data Location:** `'./data/news.csv'`

### Act 1: Understand the problem and context

#### Step 0: Scope the work in `agents.md`

Before moving forward, create a file named `agents.md` in the project root directory (likely the same level of the directory in which this notebook lives). This file specifies the intended role of AI in this project and serves as reference context for GitHub Copilot as you work.

Your `agents.md` must include the following five sections:

##### 1. What we're building
A one-sentence "elevator pitch" describing the prototype and its primary output (e.g., "An automated content enrichment system that classifies, tags, summarizes, and generates FAQs for news articles using transformer-based NLP models.").

##### 2. How AI helps solve the business problem
2â€“4 bullet points explaining the specific value-add of the AI components. Focus on the transition from the business "pain point" to the AI "solution."

##### 3. Key file locations and data structure
List the paths that matter (e.g., `./mbai448_week05_assignment.ipynb`, `./data/news.csv`).

##### 4. High-level execution plan
A step-by-step outline of the build process (e.g., 1. Data loading and inspection, 2. Environment setup and model loading, 3. Sentiment classification, 4. Named entity extraction, 5. Question answering, 6. Summarization, 7. Full pipeline application). Feel free to ask Copilot for help (or take a peek at the steps in Act II below) for a sense of structuring the work.

##### 5. Code conventions and constraints
To ensure the prototype remains manageable, add 1-2 bullet points specifying that code be as simple and straightforward as possible, using standard libraries unless instructed otherwise.

### Act 2: Prototype a solution with AI technology

## Prototyping a Transformer-Based NLP Pipeline for Content Enrichment

In this act, you will prototype a content enrichment system using pretrained transformer models from HuggingFace. The goal is not to build a production system, but to understand how these models behave when applied to real content.

Throughout this act, use GitHub Copilot as a development assistant, following a disciplined loop in every step:

- **Plan**: Have Copilot draft a clear, plain-language plan describing what needs to happen and in what order.
- **Validate**: Review and refine that plan to ensure it does exactly what the step requiresâ€”no more, no less.
- **Execute**: Have Copilot implement the validated plan in code.
- **Check**: Perform one or two concrete actions that confirm the code worked and that you understand the result.

This is exploratory prototyping. The goal is to remain in contact with the system's real behavior at all times.

---

#### Environment Setup

To run this notebook in Google Colab, you'll need to connect to Google Drive and install the required packages.

If running locally in VS Code, you may want to create and activate a Python virtual environment.

##### On MacOS/Linux:
```
python -m venv venv
source venv/bin/activate
```

##### On Windows:
```
python -m venv venv
venv\Scripts\activate
```

Once your virtual environment is activated, you can set it as the kernel for this notebook in the top right corner of your notebook pane.

---

## Step 1: Load and inspect the article data

Before using any model, you need to understand what data you are working with and how it is organized.

### Plan
Have Copilot create a plan to:
- install the transformers library
- import required libraries (pandas, transformers pipeline)
- load the news dataset from disk
- display dataset structure and sample rows

### Validate
Ensure the plan:
- makes dataset structure visible rather than implicit
- shows actual article headlines and text
- confirms data types and column names

### Execute
Implement the validated plan in code.

### Check
- Print the number of articles in the dataset.
- Display a few sample rows using head() or tail().

**Food for thought:** What kinds of articles are in this dataset? What variation in length and style do you observe?

In [None]:
# write Step 1 code below




---

## Step 2: Sample a single article for exploration

Before applying models at scale, you should work with a single example to understand model behavior.

### Plan
Have Copilot create a plan to:
- extract a single article's headline and text from the dataset
- store them in variables for later use
- print both to inspect content

### Validate
Ensure the plan:
- uses proper pandas indexing (e.g., `.at[]` property)
- creates reusable variables
- displays the full content for inspection

### Execute
Implement the validated plan in code.

### Check
- Print the headline and article text.
- Confirm you can explain what the article is about.

**Food for thought:** How might headline sentiment differ from the sentiment of the full article text?

In [None]:
# write Step 2 code below




---

## Step 3: Load pretrained transformer models

You will use four different NLP capabilities. Each requires loading a pretrained model from HuggingFace.

### Plan
Have Copilot create a plan to:
- create a sentiment analysis pipeline with a model different from the walkthrough
- create a named entity recognition pipeline with a model different from the walkthrough
- create a question-answering pipeline with a model different from the walkthrough
- create a summarization pipeline with a model different from the walkthrough

Model sources:
- Sentiment: https://huggingface.co/models?pipeline_tag=text-classification&sort=trending
- NER: https://huggingface.co/models?pipeline_tag=token-classification&sort=trending
- QA: https://huggingface.co/models?pipeline_tag=question-answering&sort=trending
- Summarization: https://huggingface.co/models?pipeline_tag=summarization&sort=trending

### Validate
Ensure the plan:
- uses different models than the walkthrough
- assigns each pipeline to a clearly named variable
- documents why each model was chosen

### Execute
Implement the validated plan in code.

### Check
- Confirm each pipeline loads without errors.
- Note any warnings about model configurations.

**Food for thought:** What commonalities do you notice across models for a given task (e.g., BERT variants, distilled models, SQUAD dataset)? Why might you prefer a "distilled" model?

In [None]:
# write Step 3 code below




---

## Step 4: Test each model on your sample article

Before applying models at scale, you should understand their behavior on a single example.

### Plan
Have Copilot create a plan to:
- run sentiment analysis on the sample headline
- run named entity recognition on the sample article text
- run summarization on the sample article text
- run question answering with a relevant question about the article

### Validate
Ensure the plan:
- uses the variables created in Step 2
- prints readable outputs for each capability
- includes confidence scores where available

### Execute
Run the inference code for each model.

### Check
- Inspect each model's output.
- Confirm you can explain what each model is producing.

**Food for thought:** How well does each model perform on your sample? What surprised you?

In [None]:
# write Step 4 code below




---

## Step 5: Prepare data for scaled processing

Running NLP models on large datasets is computationally intensive. You'll work with a manageable sample.

### Plan
Have Copilot create a plan to:
- sample the dataset to approximately 100 rows
- reset the index for clean row ordering
- verify the reduced dataset size

### Validate
Ensure the plan:
- uses random sampling for representativeness
- resets index with drop=True to avoid extra columns
- confirms final row count

### Execute
Implement the data preparation code.

### Check
- Call describe() to confirm dataset size.
- Verify the sample looks representative.

**Food for thought:** How might the sample size affect your ability to evaluate model performance?

In [None]:
# write Step 5 code below




---

## Step 6: Classify all articles by sentiment

Now you apply sentiment classification to categorize articles for the "Good News" and "Bad News" sections.

### Plan
Have Copilot create a plan to:
- write a function to classify headline sentiment
- map labels to 'Good News ðŸŽ‰', 'Bad News ðŸ‘Ž', or 'Just News ðŸ¤·'
- apply the function to create a new 'category' column

### Validate
Ensure the plan:
- handles edge cases (low confidence scores)
- uses a confidence threshold for ambiguous cases
- creates readable category labels

### Execute
Implement the classification pipeline.

### Check
- Display head() to see the new category column.
- Sample a few rows to verify classifications make sense.

**Food for thought:** What threshold would you use to distinguish confident vs. ambiguous classifications?

In [None]:
# write Step 6 code below




---

## Step 7: Extract named entities as tags for all articles

Named entities (people, organizations, locations) serve as topical tags for content discovery.

### Plan
Have Copilot create a plan to:
- write a function to extract unique named entities from text
- filter out duplicates and subword tokens
- apply the function to create a new 'tags' column

### Validate
Ensure the plan:
- returns a list of unique entity strings
- handles articles with no entities gracefully
- produces readable tag names

### Execute
Implement the entity extraction pipeline.

### Check
- Display tail() to see the new tags column.
- Examine a few articles' tags for relevance.

**Food for thought:** How might entity types (person, organization, location) be used differently in a content recommendation system?

In [None]:
# write Step 7 code below




---

## Step 8: Generate a FAQ for each article

Question-answering models can generate clarifying questions and answers to enhance reader engagement.

### Plan
Have Copilot create a plan to:
- write a function that generates a relevant question based on article content
- use the QA model to find the answer in the article text
- apply the function to create a new 'faq' column with Q&A pairs

### Validate
Ensure the plan:
- generates coherent, relevant questions
- returns both question and answer in a structured format
- handles cases where no good answer is found

### Execute
Implement the FAQ generation pipeline.

### Check
- Display head() to see the new faq column.
- Use .at[] to inspect individual FAQ entries in detail.

**Food for thought:** How might you compose the question differently based on article category?

In [None]:
# write Step 8 code below




---

## Step 9: Summarize all articles

Pithy summaries help readers quickly assess article relevance.

### Plan
Have Copilot create a plan to:
- write a function to summarize article text
- handle text length constraints (some models have max token limits)
- apply the function to create a new 'summary' column

### Validate
Ensure the plan:
- truncates long articles to avoid errors (e.g., first 5000 characters)
- produces summaries significantly shorter than originals
- handles edge cases gracefully

### Execute
Implement the summarization pipeline.

### Check
- Display tail() to see the new summary column.
- Compare a summary to its original article using .at[].

**Food for thought:** How would you evaluate whether a summary preserves the key information?

In [None]:
# write Step 9 code below




---

## Step 10: Review the enriched dataset and reflect on scalability

You now have a fully enriched dataset. Assess the results and consider production implications.

### Plan
Have Copilot create a plan to:
- display the final dataframe structure with all new columns
- sample individual records to inspect quality
- document observations about model behavior

### Validate
Ensure the plan:
- shows all columns including original and enriched data
- allows inspection of individual records in detail
- prompts reflection on scaling considerations

### Execute
Review the final enriched dataset.

### Check
- Confirm all expected columns are present.
- Inspect 2-3 individual records across different categories.

**Food for thought:** If your boss asks you to scale this across millions of articles, how would you answer? What would you useâ€”models like these or GPT-4o/DeepSeek-v3? Why?

In [None]:
# write Step 10 code below




---

## End of Act II

At this point, you should have direct evidence of how pretrained transformer models behave on your content, what each NLP capability produces, and what quality/performance tradeoffs exist. Use these observations to inform Act III discussions with stakeholders.

Before moving on to Act III, create a file named `README.md` in the project root.

This README should capture the current state of the prototype as if you were handing it off to a colleague. Keep it concise and grounded in what actually exists.

### 1. What this prototype does
In one sentence, clearly describe the capability that was built and the problem it is intended to address.

### 2. How it works (at a high level)
In a few bullet points, specify:
- what data the system operates over,
- what models or representations it uses,
- how results are produced.

### 3. Limitations and open questions
Briefly note:
- the most important limitations you observed or conceive of, and
- any open questions that would need to be addressed before broader use.

---

## Act 3 â€” Socialize the Work

You have built and evaluated a working prototype for automated content enrichment. Now you need to think about what it would mean to use this system in practice.

In this act, you will have conversations with three colleagues who each engage with the system from a different professional perspective. Each one surfaces a distinct set of pressures that emerge when NLP automation is introduced into a real content operation.

Your goal is not to convince them that the system is "good," but to reckon with how its behavior intersects with editorial judgment, organizational responsibility, and operational reality.

---

### Colleague Perspectives

You will speak with:

- A **Content Strategy Lead** focused on how automated tagging and categorization changes editorial workflows â€” including when editors trust the system, when they override it, and how errors affect content quality and reader trust.

- An **Editorial Standards Manager** focused on accountability, accuracy, and brand risk â€” including what happens when articles are miscategorized or summaries misrepresent content, how decisions are justified after the fact, and whether the system can be relied on for sensitive topics.

- A **Platform Operations Manager** focused on efficiency and scale â€” including the tradeoffs between processing speed and accuracy, how the system affects content throughput, and whether gains in automation introduce new forms of friction elsewhere in the pipeline.

---

### How to Approach These Conversations

Each conversation should feel like a real internal discussion about a live prototype.

In these interactions, you should be prepared to:
- explain how each model behaves in concrete terms,
- reference evidence from your prototype (e.g., sample classifications, entity extractions, summaries),
- articulate tradeoffs clearly in plain, cross-functional language,
- and acknowledge uncertainty where it exists.

These colleagues are not trying to block the work â€” but they are responsible for understanding its implications within their domains.

When a colleague has enough information to understand the risks, assumptions, and consequences involved, the conversation will naturally come to a close.

---

### Submission

- Save the Notebook you have been working in and other files you created in your repo (i.e., `agents.md`, `README.md`, etc).
- Export your Copilot Chat and save as a `.txt`, `.json`, or `.md` in the same directory as the above.
- Stop / shut down the Google Colab session in which the Notebook was running.
- **Upload your completed Notebook to the [Canvas page for Assignment 5](https://canvas.northwestern.edu/courses/245397/assignments/1668984).**