# InventoryAgent: Prototyping a Article Recommendation System for DigiKey Product Retailers.


## Introduction

Product retailer success depends on accurately managing inventory. Online articles on sites like [EETimes](https://www.eetimes.com/) and [EDN](https://www.edn.com/), can have an outsized affect on sales. Unfortunately, there are often too many articles for short staffed retailers to sort through. Therefore, an opportunity arises for a tool that can sort through a high volume of articles and identify the ones most relevant to a retailers products.

We can classify articles into three major types based on their relevance to inventory managment of bike products:
* Positive: Articles whos content implies a products sales will increase
* Negative: Articles whos content implies a products sales will decrease
* Neutral: Articles whos content is not relevant to a product

## Goal
This notebook demonstrates how to prototype InventoryAgent, an article recommendation system application using data science tools. We built the recommender to identify [EETimes](https://www.eetimes.com/) and [EDN](https://www.edn.com/) articles most relevant to specific products sold by [DigiKey](https://www.digikey.com).

## Key Results
- **Formulate**: Brainstorm ideas as a team
- **Collect**: Collect the initial datasets for the model build
- **Clean**: Clean the datasets to have the correct shape and content to fit into a model
- **Analyze**: Explore the cleaned datasets and edit content as needed
- **Model**: Fit different models to the datasets, recommending articles for products, and evaluate there relevance.
- **Present**: Present the model outputs in compelling ways through (Examples: dashabords, visualizations or reports).

**Run the following notebooks and explore how we prototyped InventoryAgent.**

## 0. Formulate Problem

First, we brainstormed as a team around the problem using the [Question Storming](https://experiencinginformation.com/2011/11/02/questionstorming-framing-the-problem/) technique. This allowed us to narrow in on the few ideas that would make the most compeling prototype.

The results of our Question Storm can be found [here](https://ridethenextwave-my.sharepoint.com/:x:/p/nick_capaldini/EWQWeLLgcz9KroLso29B4fABtNmctnTHb2WQNEyxtPARqg?e=AsxVzH).

## 1. Collect Dataset

Next, we collect the necessary product text data from [DigiKey](https://www.digikey.com) website.

Run the following notebook to extract the necessary text data from DigiKey.

#### 1.1 Collect Product
Raw product data is collected to prepare for cleaning.

[01-01-Collect_Product.ipynb](./01-01-Collect_Product.ipynb)

This notebook scrapes product category data from [DigiKey](https://www.digikey.com/en/products), including:

- Product category name and sub-category label

- URL to each category page

- Total number of products per category

- Category classification into higher-level groups (e.g., "Capacitors", "Battery Products")

It performs regex-based parsing of embedded JSON-like data and filters out categories with zero product count.

Output is stored in the following: `./intermediate_data/Products_List_Raw.json`




#### 1.2 Collect Article Link
Collect raw article links for later content extraction.

[01-02-Collect_Article_Links.ipynb](./01-02-Collect_Article_Links.ipynb)

This notebook gathers the latest article metadata (titles, URLs, publish dates) from multiple electronics-focused publishers

Output is stored in the following: `./intermediate_data/Scraped_Article_Links.csv`

#### 1.3 Collect Article Data
Raw article data is collected to prepare for cleaning.

[01-03-Collect_Article_Data.ipynb](./01-03-Collect_Article_Data.ipynb)

This notebook fetches full article content (body text) for the URLs obtained above.

Output is stored in the following: `./intermediate_data/Scraped_Article_Raw_Data.json`

## 2. Clean Datasets

The raw article and product text is not filtered and cannot be directly used for machine learning. Here we use various methods to clean the text data and prepare it for machine learning.

Run the following notebooks to clean the datasets. 

#### 2.1 Clean Article Data
Raw article data is cleaned to prepare for modeling.

[02-01-Clean_Products.ipynb](./02-01-Clean_Products.ipynb)

This notebook processes and filters product category data from DigiKey.

- Reads raw category JSON (`./intermediate_data/Products_List_Raw.json`)

- Iterates over each category and scrapes a sample of up to 100 listed products per category

- Applies light heuristics to extract product metadata

- Retains associated metadata such as category and main category group

Output is stored in the following: `./intermediate_data/Products_List_Clean.json`

#### 2.2 Clean Product Data
Raw product data is cleaned to prepare for modeling.

[02-02-Clean_Article_Data.ipynb](./02-02-Clean_Article_Data.ipynb)

This notebook processes the raw scraped article data into clean, readable text.

- Loads full-article records from Scraped_Article_Raw_Data.json

- Cleans and normalizes article text

- Preserves article metadata

Output is stored in the following: `./intermediate_data/product-data-clean.json`

## 3. Analyze Data

Now that the data has been cleaned, we can analyze it. Here we do some quick exploration of the cleaned datasets for any insights, ideas, or edits that can inform our model building.

Run the following notebooks to analyze the datasets. 

#### 3.1 Analyze Article Data
Cleaned article data is analyzed in anticipation of modeling.

[03-01-AnalyzeArticles.ipynb](./03-01-AnalyzeArticles.ipynb)

This notebook performs token-level analysis on cleaned articles to identify frequent keywords and evaluate content length distribution by following steps:

- Load cleaned article data from Cleaned_Article_Data.json

- Tokenize text while removing manual stopwords

- Skip articles with <50 or >3000 valid tokens (truncated if too long)

- Count and visualize top 20 most frequent words (after cleaning)

- Plot article token count distribution to detect outliers

Output is showed in the following ways:

- Bar plot: Top 20 frequent tokens

- Bar plot: Filtered article length distribution

#### 3.2 Analyze Product Data
Cleaned product data is analyzed in anticipation of modeling.

[03-02-AnalyzeProducts.ipynb](./03-02-AnalyzeProducts.ipynb)

This notebook analyzes the cleaned product category list and computes key statistics for visualization by following steps:

- Load Products_List_Clean.json

- Extract number of items from category names using regex

- Normalize and clean category names for better readability

- Filter out categories with 0 items

Output is showed in the following ways:

- Bar plot: Top 10 product-rich categories

- Pie chart: Distribution of all categories

## 4. Model Recommendation

Next, we build a recommendation model using the cleaned datasets.

Run the following notebook to build a recommendation model using the data provided.

[04-ModelRec.ipynb](./04-ModelRec.ipynb)

This notebook performs the following steps:

- Loads cleaned article texts and product category data.

- Preprocesses texts using tokenization and stopword removal (via `gensim.simple_preprocess`).

- Combines all article and product texts to build a shared dictionary and TF-IDF representation.

- Uses `SparseMatrixSimilarity` to calculate cosine similarity between product and article vectors.

- Implements a function to **recommend top-N similar articles** for each product category based on textual similarity.

- Filters results to exclude articles with low relevance scores.

- Generates a table matching products with their top article recommendations and saves it as a CSV.

Output is stored in the following: `./intermediate_data/Product_Article_Matching.csv`

## 5. Present Recommendations

Finally, we set up the recommendation system so that it can be presented in Mesmorizing, Original, Professional, and Simple way to a non-technical audience.

Run the following python file to present the recommendations.

#### 5.1 Deploy Streamlit Dashboard

[05-DigiKey_App.py](./05-DigiKey_App.py)

This Streamlit app provides an interactive interface for end-users to:

- Search and select a product category

  - View 90-day forecast as a quantile dot plot

  - Read AI-generated technical summaries

  - Access matched articles with relevance scores and clickable links

- Explore a full product-article table with rich HTML formatting

Output is showed in the following Features:

- Multi-tab layout (Search view + Table view)

- Expandable long descriptions with "See more"

- Clickable product names and article titles with external links

- Forecasts and insights auto-rendered per selection

#### 5.2 Visualize Forecast with Quantile Dot Plots

[05-02-QuantileDotPlot.ipynb](./05-02-QuantileDotPlot.ipynb)

This notebook creates interpretable dot plots for each product category to visualize forecasted unit sales with following steps:

- Loads model predictions and dotplot ratio data (`Dot_Plot_Ratio.json`).

- For each product category:
  
  - Multiplies prediction × product count × dot distribution ratio
  
  - Plots 90-day sales forecast as a quantile dot plot
  
  - Saves one image per category to `./figures/`


Output is stored in the following: `./figures/`

#### 5.3 Generate Category Summaries

[05-03-GenerateDescription.ipynb](./05-03-GenerateDescription.ipynb)

This notebook uses the OpenAI GPT-4 model via LangChain to automatically generate summaries for each product category with follow setps:

- Loads unique product category names from the matching CSV.

- Sends each category name to an LLM prompt asking for a 2–3 line technical summary.

- Saves the output in structured JSON format.

Output is stored in the following: `./intermediate_data/Product_Description.json`

## Version and Hardware Information

In [1]:
%load_ext watermark
%watermark -v -m -p ipywidgets,matplotlib,numpy,streamlit,pandas,sklearn

ModuleNotFoundError: No module named 'watermark'

---

**Authors:**
[Salah Mohamoud](mailto:salah.mohamoud.dev@gmail.com),
[Sai Keertana Lakku](mailto:saikeertana005@gmail.com),
[Zhen Zhuang](mailto:zhuangzhen17cs@gmail.com),
[Nick Capaldini](mailto:nick.capaldini@ridethenextwave.com), Ride The Next Wave, May 19, 2025

---