### Challenge
At FakeFashionCorp, we're always looking to **improve our product recommendations for our customers**. As a member of our data science team, your task is to make the best recommendations possible based on the provided data:

- **Customer Search Data**: A list of google search queries made by a **specific** customer. You can find the data in the `./search_history.json` file.
- **Product Catalog**: A dataset containing 100,000 **fashion items** from our current inventory, including details such as product name, category, description, and other relevant attributes. You can find the data in the `./fashion_catalog.json` file.

Your challenge is to analyze the customer's search history and use this information to select the most relevant items from our product catalog that we should recommend to this customer.

Here are the specific requirements:

- Select the top items that best match the customer's apparent interests and preferences.
- Provide a brief explanation of your approach, including any assumptions you made and the reasoning behind your methodology.
- Include any visualizations or metrics that support your recommendations.
- Make sure to include the cell output in the final commit, we will **not** execute the script ourselves.

### Dummy approach
The following is what we consider a **dummy** approach. We expect you to find a more clever solution that this:
1. embed the customer's searches
2. rank the searches according to some semantic similarity to a fashion related anchor
3. for each fashion related search, find the product in the catalog that is most similar

We encourage you to be creative in your approach. There's no single correct solution, and we're interested in seeing how you tackle this real-world problem.

Hint: **how can we truly understand the customer's preferences?**

### 0. Repo structure

The project structure is organized as follows:

```text
├── README.md
├── challenge.ipynb
├── data
│   ├── chroma_db - Vector database storing processed and embedded fashion catalog items
│   ├── processed
│   │   ├── analysis_summary.txt - Generated user profile and recommendations
│   │   ├── fashion_analysis.json - Extracted fashion entities and topics
│   │   ├── fashion_catalog_sampled.json
│   │   ├── product_recommendations.json
│   │   ├── search_history_sampled.json
│   │   └── user_comp_analysis.txt
│   └── raw
│       ├── fashion_catalog.json - Raw fashion product catalog
│       └── search_history.json - User search query history
├── data_exploration
│   └── search_history.py
├── pyproject.toml
├── requirements.txt
├── setup.py
├── src
│   ├── __init__.py
│   ├── config
│   │   ├── __init__.py
│   │   ├── model_config.py - Configuration for ML models and vector DB
│   │   └── prompts.yaml - Template prompts for analysis tasks
│   ├── data_processing
│   │   ├── __init__.py
│   │   ├── catalog_processor.py - Core processing logic for fashion catalog items
│   │   ├── embeddings.py - Manages vector embeddings for catalog items
│   │   ├── entity_extractor.py
│   │   ├── fashion_catalog_downsampler.py
│   │   ├── schema.py - Data models and validation schemas
│   │   └── search_history_downsampler.py
│   ├── examples
│   │   ├── __init__.py
│   │   ├── process_catalog.py - Main script for processing and embedding catalog data
│   │   ├── product_recommendations.py
│   │   └── test_embeddings.py
│   └── features
│       ├── __init__.py
│       ├── fashion_analysis.py - Analyzes fashion-related entities and trends
│       └── user_comp_analysis.py - Comprehensive user behaviour analysis
└── utils
    ├── generate_tree.py
    └── project_structure.txt
    ```


### 0. Analyse raw datasets

An preliminary analysis of the input datasets `data/raw/search_history.json` and `data/raw/fashion_catalog.json` was performed early on. Multiple fields were identified as potentially redundant for the pipeline outlined below (e.g. multiple fashion image URLs per product - not doing any CV; some categorical fields in the datasets with a cardinality of 1).

The approach decided upon and outlined below was to do LLM-based analysis of the files (`GPT-4o-mini`), so not much time was spent pre-processing the datasets (save for the downsampling, detailed below).

It was noted that the `search_history.json` comprised many different browser searches, and was not fashion specific.

### 1. Downsample datasets

Search history JSON consists of 55383 items, while the fashion catalog consists of 100,000.

These downsampling modules allowed me to debug the pipeline on very small subsets of the raw datasets, and then ultimately prove the pipeline on larger datasets (10% of the raw input datasets), while keeping the OpenAI API costs down.

Module docstrings for each explain the functionality, but essentially, they are using `random.sample` to sample items from the raw datasets.

Example of running search_history_downsampler.py from command line
```bash
!python src/data_processing/search_history_downsampler.py \
    --input data/raw/search_history.json \
    --output data/processed/search_history_sampled.json \
    --fraction 0.1
    ```

Example output:
```bash
Processed search history:
Total entries: 55383
Sampled entries: 5538
Sample fraction: 0.1
```

Example of running fashion_catalog_downsampler.py from command line
```bash
!python src/data_processing/fashion_catalog_downsampler.py \
    --input data/raw/fashion_catalog.json \
     --output data/processed/fashion_catalog_sampled.json \
     --fraction 0.1
    ```

Example output:
```bash
Processed fashion catalog:
Total entries: 100000
Sampled entries: 10000
Sample fraction: 0.1
```

### 2. Extract features from `search_history`

The main module for processing `search_history` (actually, the downsampled `output data/processed/search_history_sampled.json` from the previous step) is `src/features/user_comp_analysis.py`.

It performs the following steps, using `GPT-4o-mini` for NLP tasks:

0. Uses tiktoken to estimate long run cost and give y/n dialog before long runs
1. Chunked processing of large JSON files with memory efficiency. For each chunk:
    1. Fashion entity extraction (brands, products, styles); updating a structured JSON which logs statistics across the chunks
    2. Chunk summary including fashion topic/trend extraction
    2. Vector database caching using ChromaDB
2. Once all chunks have been processed, a final prompt to the model to take all chunk summaries, alongside the extracted entities, and generate a summary of the user, both for a convenient, human-readable output of the processed search history, and as an input to the product recommendation steps.

Example command line usages:

```bash
    # Full analysis
    $ python user_comp_analysis.py --input data/raw/search_history.json --verbose

    # Test run (2 chunks only)
    $ python user_comp_analysis.py --test --verbose

    # Custom output location
    $ python user_comp_analysis.py -i input.json -o output/analysis.txt
```

#### Output 1: [fashion_analysis.json](data/processed/fashion_analysis.json)

1. Tracks entity frequencies, e.g.
```json
"brands": {
      "luxury": {
        "marc jacobs": 1,
        "Dior": 17,
        "alexander mcqueen": 1,
        "Jacquemus": 4,
        "Christian Louboutin": 9,
      },
      "high_street": {
        "John Lewis Camden": 1,
        "H&M": 2,
        "john lewis camden": 5,
        "Reiss": 3,
        "Marks & Spencer": 19,
      },
      "sportswear": {
        "salomon hiking shoes women": 1,
        "tommy hilfiger": 3,
      }
      ```
2. Also tracks raw entity counts

Details of this JSON are fed into each iteration of the loop over chunks, and it is updated on each iteration

#### Output 2: [analysis_summary.txt](./data/processed/analysis_summary.txt)

A summary of all the chunk summaries and extracted entities, structured into various sections, including general and fashion-specific topics

Output can be read at the link above, a sample of the sections recreated here:

```text
### User Profile

#### 1. **Core User Profile**:
- **Demographics**:
  - **Gender**: Predominantly female
  - **Age Range**: 25-45 years
  - **Life Stage**: Likely to be professionals or in transitional life stages such as new careers or family planning (e.g., engaged or married).

- **Geographic Context**: 
  - Predominantly in urban areas where fashion and luxury brands are more accessible (e.g., major cities in the UK like London).

- **Lifestyle Indicators**: 
  - Enjoys a blend of luxury and high-street fashion.
  - Engaged in professional settings, where business attire is often required.
  - Interest in sustainability and ethical fashion may be inferred from the mix of luxury and high-street brands.

#### 2. **Fashion Profile**:
- **Most Frequently Searched Brands**:
  - **Luxury**: Dior, Christian Louboutin, Prada, and Tiffany & Co.
  - **High Street**: Marks & Spencer, Zara, and Ted Baker, indicating a balance between affordable and high-end shopping preferences.
  - **Sportswear**: Tommy Hilfiger and Nike, suggesting a preference for sporty chic aesthetics.


#### 3. **Shopping Behaviour**:
- **Price Sensitivity**: 
  - The user shows a willingness to invest in luxury brands such as Dior and Prada indicating lower price sensitivity for high-quality products, while still engaging with accessible brands like Marks & Spencer and Zara.


#### 4. **Recommendations**:
- **Product Recommendations**:
  - **Clothing**: Consider offering tailored business suits, designer blouses, and statement dresses that epitomize luxury while being suitable for professional settings.
  - **Footwear**: Recommend high-quality pumps from brands like Manolo Blahnik or elegant heeled sandals from Valentino.
  - **Accessories**: Suggest designer handbags, chic jewelry pieces, and stylish brooches from Van Cleef & Arpels.


#### 5. **Marketing Approach**:
- **Suggested Engagement Channels**:
  - Utilize social media platforms like Instagram and Pinterest for visual inspiration, with a website presence that features blog-style content around fashion tips and trends.
  - Engage via email newsletters with curated product recommendations based on browsing habits.

  ```



### 3. Embed `fashion_catalog`

Takes the (downsampled) fashion catalog and:
1. Processes fashion product data using OpenAI embeddings
2. Extracts structured entities (brand, category, price tier)
3. Stores processed items in a vector database (ChromaDB)
4. Enables semantic search with filtering

Example command line usage:

```bash
    $ python process_catalog.py --catalog-path data/processed/fashion_catalog_sampled.json --verbose
```

Output is a ChromaDB vector database, saved in [data/chroma_db](data/chroma_db)

### Product recommendations

The module [`src/examples/product_recommendations.py`](`src/examples/product_recommendations.py`) runs the product recommendation process. 

It uses the processed user search history artefacts (the extracted entity counts/densities and the overall summary) to formulate weighted queries.

These are then used to query the ChromaDB vector database of products, semantically.

As a demo, the script currently generates 3 sets of the top 5 product recommendations (by cosine similarity).

The three demo sets are searches by:
1. Fashion items similar to the top 5 **brands** from the user's search history, and filtered by "luxury" price tier
2. Fashion items similar to the top 5 **products** from the user's search history, and filtered by "luxury" price tier
3. Fashion items informed by the comprehensive user search summary from `user_comp_analysis.txt`, and filtered by "luxury" price tier

#### Usage

```bash
# Run with default paths
python src/examples/product_recommendations.py

# Run with custom analysis files
python src/examples/product_recommendations.py --fashion-analysis data/custom/fashion_trends.json \
                        --user-analysis data/custom/user_profile.txt

# Run personalised recommendations with specific user analysis
python src/examples/product_recommendations.py --user-analysis data/users/luxury_profile.txt \
                        --output results/luxury_recommendations.json
                        ```

#### Output:

Product recommendations are saved in [`data/processed/product_recommendations.json`](`data/processed/product_recommendations.json`).

A sample from the current output (better to look at the full output at the link):



**Set 1**:
```json
"metadatas": [
      [
        {
          "brand": "Jean Paul Gaultier",
          "category": "dress",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/clothing/jean-paul-gaultier-dress-96"
        },
        {
          "brand": "Prada",
          "category": "scarves",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/accessories/prada-wool-and-silk-scarf"
        },
        {
          "brand": "Givenchy",
          "category": "scarves",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/accessories/givenchy-silky-scarves-31"
        },
        {
          "brand": "Dolce & Gabbana",
          "category": "scarves",
          "gender": "M",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/accessories/dolce-gabbana-floral-print-striped-scarf"
        },
        {
          "brand": "Fendi",
          "category": "scarves",
          "gender": "M",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/accessories/fendi-winter-scarves-24"
        }
      ]
      ```



**Set 2**:
```json
"metadatas": [
      [
        {
          "brand": "Tory Burch",
          "category": "handbag",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/bags/tory-burch-handbag-1679"
        },
        {
          "brand": "Vivienne Westwood",
          "category": "shoulder bags",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/bags/vivienne-westwood-clutches-47"
        },
        {
          "brand": "Jimmy Choo",
          "category": "handbag",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/bags/jimmy-choo-handbag-862"
        },
        {
          "brand": "Golden Goose",
          "category": "tote",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/bags/golden-goose-deluxe-brand-handbags-115"
        },
        {
          "brand": "BOYY",
          "category": "totes",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/bags/boyy-leather-lotus-12-handbag-2"
        }
      ]
    ],
      ```


**Set 3**:
```json
"metadatas": [
  [
        {
          "brand": "maje",
          "category": "jackets",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/clothing/maje-suit-jacket-141"
        },
        {
          "brand": "Versace",
          "category": "jackets",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/clothing/versace-crocodile-jacquard-wool-blazer"
        },
        {
          "brand": "Lanvin",
          "category": "sneakers",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/shoes/lanvin-knitted-sneakers-1"
        },
        {
          "brand": "Versace",
          "category": "jackets",
          "gender": "M",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/clothing/versace-jacket-1767"
        },
        {
          "brand": "Dorothee Schumacher",
          "category": "jackets",
          "gender": "F",
          "price_tier": "luxury",
          "url": "https://www.lyst.com/clothing/dorothee-schumacher-plaid-shirt-jacket-with-embossed-leather-details"
        }
      ]
    ],
      ```