# University of Stirling

# ITNPBD2 Representing and Manipulating Data

# Assignment Autumn 2025

# A Consultancy Job for JC Penney

This notebook forms the assignment instructions and submission document of the assignment for ITNPBD2. Read the instructions carefully and enter code into the cells as indicated.

You will need these five files, which were in the Zip file you downloaded from the course webpage:

- jcpenney_reviewers.json
- jcpenney_products.json
- products.csv
- reviews.csv
- users.csv

The data in these files describes products that have been sold by the American retail giant, JC Penney, and reviews by customers who bought them. Note that the product data is real, but the customer data is synthetic.

Your job is to process the data, as requested in the instructions in the markdown cells in this notebook.

# Completing the Assignment

Rename this file to be xxxxxx_BD2 where xxxxxx is your student number, then type your code and narrative description into the boxes provided. Add as many code and markdown cells as you need. The cells should contain:

- **Text narrative describing what you did with the data**
- **The code that performs the task you have described**
- **Comments that explain your code**

The final structure (in PDF) of your report must:
- **Start from the main insights observed (max 5 pages)**
- **Include as an appendix the source code used for producing those insights (max 15 pages)**
- **Include an AI cover sheet (provided on Canvas), which must contain a link to a versioned notebook file in OneDrive or another platform for version checks.**

# Marking Scheme
The assessment will be marked against the university Common Marking Scheme (CMS)

Here is a summary of what you need to achieve to gain a grade in the major grade bands:

|Grade|Requirement|
|:---|:---|
| Fail | You will fail if your code does not run or does not achieve even the basics of the task. You may also fail if you submit code without either comments or a text explanation of what the code does.|
| Pass | To pass, you must submit sufficient working code to show that you have mastered the basics of the task, even if not everything works completely. You must include some justifications for your choice of methods, but without mentioning alternatives. |
| Merit | For a merit, your code must be mostly correct, with only small problems or parts missing, and your comments must be useful rather than simply re-stating the code in English. Most choices for methods and structures should be explained and alternatives mentioned. |
| Distinction | For a distinction, your code must be working, correct, and well commented and shows an appreciation of style, efficiency and reliability. All choices for methods and structures are concisely justified and alternatives are given well thought considerations. For a distinction, your work should be good enough to present to executives at the company.|

The full details of the CMS can be found here

https://www.stir.ac.uk/about/professional-services/student-academic-and-corporate-services/academic-registry/academic-policy-and-practice/quality-handbook/assessment-policy-and-procedure/appendix-2-postgraduate-common-marking-scheme/

Note that this means there are not certain numbers of marks allocated to each stage of the assignment. Your grade will reflect how well your solutions and comments demonstrate that you have achieved the learning outcomes of the task. 

## Submission
When you are ready to submit, **print** your notebook as PDF (go to File -> Print Preview) in the Jupyter menu. Make sure you have run all the cells and that their output is displayed. Any lines of code or comments that are not visible in the pdf should be broken across several lines. You can then submit the file online.

Late penalties will apply at a rate of three marks per day, up to a maximum of 7 days. After 7 days you will be given a mark of 0. Extensions will be considered under acceptable circumstances outside your control.

## Academic Integrity

This is an individual assignment, and so all submitted work must be fully your own work.

The University of Stirling is committed to protecting the quality and standards of its awards. Consequently, the University seeks to promote and nurture academic integrity, support staff academic integrity, and support students to understand and develop good academic skills that facilitate academic integrity.

In addition, the University deals decisively with all forms of Academic Misconduct.

Where a student does not act with academic integrity, their work or behaviour may demonstrate Poor Academic Practice or it may represent Academic Misconduct.

### Poor Academic Practice

Poor Academic Practice is defined as: "The submission of any type of assessment with a lack of referencing or inadequate referencing which does not effectively acknowledge the origin of words, ideas, images, tables, diagrams, maps, code, sound and any other sources used in the assessment."

### Academic Misconduct

Academic Misconduct is defined as: "any act or attempted act that does not demonstrate academic integrity and that may result in creating an unfair academic advantage for you or another person, or an academic disadvantage for any other member or member of the academic community."

Plagiarism is presenting somebody else’s work as your own **and includes the use of artificial intelligence tools beyond AIAS Level 2 or the use of Large Language Models.**. Plagiarism is a form of academic misconduct and is taken very seriously by the University. Students found to have plagiarised work can have marks deducted and, in serious cases, even be expelled from the University. Do not submit any work that is not entirely your own. Do not collaborate with or get help from anybody else with this assignment.

The University of Stirling's full policy on Academic Integrity can be found at:

https://www.stir.ac.uk/about/professional-services/student-academic-and-corporate-services/academic-registry/academic-policy-and-practice/quality-handbook/academic-integrity-policy-and-academic-misconduct-procedure/

## The Assignment
Your task with this assignment is to use the data provided to demonstrate your Python data manipulation skills.

There are three `.csv` files and two `.json` files so you can process different types of data. The files also contain unstructured data in the form of natural language in English and links to images that you can access from the JC Penney website (use the field called `product_image_urls`).

Start with easy tasks to show you can read in a file, create some variables and data structures, and manipulate their contents. Then move onto something more interesting.

Look at the data that we provided with this assessment and think of something interesting to do with it using whatever libraries you like. Describe what you decide to do with the data and why it might be interesting or useful to the company to do it.

You can add additional data if you need to - either download it or access it using `requests`. Produce working code to implement your ideas in as many cells as you need below. There is no single right answer, the aim is to simply show you are competent in using python for data analysis. Exactly how you do that is up to you.

For a distinction class grade, this must show originality, creative thinking, and insights beyond what you've been taught directly on the module.

## Structure
You may structure the appendix of the project how you wish, but here is a suggested guideline to help you organise your work, based on the CRISP-DM data science methodology:

 1. **Business understanding** - What business context is the data coming from? What insights would be valuable in that context, and what data would be required for that purporse? 
 2. **Data understanding and preparation** - Explore the data and show you understand its structure and relations, with the aid of appropriate visualisation techniques. Assess the data quality, which insights you would be able to answer from it, and what preparation the data would require. Add new data from another source if required to bring new insights to the data you already have.
 3. **Data modeling (optional)** - Would modeling be required for the insights you have considered? Use appropriate techniques, if so.
 4. **Evaluation and deployment** - How do the insights you obtained help the company, and how can should they be adopted in their business? If modeling techniques have been adopted, are their use scientifically sound and how should they be mantained?

# Remember to make sure you are working completely on your own.
# Don't work in a group or with a friend


## **JCPenny Consultancy Analysis**
### **Date: 27/10/2025**




# Environment Setup Instructions

## Setting up the Environment with Anaconda

Follow these steps to set up your environment for running this Jupyter notebook:

### 1. Clone the Repository
```bash
git clone https://github.com/yakubuaisha318-gif/Representation_and_Manipulation_of_Data_JC_Penny_Consultancy_Assignment.git
cd Representation_and_Manipulation_of_Data_JC_Penny_Consultancy_Assignment
```

### 2. Install Anaconda
If you haven't already installed Anaconda, download it from [anaconda.com](https://www.anaconda.com/products/distribution) and follow the installation instructions for your operating system.

### 3. Create a New Conda Environment
```bash
conda create -n jcpenney-analysis python=3.9
```

### 4. Activate the Environment
```bash
conda activate jcpenney-analysis
```

### 5. Install Required Dependencies
```bash
pip install -r requrements.txt
```

If the requirements file is not available, install the necessary packages:
```bash
conda install pandas numpy matplotlib openpyxl
pip install fpdf
```

### 6. Start Jupyter Notebook
```bash
jupyter notebook
```

### 7. Open and Run This Notebook
1. Navigate to this notebook file in the Jupyter interface
2. Select the kernel: `Kernel` → `Change kernel` → `jcpenney-analysis`
3. Run the cells: `Cell` → `Run All`

### 8. Deactivating the Environment
When you're done working:
```bash
conda deactivate
```

# Importing Libraries

In [12]:
import json
import pandas as pd
import numpy as np
from typing import List, Dict, Any
import warnings
from collections import Counter
import os
import matplotlib.pyplot as plt
import tempfile
import requests
from fpdf import FPDF

warnings.filterwarnings('ignore')




ModuleNotFoundError: No module named 'fpdf'

# Constants used in project

In [None]:
CHART_SPACING_SMALL = 35  # Optimized spacing after charts
CHART_SPACING_MEDIUM = 75  # Optimized spacing for better page utilization
TEXT_SPACING_SMALL = 5
PDF_REPORT_NAME = "3512017_JCpenney_Analysis_Report.pdf"

# Data Loading Functions

## Converts JSON file to JSON array

In [None]:
def convert_json_file_to_json_array(json_file_path: str) -> List[Dict[str, Any]]:
    '''Convert a JSON Lines file to a JSON array.
    Args:
        json_file_path (str): The path to the JSON Lines file.
    
    Returns:
        List[Dict[str, Any]]: A list of dictionaries representing the JSON array.'''
    data: List[Dict[str, Any]] = []
    try:
        with open(json_file_path, "r", encoding='utf-8') as file:
            for line in file:
                line = line.strip()
                if line:
                    try:
                        data.append(json.loads(line))
                    except:
                        continue
    except:
        pass
    return data
jcpenney_products = convert_json_file_to_json_array("jcpenney_products.json")
jcpenney_reviewers = convert_json_file_to_json_array("jcpenney_reviewers.json")

total_jcpenney_products = len(jcpenney_products)
total_jcpenney_reviewers = len(jcpenney_reviewers)

print("Total jcpenney Products = ",total_jcpenney_products)
print("Total jcpenney Reviewers = ",total_jcpenney_reviewers)

# Converts CSV files to JSON arrays

In [None]:
def convert_csv_file_to_json_array(csv_file_path: str) -> List[Any]:
    '''Convert a CSV file to a JSON array.
    Args:
        csv_file_path (str): The path to the CSV file.
    
    Returns:
        List[Any]: A list of dictionaries representing the JSON array.'''
    try:
        df: pd.DataFrame = pd.read_csv(csv_file_path)
        json_str = df.to_json(orient="records")
        return json.loads(json_str) if json_str else []
    except:
        return []

products = convert_csv_file_to_json_array("products.csv")
reviews = convert_csv_file_to_json_array("reviews.csv")
users = convert_csv_file_to_json_array("users.csv")

total_products = len(products)
total_reviews = len(reviews)
total_users = len(users)

print("Total Products = ",total_products)
print("Total Reviews = ",total_reviews)
print("Total Users = ",total_users)

## Data Comparison & Validation Functions


Upon a critical examination of the data, it was observed that jcpenney_products.json is a more detailed version of products.csv, and jcpenney_reviewers.json is a more detailed version of users.csv. Consequently, data comparison and validation functions were developed to analyze the relationships between these files.

Also, upon further investigation i realized there was a duplicate username(dqft3311) however, different date of birth and states hence not duplicate.

In [None]:
def compare_dataset_fields(jcpenney_data: List[Dict[str, Any]], csv_data: List[Dict[str, Any]], 
                          dataset_name: str) -> Dict[str, Any]:
    """Compare fields between JCPenney dataset and CSV dataset to identify additional fields.
    
    Args:
        jcpenney_data: The JCPenney dataset (richer version)
        csv_data: The CSV dataset (simpler version)
        dataset_name: Name of the dataset for reporting purposes
        
    Returns:
        Dict containing lists of additional fields in JCPenney data and missing fields
    """
    if not jcpenney_data or not csv_data:
        return {"additional_fields": [], "missing_fields": [], "field_mappings": {}}
    
    # Get field sets from both datasets
    jcpenney_fields = set(jcpenney_data[0].keys()) if jcpenney_data else set()
    csv_fields = set(csv_data[0].keys()) if csv_data else set()
    
    # Define field mappings between CSV and JCPenney datasets
    field_mappings = {
        "SKU": "sku",
        "Uniq_id": "uniq_id",
        "Name": "name_title",
        "Price": "list_price",
        "Description": "description",
        "Av_Score": "average_product_rating"
    }
    
    # Reverse mapping for easier lookup
    reverse_mappings = {v: k for k, v in field_mappings.items()}
    
    # Identify truly additional fields (not just renamed)
    additional_fields = []
    missing_fields = []
    
    for field in jcpenney_fields:
        # Check if this is a renamed version of a CSV field
        if field not in field_mappings.values() or reverse_mappings.get(field) not in csv_fields:
            # Not a simple renaming, so it's truly additional
            if field not in [reverse_mappings.get(f, f) for f in csv_fields]:
                additional_fields.append(field)
    
    for field in csv_fields:
        # Check if this field is missing (not just renamed)
        if field not in field_mappings or field_mappings[field] not in jcpenney_fields:
            # Not a simple renaming, so it's truly missing
            if field not in [field_mappings.get(f, f) for f in jcpenney_fields]:
                missing_fields.append(field)
    
    # Get the actual mappings that exist
    actual_mappings = {}
    for csv_field, jcp_field in field_mappings.items():
        if csv_field in csv_fields and jcp_field in jcpenney_fields:
            actual_mappings[csv_field] = jcp_field
    
    return {
        "additional_fields": sorted(additional_fields),
        "missing_fields": sorted(missing_fields),
        "field_mappings": actual_mappings
    }

def print_dataset_comparison(jcpenney_data: List[Dict[str, Any]], csv_data: List[Dict[str, Any]], 
                           dataset_name: str) -> None:
    """Print a comparison of fields between datasets.
    
    Args:
        jcpenney_data: The JCPenney dataset (richer version)
        csv_data: The CSV dataset (simpler version)
        dataset_name: Name of the dataset for reporting purposes
    """
    comparison = compare_dataset_fields(jcpenney_data, csv_data, dataset_name)
    
    print(f"\n{dataset_name} Dataset Comparison:")
    print(f"  JCPenney data contains all fields from CSV (with different names):")
    for csv_field, jcp_field in comparison['field_mappings'].items():
        print(f"    - {csv_field} -> {jcp_field}")
    
    if comparison['additional_fields']:
        print(f"  Truly additional fields in JCPenney data: {len(comparison['additional_fields'])}")
        for field in comparison['additional_fields']:
            print(f"    - {field}")
    
    if comparison['missing_fields']:
        print(f"  Fields missing in JCPenney data: {len(comparison['missing_fields'])}")
        for field in comparison['missing_fields']:
            print(f"    - {field}")

# Compare datasets and identify additional fields
print_dataset_comparison(jcpenney_reviewers, users, "Reviewers")
print_dataset_comparison(jcpenney_products, products, "Products")

jcpenney_reviewers_usernames = [reviewer["Username"] for reviewer in jcpenney_reviewers]
users_usernames = [user["Username"] for user in users]

jcpenney_products_uniq_ids = [product["uniq_id"] for product in jcpenney_products]
products_uniq_ids = [product["Uniq_id"] for product in products]

print("len of reviewers (json file) ", len(jcpenney_reviewers))
print("len of users (csv file) ", len(users))
# compare length and unique elements of reviewers and users
if (set(jcpenney_reviewers_usernames).issubset(set(users_usernames)) and 
    len(jcpenney_reviewers_usernames) == len(users_usernames)):
    selected_reviewers = jcpenney_reviewers  # Use the richer dataset
    print("Using jcpenney_reviewers for processing (contains additional fields)")
else:
    selected_reviewers = users  # Fallback to simpler dataset
print("len of jcpenny products (json file) ", len(jcpenney_products))
print("len of products (csv file) ", len(products))
# compare length and unique elements of products and jcpenney_products
if (set(jcpenney_products_uniq_ids).issubset(set(products_uniq_ids)) and 
    len(jcpenney_products_uniq_ids) == len(products_uniq_ids)):
    selected_products = jcpenney_products  # Use the richer dataset
    print("Using jcpenney_products for processing (contains additional fields)")
else:
    selected_products = products  # Fallback to simpler dataset

# Data Processing Functions

## Extracting Values

In [None]:

def extract_values(data: List[Dict[str, Any]], field_name: str) -> List[float]:
    """Extract numeric values from a field in a list of dictionaries.
    Args:
        data (List[Dict[str, Any]]): The list of dictionaries containing the data.
        field_name (str): The name of the field to extract values from.
    
    Returns:
        List[float]: A list of extracted numeric values converted to floats."""
    values = []
    for item in data:
        value_str = item.get(field_name)
        if value_str is not None:
            try:
                values.append(float(value_str))
            except (ValueError, TypeError):
                pass
    return values

# Printing first value I extract
print('first four float extracted from a scored field in reviews using extract value function',extract_values(reviews, "Score")[:4])

## By numeric data

In [None]:
def analyze_numeric_data(products_data: List[Dict[str, Any]], field_name: str) -> Dict[str, Any]:
    """Analyze numeric data across all products
    Args:
        products_data (List[Dict[str, Any]]): The list of dictionaries containing the products data.
        field_name (str): The name of the field to extract numeric values from.
    
    Returns:
        Dict[str, Any]: A dictionary containing the count, mean, median, standard deviation, minimum, and maximum values."""
    values = extract_values(products_data, field_name)
    if not values:
        return {}
    
    return {
        "count": len(values),
        "mean": np.mean(values),
        "median": np.median(values),
        "std": np.std(values),
        "min": np.min(values),
        "max": np.max(values)
    }

print(analyze_numeric_data(jcpenney_products, "average_product_rating"))

## By product

In [None]:
def get_products(products_data: List[Dict[str, Any]], sort_field: str, top_n: int = 5, highest: bool = True) -> List[Dict[str, Any]]:
    """Get top or bottom products based on a specified field.
    Args:
        products_data (List[Dict[str, Any]]): The list of dictionaries containing the products data.
        sort_field (str): The name of the field to sort the products by.
        top_n (int, optional): The number of top or bottom products to retrieve. Defaults to 5.
        highest (bool, optional): Whether to retrieve the highest or lowest values. Defaults to True.
    
    Returns:
        List[Dict[str, Any]]: A list of dictionaries representing the top or bottom products."""
    valid_products = [p for p in products_data if p.get(sort_field) is not None]
    sorted_products = sorted(valid_products, 
                           key=lambda x: float(x[sort_field]), 
                           reverse=highest)
    return sorted_products[:top_n]
    
top_rated_products = get_products(jcpenney_products, "average_product_rating", 5, True)
print("Top rated products: ", [product.get('name_title') for product in top_rated_products])

## By User demographics

In [None]:
def get_user_demographics(reviewers_data: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Analyze user demographics from reviewers data.
    Args:
        reviewers_data (List[Dict[str, Any]]): The list of dictionaries containing the reviewers data.
    
    Returns:
        Dict[str, Any]: A dictionary containing the total number of reviewers, state distribution, top states, and age statistics."""
    states = [reviewer.get("State") for reviewer in reviewers_data if reviewer.get("State")]
    dobs = [reviewer.get("DOB") for reviewer in reviewers_data if reviewer.get("DOB")]
    
    birth_years = []
    for dob in dobs:
        if dob:
            try:
                year = int(dob.split(".")[-1])
                birth_years.append(year)
            except:
                pass
    
    current_year = 2025
    ages = [current_year - year for year in birth_years if 1900 <= year <= current_year]
    
    return {
        "total_reviewers": len(reviewers_data),
        "state_distribution": dict(Counter(states)),
        "top_states": Counter(states).most_common(10),
        "age_statistics": {
            "count": len(ages),
            "mean_age": np.mean(ages) if ages else 0,
            "median_age": np.median(ages) if ages else 0
        } if ages else {}
    }
get_user_demographics(users)


# Visualization Functions

In [None]:
def create_bar_chart(x_data, y_data, title, x_label, y_label, color, figsize=(10, 3)):
    """Create a bar chart and return the temporary file path.
    Args:
        x_data: Data for the x-axis.
        y_data: Data for the y-axis.
        title: Title of the chart.
        x_label: Label for the x-axis.
        y_label: Label for the y-axis.
        color: Color of the bars.
        figsize: Size of the figure.
    
    Returns:
        str: Temporary file path for the created chart."""
    plt.figure(figsize=figsize)  # Reduced height from (10, 5) to (10, 3)
    bars = plt.bar(x_data, y_data, color=color)
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.xticks(x_data)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        if height > 0:
            plt.text(bar.get_x() + bar.get_width()/2., height,
                    f'{int(height)}' if isinstance(height, (int, float)) else f'{height:.2f}', 
                    ha='center', va='bottom')
    
    temp_file = tempfile.NamedTemporaryFile(suffix='.png', delete=False)
    temp_file.close()
    plt.savefig(temp_file.name, format='png', bbox_inches='tight', dpi=150)
    plt.close()
    
    return temp_file.name

score_distribution = Counter(extract_values(reviews, "Score"))
scores = sorted(score_distribution.keys())
counts = [score_distribution[score] for score in scores]
        
create_bar_chart(scores, counts, 'Distribution of Review Scores','Review Score', 'Number of Reviews', '#2ca02c', (10, 3))

def create_histogram_chart(data, title, x_label, y_label, color, figsize=(10, 3)):
    """Create a histogram chart and return the temporary file path.
    Args:
        data: Data for the histogram.
        title: Title of the chart.
        x_label: Label for the x-axis.
        y_label: Label for the y-axis.
        color: Color of the bars.
        figsize: Size of the figure.
    
    Returns:
        str: Temporary file path for the created chart."""
    plt.figure(figsize=figsize)  # Reduced height from (10, 5) to (10, 3)
    plt.hist(data, bins=10, color=color, alpha=0.7)
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.grid(True, alpha=0.3)
    
    temp_file = tempfile.NamedTemporaryFile(suffix='.png', delete=False)
    temp_file.close()
    plt.savefig(temp_file.name, format='png', bbox_inches='tight', dpi=150)
    plt.close()
    
    return temp_file.name

def create_comparison_chart(product_rating, review_score):
    """Create a comparison chart and return the temporary file path.
    Args:
        product_rating: Product rating data.
        review_score: Review score data.
    
    Returns:
        str: Temporary file path for the created chart."""
    plt.figure(figsize=(10, 6))
    x = ['Product Ratings', 'Review Scores']
    y = [product_rating, review_score]
    bars = plt.bar(x, y, color=['#1f77b4', '#ff7f0e'])
    plt.title('Average Ratings Comparison')
    plt.ylabel('Average Score')
    plt.ylim(0, 5)
    
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.2f}', ha='center', va='bottom')
    
    temp_file = tempfile.NamedTemporaryFile(suffix='.png', delete=False)
    temp_file.close()
    plt.savefig(temp_file.name, format='png', bbox_inches='tight', dpi=150)
    plt.close()
    
    return temp_file.name

def create_line_chart(data, title, x_label, y_label, color, figsize=(10, 3)):
    """Create a line chart and return the temporary file path.
    Args:
        data: Data for the line chart.
        title: Title of the chart.
        x_label: Label for the x-axis.
        y_label: Label for the y-axis.
        color: Color of the line.
        figsize: Size of the figure.
    
    Returns:
        str: Temporary file path for the created chart."""
    plt.figure(figsize=figsize)
    
    # Create bins for the line chart
    counts, bin_edges = np.histogram(data, bins=10)
    bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
    
    plt.plot(bin_centers, counts, marker='o', color=color, linewidth=2, markersize=6)
    plt.fill_between(bin_centers, counts, alpha=0.3, color=color)
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.grid(True, alpha=0.3)
    
    temp_file = tempfile.NamedTemporaryFile(suffix='.png', delete=False)
    temp_file.close()
    plt.savefig(temp_file.name, format='png', bbox_inches='tight', dpi=150)
    plt.close()
    
    return temp_file.name

# PDF Generation Functions

## Adding product Image to PDF

In [None]:
def add_product_images(pdf: FPDF, products: List[Dict[str, Any]], start_y: float) -> List[str]:
    """Add product images to the PDF and return a list of temp file paths.
    Args:
        pdf (FPDF): The FPDF object to which images will be added.
        products (List[Dict[str, Any]]): The list of dictionaries containing the products data.
        start_y (float): The starting y-coordinate for placing images.
    
    Returns:
        List[str]: A list of temporary file paths for the added images."""
    temp_files = []
    x_position = 10
    image_width = 30
    image_height = 30
    spacing = 5
    
    for i, product in enumerate(products[:3]):
        image_url = product.get("product_image_urls")
        if image_url:
            try:
                headers = {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
                }
                response = requests.get(image_url, headers=headers, timeout=10)
                if response.status_code == 200:
                    temp_file = tempfile.NamedTemporaryFile(suffix='.jpg', delete=False)
                    temp_files.append(temp_file.name)
                    temp_file.close()
                    
                    with open(temp_file.name, 'wb') as f:
                        f.write(response.content)
                    
                    current_x = x_position + (i * (image_width + spacing))
                    pdf.image(temp_file.name, x=current_x, y=start_y, w=image_width, h=image_height)
                    
                    pdf.set_font("Arial", size=8)
                    product_name = product.get("name_title", "")[:15]
                    pdf.set_xy(current_x, start_y + image_height + 2)
                    pdf.cell(image_width, 5, product_name, align="C")
            except:
                print(image_url)
                current_x = x_position + (i * (image_width + spacing))
                pdf.set_font("Arial", size=8)
                pdf.set_xy(current_x, start_y)
                pdf.cell(image_width, image_height, "Image N/A", align="C")
    
    return temp_files

## Visualization of Data

In [None]:
def create_visualization_pdf_report(pdf_filename: str) -> bool:
    """Create a comprehensive PDF report with charts and product images.
    Args:
        pdf_filename (str): The filename of the PDF report to be created.
    
    Returns:
        bool: True if the report was created successfully, False otherwise."""
    print("running...")
    try:        
        class PDF(FPDF):
            def header(self):
                self.set_font('Arial', 'B', 15)
                self.cell(0, 10, 'JCPenney Data Analysis Report', 0, 1, 'C')
                self.ln(5)
            
            def footer(self):
                self.set_y(-15)
                self.set_font('Arial', 'I', 8)
                self.cell(0, 10, f'Page {self.page_no()}', 0, 0, 'C')
        
        pdf = PDF()
        pdf.set_auto_page_break(auto=True, margin=15)
        
        # Load data
        jcpenney_reviewers = convert_json_file_to_json_array("jcpenney_reviewers.json")
        jcpenney_products = convert_json_file_to_json_array("jcpenney_products.json")
        reviews = convert_csv_file_to_json_array("reviews.csv")
        
        # Calculate metrics
        product_ratings = analyze_numeric_data(jcpenney_products, "average_product_rating")
        avg_product_rating = round(product_ratings.get("mean", 0), 2) if product_ratings else 0
        
        review_scores = analyze_numeric_data(reviews, "Score")
        avg_review_score = round(review_scores.get("mean", 0), 2) if review_scores else 0
        
        # Calculate user demographics
        user_demographics = get_user_demographics(jcpenney_reviewers)
        age_stats = user_demographics.get("age_statistics", {})
        avg_age = round(age_stats.get("mean_age", 0), 1) if age_stats else 0
        
        temp_files = []
        
        # Page 1: Title and Key Metrics
        pdf.add_page()
        pdf.set_font("Arial", "B", 24)
        pdf.cell(0, 20, "JCPenney Data Analysis Report", ln=True, align="C")
        pdf.ln(10)
        
        pdf.set_font("Arial", size=12)
        pdf.cell(0, 10, "Comprehensive Analysis of Products and Customer Reviews", ln=True, align="C")
        pdf.ln(20)
        
        # Create comparison chart
        temp_file_path = create_comparison_chart(avg_product_rating, avg_review_score)
        temp_files.append(temp_file_path)
        pdf.image(temp_file_path, x=10, y=80, w=190)
        pdf.ln(120)
        
        # Key metrics
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Key Metrics", ln=True)
        pdf.ln(5)
        
        metrics = [
            ("Total Products", f"{len(jcpenney_products):,}"),
            ("Total Reviews", f"{len(reviews):,}"),
            ("Average Product Rating", str(avg_product_rating)),
            ("Average Review Score", str(avg_review_score)),
            ("Total Reviewers", f"{len(jcpenney_reviewers):,}"),
        ]
        
        pdf.set_font("Arial", size=12)
        for metric, value in metrics:
            pdf.cell(0, 8, f"{metric}: {value}", ln=True)
        
        # Page 2: Review Score Distribution
        pdf.add_page()
        pdf.set_font("Arial", "B", 18)
        pdf.cell(0, 10, "Review Score Distribution", ln=True)
        pdf.ln(10)
        
        # Review score distribution chart
        score_distribution = Counter(_extract_values(reviews, "Score"))
        scores = sorted(score_distribution.keys())
        counts = [score_distribution[score] for score in scores]
        
        temp_file_path = create_bar_chart(scores, counts, 'Distribution of Review Scores', 
                                         'Review Score', 'Number of Reviews', '#2ca02c', (10, 3))
        temp_files.append(temp_file_path)
        pdf.image(temp_file_path, x=10, y=40, w=190)
        pdf.ln(CHART_SPACING_MEDIUM)  # Optimized spacing for better page utilization
        
        # User Demographics
        pdf.set_font("Arial", "B", 16)
        pdf.cell(0, 10, "User Demographics", ln=True)
        pdf.ln(5)
        
        # Calculate age statistics for display
        ages = []
        for reviewer in jcpenney_reviewers:
            dob = reviewer.get("DOB")
            if dob:
                try:
                    year = int(dob.split(".")[-1])
                    if 1900 <= year <= 2025:
                        ages.append(2025 - year)
                except:
                    pass
        
        if ages:
            youngest_age = min(ages)
            oldest_age = max(ages)
            
            # Display age statistics
            pdf.set_font("Arial", size=12)
            pdf.cell(0, 8, f"Youngest Customer: {youngest_age} years old", ln=True)
            pdf.cell(0, 8, f"Oldest Customer: {oldest_age} years old", ln=True)
            pdf.cell(0, 8, f"Average Age: {avg_age} years old", ln=True)
            pdf.ln(TEXT_SPACING_SMALL)
        
        # Display top states
        top_states = user_demographics.get("top_states", [])
        if top_states:
            pdf.set_font("Arial", "B", 14)
            pdf.cell(0, 10, "Top States by Reviewer Count:", ln=True)
            pdf.ln(5)
            
            pdf.set_font("Arial", size=12)
            for i, (state, count) in enumerate(top_states[:5]):  # Show top 5 states
                pdf.cell(0, 8, f"{i+1}. {state}: {count} reviewers", ln=True)
            pdf.ln(TEXT_SPACING_SMALL)
        
        # Age distribution chart
        if age_stats and age_stats.get("count", 0) > 0 and ages:
            temp_file_path = create_line_chart(ages, 'Age Distribution of Reviewers', 
                                             'Age', 'Number of Reviewers', '#9467bd', (10, 3))
            temp_files.append(temp_file_path)
            pdf.image(temp_file_path, x=10, y=pdf.get_y(), w=190)
            pdf.ln(CHART_SPACING_SMALL)  # Optimized spacing after chart
        
        # Page 3: Top Rated Products with Images
        pdf.add_page()
        pdf.set_font("Arial", "B", 18)
        pdf.cell(0, 10, "Top Rated Products", ln=True)
        pdf.ln(10)
        
        # Top rated products chart
        top_rated = get_products(jcpenney_products, "average_product_rating", 5, True)
        if top_rated:
            products = [product.get("name_title", "")[:30] for product in top_rated]
            ratings = [float(product.get("average_product_rating", 0)) for product in top_rated]
        else:
            products = []
            ratings = []
        
        plt.figure(figsize=(10, 8))
        bars = plt.barh(products, ratings, color='#d62728')
        plt.title('Top 5 Highest Rated Products')
        plt.xlabel('Average Rating')
        plt.xlim(0, 5.5)
        
        for i, (bar, rating) in enumerate(zip(bars, ratings)):
            plt.text(bar.get_width() + 0.05, bar.get_y() + bar.get_height()/2,
                    f'{rating:.1f}', ha='left', va='center')
        
        temp_file = tempfile.NamedTemporaryFile(suffix='.png', delete=False)
        temp_file.close()
        temp_files.append(temp_file.name)
        plt.savefig(temp_file.name, format='png', bbox_inches='tight', dpi=150)
        plt.close()
        
        pdf.image(temp_file.name, x=10, y=40, w=190)
        pdf.ln(110)
        
        # Add product images for top rated products
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Product Images:", ln=True)
        pdf.ln(5)
        
        y_position = pdf.get_y()
        image_temp_files = add_product_images(pdf, top_rated, y_position)
        temp_files.extend(image_temp_files)
        
        # Page 4: Low Rated Products with Images
        pdf.add_page()
        pdf.set_font("Arial", "B", 18)
        pdf.cell(0, 10, "Low Rated Products", ln=True)
        pdf.ln(10)
        
        # Low rated products
        low_rated = get_products(jcpenney_products, "average_product_rating", 5, False)
        if low_rated:
            products = [product.get("name_title", "")[:30] for product in low_rated]
            ratings = [float(product.get("average_product_rating", 0)) for product in low_rated]
        else:
            products = []
            ratings = []
        
        plt.figure(figsize=(10, 8))
        bars = plt.barh(products, ratings, color='#1f77b4')
        plt.title('Top 5 Lowest Rated Products')
        plt.xlabel('Average Rating')
        plt.xlim(0, 5.5)
        
        for i, (bar, rating) in enumerate(zip(bars, ratings)):
            plt.text(bar.get_width() + 0.05, bar.get_y() + bar.get_height()/2,
                    f'{rating:.1f}', ha='left', va='center')
        
        temp_file = tempfile.NamedTemporaryFile(suffix='.png', delete=False)
        temp_file.close()
        temp_files.append(temp_file.name)
        plt.savefig(temp_file.name, format='png', bbox_inches='tight', dpi=150)
        plt.close()
        
        pdf.image(temp_file.name, x=10, y=40, w=190)
        pdf.ln(110)
        
        # Add product images for low rated products
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Product Images:", ln=True)
        pdf.ln(5)
        
        y_position = pdf.get_y()
        image_temp_files = add_product_images(pdf, low_rated, y_position)
        temp_files.extend(image_temp_files)
        
        # Page 5: Summary & Recommendations
        pdf.add_page()
        pdf.set_font("Arial", "B", 18)
        pdf.cell(0, 10, "Summary & Recommendations", ln=True)
        pdf.ln(10)
        
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Key Findings:", ln=True)
        pdf.ln(5)
        
        pdf.set_font("Arial", size=12)
        findings = [
            f"1. Average product rating is {avg_product_rating}, indicating {'good' if avg_product_rating >= 3 else 'room for improvement'}",
            f"2. Review scores are {'high' if avg_review_score >= 3 else 'low'} (average {avg_review_score})",
            f"3. Customer base has an average age of {avg_age} years"
        ]
        for finding in findings:
            pdf.cell(0, 8, finding, ln=True)
        
        pdf.ln(10)
        
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Recommendations:", ln=True)
        pdf.ln(5)
        
        pdf.set_font("Arial", size=12)
        recommendations = []
        if avg_product_rating < 3:
            recommendations.append("- Focus on improving product quality to increase ratings")
        if avg_review_score < 3:
            recommendations.append("- Investigate reasons for low review scores")
        if avg_age > 30 and avg_age < 50:
            recommendations.append("- Target marketing efforts toward middle-aged demographics")
        
        if not recommendations:
            recommendations = [
                "- Continue monitoring product ratings and review scores",
                "- Regularly analyze customer feedback for improvement opportunities",
                "- Consider expanding product lines based on top-rated items"
            ]
            
        for recommendation in recommendations:
            pdf.cell(0, 8, recommendation, ln=True)
        
        pdf.output(pdf_filename)
        
        # Clean up temporary files
        try:
            for temp_file in temp_files:
                os.unlink(temp_file)
        except:
            pass
        
        print(f"PDF report with visualizations created: {pdf_filename}")
        return True
        
    except Exception as e:
        print(f"Error creating visualization PDF report: {e}")
        return False

create_visualization_pdf_report("jcpenney_analysis.pdf")
