## Overview
The report presents a detailed analysis of JCPenney products and its customer review data. It covers customer demographics, product ratings and review patterns as well as business insights to support decision making.

## Data Sources
1. `jcpenney_reviewers.json` - Customer reviewer data
2. `jcpenney_products.json` - Product information and ratings
3. `reviews.csv` - Detailed customer reviews

### Libraries Used
- **pandas**: Used for data manipulation and CSV handling
- **numpy**: Also used for numerical calculations and statistical analysis
- **matplotlib**: Data visualization and chart generation
- **fpdf**: For the generation of the PDF report
- **requests**: For downloading product images from using the URLs of products
- **collections**: Also for counting and organizing data

# Environment Setup Instructions

## Setting up the Environment with Anaconda

Follow these steps to set up your environment for running this Jupyter notebook:

### 1. Clone the Repository
```bash
git clone https://github.com/yakubuaisha318-gif/3512017_BD2_Assignment.git
cd 3512017_BD2_Assignment
```

### 2. Install Anaconda
Install anaconda if not installed, download it from [anaconda.com](https://www.anaconda.com/products/distribution) and go by the installation instructions for your operating system.

### 3. Generate a New Conda Environment
```bash
conda create -n jcpenney-analysis python=3.9
```

### 4. Activate the Environment
```bash
conda activate jcpenney-analysis
```

### 5. Dependencies Install Required
Install the necessary packages:
```bash
conda install pandas numpy matplotlib
pip install fpdf
pip install requests
```

### 6. Start Jupyter Notebook
```bash
jupyter notebook
```

### 7. Open and Run This Notebook
1. In the Jupyter interface, navigate to this notebook file
2. Choose the kernel: `Kernel` → `Change kernel` → `jcpenney-analysis`
3. Run the cells: `Cell` → `Run All`

### 8. Deactivating the Environment
When you're through:
```bash
conda deactivate
```

# Importing Libraries

In [159]:
!pip install fpdf



In [160]:
import json
import pandas as pd
import numpy as np
from typing import List, Dict, Any
import warnings
from collections import Counter
import os
import matplotlib.pyplot as plt
import tempfile
import requests
from fpdf import FPDF

warnings.filterwarnings('ignore')



# Constants used in project

In [161]:
# Constants for pdf layout
CHART_SPACING = 35
TEXT_SPACING = 5
CURRENT_YEAR = 2025
PDF_REPORT_NAME = "3512017_JCpenney_Analysis_Report.pdf"

# Data Loading Functions
- **`convert_json_file_to_json_array(json_file_path: str)`**
  - Argument: `json_file_path` (str) - JSON Lines file path
  - Returns: `List[Dict[str, Any]]` - List of dictionaries displaying the JSON array
  - Purpose: Converts JSON Lines files to JSON arrays to be processed

- **`convert_csv_file_to_json_array(csv_file_path: str)`**
  - Argument: `csv_file_path` (str) - Path to the location of the CSV file
  - Returns: `List[Any]` - List of dictionaries also displaying the JSON array
  - Purpose: Converts CSV files to JSON arrays to be processed 

## Converts JSON file to JSON array

In [162]:
def convert_json_file_to_json_array(json_file_path: str) -> List[Dict[str, Any]]:
    '''Convert JSON Lines file to JSON array.'''
    data: List[Dict[str, Any]] = []
    try:
        with open(json_file_path, "r", encoding='utf-8') as file:
            for line in file:
                line = line.strip()
                if line:
                    try:
                        data.append(json.loads(line))
                    except:
                        continue
    except:
        pass
    return data
jcpenney_products = convert_json_file_to_json_array("jcpenney_products.json")
jcpenney_reviewers = convert_json_file_to_json_array("jcpenney_reviewers.json")

#printed out stringified sample because of how long data is
print(str(jcpenney_products)[:100] + "...")
print(str(jcpenney_reviewers)[:100] + "...")

[{'uniq_id': 'b6c0b6bea69c722939585baeac73c13d', 'sku': 'pp5006380337', 'name_title': 'Alfred Dunner...
[{'Username': 'bkpn1412', 'DOB': '31.07.1983', 'State': 'Oregon', 'Reviewed': ['cea76118f6a9110a893d...


## Converts CSV files to JSON arrays

In [163]:
def convert_csv_file_to_json_array(csv_file_path: str) -> List[Any]:
    '''Convert CSV file to JSON array.'''
    try:
        df: pd.DataFrame = pd.read_csv(csv_file_path)
        json_str = df.to_json(orient="records")
        return json.loads(json_str) if json_str else []
    except:
        return []

products = convert_csv_file_to_json_array("products.csv")
reviews = convert_csv_file_to_json_array("reviews.csv")
users = convert_csv_file_to_json_array("users.csv")

#printed out stringified sample because of how long data is
print(str(products)[:100] + "...")
print(str(reviews)[:100] + "...")
print(str(users)[:100] + "...")

[{'Uniq_id': 'b6c0b6bea69c722939585baeac73c13d', 'SKU': 'pp5006380337', 'Name': 'Alfred Dunner® Esse...
[{'Uniq_id': 'b6c0b6bea69c722939585baeac73c13d', 'Username': 'fsdv4141', 'Score': 2, 'Review': 'You ...
[{'Username': 'bkpn1412', 'DOB': '31.07.1983', 'State': 'Oregon'}, {'Username': 'gqjs4414', 'DOB': '...


# Data Comparison & Validation Functions
- **`compare_dataset_fields(jcpenney_data: List[Dict[str, Any]], csv_data: List[Dict[str, Any]], dataset_name: str)`**
  - Argument: `jcpenney_data` (List[Dict[str, Any]]) - JCPenney dataset, `csv_data` (List[Dict[str, Any]]) - CSV dataset, `dataset_name` (str) - Name for reporting
  - Returns: `Dict[str, Any]` - Dictionary with extra fields, fields not found, and field mappings
  - Purpose: Compares fields between JCPenney and CSV datasets

- **`print_dataset_comparison(jcpenney_data: List[Dict[str, Any]], csv_data: List[Dict[str, Any]], dataset_name: str)`**
  - Argument: `jcpenney_data` (List[Dict[str, Any]]) - JCPenney dataset, `csv_data` (List[Dict[str, Any]]) - CSV dataset, `dataset_name` (str) - Name for reporting
  - Returns: None (prints to console)
  - Purpose: Prints a comparison of fields between datasets


## Data Quality Notes

Upon a critical examination of the data, it was observed that jcpenney_products.json is a more detailed version of products.csv, and jcpenney_reviewers.json is a more detailed version of users.csv. Consequently, data comparison and validation functions were developed to analyze the relationships between these files.

Also, upon further investigation i realized there was a duplicate username(dqft3311) however, different date of birth and states hence not duplicate.

In [164]:
def compare_dataset_fields(jcpenney_data: List[Dict[str, Any]], csv_data: List[Dict[str, Any]], 
                          dataset_name: str) -> Dict[str, Any]:
    """Compare fields between JCPenney dataset and CSV dataset."""
    if not jcpenney_data or not csv_data:
        return {"additional_fields": [], "missing_fields": [], "field_mappings": {}}
    
    jcpenney_fields = set(jcpenney_data[0].keys()) if jcpenney_data else set()
    csv_fields = set(csv_data[0].keys()) if csv_data else set()
    
    field_mappings = {
        "SKU": "sku", "Uniq_id": "uniq_id", "Name": "name_title",
        "Price": "list_price", "Description": "description", "Av_Score": "average_product_rating"
    }
    
    reverse_mappings = {v: k for k, v in field_mappings.items()}
    additional_fields = []
    missing_fields = []
    
    for field in jcpenney_fields:
        if field not in field_mappings.values() or reverse_mappings.get(field) not in csv_fields:
            if field not in [reverse_mappings.get(f, f) for f in csv_fields]:
                additional_fields.append(field)
    
    for field in csv_fields:
        if field not in field_mappings or field_mappings[field] not in jcpenney_fields:
            if field not in [field_mappings.get(f, f) for f in jcpenney_fields]:
                missing_fields.append(field)
    
    actual_mappings = {}
    for csv_field, jcp_field in field_mappings.items():
        if csv_field in csv_fields and jcp_field in jcpenney_fields:
            actual_mappings[csv_field] = jcp_field
    
    return {
        "additional_fields": sorted(additional_fields),
        "missing_fields": sorted(missing_fields),
        "field_mappings": actual_mappings
    }

def print_dataset_comparison(jcpenney_data: List[Dict[str, Any]], csv_data: List[Dict[str, Any]], 
                           dataset_name: str) -> None:
    """Print a comparison of fields between datasets."""
    comparison = compare_dataset_fields(jcpenney_data, csv_data, dataset_name)
    
    print(f"\n{dataset_name} Dataset Comparison:")
    print(f"  JCPenney data contains all fields from CSV (with different names):")
    for csv_field, jcp_field in comparison['field_mappings'].items():
        print(f"    - {csv_field} -> {jcp_field}")
    
    if comparison['additional_fields']:
        print(f"  Truly additional fields in JCPenney data: {len(comparison['additional_fields'])}")
        for field in comparison['additional_fields']:
            print(f"    - {field}")

# Compare datasets and identify additional fields
print_dataset_comparison(jcpenney_reviewers, users, "Reviewers")
print_dataset_comparison(jcpenney_products, products, "Products")

jcpenney_reviewers_usernames = [reviewer["Username"] for reviewer in jcpenney_reviewers]
users_usernames = [user["Username"] for user in users]

jcpenney_products_uniq_ids = [product["uniq_id"] for product in jcpenney_products]
products_uniq_ids = [product["Uniq_id"] for product in products]

print("len of reviewers (json file) ", len(jcpenney_reviewers))
print("len of users (csv file) ", len(users))
# compare length and unique elements of reviewers and users
if (set(jcpenney_reviewers_usernames).issubset(set(users_usernames)) and 
    len(jcpenney_reviewers_usernames) == len(users_usernames)):
    selected_reviewers = jcpenney_reviewers
    print("Using jcpenney_reviewers for processing (contains additional fields)")
else:
    selected_reviewers = users  # Fallback to simpler dataset
print("len of jcpenny products (json file) ", len(jcpenney_products))
print("len of products (csv file) ", len(products))
# compare length and unique elements of products and jcpenney_products
if (set(jcpenney_products_uniq_ids).issubset(set(products_uniq_ids)) and 
    len(jcpenney_products_uniq_ids) == len(products_uniq_ids)):
    selected_products = jcpenney_products  # Use the richer dataset
    print("Using jcpenney_products for processing (contains additional fields)")
else:
    selected_products = products  # Fallback to simpler dataset


Reviewers Dataset Comparison:
  JCPenney data contains all fields from CSV (with different names):
  Truly additional fields in JCPenney data: 1
    - Reviewed

Products Dataset Comparison:
  JCPenney data contains all fields from CSV (with different names):
    - SKU -> sku
    - Uniq_id -> uniq_id
    - Name -> name_title
    - Price -> list_price
    - Description -> description
    - Av_Score -> average_product_rating
  Truly additional fields in JCPenney data: 9
    - Bought With
    - Reviews
    - brand
    - category
    - category_tree
    - product_image_urls
    - product_url
    - sale_price
    - total_number_reviews
len of reviewers (json file)  5000
len of users (csv file)  5000
Using jcpenney_reviewers for processing (contains additional fields)
len of jcpenny products (json file)  7982
len of products (csv file)  7982
Using jcpenney_products for processing (contains additional fields)


# Data Processing Functions
- **`extract_values(data: List[Dict[str, Any]], field_name: str)`**
  - Argument: `data` (List[Dict[str, Any]]) - List of dictionaries containing the data, `field_name` (str) - Name of the field to extract values from
  - Returns: `List[float]` - List of extracted numeric values
  - Purpose: Extracted numeric values from a specific field in a list of dictionaries

- **`analyze_numeric_data(products_data: List[Dict[str, Any]], field_name: str)`**
  - Argument: `products_data` (List[Dict[str, Any]]) - List of dictionaries containing products data, `field_name` (str) - Name of the field to analyze
  - Returns: `Dict[str, Any]` - Dictionary which contains count, mean, median, std, min, and max values
  - Purpose: Performs analysis on numeric data for all products

- **`get_products(products_data: List[Dict[str, Any]], sort_field: str, top_n: int = 5, highest: bool = True)`**
  - Argument: `products_data` (List[Dict[str, Any]]) - List of dictionaries containing products data, `sort_field` (str) - Field to sort by, `top_n` (int) - Number of products to return, `highest` (bool) - Whether to get highest or lowest values
  - Returns: `List[Dict[str, Any]]` - List of top/bottom products
  - Purpose: Retrieves top or bottom products based on a field specified

- **`get_user_demographics(reviewers_data: List[Dict[str, Any]])`**
  - Argument: `reviewers_data` (List[Dict[str, Any]]) - List of dictionaries which contains reviewers data
  - Returns: `Dict[str, Any]` - Dictionary with total reviewers, top states, state distribution and age statistics
  - Purpose: Analyzing user demographics from the reviewers data


## Extracting Values

In [165]:
def extract_values(data: List[Dict[str, Any]], field_name: str) -> List[float]:
    """Extract numeric values from a field."""
    values = []
    for item in data:
        value_str = item.get(field_name)
        if value_str is not None:
            try:
                values.append(float(value_str))
            except (ValueError, TypeError):
                pass
    return values
print('first four float extracted from a scored field in reviews using extract value function: ',extract_values(reviews, "Score")[:4])

first four float extracted from a scored field in reviews using extract value function:  [2.0, 1.0, 2.0, 0.0]


## By numeric data

In [166]:
def analyze_numeric_data(products_data: List[Dict[str, Any]], field_name: str) -> Dict[str, Any]:
    """Analyze numeric data across all products."""
    values = extract_values(products_data, field_name)
    if not values:
        return {}
    return {
        "count": len(values),
        "mean": np.mean(values),
        "median": np.median(values),
        "std": np.std(values),
        "min": np.min(values),
        "max": np.max(values)
    }

print(analyze_numeric_data(jcpenney_products, "average_product_rating"))

{'count': 7982, 'mean': 2.9886828222275152, 'median': 3.0, 'std': 0.9116162637616239, 'min': 1.0, 'max': 5.0}


## By product

In [167]:
def get_products(products_data: List[Dict[str, Any]], sort_field: str, top_n: int = 5, highest: bool = True) -> List[Dict[str, Any]]:
    """Get top or bottom products based on a specified field."""
    valid_products = [p for p in products_data if p.get(sort_field) is not None]
    sorted_products = sorted(valid_products, key=lambda x: float(x[sort_field]), reverse=highest)
    return sorted_products[:top_n]
    
top_rated_products = get_products(jcpenney_products, "average_product_rating", 5, True)
print("Top 5 5.0 rated products: ", [product.get('name_title') for product in top_rated_products])

Top 5 5.0 rated products:  ['Danny & Nicole® Sleeveless Printed Fit-and-Flare Dress - Plus', 'Danny & Nicole® Sleeveless Printed Fit-and-Flare Dress - Plus', 'Danny & Nicole® Sleeveless Striped Colorblock Fit-and-Flare Dress', 'Azul by Maxine of Hollywood Tankini Swim Top or Skirted Bottoms', 'Azzure 2-Pack Decorative Pillows']


## By User demographics

In [168]:
def get_user_demographics(reviewers_data: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Analyze user demographics from reviewers data."""
    states = [reviewer.get("State") for reviewer in reviewers_data if reviewer.get("State")]
    dobs = [reviewer.get("DOB") for reviewer in reviewers_data if reviewer.get("DOB")]
    
    birth_years = []
    for dob in dobs:
        if dob:
            try:
                year = int(dob.split(".")[-1])
                birth_years.append(year)
            except:
                pass
    
    ages = [CURRENT_YEAR - year for year in birth_years if 1900 <= year <= CURRENT_YEAR]
    
    return {
        "total_reviewers": len(reviewers_data),
        "state_distribution": dict(Counter(states)),
        "top_states": Counter(states).most_common(10),
        "age_statistics": {
            "count": len(ages),
            "mean_age": np.mean(ages) if ages else 0,
            "median_age": np.median(ages) if ages else 0
        } if ages else {}
    }

print(str(get_user_demographics(users))[:100] + "...")

{'total_reviewers': 5000, 'state_distribution': {'Oregon': 96, 'Massachusetts': 107, 'Idaho': 79, 'F...


# Visualization Functions
- **`create_chart(chart_type: str, data_x, data_y, title: str, x_label: str, y_label: str, color: str, figsize=(10, 3))`**
  - Argument: `chart_type` (str) - Type of chart (bar, histogram, line, comparison), `data_x`, `data_y` - Chart data, `title`, `x_label`, `y_label` (str) - Labels, `color` (str) - Chart color, `figsize` (tuple) - Figure size
  - Returns: `str` - Temporary file path for the chat to be created
  - Purpose: Creates various types of charts for visualization

In [169]:
def create_chart(chart_type: str, data_x, data_y, title: str, x_label: str, y_label: str, color: str, figsize=(10, 3)):
    """Create various types of charts."""
    plt.figure(figsize=figsize)
    
    if chart_type == "bar":
        bars = plt.bar(data_x, data_y, color=color)
        for bar in bars:
            height = bar.get_height()
            if height > 0:
                plt.text(bar.get_x() + bar.get_width()/2., height,
                        f'{int(height)}' if isinstance(height, (int, float)) else f'{height:.2f}', 
                        ha='center', va='bottom')
    elif chart_type == "histogram":
        plt.hist(data_x, bins=10, color=color, alpha=0.7)
        plt.grid(True, alpha=0.3)
    elif chart_type == "line":
        counts, bin_edges = np.histogram(data_x, bins=10)
        bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
        plt.plot(bin_centers, counts, marker='o', color=color, linewidth=2, markersize=6)
        plt.fill_between(bin_centers, counts, alpha=0.3, color=color)
        plt.grid(True, alpha=0.3)
    elif chart_type == "comparison":
        bars = plt.bar(data_x, data_y, color=['#1f77b4', '#ff7f0e'])
        plt.ylim(0, 5)
        for bar in bars:
            height = bar.get_height()
            plt.text(bar.get_x() + bar.get_width()/2., height,
                    f'{height:.2f}', ha='center', va='bottom')
    
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    
    temp_file = tempfile.NamedTemporaryFile(suffix='.png', delete=False)
    temp_file.close()
    plt.savefig(temp_file.name, format='png', bbox_inches='tight', dpi=150)
    plt.close()
    
    return temp_file.name

# PDF Generation Functions
- **`add_product_images(pdf: FPDF, products: List[Dict[str, Any]], start_y: float)`**
  - Argument: `pdf` (FPDF) - PDF object, `products` (List[Dict[str, Any]]) - Products data, `start_y` (float) - Starting Y position
  - Returns: `List[str]` - List of temporary file paths
  - Purpose: Adds product images to the PDF report
- **`create_visualization_pdf_report(pdf_filename: str)`**
  - Argument: `pdf_filename` (str) - Name of the PDF file to created
  - Returns: `bool` - True if successful, False otherwise
  - Purpose: Creates a comprehensive PDF report with charts and product images

## Adding product Image to PDF

In [170]:
def add_product_images(pdf: FPDF, products: List[Dict[str, Any]], start_y: float) -> List[str]:
    """Add product images to the PDF."""
    temp_files = []
    x_position = 10
    image_width = 30
    image_height = 30
    spacing = 5
    
    for i, product in enumerate(products[:5]):
        image_url = product.get("product_image_urls")
        if image_url:
            try:
                headers = {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
                }
                response = requests.get(image_url, headers=headers, timeout=10)
                if response.status_code == 200:
                    temp_file = tempfile.NamedTemporaryFile(suffix='.jpg', delete=False)
                    temp_files.append(temp_file.name)
                    temp_file.close()
                    
                    with open(temp_file.name, 'wb') as f:
                        f.write(response.content)
                    
                    current_x = x_position + (i * (image_width + spacing))
                    pdf.image(temp_file.name, x=current_x, y=start_y, w=image_width, h=image_height)
                    
                    pdf.set_font("Arial", size=8)
                    product_name = product.get("name_title", "")[:15]
                    pdf.set_xy(current_x, start_y + image_height + 2)
                    pdf.cell(image_width, 5, product_name, align="C")
            except:
                current_x = x_position + (i * (image_width + spacing))
                pdf.set_font("Arial", size=8)
                pdf.set_xy(current_x, start_y)
                pdf.cell(image_width, image_height, "Image N/A", align="C")
    
    return temp_files

## Visualization of Data

In [171]:
def create_visualization_pdf_report(pdf_filename: str) -> bool:
    """Create a comprehensive PDF report with charts and product images."""
    print("Generating pdf report...")
    try:        
        class PDF(FPDF):
            def header(self):
                self.set_font('Arial', 'B', 15)
                self.cell(0, 10, 'JCPenney Data Analysis Report', 0, 1, 'C')
                self.ln(5)
            
            def footer(self):
                self.set_y(-15)
                self.set_font('Arial', 'I', 8)
                self.cell(0, 10, f'Page {self.page_no()}', 0, 0, 'C')
        
        pdf = PDF()
        pdf.set_auto_page_break(auto=True, margin=15)
        
        jcpenney_reviewers = convert_json_file_to_json_array("jcpenney_reviewers.json")
        jcpenney_products = convert_json_file_to_json_array("jcpenney_products.json")
        reviews = convert_csv_file_to_json_array("reviews.csv")
        
        product_ratings = analyze_numeric_data(jcpenney_products, "average_product_rating")
        avg_product_rating = round(product_ratings.get("mean", 0), 2) if product_ratings else 0
        
        review_scores = analyze_numeric_data(reviews, "Score")
        avg_review_score = round(review_scores.get("mean", 0), 2) if review_scores else 0
        
        user_demographics = get_user_demographics(jcpenney_reviewers)
        age_stats = user_demographics.get("age_statistics", {})
        avg_age = round(age_stats.get("mean_age", 0), 1) if age_stats else 0

        temp_files = []
        
        # Page 1: Title and Key Metrics
        pdf.add_page()
        pdf.set_font("Arial", "B", 24)
        pdf.cell(0, 20, "JCPenney Data Analysis Report", ln=True, align="C")
        pdf.ln(5)
        pdf.cell(0, 16, "Student ID: 3512017", ln=True, align="C")
        pdf.ln(5)
        
        pdf.set_font("Arial", size=12)
        pdf.cell(0, 10, "Comprehensive Analysis of Products and Customer Reviews", ln=True, align="C")
        pdf.ln(20)
        
        temp_file_path = create_chart("comparison", ['Product Ratings', 'Review Scores'], [avg_product_rating, avg_review_score], 
                                    'Average Ratings Comparison', '', 'Average Score', '', (10, 6))
        temp_files.append(temp_file_path)
        pdf.image(temp_file_path, x=10, y=80, w=190)
        pdf.ln(120)
        
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Key Metrics", ln=True)
        pdf.ln(5)
        
        metrics = [
            ("Total Products", f"{len(jcpenney_products):,}"),
            ("Total Reviews", f"{len(reviews):,}"),
            ("Average Product Rating", str(avg_product_rating)),
            ("Average Review Score", str(avg_review_score)),
            ("Total Reviewers", f"{len(jcpenney_reviewers):,}"),
        ]
        
        pdf.set_font("Arial", size=10)
        for metric, value in metrics:
            pdf.cell(0, 8, f"{metric}: {value}", ln=True)
        
        # Page 2: Review Score Distribution and User Demographics
        pdf.add_page()
        pdf.set_font("Arial", "B", 18)
        pdf.cell(0, 10, "Review Score Distribution", ln=True)
        pdf.ln(10)
        
        score_distribution = Counter(extract_values(reviews, "Score"))
        scores = sorted(score_distribution.keys())
        counts = [score_distribution[score] for score in scores]
        
        temp_file_path = create_chart("bar", scores, counts, 'Distribution of Review Scores', 
                                    'Review Score', 'Number of Reviews', '#2ca02c', (10, 3))
        temp_files.append(temp_file_path)
        pdf.image(temp_file_path, x=10, y=40, w=190)
        pdf.ln(75)
        
        # User Demographics
        pdf.set_font("Arial", "B", 16)
        pdf.cell(0, 10, "User Demographics", ln=True)
        pdf.ln(5)
        
        ages = []
        for reviewer in jcpenney_reviewers:
            dob = reviewer.get("DOB")
            if dob:
                try:
                    year = int(dob.split(".")[-1])
                    if 1900 <= year <= CURRENT_YEAR:
                        ages.append(CURRENT_YEAR - year)
                except:
                    pass
        
        if ages:
            youngest_age = min(ages)
            oldest_age = max(ages)
            
            pdf.set_font("Arial", size=10)
            pdf.cell(0, 8, f"Youngest Customer: {youngest_age} years old", ln=True)
            pdf.cell(0, 8, f"Oldest Customer: {oldest_age} years old", ln=True)
            pdf.cell(0, 8, f"Average Age: {avg_age} years old", ln=True)
            pdf.ln(TEXT_SPACING)
        
        top_states = user_demographics.get("top_states", [])
        if top_states:
            pdf.set_font("Arial", "B", 14)
            pdf.cell(0, 10, "Top States by Reviewer Count:", ln=True)
            pdf.ln(5)
            
            pdf.set_font("Arial", size=10)
            for i, (state, count) in enumerate(top_states[:5]):
                pdf.cell(0, 8, f"{i+1}. {state}: {count} reviewers", ln=True)
            pdf.ln(TEXT_SPACING)
        
        if age_stats and age_stats.get("count", 0) > 0 and ages:
            temp_file_path = create_chart("line", ages, None, 'Age Distribution of Reviewers', 
                                        'Age', 'Number of Reviewers', '#9467bd', (10, 3))
            temp_files.append(temp_file_path)
            pdf.image(temp_file_path, x=10, y=pdf.get_y(), w=190)
            pdf.ln(60)
        
        # Page 3: Top Rated and Low Rated Products (Combined Page)
        pdf.add_page()
        pdf.set_font("Arial", "B", 18)
        pdf.cell(0, 10, "5.0 Top 5 Rated Products", ln=True)
        pdf.ln(5)
        
        top_rated_products = get_products(jcpenney_products, "average_product_rating", 5, True)
        temp_files.extend(add_product_images(pdf, top_rated_products, pdf.get_y()))
        pdf.ln(10)
        
        pdf.set_font("Arial", size=10)
        for i, product in enumerate(top_rated_products):
            name = product.get("name_title", "N/A")[:60]
            brand = product.get("brand", "N/A")
            rating = product.get("average_product_rating", "N/A")
            pdf.cell(0, 6, f"{i+1}. {name} | Brand: {brand} | Rating: {rating}", ln=True)
            pdf.ln(3)
        
        # Add a line separator
        pdf.set_draw_color(200, 200, 200)
        pdf.line(10, pdf.get_y(), 200, pdf.get_y())
        pdf.ln(5)
        
        pdf.set_font("Arial", "B", 18)
        pdf.cell(0, 10, "1.0 Least 5 Rated Products", ln=True)
        pdf.ln(5)
        
        low_rated_products = get_products(jcpenney_products, "average_product_rating", 5, False)
        temp_files.extend(add_product_images(pdf, low_rated_products, pdf.get_y()))
        pdf.ln(10)
        
        pdf.set_font("Arial", size=10)
        for i, product in enumerate(low_rated_products):
            name = product.get("name_title", "N/A")[:60]
            brand = product.get("brand", "N/A")
            rating = product.get("average_product_rating", "N/A")
            pdf.cell(0, 6, f"{i+1}. {name} | Brand: {brand} | Rating: {rating}", ln=True)
            pdf.ln(3)
        
        # Page 4: Detailed Product Analysis
        pdf.add_page()
        pdf.set_font("Arial", "B", 18)
        pdf.cell(0, 10, "Detailed Product Analysis", ln=True)
        pdf.ln(10)
        
        # Price Analysis
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Price Distribution Analysis", ln=True)
        pdf.ln(5)
        
        # Extract price data
        prices = []
        for product in jcpenney_products:
            try:
                price_str = product.get("list_price", "0")
                if price_str and price_str != "0":
                    # Remove dollar sign and convert to float
                    price = float(price_str.replace("$", "").replace(",", ""))
                    prices.append(price)
            except:
                pass
        
        if prices:
            pdf.set_font("Arial", size=10)
            pdf.cell(0, 8, f"Total Products with Price Data: {len(prices)}", ln=True)
            pdf.cell(0, 8, f"Average Price: ${np.mean(prices):.2f}", ln=True)
            pdf.cell(0, 8, f"Price Range: ${np.min(prices):.2f} - ${np.max(prices):.2f}", ln=True)
            pdf.ln(5)
            
            # Create price distribution line chart
            temp_file_path = create_chart("line", prices, None, 'Product Price Distribution', 
                                        'Price ($)', 'Number of Products', '#1f77b4', (10, 4))
            temp_files.append(temp_file_path)
            pdf.image(temp_file_path, x=10, y=pdf.get_y(), w=190)
            pdf.ln(80)
        
        # Brand Analysis
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Brand Analysis", ln=True)
        pdf.ln(5)
        
        brands = [product.get("brand", "Unknown") for product in jcpenney_products]
        brand_counts = Counter(brands)
        top_brands = brand_counts.most_common(5)
        
        if top_brands:
            pdf.set_font("Arial", size=10)
            pdf.cell(0, 8, "Top 5 Brands by Product Count:", ln=True)
            pdf.ln(3)
            
            for i, (brand, count) in enumerate(top_brands):
                pdf.cell(0, 7, f"{i+1}. {brand}: {count} products", ln=True)
            pdf.ln(10)
        
        # Page 5: Summary & Recommendations
        pdf.add_page()
        pdf.set_font("Arial", "B", 18)
        pdf.cell(0, 10, "Summary & Recommendations", ln=True)
        pdf.ln(10)
        
        pdf.set_font("Arial", size=10)
        summary_points = [
            f"- The analysis discorvered {len(jcpenney_products)} products and {len(reviews)} reviews",
            f"- Averagely the product rating is {avg_product_rating}, indicating {'good' if avg_product_rating >= 3 else 'moderate'} customer satisfaction",
            f"- Average review score is also {avg_review_score}, showing {'positive' if avg_review_score >= 3 else 'mixed'} sentiment",
            f"- Customer base spreads through {len(user_demographics.get('state_distribution', {}))} states, with strong representation",
            f"- Demographics of age range from {min(ages) if ages else 'N/A'} to {max(ages) if ages else 'N/A'} years"
        ]
        
        for point in summary_points:
            pdf.cell(0, 8, point, ln=True)
        
        pdf.ln(10)
        
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Recommendations:", ln=True)
        pdf.ln(5)
        
        recommendations = [
            "1. Products which was rated low (1.0) must be improved to ensure customer satisfaction is met fully",
            "2. High-rated products should be the center in marketing campaigns to attract new customers",
            "3. Different marketing strategies to consider age, based on the broad demographic distribution",
            "4. Review patterns should be analysed to identify positive and negative feedbacks",
            "5. Product lines in top-performing categories and brands can be expanded",
            "6. Pricing strategies should be looked at in order to remain competitive in key market segments"
        ]
        
        pdf.set_font("Arial", size=10)
        for recommendation in recommendations:
            pdf.cell(0, 7, recommendation, ln=True)
        
        # Clean up temporary files
        for temp_file in temp_files:
            try:
                os.unlink(temp_file)
            except:
                pass
        
        pdf.output(pdf_filename)
        return True
    except Exception as e:
        print(f"Error creating visualization PDF report: {e}")
 
        #due to the length of this function most of the code has been deleted for the purpose of this report
        #but can be found in my github repo
        return False


create_visualization_pdf_report(PDF_REPORT_NAME)

Generating pdf report...


True



## Report Generation
The analysis produces a detailed PDF report with the following structure:
1. **Page 1**: Title and Key Metrics
2. **Page 2**: Review Score Distribution and User Demographics
3. **Page 3**: Top Rated and Low Rated Products (combined page)
4. **Page 4**: Detailed Product Analysis
5. **Page 5**: Summary & Recommendations

## References
JCPenney. (n.d.). *Product and customer review data*. Retrieved from project dataset files.

Requests Documentation. (n.d.). *Requests: HTTP for Humans*. Retrieved from https://requests.readthedocs.io/en/latest/

FPDF Documentation. (n.d.). *PyFPDF Documentation*. Retrieved from https://pyfpdf.readthedocs.io/en/latest/

NumPy Documentation. (n.d.). *NumPy Documentation*. Retrieved from https://numpy.org/doc/

Pandas Documentation. (n.d.). *Pandas Documentation*. Retrieved from https://pandas.pydata.org/docs/getting_started/index.html#getting-started

Matplotlib Documentation. (n.d.). *Matplotlib Documentation*. Retrieved from https://matplotlib.org/stable/index.html

Python Software Foundation. (n.d.). *Python 3 Documentation*. Retrieved from https://docs.python.org/3/

Data Science Process Alliance. (n.d.). CRISP-DM 2.0. Retrieved from https://www.datascience-pm.com/crisp-dm-2/

The analysis provides a strong foundation for understanding how customers behave, product performance, and how the market is positioned to support strategic business decisions.


[My Github Repo](https://github.com/yakubuaisha318-gif/3512017_BD2_Assignment)