# University of Stirling

# ITNPBD2 Representing and Manipulating Data

# Assignment Autumn 2025

# A Consultancy Job for JC Penney

This notebook forms the assignment instructions and submission document of the assignment for ITNPBD2. Read the instructions carefully and enter code into the cells as indicated.

You will need these five files, which were in the Zip file you downloaded from the course webpage:

- jcpenney_reviewers.json
- jcpenney_products.json
- products.csv
- reviews.csv
- users.csv

The data in these files describes products that have been sold by the American retail giant, JC Penney, and reviews by customers who bought them. Note that the product data is real, but the customer data is synthetic.

Your job is to process the data, as requested in the instructions in the markdown cells in this notebook.

# Completing the Assignment

Rename this file to be xxxxxx_BD2 where xxxxxx is your student number, then type your code and narrative description into the boxes provided. Add as many code and markdown cells as you need. The cells should contain:

- **Text narrative describing what you did with the data**
- **The code that performs the task you have described**
- **Comments that explain your code**

The final structure (in PDF) of your report must:
- **Start from the main insights observed (max 5 pages)**
- **Include as an appendix the source code used for producing those insights (max 15 pages)**
- **Include an AI cover sheet (provided on Canvas), which must contain a link to a versioned notebook file in OneDrive or another platform for version checks.**

# Marking Scheme
The assessment will be marked against the university Common Marking Scheme (CMS)

Here is a summary of what you need to achieve to gain a grade in the major grade bands:

|Grade|Requirement|
|:---|:---|
| Fail | You will fail if your code does not run or does not achieve even the basics of the task. You may also fail if you submit code without either comments or a text explanation of what the code does.|
| Pass | To pass, you must submit sufficient working code to show that you have mastered the basics of the task, even if not everything works completely. You must include some justifications for your choice of methods, but without mentioning alternatives. |
| Merit | For a merit, your code must be mostly correct, with only small problems or parts missing, and your comments must be useful rather than simply re-stating the code in English. Most choices for methods and structures should be explained and alternatives mentioned. |
| Distinction | For a distinction, your code must be working, correct, and well commented and shows an appreciation of style, efficiency and reliability. All choices for methods and structures are concisely justified and alternatives are given well thought considerations. For a distinction, your work should be good enough to present to executives at the company.|

The full details of the CMS can be found here

https://www.stir.ac.uk/about/professional-services/student-academic-and-corporate-services/academic-registry/academic-policy-and-practice/quality-handbook/assessment-policy-and-procedure/appendix-2-postgraduate-common-marking-scheme/

Note that this means there are not certain numbers of marks allocated to each stage of the assignment. Your grade will reflect how well your solutions and comments demonstrate that you have achieved the learning outcomes of the task. 

## Submission
When you are ready to submit, **print** your notebook as PDF (go to File -> Print Preview) in the Jupyter menu. Make sure you have run all the cells and that their output is displayed. Any lines of code or comments that are not visible in the pdf should be broken across several lines. You can then submit the file online.

Late penalties will apply at a rate of three marks per day, up to a maximum of 7 days. After 7 days you will be given a mark of 0. Extensions will be considered under acceptable circumstances outside your control.

## Academic Integrity

This is an individual assignment, and so all submitted work must be fully your own work.

The University of Stirling is committed to protecting the quality and standards of its awards. Consequently, the University seeks to promote and nurture academic integrity, support staff academic integrity, and support students to understand and develop good academic skills that facilitate academic integrity.

In addition, the University deals decisively with all forms of Academic Misconduct.

Where a student does not act with academic integrity, their work or behaviour may demonstrate Poor Academic Practice or it may represent Academic Misconduct.

### Poor Academic Practice

Poor Academic Practice is defined as: "The submission of any type of assessment with a lack of referencing or inadequate referencing which does not effectively acknowledge the origin of words, ideas, images, tables, diagrams, maps, code, sound and any other sources used in the assessment."

### Academic Misconduct

Academic Misconduct is defined as: "any act or attempted act that does not demonstrate academic integrity and that may result in creating an unfair academic advantage for you or another person, or an academic disadvantage for any other member or member of the academic community."

Plagiarism is presenting somebody else’s work as your own **and includes the use of artificial intelligence tools beyond AIAS Level 2 or the use of Large Language Models.**. Plagiarism is a form of academic misconduct and is taken very seriously by the University. Students found to have plagiarised work can have marks deducted and, in serious cases, even be expelled from the University. Do not submit any work that is not entirely your own. Do not collaborate with or get help from anybody else with this assignment.

The University of Stirling's full policy on Academic Integrity can be found at:

https://www.stir.ac.uk/about/professional-services/student-academic-and-corporate-services/academic-registry/academic-policy-and-practice/quality-handbook/academic-integrity-policy-and-academic-misconduct-procedure/

## The Assignment
Your task with this assignment is to use the data provided to demonstrate your Python data manipulation skills.

There are three `.csv` files and two `.json` files so you can process different types of data. The files also contain unstructured data in the form of natural language in English and links to images that you can access from the JC Penney website (use the field called `product_image_urls`).

Start with easy tasks to show you can read in a file, create some variables and data structures, and manipulate their contents. Then move onto something more interesting.

Look at the data that we provided with this assessment and think of something interesting to do with it using whatever libraries you like. Describe what you decide to do with the data and why it might be interesting or useful to the company to do it.

You can add additional data if you need to - either download it or access it using `requests`. Produce working code to implement your ideas in as many cells as you need below. There is no single right answer, the aim is to simply show you are competent in using python for data analysis. Exactly how you do that is up to you.

For a distinction class grade, this must show originality, creative thinking, and insights beyond what you've been taught directly on the module.

## Structure
You may structure the appendix of the project how you wish, but here is a suggested guideline to help you organise your work, based on the CRISP-DM data science methodology:

 1. **Business understanding** - What business context is the data coming from? What insights would be valuable in that context, and what data would be required for that purporse? 
 2. **Data understanding and preparation** - Explore the data and show you understand its structure and relations, with the aid of appropriate visualisation techniques. Assess the data quality, which insights you would be able to answer from it, and what preparation the data would require. Add new data from another source if required to bring new insights to the data you already have.
 3. **Data modeling (optional)** - Would modeling be required for the insights you have considered? Use appropriate techniques, if so.
 4. **Evaluation and deployment** - How do the insights you obtained help the company, and how can should they be adopted in their business? If modeling techniques have been adopted, are their use scientifically sound and how should they be mantained?

# Remember to make sure you are working completely on your own.
# Don't work in a group or with a friend


## **JCPenny Consultancy Analysis**
### **Date: 27/10/2025**




# Environment Setup Instructions

## Setting up the Environment with Anaconda

Follow these steps to set up your environment for running this Jupyter notebook:

### 1. Clone the Repository (if applicable)
```bash
git clone https://github.com/yakubuaisha318-gif/Representation_and_Manipulation_of_Data_JC_Penny_Consultancy_Assignment.git
cd Representation_and_Manipulation_of_Data_JC_Penny_Consultancy_Assignment
```

### 2. Install Anaconda
If you haven't already installed Anaconda, download it from [anaconda.com](https://www.anaconda.com/products/distribution) and follow the installation instructions for your operating system.

### 3. Create a New Conda Environment
```bash
conda create -n jcpenney-analysis python=3.9
```

### 4. Activate the Environment
```bash
conda activate jcpenney-analysis
```

### 5. Install Required Dependencies
```bash
pip install -r requrements.txt
```

If the requirements file is not available, install the necessary packages:
```bash
conda install pandas numpy matplotlib openpyxl
pip install fpdf
```

### 6. Start Jupyter Notebook
```bash
jupyter notebook
```

### 7. Open and Run This Notebook
1. Navigate to this notebook file in the Jupyter interface
2. Select the kernel: `Kernel` → `Change kernel` → `jcpenney-analysis`
3. Run the cells: `Cell` → `Run All`

### 8. Deactivating the Environment
When you're done working:
```bash
conda deactivate
```

In [8]:
import json
import pandas as pd
from typing import List, Dict, Any, Tuple, Union, Set
import numpy as np


def convert_json_file_to_json_array(json_file_path):
    '''
    Convert a JSON Lines file to a JSON array.
    '''
    data = []
    with open(json_file_path, "r") as file:
        for line in file:
            line = line.strip()
            if line:  # Skip empty lines
                try:
                    data.append(json.loads(line))
                except json.JSONDecodeError as e:
                    print(f"Error decoding JSON: {e}")
                    print(f"Problematic line: {line[:100]}...")  # Print first 100 chars of problematic line
                    break
    return data


def convert_csv_file_to_json_array(csv_file_path):
    '''
    Convert a CSV file to a JSON array.
    '''
    df = pd.read_csv(csv_file_path)
    return json.loads(df.to_json(orient="records"))


def get_reviews_by_username(username):
    '''
    Get reviews by username.
    '''
    reviews = convert_csv_file_to_json_array("reviews.csv")
    filtered_reviews = [review for review in reviews if review["Username"] == username]
    return filtered_reviews

def get_extra_fields(dict1, dict2):
    """
    Compare two dictionaries and return extra fields in each.
    Returns a tuple of (extra_in_1, extra_in_2)
    """
    keys1 = set(dict1.keys())
    keys2 = set(dict2.keys())
    return (keys1 - keys2, keys2 - keys1)


def analyze_user_reviewing_patterns(reviewers_data: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Analyze how many products each user has reviewed."""
    review_counts = [len(reviewer.get("Reviewed", [])) for reviewer in reviewers_data]
    
    return {
        "total_reviewers": len(reviewers_data),
        "reviewers_with_reviews": len([c for c in review_counts if c > 0]),
        "reviewers_without_reviews": len([c for c in review_counts if c == 0]),
        "average_reviews_per_user": np.mean(review_counts) if review_counts else 0,
        "median_reviews_per_user": np.median(review_counts) if review_counts else 0,
        "max_reviews_by_user": np.max(review_counts) if review_counts else 0
    }


def extract_values(data: List[Dict[str, Any]], field_name: str, value_type: str = "float") -> List[Union[float, int]]:
    """Extract numeric values from a field in a list of dictionaries."""
    values = []
    for item in data:
        value_str = item.get(field_name)
        if value_str is not None:
            try:
                if value_type == "int":
                    values.append(int(value_str))
                else:
                    values.append(float(value_str))
            except (ValueError, TypeError):
                pass
    return values


def analyze_numeric_data(products_data: List[Dict[str, Any]], field_name: str) -> Dict[str, Any]:
    """Analyze numeric data across all products."""
    values = extract_values(products_data, field_name)
    
    if not values:
        return {}
    
    return {
        "count": len(values),
        "mean": np.mean(values),
        "median": np.median(values),
        "std": np.std(values),
        "min": np.min(values),
        "max": np.max(values),
        "quartiles": {
            "25%": np.percentile(values, 25),
            "50%": np.percentile(values, 50),
            "75%": np.percentile(values, 75)
        }
    }


def create_summary_data(jcpenney_reviewers: List[Dict[str, Any]], 
                        jcpenney_products: List[Dict[str, Any]], 
                        reviews: List[Dict[str, Any]]) -> Dict[str, List[Any]]:
    """Create summary statistics data."""
    user_patterns = analyze_user_reviewing_patterns(jcpenney_reviewers)
    product_ratings = analyze_numeric_data(jcpenney_products, "average_product_rating")
    review_scores = analyze_numeric_data(reviews, "Score")
    
    return {
        "Metric": [
            "Total Reviewers", 
            "Reviewers with Reviews", 
            "Reviewers without Reviews",
            "Average Reviews per User",
            "Total Products",
            "Products with Ratings",
            "Total Reviews",
            "Average Product Rating",
            "Median Product Rating",
            "Average Review Score",
            "Median Review Score"
        ],
        "Value": [
            len(jcpenney_reviewers),
            user_patterns["reviewers_with_reviews"],
            user_patterns["reviewers_without_reviews"],
            round(user_patterns["average_reviews_per_user"], 2),
            len(jcpenney_products),
            len([p for p in jcpenney_products if p.get("average_product_rating") is not None]),
            len(reviews),
            round(product_ratings.get("mean", 0), 2),
            round(product_ratings.get("median", 0), 2),
            round(review_scores.get("mean", 0), 2),
            round(review_scores.get("median", 0), 2)
        ]
    }



def create_product_data(products: List[Dict[str, Any]], include_image_url: bool = False) -> List[Dict[str, Any]]:
    """Create standardized product data for Excel sheets."""
    product_data = []
    for product in products:
        data = {
            "Product ID": product.get("uniq_id", ""),
            "Product Name": product.get("name_title", "")[:50],
            "Brand": product.get("brand", ""),
            "Category": product.get("category", ""),
            "Average Rating": product.get("average_product_rating", ""),
            "Total Reviews": product.get("total_number_reviews", ""),
            "Price": f"${product.get('sale_price', 'N/A')}"
        }
        if include_image_url:
            data["Image URL"] = product.get("product_image_urls", "")
        product_data.append(data)
    return product_data


def get_top_products(products_data: List[Dict[str, Any]], sort_field: str, top_n: int = 10, reverse: bool = True) -> List[Dict[str, Any]]:
    """Get top products based on a specified field."""
    valid_products = []
    for p in products_data:
        field_value = p.get(sort_field)
        if field_value is not None:
            try:
                float(field_value) if sort_field == "average_product_rating" else int(field_value)
                valid_products.append(p)
            except (ValueError, TypeError):
                pass
    
    sorted_products = sorted(valid_products, 
                           key=lambda x: float(x[sort_field]) if sort_field == "average_product_rating" else int(x[sort_field]), 
                           reverse=reverse)
    
    return sorted_products[:top_n]


def analyze_performance_by_field(products_data: List[Dict[str, Any]], field_name: str) -> Dict[str, Any]:
    """Analyze performance by a specified field (brand or category)."""
    field_stats = defaultdict(list)
    
    for product in products_data:
        field = product.get(field_name)
        rating_str = product.get("average_product_rating")
        reviews_str = product.get("total_number_reviews")
        
        if field and rating_str is not None and reviews_str is not None:
            try:
                rating = float(rating_str)
                reviews = int(reviews_str)
                field_stats[field].append({
                    "rating": rating,
                    "reviews": reviews
                })
            except (ValueError, TypeError):
                pass
    
    field_analysis = {}
    for field, products in field_stats.items():
        ratings = [p["rating"] for p in products]
        review_counts = [p["reviews"] for p in products]
        
        field_analysis[field] = {
            "product_count": len(products),
            "avg_rating": np.mean(ratings),
            "median_rating": np.median(ratings),
            "total_reviews": sum(review_counts),
            "avg_reviews_per_product": np.mean(review_counts)
        }
    
    sorted_fields = sorted(field_analysis.items(), key=lambda x: x[1]["avg_rating"], reverse=True)
    
    return dict(sorted_fields)


def main():
    jcpenney_reviewers = convert_json_file_to_json_array("jcpenney_reviewers.json")
    jcpenney_products = convert_json_file_to_json_array("jcpenney_products.json")
    reviews = convert_csv_file_to_json_array("reviews.csv")
    products = convert_csv_file_to_json_array("products.csv")
    users = convert_csv_file_to_json_array("users.csv")

    jcpenney_reviewers_usernames = [reviewer["Username"] for reviewer in jcpenney_reviewers]
    users_usernames = [user["Username"] for user in users]

    jcpenney_products_uniq_ids = [product["uniq_id"] for product in jcpenney_products]
    products_uniq_ids = [product["Uniq_id"] for product in products]

    if set(jcpenney_reviewers_usernames).issubset(set(users_usernames)) and len(jcpenney_reviewers_usernames) == len(users_usernames):
        print("jcpenny_reviewers and users are same except for the extra information in jcpenney_reviewers")
        print("jcpenney_reviewers has reviewed field")
        print("I will be using jcpenney_reviewers for further processing")
        extra_in_reviewers, extra_in_users = get_extra_fields(jcpenney_reviewers[0], users[0])
        print("Extra fields in jcpenney_reviewers:", extra_in_reviewers)
        print("Extra fields in users:", extra_in_users)

    if set(jcpenney_products_uniq_ids).issubset(set(products_uniq_ids)) and len(jcpenney_products_uniq_ids) == len(products_uniq_ids):
        print("jcpenney_products and products are same except for the extra information in jcpenney_products")
        print("jcpenney_products has Reviews field")
        print("I will be using jcpenney_products for further processing")
        extra_in_jcpenney, extra_in_products = get_extra_fields(jcpenney_products[0], products[0])
        print("\nExtra fields in jcpenney_products:", extra_in_jcpenney)
        print("Extra fields in products:", extra_in_products)

    print(create_summary_data(jcpenney_reviewers, jcpenney_products, reviews))
    print(create_product_data(jcpenney_products, include_image_url=True)[:2])
    # print("number of reviewers:", len(jcpenney_reviewers))
    # print("number of jcpenney products:", len(jcpenney_products))
    # # print("number of reviews:", len(reviews))
    # print("number of products:", len(products))
    # print("number of users:", len(users))

    # print(jcpenney_reviewers[0])
    # print(jcpenney_products[0])
    # print(type(jcpenney_reviewers))
    # for key in jcpenney_reviewers[0].keys():
    #     print("reviewer_keys:", key)
    # print(jcpenney_products[0].get("Reviews"))
    # print(type(jcpenney_products))
    # for key in jcpenney_products[0].keys():
    #     print("product_keys:", key)
    # print(type(reviews))
    # for key in reviews[0].keys():
    #     print("review_keys:", key)
    # print(type(products))
    # for key in products[0].keys():
    #     print("product_keys:", key)
    # print(type(users))
    # for key in users[0].keys():
    #     print("user_keys:", key)


# print("Uniq_id:", data[0]["uniq_id"])
# print("SKU:", data[0]["sku"])
# print("Name:", data[0]["name_title"])
# print("List_price:", data[0]["list_price"])
# print("Sale_price:", data[0]["sale_price"])
# print("Category:", data[0]["category"])
# print("Category_tree:", data[0]["category_tree"])
# print("Average_product_rating:", data[0]["average_product_rating"])
# print("Product_url:", data[0]["product_url"])
# print("Description:", data[0]["description"])
# print("Product_image_urls:", data[0]["product_image_urls"])
# print("Bought With:", data[0]["Bought With"])

if __name__ == "__main__":
    main()

jcpenny_reviewers and users are same except for the extra information in jcpenney_reviewers
jcpenney_reviewers has reviewed field
I will be using jcpenney_reviewers for further processing
Extra fields in jcpenney_reviewers: {'Reviewed'}
Extra fields in users: set()
{'Metric': ['Total Reviewers', 'Reviewers with Reviews', 'Reviewers without Reviews', 'Average Reviews per User', 'Total Products', 'Products with Ratings', 'Total Reviews', 'Average Product Rating', 'Median Product Rating', 'Average Review Score', 'Median Review Score'], 'Value': [5000, 4029, 971, np.float64(1.6), 7982, 7982, 39063, np.float64(2.99), np.float64(3.0), np.float64(1.49), np.float64(1.0)]}
[{'Product ID': 'b6c0b6bea69c722939585baeac73c13d', 'Product Name': 'Alfred Dunner® Essential Pull On Capri Pant', 'Brand': 'Alfred Dunner', 'Category': 'alfred dunner', 'Average Rating': 2.625, 'Total Reviews': 8, 'Price': '$24.16', 'Image URL': 'http://s7d9.scene7.com/is/image/JCPenney/DP1228201517142050M.tif?hei=380&amp;wi