---
title: Data Card
subtitle: Review Rating Model
version: v0.1
card version: v0.1
author: Tathagata Talukdar, Kartik Gawande
date: 24-Nov-2024
format:
    html:
        toc: true
        code-fold: true
        html-math-method: katex
        embed-resources: true
execute:
    echo: true
    warning: false
dependencies:
    - plotly=5.18.0
jupyter: python3
---


In [4]:
import os
import pandas as pd
import json
from evidently.metric_preset import DataQualityPreset
from evidently.metrics import DatasetCorrelationsMetric
from evidently.report import Report
import json
from rich import print
from pathlib import Path

from pathlib import Path

# Data Quality Functions
def generate_data_quality_report(data: pd.DataFrame) -> None:
    data_quality_report = Report(
        metrics=[DataQualityPreset(), DatasetCorrelationsMetric()]
    )
    relevant_data = data[['HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Id']]
    data_quality_report.run(current_data=relevant_data, reference_data=None)
    root_dir = os.getcwd()
    report_path = os.path.join(root_dir, 'docs_quarto', 'data_quality', 'data_quality_report.qmd')
    os.makedirs(os.path.dirname(report_path), exist_ok=True)
    html_content = data_quality_report.get_html()
    with open(report_path, 'w') as f:
        f.write("```{=html}")
        f.write(html_content)
        f.write("```")

def get_data_quality_metrics(data: pd.DataFrame) -> dict:
    metrics = {
        "num_features": len(data.columns),
        "num_rows": len(data),
        "missing_values": data.isnull().sum().sum() / (data.shape[0] * data.shape[1])
    }
    return metrics

def run_data_quality_checks(df: pd.DataFrame) -> dict:
    generate_data_quality_report(df)
    metrics = get_data_quality_metrics(df)
    return metrics

# Node Function
def create_data_card(loaded_data: pd.DataFrame, data_quality_metrics: dict) -> str:
    data_card = {
        "dataset_name": "Amazon Fine Food Reviews",
        "number_of_rows": len(loaded_data),
        "number_of_features": len(loaded_data.columns),
        "feature_names": list(loaded_data.columns),
        "data_quality_metrics": data_quality_metrics,
    }
    return json.dumps(data_card)

# Example usage
data_path = r'data/01_raw/Reviews.csv'  # Change this to the path of your dataset
data = pd.read_csv(data_path)
data_quality_metrics = run_data_quality_checks(data)
data_card_json = create_data_card(data, data_quality_metrics)
data_card = json.loads(data_card_json)

- **Name**: Amazon Fine Food Reviews
- **Description**: This dataset contains over 500,000 food reviews from Amazon users up to October 2012. Reviews include information about the product and user, with ratings, helpfulness votes, and a summary.
- **Dataset from time**: Oct 1999 - Oct 2012
- **Version**: 2.0

### Dataset Characteristics

In [17]:
# project_dir = Path().absolute().parent
# data_card_path = project_dir / "data" / "08_reporting" / "data-card" / "data_card.json"

# with open(data_card_path) as f:
#     data_card = json.load(f)
#     data_card = json.loads(data_card)
print(f"""• [bold]Number of Instances[/bold]: {data_card['number_of_rows']}
• [bold]Number of Features[/bold]: {data_card['number_of_features']}
• [bold]Target Variable[/bold]: y (boolean)""")

### Features

In [18]:
features_list = [f"{i + 1}. {data_card['feature_names'][i]}" 
                 for i in range(len(data_card["feature_names"]))]
print("\n".join(features_list))

### Data Collection
- **Method**: Unknown

### Intended Use
- **Sentiment Analysis**: Analyzing the sentiment of the reviews—whether they are positive, negative, or neutral. This is useful for understanding consumer opinions and improving products or services.

- **Text Classification**: Classifying reviews into various categories based on their content, which could help in automatically sorting feedback into different areas such as packaging, taste, and customer service.

- **Recommendation Systems**: Using the ratings and reviews to build recommendation systems that can suggest products to users based on the preferences of similar users.

- **Language Modeling**: Training language models to generate text that mimics user-generated content, which can be useful for creating automated responses or new reviews for training.

- **Feature Extraction**: Extracting and analyzing specific features from the text, such as the use of adjectives, to study how language use affects the perception of a product.

- **Data Visualization**: Visualizing data to identify trends and patterns in consumer behavior over time, across different products, or among different groups of reviewers.

### Ethical Considerations
- Ensure fair and unbiased use of the data, particularly regarding protected attributes like personal status.
- Be cautious of potential biases in the original data collection process.
- Consider the implications of using this data for decision-making in financial contexts.

### Known Limitations
- **Bias in Reviews**: The dataset may not represent the general population as it only includes reviews from Amazon users who are more likely to provide feedback. Users who write reviews might differ significantly in their tastes and expectations from the average consumer.

- **Rating Inflation**: There is often a tendency for review datasets to show rating inflation where the number of high ratings disproportionately exceeds lower ratings. This can skew analysis, particularly if the goal is to understand negative feedback.

- **Outdated Information**: As products and consumer preferences evolve over time, the reviews, especially older ones, may not accurately reflect the current status or quality of a product.

- **Missing Context**: Reviews may reference specific product versions, experiences, or events not fully detailed in the text, leading to potential misinterpretation of the sentiment or content of the review.

- **Text Quality**: The quality of text in reviews can vary greatly. Some reviews may be very brief or poorly written, offering little useful information for analysis, while others may be verbose and detailed.

- **Spam and Fake Reviews**: The presence of spam or fake reviews can distort analysis, leading to inaccurate conclusions unless methods are in place to identify and filter out such content.

- **Limited Demographic Data**: The dataset primarily focuses on the text and ratings of reviews without providing detailed demographic information about the reviewers, which could be crucial for understanding preferences across different user segments.