# Data Profiling

## Prerequisites

- aws credentials file correctly configured on your local machine with enough permissions to read the csv file from s3
- python 3.8-3.9 virtual environment installed and activated (will be used for this notebook)
- python requirements installed (run step below only once)

## Considerations

- CSV sample file was uploaded to AWS S3 as it is (data & structure, filename was renamed), maintaining data privacy in a secure cloud environment.
- The following script could be simpler if profiling is done on a local file instead of the aws cloud.

## Advanced feature

If new files are received for profiling and automated work is required, a dag can be created in Airflow that reuses the script below to complete the following flow:
 
* read file from s3 -> generate profiling report -> upload report to the cloud

Based on the previous, an analyst could simply execute a dag to receive a data profiling report.

In [2]:
# Run only once
%pip install ydata-profiling==4.16.1
%pip install boto3==1.26.106

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import boto3
import pandas as pd
import logging
from io import StringIO
from ydata_profiling import ProfileReport

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    force=True # Only in jupyter notebooks
)
logger = logging.getLogger(__name__)

aws_region = 'us-east-1'
s3_bucket_name = 'novi-raw-dev'
s3_key = 'de_sample.csv'
report_path = "data_profiling_report.html"

def s3_read_csv(bucket_name, key, aws_region):
    try:
        s3_client = boto3.client("s3", region_name=aws_region)

        logger.info(f"Reading file '{key}' from bucket '{bucket_name}'...")

        response = s3_client.get_object(Bucket=bucket_name, Key=key)

        csv_content = response["Body"].read().decode("utf-8")
        df = pd.read_csv(StringIO(csv_content), sep=',', header=0, encoding='utf-8')

        logger.info("File was read successfully")
        return df
    except Exception as e:
        logger.error(f"ERROR: {e}")
        raise

df = s3_read_csv(s3_bucket_name, s3_key, aws_region)
logger.info("Generating profiling report...")
profile = ProfileReport(df, title="Novicap Profiling Report")
profile.to_file(report_path)
logger.info(f"Report was exported as html in {report_path}")

  from .autonotebook import tqdm as notebook_tqdm


2025-08-18 19:33:29,182 [INFO] Found credentials in shared credentials file: ~/.aws/credentials
2025-08-18 19:33:29,414 [INFO] Reading file 'de_sample.csv' from bucket 'novi-raw-dev'...
2025-08-18 19:33:45,493 [INFO] File was read successfully
2025-08-18 19:33:45,496 [INFO] Generating profiling report...
100%|██████████| 19/19 [00:01<00:00, 12.53it/s]2<00:01,  7.27it/s, Describe variable: Tax Liens]                  
Summarize dataset: 100%|██████████| 173/173 [00:27<00:00,  6.41it/s, Completed]                                                        
Generate report structure: 100%|██████████| 1/1 [00:07<00:00,  7.23s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00,  3.52s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 87.04it/s]
2025-08-18 19:34:25,381 [INFO] Report was exported as html in data_profiling_report.html
