# Data Profiling

## Prerequisites

- aws credentials file correctly configured on your local machine with enough permissions to read the csv file from s3
- python virtual environment installed and activated (will be used for this notebook)
- python requirements installed (run step below only once)

## Considerations

- CSV sample file was uploaded to AWS S3 as it is (data & structure, filename was renamed), maintaining data privacy in a secure cloud environment.
- The following script could be simpler if profiling is done on a local file instead of the aws cloud.

## Next steps (advanced feature)

If new files are received for profiling and automated work is required, a dag can be created in Airflow that reuses the script below to complete the following flow:
 
read file from s3 -> generate profiling report -> upload report to the cloud

In [None]:
# Run only once
!pip install ydata-profiling==4.16.1
!pip install boto3==1.26.106

In [2]:
import boto3
import pandas as pd
import logging
from io import StringIO
from ydata_profiling import ProfileReport

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    force=True # Only in jupyter notebooks
)
logger = logging.getLogger(__name__)

aws_region = 'us-east-1'
s3_bucket_name = 'novi-raw-dev'
s3_key = 'de_sample.csv'
report_path = "eda_report.html"

def s3_read_csv(bucket_name, key, aws_region):
    try:
        s3_client = boto3.client("s3", region_name=aws_region)

        logger.info(f"Reading file '{key}' from bucket '{bucket_name}'...")

        response = s3_client.get_object(Bucket=bucket_name, Key=key)

        csv_content = response["Body"].read().decode("utf-8")
        df = pd.read_csv(StringIO(csv_content), sep=',', header=0, encoding='utf-8')

        logger.info("File was read successfully")
        return df
    except Exception as e:
        logger.error(f"ERROR: {e}")
        raise

df = s3_read_csv(s3_bucket_name, s3_key, aws_region)
logger.info("Generating profiling report...")
profile = ProfileReport(df, title="Novicap Profiling Report")
profile.to_file(report_path)
logger.info(f"Report was exported as html in {report_path}")

2025-08-18 18:54:11,964 [INFO] Reading file 'de_sample.csv' from bucket 'novi-raw-dev'...


2025-08-18 18:54:13,822 [INFO] File was read successfully
2025-08-18 18:54:13,825 [INFO] Generating profiling report...
100%|██████████| 19/19 [00:02<00:00,  8.92it/s]2<00:00, 21.39it/s, Describe variable: Tax Liens]                  
Summarize dataset: 100%|██████████| 173/173 [00:26<00:00,  6.48it/s, Completed]                                                        
Generate report structure: 100%|██████████| 1/1 [00:07<00:00,  7.53s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00,  3.51s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 75.67it/s]
2025-08-18 18:54:52,201 [INFO] Report was exported as html in eda_report.html
