## 📊 Key features

- **Type inference**: automatic detection of columns' data types (*Categorical*, *Numerical*, *Date*, etc.)
- **Warnings**: A summary of the problems/challenges in the data that you might need to work on (*missing data*, *inaccuracies*, *skewness*, etc.)
- **Univariate analysis**: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
- **Multivariate analysis**: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
- **Time-Series**: including different statistical information relative to time dependent data such as auto-correlation and seasonality, along ACF and PACF plots.
- **Text analysis**: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
- **File and Image analysis**: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
- **Compare datasets**: one-line solution to enable a fast and complete report on the comparison of datasets
- **Flexible output formats**: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for an easy integration in automated systems and as a widget in a Jupyter Notebook.

The report contains three additional sections:

- **Overview**: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
- **Alerts**: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
- **Reproduction**: technical details about the analysis (time, version and configuration)

In [1]:
import numpy as np
import math
import pandas as pd
import random
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt

In [8]:
# Load the e-commerce dataset

path = "/home/RMittal@ccsfundraising.com/ccs_pred_mod"
filename = "synthetic_constituent_data.csv"

file = "%s/%s" %(path, filename)
df_cd = pd.read_csv(file)

  df_cd = pd.read_csv(file)


In [9]:
profile = ProfileReport(
    df_cd, \
        title="Profiling Synthetic Data", \
        html={"style": {"full_width": True}}, 
)
profile.to_file("synthetic_constituent_data_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'Marta S. Brill'')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# N = int(9E+6)
# s = int(0.10*N)

# skip = sorted(random.sample(range(N),N-s))
# print(N,s)
# df_cd = pandas.read_csv(file, skiprows=skip)

In [None]:


### If it's a large data set, take a sample:
description = "Disclaimer: this profiling report was generated using a sample of 15'%' of the original dataset."
#df_cd_sample = df_cd.sample(frac=0.15)

# profile = ProfileReport(
#     df_cd_sample, \
#         title="Profiling National Multiple Sclerosis Data", \
#         dataset={"description": description}, \
#         html={"style": {"full_width": True}}, \
#         minimal=True
# )
#profile.to_file("nmss_data_report.html")
#profile.to_notebook_iframe()