# Pair coding: Data Profiling with DQX

## Install DQX library

In [0]:
%pip install databricks-labs-dqx==0.9.2

It is always advisable to restart the Python kernel after installing a library

In [0]:
# Restart the Python kernel
dbutils.library.restartPython()

In [0]:
# import DQX profiler, and the WorkspaceClient
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient

In [0]:
# Let's read from the sample data from the Databricks volume
df_sample = spark.read.parquet("/Volumes/securehome/raw/phoenix/sample_10pct_data")

Let's preview the first 100 rows to get a sense of the dataset. Good to see what we are working with.

In [0]:
# See the first 100 records
display(df_sample.head(100))

Hmm, we see a few issues already for example:
- `created_at` seems to have `null` values
- `last_name` seems to have `null` values
- Looking at the `email_address` there are some with spaces, are thise valid?
- `phone_number` seems to have incorrect formats sometimes

Let's see some overview stats ourselves on the data to check these records

In [0]:
df_sample.summary().display()

## Data Profiling with DQX

We can run data profiling on input data which can generate quality rule candidates on columns together with summary statistics.

The generated data quality rules from profiling can then be used as input for the quality checking. 

Let's try to understand the Summary statistics, we can look at the docs [here](https://databrickslabs.github.io/dqx/docs/guide/data_profiling/#summary-statistics-reference)

In [0]:
# We initialize the profiler with a WorkspaceClient, which gives it access to Databricks context.
ws = WorkspaceClient()
profiler = DQProfiler(ws)

The profiler returns two things: 
- (1) summary statistics (dict of column-level stats) and
- (2) profiles (a set of data quality checks).

In [0]:
# Run the profiler on our Dataframe
summary_stats, profiles = profiler.profile(df_sample)

### Overview of Profiling results

In [0]:
# We'll convert the summary stats dictionary into a Pandas DataFrame just for easy visualization.”
pd.DataFrame(summary_stats).T

From the above table, w