# A data quality validation use case

### **Data quality validation:** Data preparation and the importance of profiling the quality of your data

Data profiling is defined as the process of reviewing a source of data, validating it, understanding it's structure and behaviour, as well as content and depicted relations. 
Data profiling can be seen as a crucial step for different data projects and initiatives: 

- **Data Warehousing projects** - While building or feeding an existing Data Warehouse(DWH);
- **Data conversion and projects for data migration** - Data profiling is an essential tool to spot and identify any data quality issues. It can also help uncivering any new requirements for the target system;
- **Data preparation** - To develop ML based projects the data preparation stage involves many activites around data filtering and delection, data augmentation, feature creation and data splits based on training, validation and holdout. To measure the impact of each of these transformations data-profiling is a core step that allows to automate any required validations. 

YData's data profiling process involves: 
- Inference of data types and keyword tagging based categories;
- The collection of descriptive statistics in an univariate manner (min, max, quantiles, etc);
- Key integraty and missingess profile, based on zeros, blanks and nulls validation;
- Collection of statistics on a multivariate manner (how each variable relates with each other);
- Constraints defined based on business rules and formatting expectations.

## Hotel booking - A data preparation & profiling pipeline

To demo the potential of data quality profiling analysis throuhgout the process of data preparation, the example below leverages the ["Hotel Booking dataset that can be found in Kaggle"](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand). The dataset contains the booking information for a city and resort hotel.
In this tutorial we deliver a full pipeline with the data preparation to train a classification model to identify wether a certain reservation will be canceled. The pipeline includes the validation of the train/validation split (is the samples representative and comaprable?) as well as the profiling of all the data transformations and features extracted throughout the process. 

## Read data

In [6]:
# Import the necessary packages
import pandas as pd

from ydata.connectors import GCSConnector
from ydata.connectors.filetype import FileType
from ydata.utils.formats import read_json

In [7]:
data = pd.read_csv('hotel_booking.csv')

## Metadata calculation
Calculating the overall statistics of the data to be observed

In [12]:
from ydata.dataset import Dataset
from ydata.metadata import Metadata

metadata = Metadata(Dataset(data))
print(metadata)

[########################################] | 100% Completed |  6.5s
[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m32
[1m% of duplicate rows: [0m26
[1mTarget column: [0m

[1mColumn detail: [0m
                            Column    Data type Variable type
0                            hotel  categorical        string
1                      is_canceled  categorical           int
2                        lead_time  categorical           int
3                arrival_date_year  categorical           int
4               arrival_date_month  categorical        string
5         arrival_date_week_number  categorical           int
6        arrival_date_day_of_month  categorical           int
7          stays_in_weekend_nights  categorical           int
8             stays_in_week_nights  categorical           int
9                           adults  categorical           int
10                        children  categorical     

In [13]:
metadata.save('metadata.pkl')

## Sending the data & Metadata for the next flow steps

In [14]:
data.to_csv('data.csv')