# A data quality validation use case

### **Data quality validation:** Data preparation and the importance of profiling the quality of your data

Data profiling is defined as the process of reviewing a source of data, validating it, understanding it's structure and behaviour, as well as content and depicted relations. 
Data profiling can be seen as a crucial step for different data projects and initiatives: 

- **Data Warehousing projects** - While building or feeding an existing Data Warehouse(DWH);
- **Data conversion and projects for data migration** - Data profiling is an essential tool to spot and identify any data quality issues. It can also help uncivering any new requirements for the target system;
- **Data preparation** - To develop ML based projects the data preparation stage involves many activites around data filtering and delection, data augmentation, feature creation and data splits based on training, validation and holdout. To measure the impact of each of these transformations data-profiling is a core step that allows to automate any required validations. 

YData's data profiling process involves: 
- Inference of data types and keyword tagging based categories;
- The collection of descriptive statistics in an univariate manner (min, max, quantiles, etc);
- Key integraty and missingess profile, based on zeros, blanks and nulls validation;
- Collection of statistics on a multivariate manner (how each variable relates with each other);
- Constraints defined based on business rules and formatting expectations.

## Hotel booking - A data preparation & profiling pipeline

To demo the potential of data quality profiling analysis throuhgout the process of data preparation, the example below leverages the ["Hotel Booking dataset that can be found in Kaggle"](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand). The dataset contains the booking information for a city and resort hotel.
In this tutorial we deliver a full pipeline with the data preparation to train a classification model to identify wether a certain reservation will be canceled. The pipeline includes the validation of the train/validation split (is the samples representative and comaprable?) as well as the profiling of all the data transformations and features extracted throughout the process. 

## Read data

In [10]:
# Import the necessary packages
from pickle import dump

from ydata.platform.datasources import DataSources
from ydata.metadata import Metadata

In [2]:
# Creating a Dataset from the Data Source
datasource = DataSources.get(uid='8a730a1d-d756-4768-a79d-663c86fa2af0', namespace='c69936a5-a14e-43aa-bcc7-f7e95709bb9b')
dataset = datasource.read()
# Quickly previewing the Dataset
dataset.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


## Metadata calculation
Calculating the overall statistics of the data to be observed

In [3]:
# Creating a Metadata (where warnings can be reviewed) from the Dataset
metadata = Metadata(dataset)
print(metadata)

  warn("Datasets other than Timeseries don't make use of dataset_attrs")


[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m32
[1m% of duplicate rows: [0m26
[1mTarget column: [0m

[1mColumn detail: [0m
                            Column    Data type Variable type
0                            hotel  categorical        string
1                      is_canceled    numerical           int
2                        lead_time    numerical           int
3                arrival_date_year    numerical           int
4               arrival_date_month  categorical        string
5         arrival_date_week_number    numerical           int
6        arrival_date_day_of_month    numerical           int
7          stays_in_weekend_nights    numerical           int
8             stays_in_week_nights    numerical           int
9                           adults    numerical           int
10                        children    numerical         float
11                          babies    numerical           

### Update metadata

In [4]:
dataset.astype('reservation_status_date', 'datetime')

In [5]:
metadata = Metadata(dataset)

  warn("Datasets other than Timeseries don't make use of dataset_attrs")


In [7]:
#aqui fazer o update da metadata
metadata.update_datatypes({'is_canceled': 'categorical'})

## Pipeline outputs

In [8]:
#Savin the dataset metadata
metadata.save('bookings_metadata.pkl')

In [11]:
from pickle import dumps

with open('bookings.pkl', 'wb') as file:
    dump(dataset, file)