# **Milestone 3**
---

- **Name:** Wilson

- **Batch:** HCK-016

- **Objective:** This program is made to validate cleaned data with the library great expectation

- **Dataset overview:** Our data contains information about apps released on Google Play Store

# **1. Import Libraries**

In [1]:
from great_expectations.data_context import FileDataContext

# **2. Initialization**
- Create data context
- Create data source
- Create data asset
- Build batch request
- Create expectation suite
- Create validator

In [2]:
# Create a data context
context = FileDataContext.create(project_root_dir='./')

# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv_data_m3'
if context.datasources: # to prevent error incase we do 'run all' more than once 
    context.delete_datasource(datasource_name)
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'data_asset_m3'
path_to_data = 'P2M3_wilson_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

# Create an expectation suite
expectation_suite_name = 'expectation_suite_m3'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,primary_genre,secondary_genre,genres,last_updated,current_ver,android_ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000000.0,10000,Free,0.0,Everyone,Art & Design,-,['Art & Design'],2018-01-07,1.0.0,4.0.3 and up
1,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700000.0,5000000,Free,0.0,Everyone,Art & Design,-,['Art & Design'],2018-08-01,1.2.4,4.0.3 and up
2,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000000.0,50000000,Free,0.0,Teen,Art & Design,-,['Art & Design'],2018-06-08,Varies with device,4.2 and up
3,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800000.0,100000,Free,0.0,Everyone,Art & Design,Creativity,"['Art & Design', 'Creativity']",2018-06-20,1.1,4.4 and up
4,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5600000.0,50000,Free,0.0,Everyone,Art & Design,-,['Art & Design'],2017-03-26,1,2.3 and up


# **3. Data Validation**

## **3.1. Column 'app' Should be Unique**
-  To prevent double information 

In [3]:
validator.expect_column_values_to_be_unique('app')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 8189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## **3.2. Column 'rating' Should be Between 1.0 and 5.0**
- To prevent rating to get out of bounds (1 star for the lowest, and 5 star for highest)

In [4]:
validator.expect_column_values_to_be_between('rating', 1.0, 5.0)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 8189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## **3.3. Column 'type' Should be either 'Free' or 'Paid'**
- To prevent types that are not in the options

In [5]:
validator.expect_column_values_to_be_in_set('type', ['Free', 'Paid'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 8189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## **3.4. Column 'size' Should be Floating Data Type**
- To prevent unreadable app size

In [6]:
validator.expect_column_values_to_be_in_type_list('size', ['float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "result": {
    "observed_value": "float64"
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## **3.5. Column 'content_rating' Should Not Exceed 20 Characters**
- To prevent content rating that's too long

In [7]:
validator.expect_column_value_lengths_to_be_between('content_rating', 0, 20)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 8189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## **3.5. Column 'content_rating' Should Not Exceed 20 Characters**
- To prevent content rating that's too long

In [8]:
validator.expect_column_value_lengths_to_be_between('content_rating', 0, 20)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 8189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## **3.6. Columns Count Is As Expected**
- To prevent any missing or additional column

In [9]:
validator.expect_table_column_count_to_equal(15)

Calculating Metrics:   0%|          | 0/3 [00:00<?, ?it/s]

{
  "result": {
    "observed_value": 15
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## **3.7. Rows Count Is As Expected**
- To prevent any missing or additional row

In [10]:
validator.expect_table_row_count_to_equal(8189)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "result": {
    "observed_value": 8189
  },
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

# **4. Save Expectation Suite**
- Enabling us to reuse the sets of expectations
- Setting `discard_failed_expectations` to False, so we don't lose failed expectation

In [11]:
validator.save_expectation_suite(discard_failed_expectations=False)

# **5. Create Checkpoint**
- Bundling validations of batch(s) against expectations suite(s)

In [12]:
# Create a checkpoint
checkpoint_m3 = context.add_or_update_checkpoint(
    name = 'checkpoint_m3',
    validator = validator,
)

# Test run the checkpoint
checkpoint_result = checkpoint_m3.run()

Calculating Metrics:   0%|          | 0/33 [00:00<?, ?it/s]

# **6. Build Documentation**
- Translate expectations and validation results into a human-readable documentation

In [13]:
# Build data docs
context.build_data_docs()

{'local_site': 'file://d:\\hacktiv8\\milestone\\p2-ftds016-hck-m3-weewoo2636\\gx\\uncommitted/data_docs/local_site/index.html'}