# **1. Introduction**

**Milestone 3**

Nama  : Tasya Amalia

Batch : FTDS-016-HCK

Tugas ini dibuat untuk melakukan validasi data, seperti memeriksa keakuratan, konsistensi, dan kelengkapan data menggunakan Great Expectations


# **2. Import Libraries**

In [2]:
# import pandas as pd
import great_expectations as ge
from great_expectations.data_context import FileDataContext

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


**Penjelasan:**

Mengimport semua library yang diperlukan untuk memfasilitasi pengerjaan tugas ini.

# **3. Data Loading**

In [3]:
context = FileDataContext.create(project_root_dir='./')

In [4]:
# Memberikan nama Datasource
datasource_name = 'online_shopping_dataset'
if context.datasources:
    context.delete_datasource(datasource_name)
datasource = context.sources.add_pandas(datasource_name)

# Memberikan nama data asset
asset_name = 'online_shopping_dataset'
path_to_data = 'P2M3_Tasya_Amalia_Data_Clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

batch_request = asset.build_batch_request()

# **4. Great Expectations**

In [5]:
# Creat an expectation suite
expectation_suite_name = 'expectation_online_shopping_dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,unnamed 0,gender,location,tenure_months,transaction_date,product_category,quantity,avg_price,delivery_charges,coupon_status,...,offline_spend,online_spend,month,discount_pct,total_item_price,discount_amount,discounted_price,net_price,total_spend,revenue_unit
0,0,Male,Chicago,12,2019-01-01,Nest-USA,1,153.71,6.5,Used,...,4500.0,2424.5,1,10,153.71,15.371,138.339,144.839,144.839,144.839
1,1,Male,Chicago,12,2019-01-01,Nest-USA,1,153.71,6.5,Used,...,4500.0,2424.5,1,10,153.71,15.371,138.339,144.839,144.839,144.839
2,2,Male,Chicago,12,2019-01-01,Nest-USA,2,122.77,6.5,Not Used,...,4500.0,2424.5,1,10,245.54,24.554,220.986,227.486,227.486,113.743
3,3,Male,Chicago,12,2019-01-01,Nest-USA,1,81.5,6.5,Clicked,...,4500.0,2424.5,1,10,81.5,8.15,73.35,79.85,79.85,79.85
4,4,Male,Chicago,12,2019-01-01,Nest-USA,1,153.71,6.5,Clicked,...,4500.0,2424.5,1,10,153.71,15.371,138.339,144.839,144.839,144.839


### **4.1 Expect each column value to be unique**

In [6]:
# Expectation 1 : Column `unnamed 0` must be unique

validator.expect_column_values_to_be_unique('unnamed 0')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 52524,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Penjelasan**:

Hasil validasi menunjukkan bahwa semua nilai dalam kolom `unnamed 0` memenuhi kriteria sudah ditentukan, yaitu semua nilai dalam kolom tersebut adalah unik. 

### **4.2 Expect the column entries to be between a minimum value and a maximum value**

In [7]:
# Expectation 2 : Column `revenue_unit` must be between 0.380 and 540.832

validator.expect_column_values_to_be_between(
    column='revenue_unit', min_value=0.380, max_value=540.832)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 52524,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Penjelasan**:

Hasil validasi menunjukkan bahwa semua nilai dalam kolom `revenue_unit` memenuhi kriteria sudah ditentukan, yaitu berada di antara 0.380 dan 540.832.

### **4.3 Expect each column value to be in a given set**

In [8]:
# Expectation 3: The `gender` column must contain 'Male' and 'Female'
gender_gx = ['Male', 'Female']
validator.expect_column_values_to_be_in_set('gender', gender_gx)


Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 52524,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Penjelasan**:

Hasil validasi menunjukkan bahwa semua nilai dalam kolom `gender` memenuhi kriteria sudah ditentukan, yaitu hanya berisi nilai Male dan Female. 

### **4.4 Expect a column to contain values of a specified data type**

In [9]:
# Expectation 4: Colomn 'quantity' must be either int or float
validator.expect_column_values_to_be_in_type_list(
    column='quantity',
    type_list=['int64', 'float64'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Penjelasan**:

Hasil validasi menunjukkan bahwa semua nilai dalam kolom `quantity` memenuhi kriteria sudah ditentukan, yaitu berupa tipe data yang diharapkan, yaitu int64. 

### **4.5 Expect the values in column A to be greater than column B**

In [17]:
# Expectation 5 : Column column `total_item_price` to be greater than column `discounted_price`
validator.expect_column_pair_values_a_to_be_greater_than_b(
    column_A='total_item_price',
    column_B='discounted_price'
)

Calculating Metrics:   0%|          | 0/7 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 52524,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Penjelasan**:

Hasil validasi menunjukkan bahwa semua nilai dalam kolom `total_item_price` memenuhi kriteria sudah ditentukan, yaitu lebih besar dari nilai dalam kolom `discounted_price`. 

### **4.6 Expect the column median to be between a minimum value and a maximum value**

In [18]:
# Expectation 6 : Column `avg_price` must be between 1.000 and 20.000

validator.expect_column_median_to_be_between(
    column='avg_price', min_value=1.000, max_value=20.000)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 16.99
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Penjelasan**:

Hasil validasi menunjukkan bahwa semua nilai dalam kolom `avg_price` memenuhi kriteria sudah ditentukan, yaitu nilai median berada di antara 1.000 dan 20.000.

### **4.7 Expect the column entries to be strings that match a given regular expression**

In [27]:
# Expectation 7 : check column 'coupon status' match a given regular expression

validator.expect_column_values_to_match_regex(
    column='coupon_status',
    like_pattern='%[d]',
    )  

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 52524,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Penjelasan**:

Hasil validasi menunjukkan bahwa kolom 'coupon status' cocok dengan ekspresi reguler yang diberikan.