'''

**=======================================================================================================**

**MILESTONE 3**

*NAME*  : Muhammad Rofi Seno Aji

*BATCH* : HCK-008

**This program was created to conduct great expectation method from the dataset that already been cleaned.**

**=======================================================================================================**

'''

# ***INSTANTIATE DATA CONTEXT***

In [1]:
# Create a data context
from great_expectations.data_context import FileDataContext
context = FileDataContext.create(project_root_dir='./')

# ***CONNECT TO A DATASOURCE***

In [5]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'data_m3_fix'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'birdstrikes_fix'
path_to_data = 'D:\\Hacktiv8\\Phase2\\Milestone\\coba_coba\\P2M3_muhammad_rofi_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

# ***CREATE AN EXPECTATION SUITE***

In [6]:
# Creat an expectation suite
expectation_suite_name = 'expectation-data-m3_fix'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)
# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,record_id,aircraft_type,airport_name,altitude_bin,aircraft_make_model,wildlife_number_struck,wildlife_number_struck_actual,flightdate,effect_indicated_damage,aircraft_number_of_engines,...,remains_of_wildlife_collected,remains_of_wildlife_sent_to_smithsonian,wildlife_size,conditions_sky,wildlife_species,pilot_warned_of_birds_or_wildlife,cost_total,feet_above_ground,number_of_people_injured,is_aircraft_large
0,202152,Airplane,LAGUARDIA NY,> 1000 ft,B-737-400,Over 100,859,2000-11-23T00:00:00,Caused damage,2,...,False,False,Medium,No Cloud,Unknown bird - medium,False,30736,1500.0,0,True
1,208159,Airplane,DALLAS/FORT WORTH INTL ARPT,< 1000 ft,MD-80,Over 100,424,2001-07-25T00:00:00,Caused damage,2,...,False,False,Small,Some Cloud,Rock pigeon,True,0,0.0,0,False
2,207601,Airplane,LAKEFRONT AIRPORT,< 1000 ft,C-500,Over 100,261,2001-09-14T00:00:00,No damage,2,...,False,False,Small,No Cloud,European starling,False,0,50.0,0,False
3,215953,Airplane,SEATTLE-TACOMA INTL,< 1000 ft,B-737-400,Over 100,806,2002-09-05T00:00:00,No damage,2,...,True,False,Small,Some Cloud,European starling,True,0,50.0,0,True
4,219878,Airplane,NORFOLK INTL,< 1000 ft,CL-RJ100/200,Over 100,942,2003-06-23T00:00:00,No damage,2,...,False,False,Small,No Cloud,European starling,False,0,50.0,0,False


# ***VALIDATE DATA***

## EXPECTATION 1 : COLUMN FEET ABOVE GROUND CANNOT CONTAIN MISSING VALUE

In [7]:
validator.expect_column_values_to_not_be_null('feet_above_ground')

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 24747,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**EXPLANATION :**

In the context of analyzing birdstrike incidents in the dataset covering the period between 1999 and 2009, it is imperative that the column 'feet_above_ground,' which represents the altitude of aircraft during these incidents, does not contain missing values. This is essential for several scholarly and practical reasons:

1. **Incident Severity Assessment:** 'Feet_above_ground' provides critical information about the altitude of the aircraft at the time of the birdstrike, facilitating an accurate assessment of incident severity.

2. **Risk Evaluation:** Accurate data on the altitude of birdstrikes is indispensable for evaluating the risk associated with these incidents. Incomplete data can lead to biased risk assessments and inadequate safety measures.

3. **Regulatory Compliance:** The aviation industry is subject to strict regulatory reporting requirements, necessitating comprehensive and accurate data for compliance with authorities such as the Federal Aviation Administration (FAA).

4. **Research and Prevention:** Scholars and aviation safety experts rely on complete data to analyze birdstrike incidents systematically, identify patterns, and develop effective prevention strategies.

5. **Safety Enhancement:** Understanding the altitude at which birdstrikes occur is fundamental for devising safety improvements, modifying aircraft, developing prevention technologies, and implementing operational changes to reduce the occurrence and impact of birdstrikes.

## EXPECTATION 2: COLUMN RECORD ID MUST BE UNIQUE

In [8]:
validator.expect_column_values_to_be_unique('record_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 24747,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**EXPLANATION:**

The uniqueness of the "record_id" is of paramount importance for academic and research purposes. This uniqueness serves to maintain data integrity by preventing duplication and ensuring accuracy. It also enhances data retrieval and linking capabilities, simplifies data analysis, and aids in compliance with regulatory standards. In an academic context, a unique identifier like "record_id" is crucial for the accurate and rigorous study of birdstrike accidents, enabling comprehensive research, analysis, and reporting.

## EXPECTATION 3: COLUMN FEET ABOVE GROUND VALUE MUST CAN NOT BE NEGATIVE

In [10]:
validator.expect_column_values_to_be_between(column='feet_above_ground', min_value=0, max_value=40000)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 24747,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**EXPLANATION:**

The imposition of altitude constraints within the 'feet_above_ground' column wherein the permissible range is confined to altitudes between 0 and 45,000 feet, is underpinned by several salient considerations.

1. **Environmental Realism:** Birdstrike incidents, as recorded in this dataset, transpire predominantly within the lower atmospheric strata. Altitudes below 0 feet, signifying ground level, are evidently non-viable within the confines of the dataset's scope. Conversely, altitudes substantially surpassing 45,000 feet venture into the stratospheric region, an environment where the likelihood of bird-related aviation incidents is exceedingly remote.

2. **Aircraft Operational Regime:** The dataset principally encompasses data related to commercial aviation, a domain where aircraft predominantly operate within altitudes ranging from surface level (0 feet) to approximately 45,000 feet. This range encapsulates the entirety of an aircraft's operational envelope, including takeoff, landing, and cruising phases. Incidents involving birdstrikes are more prevalent within this aviation operating envelope.

## EXPECTATION 4: COLUMN NUMBER OF PEOPLE INJURED MUST BE FLOAT OR INTEGER

In [12]:
validator.expect_column_values_to_be_in_type_list('number_of_people_injured', ['int64', 'float64'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**EXPLANATION :**

The column 'number_of_people_injured' typically be represented as an integer or a float due to the following academic and practical considerations:

1. **Data Integrity**: In an academic and practical context, data integrity is crucial. The number of people injured in birdstrike incidents is a discrete and countable value. It cannot be a fraction, as you cannot have a fractional person injured. Therefore, it should be represented as an integer to accurately reflect the nature of the data.

2. **Compatibility**: Many data analysis and visualization tools expect numerical data to be of type integer or float. Representing the number of people injured as an integer or float ensures compatibility with these tools, making it easier to conduct academic research and practical analysis on the dataset.

## EXPECTATION 5: COLUMN ALTITUDE BIN VALUE MUST BE > 1000 or < 1000

In [13]:
validator.expect_column_values_to_be_in_set("altitude_bin", ['> 1000 ft', '< 1000 ft'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 24747,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**EXPLANATION :**

The primary objective of categorizing the 'altitude bin' column with discrete altitude values denoting altitudes as falling either below 1000 feet or exceeding 1000 feet is as follows:

1. **Regulatory Relevance**: Many aviation regulations and safety guidelines are formulated based on specific altitude thresholds. By categorizing altitudes into '< 1000ft' and '> 1000ft', the dataset becomes more relevant to regulatory assessments and compliance.

2. **Comparative Analysis**: This categorization facilitates comparative analysis between birdstrikes that occur at lower altitudes (e.g., during takeoff and landing) and those that occur at higher altitudes (e.g., during cruising). Such comparisons are academically valuable in understanding birdstrike patterns.

3. **Simplicity and Interpretability**: The binary categorization is straightforward and enhances the interpretability of the data. Researchers, policymakers, and aviation professionals can easily grasp the altitude-related patterns in birdstrike incidents.

 ## EXPECTATION 6: COLUMN RECORD IN MUST HAVE VALUE LENGTH 6

In [14]:
validator.expect_column_value_lengths_to_equal("record_id", 6)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 24747,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**EXPLANATION:**

The record ID column is required to contain six characters, in compliance with National Transportation Safety Board regulations, which mandate that birdstrike incident categories must be documented with a six-character ID.

 ## EXPECTATION 7: COLUMN AIRCRAFT TYPE MUST ONLY HAVE VALUE AIRCRAFT

In [16]:
validator.expect_column_values_to_not_be_in_set('aircraft_type', ["Helicopter", "Drone"])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 24747,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**EXPLANATION :**

Given that this dataset exclusively pertains to incidents involving birdstrikes on aircraft, it is imperative that the 'aircraft type' column possesses values that are exclusively categorized as 'aircraft.'

 ## EXPECTATION 8: COLUMN AIRPORT NAME MUST ONLY HAVE VALUE LENGTH MAXIMUM 100

In [19]:
validator.expect_column_value_lengths_to_be_between('airport_name', min_value=1, max_value=100)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 24747,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**EXPLANATION :**

Column 'airport_name,' which presumably contains information about the location where these incidents occurred, should have a maximum value length of 100 characters for several important reasons.

1. **Storage Efficiency:** A fixed maximum length for the 'airport_name' field optimizes data storage. When working with large datasets, efficient storage is critical. If the 'airport_name' field has an unrestricted length, it could lead to excessive storage requirements, which can increase costs and reduce query performance.

2. **Data Quality Control:** Imposing a character limit encourages data quality control. Longer entries may be prone to typographical errors, inconsistencies, or irrelevant information. By limiting the length to 100 characters, data entry errors are more likely to be detected and corrected, leading to a higher overall data quality.

3. **Human Readability:** A 100-character limit strikes a balance between data standardization and human readability. It is sufficiently long to accommodate most airport names and their locations while preventing excessively long or unwieldy entries that can be challenging for users to read and interpret.

In [20]:
# Save into Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)

# ***CHECKPOINT***

In [21]:
# Create a checkpoint

checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

In [22]:
# Run a checkpoint

checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/48 [00:00<?, ?it/s]

# ***DATA DOCS***

In [23]:
# Build data docs

context.build_data_docs()

{'local_site': 'file://d:\\Hacktiv8\\Phase2\\Milestone\\coba_coba\\gx\\uncommitted/data_docs/local_site/index.html'}