# **Retail Transactional Dataset Analysis**

## **Background**

This dataset contains information about customer transactions in a retail store. The data includes details about:

- Customers
- Products they bought
- Payment methods
- Delivery options

As shopping habits change, it is important for stores to understand what customers want and how they shop. This information can help stores:

- Plan better marketing strategies.
- Provide better service to customers.

## **Goals of Analysis**

As a Data Analyst, my goals are:

1. **Ensure data quality**: Use tools like *Great Expectations* to check and fix data issues, ensuring accurate analysis.
2. **Prepare data for NoSQL database**: Organize the data to better understand shopping patterns and customer spending.
3. **Create visualizations**: Use *Kibana* to create charts and graphs that make customer transaction data easier to understand.
4. **Provide actionable recommendations**:
   - Help the marketing team create more effective promotions.
   - Support the customer service team in improving the overall customer experience.

## **Report Users**

1. **Marketing Team**
   - Use the analysis to design promotions that match customer preferences.

2. **Customer Service Team**
   - Use insights to improve services and make customers happier.

3. **Procurement Team**
   - Identify popular products to better plan inventory and stock.

# **Great Expectations**

## **A. Perkenalan**

---
<b>Milestone 3</b>

Nama  : Yasmine Naraindas Setiadi

Batch : FTDS HCK-024

---

## **B. Import Libraries**

In [2]:
import great_expectations as ge # Mengimpor library Great Expectations dengan alias 'ge'
from great_expectations.data_context import FileDataContext # Mengimpor kelas FileDataContext dari modul data_context di library Great Expectations

## **C. Data Loading dan Mendefinisikan Validator**

In [3]:
# Membuat DataContext untuk mengelola semua konfigurasi dan aset dalam proyek Great Expectations.
context = FileDataContext.create(project_root_dir="./")

# Menambahkan sumber data (datasource) menggunakan format Pandas.
datasource_name = "P2M3_yasminenaraindassetiadi_datasource"

# Cek dan Hapus Datasource yang Sudah Ada
if datasource_name in context.datasources:
    context.delete_datasource(datasource_name)

# Menambahkan datasource
datasource = context.sources.add_pandas(datasource_name)

# Mendefinisikan nama aset dan lokasi file data CSV yang akan digunakan.
asset_name = "Cleaned Data"
path_to_data = "P2M3_yasminenaraindassetiadi_data_clean.csv"
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Membuat batch request untuk mengambil data dari aset yang telah didefinisikan.
batch_request = asset.build_batch_request()

# Membuat expectation suite dengan nama yang telah ditentukan.
expectation_suite_name = "expectation-cleaned-dataset"
context.add_or_update_expectation_suite(expectation_suite_name)

# Menginisialisasi validator untuk memvalidasi data terhadap expectation suite yang dibuat.
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name=expectation_suite_name
)

## **D. Great Expectations**

### ***Expect column to be unique***

In [4]:
# Memvalidasi bahwa nilai-nilai dalam kolom "transaction_id" harus unik.
validator.expect_column_values_to_be_unique("transaction_id")

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 274920,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### ***Expect column values to be between range***

In [5]:
# Memvalidasi bahwa nilai pada kolom "age" berada dalam rentang tertentu (antara 18 hingga 65)
validator.expect_column_values_to_be_between("age", min_value=18, max_value=65)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 274920,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### ***Expect column values to be in set***

In [6]:
# Memvalidasi bahwa nilai dalam kolom "gender" hanya berisi ["Male", "Female"]
validator.expect_column_values_to_be_in_set("gender", value_set=["Male", "Female"])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 274920,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### ***Expect column values to match regex***

In [7]:
# Memvalidasi bahwa nilai dalam kolom tertentu sesuai dengan pola regex yang ditentukan
validator.expect_column_values_to_match_regex(
    "email", regex=r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$"
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 274920,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### ***Expect column to not have nulls***

In [8]:
# Memvalidasi bahwa nilai dalam kolom "customer_id" dan "transaction_id" tidak boleh memiliki nilai NULL
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_not_be_null("transaction_id")

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 274920,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### ***Expect column values to be of type***

In [9]:
# Memvalidasi bahwa  nilai pada kolom "age" sesuai dengan tipe data yang diharapkan
validator.expect_column_values_to_be_of_type("age", type_="int")

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### ***Expect Table Column Count to Equal***

In [10]:
# Memvalidasi untuk mmemastikan bahwa tabel memiliki tepat 30 kolom
validator.expect_table_column_count_to_equal(value=30)

Calculating Metrics:   0%|          | 0/3 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 30
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## **E. Save expectations**

In [11]:
# Menyimpan expectation suite dan menentukan expectation yang gagal tetap harus disimpan
validator.save_expectation_suite(discard_failed_expectations=False)

# Mencetak pesan bahwa expectation suite telah divalidasi dan disimpan
print("Expectations validated and saved.")


Expectations validated and saved.
