# Cleaning: Dirty Cafe Sales Data

This notebook focuses exclusively on data preparation and cleaning. The input is a dataset containing transaction data from a cafÃ©, which shows signs of contamination (missing values, inconsistent formats, text errors).

### ðŸ“‚ Dataset Info
* **Source:** [Kaggle - Dirty Cafe Sales Dataset](https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training)
* **File:** `dirty_cafe_sales.csv`
* **Description:** Synthetic dirty data created specifically for cleaning practice.

## Loading data and libraries

In [22]:
# Libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("dirty_cafe_sales.csv")

# Number of rows and columns
print(f"Loaded Dataset: {df.shape[0]} rows, {df.shape[1]} columns")

Loaded Dataset: 10000 rows, 8 columns


## Basic data inspection

In [23]:
df.head()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11


In [24]:
# Printing names of columns
print(df.columns.tolist())

# Renaming columns for consistency
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Checking the correctness of names
print(df.columns.tolist())

['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']
['transaction_id', 'item', 'quantity', 'price_per_unit', 'total_spent', 'payment_method', 'location', 'transaction_date']


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    10000 non-null  object
 1   Item              9667 non-null   object
 2   Quantity          9862 non-null   object
 3   Price Per Unit    9821 non-null   object
 4   Total Spent       9827 non-null   object
 5   Payment Method    7421 non-null   object
 6   Location          6735 non-null   object
 7   Transaction Date  9841 non-null   object
dtypes: object(8)
memory usage: 625.1+ KB


ðŸš© **Initial Findings**
Based on the `df.info()` output, immediate cleaning is required for data types:

* **Numeric Columns (`Quantity`, `Price Per Unit`, `Total Spent`):** Currently stored as `object` (string) instead of numbers, likely due to non-numeric values.

* **Date Column (`Transaction Date`):** Currently stored as `object`. Needs conversion to `datetime` format for time-based analysis.

### Duplicates and missing values

#### Duplicates

In [12]:
print(f"Number of duplicated rows: {df.duplicated().sum()}")

Number of duplicated rows: 0


#### Missing values

In [None]:
print(f"Number of missing values: \n {df.isna().sum()}")

Number of missing values: 
Transaction ID         0
Item                 333
Quantity             138
Price Per Unit       179
Total Spent          173
Payment Method      2579
Location            3265
Transaction Date     159
dtype: int64


ðŸ“‰ Missing Values Summary

The output above reveals significant gaps in the dataset that require immediate attention:

* **High Severity:** The `Location` (3265 missing) and `Payment Method` (2579 missing) columns are heavily compromised, missing approximately **25-33%** of data. Dropping these rows would result in massive data loss.

* **Moderate Severity:** Essential operational columns like `Item`, `Price Per Unit`, and `Quantity` are missing hundreds of values. Since these are required for calculating total sales, we cannot simply ignore them.

> **Strategy Required:** Simple removal (`dropna`) is not a viable option for columns like `Location` as we would lose too much data. We will need to combine **imputation** (filling gaps with "Unknown" or statistical averages) with **row removal** for critical missing operational data.

## Data Cleaning 


## Final Validation & Export