# Proof of Concept: How Data Could be Sorted into Approved / Not Approved

In [83]:
import pandas as pd

## Interpreting the data

In [84]:
df = pd.read_csv('mock.csv')
df.head(6)

Unnamed: 0,bookingID,ticketID,price,status
0,1,1,10,not completed
1,1,1,20,completed
2,2,2,10,not completed
3,2,3,20,not completed
4,2,2,30,completed
5,2,3,20,completed


### What can we see?
-  Row entries have two status states, not completed and completed. 
- Completed implies that an order has been approved, and hence all tickets in the booking were fine to be approved
- Not completed implies that an order has not been approved. This means a single or multiple ticket in the order could not be approved. It does not mean that all tickets in the order could not be approved
- If a row matches another row in all categories except status, it is assumed the ticket was fine to be approved from the beginning. and only the entry with the completed status should be preserved
- If the row has no other matches in all categories, the approval state can be set based on the status

### What Steps should we take?
1. Remove duplicate rows with the status not completed
2. Create a column with the final approval state

## Processing the Data

In [85]:
# Assign priority: 'completed' gets highest priority (lowest number)
df['status_priority'] = df['status'].apply(lambda x: 0 if x == 'completed' else 1)

# Define the columns used to detect duplicates (all except 'status' and 'status_priority')
dedup_cols = [col for col in df.columns if col not in ['status', 'status_priority']]

# Sort so that 'completed' comes last within each group of duplicate rows
df_sorted = df.sort_values(by=dedup_cols + ['status_priority'])

# Drop duplicates based on all columns except status, keeping the preferred row (last)
df_deduped = df_sorted.drop_duplicates(subset=dedup_cols, keep='last')

# Drop the helper column
df_deduped = df_deduped.drop(columns='status_priority')

# Replace status column with approval labels
df_deduped['approval_status'] = df_deduped['status'].apply(lambda x: 'approved' if x == 'completed' else 'not approved')
df_deduped.drop(columns='status', inplace=True)
df_deduped.head(6)


Unnamed: 0,bookingID,ticketID,price,approval_status
0,1,1,10,not approved
1,1,1,20,approved
2,2,2,10,not approved
4,2,2,30,approved
5,2,3,20,not approved
