# Task 1: Data Understanding, Preparation and Descriptive Analytics

## Merge the Datasets

I used the function `merge` from pandas library that implements SQL style joining operations.

In this case, `transactions` is our primary dataset, with each row representing a transaction record. We want to ensure that every transaction is retained in the final merged dataset, even if certain demographic, merchant, or city information is missing.

Using `how='left'` for each merge step ensures **all transactions are retained** in the final dataset, even if:

- **Customer data is missing** Transactions without a matching `cc_num` in `customers` will still appear, with `NaN` for customer details

- **Merchant information is missing** Transactions lacking a matching `merchant` in `merchants` are included, with `NaN` for merchant fields

- **City data is missing** If a customer’s `city` has no match in `cities`, the transaction is kept with `NaN` for city details


In [9]:
import pandas as pd

# Load Datasets
transactions = pd.read_csv('data/transactions.csv')
merchants = pd.read_csv('data/merchants.csv')
customers = pd.read_csv('data/customers.csv')
cities = pd.read_csv('data/cities.csv')

# Merge the .csv files into one
merged_data = pd.merge(transactions, customers, on='cc_num', how='left')
merged_data = pd.merge(merged_data, merchants, on='merchant', how='left')
merged_data = pd.merge(merged_data, cities, on='city', how='left')

# Print merged dataset
print(merged_data.head())

# Save merged dataset
merged_data.to_csv('data/merged_data.csv', index=False)

   index trans_date_trans_time            cc_num device_os     merchant  \
0   5381   2023-01-01 00:39:03  2801374844713453       NaN  Merchant_85   
1   5381   2023-01-01 00:39:03  2801374844713453       NaN  Merchant_85   
2   5381   2023-01-01 00:39:03  2801374844713453       NaN  Merchant_85   
3   5381   2023-01-01 00:39:03  2801374844713453       NaN  Merchant_85   
4   5381   2023-01-01 00:39:03  2801374844713453       NaN  Merchant_85   

      amt     trans_num   unix_time  is_fraud first  ...  job         dob  \
0  252.75  TRANS_662964  1672533543         0  Jane  ...  NaN  2002-10-12   
1  252.75  TRANS_662964  1672533543         0  Jane  ...  NaN  2002-10-12   
2  252.75  TRANS_662964  1672533543         0  Jane  ...  NaN  2002-10-12   
3  252.75  TRANS_662964  1672533543         0  Jane  ...  NaN  2002-10-12   
4  252.75  TRANS_662964  1672533543         0  Jane  ...  NaN  2002-10-12   

  category merch_lat  merch_long merchant_id        lat        long  city_pop  \
0    