# Milestone 1: Data Familiarisation

This notebook focuses on understanding the structure, content, and quality of the Globex Retail dataset before any cleaning or transformation is applied.

The goal is to assess:
- available columns and data types
- dataset size and granularity
- presence of missing values or duplicates
- initial observations relevant to business analysis


### Import libraries

In [19]:
import pandas as pd

#### Load Dataset globex_retail_raw.csv and What each row represents

In [20]:
df = pd.read_csv('globex_retail_raw.csv')
df.head()

# Each row represents a single retail transaction (order line) linked to a customer, product category, date, location, quantity, pricing, discount, and revenue.

Unnamed: 0,Customer_ID,Order_ID,Order_Date,Product_Category,Product_Sub_Category,Quantity,Price,Discount,Customer_Location,Revenue
0,CUST_013738,ORD_00102406,01/01/2023,Home & Garden,Gardening Tools,1,419.19,0.0,TN,419.19
1,CUST_011726,ORD_00102902,01/01/2023,Electronics,Laptops,1,222.37,0.09,TN,202.3567
2,CUST_010891,ORD_00103864,01/01/2023,Electronics,Laptops,6,1107.65,0.0,IN,6645.9
3,CUST_011452,ORD_00103560,01/01/2023,Electronics,Gaming Consoles,5,288.84,0.0,MA,1444.2
4,CUST_010886,ORD_00100632,02/01/2023,Electronics,Headphones,1,191.27,0.0,AZ,191.27


#### Dataset size & structure

In [21]:
#data size:
df.shape

# “The dataset contains 5,000 transaction records with 10 descriptive attributes per transaction.”

(5000, 10)

#### Column-by-column as in dataset

In [22]:
# columns names:
df.columns

# seeing that revenue already exists, was this calculated consistently and correctly?

Index(['Customer_ID', 'Order_ID', 'Order_Date', 'Product_Category',
       'Product_Sub_Category', 'Quantity', 'Price', 'Discount',
       'Customer_Location', 'Revenue'],
      dtype='object')

#### Data types & information about the dataset

In [23]:
# information about data types and missing values:
df.info()

# seeing order_date column is object type, that should have been a datetime, we will need to convert to datetime

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Customer_ID           5000 non-null   object 
 1   Order_ID              5000 non-null   object 
 2   Order_Date            5000 non-null   object 
 3   Product_Category      5000 non-null   object 
 4   Product_Sub_Category  5000 non-null   object 
 5   Quantity              5000 non-null   int64  
 6   Price                 5000 non-null   float64
 7   Discount              5000 non-null   float64
 8   Customer_Location     5000 non-null   object 
 9   Revenue               5000 non-null   float64
dtypes: float64(3), int64(1), object(6)
memory usage: 390.8+ KB


#### Missing values
5000 non-null values

## Initial data quality observations
- The dataset contains 5,000 transaction records with 10 columns.
- Each row represents a single retail transaction.
- No missing values were observed across all columns.
- Order_Date is stored as a string and will need conversion to datetime.
- Revenue is already present, suggesting prior calculation logic that should be validated.
- Discount appears to be stored as a decimal (e.g. 0.09 = 9%).
- No data transformations have been applied at this stage.
