## **1. Mount Google Drive and Load Data**
---

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
supply_chain_path = '/content/drive/My Drive/supply_chain'
print(os.listdir(supply_chain_path))

['DescriptionDataCoSupplyChain.csv', 'tokenized_access_logs.csv', 'DataCoSupplyChainDataset.csv']


In [None]:
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/supply_chain/DataCoSupplyChainDataset.csv', encoding='latin-1')

***

## **2. Remove columns that are completely empty or not useful**
---
>  ***Drop columns which provide no useful information, such as those that are totally blank or have the same value in every row.***

> ***If a column is empty or always the same, it can't help us learn anything or build models, so we remove it.***
***

In [7]:
# Drop 'Product Description' (completely empty)
df = df.drop(columns=['Product Description'])

In [8]:
# Drop columns with one unique value (Customer Email, Customer Password, Product Status)
drop_cols = [col for col in df.columns if df[col].nunique() <= 1]
df = df.drop(columns=drop_cols)

In [9]:
# Drop Order Zipcode if mostly missing (optional, based on your exploration)
df = df.drop(columns=['Order Zipcode'])

***

## **3. Handle Missing Values**
***
> ***Ensure missing values do not create problems for analysis.***

> ***We drop rows where important info is missing if only a few rows are affected, otherwise we might fill in values or use other techniques.***
***

In [11]:
# Drop rows with missing Customer Lname or Customer Zipcode
df = df.dropna(subset=['Customer Lname', 'Customer Zipcode'])

***

## **4. Convert Data Types**
***
> ***Make sure every column has the correct data type (text, number, date, category).***

> ***This ensures we can run correct calculations, groupings, or time-based analysis later.***
***

In [12]:
# Convert dates to datetime objects
df['order date (DateOrders)'] = pd.to_datetime(df['order date (DateOrders)'])
df['shipping date (DateOrders)'] = pd.to_datetime(df['shipping date (DateOrders)'])

In [15]:
# Convert relevant categorical fields to 'category' dtype
cat_cols = ['Delivery Status', 'Category Name', 'Customer Segment', 'Shipping Mode',
            'Order Status', 'Market', 'Order Region']
for col in cat_cols:
    df[col] = df[col].astype('category')

***

## **5. Remove Sensitive Data**
***
> ***Protect privacy by dropping columns that could identify individuals.***

> ***Emails, addresses, and names are not needed for most analysis and shouldn’t be shared or published for privacy reasons.***
***

In [18]:
# Drop PII (Personally Identifiable Information) columns for privacy
df = df.drop(columns=['Customer Fname', 'Customer Lname', 'Customer Street'])

***

## **6. Reset Index**
***
> ***Keep the DataFrame indexed correctly after deleting rows.***
***

In [19]:
df = df.reset_index(drop=True)

***

## **7. Save the Cleaned Data**
***
> ***Store the preprocessed dataset for future use.***
***

In [20]:
df.to_csv('/content/drive/My Drive/supply_chain/supply_chain_data_cleaned.csv', index=False)

***