<a href="https://colab.research.google.com/github/satyakala-teja/analytics-capstone-satyakala/blob/main/02_data_cleaning_and_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 02 â€“ Data Cleaning & Feature Engineering
**Author:** Satyakala Devata

## Objective
Clean the dataset, fix data types, handle missing values and engineer useful features for analytics and dashboarding.


In [1]:
import pandas as pd

df = pd.read_csv('/content/data/sales_data.csv')
df.head()


Unnamed: 0,order_id,order_date,customer_id,category,sub_category,product,quantity,unit_price,sales,region
0,1001,2023-01-02,C001,Office Supplies,Binders,Elastic Binder,2,5.0,10.0,East
1,1002,2023-01-03,C002,Furniture,Chairs,Ergo Chair,1,150.0,150.0,West
2,1003,2023-01-04,C003,Technology,Phones,SmartPhone X,1,700.0,700.0,North
3,1004,2023-01-05,C001,Office Supplies,Paper,Copy Paper,10,3.5,35.0,East
4,1005,2023-01-06,C004,Technology,Laptops,UltraBook Pro,1,1200.0,1200.0,South


## 1. Missing Value Analysis

Check for missing values in each column and decide how to handle them.


In [3]:
df.isnull().sum()


Unnamed: 0,0
order_id,0
order_date,0
customer_id,0
category,0
sub_category,0
product,0
quantity,0
unit_price,0
sales,0
region,0



### Handling Missing Values
Even though this dataset has no missing values, we add the standard cleaning steps for completeness.


In [6]:
df = df.dropna()    # remove rows with missing values
df.reset_index(drop=True, inplace=True)   # reset index after cleaning


In [7]:
df.isnull().sum()


Unnamed: 0,0
order_id,0
order_date,0
customer_id,0
category,0
sub_category,0
product,0
quantity,0
unit_price,0
sales,0
region,0


## 2. Data Type Conversion
Convert columns into appropriate data types for analysis. Most important: converting `order_date` from object/string to datetime.


In [8]:
df.dtypes


Unnamed: 0,0
order_id,int64
order_date,object
customer_id,object
category,object
sub_category,object
product,object
quantity,int64
unit_price,float64
sales,float64
region,object


In [11]:
df['order_date'] = pd.to_datetime(df['order_date'])
df.dtypes


Unnamed: 0,0
order_id,int64
order_date,datetime64[ns]
customer_id,object
category,object
sub_category,object
product,object
quantity,int64
unit_price,float64
sales,float64
region,object


### Extracting Date-Based Features


In [14]:
df['order_year'] = df['order_date'].dt.year
df['order_month'] = df['order_date'].dt.month
df['order_day'] = df['order_date'].dt.day
df['order_weekday'] = df['order_date'].dt.weekday
df.head()



Unnamed: 0,order_id,order_date,customer_id,category,sub_category,product,quantity,unit_price,sales,region,order_year,order_month,order_day,order_weekday
0,1001,2023-01-02,C001,Office Supplies,Binders,Elastic Binder,2,5.0,10.0,East,2023,1,2,0
1,1002,2023-01-03,C002,Furniture,Chairs,Ergo Chair,1,150.0,150.0,West,2023,1,3,1
2,1003,2023-01-04,C003,Technology,Phones,SmartPhone X,1,700.0,700.0,North,2023,1,4,2
3,1004,2023-01-05,C001,Office Supplies,Paper,Copy Paper,10,3.5,35.0,East,2023,1,5,3
4,1005,2023-01-06,C004,Technology,Laptops,UltraBook Pro,1,1200.0,1200.0,South,2023,1,6,4
