Retail-Orders-Python-SQL-Project

The main objectives of this project are to use the Kaggle API to retrieve retail order data, execute pandas data transformation in Python, and use SQL Server for data analysis. Ref: Ankit Bansal YT Channel

Project Overview

The project performs the following steps:

Extraction: Extracts data from Kaggle using the Kaggle API (Python).
Transformation: Cleans and transforms the data for analysis using pandas (Python).
Loading: Loads the transformed data into a SQL Server database.
Insights: Carries out analysis on the retail order data using SQL Server to find patterns.

Technologies Used

Python
Pandas
Kaggle API
SQL Server

Extract

Kaggle API for data downlaod

Open Kaggle website and go to settings option then click on 'Create New Token'.
Download the kaggle.json file.
Now, go to your home directory in your PC like 'C:\Users\suman.kaggle', if '.kaggle' folder is not there then create a new one and paste the 'kaggle.json' file right there.

Transform

Python for Data Cleaning

I’m using VS Code to use jupyter notebook.

Open the VS Code and create a new '.ipynb' file in a local folder, where we can write python codes.

# install libraries
!pip install kaggle

# import libraries
import kaggle

# downloading the dataset using kaggle api
# copy the api command from kaggle dataset only
!kaggle datasets download -d ankitbansal06/retail-orders

# this is an zip file, so extract files from zip file
import zipfile # importing zipfile libraries
openfile = zipfile.ZipFile('retail-orders.zip') # open the file
openfile.extractall() # extract files to the directory
openfile.close() # close the file

# read the the file in pandas
import pandas as pd # importing pandas libraries
df = pd.read_csv('D:\\Books\\Python\\Project\\Retail-Orders-Python-SQL-Project\\orders.csv')

# reading few values
df.head(10)

# some null values are there in 'Ship Mode' column so, we need to fix this
df['Ship Mode'].unique()

# reading the dataset again and use defaut option to change 'Not Available' & 'unknown' values to NaN
df = pd.read_csv('D:\\Books\\Python\\Project\\Retail-Orders-Python-SQL-Project\\orders.csv', na_values=['Not Available', 'unknown'])
df['Ship Mode'].unique()

# now fix the column names as it has space and non uniformatity in names
df.columns

# lowercase the column names
df.columns = df.columns.str.lower()
df.columns

# add _ inplace of ' '
df.columns = df.columns.str.replace(' ', '_')
df.columns

# derive new column
df['discount'] = (df.list_price * df.discount_percent) / 100 # actual discount column in amount
df['sale_price'] = df.list_price - df.discount # sale_price is (list price - discount)
df['profit'] = (df.sale_price - df.cost_price) # profit is (sale price - cost price)
df

# now we can drop the extra columns
df.drop(['cost_price', 'list_price', 'discount_percent'], axis=1, inplace=True)
df

# now check the daatatypes of the columns
df.dtypes

# look everythis is fine other than the 'order_date' column, 'order_date' column datatype needs to change to time datatype
df.order_date = pd.to_datetime(df.order_date)
df.dtypes

Load

load to SQL Server using python

now we cleaned the dataset and data is ready to load to SQL Server

# install libraries
!pip install sqlalchemy

# connect to the sql server
import pandas as pd
import sqlalchemy as sal
engine = sal.create_engine('mssql://sumanpc/master?driver=ODBC+DRIVER+17+FOR+SQL+SERVER')
conn = engine.connect()

# now, load the data into sql server

# df.to_sql('df_orders', con = conn, index=False, if_exists = 'replace')
# we do not use 'replace' mode because it will create a table in sql with highest datatype values and it will take more space in memory

# so, we will use 'append' mode, but first create a new empty table in SQL Server with all the column names and required datatype with values
df.to_sql('df_orders', con = conn, index=False, if_exists = 'append')

# DDL command from SSMS 
# create table df_orders (
#   [order_id] int primary key,
#   [order_date] date,
#   [ship_mode] varchar (20),
#   [segment] varchar (20),
#   [country] varchar (20),
#   [city] varchar (20),
#   [state] varchar (20),
#   [postal_code] varchar (20),
#   [region] varchar (20),
#   [category] varchar (20),
#   [sub_category] varchar (20),
#   [product_id] varchar (50),
#   [quantity] int,
#   [discount] decimal (7,2),
#   [sale_price] decimal (7,2),
#   [profit] decimal (7,2))

Analysis using SQL Server

Find top 10 highest reveue generating products

select top 10 with ties product_id, sum(profit) total_profit from df_orders
group by product_id
order by 2 desc

Find top 5 highest selling products in each region

with cte as
(select region, product_id, sum(sale_price) qty_sale,
dense_rank() over(partition by region order by sum(sale_price) desc) rn
from df_orders
group by region, product_id)
select * from cte where rn <= 5

Find month over month growth comparison for 2022 and 2023 sales eg jan 2022 vs jan 2023

with cte as
(select year(order_date) order_year, month(order_date) order_month, sum(sale_price) total_sale from df_orders
group by year(order_date), month(order_date))
select order_month,
sum(case when order_year = 2022 then total_sale else 0 end) as sale_2022,
sum(case when order_year = 2023 then total_sale else 0 end) as sale_2023,
cast(cast(round(((sum(case when order_year = 2023 then total_sale else 0 end)) - (sum(case when order_year = 2022 then total_sale else 0 end))) * 100.0 / (sum(case when order_year = 2022 then total_sale else 0 end)), 2) as float) as varchar) + '%' as inc_or_dec_in_2023
from cte
group by order_month
order by order_month

For each category which month had highest sales

with cte as
(select category, format(order_date, 'MMM-yyyy') mnth_yr, sum(sale_price) total_sale,
dense_rank() over(partition by category order by sum(sale_price) desc) rn
from df_orders
group by category, format(order_date, 'MMM-yyyy'))
select category, mnth_yr, total_sale
from cte
where rn = 1

Which sub category had highest growth by profit in 2023 compare to 2022?

with cte as
(select sub_category, year(order_date) order_year, sum(profit) total_profit from df_orders
group by sub_category, year(order_date)),
cte2 as
(select sub_category,
sum(case when order_year = 2022 then total_profit else 0 end) as profit_2022,
sum(case when order_year = 2023 then total_profit else 0 end) as profit_2023
from cte
group by sub_category)
select top 1 with ties *,
(profit_2023 - profit_2022) * 100.0 / profit_2022 profit_percentage
from cte2
order by 4 desc

Conclusion

The Retail Orders Analysis project demonstrated how to use SQL and Python to get valuable insights out of retail data. We were able to determine the best-performing goods, local sales trends, and growth patterns from year to year by preprocessing and analyzing the data. These observations can help guide strategic decision-making and propel retail industry success.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
Retail-Orders-Python-SQL-Project.ipynb		Retail-Orders-Python-SQL-Project.ipynb
SQLQuery.sql		SQLQuery.sql
retail-orders.zip		retail-orders.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retail-Orders-Python-SQL-Project

Project Overview

Technologies Used

Extract

Transform

Load

Analysis using SQL Server

Conclusion

About

Uh oh!

Releases

Packages

Languages

sumanndass/Retail-Orders-Python-SQL-Project

Folders and files

Latest commit

History

Repository files navigation

Retail-Orders-Python-SQL-Project

Project Overview

Technologies Used

Extract

Transform

Load

Analysis using SQL Server

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages