**Task Details**
We like to find out the impact of order size (items numbers), order date and time, payment method on order status.

**Expected Submission**
Tell us what's the co-relation between

return orders and any other given variable
completed orders and any other given variables
cancelled orders and any other given variables
It would help if you can break it down to cities as well

**Evaluation**
We are looking for easy to understand graphs and clear insights backed by data

# Loading data

In [None]:
#Loading of Dataset and required Libraries

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns


dt= pd.read_csv(
    "/kaggle/input/gufhtugu-publications-dataset-challenge/GP Orders - 5.csv",
    encoding="utf_8")

import warnings  
warnings.filterwarnings('ignore')

# Basic Data Exploration

In [None]:
#to check few rows
dt.head()

In [None]:
#to check the number of column & rows
print("dimensions are : ", dt.shape)

so data contains 19239 rows and 8 columns

In [None]:
#to check the columns names, data type and null values (if any)

print(dt.info())

"Non-Null Count" of few columns shows the presence of null values

# Data cleaning

**Handling missing values.**

In [None]:
print(dt.isna().sum())

Three columns contains the missing values. we will drop these.

In [None]:
print("Before drop, total rows are: ", dt.shape[0])

#drop the null values
dt.dropna(inplace=True)

print("After drop, total rows are: ", dt.shape[0])

print(dt.isna().sum())

now data doesnt contains any missing values

**Rename the columns to more appropriate**

In [None]:
#rename the columns
dt = dt.rename(columns={'Order Number': 'Order_Number',"Order Status":"Order_Status", "Book Name":"Book_Name","Order Date & Time":"Date_Time","Payment Method":"Payment_Method","Total items":"Total_Items","Total weight (grams)":"Weight(g)"})
print("After rename, column names are: ", "\n", "\n" , dt.columns)

**"Date_Time" columns has "object" type, we will change it to datetime64 type. also Extract the Year, Month, Days..etc,**

In [None]:
#change the type of "Date_Time" columns
dt['Date_Time'] = pd.to_datetime(dt['Date_Time'])
print(dt.info())

#Date (Days, Week) and Time from Date Column
dt['Years'] = dt["Date_Time"].dt.year
dt['Months_Name'] = dt["Date_Time"].dt.month_name()
dt['Months'] = dt["Date_Time"].dt.month
dt['Days'] = dt["Date_Time"].dt.day
dt['DaysName'] = dt["Date_Time"].dt.day_name()
dt['Weeks'] = dt["Date_Time"].dt.week
dt['Date'] = dt["Date_Time"].dt.date
dt["Month_Year"]=dt["Months_Name"] + "-" + dt["Years"].astype(str)
#Confirm Extracted Columns

# **Handling Inconsistent Data**

"Book_Name" column contains more than one book name. lets split it.

In [None]:


#to separate, from multiple to single book title per line

#print('No of rows BEFORE splitting : ',dt.shape[0])

scol = dt['Book_Name'].str.split('/', expand=True).stack()
scol.index = scol.index.droplevel(-1) 
scol.name = 'Book_Name' 
dt = dt.drop(columns='Book_Name').join(scol)

#print('No of rows AFTER splitting : ',dt.shape[0])

#manually rename some urdu books names to english
dt['Book_Name'] = dt['Book_Name'].replace('انٹرنیٹ سے پیسہ کمائیں','Internet Sy Pysy Kamaen')
dt['Book_Name'] = dt['Book_Name'].replace('انٹرنیٹ سے پیسہ کمائیں؟- مستحقین زکواة','Internet Sy Pysy Kamaen')
dt['Book_Name'] = dt['Book_Name'].replace('ڈیٹا سائنس','Data Science')
dt['Book_Name'] = dt['Book_Name'].replace('ڈیٹا سائنس ۔ ایک تعارف','Data Science')
dt['Book_Name'] = dt['Book_Name'].replace('مشین لرننگ','Machine Learning')
dt['Book_Name'] = dt['Book_Name'].replace('(C++) ++سی','(C++)')
dt['Book_Name'] = dt['Book_Name'].replace("ایک تھا الگورتھم",'Ak Tha Algorithm')


#extracting top 20 books for fuzzywuzzy
top_bks=dt["Book_Name"].value_counts().head(20).reset_index()
top_bks.columns=['Book_Name','Sold_Qty']
all_bks = dt["Book_Name"].unique()

#renaming the books name to close matching using fuzzywuzzy
from fuzzywuzzy import process

for bks in top_bks['Book_Name']:
    matches = process.extract(bks, all_bks , limit = len(all_bks))
    for potential_match in matches:
        if potential_match[1] > 90:
                dt.loc[dt['Book_Name'] == potential_match[0],"Book_Name"] = bks
    
dt.reset_index(drop=True, inplace=True)
print("Top 10 unique Books are: \n",dt["Book_Name"].value_counts().head(10))

The "City" column contains many typos. lets fix it.

In [None]:
dt['City'] = dt['City'].replace(['karachi','KARACHI'],'Karachi')
dt['City'] = dt['City'].replace('FSD','Faisalabad')
dt['City'] = dt['City'].replace(['lahore','LAHORE'],'Lahore')

#extracting top 20 cities for fuzzywuzzy

fuzz_top_City=dt["City"].value_counts().head(20).reset_index()
fuzz_top_City.columns=['City','Sold_Qty']
fuzz_all_City = dt["City"].unique()

#removing the typo mistake in books name

from fuzzywuzzy import process
for city in fuzz_top_City['City']:
    matches = process.extract(city, fuzz_all_City , limit = len(fuzz_all_City))
    for potential_match in matches:
        if potential_match[1] > 90:
                dt.loc[dt['City'] == potential_match[0],"City"] = city
                
print("Top 20 Cities are: \n",dt["City"].value_counts().head(20))

Removing typos error in "Payment Method" Column

In [None]:
dt['Payment_Method'] = dt['Payment_Method'].replace('Cash on Delivery (COD)','Cash on delivery')
print("Unique Payment Methods are: \n",dt['Payment_Method'].unique())

**Data is clear. we are ready to visualize it.**

In [None]:
fig, ax = plt.subplots()
ax=sns.countplot(x="Order_Status",data=dt)
fig.set_size_inches(18,9)
ax.set_title('Total Orders for Different Order Status',fontsize=20)
ax.set_xlabel("Order Status",fontsize=18)
ax.set_ylabel("Number of Order(s)",fontsize=18) 
#plt.xticks(rotation=90)
plt.show()

**Data shows that maximum orders are completed and a fewer were returned or cancelled**

In [None]:
fig, ax = plt.subplots()
ax=sns.countplot(x="Payment_Method",data=dt,hue="Order_Status")
fig.set_size_inches(18,9)
ax.set_title('Relation between Payment Method & Order Status',fontsize=20)
ax.set_xlabel("Payment Method(s)",fontsize=18)
ax.set_ylabel("Number of Order(s)",fontsize=18) 
#plt.xticks(rotation=90)
plt.show()

**Data shows an interesting relationship that if payment mode is COD, the order is likely to be completed**

In [None]:
fig, ax = plt.subplots()
ax=sns.countplot(x="Years",data=dt,hue="Order_Status")
fig.set_size_inches(18,9)
ax.set_title('Yearly Orders Status (Frequency)',fontsize=20)
ax.set_xlabel("Years",fontsize=18)
ax.set_ylabel("Number of Order(s)",fontsize=18) 
#plt.xticks(rotation=90)
plt.show()

**Data shows that Guftugu did top sale in 2020. we can’t say with guarantee, as only last 3 month from 2019 and 1st month from 2021 is included in data. which for 2020, whole year is included.**

In [None]:
fig, ax = plt.subplots()

Months = ['January', 'February', 'March','April', 'May', 'June', 'July', 'August', 'September','October', 'November', 'December']
ax=sns.countplot(x="Months_Name",data=dt,hue="Order_Status", order=Months)
fig.set_size_inches(18,9)
#plt.rcParams["axes.labelsize"] = 25
ax.set_title('Monthly Orders Status (Frequency)',fontsize=20)
ax.set_xlabel("Months",fontsize=18)
ax.set_ylabel("Number of Order(s)",fontsize=18) 
#plt.xticks(rotation=90)
plt.show()

**Data shows that highest no of orders are process in January. It is important to note that in this graph data is grouped for all the years. so January includes data from Jan 2020 and Jan 20221, next graph will further elaborate.**

In [None]:

fig, ax = plt.subplots()
ax=sns.countplot(x="Month_Year",data=dt,hue="Order_Status" )
fig.set_size_inches(18,9)
ax.set_title('Monthly Orders Status (Frequency)',fontsize=20)
ax.set_xlabel("Months",fontsize=18)
ax.set_ylabel("Number of Order(s)",fontsize=18) 
plt.xticks(rotation=90)
plt.show()

**Data shows that January 2021 has the highest sale. Followed by Dec 20 and Aug 20**

In [None]:
fig, ax = plt.subplots()
ax=sns.countplot(x="Days",data=dt,hue="Order_Status")
fig.set_size_inches(18,9)
ax.set_title('Day wise Orders Status (Frequency)',fontsize=20)
ax.set_xlabel("Days of Month",fontsize=18)
ax.set_ylabel("Number of Order(s)",fontsize=18) 
plt.xticks(rotation=90)
plt.show()

**This graph shows the day wise total orders. It doesn’t looks any relation between day of month and Order Status.**

In [None]:
fig, ax = plt.subplots()
order=['Monday','Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday','Sunday']
ax=sns.countplot(x="DaysName",data=dt,hue="Order_Status", order=order)
fig.set_size_inches(18,9)
ax.set_title('Orders Status on each day of week',fontsize=20)
ax.set_xlabel("Days of the Week",fontsize=18)
ax.set_ylabel("Number of Order(s)",fontsize=18) 
plt.xticks(rotation=90)
plt.show()

**This graph shows the correlation between total orders  and days of the week. Weekend (Saturday/Sunday) has the maximum orders.**

**Creating new dataset**

In [None]:
cty = dt['City'].value_counts().iloc[:20]
#bks = dt['Book_Name'].value_counts().iloc[:10]

#tcb=top 10 cities and top 10 books


#tcb=tcb[tcb["Book_Name"].isin(bks.index)]
tcb=dt[dt["City"].isin(cty.index)]

tcb=dt.groupby(["City","Order_Status"])["Order_Number"].count().reset_index().sort_values("Order_Number", ascending=False)

#to add another column for Province against each city
prov = {'Karachi': "Sindh", 'Lahore': "Punjab", 'Islamabad': "Islamabad", 'Rawalpindi': "Punjab", 'Faisalabad': "Punjab",'Peshawar': "KPK", 'Multan': "Punjab", 'Gujranwala': "Punjab", 'Sialkot': "Punjab", 'Hyderabad': "Sindh",'Quetta':"Baluchistan", 'Bahawalpur':"Punjab", 'Sargodha':"Punjab", 'Abbottabad': "KPK", 'Sahiwal':"Punjab",'Okara':"Punjab",'Sheikhupura':"Punjab",'Gujrat':"Punjab", 'Sukkur':"Sindh", 'Chakwal':"Punjab"}
tcb["Province"] = tcb["City"].map(prov)
tcb.columns=['City','Order_Status','Total_Order','Province']

x=tcb.groupby(["Province","Order_Status"])["Total_Order"].sum().reset_index().sort_values("Total_Order", ascending=False)

fig, ax = plt.subplots()
ax=sns.barplot(x="Province",y="Total_Order",data=x,hue="Order_Status" , ci=None)
fig.set_size_inches(18,9)
ax.set_title('Province wise Order Status',fontsize=20)
ax.set_xlabel("Province",fontsize=18)
ax.set_ylabel("Number of Order(s)",fontsize=18) 
plt.xticks(rotation=90)
plt.show()

**Data shows that maximum order were received from Punjab followed by Sindh, Islamabad, KPK and Baluchistan
Please note that for the ease of understanding data is taken from top 20 cities.**

In [None]:
#cty = dt['City'].value_counts().iloc[:20]
bks = dt['Book_Name'].value_counts().iloc[:20]

#tcb=top 10 cities and top 10 books
tcb=dt.groupby(["Book_Name","Order_Status"])["Order_Number"].count().reset_index().sort_values("Order_Number", ascending=False)

tcb=tcb[tcb["Book_Name"].isin(bks.index)]
#tcb=tcb[tcb["City"].isin(cty.index)]

fig, ax = plt.subplots()
ax=sns.barplot(x="Book_Name",y="Order_Number",data=tcb,hue="Order_Status" )
fig.set_size_inches(18,9)
ax.set_title('Top Books wise Status (Frequency)',fontsize=20)
ax.set_xlabel("Top Book Title(s)",fontsize=18)
ax.set_ylabel("Number of Order(s)",fontsize=18) 
plt.xticks(rotation=90)
plt.show()

**in this graph we can see top 20 books that were ordered. Please note that for the ease of understanding data is based on top 20 books**

# **Thanks for viewing my notebook, you are welcome to give any comments/suggestions to further improve my work.**