## Introduction:
**Dataset:** The dataset contains detailed information of 200,000 online book orders in Pakistan from January 2019 to January 2021. It contains order number, order status (completed, cancelled, returned), order date and time, book name and city address. This is the most detailed dataset about e-commerce orders in Pakistan that you can find in the Public domain.
 
 **Variables:** The dataset contains order number, order status, book name, order date, order time and city of the customer.
 
 **using Machine Learning and Data Sciences to explore these ideas:**
*  What is the best-selling book?
*  Visualize order status frequency
*  Find a correlation between date and time with order status
*  Find a correlation between city and order status
*  Find any hidden patterns that are counter-intuitive for a layman
*  Can we predict number of orders, or book names in advance?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # For Visualization
import matplotlib.pyplot as plt # For visualization
import plotly.express as px # for high resulotion charts 

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## **Read The data**

In [None]:
dt = pd.read_csv("/kaggle/input/gufhtugu-publications-dataset-challenge/GP Orders - 5.csv")

dt.head(10)

In [None]:
dt = dt.rename(columns={'Order Number': 'Order_Number',"Order Status":"Order_Status", 
                        "Book Name":"Book_Name","Order Date & Time":"Order_Date","City":"City",
                        "Payment Method":"Payment_Method","Total items":"Total_items","Total weight (grams)":"grams" })

In [None]:
dt.head()

## **Checking for null values and Try to fill Missing values**

In [None]:
dt.isna().sum()


In [None]:
print(" checking for nulll values")

miss_value = dt[dt.isnull().any(axis=1)]
miss_value.head()

Now we clearly see the missing data. Here will check the formate of the missing vaules. Beacuse Pandas only know the missing as 'NaN'.

##  **Here Two Approches to fill Missing values, 1) is mean/median in case of numerical values. 2) Max Count in case of Categorical variable**

So we will use the second one approch because we have missing values in "Payment Method", "Book Name" and "city". 

In [None]:
# This code show the Max value,so we will replace the NaN with the Max count value
dt['Book_Name'].value_counts()


In [None]:
#so we select the 0 index value which is the max value, so we fill the miss value 
dt['Book_Name'].value_counts().index[0]

In [None]:
#this code fill NaN missing values with 'انٹرنیٹ سے پیسہ کمائیں'
dt['Book_Name'].fillna(dt["Book_Name"].value_counts().index[0], inplace=True)

In [None]:
# Now this code will check the Max value of Payment Method which will be replace with the Missing value
dt['Payment_Method'].value_counts()

In [None]:
# This code will select 0 index value to replace with missing value 
dt["Payment_Method"].value_counts().index[0]


In [None]:
# The code will fill the missing value NaN with "Cash on delivery"
dt['Payment_Method'].fillna(dt['Payment_Method'].value_counts().index[0], inplace= True)

In [None]:
# This code will fill the missing value in the city
dt['City'].value_counts()

In [None]:
# The code will fill the missing value NaN with "Karachi"
dt["City"].fillna(dt['City'].value_counts().index[0], inplace= True)

Now The Data is fill for missing values

In [None]:
dt.isna().sum()

In [None]:
dt.City.value_counts()[:10]

In [None]:
top_city = dt.groupby('City')['Order_Number'].count().reset_index().sort_values('Order_Number', ascending = False)
top_city.head(20)

In the above set we clearly see the repeation of data, like 'Karachi' and 'karachi' both are the same city. for easy conversion we all convert into the upercase 

In [None]:
# this code convert the city column data into upercase
dt['City'] = dt['City'].str.upper()


In [None]:
# this code show the repeation is clean and also show the max oder oder done from which city
dt.groupby('City')['Order_Number'].count().reset_index().sort_values('Order_Number', ascending = False).head(20)
#dt.City.value_counts()[:10]

## What is the best-selling book?

In [None]:
dt["Order_Date"] = pd.DatetimeIndex(dt["Order_Date"])
dt['Date'] = dt['Order_Date'].dt.date
dt['Time'] = dt['Order_Date'].dt.time
dt['Year'] = dt['Order_Date'].dt.year
dt['Month'] = dt['Order_Date'].dt.month_name()
dt['Day'] = dt['Order_Date'].dt.day_name()

The Reason for split the Book Column is to count every book. beacuse as we seen in data some order item is more then 2 item, so in that case the whole order come in single row, so split column help to count evey book.

In [None]:
split_data = dt.drop('Book_Name', axis=1).join(dt['Book_Name'].str.split('/', expand=True).stack().reset_index(level=1, drop=True).rename('Book_Name'))
split_data.head(10)


In [None]:
# remove the repeation 
#The code will replace the name of same book name which were repeated, As we know python is case sensetive  

split_data['Book_Name'] = split_data['Book_Name'].replace(['(C++) ++سی/سی++', 'سی/سی (C++) ++','(C++)', '(C++) ++سی', 'سی'], 'C++')
split_data['Book_Name'] = split_data['Book_Name'].replace(['ڈیٹا سائنس ۔ ایک تعارف' , 'ڈیٹا سائنس'], 'Data Science')
split_data['Book_Name'] = split_data['Book_Name'].replace(['بلاک چین اور کرپٹو کرنسی'], 'Blockchain, Cryptocurrency And Bitcoin')
split_data["Book_Name"] = split_data["Book_Name"].replace(['انٹرنیٹ سے پیسہ کمائیں؟- مستحقین زکواة'], 'انٹرنیٹ سے پیسہ کمائیں')
split_data['Book_Name'] = split_data['Book_Name'].replace(['R ka Taaruf', 'R ka Taaruf آر کا تعارف'], 'R ka Taaruf آر کا تعارف')
split_data['Book_Name'] = split_data['Book_Name'].replace(['molo masali - مولو مصلی' ], 'molo masali')
split_data['Book_Name'] = split_data['Book_Name'].replace(["python programming- release date: august 14, 2020"], "python programming")

# best selling book group
split_data.groupby('Book_Name')['Order_Number'].count().reset_index().sort_values('Order_Number', ascending = False).head(10)



In [None]:
split_data["Book_Name"] = split_data["Book_Name"].apply(lambda x: x.strip(''))
book_stats = split_data["Book_Name"].value_counts(ascending=False)
book_stats.head()

In [None]:
Best_sell_Book = split_data["Book_Name"].value_counts().nlargest(15).to_frame()

fig = px.bar(Best_sell_Book, y =Best_sell_Book['Book_Name'], 
             x = Best_sell_Book.index, color=Best_sell_Book.Book_Name, height=650, title = 'Best 15 most Selling Books',
             custom_data=[Best_sell_Book['Book_Name'],
             Best_sell_Book.index])

fig.update_xaxes(title="Best 15 Selling Books in Guftugu Publications",
                 title_font=dict(size=18, family='Courier'), 
                 linecolor='Black', mirror=True)

fig.update_yaxes(title="Books Selling Count",title_font=dict(size=18, family='Courier', ),
                 linecolor='gray', mirror=True)

fig.update_traces(texttemplate='%{y}', textposition='outside') 

# fig.update_traces(marker_color='#ff7c43',
#                   hovertemplate="<br>".join(["Book_Name: %{x}", "Count: %{y}",
 
                                            
#     ]))

#fig.update_layout(hovermode="x unified")
fig.show()

## Visualize order status frequency

The Reason for using this data is to check the deep analysis of the order_statuse. So the code will count the number of order completed, cancelled, and Returned. 

In [None]:
#This code will check the Order_Status for completed order, cancel, and Return.
dt.groupby('Order_Status')['Order_Status'].agg('count')


In [None]:
# code is to visualized the Order_status of complete order, cancelled, and Returned.
px.histogram(dt, x=dt.Order_Status, color=dt.Order_Status, width = 700, height = 500, title= 'Order Status Frequency', marginal='rug',
             hover_name='Order_Number', hover_data=dt.columns)

## Find a correlation between date and time with order status

In [None]:
Year_Data= dt.groupby(["Year"])["Order_Number"].count().reset_index()
fig=px.pie(Year_Data, values=Year_Data.Order_Number, names=Year_Data['Year'])
fig.update_traces(hole=.4)
fig.update_layout(
    title_text="Year Wise total Orders",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Order_Status',  font_size=20, showarrow=False),
                 ])
fig.show()

In [None]:
df= dt["Month"].value_counts().nlargest(12).to_frame()
df.head()

#df = px.data.gapminder().query("country=='Canada'")
fig = px.histogram(df, x=df.index, y=df['Month'], color=df.index, title='Most successful Months For Gufhtugu Publishers', )

fig.update_xaxes(title="Best Month For Gufhtugu Publishers",
                 title_font=dict(size=18, family='Courier'), 
                 linecolor='Black', mirror=True)

fig.update_yaxes(title="Books Selling Count",title_font=dict(size=18, family='Courier', ),
                 linecolor='gray', mirror=True)

fig.show()

In [None]:
plt.figure(figsize=(10,6))
ax=sns.countplot(x =dt['Year'], hue = 'Year', data = dt)
ax.set_title("The Most Orders By Year ", fontsize = 20)
plt.xlabel("Year ",fontsize=17)
plt.ylabel("Number of Orders", fontsize=17)
for p in ax.patches:
    ax.annotate(f'\n{p.get_height()}', (p.get_x() + 0.12, p.get_height()), color='black', size=15, ha="center")

In [None]:
plt.figure(figsize = (15,7))
ax = sns.countplot(x=dt.Day,  data=dt, hue = 'Order_Status')

ax.set_title("The Most Orders by Day ", fontsize = 20)
plt.xlabel("Day name of the most orders ",fontsize=17)
plt.ylabel("Number of Orders", fontsize=17)
for p in ax.patches:
    ax.annotate(f'\n{p.get_height()}', (p.get_x() + 0.1, p.get_height()), color='black', size=15, ha="center")

## Find a correlation between city and order status

code will check the order status for cities where max orders done

In [None]:
Most_orderBy_city = dt.City.value_counts()[:10]
Most_orderBy_city.head()

In [None]:
city_count  = dt['City'].value_counts()
city_count = city_count[:10,]
plt.figure(figsize=(15,7))
ax= sns.barplot(city_count.index, city_count.values)
plt.title('Books Orders in top 10 cities in the Pakistan', fontsize=15)
plt.ylabel('Number of Orders', fontsize=15)
plt.xlabel('Cities Names', fontsize=15)
for p in ax.patches:
    ax.annotate(f'\n{p.get_height()}', (p.get_x () + 0.4, p.get_height()), color='black', size=15, ha="center")
plt.show()

In [None]:
dt.groupby('Payment_Method')['Order_Number'].agg('count')

In [None]:
# this code clean the repeation of the above Cash on Delivery(COD) with Cash on delivery
dt['Payment_Method'] = dt['Payment_Method'].replace(['Cash on Delivery (COD)'], 'Cash on delivery')


In [None]:
#Now the Payment_Method is clean and count max
dt.groupby('Payment_Method')['Order_Number'].agg('count')

In [None]:
plt.figure(figsize=(15,7))
ax=sns.countplot(x="Order_Status",hue="Payment_Method", data=dt, palette="Set2")
plt.title('Payment Method', fontsize=15)
plt.ylabel('Number of Orders', fontsize=15)
plt.xlabel('Order Status', fontsize=15)
for p in ax.patches:
    ax.annotate(f'\n{p.get_height()}', (p.get_x () + 0.1, p.get_height()), color='black', size=15, ha="center")
plt.show()


## Can we predict number of orders, or book names in advance?

The below code is about to predict the number of orders, or book name in advance. So the first task is to check the most selling book year wise.

In [None]:
Top_Book_year_wise=dt.groupby(["Book_Name","Year" ])["Order_Number"].count().reset_index().sort_values("Order_Number", ascending=False)
Top_Book_year_wise.head()

In [None]:
# this slicing we help in prediction
Year_books=dt[['Book_Name','Year']].value_counts().rename_axis(['Book','Year']).reset_index(name='counts')

In [None]:
# Best selled book in 2019
plt.figure(figsize=(15,8))
Year2019=Year_books[Year_books['Year']==2019].nlargest(10, 'counts')
Year2019.head()

px.bar( Year2019, x= Year2019.Book, y='counts', title='Top_10 Books In 2019', color='Book', )

In [None]:
# Best selled book in 2020
Year2020=Year_books[Year_books['Year']==2020].nlargest(10, 'counts')
px.bar( Year2020, x= Year2020.Book, y='counts', title='Top_10 Books In 2020')

In [None]:
# Best selled book in 2021
Year2021=Year_books[Year_books['Year']==2021].nlargest(10, 'counts')
px.bar( Year2021, x= Year2021.Book, y='counts', title='Top_10 Books In 2021', color='Book')

According to dataset the orders of the first month of 2021 is 2679 which mean if the orders ration remain the same through out the year then their is the chance at the end of 2021 the orders count will be around 32148. so the prediction of the Best book will be the Lucky Draw -Free Book with chance of orders 5844, and the second( انٹرنیٹ سے پیسہ کمائیں) with orders 4542. 

If you like then don't forget to upvote