## GP - A Complete Overview

### This notebook have the complete and detailed overview of the Dataset 

### There are following insights that we'll explore through this dataset
1. What is the best-selling book?
2. Visualize order status frequency
3. Find a correlation between date and time with order status
4. Find a correlation between city and order status
5. Find any hidden patterns that are counter-intuitive for a layman
6. Can we predict number of orders, or book names in advance?

In [None]:
#importing necessary libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import nltk


In [None]:
#importing data to pandas dataframe 
#importing the updated dataset GP Orders - 5.csv  
df = pd.read_csv('/kaggle/input/gufhtugu-publications-dataset-challenge/GP Orders - 5.csv')


In [None]:
#Inspecting the datasets
df.head(10)

In [None]:
df.columns

In [None]:
# Rename columns
col_names = {'Order Number' : 'OrderNumber' , 'Order Status' : 'OrderStatus', 'Book Name' : 'BookName',
           'Order Date & Time' : 'OrderDate', 'City' : 'City', 'Payment Method' : 'PaymentMethod', 'Total items' : 'TotalItems', 'Total weight (grams)' : 'TotalWeight'}
df.rename(col_names, axis = 1, inplace = True)

In [None]:
df.columns

In [None]:
print(f'Shape of the dataset {df.shape}')

In [None]:
#inspecting the dataset
print(df.info())
print(df.describe())

In [None]:
#Checking the Null values
df.isnull().sum()

In [None]:
#As there aren't so much null values, dropping them won't affect the analysis
df.dropna(inplace = True)

In [None]:
df.isnull().sum()

In [None]:
print(df['BookName'].head(10))

In [None]:
print(f'There are {len(df["City"].unique())} unique city names in the dataset \
and {len(df["BookName"].unique())} unique book names.')

In [None]:
df['BookName'].value_counts().head(10)

In [None]:
df['BookName'].value_counts().tail(10)

In [None]:
#Separating multiple book names into separate columns
df["BookName"] = df["BookName"].str.split("/").str[0]
df["BookName"].tail(10)

In [None]:
#Convert entries into lower case
df['BookName'] = df['BookName'].str.lower()

In [None]:
#Remove any special character in the "Book Name" column
chars = ["!",'"',"#","%","&","'","(",")","*","+",",",".","/",":",";","<",
        "=",">","?","@","[","\\","]","^","_","`","{","|","}","~","–"]
for char in chars:
    df['BookName'] = df['BookName'].str.replace(char, ' ')

In [None]:
df['BookName'].value_counts().head(20)

In [None]:
#Some books names need to be changed such as "python programming" and
#"python programming- release date  august 14  2020 "

df["BookName"] = df["BookName"].str.replace("python programming- release date: august 14, 2020" , "python programming")
df["BookName"] = df["BookName"].str.replace("انٹرنیٹ سے پیسہ کمائیں؟- مستحقین زکواة" , "")
df["BookName"] = df["BookName"].str.replace("molo masali - مولو مصلی" , "molo masali" )
df["BookName"] = df["BookName"].str.replace("r ka taaruf  آر کا تعارف" , "r ka taaruf")
df["BookName"] = df["BookName"].str.replace("linux - an introduction release data - october 3, 2020" , "linux - an introduction")

In [None]:
df['BookName'].value_counts().head(10)

In [None]:
print(f'There are {len(df["City"].unique())} unique city names in the dataset \
and {len(df["BookName"].unique())} unique book names.')

In [None]:
#Now check for cities, there are 4163 unique city names and required cleaning
df.City.sample(20)

In [None]:
#Lowering the case and replacing the special characters
for char in chars:
    df['City'] = df['City'].str.lower().str.replace(char, ' ')

In [None]:
df["City"].value_counts().head(10)

In [None]:
df.City.value_counts().head(10)

In [None]:
#The ispection shows there are so many distinct values in the city column
cities = ['islamabad', 'ahmed nager chatha', 'ahmadpur east', 'ali khan abad', 'alipur', 'arifwala', 'attock', 'bhera',
              'bhalwal', 'bahawalnagar','bahawalpur', 'bhakkar', 'burewala', 'chillianwala', 'chakwal', 'chichawatni',
              'chiniot', 'chishtian','daska', 'darya khan', 'dera ghazi khan', 'dhaular', 'dina', 'dinga', 'dipalpur', 'faisalabad', 'ferozewala',
              'fateh jhang','ghakhar mandi', 'gojra', 'gujranwala', 'gujrat', 'gujar khan', 'hafizabad', 'haroonabad', 'hasilpur',
              'haveli lakha', 'jatoi',
              'jalalpur', 'jattan', 'jampur', 'jaranwala', 'jhang', 'jhelum', 'kalabagh', 'karor lal esan', 'kasur', 'kamalia', 'kamoke',
              'khanewal',
              'khanpur', 'kharian', 'khushab', 'kot addu', 'jauharabad', 'lahore', 'lalamusa', 'layyah', 'liaquat pur',
              'lodhran', 'malakwal', 'mamoori', 'mailsi', 'mandi bahauddin', 'mian channu', 'mianwali', 'multan', 'murree', 
              'muridke', 'mianwali bangla', 'muzaffargarh', 'narowal', 'nankana sahib', 'okara', 'renala khurd', 'pakpattan', 
              'pattoki', 'pir mahal', 'qaimpur', 'qila didar singh', 'rabwah', 'raiwind', 'rajanpur', 'rahim yar khan',
              'rawalpindi',
              'sadiqabad', 'safdarabad', 'sahiwal', 'sangla hill', 'sarai alamgir', 'sargodha', 'shakargarh', 'sheikhupura',
              'sialkot','sohawa', 'soianwala', 'siranwali', 'talagang', 'taxila', 'toba tek singh', 'vehari', 'wah cantonment', 
              'wazirabad',
              'badin', 'bhirkan', 'rajo khanani', 'chak', 'dadu', 'digri', 'diplo', 'dokri', 'ghotki', 'haala', 'hyderabad',
              'islamkot', 'jacobabad', 'jamshoro', 'jungshahi', 'kandhkot', 'kandiaro', 'karachi', 'kashmore', 'keti bandar',
              'khairpur', 'kotri', 'larkana', 'matiari', 'mehar', 'mirpur khas', 'mithani', 'mithi', 'mehrabpur', 'moro',
              'nagarparkar', 'naudero', 'naushahro feroze', 'naushara', 'nawabshah', 'nazimabad', 'qambar', 'qasimabad', 
              'ranipur', 'ratodero', 'rohri', 'sakrand', 'sanghar', 'shahbandar', 'shahdadkot', 'shahdadpur',
              'shahpur chakar', 'shikarpaur', 'sukkur', 'tangwani', 'tando adam khan', 'tando allahyar',
              'tando muhammad khan', 'thatta', 'umerkot', 'warah', 'abbottabad', 'adezai', 'alpuri', 'akora khattak',
              'ayubia', 'banda daud shah', 'bannu', 'batkhela', 'battagram', 'birote', 'chakdara', 'charsadda', 'chitral',
              'daggar', 'dargai', 'darya khan', 'dera ismail khan', 'doaba', 'dir', 'drosh', 'hangu', 'haripur', 'karak',
              'kohat', 'kulachi', 'lakki marwat', 'latamber', 'madyan', 'mansehra', 'mardan', 'mastuj', 'mingora', 'nowshera','paharpur', 'pabbi', 'peshawar', 'saidu sharif', 'shorkot', 'shewa adda', 'swabi', 'swat', 'tangi', 'tank',
              'thall', 'timergara', 'tordher', 'awaran', 'barkhan', 'chagai', 'dera bugti', 'gwadar', 'harnai', 'jafarabad',
              'jhal magsi', 'kacchi', 'kalat', 'kech', 'kharan', 'khuzdar', 'killa abdullah', 'killa saifullah', 'kohlu',
              'lasbela', 'lehri', 'loralai', 'mastung', 'musakhel', 'nasirabad', 'nushki', 'panjgur', 'pishin valley', 
              'quetta', 'sherani', 'sibi', 'sohbatpur', 'washuk', 'zhob', 'ziarat']

def city_unique(city):
    for i in cities:
        if i in str(city):
            return i
    return city

In [None]:
df.City = df.City.apply(city_unique)

In [None]:
df["City"].nunique()

### The unique city names have been reduced from 4163 to only 1869
*However there are many city names with mispelled characters along with complete address, that need to be cleaned* 

In [None]:
# Function courtesy @hammad40241

def clean_city(row):
    address = row.City.split()
    add = set()
    for a in address:
        a = a.strip()
        if a:
            add.add(a)
    for city in cities:
        if row.City.__contains__(city):
            return city
        
    for a in add:
        for c in cities:
            if nltk.edit_distance(a, c) <= 15: # considering spelling mistakes upto 5 letters
                return c
    return row.City

In [None]:
df.City = df.apply(clean_city, axis =1)

In [None]:
print(f'There are {len(df["City"].unique())} unique city names in the dataset \
and {len(df["BookName"].unique())} unique book names.')

#### The data is cleaned now, let's explore further columns

In [None]:
df.OrderDate.dtype

In [None]:
#Converting the object datatype into Pandas Date Time
df.OrderDate = pd.to_datetime(df.OrderDate)
df.OrderDate.head()

In [None]:
df.info()

In [None]:
#Checking the order status
print(df.OrderStatus.unique())
print(df.OrderStatus.value_counts())

In [None]:
#Visualizing order status
sns.set()
plt.figure(figsize = (10,8))
plt.hist(x = 'OrderStatus', data = df)
plt.show()

### Year wise ordr status

In [None]:
status_19 = df[df.OrderDate.dt.year == 2019].OrderStatus
status_20 = df[df.OrderDate.dt.year == 2020].OrderStatus
status_21 = df[df.OrderDate.dt.year == 2021].OrderStatus

_ = plt.figure(figsize = (16, 10))
_ = plt.subplot(1,3,1)
_ = plt.hist(status_19)
_ = plt.title('Order Status 2019')
_ = plt.xlabel('Order status')
_ = plt.ylabel('Total no of orders')

_ = plt.subplot(1,3,2)
_ = plt.hist(status_20)
_ = plt.title('Order Status 2020')
_ = plt.xlabel('Order status')
_ = plt.ylabel('Total no of orders')

_ = plt.subplot(1,3,3)
_ = plt.hist(status_21)
_ = plt.title('Order Status 2021')
_ = plt.xlabel('Order status')
_ = plt.ylabel('Total no of orders')

In [None]:
#Top selling books
top_books = df.BookName.value_counts().nlargest(10).to_frame()
print(top_books)
fig = px.bar(top_books, x = top_books.index, y = top_books.BookName, title = 'Top 10 Most Selling Books', \
             labels={'index':'Book Name', 'BookName':'Order Counts'}, color = 'BookName', text = 'BookName')
fig.update_traces(texttemplate='%{text:.3}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

### Top 20 cities with most sale


In [None]:
top_cities = df.City.value_counts().nlargest(20).to_frame()
print(top_cities)
fig = px.bar(top_cities, x = top_cities.index, y = top_cities.City, title = 'Top 20 Cities with Most Orders', \
             labels={'index':'City', 'City':'Order Counts'}, color = 'City', text = 'City')
fig.update_traces(texttemplate='%{text:.3}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

In [None]:
top_cities.plot(kind='pie',figsize=(12,12),autopct='%1.1f%%', subplots = True)

### Month wise order frequency

In [None]:
top_months = df.OrderDate.dt.month.value_counts().to_frame()
#Changing the months int to months name
import calendar
d=dict((enumerate(calendar.month_abbr)))
top_months = top_months.rename(index=d)
print(top_months)
fig = px.bar(top_months, x = top_months.index, y = top_months.OrderDate, title = 'Top Months with Most Orders', \
             labels={'index':'Months', 'OrderDate':'Order Counts'}, color = 'OrderDate', text = 'OrderDate')
fig.update_traces(texttemplate='%{text:.3}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

### Day wise order frequency

In [None]:
top_days = df.OrderDate.dt.day.value_counts().to_frame()
top_days.index = top_days.index.map(str)
#Changing the months int to months name
print(top_days)
fig = px.bar(top_days, x = top_days.index, y = top_days.OrderDate, title = 'Top Days with Most Orders', \
             labels={'index':'Day', 'OrderDate':'Order Counts'}, color = 'OrderDate', text = 'OrderDate')
fig.update_traces(texttemplate='%{text:.3}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

### Order Status Payment Method wise

In [None]:
pay_method_order_count = df.groupby('OrderStatus')['PaymentMethod'].value_counts().to_frame()
print(pay_method_order_count)

### Payment method wise order count

In [None]:
df.PaymentMethod = df.PaymentMethod.replace({'Cash on Delivery (COD)': 'Cash on delivery'})
fig = px.histogram(df, x = 'PaymentMethod', width = 600, height = 400, title = 'Frequency of Payment Method', color = 'PaymentMethod')
fig.show()

In [None]:
_ = plt.figure(figsize = (20, 8))
_ = plt.subplot(1,4,1)
_ = sns.countplot(x = 'OrderStatus', data = df[df.PaymentMethod == 'Cash on delivery'])
_.set_title('Order Status for Cash on Delivery')
_ = plt.subplot(1,4,2)
_ = sns.countplot(x = 'OrderStatus', data = df[df.PaymentMethod == 'EasyPaisa'])
_.set_title('Order Status for EasyPaisa')
_ = plt.subplot(1,4,3)
_ = sns.countplot(x = 'OrderStatus', data = df[df.PaymentMethod == 'JazzCash'])
_.set_title('Order Status for JazzCash')
_ = plt.subplot(1,4,4)
_ = sns.countplot(x = 'OrderStatus', data = df[df.PaymentMethod == 'BankTransfer'])
_.set_title('Order Status for BankTransfer')

#### As seen in the above plots, the Cash on Delivery Method remain successful among all other methods 

In [None]:
fig = px.line(df, x="OrderDate", title='Date Wise Order Counts', labels={'index':'Order Counts', 'OrderDate':'Date'})
fig.show()

### According to the sales data, there is the big difference in sales from the start date

## Results
From the above analysis of the data, the sales is started to increase from December, 2019 and a rapid growth in sales is observed at the end of April, 2020 (Might be the effect of digital marketing). Moreover, the data shows that maximum of Pakistanis prefer cash on delivery method during online purchase. Data also shows that best sale month is January, and best sale day is 9th of the month. Best sale city is Islamabad with 30% of overall sale and the best book is انٹرنیٹ سے پیسہ کمائیں

#### The notebook is still inProcess


### If you like this notebook, please upvote and if you want to discuss or want to suggest improvements, please comment. Thanks 