# Table of Content

1. Importing the library
2. Loading the dataset
3. Explore the data
    - Checking the null values
4. Answer to question asked by Gufthugu    
5. Exploratory data analysis.

    - What is the best-selling book?
    - Visualize order status frequency
    - Find a correlation between date and time with order status
    - Find a correlation between city and order status
    - Find any hidden patterns that are counter-intuitive for a layman
    - Can we predict number of orders, or book names in advance?

# 1. Import Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 2. Loading the dataset

In [None]:
data = pd.read_csv('/kaggle/input/gufhtugu-publications-dataset-challenge/GP Orders - 4.csv');
data

# 3. Explore the data

In [None]:
# sample data
data.head()

# get columns Name
print(data.columns)

# convert columns having spaces to '_', because we can access order number column as data['Order Number'] but not like this data.Order Number.
# So converting them is good idea. Also converting columns names into lower string

columns = data.columns.str.replace(' ','_').str.lower()

# remove the special characters
columns = columns.str.replace('[(,)]', '')

data.columns = columns

# as info shows count of each columns' rows. Columns order_number, order_status and order_date has rows equals to total rows. This show that these columns has null value.
# book_name column has 2 null values and city_billing has 1 null value. 
# Lets count null count as varify this.

print(data.isnull().sum())

# treat the null values 
# treament for book_name: as book name is string so we can use mode, to fill null values. Mode means which book is buy most i.e. 'انٹرنیٹ سے پیسہ کمائیں'
print(data.book_name.mode()[0])
data.book_name = data.book_name.fillna(data.book_name.mode()[0])

# now book_name has no null values
print(data.isnull().sum())

# treatment for city_billing: as city billing is string so we can use mode, to fill null values. Mode means which city has most buyer i.e. 'Karachi'
print(data.city_billing.mode())
data.city_billing = data.city_billing.fillna(data.city_billing.mode()[0])

# now city_billing has no null values
print(data.isnull().sum())

# book name column having multiple books seperated by '/'. we will convert them into array.
data['book_name_list'] = data.book_name.str.split('/')

# coonvert Date string to Date object
data['order_date'] = pd.to_datetime(data['order_date'])
print(data.head())

#include a book count column, having count of books in order
data['order_books_count'] = data.book_name_list.apply(len)
print(data.head())

# Exploratory data analysis

**Total order status**

In [None]:
# let get status of order

statuses = data.order_status.unique() # array(['Completed', 'Returned', 'Canceled'], dtype=object)
plt.style.use('seaborn')
plt.title('Order status Frequency')
plt.xlabel('Order status')
plt.ylabel('Number of orders')
plt.hist(data.order_status)
plt.show()

**Order Status by X Year**

In [None]:
# get unique years list from order_date column
data_years = data['order_date'].dt.strftime("%Y").unique().tolist()
print(data_years)

# get statuses of 2019 year
status_2019 = data[data['order_date'].dt.strftime("%Y") == "2019"].order_status

#get statuses of 2020 year
status_2020 = data[data['order_date'].dt.strftime("%Y") == "2020"].order_status

#get statuses of 2021 year
status_2021 = data[data['order_date'].dt.strftime("%Y") == "2021"].order_status


#plot graph for year 2019

plt.figure(figsize=(15,5))

plt.subplot(1,3,1)
plt.title('2019 Year Order Status')
plt.xlabel('Order statuses')
plt.ylabel('Number of orders')
plt.hist(status_2019)

#plot graph for year 2020
plt.subplot(1,3,2)
plt.title('2020 Year Order Status')
plt.xlabel('Order statuses')
plt.ylabel('Number of orders')
plt.hist(status_2020)

#plot graph for year 2021
plt.subplot(1,3,3)
plt.title('2021 Year Order Status')
plt.xlabel('Order statuses')
plt.ylabel('Number of orders')
plt.hist(status_2021)


# Number of orders by Year.

In [None]:
orders_count_per_month_per_year = data['order_date'].groupby([data['order_date'].dt.year.rename('year'), data['order_date'].dt.month.rename('month')]).agg({'count'})
orders_count_per_month_per_year.plot()

* What is the 10 best-selling book?

In [None]:
data['book_name'].value_counts()[:10].plot(kind='bar')

**Top Ten Cities**

In [None]:
data['city_billing'].value_counts()[:10].plot(kind='bar')

# Please Upvote if you find the notebook interesting.
This notebook is under MIT License Feel free to copy and edit it.