# Topics to be covered in this notebook
• Which one is the best-selling book? <br>
• Visualize order status frequency<br>
• Find a correlation between date and time with order status<br>
• Find a correlation between city and order status<br>
• Find any hidden patterns that are counter-intuitive for a layman<br>
• Can we predict number of orders, or book names in advance?<br>

In [None]:
#import these libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings("ignore")

# We have to follow these steps for this task
* Load the data set
* Understanding the data
* Clean the data and removed null values 
* Split data order that contains multiple order in one order on the basis of /

In [None]:
#Load the dataset we will use this dataset(GP Orders - 4.csv) as it contains correct values 

In [None]:

df2 = pd.read_csv("../input/gufhtugu-publications-dataset-challenge/GP Orders - 5.csv")


In [None]:
#Understanding the data
df2.shape

In [None]:
#Let's see first five values of our data
df2.head()

In [None]:
#Lats five values of our data
df2.tail()

In [None]:
df2.shape

In [None]:
df2.describe()

In [None]:
#see the column name and if necessary change the name of columns

In [None]:
df2.columns

In [None]:
#change the name of columns for our ease

In [None]:
df2.columns = ['order_num', 'order_status', 'book_name', 'order_date', 'city', 'payment_method', 'items', 'weight']

In [None]:
df2.head()

In [None]:
#Now we check the unique values in our dataset

In [None]:
df2.nunique()

In [None]:
#We can check for the unique values sepratly 

In [None]:
df2['order_status'].unique()

In [None]:
#Info about our data, int64 = represent integer values, object = represent the string value

In [None]:
df2.info()

**Observation & tasks so far**
We see our data contains 19187 rows/observation and 5 columns/variables <br>
We rename our columns and gives user friendly name <br>
We check the unique values from every column togather and seprately as well <br>
We also check the info of our data types

In [None]:
#Data Cleaning

### Check the missing values in the data set. If you want to sort the data you cab use this sort_values(ascending = False)  otherwise you can use simple function df2.isnull().sum()

In [None]:
df2.isnull().sum().sort_values(ascending = False)

In [None]:
# We see that book_name has 2 missing values and City has 1, Now we locate where exactly these values are in our dataset

In [None]:
df2[df2['book_name'].isna()]

In [None]:
df2[df2['city'].isna()]

In [None]:
#Now we drop these values 

In [None]:
df2.dropna(inplace=True)

In [None]:
df2.isnull().sum()

In [None]:
#Now we see details of order_staus

In [None]:
df2.order_status.value_counts()

# Task 1: Which one is the best-selling book?

In [None]:
#Here we split our orders on the basis of "/"

In [None]:
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split('/')))

# calculate lengths of splits
lens = df2['book_name'].str.split('/').map(len)

# create new dataframe, repeating or chaining as appropriate
df2 = pd.DataFrame({'order_num': np.repeat(df2['order_num'], lens),
                    'order_status': np.repeat(df2['order_status'], lens),
                    'book_name': chainer(df2['book_name']),
                    'order_date': np.repeat(df2['order_date'], lens),
                    'city': np.repeat(df2['city'], lens)})

In [None]:
#Now see our total rows increase from 19187 to 33091

In [None]:
df2.shape

In [None]:
from matplotlib.pyplot import figure
figure(num=None, figsize=(10, 10))
df2[df2["order_status"]=="Completed"]["book_name"].value_counts()[:10].sort_values().plot.barh()
plt.title("Top 10 purchased books")
plt.xlabel("Number of orders")
plt.ylabel("Name of books ")
plt.show()

# Task 2: Visualize order status frequency

In [None]:
#Using bar plot

In [None]:
sns.countplot(data = df2, x = 'order_status')

In [None]:
#As we see upper charts did not show the canceled order properly so see it with pie plot

In [None]:
pal=['#349d6e','#faff00',"#ff0000"]
sns.set_palette(pal)
plt.figure(figsize=(10,10))
plt.pie(df2['order_status'].value_counts())
plt.legend(df2['order_status'].unique(),bbox_to_anchor=(0.00, 1))

# Task 3: correlation between date and time with order status

In [None]:
df2['date'] = pd.to_datetime(df2['order_date']).dt.date
df2['time'] = pd.to_datetime(df2['order_date']).dt.time
#other way to do this is bellow
#data['Date'] = data.Order_Date.apply(lambda x: str(x).split(' ')[0])
#data['Time'] = data.Order_Date.apply(lambda x: str(x).split(' ')[1])


In [None]:
df2.head()

## Next steps <br>
**-Find a correlation between date and time with order status<br>
-Find a correlation between city and order status<br>
-Find any hidden patterns that are counter-intuitive for a layman <br>
-Can we predict number of orders, or book names in advance? <br>**

# Please Upvote if you find the notebook interesting.
# Follow me & let's rock together 
# Thank you.