# Gufhtugu Publications Transactions Analysis

Tasks to achieve
* What is the best-selling book?
* Visualize order status frequency
* Find a correlation between date and time with order status
* Find a correlation between city and order status
* Find any hidden patterns that are counter-intuitive for a layman
* Can we predict number of orders, or book names in advance?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
!pip install -U textblob
!pip install googletrans
#Import libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from googletrans import Translator # translate cities name into english 
from textblob import TextBlob

pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Read Dataset
df = pd.read_csv("../input/gufhtugu-publications-dataset-challenge/GP Orders - 5.csv",encoding="utf-8", delimiter=',')
df.sample(20)

In [None]:
pak_places = pd.read_csv("../input/pakistan-population-census-2017-at-village-level/Pakistan_Population_2017_Census_Village_Administration_Level.csv")

Notes:

Cities:
* Hayatabad, Peshawar, KPK --> peshawar
* Model town --> lahore
* Dulle Wala tehsil darya Khan district bhakkar --> bhakkar
* Akbar Town Danish abad near university road --> Danish abad
* Thutha Rai Bahadur,Teh Kharian,Gujrat --> Gujrat | 
* Rawal pindi --> rawalpindi
* Khair Pur Mir's --> khairpur
* Kharian cantt 50070
* RYK --> Rahim Yar Khan
* D.i.khan --> dera ismail khan
* D.g.khan --> Dera ghazi khan

### Books

* ڈیٹا سائنس ۔ ایک تعارف --> ڈیٹا سائنس

Dataset Shape

In [None]:
Row, Col = df.shape
print(f'There are {Row} Rows and {Col} columns')

Rename column names for better EDA

In [None]:
df = df.rename(columns={'Order Number': 'Order_Number',"Order Status":"Order_Status", "Book Name":"Book_Name","Order Date & Time":"Order_Date","City":"City","Payment Method":"Payment_Method", "Total items":"Total_items","Total weight (grams)":"grams" })

**Check Null values in dataset**

In [None]:
df.isnull().sum().sort_values(ascending = False).to_frame('counts')

Explore the null values to understand

In [None]:
#rows containing missing data
df[(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)]

In [None]:
# To Drop NaN values
# Don't drop rows, Need these rows 
#df.dropna(inplace=True)

## Order Status Frequency

In [None]:
df.Order_Status.value_counts().to_frame('count')

Plot histogram of "Order_Status"

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(12,6))

histogram = df.Order_Status.hist()

plt.xticks(rotation=30, horizontalalignment="center")
plt.title("Order Numbers vs. Order Status")
plt.xlabel("Order Status")
plt.ylabel("Total Orders")
plt.show()

## Payment Methods

In [None]:
# Preprocess the date
# Thanks to @hussainsaddam12 & @mnavaidd for this codeblock idea + I improved it accordin to the given task.

df["Order_Date"] = pd.DatetimeIndex(df["Order_Date"])
df['date'] = df['Order_Date'].dt.date
df['time'] = df['Order_Date'].dt.time
df["Day_Name"] = df["Order_Date"].dt.day_name()
df["Week_Day"] = df["Order_Date"].dt.dayofweek
df["DayofYear"] = df["Order_Date"].dt.dayofyear
df["Month_Number"] = df["Order_Date"].dt.month
df["Month_Name"] = df["Order_Date"].dt.month_name()
df['year'] = df["Order_Date"].dt.year
df.sample(50)

In [None]:
df['Payment_Method'] = df['Payment_Method'].replace({"Cash on Delivery (COD)": "Cash on delivery"})
df.Payment_Method.value_counts().to_frame('counts')

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,5))

df.Payment_Method.value_counts().plot(kind='bar')

# Rotate the x-labels by 30 degrees, and keep the text aligned horizontally
plt.xticks(rotation=30, horizontalalignment="center")
plt.title("Payment Methods vs. Order Status")
plt.xlabel("Payment Method")
plt.ylabel("Number of Deliveries")
plt.show()

Now, Plot corelation between completed orders and Payment Methods

In [None]:
completed_df = df.loc[(df.Order_Status == 'Completed')]
completed_df.Payment_Method.value_counts().to_frame()

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,5))

completed_df.Payment_Method.value_counts().plot(kind='bar')

# Rotate the x-labels by 30 degrees, and keep the text aligned horizontally
plt.xticks(rotation=30, horizontalalignment="center")
plt.title("Payment Methods vs. Completed Order")
plt.xlabel("Payment Method")
plt.ylabel("Number of Completed Orders")
plt.show()

Now, Plot corelation between completed orders and Payment Methods

In [None]:
cancelled_df = df.loc[(df.Order_Status == 'Cancelled')]
cancelled_df.Payment_Method.value_counts().to_frame()

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,5))

cancelled_df.Payment_Method.value_counts().plot(kind='bar')

# Rotate the x-labels by 30 degrees, and keep the text aligned horizontally
plt.xticks(rotation=30, horizontalalignment="center")
plt.title("Payment Methods vs. Canelled Orders")
plt.xlabel("Payment Method")
plt.ylabel("Cancelled Orders")
plt.show()

Now, Plot corelation between returned orders and Payment Methods

In [None]:
returned_df = df.loc[(df.Order_Status == 'Returned')]
returned_df.Payment_Method.value_counts().to_frame('Count')

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,5))

returned_df.Payment_Method.value_counts().plot(kind='bar')

# Rotate the x-labels by 30 degrees, and keep the text aligned horizontally
plt.xticks(rotation=30, horizontalalignment="center")
plt.title("Payment Methods vs. Returned Order")
plt.xlabel("Payment Method")
plt.ylabel("Returned Orders")
plt.show()

# **Preprocess the dataset**

In the previous vesion we have analysed the data and concluded the following results:
* Order Number is unique
* Dataset have 19187 tansactions
* There are inconsistenies in Billing_City column
* Null values available in the data
* Multiple titles are in a single row

For Example: 
* There are total "4082" unique cities where books have been delivered. Which is a suspisous numbe considering "Gufhtugu" is relatively a new startup.
* The city Karachi has occured in different forms like "Karachi", "Karachi ", "Khi", etc.

Uptill now we have achieved one task of order frequency status

Let's preprocess the data

## 1. Convert enteries into lower case and remove blank spaces aound them.

Performing preprocessing on 'City' and 'Book_Name' Column

In [None]:
#Preprocess Billing_City
df['City'] = df['City'].str.lower()
df['City'] = df['City'].str.replace('\d+', '')
df['City'] = df['City'].str.replace('pakistan', '')
df['City'] = df['City'].str.replace('city', '')
df['City'] = df['City'].str.replace('?', '')
df['City'] = df['City'].str.strip()
#preprocess Book_Name
df['Book_Name'] = df['Book_Name'].str.replace("- مستحقین زکواة", "")
df['Book_Name'] = df['Book_Name'].str.lower()
df['Book_Name'] = df['Book_Name'].str.replace("linux - an introduction  (release data - october 3, 2020)", "linux - an introduction")
df['Book_Name'] = df['Book_Name'].str.replace("python programming- release date: august 14, 2020", "python programming")
df['Book_Name'] = df['Book_Name'].str.replace("ڈیٹا سائنس ۔ ایک تعارف", "ڈیٹا سائنس")
#df['Book_Name'] = df['Book_Name'].str.replace("(C++)","(C++) ++سی/سی")
df['Book_Name'] = df['Book_Name'].str.replace("molo masali - مولو مصلی", "molo masali")
df['Book_Name'] = df['Book_Name'].str.replace("مشین ل", "مشین لرننگ")
df['Book_Name'] = df['Book_Name'].str.replace("مشین لرننگرننگ", "مشین لرننگ")
df['Book_Name'] = df['Book_Name'].str.replace("r ka taaruf آر کا تعارف", "r ka taaruf")
df['Book_Name'] = df['Book_Name'].str.strip()
df.sample(20)

In [None]:
# This code blok onvers from 3533 to 1854, whih is not enough.
#if an address contains the name of a Pakistani city from the given list, the entire address is replaced with the name of the city only

#list of pakistani cities obtained from https://gist.github.com/malikbilal1997/4f41d4d153fca7087a8875cac7db8836
pak_cities = ['islamabad', 'ahmed nager chatha', 'ahmadpur east', 'ali khan abad', 'alipur', 'arifwala', 'attock', 'bhera', 'bhalwal', 'bahawalnagar', 'bahawalpur', 'bhakkar', 'burewala', 'chillianwala', 'chakwal', 'chichawatni', 'chiniot', 'chishtian', 'daska', 'darya khan', 'dera ghazi khan', 'dhaular', 'dina', 'dinga', 'dipalpur', 'faisalabad', 'ferozewala', 'fateh jhang', 'ghakhar mandi', 'gojra', 'gujranwala', 'gujrat', 'gujar khan', 'hafizabad', 'haroonabad', 'hasilpur', 'haveli lakha', 'jatoi', 'jalalpur', 'jattan', 'jampur', 'jaranwala', 'jhang', 'jhelum', 'kalabagh', 'karor lal esan', 'kasur', 'kamalia', 'kamoke', 'khanewal', 'khanpur', 'kharian', 'khushab', 'kot addu', 'jauharabad', 'lahore', 'lalamusa', 'layyah', 'liaquat pur', 'lodhran', 'malakwal', 'mamoori', 'mailsi', 'mandi bahauddin', 'mian channu', 'mianwali', 'multan', 'murree', 'muridke', 'mianwali bangla', 'muzaffargarh', 'narowal', 'nankana sahib', 'okara', 'renala khurd', 'pakpattan', 'pattoki', 'pir mahal', 'qaimpur', 'qila didar singh', 'rabwah', 'raiwind', 'rajanpur', 'rahim yar khan', 'rawalpindi', 'sadiqabad', 'safdarabad', 'sahiwal', 'sangla hill', 'sarai alamgir', 'sargodha', 'shakargarh', 'sheikhupura', 'sialkot', 'sohawa', 'soianwala', 'siranwali', 'talagang', 'taxila', 'toba tek singh', 'vehari', 'wah cantonment', 'wazirabad', 'badin', 'bhirkan', 'rajo khanani', 'chak', 'dadu', 'digri', 'diplo', 'dokri', 'ghotki', 'haala', 'hyderabad', 'islamkot', 'jacobabad', 'jamshoro', 'jungshahi', 'kandhkot', 'kandiaro', 'karachi', 'kashmore', 'keti bandar', 'khairpur', 'kotri', 'larkana', 'matiari', 'mehar', 'mirpur khas', 'mithani', 'mithi', 'mehrabpur', 'moro', 'nagarparkar', 'naudero', 'naushahro feroze', 'naushara', 'nawabshah', 'nazimabad', 'qambar', 'qasimabad', 'ranipur', 'ratodero', 'rohri', 'sakrand', 'sanghar', 'shahbandar', 'shahdadkot', 'shahdadpur', 'shahpur chakar', 'shikarpaur', 'sukkur', 'tangwani', 'tando adam khan', 'tando allahyar', 'tando muhammad khan', 'thatta', 'umerkot', 'warah', 'abbottabad', 'adezai', 'alpuri', 'akora khattak', 'ayubia', 'banda daud shah', 'bannu', 'batkhela', 'battagram', 'birote', 'chakdara', 'charsadda', 'chitral', 'daggar', 'dargai', 'darya khan', 'dera ismail khan', 'doaba', 'dir', 'drosh', 'hangu', 'haripur', 'karak', 'kohat', 'kulachi', 'lakki marwat', 'latamber', 'madyan', 'mansehra', 'mardan', 'mastuj', 'mingora', 'nowshera', 'paharpur', 'pabbi', 'peshawar', 'saidu sharif', 'shorkot', 'shewa adda', 'swabi', 'swat', 'tangi', 'tank', 'thall', 'timergara', 'tordher', 'awaran', 'barkhan', 'chagai', 'dera bugti', 'gwadar', 'harnai', 'jafarabad', 'jhal magsi', 'kacchi', 'kalat', 'kech', 'kharan', 'khuzdar', 'killa abdullah', 'killa saifullah', 'kohlu', 'lasbela', 'lehri', 'loralai', 'mastung', 'musakhel', 'nasirabad', 'nushki', 'panjgur', 'pishin valley', 'quetta', 'sherani', 'sibi', 'sohbatpur', 'washuk', 'zhob', 'ziarat']

def get_nearest_city(city):
  for cand_city in pak_cities:
    if cand_city in str(city):
      return cand_city
  return city 

#print(f'total unique cities in our dataset before normalization: {df.City.nunique()}')

#df['city'] = df['City'].apply(get_nearest_city)

#print(f'total unique cities in our dataset after normalization: {df.city.nunique()}')

## 2. Seprate the books in each order on "/"

In [None]:
df = df.assign(Order_Books_Name=df.Book_Name.str.split("/")).explode("Book_Name")

# Get Best Selling Books

Best selling book

Top 10 best sellling books

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,5))

top_10_seller = df.Order_Books_Name.explode().value_counts()[:10].plot.bar()

returned_df.Payment_Method.value_counts().plot(kind='bar')

# Rotate the x-labels by 30 degrees, and keep the text aligned horizontally
plt.xticks(rotation=30, horizontalalignment="center")
plt.title("Top 10 Best Selling Books")
plt.xlabel("Books")
plt.ylabel("Number of Orders")
plt.show()

Least 10 sellers

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,5))

least_10_seller = df.Order_Books_Name.explode().value_counts()[-10:].plot.bar()

# Rotate the x-labels by 30 degrees, and keep the text aligned horizontally
plt.xticks(rotation=30, horizontalalignment="center")
plt.title("Least 10 Selling Books")
plt.xlabel("Books")
plt.ylabel("Number of Orders")
plt.show()

### Total number of books sold 

In [None]:
total_sold = df.Order_Books_Name.explode().value_counts().sum()
print(f"In total, {total_sold} books sold by Gufhtugu Publications from {df.Order_Date.min()} to {df.Order_Date.max()}")

# Cities

In [None]:
df.City.str.upper().value_counts()[:10].to_frame()

In [None]:
dump = df.City.value_counts()[:10].plot.bar()

# Cleaned Bilding_Cities  

FuzzyWazzy

In [None]:
pak_places['District'] = pak_places['District'].str.replace('FR ', '')
pak_places['District'] = pak_places['District'].str.replace(' DISTRICT', '')
pak_places['District'] = pak_places['District'].str.strip()
pak_places['District'] = pak_places['District'].str.lower()
pak_places['Tehsil'] = pak_places['Tehsil'].str.replace(' TEHSIL', '')
pak_places['Tehsil'] = pak_places['Tehsil'].str.replace(' SUB-TEHSIL', '')
pak_places['Tehsil'] = pak_places['Tehsil'].str.replace(' CITY', '')
pak_places['Tehsil'] = pak_places['Tehsil'].str.lower()
pak_places['Tehsil'] = pak_places['Tehsil'].str.strip()
district = pak_places['District'].tolist()
tehsil = pak_places['Tehsil'].tolist()
District_Tehsil = list(set(district+tehsil))

#pop_list = ["tribal area adj. dera ismail khan district",'karachi west','de-excluded area d.g khan']
"""
District_Tehsil.pop()
District_Tehsil.pop()
District_Tehsil.pop()
"""

In [None]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

y = []
def city_correction():
    try:
        for x in range(Row):
            #if x != 2206 and x != 3918 and x != 3917:
            if x == 3919:
                y.append("NaN")
            else:
                city = df.loc[x, 'City']
                partial = process.extractOne(city.lower(),District_Tehsil)
                y.append(partial[0])
                #print(partial[0])
    except e:
        pass
#fuzz.partial_ratio
city_correction()

In [None]:
df['city1'] = y

In [None]:
df.sample(50)

## Reduced the cities number from "4082" to "501"

In [None]:
len(y)

In [None]:
#df.drop(columns="city")

In [None]:
from collections import Counter
print(Counter(y))

# Find a correlation between date and time with order status

in progress

## Next Steps

* Find a correlation between date and time with order status
* Find a correlation between city and order status
* Find any hidden patterns that are counter-intuitive for a layman
* Can we predict number of orders, or book names in advance?

Thoughts:

* datetime corelation is easy
* Need to work and come up with more techniques to nomalie and clean cities column
* We can use apriori algorithm for recommendation system

## Please Upvote if you find the notebook interesting.

This notebook is under [MIT License](https://opensource.org/licenses/MIT) Feel free to copy and edit it.


**Read more:**

* [Bi-gram Model from scratch in python using William Shakespeare Plays](https://www.kaggle.com/asimzahid/bi-gram-model-using-william-shakespeare-plays)
* [How to Scrape Tweets and create Dataset using Twint without Twitter API](https://www.kaggle.com/discussion/207512)

Thank you.