***Overview***

The objective of this notebook was to investigate whether there are particular types of books and cities that have a high rate of books being returned.

I have focused on cleaning the text representing book names and city names, and removing the anomalies, with the help of fuzzy string matching, regular expressions, string handling and data wrangling. 

After cleaning and merging the book and city names, exploratory data analysis was performed on the book return rate using bar plots and ecdf plots. The cleaned books and city names with highest orders were also visualized. 

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from fuzzywuzzy import fuzz, process
import tqdm.notebook as tq
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
orders = pd.read_csv('../input/gufhtugu-publications-dataset-challenge/GP Orders - 5.csv',encoding="UTF-8-SIG")
orders.head(5)

Breaking down orders of multiple books into multiple orders, dropping columns that are not of interest and converting book and city names into uppercase strings.

In [None]:
orders['Book_O'] = orders['Book Name'].str.split('/')
orders = orders.explode('Book_O')
orders['Book_O'] = orders['Book_O'].apply(str)
orders['City_O'] = orders['City'].apply(str)
orders['Book_O'] = orders['Book_O'].str.upper()
orders['City_O'] = orders['City_O'].str.upper()
orders.drop([ 'Book Name','Order Date & Time','Order Number','Payment Method','Total items','Total weight (grams)'], axis = 1,inplace=True)

In [None]:
print(orders['Order Status'].value_counts())
orders = orders[ orders['Order Status'] != 'Cancelled' ].reset_index(drop=True)
returned = orders[ orders['Order Status']=='Returned'].reset_index(drop=True)
totalBooksNo = orders.shape[0]
returnedBooksNo = returned.shape[0]
print('Total Non-Cancelled Book Orders : ',totalBooksNo)
print('Returned Book Orders : ',returnedBooksNo)
print('Returned Books Percentage : ',round(returnedBooksNo*100/totalBooksNo,2),'%')

In [None]:
origNoBooks = orders['Book_O'].nunique()
origNoCities = orders['City_O'].nunique()
print('No of Unique Book Names : ',origNoBooks)
print('No of Unique City Names : ',origNoCities)

In [None]:
def listSortPrint(lst,n=200):
    lst.sort()
    print(lst[:n])
    print('_______________')

In [None]:
uniqueCities = list(orders['City_O'].unique())
print('First 200 Unique Cities')
listSortPrint(uniqueCities)

Transliteration means converting alphabets from one language to another. I am using it for converting Urdu text into Roman Urdu for fuzzy matching and merging of city names and for finding Urdu words in book and city names. 

The table for mapping is originally by Ahmed, T. (2009). "Roman to Urdu Transliteration using word list"-Conference of Language and Technology. The dictionary was written by Shan Khan in 2019.

In [None]:
def transliterate(string):
    buck2uni = {
            u"\u0627":"A",
            u"\u0627":"A", 
            u"\u0675":"A", 
            u"\u0673":"A", 
            u"\u0630":"A", 
            u"\u0622":"AA", 
            u"\u0628":"B", 
            u"\u067E":"P", 
            u"\u062A":"T", 
            u"\u0637":"T", 
            u"\u0679":"T", 
            u"\u062C":"J", 
            u"\u0633":"S", 
            u"\u062B":"S", 
            u"\u0635":"S", 
            u"\u0686":"CH", 
            u"\u062D":"H", 
            u"\u0647":"H", 
            u"\u0629":"H", 
            u"\u06DF":"H", 
            u"\u062E":"KH", 
            u"\u062F":"D", 
            u"\u0688":"D", 
            u"\u0630":"Z", 
            u"\u0632":"Z", 
            u"\u0636":"Z", 
            u"\u0638":"Z", 
            u"\u068E":"Z", 
            u"\u0631":"R", 
            u"\u0691":"R", 
            u"\u0634":"SH", 
            u"\u063A":"GH", 
            u"\u0641":"F", 
            u"\u06A9":"K", 
            u"\u0642":"K", 
            u"\u06AF":"G", 
            u"\u0644":"L", 
            u"\u0645":"M", 
            u"\u0646":"N", 
            u"\u06BA":"N", 
            u"\u0648":"O", 
            u"\u0649":"Y", 
            u"\u0626":"Y", 
            u"\u06CC":"Y", 

            u"\u06D2":"E", 
            u"\u06C1":"H",
            u"\u064A":"E"  ,
            u"\u06C2":"AH"  ,
            u"\u06BE":"H"  ,
            u"\u0639":"A"  ,
            u"\u0643":"K" ,
            u"\u0621":"A",
            u"\u0624":"O",
            u"\u060C":"" #seperator ulta comma
    }

    for k, v in buck2uni.items():
        string = string.replace(k, v)

    return string

In [None]:
orders['City'] = orders['City_O'].apply(transliterate)

finding 3 letter city names for help in checking for city abbreviations being used instead of full city names.

In [None]:
uniqueCities = list(orders['City'].unique())
uniqueCities.sort()
abbrevs = list(filter(lambda x: (len(x) == 3), uniqueCities))
print(abbrevs)

After analyzing the unique city names it was found that there are many anomalies in the city names. Apart from spelling mistakes they often include words like CITY, DISTRICT and province names. Sometimes the address is included and city name abbreviations are used.

I wanted to use a single name string for all orders from a particular city, to get accurate counts. The first step for that is to replace or remove those strings. I made a dictionary to replace the substrings in the column. Some full strings are also being replaced as their abbreviation is substring in other names. 

In [None]:
replD_subs = {'CITY':'','DISTRICT':'','DISST':'','DIST':'','TEHSEEL':'','TEHSIL':'',' TEH ':'','ZLA':'',
              'VILLAGE':'','PUNJAB':'','SINDH':'','BALOCHISTAN':'','KPK':'','KHYBER PAKHTUNKHWA':'', 'PUR EAST':'PUR SHQI',
              'CANNT':'','CANTT':'','SIND':'','PAKISTAN':'','CENTRAL':'','NORTH/':'','EAST':'','SOUTH/':'','WEST':'',
              'BWN':'BAHAWAL NAGAR','BWP':'BAHAWAL PUR ','LHR':'LAHORE','BAHAWAL PUR':'BAHAWAL PUR  ',
              'MZH':'MUZAFFAR GARH','FSD':'FAISALABAD','DGK':'DERA GHAZI',
              'ISB':'ISLAMABAD',' AND ':''}
for k,v in tq.tqdm(replD_subs.items()):
    orders['City'] = orders['City'].str.replace(k,v)
orders['City'].replace('KHI','KARACHI',inplace=True)
orders['City'].replace('HYD','HYDERABAD',inplace=True)

This function is being applied to remove the numbers and whitespaces in start of string before the city name. It uses regex to search for A-Z characters.

In [None]:
def trimStart(ct):
    res = None
    temp = re.search(r'[A-Z]',ct,re.I) 
    if temp is not None: 
        res = temp.start()
    else:
        res = 0
    if len(ct)>0: 
        return ct[res:]
    else:
        return ct

In [None]:
orders['City'] = orders['City'].apply(trimStart)

Sometimes there are additional details in the city and book names. In city names there are district and province names and address details. Removing the string part after a particular length helps in merging similar city names. In books there can be additional details like specifying that it was free book etc, but those are rare and book names were often long, so I decided to do it only for cities.

In [None]:
orders['City'] = orders['City'].str[:14]
orders['Book'] = orders['Book_O'].str[:]

In [None]:
correcNoCities = orders['City'].nunique()
print('No of Original Unique Cities : ',origNoCities)
print('No of Corrected Unique Cities : ',correcNoCities)
print( round(100-(correcNoCities*100/origNoCities),2),'% Reduction so far')

There are some cities with lot of addresses and extra details in the strings, and they are also important cities with high sales. So additional cleaning is being done to replace the whole string if the name is found in the string. More cities can also be cleaned like that. 

In [None]:
def cleanLongStr(stri):
    ctLst = ['LAHORE','KARACHI','SIALKOT','MULTAN','ISLAMABAD']
    for ct in ctLst:
        if ct in stri:
            return ct
    return stri

In [None]:
orders['City'] = orders['City'].apply(cleanLongStr)

In [None]:
correcNoCities = orders['City'].nunique()
print('No of Original Unique Cities : ',origNoCities)
print('No of Corrected Unique Cities : ',correcNoCities)
print( round(100-(correcNoCities*100/origNoCities),2),'% Reduction so far')

In [None]:
orders.sort_values(by=['City'],ascending=True,inplace=True)

This function uses Fuzzy String Matching with Levenshtein Similarity to replace all similar strings in a dataframe column by the first occurence of a string in the same group.

The code for main idea was posted by Alperen on StackOverflow. However that solution had quadratic time complexity and took around 40 minutes to run on our data. I have modified and made it much faster by sorting the names and finding the start and end index of names starting with same letter, and then matching only with strings in the corresponding bins instead of all strings. This requires first letter to be same, which actually helps in making better matches in our case.

The time was reduced from 40 min to 2 min 24 sec. The algorithm still has O(n²) time complexity because each string is compared with strings in a bin, the number of which still depends on the number of strings. However the time is reduced by a factor close to number of alphabets, like the 17X reduction in time that I got.

In [None]:
def fuzzyReplace(df,colName,thresh):
    
    strLst = list(df[colName])
    strLst.sort()
    indxD = {}
    lastS = '*'
    indxD[lastS]=[0]
    for i,stri in enumerate(strLst):
        if len(stri) > 0:
            curS = stri[0]
            if curS != lastS:
                indxD[curS] = [i]
                indxD[lastS].append(i)
                lastS = curS
    indxD[lastS].append(len(strLst))
    
    for i in tq.tqdm(range(len(strLst))):
        if len(strLst[i]) > 0:
            startLtr = strLst[i][0]
            for j in range( indxD[startLtr][0], indxD[startLtr][1] ):
                if i < j and fuzz.ratio(strLst[i], strLst[j]) >= thresh:
                    strLst[j] = strLst[i]
                    
    return strLst

In [None]:
orders['City'] = fuzzyReplace(orders,'City',73)

The number of unique cities was reduced to nearly half by text cleaning and merging of city names 

In [None]:
correcNoCities = orders['City'].nunique()
print('No of Original Unique Cities : ',origNoCities)
print('No of Corrected Unique Cities : ',correcNoCities)
print( round(100-(correcNoCities*100/origNoCities),2),'% Reduction after Fuzzy Merging')

In [None]:
uniqueCities = list(orders['City'].unique())
print('First 200 Unique Cities')
listSortPrint(uniqueCities)

In [None]:
orders.sort_values(by=['Book'],ascending=True,inplace=True)

Book names are relatively clean with much less anomalies. However several types of issues were discovered after analyzing the unique book names. They include following:
1. Some books with Urdu names have orders with their name in Urdu text as well as orders with the name in Roman Urdu.
2. Some book names have both Urdu text and Roman Urdu text, while the separate Urdu and Roman Urdu versions exist too.
3. Some book names have extra details about the books in some versions, often in parenthesis.
4. Multiple versions of same books exist due to spelling mistakes too.

The names of the books are often very similar and sometimes they have parts. The number of book names with anomalies was relatively small and the chances for mistake were significant, so I decided to manually make a list of the book names to replace and merge, as it will be more reliable. I am utilizing functions for detection of book names likely to have anamolies for aiding me in the process of manually finding the names, as a semi-automated human-in-loop approach. 

In [None]:
def hasEngUrdBoth(stri):
    return transliterate(stri) != stri and re.search(r'[A-Z]',stri,re.I) is not None

def hasUrduOnly(stri):
    return transliterate(stri) != stri and re.search(r'[A-Z]',stri,re.I) is None

def hasEnglish(stri):
    return re.search(r'[A-Z]',stri,re.I) is not None

def hasParenthesis(stri):
    return '(' in stri or ')' in stri

Note that I'm only printing the top 200 now because the output becomes too big to show. I printed all during my analysis.

In [None]:
print('First 200 Book Names')
listSortPrint(list(orders['Book_O'].unique()))

In [None]:
print('Book Titles with English and Urdu words')
mixedBooks = list(orders[ orders['Book'].apply(hasEngUrdBoth) == True ]['Book'].unique())
listSortPrint(mixedBooks)
print('Book Titles with Parenthesis')
listSortPrint(list(orders[ orders['Book'].apply(hasParenthesis) == True ]['Book'].unique()))

In [None]:
print('Book Titles in Urdu')
urduBooks = list(orders[ orders['Book'].apply(hasUrduOnly) == True ]['Book'].unique())
listSortPrint(urduBooks)
urduBooksTL = [ transliterate(bk) for bk in urduBooks ]
print('Book Titles in Urdu - Transliterated')
listSortPrint(urduBooksTL)
englishBooks = list(orders[ orders['Book'].apply(hasEnglish) == True ]['Book'].unique())

In [None]:
for i,bk in enumerate(urduBooksTL):
    bestMatchesE = process.extract(bk,englishBooks,limit=3)
    bestMatchesU = process.extract(urduBooks[i],mixedBooks,limit=3)
    bestMatchesE = [ tpl for tpl in bestMatchesE if tpl[1] >= 80]
    bestMatchesU = [ tpl for tpl in bestMatchesU  if tpl[1] >= 70]
    
    if len(bestMatchesE)>0 or len(bestMatchesU)>0:
        print(i+1,' : ',urduBooks[i])
        if len(bestMatchesE)>0:
            print(bestMatchesE)
        if len(bestMatchesU)>0:
            print(bestMatchesU)

In [None]:
replD_subs = { 'ڈیٹا سائنس':'DATA SCIENCE','مشین لرننگ':'MACHINE LEARNING',
              'ارفع کریم':'ARFA KARIM','ڈیٹا سائنس ۔ ایک تعارف':'DATA SCIENCE',
              'ارطغرل غازی':'ERTUGRUL GHAZI','MOLO MASALI - مولو مصلی':'MOLO MASALI',
              'SHAOOR شعور۔ علم سے آگہی کا سفر':'SHAOOR','SAFAR E HAJJ سفر حج':'SAFAR E HAJJ',
              'JAVA  جاوا 2':'JAVA 2','(C++) ++سی':'(C++)','R KA TAARUF  آر کا تعارف':'R KA TAARUF',
              'JUSTUJU KA SAFAR جستجو کا سفر':'JUSTUJU KA SAFAR (URDU)','JUSTJU KA SAFAR-1':'JUSTUJU KA SAFAR (URDU)',
              'LINUX - AN INTRODUCTION  (RELEASE DATA - OCTOBER 3, 2020)':'LINUX - AN INTRODUCTION',
              'ادھورے گناہ':'ADHORAY GUNNAH','JAN KA KHAMOSH ZAYAN - HIGH BLOOD PRESSURE' : 'JAN KA KHAMOSH ZAYAN',
              'KULYAT MAKATEEB E IQBAL (4 VOLUMES COMPLETE)':'KULLYAT MAKATEEB-E-IQBAL (4 VOLUMES)','انٹرنیٹ سے پیسہ کمائیں؟- مستحقین زکواة':'انٹرنیٹ سے پیسہ کمائیں',
              'IRTEQA SHAHEEN - ارتقاء شاہین':'IRTEQA SHAHEEN','HAR SHAYE KA NAZRIA - ہر شے کا نظریہ':'HAR SHAYE KA NAZRIA',
              'BITCOIN BLOCKCHAIN AUR CRYPTO CURRENCY - FREE E-BOOK':'BIT COIN BLOCK CHAIN AUR CRYPTO CURRENCY',
              'BIT COIN BLOCK CHAIN AUR CRYPTO CURRENCY بٹ کوائن، بلاک چین اور کرپٹو کرنسی':'BIT COIN BLOCK CHAIN AUR CRYPTO CURRENCY',
              'HAZIR GHAYAB حاضر غائب':'HAZIR GHAYAB'
             }
for k,v in tq.tqdm(replD_subs.items()):
    orders['Book'].replace(k,v,inplace=True)

In [None]:
print('First 200 Book Names after merging')
listSortPrint(list(orders['Book'].unique()))

Many book names were merged and fixed resulting in 1.3 % reduction, which will make the counts more accurate. However it is much less compared to city names, because book names are cleaner.

In [None]:
correcNoBooks = orders['Book'].nunique()
correcNoCities = orders['City'].nunique()
print('No of Original Unique Books: ',origNoBooks)
print('No of Corrected Unique Books: ',correcNoBooks)
print( round(100-(correcNoBooks*100/origNoBooks),2),'% Reduction')
print('_'*10)
print('No of Original Unique Cities : ',origNoCities)
print('No of Corrected Unique Cities : ',correcNoCities)
print( round(100-(correcNoCities*100/origNoCities),2),'% Reduction')
print('_'*10)

Currently the City column contains mispelled City names because the first name in group is being used. It will be later replaced by joins.

In [None]:
orders.sample(20)

This function is being used to generate a dataframes with the return percentage, completed, returned and total books. Return percentage is the percentage of a book being returned among the total uncancelled book orders.

In [None]:
def getOrderStatusInfo(colName,orders,minObs=20):
    col_status = orders.groupby([colName, "Order Status"])["Order Status"].count().unstack().fillna(0).reset_index()
    col_status['Total'] = col_status[ ['Completed','Returned'] ].sum(axis=1)
    col_status['Return_Percentage'] = col_status['Returned'] / col_status['Total'] * 100
    col_status.sort_values(by=['Return_Percentage'],ascending=False,inplace=True)
    col_status = col_status[ (col_status['Total'] >= minObs) & (col_status['Returned'] > 0) ].reset_index(drop=True)
    return col_status

I am also using a minimum number of observations when finding books without high return percentage,so we could find books and cities with some data to support that they have high likelihood of being returned. However I didn't keep it higher because there are many books and locations don't have much orders but shouldn't be excluded.

In [None]:
minObsB = 10
minObsC = 5
evidentReturnedBooks = getOrderStatusInfo('Book',orders,minObsB)
evidentReturnedCities = getOrderStatusInfo('City',orders,minObsC)
print('No of Returned Book Titles with at least '+str(minObsB)+' Orders : ',evidentReturnedBooks.shape[0])
print('No of Returned Cities with at least '+str(minObsC)+' Orders : ',evidentReturnedCities.shape[0])

In [None]:
evidentReturnedCities = pd.merge(evidentReturnedCities,orders[['City_O','City']].drop_duplicates(),\
                                 how='left',left_on='City', right_on='City').drop_duplicates(subset=['City'])

evidentReturnedBooks = pd.merge(evidentReturnedBooks,orders[['Book_O','Book']].drop_duplicates(),\
                                 how='left',left_on='Book', right_on='Book').drop_duplicates(subset=['Book'])

After joining, the Book_O column has the original book names, and the Book column has names after merging different versions of same book names.

The sorted return percentage of books along with their total orders can be observed here.

In [None]:
evidentReturnedBooks[:30]

After joining, the City_O column has the original city names, and the City column has names after merging different versions of same city names, which can have wrong spelling because they are random names from group of similar city names.

The sorted return percentage of cities along with their total orders can be observed here.

In [None]:
evidentReturnedCities[:40]

In [None]:
def barPlot(df,topN,catCol,valCol,title,units=''):
    fig, ax = plt.subplots(figsize =(16, 16)) 
    ax.barh(df[catCol].head(topN),df[valCol].head(topN)) 
    ax.invert_yaxis()
    for i in ax.patches: 
        plt.text(i.get_width()+0.25, i.get_y()+0.5,str(round((i.get_width()), 2))+units,\
                 fontsize = 10, fontweight ='bold', color ='grey') 
    ax.set_title(title, loc ='left') 
    plt.show() 

In [None]:
topN = 30
barPlot(evidentReturnedBooks,topN,'Book','Return_Percentage','Top '+str(topN)+' Return Percentages of Books',' %')

The bar chart of Top 30 City names or locations according to return percentage and having at least 5 orders is being plotted.

It can be observed that there are many cities/locations with high return percentage. 
There are some interesting observations. There are some location, which seem rare like foreign cities, small towns and even particular addresses, having very high return percentage like 100%. I might have some confirmation bias but I suspect that those orders can be from the same person or people with some connection that are exploiting the return policy by ordering and returning again.

The small towns often have high return percentage, which makes sense considering that they might find the books more expensive. The orders are not too much from them so some of that might be by chance, however similar trend was observed for much higher minimum orders requirement.

In [None]:
topN = 30
barPlot(evidentReturnedCities,topN,'City_O','Return_Percentage','Top '+str(topN)+' Return Percentages of Cities',' %')

The ECDF plot for return percentages of those books show that around 75% books have return percentage under 13%, while the top 25% are spread between around 18% to 53.3%, with some big jumps represented by the flat portions. The median is around 8%.

This indicates that there is a considerable amount of books with unusually high return percentages, which maybe more likely to be returned due to their content, and some of them are outliers.

In [None]:
sns.ecdfplot(data=evidentReturnedBooks, x="Return_Percentage").set_title('ECDF for Return Percentages of Books')

The ECDF plot for return percentages of those cities/locations show that nearly % books are have return percentage under 20%, while the top 20% are spread between around 22% to 100%, with big jumps represented by the flat portions. The median is around 10%.
This indicates that there is a considerable amount of cities/locations with unusually high return percentages, which maybe more likely to return. There are several locations with 100% return percentage, indicated by the vertical jump at the end.

It was a bit unexpected that a significant amount of particular cities/locations would have higher than usual and so high return percentage. My intuition was that books would appear to have more influence on return percentage, as compared to cities. The reason for that should be investigated and checked for possible exploitation of return policy by certain customers and demographic.   

In [None]:
sns.ecdfplot(data=evidentReturnedCities, x="Return_Percentage").set_title('ECDF for Return Percentages of Cities')

The cleaned and merged book and cities names are also being sorted according to total number of orders to get accurate estimate of top selling books and cities with highest orders.

In [None]:
evidentReturnedBooks.sort_values(by=['Total'],ascending=False,inplace=True)
evidentReturnedCities.sort_values(by=['Total'],ascending=False,inplace=True)

It can be seen that the top selling books are technical and other books by Zeeshan Usmani, with the book on how to earn from internet taking the top spot and significantly higher sales than all other books, probably because it is about something that a lot of people are interested in.

In [None]:
topN = 10
barPlot(evidentReturnedBooks,topN,'Book','Total','Top '+str(topN)+' Books by Total Orders')

It can seen the most orders are from the big cities with most population, which makes sense. Islamabad has less population compared to the few below it but people there seem to be more educated and fond of books and learning.

In [None]:
topN = 10
barPlot(evidentReturnedCities,topN,'City_O','Total','Top '+str(topN)+' Cities by Total Orders')

There is potential for more analytics and improvements. Some possible things that can be tried include following:
* The books can be divided into categories to see analytics for different types of books. Scrapping the Guftagu website for that might also be a good idea.
* The cities and locations can also be divided into categories to see analytics for different types of areas. Some data of Pakistani cities can be found on Kaggle.
* The city/location merging can be improved with some manual replacement.
* Some hypothesis testing might be performed to verify the insights being indicated. 
* Some more data wrangling and exploration can be done to investigate things and find possible reasons.
* Features other than books and cities can also be explored.