# Pakistan largest E-commerce data
1. Top ten selling category
2. Hidden Patterns
    * Yearly sales
    * Monthly sales
    * Quarterly sales
    * Best Customer
3. Visualize Payment Methods versus Order Status 
    * Order Status vs Payment Method
    * Payment Methods Used
    * Canceled Order by Category
    * Canceled Order by Payment Method

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **What is the best-selling category?**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
eco_df = pd.read_csv('/kaggle/input/pakistans-largest-ecommerce-dataset/Pakistan Largest Ecommerce Dataset.csv')

In [None]:
eco_df.shape

In [None]:
eco_df.describe()

In [None]:
eco_df.info()

In [None]:
eco_df.head()

In [None]:
eco_df.isnull().sum()

In [None]:
df = eco_df.drop(['Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25'], axis=1)

In [None]:
df = df.dropna(how='any', axis=0)

In [None]:
dtype_conv = { 'item_id' : int,
              'qty_ordered' : int, 
              'Month' : int,
              'Year': int,
              'Customer ID': int
    
}

df = df.astype(dtype_conv)
df.info()

In [None]:
selling = df.category_name_1.explode().value_counts()
selling = dict(selling[:10])

k = list(selling.keys())

#printing top ten item
print('Top Ten Best Selling Item\n')

for i in range(len(selling)):
    print(i+1," ", k[i])

In [None]:
v = list(selling.values())

#visualization of top ten selling items
plt.figure()
sns.barplot(x = k , y = v)

plt.title("Top Ten Selling Items")
plt.xlabel('Items Name')
plt.xticks(rotation=95, size=10)
plt.yticks(size=10)
plt.ylabel('values')

plt.show()

# **Hidden Patterns**

In [None]:
df['created_at'] = pd.to_datetime(df['created_at'])

In [None]:
df.set_index('created_at', inplace=True)

In [None]:
plt.title('Yearly sales')
plt.ylabel('Sales')
df['price'].plot(figsize=(12, 5)) 

* year 2016 sales is average
* year 2017 sales take the hype
* year 2018 sales is very low (b/c of pandemic maybe)

In [None]:
plt.title('Monthly sales')
plt.ylabel('Sales')
df.loc['2017', 'price'].plot(figsize=(12,5))

**By exploring the sales in month we find that the sales are high in the month of october, november, dec and Jan(2018).**
**End of the year 2017 sales are good.**

In [None]:
sales = df.groupby(['Year'])['grand_total'].sum().reset_index()
plt.figure()
sns.barplot(x = sales['Year'], y = np.log(sales['grand_total']))

plt.title("Sales in Year")
plt.xlabel('Years')
plt.xticks(size=15)
plt.yticks(size=15)
plt.ylabel('sales')

plt.show()

1. who are your best customer
2. which customer buy most things
3. clusters the sales category (incomplete)

**Who are your best customer**

In [None]:
best_cust = df.groupby(['Customer ID'])['grand_total'].sum().reset_index().sort_values(by=['Customer ID'], ascending=[False])
plt.figure(figsize= (12,8))
sns.barplot(x = best_cust['Customer ID'][:10], y = best_cust['grand_total'][:10])

plt.title("Top 10 customer")
plt.xlabel('Customers ID')
plt.xticks(size=15)
plt.yticks(size=15)
plt.ylabel('Grand_total')

plt.show()

In [None]:
df.head()

In [None]:
#correlation in price and quantity
print("Correlation b/w price and quantity ordered")
df['price'].corr(df['qty_ordered']).round(3)

**Sales in Monthly basis**

In [None]:
sales_m = df.groupby(['Month'])['grand_total'].sum().reset_index()
np.log(sales_m['grand_total'])

In [None]:
plt.figure(figsize=(12, 5))
sns.barplot(x = sales_m['Month'], y = sales_m['grand_total'])

plt.title("Sales in Month")
plt.xlabel('Months')
plt.xticks(size=20)
plt.yticks(size=20)
plt.ylabel('sales')

plt.show()

**Sales in quater basis**

In [None]:
plt.figure(figsize=(12, 5))
plt.title('Sales on Quarterly(3-month) Basis')
plt.ylabel('Sales')
df['grand_total'].resample('Q').mean().plot(kind='bar')

# **Visualize Payment Methods versus Order Status**

In [None]:
#Total number of Orders countdifferent categories
ord_st = df['status'].value_counts()
print('Order Status Statistics\n', ord_st)
plt.figure(figsize=(10, 5))
ord_st.plot(kind='bar',color=['Red','green','blue'])
plt.title("Order Status Category division", size=20)
plt.xlabel('Order Status categories', size=15)
plt.ylabel('Order Value Count', size=15)
plt.show()

In [None]:
pay_mthd = df['payment_method'].value_counts()
print('Payment Method Statistics\n',pay_mthd)

In [None]:
plt.figure(figsize=(10, 5))
pay_mthd.plot(kind='bar', color=['green', 'Yellow', 'blue', 'Pink'])
plt.title('Payment Methods Statistics', size=20)
plt.xlabel('Payment Method', size=15)
plt.ylabel('Payments', size=15)
plt.show()

### **Canceled Order by payment Method**

In [None]:
canOrd_df = df[df['status']== 'canceled']
canOrd_df.head()

In [None]:
can_ord = canOrd_df['category_name_1'].value_counts()

plt.figure(figsize=(10, 5))
can_ord.plot(kind='bar', color=['green', 'Yellow', 'blue', 'Pink'])
plt.title('Most canceled Order by Category', size=20)
plt.xlabel('Canceled Order Category', size=15)
plt.ylabel('Total value', size=15)
plt.show()

In [None]:
pay_method = canOrd_df['payment_method'].value_counts()

plt.figure(figsize=(10, 5))
pay_method.plot(kind='bar', color=['green', 'Yellow', 'blue', 'Pink'])
plt.title('Most canceled Order by Payment Method', size=20)
plt.xlabel('Canceled Order Category', size=15)
plt.ylabel('Total value', size=15)
plt.show()

* **80,369 people uses Payaxis for their payment but almost 50,000 people canceled the order.**
* **57,000 people uses Easypaisa for their payment but almost 35,000 people canceled the order.**

### **These two payment method give you the measure loss, it could be various reasons.**