In [None]:

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Overview:

Dataset - Pakistan E-Commerce data.

Products for selling - Almost all type of daily life products

Customers - Local Pakistanis (some wholesalers)

Transactions Period - **2016 to 2018**

# Results Obtained from EDA:

1. Majority of the customers ordered only **one item** at a time.

2. Fifty percent of the time people order items with **price less than or equal to Rs 900**.

3. The most popular payment method is cod **"cash on delivery" and "EasyPaisa"**. 

4. Majority of the time customers completed their orders.

5. The most demanded category in Pakistan is **mobiles and tablets**( mobile accessories are also included too like chargers, earpods etc) and second one related to men's fashion.

6. **Entertainment category** is looking more profitable and this includes Smart TV's, projectors, playstation etc.

7. The sales was maximum in **11th month(november)**  in Pakistan from year 2016-2018.

8. Even though in point 6, profitable category is entertainment but the category which generated more income is** mobiles and tablets**. In my point of view, in mobiles and tablets there are accessories included too like chargers, earpods etc.

9. In category column, the category which has most completed orders are **men's fashion**. The category which has most canceled orders are mobiles and tablets and the category which has most order refunded has men's fashion.

10. Price shows positive relationship with sales and year and neagative relation with month.

11. Payment method and order status has negative relantionship with eachother.

12. Category and order date(created_at) has nothing to do with eachother.

# Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams["figure.figsize"] = (25, 10)

# Importing Dataset

In [None]:
dataset = pd.read_csv('../input/pakistans-largest-ecommerce-dataset/Pakistan Largest Ecommerce Dataset.csv')

In [None]:
dataset

# Selecting Features

In [None]:
data = dataset[['item_id','status','created_at','price','qty_ordered','grand_total','category_name_1',
           'payment_method','Year','Month','Customer ID']]

**Here we are selecting some usefull features from dataset for our analysis.**

In [None]:
data.head()

In [None]:
# change the column names
data.rename(index=str, columns={'category_name_1': 'category',
                              'Year' : 'year',
                              'Month' : 'month',
                              'Customer ID' : 'customer_id'
                              }, inplace=True)

In [None]:
# rearrange all the columns for easy reference
data = data[['item_id','customer_id','created_at','price','qty_ordered','grand_total','category','status','payment_method','month','year']]

In [None]:
data

In [None]:
df = data.copy()

## Handling Missing Values

In [None]:
# check missing values for each column 
df.isnull().sum()

In [None]:
# check out the rows with missing values
df[df.isnull().any(axis=1)].head()

In [None]:
# df_new without missing values
df = df.dropna()

In [None]:
# check missing values for each column 
df.isnull().sum().sort_values(ascending=False)

# Feature Engineering

### Handling payment method

In [None]:
df['payment_method'].unique()

In payment method we can see that most of our values are same but they are in duplicate manner i.e Easypay_MA, easypay_voucher and Easypay are same so that we are making them in single value which is Easypay. There are other payment methods which you might not be heard of them so i am giving you some definitions of these terms:

**apg:**

Advance payment Guarantee (APG) is a form of guarantee issues by a bank to provide confidence to a principal that the advanced amount made on a job awarded to the bank customer will be used for the purpose for which the advance was made.

**finance settlement:**

Transaction in which a contract is settled on the same day as the trade date, or the next day if the trade occurs after 2:30 p.m. EST and the parties agree to this procedure. Often occurs because a party is strapped for cash and cannot wait until the regular three-business day settlement. 

**easypay:**

 Easypay is an easy payment solution from EasyPaisa (Telenor) that is specially designed for eCommerce customers and sellers in Pakistan. Now you can pay at Shophive.com with greater ease and convenience with EasyPay by using your debit/credit cards, EasyPaisa mobile accounts or via any EasyPaisa Shop.
 

**Customer Credit:**

A consumer credit system allows consumers to borrow money or incur debt, and to defer repayment of that money over time. Having credit enables consumers to buy goods or assets without having to pay for them in cash at the time of purchase.

**Payaxis:**

Gateway for TPS. TPS is a leading provider of cards and payment solutions, powering digital payments for various commercial and central banks, telecoms, processors and financial institutions. In layman term, It works with billing companies.

In [None]:
df['payment_method'].unique()

In [None]:
payment_to_replace = {'cashatdoorstep': 'cod', 'Easypay_MA':'Easypay', 'easypay_voucher':'Easypay','jazzvoucher':'jazzwallet','internetbanking':'Payaxis','mygateway':'Payaxis','marketingexpense':'Payaxis'} 
df = df.replace({"payment_method": payment_to_replace})

## Handling status

There are different values are mentioning here in status, order status should be in three kind of values complete, canceled and order_refunded. All the other values I replace them with their most related terms.

In [None]:
df['status'].unique()

In [None]:
replace_status = {'received':'complete', 'exchange':'complete', 'paid':'complete','cod':'complete'
                  ,'payment_review':'complete','pending':'complete','processing':'complete','payment_review':'complete','refund':'order_refunded', 'pending_paypal':'order_refunded',
                  'closed':'canceled','refund':'order_refunded', 'pending_paypal':'order_refunded', 'closed':'canceled',
                 'fraud':'order_refunded', 'holded':'order_refunded'} 
df = df.replace({"status": replace_status})

In [None]:
# removing null
df = df[df['status'] != '\\N']

## Handling Category

We have '\\N' in our category column which needed to treat.

In [None]:
df['category'].unique()

In [None]:
# converting them into low caps
data['category'] = data.category.str.lower()

In [None]:
df['category'] = df['category'].replace('\\N',' ')

In [None]:
df[df['category']==" "] = np.NaN

In [None]:
df['category'].fillna(method='ffill')

****Pandas dataframe.ffill() function is used to fill the missing value in the dataframe. ‘ffill’ stands for ‘forward fill’ and will propagate last valid observation forward.****

In [None]:
df['category'].unique()

In [None]:
df.info()

In [None]:
df['created_at'] = pd.to_datetime(df.created_at, format='%m/%d/%Y')

In [None]:
# descriptive statistics
df.describe().round(2)

## These are the insights we are getting from this descriptive statistics:
* Majority of the people order only 1 item at a time.
* Fifty percent of the time people order items with price less than or equal to Rs 900.


## 2. Univariate Analysis

## Question 1: What is the most popular payment method made by customers?

In [None]:
df['payment_method'].value_counts()

In [None]:
chart = sns.countplot(df['payment_method'])

chart.set_xticklabels(chart.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

### Answer: The most popular payment method is cod "cash on delivery" and "EasyPaisa". The other transaction terms which are stated here as per google there definitions for more understanding are stated above  

## Question No 2. What does customers do wether they complete order or not?

In [None]:
df['status'].value_counts()

In [None]:
chart = sns.countplot(df['status'])

chart.set_xticklabels(chart.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

### Answer:  This shows that majority of the customers complete their order.

## Question 3: What is the most demanded item in Pakistan?

In [None]:
chart = sns.countplot(df['category'])

chart.set_xticklabels(chart.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

### Answer: The most demanded category in Pakistan is mobiles and tablets( mobile accessories are also included too like chargers, earpods etc) and second one related to men's fashion.

# Bi-variate analysis

## Question no 4: Which item in category is more profitable?

In [None]:

chart = sns.barplot(x=df['category'], y=df['price'])

chart.set_xticklabels(chart.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

### Answer: Entertainment category is looking more profitable and this includes Smart TV's, projectors, playstation etc.

## Question 5:  In which month the earning/sales was maximum?


In [None]:
df['sales'] = df['qty_ordered'] * df['price']

In [None]:
results = df.groupby('month').sum() #addition of columns are done w.r.t months
results

In [None]:
months = range(1,13) # 13 is excluded

plt.bar(months, results['sales']) # months on y-axis and results on x-axis
plt.xticks(months)  # ticks are showing here
plt.ylabel('Sales in Rupees')
plt.xlabel('Month Number')
plt.show()

### Answer: The sales was maximum in 11th month(november)  in Pakistan from year 2016-2019

## Question 6: Which product generates more income?

In [None]:
cat=df[["category", "grand_total"]].groupby(['category'], as_index=False).sum().sort_values(by='grand_total', ascending=False)

plt.figure(figsize=(25,8))

sns.barplot(x='category', y='grand_total', data=cat)

plt.show()

### Answer: Even though in question 4, profitable category is entertainment but the category which generated more income is mobiles and tablets. In my point of view, in mobiles and tablets there are accessories included too like chargers, earpods etc

## Question 7: Visualize payment method and order status frequency

In [None]:
pd.crosstab(df.category, df.status)

In [None]:
ax = sns.countplot(x="category", hue="status", data=df)
ax.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()

### Answer: The category which has most completed orders is men's fashion. The category which has most canceled orders are mobiles and tablets and the category which has most order refunded has men's fashion.

# 3. Correlation Analysis:

In [None]:
sns.heatmap(np.round(df.corr(),2), annot=True)

### Analysis: Price is showing positive relationship with sales and year and neagative relation with month

 ## Question 8: Find a correlation between payment method and order status

In [None]:
corr_dataset = df[['status','payment_method','created_at','category']]
 



In [None]:
corr_ = corr_dataset.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)

In [None]:
sns.heatmap(np.round(corr_.corr(),2), annot=True)

### Answer: Payment method and order status has negative relantionship with eachother.

## Question 9: Find a correlation between order date and item category

### Answer: category and order date(created_at) has nothing to do with eachother.

## Credits:
https://www.kaggle.com/agrawaladitya/step-by-step-data-preprocessing-eda

https://towardsdatascience.com/exploratory-data-analysis-using-spermarket-sales-data-in-python-e99d329a07fc