# Hello Guys..Please have a look at this and Enjoy! 

* This is a Dataset for Online Retail Store.
* I have loaded the dataset and cleaned the columns in the first part of this project
* then I have created some visuals and some new columns to analyze the data
* then have performed KMeans Clustering (Unsupervised) Machine Learning Model.
* I have also used the Elbow Method after KMeans Clustering to see if the clusters I chse could be any better.
* At the end I have created a plotly mapping of the Countries by Revenue from Customers!
* UPVOTE IF YOU LIKE THE WORK! THANKS!!!

# Importing Libraries and Loading Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

In [None]:
df = pd.read_csv("../input/online-retail-customer-clustering/OnlineRetail.csv", delimiter=',', encoding = "ISO-8859-1")

# Observing Datatypes, Columns and Rows

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

# Data Cleaning: Checking for Null Values

In [None]:
msno.bar(df)

* We observed that CustomerID is missing.
* We will first check approximately how many rows are mssing this detail
* Then we will delete the missing rows. 
* The type of analysis I will be performing requires groupby according to customers so that is why it is necessary to remove null values from this column. 

In [None]:
df.count()

In [None]:
df[df['CustomerID'].isnull()].count()

In [None]:
100 - ((541909-135000)/541909 * 100)

* Approximately 25% of the data is missing.
* We will prooceed with dropping the missing rows now.

In [None]:
df.dropna(inplace=True)

In [None]:
msno.bar(df)

* Now we can see above that the matrix is showing all bars are equal
* meaning that the data is clean and ready to be used in the model
* meanwhile we will change the data type of InvoiceDate to Datetime.
* We will also create three new fields:

1. Amount
2. Number of Transactons
3. Last Transactions(how many days ago was that customer's last transaction)

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%d-%m-%Y %H:%M')

# Amount

In [None]:
df['Total Amount Spent']= df['Quantity'] * df['UnitPrice']

total_amount = df['Total Amount Spent'].groupby(df['CustomerID']).sum()
total_amount = pd.DataFrame(total_amount).reset_index()
total_amount.head()

# Number of Transactions

In [None]:
transactions = df['InvoiceNo'].groupby(df['CustomerID']).count()
transaction = pd.DataFrame(transactions).reset_index()
transaction.head()

# Last Transaction (LT)

In [None]:
final = df['InvoiceDate'].max()
df['Last_transact'] = final - df['InvoiceDate']
LT = df.groupby(df['CustomerID']).min()['Last_transact']
LT = pd.DataFrame(LT).reset_index()
LT.head()

# Merging All the newly created columns

In [None]:
df_new = pd.merge(total_amount, transaction, how='inner', on='CustomerID')

In [None]:
df_new = pd.merge(df_new, LT, how='inner', on='CustomerID')

 Last Transact column still needs to be simplified for just number of days

In [None]:
df_new.head()

In [None]:
df_new['Last_transact'] = df_new['Last_transact'].dt.days

In [None]:
df_new.head()

# K Means Clustering Model

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans= KMeans(n_clusters=2)
kmeans.fit(df_new[['Total Amount Spent', 'InvoiceNo', 'Last_transact']])
pred = kmeans.predict(df_new[['Total Amount Spent', 'InvoiceNo', 'Last_transact']])

In [None]:
kmeans.cluster_centers_

In [None]:
kmeans.labels_

In [None]:
pred = pd.DataFrame(pred, columns=['pred'])
df_new = df_new.join(pred)

# Clusters are created!

* we will now visualize them using different analytics tools and plots. 
* KMeans Clustering is unsupervised learning model so this means we do not really have anything to compare it with. 
* However we will try to visualize and make assumptions of how it actually made the segmentation

In [None]:
fig, ax =plt.subplots(nrows= 1, ncols = 3, figsize= (14,6))
ty=sns.stripplot(x='pred', y='Total Amount Spent', data=df_new, s=8, ax = ax[0], palette='magma_r')
sns.despine(left=True)
ty.set_title('Clusters based on different Amounts')
ty.set_ylabel('Total Spent')
ty.set_xlabel('Clusters')

tt=sns.boxplot(x='pred', y='InvoiceNo', data=df_new, ax = ax[1], palette='coolwarm_r')
tt.set_title('Clusters based on Number of Transactions')
tt.set_ylabel('Total Transactions')
tt.set_xlabel('Clusters')

tr=sns.boxplot(x='pred', y='Last_transact', data=df_new, ax = ax[2], palette='magma_r')
tr.set_title('Clusters based on Last Transaction')
tr.set_ylabel('Last Transactions (Days ago)')
tr.set_xlabel('Clusters')

In [None]:
sns.pairplot(hue='pred', data=df_new, diag_kind='kde', palette='magma')

# Elbow Method to decide how many Clusters to chose!
*  It is difficult to predict the right number of clusters that we should opt for 
*  We can do this Elbow method to get the right number of clusters
*  After performing this we might add or reduce the number of clusters based on how much error we are reducing by increasing the clusters!


In [None]:
kmeans.inertia_

In [None]:
error_rate = []
for clusters in range(1,16):
    kmeans = KMeans(n_clusters = clusters)
    kmeans.fit(df_new)
    kmeans.predict(df_new)
    error_rate.append(kmeans.inertia_)
    

In [None]:
error_rate = pd.DataFrame({'Cluster':range(1,16) , 'Error':error_rate})

In [None]:
error_rate

In [None]:
plt.figure(figsize=(12,8))
p = sns.barplot(x='Cluster', y= 'Error', data= error_rate, palette='coolwarm_r')
sns.despine(left=True)
p.set_title('Error Rate and Clusters')


# Country wise Analysis

In [None]:
country_wise = df.groupby('Country').sum()

In [None]:
country_codes = pd.read_csv('../input/iso-country-codes-global/wikipedia-iso-country-codes.csv', names=['Country', 'two', 'three', 'numeric', 'ISO'])

The dataset for country codes is taken from wikipedia

In [None]:
country_codes.head()

In [None]:
country_wise = pd.merge(country_codes,country_wise, on='Country')

In [None]:
country_wise.head()

In [None]:
from plotly import __version__
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
import plotly.graph_objs as go

In [None]:
data = dict(type='choropleth',colorscale='GnBu', locations = country_wise['three'], locationmode = 'ISO-3', z= country_wise['Total Amount Spent'], text = country_wise['Country'], colorbar={'title':'Revenue'},  marker = dict(line=dict(width=0))) 
layout = dict(title = 'European Countries According to Revenue!', geo = dict(scope='europe',showlakes=False, projection = {'type': 'winkel tripel'}))
Choromaps2 = go.Figure(data=[data], layout=layout)
iplot(Choromaps2)


In [None]:
data = dict(type='choropleth',colorscale='rainbow', locations = country_wise['three'], locationmode = 'ISO-3', z= country_wise['Total Amount Spent'], text = country_wise['Country'], colorbar={'title':'Revenue'},  marker = dict(line=dict(width=0))) 
layout = dict(title = 'All Countries According to Revenue!', geo = dict(scope='world',showlakes=False, projection = {'type': 'winkel tripel'}))
Choromaps2 = go.Figure(data=[data], layout=layout)
iplot(Choromaps2)



# Thank you! Please Upvote if you liked it :)

<img src="https://media.giphy.com/media/j6ZReIODqJXh5sPLVq/giphy.gif">