# Customer Segmentation for car sharing
<table><tr><td><img src="Images/segment.png"></td></tr></table>

## TASK DESCRIPTION							
1. Based on this data, please create a k-cluster model to analyse the relationship between paid spend / hour and total number of bookings completed. 							
2. How would the above analysis impact your pricing / promo strategy for a) new users b) existing users c) churned users							
*(you will have to cleanse the data yourself and remove outliers)							

In [None]:
#!pip install pandas-profiling

In [None]:
#Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display
import pandas_profiling
from sklearn.cluster import KMeans
from sklearn import preprocessing

In [None]:
#Loading the dataset in a Pandas DataFrame
Df= pd.read_csv("ClusteringData.csv")

In [None]:
display(Df.head())

There seem to be some missing values in the Data, i will need to further explore it to see the clear picture

In [None]:
display(Df.tail())

In [None]:
Df.shape

In [None]:
Df.info()

In [None]:
Df.describe()

#### Utilizing the pandas Profiling Package for a complete report on the Data

In [None]:
Df.profile_report()

- **Create Promo used Column**
- **Handle missing values**
- **Handle outliers**
- **Cluster Analysis**

### Key TakeAways from the Pandas Profiling of the Data

From the Pandas Profiling Report we can see that two of the variables have missing values, "PromoSpend" and "PaidSpend". These will need to be handled before we move ahead. Promo spend has 4492 missing values which means that these customers did not use any kind of Promo and only booked by paying full amount. For these values we can impute using 0 as that would specify the amount they spent using promo was 0. But before that i would like to create a new categorical feature which would show if a customer used promo or not (but this feature will not be used for our K means Clustering as it might create some issues). Similar for Missing values in Paidspend, which has missing values only when the customer has used promo entirely. so we can impute using 0 here aswell.. Any values that imply 0 in the Paid spend/hour means either that the person used promo only to book his rides, or that the person has not driven his rides for an entire hour, which is why hours booked is 0 in these cases, that means that the person has booked only for short periods of times


- **Missing values** Two of the variables seem to have missing values in them, these are PaidSpend and Promospend
- **Outliers** Two variables namely paidspend and Paidspend/hour have outliers approximately 5 in number for the same booking id where apparently the customer spend 12000Rm with only one booking and paidspend/ hour is either 861 or 370. These 5 outliers may mean scenerios where the customer was given a monetary penalty. But keeping them in the dataset will be a problem. I'm going to Drop these 5 observations

In [None]:
#Lets See what we mean by outliers in the above discussion
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=Df['paidspend/hour'],ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=Df['paidspend'],ax=ax)

Even though paidspend has a ton of other extreme values aswell, i will keep all these as they seem logical. Only dropping the extreme values in paidspend/hour as they seem like outliers

**Plotting a scatter plot for a more clear picture with regards to bookings**

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(Df['paidspend/hour'], Df['bookings'])
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(Df['paidspend'], Df['bookings'])
plt.show()

### Dropping Outliers in PaidSpend/Hour

In [None]:
#Handling Outliers
list_of_values = [867.84,  341.32, 290.03, 243.02, 213.06]
Df.drop(Df.loc[Df['paidspend/hour'].isin(list_of_values)].index, inplace=True)

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=Df['paidspend/hour'],ax=ax)

**We can see that the extreme single value of 800+ is not there anymore. The values till 220 cannot be considered outliers as they are for people with bookings more than one**

### Feature Engineering
The Missing values in Paidspend and PromoSpend indicate that the customer either only used promos or did not use any promos. So dropping these missing values is not as option. Instead i will impute them with zeros and create a new categorical variable for whether the customer used a promo or not or whether he/she only used promo

In [None]:
#Feature Engineering
# Creating a variable for those Customers who used a promo or not
Df['UsedPromo'] = np.where(Df['promospend'].isna(), 'NO','YES')
Df['UsedPromo'] = np.where(Df['paidspend'].isna(), 'OnlyPromo',Df['UsedPromo'])
Df['UsedPromo']= Df['UsedPromo'].astype('category')
#This will Create a new variable Used Promo which indicates whether a person has only used promo, or used it with paying some money or havent used atall

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
Df.UsedPromo.value_counts().plot.bar(ax=ax)

**We can see that there are a few people in a couple of hundred customer who only took rides for promos and did not spend any money**

In [None]:
from scipy import stats
fig, ax = plt.subplots(figsize=(16,8))
sns.distplot(Df['hoursbooked'], kde=False, fit=stats.norm,ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=Df['bookings'],ax=ax)

**The bookings feature also has some extreme values but these seem logical, it may indicate those customer that use SOCAR services regularly**

In [None]:
Df.dtypes

**Encoding the Used Promo Feature to numeric using label encoding**

In [None]:
Df['UsedPromo']=Df['UsedPromo'].cat.codes

In [None]:
Df.UsedPromo.value_counts()

In [None]:
Df['promospend']=Df['promospend'].fillna(0)
Df['paidspend']=Df['paidspend'].fillna(0)
Df['paidspend/hour']=Df['paidspend/hour'].fillna(0)
Df.promospend.isna().sum()

**Plotting all variables against eachother using Seaborn PairPlot and the Color grouped by UsedPromo**

In [None]:
sns.pairplot(Df,hue='UsedPromo')

In [None]:
Df.head()

#### Selecting Features for the clustering Algorithm
- **Since the Task requires us to cluster based on the bookings and paidspend/hour we will use these two features**

In [None]:
#Selecting the features for clustering
X=Df[['bookings','paidspend/hour']]

In [None]:
X.head()

## Task 1 Clustering the Data to see the relation between Bookings and Paidspend/hour
#### Using K Means Clustering 
To select the Value of K, i will use the Elbow method to determine the appropriate value of K

In [None]:
# MC
# elbow method plot

distortions = []
k_values = range(1,11)
fig, ax = plt.subplots(figsize=(16,8))
for i in k_values:
    model = KMeans(n_clusters=i)
    model.fit(X)
    distortions.append(model.inertia_)
    
plt.plot(k_values, distortions, marker='o');

- **The Elbow method shows that there are 4 Relevant Clusters for this data**

Creating Clusters and a mapping them to a new feature in Dataframe

In [None]:
#Creating Clusters as per the Elbow method
kmeans = KMeans(n_clusters=4)
Df['Cluster']=kmeans.fit_predict(X)

In [None]:
Df.head()

- **Since there are 4 clusters, we need to plot them using a 2 dimensional space, for this i will use Priniciple component analysis which is used for plotting in 2 dimensional space**

In [None]:
#Importing the Relevant Package from Sklearn
from sklearn.decomposition import PCA

In [None]:
# Visualizing the Clusters using Principle Component Analysis
pca=PCA(n_components=2,whiten=True)
fig, ax = plt.subplots(figsize=(16,8))
pca.fit(X)
Df['x']=pca.fit_transform(X)[:, 0]
Df['y']=pca.fit_transform(X)[:, 1]
plt.scatter(Df['x'],Df['y'],c=Df['Cluster'],s=60)
plt.show()

- **Based on the K means Clustering we can see that the customers can be segmented into 4 Different Segments. Each cluster shown in different colors, White, light Gray, Dark Gray, and black**

- **To see how these clusters are formed we will use them to group the data together to see them on basis of customer behvaior in regards to booking and paidspend/hour**

In [None]:
Df['Cluster'].value_counts().plot(kind='bar')

In [None]:
#Checking the property of these 4 clusters on basis of Behaviour
Df.groupby(Df['Cluster']).mean()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
Df['Cluster'].value_counts().plot(kind='bar')

### Key Take Away Points from the Cluster Analysis
- **Cluster 0:** Represents those customer who book the rides for 1 to 3 times on average for long routes or intervals (by looking at their average hours booked probably for inter city travelling). These customers utilized promos as well and have spent atleast 5.62 Rm per hour. These also include those customer who only book the cars for one time and havent booked afterwards, though this depends entirely in the dataset timeline. These customer could be categorized as churners if they only booked their cars once. These customers are the highest in number within the dataset

- **Cluster 1:** Represents those customers who book rides for more then 3 times on average but for much shorter intervals probably within city, though these customers pay more money per hour which would indicate they opt for bigger cars 

- **Cluster 2:** Represents the most loyal customers who have on average booked more than 21 rides , these customers on average spend 8to 10 rm per hour but they also rent the cars for long tours. These are also the ones who spend most promos, These customer are the third highest in number with more then 700. 

- **Cluster 3:** Represents the last Cluster and category of customers who are the least in number in the data, these customer are the ones with the highest per hour spending but the least amout of hours. This category might include those people that book the cars for only a few minutes maybe even less then an hour time. These are probably office goers, or Students or lerisurers who only the service for commuting or taking care of quick errands

### Task 2 How would the above analysis impact your pricing / promo strategy for 
- **a) new users**
- **b) existing users**
- **c) churned users**

- **New Users** For New Users it is essential to come out with a promo where the first couple of hours are free similar to what Grab did in the begining to get people on board. But there should be a restriction for this promo for example it can only be used within city.
- **Existing Users** For Existing users the company can come out with special promos or loyalty points where the more points you get the more rewards you earn, then there should be tiers of loyals reward points, like silver, gold. The gold member will get the best deals in promos. 
- **Churned Users** These users are the ones that use the service once and then dont come back, that does not mean that they have deleted the app, notifying them either by email or message for a special promo, but this too should have a limitation, maybe based on the duration of the ride. for example the promo could only be used for a 2 hour ride 

In [None]:
from sklearn.manifold import TSNE

In [None]:
#RS = 123
#fashion_tsne = TSNE(random_state=RS).fit_transform(X)

In [None]:
#import matplotlib.patheffects as PathEffects

In [None]:
#fashion_scatter(fashion_tsne,Df['Cluster'])