# Customer Segmentation for a Sports Facility

<div class="alert alert-block alert-info">

[1. Objectives](#1st-bullet)<br>
[2. Import Data/Libraries](#2nd-bullet)<br>  
[3. Data Exploration](#3rd-bullet)<br>  
[4. Data Visualization](#4th-bullet)<br> 

</div>

<div class="alert alert-block alert-success">

<a class="anchor" id="1st-bullet">    </a>
## 1. Objectives 
</div>

1. Explore the data and identify the variables that should be used to segment customers.
2. Identify customer segments
3. Justify the number of clusters chosen (taking in consideration the business use as well).
4. Explain the clusters found.
5. Suggest business applications for the findings and define general marketing approaches for each cluster.


<div class="alert alert-block alert-success">

<a class="anchor" id="2nd-bullet">    </a>
## 2. Import Libraries/Data
</div>

In [3]:
import pandas as pd
import datetime as dt
import numpy as np

In [4]:
data = pd.read_csv('XYZ_sports_dataset.csv', sep =';') 

<div class="alert alert-block alert-success">

<a class="anchor" id="3rd-bullet">    </a>
## 3. Data Exploration 
</div>

In [5]:
data.head()

Unnamed: 0,ID,Age,Gender,Income,EnrollmentStart,EnrollmentFinish,LastPeriodStart,LastPeriodFinish,DateLastVisit,DaysWithoutFrequency,...,OtherActivities,NumberOfFrequencies,AttendedClasses,AllowedWeeklyVisitsBySLA,AllowedNumberOfVisitsBySLA,RealNumberOfVisits,NumberOfRenewals,HasReferences,NumberOfReferences,Dropout
0,10000,60,Female,5500.0,2019-09-03,2019-10-31,2019-07-01,2019-12-31,2019-10-30,1,...,0.0,9.0,7,,6.28,2,0,0.0,0,0
1,10001,29,Female,2630.0,2014-08-12,2015-09-14,2015-01-01,2015-12-31,2015-07-16,60,...,0.0,23.0,1,2.0,17.42,1,2,0.0,0,1
2,10002,23,Male,1980.0,2017-05-02,2017-06-01,2017-01-01,2017-06-30,2017-05-25,7,...,0.0,6.0,0,7.0,30.03,6,0,0.0,0,1
3,10003,9,Male,0.0,2018-09-05,2019-02-12,2018-07-01,2019-06-30,2019-01-21,22,...,0.0,20.0,2,2.0,17.72,3,0,0.0,0,1
4,10004,35,Male,4320.0,2016-04-20,2018-06-07,2018-01-01,2018-06-30,2017-11-09,210,...,,41.0,0,7.0,60.97,0,3,0.0,0,1


It is easily noted that our data contains missing values encoded as NaN.

#### Data Types:

In [6]:
'Age','Income','DaysWithoutFrequency','LifetimeValue','NumberOfFrequencies', 'AttendedClasses', 'AllowedWeeklyVisitsBySLA', 'AllowedNumberOfVisitsBySLA','RealNumberOfVisits','NumberOfRenewals','NumberOfReferences'
'EnrollmentStart','EnrollmentFinish','LastPeriodStart','LastPeriodFinish','DateLastVisit'

('EnrollmentStart',
 'EnrollmentFinish',
 'LastPeriodStart',
 'LastPeriodFinish',
 'DateLastVisit')

In [7]:
data.dtypes

ID                              int64
Age                             int64
Gender                         object
Income                        float64
EnrollmentStart                object
EnrollmentFinish               object
LastPeriodStart                object
LastPeriodFinish               object
DateLastVisit                  object
DaysWithoutFrequency            int64
LifetimeValue                 float64
UseByTime                       int64
AthleticsActivities           float64
WaterActivities               float64
FitnessActivities             float64
DanceActivities               float64
TeamActivities                float64
RacketActivities              float64
CombatActivities              float64
NatureActivities              float64
SpecialActivities             float64
OtherActivities               float64
NumberOfFrequencies           float64
AttendedClasses                 int64
AllowedWeeklyVisitsBySLA      float64
AllowedNumberOfVisitsBySLA    float64
RealNumberOf

- Date objects could be turned into a datetime type for easier manipulation and interpretation.
- Binary variables are all int/float types. **should they be boolean?**

#### Missing Values:

In [8]:
data.isna().sum()[data.isna().sum()!=0]

Income                      495
AthleticsActivities          36
WaterActivities              37
FitnessActivities            35
DanceActivities              36
TeamActivities               35
RacketActivities             37
CombatActivities             33
NatureActivities             47
SpecialActivities            44
OtherActivities              35
NumberOfFrequencies          26
AllowedWeeklyVisitsBySLA    535
HasReferences                12
dtype: int64

There are some missing values that need to be adressed during the detailed exploration of each variable.

#### Duplicated

In [9]:
data.duplicated().sum()

0

There are no duplicated clients.

#### Descriptive Analysis:

In [10]:
data.describe(include="all").T 

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
ID,14942.0,,,,17470.5,4313.528196,10000.0,13735.25,17470.5,21205.75,24941.0
Age,14942.0,,,,26.015794,14.156582,0.0,19.0,23.0,31.0,87.0
Gender,14942.0,2.0,Female,8931.0,,,,,,,
Income,14447.0,,,,2230.816086,1566.527734,0.0,1470.0,1990.0,2790.0,10890.0
EnrollmentStart,14942.0,1490.0,2015-03-02,92.0,,,,,,,
EnrollmentFinish,14942.0,1300.0,2015-09-16,1684.0,,,,,,,
LastPeriodStart,14942.0,12.0,2019-07-01,3172.0,,,,,,,
LastPeriodFinish,14942.0,11.0,2019-12-31,3694.0,,,,,,,
DateLastVisit,14942.0,1384.0,2019-10-31,475.0,,,,,,,
DaysWithoutFrequency,14942.0,,,,81.224936,144.199576,0.0,13.0,41.0,83.75,1745.0


Looking at the descriptive analysis, some possible problems appear:
- For `Age` there seem to be some clients with age 0. The data is also skewed towards younger ages.
- For `Income`, some clients have value 0. It also seems like some extreme values appear for high values of income.
- The number of unique values shows that `LastPeriodStart` and `LastPeriodFinish` have some form of fixed dates.
- `DaysWithoutFrequency`,`LifetimeValue`, `NumberOfFrequencies`, `AttendedClasses`, `AllowedNumberOfVisits` and `RealNumberOfVisits` also seem to have extreme high values.

### Closer Look At Features:

#### ID:

In [11]:
data['ID'].value_counts()

ID
10000    1
19966    1
19954    1
19955    1
19956    1
        ..
14984    1
14985    1
14986    1
14987    1
24941    1
Name: count, Length: 14942, dtype: int64

All ID's are unique so we can set it as the index:

In [12]:
data.set_index('ID', inplace=True) 

In [13]:
data.duplicated().sum()

1

We now have one duplicated entry that needs to be removed:

In [14]:
data.drop_duplicates(inplace=True)

#### Age:

In [15]:
np.sort(data['Age'].unique())

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87])

- All age values seem normal for a sport facility.
- Special attention to babies (0-3) and other childreen under 16 needs to be taken.

#### Income:

Children under 16 should have no income:

In [16]:
data[data['Age']<16][data['Income']!=0]

  data[data['Age']<16][data['Income']!=0]


Unnamed: 0_level_0,Age,Gender,Income,EnrollmentStart,EnrollmentFinish,LastPeriodStart,LastPeriodFinish,DateLastVisit,DaysWithoutFrequency,LifetimeValue,...,OtherActivities,NumberOfFrequencies,AttendedClasses,AllowedWeeklyVisitsBySLA,AllowedNumberOfVisitsBySLA,RealNumberOfVisits,NumberOfRenewals,HasReferences,NumberOfReferences,Dropout
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10076,9,Male,,2017-09-16,2017-09-16,2019-07-01,2019-12-31,2019-10-26,5,708.20,...,0.0,64.0,64,2.0,17.42,6,2,0.0,0,0
10224,7,Male,,2016-04-20,2018-11-11,2018-07-01,2018-12-31,2018-10-25,17,836.60,...,0.0,107.0,91,2.0,11.72,6,3,0.0,0,1
10226,10,Female,,2016-11-14,2016-11-14,2019-07-01,2019-12-31,2019-10-26,5,1331.55,...,0.0,65.0,47,1.0,8.71,4,3,0.0,0,0
10261,3,Male,,2017-09-07,2017-09-07,2019-07-01,2019-12-31,2019-10-19,12,1066.40,...,0.0,78.0,62,2.0,17.42,5,2,0.0,0,0
10295,5,Female,,2015-03-05,2015-03-05,2019-07-01,2019-12-31,2019-09-07,54,286.30,...,0.0,15.0,7,1.0,3.14,0,5,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24804,4,Female,,2016-03-05,2016-03-05,2019-07-01,2019-12-31,2019-10-26,5,1656.30,...,0.0,95.0,88,2.0,9.72,3,4,0.0,0,0
24830,2,Male,2750.0,2018-07-26,2018-07-26,2019-07-01,2019-12-31,2019-10-19,12,773.32,...,0.0,27.0,20,2.0,17.42,8,2,0.0,0,0
24836,5,Male,,2018-07-02,2018-07-02,2019-07-01,2019-12-31,2019-10-30,1,654.60,...,0.0,20.0,20,2.0,15.42,9,2,0.0,0,0
24874,15,Male,,2015-11-02,2016-07-31,2016-01-01,2016-12-31,2016-05-30,62,353.60,...,0.0,20.0,17,1.0,8.71,0,0,0.0,0,1


There are 360 entries that need to have income set as 0:

In [17]:
data.loc[data["Age"] < 16 , "Income"] = 0

#### Gender:

In [18]:
data['Gender'].unique()

array(['Female', 'Male'], dtype=object)

Values for `Gender` are normal.

#### EnrollmentStart / EnrollmentFinish / LastPeriodStart / LastPeriodFinish / DateLastVisit:

Dates should be DateTime format:

In [19]:
data['EnrollmentStart'] = pd.to_datetime(data['EnrollmentStart'])
data['EnrollmentFinish'] = pd.to_datetime(data['EnrollmentFinish'])

data['LastPeriodStart'] = pd.to_datetime(data['LastPeriodStart'])
data['LastPeriodFinish'] = pd.to_datetime(data['LastPeriodFinish'])

data['DateLastVisit'] = pd.to_datetime(data['DateLastVisit'])

Some entries have enrollment start equal to enrollment finish:

In [20]:
data[data['EnrollmentStart'] == data['EnrollmentFinish']]['Dropout'].value_counts()

Dropout
0    2422
Name: count, dtype: int64

All clients in this situation have 'Dropout' status set to 0.

In [21]:
data[data['EnrollmentStart'] == data['EnrollmentFinish']]['LastPeriodFinish'].value_counts()

LastPeriodFinish
2019-12-31    2038
2019-06-30     186
2018-12-31     117
2017-12-31      22
2016-12-31      20
2018-06-30      20
2017-06-30       9
2016-06-30       5
2015-06-30       3
2015-12-31       2
Name: count, dtype: int64

Most cases seem to be of clients with current active contracts (Finishing on '2019-12-31'). \
We need to take a deeper look at other cases:

#### Members who have not been active in current Period (from 2019-07-01 untill 2019-12-31):

Checking clients who have not been to the facility in the current period:

In [22]:
data.loc[data['DateLastVisit']< '2019-06-30', 'Dropout'].value_counts()

Dropout
1    11335
0      324
Name: count, dtype: int64

In [23]:
data[data['DateLastVisit']< '2019-06-30'].loc[data['Dropout']== 0][data['EnrollmentFinish'] != data['EnrollmentStart']]

  data[data['DateLastVisit']< '2019-06-30'].loc[data['Dropout']== 0][data['EnrollmentFinish'] != data['EnrollmentStart']]


Unnamed: 0_level_0,Age,Gender,Income,EnrollmentStart,EnrollmentFinish,LastPeriodStart,LastPeriodFinish,DateLastVisit,DaysWithoutFrequency,LifetimeValue,...,OtherActivities,NumberOfFrequencies,AttendedClasses,AllowedWeeklyVisitsBySLA,AllowedNumberOfVisitsBySLA,RealNumberOfVisits,NumberOfRenewals,HasReferences,NumberOfReferences,Dropout
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


The people who had DateLastVisit < '2019-06-30' and don't have a Dropout all have have 'EnrollmentStart'= 'EnrolmentFinish', which tells us that:
- these people have dropped out but have not been added to the system as dropouts.
- Their contract is still active as they've been paying, but not comming to the gym.

How can we deal with these inconsistencies?
1) If `DateLastVisit` matches the Last Period of activity (not current Period), clients are considered dropouts (all date are correct but Enrollment Dates) and have `EnrollmentFinish` on `DateLastVisit`.
2) Other clients are considered active: `LastPeriodFinish`should be '2019-12-31' and `EnrollmentFinish` on '2019-10-31'.


##### Drop Outs:

We create a mask to select clients who follow condition **1** and should be considered dropouts:

In [24]:
drop_mask = (
    (data['EnrollmentStart'] == data['EnrollmentFinish']) &
    (data['LastPeriodStart']<= data['DateLastVisit']) &
    (data['DateLastVisit']<= data['LastPeriodFinish']) &
    (data['DateLastVisit'] < pd.Timestamp(dt.date(2019,6,30)))
)
index_dropout = data.index[drop_mask].tolist()

`Dropout` should be set to 1 and `EnrollmentFinish` to the respective `DateLastVisit`

In [25]:
data.loc[index_dropout, 'Dropout']=1

In [26]:
data.loc[index_dropout, 'EnrollmentFinish'] = data.loc[index_dropout, 'DateLastVisit']

##### Non - Drop Out:

Looking at the non - dropout clients, instances where `LastPeriodFinish` is not '2019-12-31' will be considered errors and removed, as they contain too many discrepencies to be considered viable.

In [27]:
mask = ( 
    (data['EnrollmentStart'] == data['EnrollmentFinish']) &
    (data['DateLastVisit']< '2019-06-30') 
)

In [28]:
data[mask]['LastPeriodFinish'].value_counts()

LastPeriodFinish
2019-12-31    81
2019-06-30     4
2018-12-31     4
2016-12-31     1
2018-06-30     1
Name: count, dtype: int64

In [29]:
drop_mask = ( 
    (data['EnrollmentStart'] == data['EnrollmentFinish']) &
    (data['DateLastVisit']< '2019-06-30') &
    (data['LastPeriodFinish']!= pd.Timestamp(dt.date(2019,12,31)))
)
index_dropout = data.index[drop_mask].tolist()

In [30]:
data.drop(index_dropout, inplace = True)

All other instances that have `EnrollmentStart` equal to `EnrollmentFinish` are considered active members:\
`LastPeriodFinish`should be '2019-12-31' and `EnrollmentFinish` on '2019-10-31'

In [31]:
data.loc[data[data['EnrollmentFinish'] == data['EnrollmentStart']].index.tolist(), 'LastPeriodFinish']='2019-12-31'

In [32]:
data.loc[data[data['EnrollmentFinish'] == data['EnrollmentStart']].index.tolist(), 'EnrollmentFinish']= '2019-10-31'

Clients who have a LastPeriod that desn't match neither the:
- DateLastVisit: Last time client was at the facility
- EnrollmentFinish: End of contract

Are all dropouts that show inconsistencies in the date type variables:

In [33]:
mask = (
    ~((data['LastPeriodStart'] <= data['EnrollmentFinish']) & (data['EnrollmentFinish'] <= data['LastPeriodFinish'])) &
    ~((data['LastPeriodStart'] <= data['DateLastVisit']) & (data['DateLastVisit'] <= data['LastPeriodFinish']))
)

In [34]:
data.loc[data.index[mask].tolist()].Dropout.value_counts()

Dropout
1    128
Name: count, dtype: int64

Depending on clustering results these clients can either be:
- Dropped totally from dataset as they represent 0.86% of entries;
- Left as they are;

In [35]:
#data.drop(data.index[mask].tolist(), inplace = True)

<div class="alert alert-block alert-success">

<a class="anchor" id="4th-bullet">    </a>
## 4. Data Visualization 
</div>

Visualizing our data can help us understand more about the distribution of the featues and find possible incoherences.

We start by separating metric, non-metric and date type features:

In [37]:
non_metric_features = ['Age','Income','DaysWithoutFrequency','LifetimeValue','NumberOfFrequencies', 'AttendedClasses', 'AllowedWeeklyVisitsBySLA', 'AllowedNumberOfVisitsBySLA','RealNumberOfVisits','NumberOfRenewals','NumberOfReferences']
date_features = ['EnrollmentStart','EnrollmentFinish','LastPeriodStart','LastPeriodFinish','DateLastVisit']

metric_features = data.columns.drop(non_metric_features+ date_features).to_list()

In [1]:
## as much seaborn as possible

- Data Visualizations
- Missing Values
- Outliers
- Feature Engineering
- New Variables