# Data Science Project

Group 17

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

### Importing Data

In [2]:
original_data = pd.read_csv('cluster.csv')

In [3]:
df = original_data.copy()

In [4]:
df.head()

Unnamed: 0,Churn,Name,Longevity,Year_Birth,TypeTravel,RoomType,RewardPoints,Comfort,ReceptionSchedule,FoodDrink,...,Wifi,Amenities,Staff,OnlineBooking,PriceQuality,RoomSpace,CheckOut,Checkin,Cleanliness,BarService
0,churn,Ms. Nicole Clarke,yes,1974.0,business,single,4907,3,4,1,...,4,3,4,3,3,3,3,4,3,4
1,nochurn,Mr. Jesse Scott,yes,1965.0,business,single,6724,1,1,1,...,5,3,4,5,5,5,5,1,5,2
2,churn,Mr. Christopher Carter,yes,1973.0,business,single,4365,3,3,3,...,1,4,4,3,3,2,3,2,3,1
3,nochurn,Ms. Jennifer Morgan,yes,1993.0,leisure,double,3849,1,1,1,...,4,4,5,4,4,4,4,4,4,3
4,nochurn,Mr. Michael White,yes,1989.0,business,single,5376,2,2,3,...,5,5,5,5,5,3,4,1,3,5



| Variable      | Description |
| ----------- | ----------- |
| Name      | Customer’s name       |
| Year Birth    | Customer’s birth year        |
| Longevity   | Whether the customer registered more than 1 year ago or not        |
| Churn   | Whether the customer churned or not (churn or nochurn)        |
| TypeTravel   | Customer’s reason for travelling (business or leisure)        |
| RoomType   | Type of room reserved        |
| RewardPoints   | Customer’s rewarding point for loyalty        |
| Comfort   | Satisfaction level of customer regarding comfort of the room (0 to 5)        |
| ReceptionSchedule   | Satisfaction level of customer regarding reception schedule (0 to 5)        |
| ReceptionSchedule   | Satisfaction level of customer regarding food and drink available (0 to 5)        |
| Location   |   Satisfaction level of customer regarding accommodation location (0 to 5)      |
| Wifi   |    Satisfaction level of customer regarding wi-fi service (0 to 5)     |
| Amenities   | Satisfaction level of customer regarding accommodation amenities(0 to 5)        |
| Staff   | Satisfaction level of customer regarding staff (0 to 5)        |
| OnlineBooking   | Satisfaction level of customer regarding online booking ease(0 to 5)        |
| PriceQuality   | Satisfaction level of customer regarding price quality relationship (0 to 5)        |
| RoomSpace   | Satisfaction level of customer regarding room space (0 to 5)        |
| CheckOut   | Satisfaction level of customer regarding check-out (0 to 5)        |
| CheckIn   | Satisfaction level of customer regarding check-in (0 to 5)        |
| Cleanliness   | Satisfaction level of customer regarding cleanliness (0 to 5)        |
| BarService   | Satisfaction level of customer regarding bar service (0 to 5)        |


### 1. Data Understanding

In [5]:
df_rows = df.shape[0]
df_columns = df.shape[1]

print("Dataframe has", df_rows, "rows and", df_columns, "columns")

Dataframe has 15589 rows and 21 columns


In [6]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year_Birth,15394.0,1981.706444,15.179042,1936.0,1970.0,1981.0,1994.0,2014.0
RewardPoints,15589.0,5022.593816,1027.962379,409.0,4445.0,5088.0,5649.0,6950.0
Comfort,15589.0,2.841619,1.388624,0.0,2.0,3.0,4.0,5.0
ReceptionSchedule,15589.0,2.997242,1.518994,0.0,2.0,3.0,4.0,5.0
FoodDrink,15589.0,2.84457,1.436948,0.0,2.0,3.0,4.0,5.0
Location,15589.0,2.986016,1.299438,1.0,2.0,3.0,4.0,5.0
Wifi,15589.0,3.245109,1.327026,0.0,2.0,3.0,4.0,6.0
Amenities,15589.0,3.374816,1.352417,0.0,2.0,4.0,4.0,5.0
Staff,15589.0,3.506383,1.319565,1.0,3.0,4.0,5.0,5.0
OnlineBooking,15589.0,3.454231,1.310343,0.0,2.0,4.0,5.0,5.0


year_birth:  
the values of min and max look alright, no extreme cases. However, we may be after customers over 18 years old, which means we can cut those born after 2004. We also have some missing values for this variable.

RewardPoints:  
Looks alright

All the rest:  
Looks alright

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15589 entries, 0 to 15588
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Churn              15589 non-null  object 
 1   Name               15589 non-null  object 
 2   Longevity          15589 non-null  object 
 3   Year_Birth         15394 non-null  float64
 4   TypeTravel         15589 non-null  object 
 5   RoomType           15589 non-null  object 
 6   RewardPoints       15589 non-null  int64  
 7   Comfort            15589 non-null  int64  
 8   ReceptionSchedule  15589 non-null  int64  
 9   FoodDrink          15589 non-null  int64  
 10  Location           15589 non-null  int64  
 11  Wifi               15589 non-null  int64  
 12  Amenities          15589 non-null  int64  
 13  Staff              15589 non-null  int64  
 14  OnlineBooking      15589 non-null  int64  
 15  PriceQuality       15589 non-null  int64  
 16  RoomSpace          155

Some categorical variables like: Churn, Longevity, TypeTravel, RoomType must be converted to a numerical type.

In [8]:
df['Churn'].value_counts()

nochurn    8477
churn      7112
Name: Churn, dtype: int64

Churn can be described as a binary variable

In [9]:
df['Longevity'].value_counts()

yes    12548
no      2874
y        167
Name: Longevity, dtype: int64

Longevity has some entries with yes and others with y, we can assume they mean the same

In [10]:
df['TypeTravel'].value_counts()

business    10756
leisure      4833
Name: TypeTravel, dtype: int64

In [11]:
df['RoomType'].value_counts()

single    7442
double    7021
suite     1126
Name: RoomType, dtype: int64

In [12]:
df.isna().sum()

Churn                  0
Name                   0
Longevity              0
Year_Birth           195
TypeTravel             0
RoomType               0
RewardPoints           0
Comfort                0
ReceptionSchedule      0
FoodDrink              0
Location               0
Wifi                   0
Amenities              0
Staff                  0
OnlineBooking          0
PriceQuality           0
RoomSpace              0
CheckOut               0
Checkin                0
Cleanliness            0
BarService             0
dtype: int64

As mentioned before, we have 195 missing values for the variable Year_birth, which will be analyzed and handled further in our treatment of the data.

### 2. Data Preparation

#### 2.1 Handling duplicate rows

In [13]:
duplicates = df.duplicated()

In [14]:
duplicates.value_counts()

False    15586
True         3
dtype: int64

In [15]:
df[duplicates]

Unnamed: 0,Churn,Name,Longevity,Year_Birth,TypeTravel,RoomType,RewardPoints,Comfort,ReceptionSchedule,FoodDrink,...,Wifi,Amenities,Staff,OnlineBooking,PriceQuality,RoomSpace,CheckOut,Checkin,Cleanliness,BarService
8195,nochurn,Ms. Abigail York,yes,1995.0,leisure,double,5098,5,5,5,...,4,5,5,3,3,4,3,3,3,5
9176,churn,Ms. Abigail Kennedy,yes,1991.0,business,suite,5932,3,3,2,...,3,3,3,3,4,1,4,3,4,3
9417,nochurn,Ms. Abigail Buchanan,yes,1972.0,business,double,6769,5,4,4,...,5,5,4,5,5,5,5,2,5,1


We have 3 rows that are duplicated and we will remove these entries

In [16]:
#remove duplicates
df = df.drop_duplicates()

#get new count of rows and columns
df_rows_after_duplicates_removal = df.shape[0]
df_columns_after_duplicates_removal = df.shape[1]

print("We have removed", df_rows - df_rows_after_duplicates_removal, "duplicated rows from the Dataframe.")

We have removed 3 duplicated rows from the Dataframe.


#### 2.2 Handling missing values

We must decide if we remove the rows with missing values or impute the values where they are missing.

In [17]:
df[df['Year_Birth'].isnull()].head()

Unnamed: 0,Churn,Name,Longevity,Year_Birth,TypeTravel,RoomType,RewardPoints,Comfort,ReceptionSchedule,FoodDrink,...,Wifi,Amenities,Staff,OnlineBooking,PriceQuality,RoomSpace,CheckOut,Checkin,Cleanliness,BarService
27,nochurn,Ms. Emily Thomas,yes,,leisure,double,4760,0,5,0,...,5,0,3,5,4,4,5,4,4,5
126,churn,Ms. Elizabeth Tyler,yes,,business,double,5151,2,2,2,...,2,2,2,2,3,4,3,2,4,2
147,nochurn,Ms. Elizabeth Santos,no,,business,single,5370,4,4,4,...,2,4,2,2,5,4,4,4,4,2
168,nochurn,Ms. Elizabeth Lawson,yes,,business,single,3669,1,1,1,...,5,5,5,5,4,5,5,4,4,5
170,nochurn,Ms. Elizabeth Morgan,yes,,business,single,4593,1,1,1,...,2,4,5,4,4,4,5,2,4,4


The values of the other variables in these rows seem to be relevant and for that reason it is best to impute values into the missing entries of Year_Birth. These values will be imputed with the mean of the variable.

In [18]:
Year_Birth_Mean = df['Year_Birth'].mean().round()

In [19]:
df[df['Year_Birth'].isnull()] = Year_Birth_Mean

Now we have no missing values.