# Mean comparison between bluecars taken on weekdays and on Weekend

# 1. Data

In this project, we are comparing the mean of blue cars taken per day on weekdays with the mean of bluecars taken per day on weekend.

## 1.1. Data Preparation

Data preparation consists of 
 - importing packages
 - loading datasets
 - looking at the variables
 - dealing with missing values
 - removing duplicates
 - harmonizing column names

In [1]:
# Importing packages needed for our project
import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import norm

In [2]:
# assigning links to dataset and glossary

url1 = "http://bit.ly/DSCoreAutolibDataset"
url2 = "http://bit.ly/DSCoreAutolibDatasetGlossary"

#url1 for dataset and url2 for glossary

In [3]:
# Loading dataset and viewing five first elements of the dataset

data = pd.read_csv(url1)
data.head()

Unnamed: 0,Postal code,date,n_daily_data_points,dayOfWeek,day_type,BlueCars_taken_sum,BlueCars_returned_sum,Utilib_taken_sum,Utilib_returned_sum,Utilib_14_taken_sum,Utilib_14_returned_sum,Slots_freed_sum,Slots_taken_sum
0,75001,1/1/2018,1440,0,weekday,110,103,3,2,10,9,22,20
1,75001,1/2/2018,1438,1,weekday,98,94,1,1,8,8,23,22
2,75001,1/3/2018,1439,2,weekday,138,139,0,0,2,2,27,27
3,75001,1/4/2018,1320,3,weekday,104,104,2,2,9,8,25,21
4,75001,1/5/2018,1440,4,weekday,114,117,3,3,6,6,18,20


In [4]:
# Loading glossary and viewing all descriptions of dataset variables

glossary = pd.read_excel(url2)
print(glossary)

# We have a description of the dataset as follows:
# - a postal code
# - a date of the row aggregation
# - a day of the week and a type of day(weekday or weekend)
# - 3 variables for cars taken (bluecar, utilib and utilib 1.4)
# - 3 variables for cars returned (bluecar, utilib and utilib 1.4)
# - total recharging slots freed that dayand
# - total recharging slots taken that day

               Column name                                        explanation
0              Postal code                 postal code of the area (in Paris)
1                     date                        date of the row aggregation
2      n_daily_data_points  number of daily data poinst that were availabl...
3                dayOfWeek     identifier of weekday (0: Monday -> 6: Sunday)
4                 day_type                                 weekday or weekend
5       BlueCars_taken_sum    Number of bluecars taken that date in that area
6    BlueCars_returned_sum  Number of bluecars returned that date in that ...
7         Utilib_taken_sum      Number of Utilib taken that date in that area
8      Utilib_returned_sum   Number of Utilib returned that date in that area
9      Utilib_14_taken_sum  Number of Utilib 1.4 taken that date in that area
10  Utilib_14_returned_sum  Number of Utilib 1.4 returned that date in tha...
11         Slots_freed_sum  Number of recharging slots released 

In [5]:
# Viewing number of records and variables

print(data.shape)
print()
data.info()


# the dataset has 13 fields/variables/columns and 16085 records/rows.

(16085, 13)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16085 entries, 0 to 16084
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Postal code             16085 non-null  int64 
 1   date                    16085 non-null  object
 2   n_daily_data_points     16085 non-null  int64 
 3   dayOfWeek               16085 non-null  int64 
 4   day_type                16085 non-null  object
 5   BlueCars_taken_sum      16085 non-null  int64 
 6   BlueCars_returned_sum   16085 non-null  int64 
 7   Utilib_taken_sum        16085 non-null  int64 
 8   Utilib_returned_sum     16085 non-null  int64 
 9   Utilib_14_taken_sum     16085 non-null  int64 
 10  Utilib_14_returned_sum  16085 non-null  int64 
 11  Slots_freed_sum         16085 non-null  int64 
 12  Slots_taken_sum         16085 non-null  int64 
dtypes: int64(11), object(2)
memory usage: 1.6+ MB


In [6]:
# Make all column names lowercase

data.columns= data.columns.str.lower()

# Strip column names for whitespace

data.rename(columns=lambda x: x.strip())

# Replace space in names with "_"

data.columns = data.columns.str.replace(' ','_')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16085 entries, 0 to 16084
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   postal_code             16085 non-null  int64 
 1   date                    16085 non-null  object
 2   n_daily_data_points     16085 non-null  int64 
 3   dayofweek               16085 non-null  int64 
 4   day_type                16085 non-null  object
 5   bluecars_taken_sum      16085 non-null  int64 
 6   bluecars_returned_sum   16085 non-null  int64 
 7   utilib_taken_sum        16085 non-null  int64 
 8   utilib_returned_sum     16085 non-null  int64 
 9   utilib_14_taken_sum     16085 non-null  int64 
 10  utilib_14_returned_sum  16085 non-null  int64 
 11  slots_freed_sum         16085 non-null  int64 
 12  slots_taken_sum         16085 non-null  int64 
dtypes: int64(11), object(2)
memory usage: 1.6+ MB


In [7]:
# Checking for missing values

data.isnull().sum()

# Our dataset has no missing values

postal_code               0
date                      0
n_daily_data_points       0
dayofweek                 0
day_type                  0
bluecars_taken_sum        0
bluecars_returned_sum     0
utilib_taken_sum          0
utilib_returned_sum       0
utilib_14_taken_sum       0
utilib_14_returned_sum    0
slots_freed_sum           0
slots_taken_sum           0
dtype: int64

In [8]:
# Checking for duplicates

data.duplicated().sum()

# Our dataset has no duplicates

0

In [9]:
# Finding unique values

column_names = list(data.columns)

for col in column_names:
        print()
        print(col, "\n\n",data[col].unique())


postal_code 

 [75001 75002 75003 75004 75005 75006 75007 75008 75009 75010 75011 75012
 75013 75014 75015 75016 75017 75018 75019 75020 75112 75116 78000 78140
 78150 91330 91370 91400 92000 92100 92110 92120 92130 92140 92150 92160
 92170 92190 92200 92210 92220 92230 92240 92250 92260 92270 92290 92300
 92310 92320 92330 92340 92350 92360 92370 92380 92390 92400 92410 92420
 92500 92600 92700 92800 93100 93110 93130 93150 93170 93200 93230 93260
 93300 93310 93350 93360 93370 93390 93400 93440 93500 93600 93700 93800
 94000 94100 94110 94120 94130 94140 94150 94160 94220 94230 94300 94340
 94410 94450 94500 94700 94800 95100 95870 95880]

date 

 ['1/1/2018' '1/2/2018' '1/3/2018' '1/4/2018' '1/5/2018' '1/6/2018'
 '1/7/2018' '1/8/2018' '1/9/2018' '1/10/2018' '1/11/2018' '1/12/2018'
 '1/13/2018' '1/14/2018' '1/15/2018' '1/16/2018' '1/17/2018' '1/18/2018'
 '1/19/2018' '1/20/2018' '1/21/2018' '1/22/2018' '1/23/2018' '1/24/2018'
 '1/25/2018' '1/26/2018' '1/27/2018' '1/28/2018' '1/29/201

## 1.2. Data Analysis

In [10]:
# Some descriptive statistics on numerical variables

data.describe()

Unnamed: 0,postal_code,n_daily_data_points,dayofweek,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
count,16085.0,16085.0,16085.0,16085.0,16085.0,16085.0,16085.0,16085.0,16085.0,16085.0,16085.0
mean,88791.293876,1431.330619,2.969599,125.926951,125.912714,3.69829,3.699099,8.60056,8.599192,22.629033,22.629282
std,7647.342,33.21205,2.008378,185.426579,185.501535,5.815058,5.824634,12.870098,12.868993,52.120263,52.14603
min,75001.0,1174.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,91330.0,1439.0,1.0,20.0,20.0,0.0,0.0,1.0,1.0,0.0,0.0
50%,92340.0,1440.0,3.0,46.0,46.0,1.0,1.0,3.0,3.0,0.0,0.0
75%,93400.0,1440.0,5.0,135.0,135.0,4.0,4.0,10.0,10.0,5.0,5.0
max,95880.0,1440.0,6.0,1352.0,1332.0,54.0,58.0,100.0,101.0,360.0,359.0


# 2. Hypothesis Testing

### 2.1 Objective

Finding if there is a difference between the number of bluecars taken between weekdays and weekends
Null Hypothesis: there is no difference between the number of cars taken on weekdays and the number of cars taken on weekdays
Alternate hyposthesis: there is a difference between the number of cars taken on weekdays and the number of cars taken on weekends
H0 : mu1 = mu2
H1 : mu1 != mu2
The above is a two-tailed test statistics that will be calculated using z score and a sample obtained from the main dataset

### 2.2. Sampling

In [11]:
# Drawing a sample
# Stratified sampling will be used to ensure we get a relatively accurate number of representatives for both weekend and weekdays
# First we check the population proportions with regard to the day_type variable
proportion = data.day_type.value_counts()
proportion

weekday    11544
weekend     4541
Name: day_type, dtype: int64

In [12]:
# Drawing sample using stratified sampling, we are only using 10% of the population
# No more than 10% of total population satisfies the condition for independency in the dataset

strata_sample = data.groupby("day_type", group_keys = False).apply(lambda strata : strata.sample(frac = 0.1))
strata_sample.day_type.value_counts()

# The sample provides an adequate representation of the population

weekday    1154
weekend     454
Name: day_type, dtype: int64

Two sample z-test to compare the means of blue car taken during weekends and the mean of blue cars taken during weekdays

Values needed for a z-test for two sample means are:
n (sample size), x(sample mean), mu (population mean) and std (population standard deviation) for both samples
We choose our level of significance alpha to be 0.05.

In [13]:
# Information about our bluecars column given weekdays and weekends from our sample

strata_description = strata_sample["bluecars_taken_sum"].groupby(strata_sample["day_type"]).describe()
strata_description

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
day_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
weekday,1154.0,109.012998,159.053954,0.0,18.0,40.0,113.5,1087.0
weekend,454.0,155.508811,216.397136,0.0,25.0,56.0,183.5,1127.0


In [14]:
# Information about our bluecars column given weekdays and weekends from our dataset

data_description = pd.DataFrame(data["bluecars_taken_sum"].groupby(data["day_type"]).describe())
data_description

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
day_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
weekday,11544.0,116.028673,169.626905,0.0,18.0,42.0,126.0,1093.0
weekend,4541.0,151.090068,218.565642,0.0,25.0,59.0,156.0,1352.0


In [15]:
strata_weekday = strata_sample[strata_sample["day_type"] == "weekday"]
strata_weekend = strata_sample[strata_sample["day_type"] == "weekend"]
print(strata_weekday.shape)
print(strata_weekend.shape)

(1154, 13)
(454, 13)


###  2.3. Calculating the z-score and the p-value

In [16]:
# We calculate the z score using the values below:

n1 = len(strata_weekday["bluecars_taken_sum"]); x1 = strata_weekday["bluecars_taken_sum"].mean(); mu1 = 116.028673; std1 = 169.626905;
n2 = len(strata_weekend["bluecars_taken_sum"]); x2 = strata_weekend["bluecars_taken_sum"].mean(); mu2 = 151.090068; std2 = 218.565642;
alpha = 0.05

mean_diff = x1 - x2
mu_diff = mu1 - mu2
pooled_var = ((std1**2)/n1)+((std2**2)/n2)

z_score = (mean_diff - mu_diff)/math.sqrt(pooled_var)
z_score

-1.002264105707102

In [17]:
# P value associated with the z score obtained is given by 

p_value = 2*(norm.sf(abs(z_score)))
p_value

0.3162160536493789

# 3. Conclusion

In [18]:
# As a conclusion, if p_value < alpha, we have significant evidence to reject the null hyposthesis.
# but if p_value > alpha, we do not have significant evidence to reject the null hypothesis.

if p_value < alpha:
    print("There is significant evidence to reject the null hypothesis")
else:
    print("There is no significant evidence to reject the null hypothesis")
    
# As we can see, there is no significant evidence to reject the null hypothesis 
# as p value = 0.85 is greater than the level of significance alpha = 0.05.

There is no significant evidence to reject the null hypothesis


In conclusion, 
there is no significant difference 
between the mean number of bluecars taken per day on weekdays 
and the mean number of bluecars taken per day on weekends