# Data analysis on Velib database - Project 2021 - Python

#### Nguyen Hai Vy, Hoang Van Hao, Benzitouni Fethi, Bertin Alexandre

  
<br/>
<div style="text-align: justify">    
We consider the ‘Vélib’ data set, related to the bike sharing system of Paris. The data are loading profiles of the bike stations over one week, collected every hour, from the period Monday 2nd Sept. - Sunday 7th Sept., 2014. The loading profile of a station, or simply loading, is defined as the ratio of number of available bikes divided by the number of bike docks. A loading of 1 means that the station is fully loaded, i.e. all bikes are available. A loading of 0 means that the station is empty, all bikes have been rent.
</div>
<br/>
<div style="text-align: justify">  
From the viewpoint of data analysis, the individuals are the stations. The variables are the 168 time steps (hours in the week). The aim is to detect clusters in the data, corresponding to common customer usages. This clustering should then be used to predict the loading profile*.
</div>

*Authors: J. Guérin, ANITI & O. Roustant, INSA Toulouse. January 2021.

## 1. Preliminary

### 1.1 Load and visualize data

We load in the data using Pandas

In [None]:
%config Completer.use_jedi = False # To make sure that autocompletion will work 
import pandas as pd
path    = ''  # If data already in current directory
loading = pd.read_csv(path + 'velibLoading.csv', sep = " ")
loading.head()

In [None]:
print(loading.shape)

In [None]:
loading.info(null_counts=True)

We have 168 columns in total described the service level for 168 time steps(from Monday 0am to Sunday 23pm).
There is no null-value in this Data. There are 1189 stations to take a look at.

Next, we load in the additional data that describe about their location. From this data, we can have the longitude, latitude of each station and we can know if a station located on hill or not

In [None]:
velibAdds = pd.read_csv(path + 'velibAdds.csv', sep = " ")
velibAdds.head()

### 1.2 Preliminary: plot the loading of the first station

To have a general overview about the variability of loading, we display the evolution of loading graphed in time order. This is the graph of the first station.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_style("darkgrid")

i = 0

loading_data = loading.to_numpy()

n_steps = loading.shape[1]
time    = np.linspace(1, n_steps, n_steps)

plt.figure(figsize = (20, 6))

plt.plot(time, loading_data[i, :], linewidth = 2, color = 'blue')
plt.vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
           colors = "orange", linestyle = "dotted", linewidth = 3)

plt.xlabel('Time', fontsize = 20)
plt.ylabel('Loading', fontsize = 20)
plt.title(velibAdds.names[1 + i], fontsize = 25)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.tight_layout()
plt.show()

## 2. Descriptive statistics

### 2.1 Available bike level

We will count how many station whose average available bike level (i.e loading) is greater than a given proportion. Here we choose respectively 1, 0.9, 0.8,.. 0

In [None]:
temp=np.mean(loading,axis=1)
list_quantile=np.linspace(1,0,11)
Number_of_station=[]
for i in list_quantile:
    Number_of_station+=[np.sum(temp>=i,axis=0)]
df_temp = pd.DataFrame({'Loading>=': np.linspace(1,0,11),'Number_of_station': Number_of_station,'Ratio':np.array(Number_of_station)/1189})
df_temp

### 2.2 Global service level

We will count how many station whose average global service level (i.e 1-loading) is greater than a given proportion. Here we choose respectively 1, 0.9, 0.8,.. 0

In [None]:
temp=np.mean(1-loading,axis=1)
list_quantile=np.linspace(1,0,11)
Number_of_station=[]
for i in list_quantile:
    Number_of_station+=[np.sum(temp>=i,axis=0)]
df_temp = pd.DataFrame({'1-Loading>=': np.linspace(1,0,11),'Number_of_station': Number_of_station,'Ratio':np.array(Number_of_station)/1189})
df_temp

The evolution of loading graphed in time order for the first 16 stations.

In [None]:
fig, axs = plt.subplots(4, 4, figsize = (15,12))
for i in range(4):
    for j in range(4):
        k_station = 4 * i + j
        axs[i, j].plot(time, loading_data[k_station, :], linewidth = 1, color = 'blue')
        axs[i, j].set_title(velibAdds.names[1 + k_station], fontsize = 12)
        axs[i, j].vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
                         colors = "orange", linestyle = "dotted", linewidth = 3)

for ax in axs.flat:
    ax.set_xlabel('Time', fontsize = 12)
    ax.set_ylabel('Loading', fontsize = 12)
    ax.tick_params(axis='x', labelsize=10)
    ax.tick_params(axis='y', labelsize=10)
    
plt.tight_layout()
plt.show()

The boxplot of the variables, sorted in time order.

In [None]:
plt.figure(figsize = (20,6))

bp = plt.boxplot(loading_data, widths = 0.75, patch_artist = True)

for median in bp['medians']:
    median.set(linewidth=5)
    
plt.vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
           colors = "blue", linestyle = "dotted", linewidth = 5)

plt.xlabel('Time', fontsize = 20)
plt.ylabel('Loading', fontsize = 20)
plt.title("Boxplots", fontsize = 25)
plt.xticks(ticks = np.arange(0, 168, 5), labels=np.arange(0, 168, 5), fontsize = 15)
plt.yticks(fontsize = 15)

plt.tight_layout()
plt.show()

The temporal correlation of the variables.

In [None]:
# Scatter plot t versus t+h

t = 5
h = 1

plt.figure(figsize = (7, 7))

plt.scatter(loading_data[:, t], loading_data[:, t + h])

plt.title("Stations loading at t = %i versus stations loading at t = %i" % (t, t + h), fontsize = 18)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)

plt.tight_layout()
plt.show()

For instance, for a given station, plot the loading at t+h versus loading at time t. Visualize the correlation matrix by an image plot.

In [None]:
# Correlation matrix for 168h

CM = np.corrcoef(loading_data.T)

plt.figure(figsize = (10, 10))
plt.imshow(CM, vmin=-1)

plt.title("Correlation matrix of loading at different times", fontsize = 18)
plt.xticks(ticks = np.arange(0, 168, 24), labels = np.arange(0, 168, 24), fontsize = 15)
plt.yticks(ticks = np.arange(0, 168, 24), labels = np.arange(0, 168, 24), fontsize = 15)
plt.xlabel('Time 1 ', fontsize = 20)
plt.ylabel('Time 2', fontsize = 20)
plt.colorbar(fraction = 0.046, pad = 0.04)

plt.tight_layout()
plt.show()


In [None]:
# Correlation matrix for first 24h

CM = np.corrcoef(loading_data[:, :24].T)

plt.figure(figsize = (10, 10))
plt.imshow(CM, vmin=-1)

plt.title("Correlation matrix: first 24 hours", fontsize = 18)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel('Time 1 ', fontsize = 20)
plt.ylabel('Time 2', fontsize = 20)
plt.colorbar(fraction = 0.046, pad = 0.04)
plt.tight_layout()
plt.show()

Plot the stations coordinates on a 2D map (latitude versus longitude)(Use a different color for stations which are located on a hill)

In [None]:
import matplotlib.cm as cm

plt.figure(figsize = (10, 10))

sctrplt = plt.scatter(velibAdds['latitude'], velibAdds['longitude'], c = velibAdds['bonus'], cmap = cm.Accent)

plt.xlabel('Latitude', fontsize = 20)
plt.ylabel('Longitude', fontsize = 20)
plt.title('Stations coordinates', fontsize = 30)
plt.xticks([])
plt.yticks([])
plt.legend(handles = sctrplt.legend_elements()[0], labels = ["No hill", "Hill"], fontsize = 20)
plt.show()

We redo our analysis for the subset of stations which are located on a hill and for those who are not

In [None]:
# Q1

data_hill = loading_data[velibAdds["bonus"] == 1]
dataAdds_hill = velibAdds.to_numpy()[velibAdds["bonus"] == 1]

print("Number of stations on a hill: %i" % dataAdds_hill.shape[0])

fig, axs = plt.subplots(4, 4, figsize = (15,12))
for i in range(4):
    for j in range(4):
        k_station = 4 * i + j
        axs[i, j].plot(time, data_hill[k_station, :], linewidth = 1, color = 'blue')
        axs[i, j].set_title(dataAdds_hill[k_station, 3], fontsize = 12)
        axs[i, j].vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
                         colors = "orange", linestyle = "dotted", linewidth = 3)

for ax in axs.flat:
    ax.set_xlabel('Time', fontsize = 12)
    ax.set_ylabel('Loading', fontsize = 12)
    ax.tick_params(axis='x', labelsize=10)
    ax.tick_params(axis='y', labelsize=10)
    
plt.tight_layout()
plt.show()

In [None]:
# Q2

plt.figure(figsize = (20,6))

bp = plt.boxplot(data_hill, widths = 0.75, patch_artist = True)

for median in bp['medians']:
    median.set(linewidth=5)
    
plt.vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
           colors = "blue", linestyle = "dotted", linewidth = 5)

plt.xlabel('Time', fontsize = 20)
plt.ylabel('Loading', fontsize = 20)
plt.title("Boxplots", fontsize = 25)
plt.xticks(ticks = np.arange(0, 168, 5), labels=np.arange(0, 168, 5), fontsize = 15)
plt.yticks(fontsize = 15)

plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix 

CM = np.corrcoef(data_hill.T)

plt.figure(figsize = (10, 10))
plt.imshow(CM, vmin=-1)

plt.title("Correlation matrix of loading at different times", fontsize = 18)
plt.xticks(ticks = np.arange(0, 168, 24), labels = np.arange(0, 168, 24), fontsize = 15)
plt.yticks(ticks = np.arange(0, 168, 24), labels = np.arange(0, 168, 24), fontsize = 15)
plt.xlabel('Time 1 ', fontsize = 20)
plt.ylabel('Time 2', fontsize = 20)
plt.colorbar(fraction = 0.046, pad = 0.04)

plt.tight_layout()
plt.show()

In [None]:
# Q1

data_nohill = loading_data[velibAdds["bonus"] == 0]
dataAdds_nohill = velibAdds.to_numpy()[velibAdds["bonus"] == 0]

print("Number of stations no hill: %i" % dataAdds_nohill.shape[0])

fig, axs = plt.subplots(4, 4, figsize = (15,12))
for i in range(4):
    for j in range(4):
        k_station = 4 * i + j
        axs[i, j].plot(time, data_nohill[k_station, :], linewidth = 1, color = 'blue')
        axs[i, j].set_title(dataAdds_nohill[k_station, 3], fontsize = 12)
        axs[i, j].vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
                         colors = "orange", linestyle = "dotted", linewidth = 3)

for ax in axs.flat:
    ax.set_xlabel('Time', fontsize = 12)
    ax.set_ylabel('Loading', fontsize = 12)
    ax.tick_params(axis='x', labelsize=10)
    ax.tick_params(axis='y', labelsize=10)
    
plt.tight_layout()
plt.show()

In [None]:
# Q2

plt.figure(figsize = (20,6))

bp = plt.boxplot(data_nohill, widths = 0.75, patch_artist = True)

for median in bp['medians']:
    median.set(linewidth=5)
    
plt.vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
           colors = "blue", linestyle = "dotted", linewidth = 5)

plt.xlabel('Time', fontsize = 20)
plt.ylabel('Loading', fontsize = 20)
plt.title("Boxplots", fontsize = 25)
plt.xticks(ticks = np.arange(0, 168, 5), labels=np.arange(0, 168, 5), fontsize = 15)
plt.yticks(fontsize = 15)

plt.tight_layout()
plt.show()

In [None]:
# Q3

t = 5
h = 3

plt.figure(figsize = (7, 7))

plt.scatter(data_nohill[:, t], data_nohill[:, t + h])

plt.title("Stations loading at t = %i versus stations loading at t = %i" % (t, t + h), fontsize = 18)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)

plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix 

CM = np.corrcoef(data_nohill.T)

plt.figure(figsize = (10, 10))
plt.imshow(CM, vmin=-1)

plt.title("Correlation matrix of loading at different times", fontsize = 18)
plt.xticks(ticks = np.arange(0, 168, 24), labels = np.arange(0, 168, 24), fontsize = 15)
plt.yticks(ticks = np.arange(0, 168, 24), labels = np.arange(0, 168, 24), fontsize = 15)
plt.xlabel('Time 1 ', fontsize = 20)
plt.ylabel('Time 2', fontsize = 20)
plt.colorbar(fraction = 0.046, pad = 0.04)

plt.tight_layout()
plt.show()

## 3. Principal component analysis

In [None]:
loading = pd.read_csv(path + 'velibLoading.csv', sep = " ")

Import required libraries for PCA

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(loading) 
X_r = scaler.transform(loading) 

In [None]:
label_dic = {0 : "No hill", 1 : "Hill"}
def plot_pca(X_R, fig, ax, nbc, nbc2):
    for i in range(2):
        xs = X_R[velibAdds.bonus == i, nbc - 1]
        ys = X_R[velibAdds.bonus == i, nbc2 - 1]
        label = label_dic[i]
        color = cmaps(i+1)
        ax.scatter(xs, ys, color = color, alpha = .8, s = 10, label = label)
        ax.set_xlabel("PC%d : %.2f %%" %(nbc, pca.explained_variance_ratio_[nbc - 1] * 100), fontsize = 10)
        ax.set_ylabel("PC%d : %.2f %%" %(nbc2, pca.explained_variance_ratio_[nbc2 - 1] * 100), fontsize = 10)

In [None]:
pca = PCA()
X_r = pca.fit_transform(X_r)

Percentage of variance explained by the first 10 components

In [None]:
plt.plot(np.arange(1,11),pca.explained_variance_ratio_[0:10]*100,color='g')
plt.bar(np.arange(1,11),pca.explained_variance_ratio_[0:10]*100,color='r')
plt.show()

Boxplots of first 20 principal components

In [None]:
plt.figure(figsize=(10,5))
plt.boxplot(X_r[:,0:20])
plt.show()

Variables factor map

In [None]:
# coordonnées des variables
coord1 = pca.components_[0] * np.sqrt(pca.explained_variance_[0]) 
coord2 = pca.components_[1] * np.sqrt(pca.explained_variance_[1]) 
fig = plt.figure(figsize = (6,6))
ax = fig.add_subplot(1, 1, 1)
u=np.arange(1,169)
for i, j,k in zip(coord1, coord2,u ):
    plt.text(i, j, str(k),size=11)
    plt.arrow(0, 0, i, j, color = 'r')
plt.axis((-1.2, 1.2, -1.2, 1.2))
# cercle
c = plt.Circle((0,0), radius = 1, color = 'b', fill = False)
ax.add_patch(c)
plt.show()

Contrast effect on the second principal component

In [None]:
temp=pca.components_[1] * np.sqrt(pca.explained_variance_[1]) 
K=np.zeros(168)
plt.figure(figsize = (10, 5))
plt.vlines(x = K, ymin = 0, ymax = 1, colors = "white", linewidth = 3)
for i in range(0,168):
    if (temp[i]>0.25):
        plt.vlines(x = i, ymin = 0, ymax = 1, colors = "blue", linewidth = 1)
    if (temp[i]< -0.25):
        plt.vlines(x = i, ymin = 0, ymax = 1, colors = "red", linewidth = 1)
    if (temp[i]> -0.25 and temp[i]<0.25):
        plt.vlines(x = i, ymin = 0, ymax = 1, colors = "white", linewidth = 1)
plt.vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
           colors = "black", linewidth = 3)

## 4. Clustering

In [None]:
def crossTable(classe1, classe2):
    table = pd.crosstab(classe1, classe2, 
                        rownames = ['classes ACP'], colnames = ['classes données brutes'])
    a = np.zeros(np.shape(table)[0])
    b = np.zeros(np.shape(table)[0])
    for j in range (0, np.shape(table)[0]):
        for i in range (0, np.shape(table)[0]):
            if (a[j] < table[i][j]):
                a[j] = table[i][j]
                b[j] = i                       
                                             
    print ("")
    print ("max colonne", a)
    print ("j=", b)
    print ("")
    tablebis = np.copy(table)
    for i in range (0, np.shape(table)[0]):
        tablebis[i][:] = table[b[i]][:]        
    return tablebis

### 4.1 Hierarchical Ascending Classification

#### 4.1.1 First, we perform HAC method on full data

In [None]:
loading = pd.read_csv(path + 'velibLoading.csv', sep = " ")

Import required libraries for CAH

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

Cluster Dendrogram and Distance before grouping vs number of class

In [None]:
plt.figure(figsize = (8, 4))
#### Cluster Dendrogram
Z = linkage(loading, 'ward', metric = 'euclidean') 
height = Z[:, 2]  
x = np.arange(10) + 1
height = sorted(height, reverse = True)
plt.subplot(1,2,1)
plt.scatter(x, height[0:10])
plt.xlabel('Index')
plt.ylabel('Height')
plt.title("Choix du nombre de classes")


#### Distance before grouping vs number of class
plt.subplot(1,2,2)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Individus')
plt.ylabel('Distance')
dendrogram(Z,leaf_font_size = 8., labels = loading.index)
plt.show()


We cut the dendrogram at distance = 31 to get exact 6 groups

In [None]:
classesCAH = fcluster(Z, t = 31, criterion = 'distance')

Graph of each group projected on the two first components of PCA

In [None]:
echantillon1 = X_r[:,0]
echantillon2 = X_r[:,1]
coul = ['b', 'r', 'g', 'k', 'y','purple']
plt.figure(figsize = (5, 5))
for i, j, nom, indcoul in zip(echantillon1, echantillon2, 
                              np.linspace(1, np.shape(X_r[:,:])[0], 
                                          num=np.shape(X_r[:,:])[0]), classesCAH):
    plt.scatter(i, j, c = coul[indcoul - 1])
#plt.axis((-2,2,-1,1))  
plt.show()

Boxplots of each group and the center of each class

In [None]:
plt.figure(figsize = (20,10))
for i in range(1,7):
    plt.subplot(3,2,i)
    bp = plt.boxplot(loading[classesCAH==i], widths = 0.5, patch_artist = True)
    plt.plot(np.mean(loading[classesCAH==i],axis=0), color='black',linewidth=3)
    plt.vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
               colors = "red", linewidth = 3)
    plt.xlabel('Time', fontsize = 20)
    plt.ylabel('Loading', fontsize = 20)
    plt.title("Boxplots", fontsize = 25)
    plt.xticks(ticks = np.arange(0, 168, 5), labels=np.arange(0, 168, 5), fontsize = 15)
    plt.yticks(fontsize = 15)
    plt.tight_layout()
plt.show()

#### 4.1.2 Second, we perform  CAH method on 5 first principal components

In [None]:
plt.figure(figsize = (8, 4))

#### Cluster Dendrogram
Z1 = linkage(X_r[:,0:5], 'ward', metric = 'euclidean') 
height = Z1[:, 2]
x = np.arange(10) + 1
height = sorted(height, reverse = True)
plt.subplot(1,2,1)
plt.scatter(x, height[0:10])
plt.xlabel('Index')
plt.ylabel('Height')
plt.title("Choix du nombre de classes")
#### Distance before grouping vs number of class
            
plt.subplot(1,2,2)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Individus')
plt.ylabel('Distance')
dendrogram(Z1,leaf_font_size = 8., labels = loading.index)
plt.show()


We cut the dendrogram at distance = 80 to get exact 6 group

In [None]:
classesCAH1 = fcluster(Z1, t = 80, criterion = 'distance')

Graph of each group projected on the two first components of PCA

In [None]:
echantillon1 = X_r[:,0]
echantillon2 = X_r[:,1]
coul = ['b', 'r', 'g', 'k', 'y','purple']
plt.figure(figsize = (5, 5))
for i, j, nom, indcoul in zip(echantillon1, echantillon2, 
                              np.linspace(1, np.shape(X_r[:,:])[0], 
                                          num=np.shape(X_r[:,:])[0]), classesCAH1):
    plt.scatter(i, j, c = coul[indcoul - 1])
#plt.axis((-2,2,-1,1))  
plt.show()

Boxplots of each group and the center of each class

In [None]:
plt.figure(figsize = (20,10))
for i in range(1,7):
    plt.subplot(3,2,i)
    bp = plt.boxplot(loading[classesCAH1==i], widths = 0.5, patch_artist = True)
    plt.plot(np.mean(loading[classesCAH1==i],axis=0), color='black',linewidth=3)
    plt.vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
               colors = "red", linewidth = 3)
    plt.xlabel('Time', fontsize = 20)
    plt.ylabel('Loading', fontsize = 20)
    plt.title("Boxplots", fontsize = 25)
    plt.xticks(ticks = np.arange(0, 168, 5), labels=np.arange(0, 168, 5), fontsize = 15)
    plt.yticks(fontsize = 15)
    plt.tight_layout()
plt.show()

In [None]:
crossTable(classesCAH.astype(np.int32)-1,classesCAH1.astype(np.int32)-1)

### 4.2 K-means

In [None]:
loading = pd.read_csv(path + 'velibLoading.csv', sep = " ")

Import required libraries for K-means

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
# initialisation du générateur de nombres aléatoires
np.random.seed(42)
# Graphiques
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
# Ignorer les warnings inutiles (cf. SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

We apply K-means method with n_clusters = 4,5,6,7

In [None]:
kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(loading)
                for k in range(4, 8)]
silhouette_scores = [silhouette_score(loading, model.labels_)
                     for model in kmeans_per_k[0:]]
y_pred = kmeans_per_k[2].labels_

The Silhouette plot by changing number of clusters

In [None]:
from sklearn.metrics import silhouette_samples
from matplotlib.ticker import FixedLocator, FixedFormatter

plt.figure(figsize=(11, 9))

for k in ( 4, 5, 6,7):
    plt.subplot(2, 2, k -3)
    
    y_pred = kmeans_per_k[k - 4].labels_
    silhouette_coefficients = silhouette_samples(loading, y_pred)

    padding = len(loading) // 30
    pos = padding
    ticks = []
    for i in range(k):
        coeffs = silhouette_coefficients[y_pred == i]
        coeffs.sort()
        cmap = matplotlib.cm.get_cmap("Spectral")
        color = cmap(i / k)
        plt.fill_betweenx(np.arange(pos, pos + len(coeffs)), 0, coeffs,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ticks.append(pos + len(coeffs) // 2)
        pos += len(coeffs) + padding

    plt.gca().yaxis.set_major_locator(FixedLocator(ticks))
    plt.gca().yaxis.set_major_formatter(FixedFormatter(range(k)))
    if k in (4, 6):
        plt.ylabel("Cluster")
    
    if k in (5, 7):
        plt.gca().set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
        plt.xlabel("Silhouette Coefficient")
    else:
        plt.tick_params(labelbottom=False)

    plt.axvline(x=silhouette_scores[k - 4], color="red", linestyle="--")
    plt.title("$k={}$".format(k), fontsize=16)
plt.show()

#### 4.2.1 First, we perform  Kmeans method on full data

In [None]:
k=4
kmeans=KMeans(n_clusters=k, random_state=42).fit(loading)
kclasses = kmeans.labels_

Boxplots of each group and the center of each class

In [None]:
plt.figure(figsize = (20,8))
for i in range(0,4):
    plt.subplot(2,2,i+1)
    bp = plt.boxplot(loading[kmeans.labels_==i], widths = 0.5, patch_artist = True)
    plt.plot(kmeans.cluster_centers_[i], color='black',linewidth=3)
    plt.vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
               colors = "red", linewidth = 3)
    plt.xlabel('Time', fontsize = 20)
    plt.ylabel('Loading', fontsize = 20)
    plt.title("Boxplots", fontsize = 25)
    plt.xticks(ticks = np.arange(0, 168, 5), labels=np.arange(0, 168, 5), fontsize = 15)
    plt.yticks(fontsize = 15)
    plt.tight_layout()
plt.show()

— The center of class 1 corresponds to a high daily usage at every hours of the week.

— The center of class 2 corresponds to a relatively higher daily usage in the middle of the daythan in the beginning and the end of the day

— Contrary to class 2, the center of class 3 corresponds to a relatively higher daily usage inthe beginning and the end of the day than in the middle of the day.

— Contrary to class 1, the center of class 4 corresponds to a low daily usage at every hours ofthe week

#### 4.2.2 Second, we perform Kmeans method on 5 first principal components

In [None]:
k=4
kmeans1=KMeans(n_clusters=k, random_state=42).fit(X_r[:,:5])
kclasses1 = kmeans1.labels_

Boxplots of each group and the center of each class

In [None]:
plt.figure(figsize = (20,8))
for i in range(0,4):
    plt.subplot(2,2,i+1)
    bp = plt.boxplot(loading[kmeans1.labels_==i], widths = 0.5, patch_artist = True)
    plt.plot(np.mean(loading[kmeans1.labels_==i],axis=0), color='black',linewidth=3)
    plt.vlines(x = np.linspace(1, n_steps, 8), ymin = 0, ymax = 1, 
               colors = "red", linewidth = 3)
    plt.xlabel('Time', fontsize = 20)
    plt.ylabel('Loading', fontsize = 20)
    plt.title("Boxplots", fontsize = 25)
    plt.xticks(ticks = np.arange(0, 168, 5), labels=np.arange(0, 168, 5), fontsize = 15)
    plt.yticks(fontsize = 15)
    plt.tight_layout()
plt.show()

Individual map on full data vs on first 5 principal components

In [None]:
echantillon1 = X_r[:,0]
echantillon2 = X_r[:,1]
coul = ['b', 'r', 'g', 'k', 'y','purple']
plt.figure(figsize = (8, 4))
plt.subplot(1,2,1)
for i, j, nom, indcoul in zip(echantillon1, echantillon2, 
                              np.linspace(1, np.shape(X_r[:,:])[0], 
                                          num=np.shape(X_r[:,:])[0]), kmeans.labels_):
    plt.scatter(i, j, c = coul[indcoul - 1])
    plt.title("on full data")
#plt.axis((-2,2,-1,1))  

plt.subplot(1,2,2)
echantillon1 = X_r[:,0]
echantillon2 = X_r[:,1]
coul = ['b', 'r', 'g', 'k', 'y','purple']
for i, j, nom, indcoul in zip(echantillon1, echantillon2, 
                              np.linspace(1, np.shape(X_r[:,:])[0], 
                                          num=np.shape(X_r[:,:])[0]), kmeans1.labels_):
    plt.scatter(i, j, c = coul[indcoul - 1])
    plt.title("on first 5 principal components")
#plt.axis((-2,2,-1,1))  
plt.show()


In [None]:
crossTable(kmeans.labels_, kmeans1.labels_)

### 4.3 Gaussian Mixture

The package GaussianMixture is not quite developped on Python. From the result that we obtained on R (n_cluster=6), we will perform this method on first 5 principal components and use covariance_type 'full' instead of 'VVE' because 'VVE' is not supported by Python.

In [None]:
from sklearn.mixture import GaussianMixture
# méthode GMM sur les données brutes
gmm = GaussianMixture(n_components = 6,covariance_type='full').fit(X_r[:,:5])

In [None]:
# identification des classes
classesGMM = gmm.predict(X_r[:,:5])
# Effectifs des classes
pd.DataFrame(classesGMM).hist()

Individual map of Gaussian mixture model on first two principal components

In [None]:
echantillon1 = X_r[:,0]
echantillon2 = X_r[:,1]
coul = ['b', 'r', 'g', 'k', 'y','purple']
plt.figure(figsize = (5, 5))
for i, j, nom, indcoul in zip(echantillon1, echantillon2, 
                              np.linspace(1, np.shape(X_r[:,:])[0], 
                                          num=np.shape(X_r[:,:])[0]), classesGMM):
    plt.scatter(i, j, c = coul[indcoul - 1])
#plt.axis((-2,2,-1,1))  
plt.show()

## 5. Plot on real maps ( Kmeans case)

In [None]:
import gmaps
import gmaps.datasets
locations = velibAdds[['latitude', 'longitude']]
group = []
color=['blue','green','red','yellow','black','white','cyan','magenta']


fig = gmaps.figure()
for i in range(len(np.unique(kclasses))):
    group=group+[gmaps.symbol_layer(
    locations[kclasses==i+1], fill_color=color[i], scale=3)]
    fig.add_layer(group[i])


In [None]:
gmaps.configure(api_key='AIzaSyCJ_w_OfV3cybHO9Kwp0fJOgMj6GAaFa9o')
fig