L'objectif est d'homogénéiser les coordonnées géographiques et les adresses

## 1. Importation

In [41]:
import numpy as np
import pandas as pd

In [42]:
# Définition du chemin d'accès:
### Hermine
%cd C:\Users\h.berthon\Desktop\Data Sc\Projet Velib

df = pd.read_csv('2018-2021_comptage-velo-donnees-compteurs.csv')

C:\Users\h.berthon\Desktop\Data Sc\Projet Velib


## 2. Nouveau DataFrame

In [43]:
#Nouveau df avec moins de colonnes
df_lat_lon= df.drop(['Unnamed: 0','Id_old','Address_old','Date_count', 'Date_instal', 'Photo_old', 
                      'Coord', 'Latitude', 'Longitude', 'Direction', 'Y_Date_Count', 'M_Date_Count', 'D_Date_Count',
                     'Dweek_Date_Count', 'H_Date_Count', 'Y_Date_Instal', 'M_Date_Instal', 'D_Date_Instal'
                     ], axis =1)

df_lat_lon.head()

Unnamed: 0,Id,Address,Count_by_hour,Coord_old,Source
0,100003096,97 avenue Denfert Rochereau,0.0,"48.83511,2.33338",2021
1,100003096,97 avenue Denfert Rochereau,0.0,"48.83511,2.33338",2021
2,100003096,97 avenue Denfert Rochereau,0.0,"48.83511,2.33338",2021
3,100003096,97 avenue Denfert Rochereau,3.0,"48.83511,2.33338",2021
4,100003096,97 avenue Denfert Rochereau,7.0,"48.83511,2.33338",2021


In [44]:
#Changement de format de coordonnées géo

# Création des colonnes 'longitude', et 'latitude'
df_lat_lon[['Latitude', 'Longitude']] = df_lat_lon['Coord_old'].str.split(',', expand = True)

#Les coordonnées que nous avons sont au format Degrés décimaux (DD) = 5 décimales
df_lat_lon['Latitude'] = df_lat_lon['Latitude'].astype('float64').round(5)
df_lat_lon['Longitude'] = df_lat_lon['Longitude'].astype('float64').round(5)

#Nouvelle colonne avec les coordonnées arrondies
df_lat_lon['Coord'] = df_lat_lon['Latitude'].astype('str') + ',' + df_lat_lon['Longitude'].astype('str')

#Suppression colonnes inutiles
df_lat_lon = df_lat_lon.drop(['Coord_old', 'Latitude', 'Longitude'], axis=1)


In [45]:
df_lat_lon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1723879 entries, 0 to 1723878
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   Id             int64  
 1   Address        object 
 2   Count_by_hour  float64
 3   Source         int64  
 4   Coord          object 
dtypes: float64(1), int64(2), object(2)
memory usage: 65.8+ MB


In [46]:
#Nombre de valeurs uniques avant le traitement des données
df_lat_lon.nunique()

Id                 71
Address           108
Count_by_hour    1054
Source              4
Coord              84
dtype: int64

Le nombre d'identifiant ne correspond pas au nombre d'adresse et au nombre de coordonnées géographiques.
Il y a donc des erreurs liées à la fusion des dataframes de 2018 à 2021.

## 3. Traitement des addresses et coordonnées géographiques

In [47]:
df_lat_lon[['Id','Coord','Address', 'Source']].groupby(['Address', 'Id','Coord', 'Source']).count().head(60)

Address,Id,Coord,Source
10 avenue de la Grande Armée SE-NO,100044494,"48.87472,2.29244",2018
10 avenue de la Grande Armée SE-NO,100044494,"48.87472,2.29244",2019
10 Bd Auguste Blanqui NE-SO,100049408,"48.8309,2.35324",2018
10 Bd Auguste Blanqui NE-SO,100049408,"48.8309,2.35324",2019
10 avenue de la Grande Armée SE-NO,100044494,"48.87472,2.29244",2020
10 avenue de la Grande Armée SE-NO,100044494,"48.87472,2.29244",2021
10 boulevard Auguste Blanqui NE-SO,100049408,"48.83068,2.35348",2020
10 boulevard Auguste Blanqui NE-SO,100049408,"48.83068,2.35348",2021
100 Rue La Fayette O-E,100003099,"48.87746,2.35008",2018
100 Rue La Fayette O-E,100003099,"48.87746,2.35008",2019


Avec les premières lignes on observe que suivant les années l'écriture des adresses ou des coordonnées géographiques ne sont pas homogènes. Pour cela, je vais créer un programme qui donnera aux valeurs "aberrantes" celles de 2021 qui englobent les compteurs récents et les plus anciens.
Une première observation nous permet d'identifier certaines anomalies concernant les majuscules, les abrévations, peut-être la présence d'accents et d'espace en trop (non visible en format dataframe), etc.

In [48]:
#Adresse nom du site de comptage en majuscule
df_lat_lon['Address'] = df_lat_lon['Address'].apply(lambda x : x.upper())

#Tri par adresse
df_lat_lon = df_lat_lon.sort_values(by = ['Address'], axis=0)

df_lat_lon.head()

Unnamed: 0,Id,Address,Count_by_hour,Source,Coord
1712536,100044494,10 AVENUE DE LA GRANDE ARMÉE SE-NO,14.0,2018,"48.87472,2.29244"
1565009,100044494,10 AVENUE DE LA GRANDE ARMÉE SE-NO,11.0,2019,"48.87472,2.29244"
1565008,100044494,10 AVENUE DE LA GRANDE ARMÉE SE-NO,3.0,2019,"48.87472,2.29244"
1565007,100044494,10 AVENUE DE LA GRANDE ARMÉE SE-NO,17.0,2019,"48.87472,2.29244"
1565006,100044494,10 AVENUE DE LA GRANDE ARMÉE SE-NO,2.0,2019,"48.87472,2.29244"


In [49]:
#Nombre de valeurs uniques 
df_lat_lon.nunique()

Id                 71
Address            81
Count_by_hour    1054
Source              4
Coord              84
dtype: int64

Nous sommes passés de 108 valeurs à 81

In [50]:
#Suppression des accents
import unicodedata
def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')
df_lat_lon['Address'] = df_lat_lon['Address'].apply(strip_accents)
df_lat_lon.nunique()

Id                 71
Address            80
Count_by_hour    1054
Source              4
Coord              84
dtype: int64

Nous sommes passés de 81 valeurs à 80

In [51]:
#Suppression des espaces en trop
df_lat_lon['Address'] = df_lat_lon['Address'].replace('  ', ' ')
df_lat_lon.nunique()

Id                 71
Address            80
Count_by_hour    1054
Source              4
Coord              84
dtype: int64

In [129]:
for i in df_lat_lon['Id'].unique():
    dfi = df_lat_lon[df_lat_lon['Id'] == i]
    address2021 = dfi['Address'][dfi['Source']== 2021].unique()
        df_lat_lon['Address'][(df_lat_lon['Id'] == i)] = address_2021

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_lat_lon['Address'][(df_lat_lon['Id'] == i)] = address_2021
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_lat_lon['Address'][(df_lat_lon['Id'] == i)] = address_2021
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_lat_lon['Address'][(df_lat_lon['Id'] == i)] = address_2021
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-vie

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_lat_lon['Address'][(df_lat_lon['Id'] == i)] = address_2021


ValueError: Length of replacements must equal series length

In [130]:
df_lat_lon.nunique()

Id                 71
Address            76
Count_by_hour    1054
Source              4
Coord              84
dtype: int64

In [131]:
df_lat_lon['Id'].unique()

array([100044494, 100049408, 100003099, 100047536, 100003097, 100003098,
       100056040, 100056044, 100006300, 100056041, 100049407, 100047534,
       100036719, 100056039, 100056036, 100063175, 100047539, 100060174,
       100050132, 300014702, 100047537, 100041488, 100056336, 100007049,
       100056225, 100060175, 100056042, 100047550, 100065336, 100056330,
       100056334, 100036718, 100056038, 100047547, 100044493, 100063174,
       100044495, 100056034, 100047540, 100044506, 100056326, 100063173,
       100047544, 100056032, 100047533, 100060178, 100003096, 100047545,
       100047538, 100047548, 100056327, 100056331, 100047542, 100047546,
       100056226, 100056329, 100056332, 100056045, 100056047, 100056223,
       100056046, 100047551, 100047535, 100047549, 100047541, 100056335,
       100050876, 100057445, 100057329, 100057380, 100042374], dtype=int64)

In [52]:
dfi = df_lat_lon[df_lat_lon['Id'] == 100044494]
address2021 = dfi['Address'][dfi['Source']== 2021].unique()
#df_lat_lon['Address'] = dfi['Address'].replace(value = address2021)
dfi['Address'].unique()
#address2021

TypeError: 'regex' must be a string or a compiled regular expression or a list or dict of strings or regular expressions, you passed a 'bool'

In [None]:
#mettre addresse old 2019 en liste, liste addresse 2021 
#my_dict = dict(zip(numlist, abclist))
#df.replace(my_dict, inplace=True)
#ou alors programme

In [119]:
df_lat_lon['Address'][(df_lat_lon['Id'] == 100049408)].unique()

array(['10 BOULEVARD AUGUSTE BLANQUI NE-SO'], dtype=object)

In [112]:
address_2021 = df_lat_lon['Address'][(df_lat_lon['Source'] == 2021) & (df_lat_lon['Id'] == 100049408)].unique()
print(address_2021)

['10 BOULEVARD AUGUSTE BLANQUI NE-SO']


In [118]:
df_lat_lon['Address'][(df_lat_lon['Id'] == 100049408)] = address_2021

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_lat_lon['Address'][(df_lat_lon['Id'] == 100049408)] = address_2021


In [117]:
df_lat_lon['Address'][(df_lat_lon['Id'] == 100049408)] = df_lat_lon['Address'][(df_lat_lon['Id'] == 100049408)].replace(value = address_2021)

TypeError: 'regex' must be a string or a compiled regular expression or a list or dict of strings or regular expressions, you passed a 'bool'

In [None]:
df1 = df_lat_lon[['Id','Coord','Address', 'Source']].groupby(['Address', 'Id','Coord', 'Source']).count().reset_index()
df1.head()