Traitements à faire
---
- Fusion de tous les Datasets
- Fusion des colonnes forename et surname en une colonne 
- Calculer l'age du pilote à la date de la course
- Retirer les colonnes inutilisées
- Exporter le Dataset fusionné

In [1]:
import sys
sys.path.append('../../')

import pandas as pd

In [2]:
from src.transformer_utils import print_df_missing_values

## 1. Chargement des Datasets

In [3]:
df_constructors = pd.read_csv('../../data/outputs/04_constructors_processed.csv')
df_drivers = pd.read_csv('../../data/outputs/03_drivers_processed.csv')
df_races_results = pd.read_csv('../../data/outputs/02_races_results.csv')

In [4]:
df_constructors.head()

Unnamed: 0,constructorId,name,nationality,constructor_is_active
0,1,McLaren,British,1
1,2,BMW Sauber,German,0
2,3,Williams,British,1
3,4,Renault,French,0
4,5,Toro Rosso,Italian,0


In [5]:
df_drivers.head()

Unnamed: 0,driverId,number,code,forename,surname,dob,nationality,driver_is_active
0,1,44,HAM,Lewis,Hamilton,1985-01-07,British,1
1,2,\N,HEI,Nick,Heidfeld,1977-05-10,German,0
2,3,6,ROS,Nico,Rosberg,1985-06-27,German,0
3,4,14,ALO,Fernando,Alonso,1981-07-29,Spanish,1
4,5,\N,KOV,Heikki,Kovalainen,1981-10-19,Finnish,0


In [6]:
df_races_results.head()

Unnamed: 0,resultId,raceId,driverId,constructorId,grid,race_rank,points,laps,milliseconds,fastestLap,fastestLapTime,fastestLapSpeed,statusId,year,round,circuitId,name
0,1,18,1,1,1,1,10.0,58,5690616,39,87452,218.3,1,2008,1,1,Australian Grand Prix
1,2,18,2,2,5,2,8.0,58,5696094,41,87739,217.586,1,2008,1,1,Australian Grand Prix
2,3,18,3,3,7,3,6.0,58,5698779,41,88090,216.719,1,2008,1,1,Australian Grand Prix
3,4,18,4,4,11,4,5.0,58,5707797,58,88603,215.464,1,2008,1,1,Australian Grand Prix
4,5,18,5,1,3,5,4.0,58,5708630,43,87418,218.385,1,2008,1,1,Australian Grand Prix


## 2. Fusion des Datasets

### 2.1. Renommons les colonnes dans les Datasets

Etant donné que nous avons une colonne name dans le Dataset des resultats et des constructeurs, il est necessaire pour nous de renommer ces colonnes dans chacun des Dataset

In [7]:
df_races_results.rename({'name': 'circuit_name'}, axis=1, inplace=True)
df_constructors.rename({'name': 'constructor_name'}, axis=1, inplace=True)
df_constructors.rename({'nationality': 'country'}, axis=1, inplace=True)

### 2.2. Fusion des Datasets constructeurs et résultats

In [8]:
df_const_race = df_races_results.merge(df_constructors, on='constructorId', how='inner')

In [9]:
df_const_race.head()

Unnamed: 0,resultId,raceId,driverId,constructorId,grid,race_rank,points,laps,milliseconds,fastestLap,fastestLapTime,fastestLapSpeed,statusId,year,round,circuitId,circuit_name,constructor_name,country,constructor_is_active
0,1,18,1,1,1,1,10.0,58,5690616,39,87452,218.3,1,2008,1,1,Australian Grand Prix,McLaren,British,1
1,2,18,2,2,5,2,8.0,58,5696094,41,87739,217.586,1,2008,1,1,Australian Grand Prix,BMW Sauber,German,0
2,3,18,3,3,7,3,6.0,58,5698779,41,88090,216.719,1,2008,1,1,Australian Grand Prix,Williams,British,1
3,4,18,4,4,11,4,5.0,58,5707797,58,88603,215.464,1,2008,1,1,Australian Grand Prix,Renault,French,0
4,5,18,5,1,3,5,4.0,58,5708630,43,87418,218.385,1,2008,1,1,Australian Grand Prix,McLaren,British,1


### 2.3. Fusion du Dataset des pilotes et des résultats des courses

In [10]:
# Visualisation du Dataset des pilotes
df_drivers.head()

Unnamed: 0,driverId,number,code,forename,surname,dob,nationality,driver_is_active
0,1,44,HAM,Lewis,Hamilton,1985-01-07,British,1
1,2,\N,HEI,Nick,Heidfeld,1977-05-10,German,0
2,3,6,ROS,Nico,Rosberg,1985-06-27,German,0
3,4,14,ALO,Fernando,Alonso,1981-07-29,Spanish,1
4,5,\N,KOV,Heikki,Kovalainen,1981-10-19,Finnish,0


#### a) Récupérons uniquement les colonnes utiles du Dataset des pilotes

In [11]:
cols = [x for x in df_drivers.columns if x not in ['number', 'code', 'nationality', 'url']]

In [12]:
df_drivers[cols].head()

Unnamed: 0,driverId,forename,surname,dob,driver_is_active
0,1,Lewis,Hamilton,1985-01-07,1
1,2,Nick,Heidfeld,1977-05-10,0
2,3,Nico,Rosberg,1985-06-27,0
3,4,Fernando,Alonso,1981-07-29,1
4,5,Heikki,Kovalainen,1981-10-19,0


#### b) Créons une colonne qui contient le nom et le prénom du pilote 

In [13]:
df_drivers['full_name'] = df_drivers['forename'] + ' ' + df_drivers['surname']
# Suppression des colonnes cars elles ne nous sont plus utiles
df_drivers.drop(['forename', 'surname', 'number'], axis=1, inplace=True)

#### c) Fusionnons tous les Datasets

In [14]:
df_merged = df_const_race.merge(df_drivers, on='driverId', how='inner')

In [15]:
print(df_merged.shape)
df_merged.head()

(26080, 25)


Unnamed: 0,resultId,raceId,driverId,constructorId,grid,race_rank,points,laps,milliseconds,fastestLap,...,circuitId,circuit_name,constructor_name,country,constructor_is_active,code,dob,nationality,driver_is_active,full_name
0,1,18,1,1,1,1,10.0,58,5690616,39,...,1,Australian Grand Prix,McLaren,British,1,HAM,1985-01-07,British,1,Lewis Hamilton
1,2,18,2,2,5,2,8.0,58,5696094,41,...,1,Australian Grand Prix,BMW Sauber,German,0,HEI,1977-05-10,German,0,Nick Heidfeld
2,3,18,3,3,7,3,6.0,58,5698779,41,...,1,Australian Grand Prix,Williams,British,1,ROS,1985-06-27,German,0,Nico Rosberg
3,4,18,4,4,11,4,5.0,58,5707797,58,...,1,Australian Grand Prix,Renault,French,0,ALO,1981-07-29,Spanish,1,Fernando Alonso
4,5,18,5,1,3,5,4.0,58,5708630,43,...,1,Australian Grand Prix,McLaren,British,1,KOV,1981-10-19,Finnish,0,Heikki Kovalainen


## 4. Calcul de l'age de chaque pilote par course

In [16]:
def calcul_age(row: pd.DataFrame) -> int:
    return row['year'] - row['dob'].year

In [17]:
# Convertion de la colonne dob en type date Pandas
df_merged['dob'] = pd.to_datetime(df_merged['dob'])
df_merged['season_age'] = df_merged.apply(calcul_age, axis=1)

# Suppression de la colonne dob car elle ne sera plus utile pour la suite
df_merged.drop('dob', axis=1, inplace=True)

In [18]:
df_merged.head()

Unnamed: 0,resultId,raceId,driverId,constructorId,grid,race_rank,points,laps,milliseconds,fastestLap,...,circuitId,circuit_name,constructor_name,country,constructor_is_active,code,nationality,driver_is_active,full_name,season_age
0,1,18,1,1,1,1,10.0,58,5690616,39,...,1,Australian Grand Prix,McLaren,British,1,HAM,British,1,Lewis Hamilton,23
1,2,18,2,2,5,2,8.0,58,5696094,41,...,1,Australian Grand Prix,BMW Sauber,German,0,HEI,German,0,Nick Heidfeld,31
2,3,18,3,3,7,3,6.0,58,5698779,41,...,1,Australian Grand Prix,Williams,British,1,ROS,German,0,Nico Rosberg,23
3,4,18,4,4,11,4,5.0,58,5707797,58,...,1,Australian Grand Prix,Renault,French,0,ALO,Spanish,1,Fernando Alonso,27
4,5,18,5,1,3,5,4.0,58,5708630,43,...,1,Australian Grand Prix,McLaren,British,1,KOV,Finnish,0,Heikki Kovalainen,27


## 5. Verifions si on a aucune valeur manquante avant de poursuivre

In [19]:
car = '\\N'
print_df_missing_values(df_merged, car)

resultId: 0
raceId: 0
driverId: 0
constructorId: 0
grid: 0
race_rank: 0
points: 0
laps: 0
milliseconds: 0
fastestLap: 0
fastestLapTime: 0
fastestLapSpeed: 0
statusId: 0
year: 0
round: 0
circuitId: 0
circuit_name: 0
constructor_name: 0
country: 0
constructor_is_active: 0
code: 16581
nationality: 0
driver_is_active: 0
full_name: 0
season_age: 0


## 5. Exportons le Dataset

In [20]:
df_merged.to_csv('../../data/outputs/05_merged_data.csv', index=False)