# Importações

In [7]:
import isodate
import pandas as pd

## Leitura dos dados

Anteriormente, tentamos ler o arquivo completo disponível no kaggle. Entretando, com a limitação de memória RAM que temos, extraímos 3 bilhões de linhas do dataset.

In [10]:
df = pd.read_csv('../data/arquivo.csv')

In [11]:
df.shape

(299999, 27)

## Verificar valores nulos ou vazios

Verificamos que a coluna totalTravelDistance possui 61.860 valores não preenchidos, sendo uma pequena parte dos dados. Com isso, vamos análisar a necessidade deles e se podem ser removidos.

Para as outras duas colunas, notamos que são informações possivelmente, irrelevantes para a predição.

In [12]:
df.isna().sum()

legId                                    0
searchDate                               0
flightDate                               0
startingAirport                          0
destinationAirport                       0
fareBasisCode                            0
travelDuration                           0
elapsedDays                              0
isBasicEconomy                           0
isRefundable                             0
isNonStop                                0
baseFare                                 0
totalFare                                0
seatsRemaining                           0
totalTravelDistance                  16814
segmentsDepartureTimeEpochSeconds        0
segmentsDepartureTimeRaw                 0
segmentsArrivalTimeEpochSeconds          0
segmentsArrivalTimeRaw                   0
segmentsArrivalAirportCode               0
segmentsDepartureAirportCode             0
segmentsAirlineName                      0
segmentsAirlineCode                      0
segmentsEqu

In [13]:
df.head()

Unnamed: 0,legId,searchDate,flightDate,startingAirport,destinationAirport,fareBasisCode,travelDuration,elapsedDays,isBasicEconomy,isRefundable,...,segmentsArrivalTimeEpochSeconds,segmentsArrivalTimeRaw,segmentsArrivalAirportCode,segmentsDepartureAirportCode,segmentsAirlineName,segmentsAirlineCode,segmentsEquipmentDescription,segmentsDurationInSeconds,segmentsDistance,segmentsCabinCode
0,9ca0e81111c683bec1012473feefd28f,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H29M,0,False,False,...,1650223560,2022-04-17T15:26:00.000-04:00,BOS,ATL,Delta,DL,Airbus A321,8940,947,coach
1,98685953630e772a098941b71906592b,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H30M,0,False,False,...,1650200400,2022-04-17T09:00:00.000-04:00,BOS,ATL,Delta,DL,Airbus A321,9000,947,coach
2,98d90cbc32bfbb05c2fc32897c7c1087,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H30M,0,False,False,...,1650218700,2022-04-17T14:05:00.000-04:00,BOS,ATL,Delta,DL,Boeing 757-200,9000,947,coach
3,969a269d38eae583f455486fa90877b4,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H32M,0,False,False,...,1650227460,2022-04-17T16:31:00.000-04:00,BOS,ATL,Delta,DL,Airbus A321,9120,947,coach
4,980370cf27c89b40d2833a1d5afc9751,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H34M,0,False,False,...,1650213180,2022-04-17T12:33:00.000-04:00,BOS,ATL,Delta,DL,Airbus A321,9240,947,coach


Abaixo, vemos que o tipo dos dados estão corretos.

In [14]:
df.dtypes

legId                                 object
searchDate                            object
flightDate                            object
startingAirport                       object
destinationAirport                    object
fareBasisCode                         object
travelDuration                        object
elapsedDays                            int64
isBasicEconomy                          bool
isRefundable                            bool
isNonStop                               bool
baseFare                             float64
totalFare                            float64
seatsRemaining                         int64
totalTravelDistance                  float64
segmentsDepartureTimeEpochSeconds     object
segmentsDepartureTimeRaw              object
segmentsArrivalTimeEpochSeconds       object
segmentsArrivalTimeRaw                object
segmentsArrivalAirportCode            object
segmentsDepartureAirportCode          object
segmentsAirlineName                   object
segmentsAi

## Alteração de valores

In [15]:
df['travelDuration'].unique()

array(['PT2H29M', 'PT2H30M', 'PT2H32M', ..., 'PT21H57M', 'PT20H20M',
       'PT20H44M'], shape=(1203,), dtype=object)

Identificamos que a coluna travelDuration, está no formato ISO 8601 e para convertermos em segundos, utilizamos a lib isodate, que consegue identificar esse formato, converter e retornamos o total de segundos.

In [16]:
df['travelDurationInSeconds'] = df['travelDuration'].apply(lambda x: isodate.parse_duration(x).total_seconds())
df['travelDurationInSeconds']


0          8940.0
1          9000.0
2          9000.0
3          9120.0
4          9240.0
           ...   
299994    44820.0
299995    20640.0
299996    28980.0
299997    36000.0
299998    26880.0
Name: travelDurationInSeconds, Length: 299999, dtype: float64

Também, separamos as taxas e impostos do valor da passagem.

In [17]:
df['taxesFare'] = df['totalFare'] - df['baseFare']
df['taxesFare']

0         30.93
1         30.93
2         30.93
3         30.93
4         30.93
          ...  
299994    40.57
299995    35.39
299996    35.39
299997    42.67
299998    38.11
Name: taxesFare, Length: 299999, dtype: float64

In [18]:
df.head()

Unnamed: 0,legId,searchDate,flightDate,startingAirport,destinationAirport,fareBasisCode,travelDuration,elapsedDays,isBasicEconomy,isRefundable,...,segmentsArrivalAirportCode,segmentsDepartureAirportCode,segmentsAirlineName,segmentsAirlineCode,segmentsEquipmentDescription,segmentsDurationInSeconds,segmentsDistance,segmentsCabinCode,travelDurationInSeconds,taxesFare
0,9ca0e81111c683bec1012473feefd28f,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H29M,0,False,False,...,BOS,ATL,Delta,DL,Airbus A321,8940,947,coach,8940.0,30.93
1,98685953630e772a098941b71906592b,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H30M,0,False,False,...,BOS,ATL,Delta,DL,Airbus A321,9000,947,coach,9000.0,30.93
2,98d90cbc32bfbb05c2fc32897c7c1087,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H30M,0,False,False,...,BOS,ATL,Delta,DL,Boeing 757-200,9000,947,coach,9000.0,30.93
3,969a269d38eae583f455486fa90877b4,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H32M,0,False,False,...,BOS,ATL,Delta,DL,Airbus A321,9120,947,coach,9120.0,30.93
4,980370cf27c89b40d2833a1d5afc9751,2022-04-16,2022-04-17,ATL,BOS,LA0NX0MC,PT2H34M,0,False,False,...,BOS,ATL,Delta,DL,Airbus A321,9240,947,coach,9240.0,30.93


Após as alterações identificamos que a duração do voo em segundos já estava presente, sendo representada pela coluna segmentsDurationInSeconds. Também, notamos que as demais colunas estão corretas, não precisando ser modificadas.

## Remoção de dados duplicados

In [19]:
df[df.duplicated()]

Unnamed: 0,legId,searchDate,flightDate,startingAirport,destinationAirport,fareBasisCode,travelDuration,elapsedDays,isBasicEconomy,isRefundable,...,segmentsArrivalAirportCode,segmentsDepartureAirportCode,segmentsAirlineName,segmentsAirlineCode,segmentsEquipmentDescription,segmentsDurationInSeconds,segmentsDistance,segmentsCabinCode,travelDurationInSeconds,taxesFare


Identificamos que não há dados duplicados.

## Salvar dados

In [20]:
df.to_csv('../data/arquivo_preprocessado.csv', index=False)