# MBIT School

## Executive Master en Data Science (2020-2021)
 
by

*Nuria Espadas*  
*Mireia Vecino*  
*Tomeu Mir*  

## notebook: pre_process

This notebook mainly loads datasets:
 - **df_meteo**: contains the weather information for each day in 2019 in Tenerife. The data was obtained via API from AEMET.es
 - **df_xa**: contains all the excursions exported from the Oracle database.

 
After loading the two datasets, the notebook creates a new dataset adding the weather information to each excursion for when the excursion was made. 

This process was a little bit tricky, as there are 4 different weather measures for each metric per day, because the info was collected from 4 different stations.

The weather info finally added to each excursion belongs to the closest station, as we had the lat and long for each excursion.

The output is persisted in acsv file **df_tfs_xa_plus_meteo.csv**, which will be the one that will be used as a input for the further different **EDA processes**.

In [2]:
import pandas as pd
import numpy as np

In [4]:
DATA_PATH = '../data/'

# Load idemas dataset

In [3]:
df_idemas = pd.read_csv(DATA_PATH+'idemas.csv')
df_idemas.rename(columns={"idema": "indicativo"}, inplace=True)
df_idemas

Unnamed: 0,indicativo,ubi,lat,lon
0,C429I,TENERIFE/SUR,28.046991,-16.561157
1,C439J,TENERIFE-GÜIMAR,28.318405,-16.382229
2,C447A,TENERIFE/LOS RODEOS,28.477497,-16.329445
3,C449C,SANTA CRUZ DE TENERIFE,28.463337,-16.255281


# Load meteo metrics dataset

It contains all meteo metrics from 2019 for the 4 IDEMAS identified above

In [3]:
df_meteo = pd.read_csv(DATA_PATH+'df_meteo_tfs_2019.csv', decimal=',')
df_meteo['fecha'] = pd.to_datetime(df_meteo['fecha'])
df_meteo.head(4)

Unnamed: 0,fecha,indicativo,nombre,provincia,altitud,tmed,prec,tmin,horatmin,tmax,horatmax,dir,velmedia,racha,horaracha,sol,presMax,horaPresMax,presMin,horaPresMin
0,2019-01-01,C447A,TENERIFE NORTE AEROPUERTO,STA. CRUZ DE TENERIFE,632,15.0,0,10.4,07:56,19.5,15:30,16.0,3.9,11.4,10:11,5.5,956.5,Varias,951.2,1
1,2019-01-01,C449C,STA.CRUZ DE TENERIFE,STA. CRUZ DE TENERIFE,35,19.6,0,15.0,07:00,24.2,15:20,20.0,3.1,8.6,13:00,9.3,1023.3,Varias,1018.1,1
2,2019-01-01,C429I,TENERIFE SUR AEROPUERTO,STA. CRUZ DE TENERIFE,64,19.2,0,15.1,06:54,23.3,13:05,3.0,4.4,8.9,20:50,9.6,1018.3,Varias,1013.0,5
3,2019-01-01,C439J,GÜÍMAR,STA. CRUZ DE TENERIFE,115,16.6,0,10.5,05:50,22.6,17:00,,,,,,1013.7,22,1008.4,5


# Add coordenates to meteo dataset

Getting the lat,lon from the idemas dataset

In [4]:
df_clima = pd.merge(df_meteo, df_idemas, how='left', on='indicativo')
df_clima.drop(['nombre','provincia','ubi','altitud','horatmin','horatmax','horaracha','horaPresMax','horaPresMin'],axis=1, inplace=True)
df_clima.head()

Unnamed: 0,fecha,indicativo,tmed,prec,tmin,tmax,dir,velmedia,racha,sol,presMax,presMin,lat,lon
0,2019-01-01,C447A,15.0,0,10.4,19.5,16.0,3.9,11.4,5.5,956.5,951.2,28.477497,-16.329445
1,2019-01-01,C449C,19.6,0,15.0,24.2,20.0,3.1,8.6,9.3,1023.3,1018.1,28.463337,-16.255281
2,2019-01-01,C429I,19.2,0,15.1,23.3,3.0,4.4,8.9,9.6,1018.3,1013.0,28.046991,-16.561157
3,2019-01-01,C439J,16.6,0,10.5,22.6,,,,,1013.7,1008.4,28.318405,-16.382229
4,2019-01-02,C447A,14.6,0,10.9,18.2,14.0,3.1,7.8,6.3,957.5,954.5,28.477497,-16.329445


In [5]:
df_clima.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   fecha       1460 non-null   datetime64[ns]
 1   indicativo  1460 non-null   object        
 2   tmed        1456 non-null   float64       
 3   prec        1460 non-null   object        
 4   tmin        1456 non-null   float64       
 5   tmax        1458 non-null   float64       
 6   dir         1425 non-null   float64       
 7   velmedia    1440 non-null   float64       
 8   racha       1425 non-null   float64       
 9   sol         1088 non-null   float64       
 10  presMax     1453 non-null   float64       
 11  presMin     1453 non-null   float64       
 12  lat         1460 non-null   float64       
 13  lon         1460 non-null   float64       
dtypes: datetime64[ns](1), float64(11), object(2)
memory usage: 171.1+ KB


### The "prec" column has some shity values ('Ip'), we remove them replacing them by NaNs and then we convert the column to float type

In [6]:
df_clima['prec'] = df_clima['prec'].replace('Ip', '0')  # remove Ip and set to 0
df_clima['prec'] = df_clima['prec'].str.replace(',', '.').astype(float) 

In [7]:
df_clima.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tmed,1456.0,20.18853,3.45811,11.7,18.0,20.0,22.8,30.4
prec,1460.0,0.421096,2.49384,0.0,0.0,0.0,0.0,42.8
tmin,1456.0,16.498901,3.386078,7.4,14.2,16.4,19.1,25.7
tmax,1458.0,23.868244,3.842759,13.0,21.4,23.7,26.7,36.4
dir,1425.0,32.945263,35.297151,1.0,7.0,18.0,33.0,99.0
velmedia,1440.0,4.965,2.31997,1.1,3.1,4.2,6.7,15.8
racha,1425.0,11.434947,3.513695,3.9,8.9,11.1,13.9,24.7
sol,1088.0,8.080147,3.517593,0.0,5.6,9.1,10.9,13.3
presMax,1453.0,996.353407,26.915474,940.4,997.4,1009.1,1014.2,1025.9
presMin,1453.0,993.248451,26.799201,935.6,992.5,1006.0,1010.9,1022.9


# Add meteo metrics to the excursions dataset

Based on the booking date, get the meteo metrics of that day from the closest indicativo.

  
  1. adding a column to df_xa -> df_xa['indicativo'] = get_closest_idema(xa_lat,xa_lon)
  
  
  df_xa['indicativo'] = get_closest_idema(df_xa['i_lat'],df_xa['i_lon'])
  
  
  
  2. joining df_xa with df_clima by:
  - df_xa.booking_date = df_clima.fecha 
      - AND 
  - df_xa.indicativo = df_clima.indicativo
  
  
**geopy**:<br>
https://github.com/geopy/geopy  
Using library **geopy** for calculating the distance between to location points.<br> 
From terminal execute: conda install -c conda-forge geopy**




**Quick test of the function geodesic of the geopy lib**
```
# testing the lib
C429I_TENERIFE_SUR = (28.046991, -16.561157)
C449C_SANTA_CRUZ_DE_TENERIFE = (28.463337,-16.255281)

dist_km = geodesic(C429I_TENERIFE_SUR, C449C_SANTA_CRUZ_DE_TENERIFE).km
print (dist_km)
```
**output:**
```
55.044545572206346


**The following code was for making a quick test of the procedure to follow for getting the closest IDEMA passing a location as a parameter:**

```
from geopy.distance import geodesic

def get_closest_idema( loc_point):
    # returns the closest idema to the point location pased as parameters
    # parameter: loc_point = (lat,lon) # tuple
    idema = None
    min_dist = 100000 # initially 100k kms so, the first iteration always will set the min. value.
    for index, row in df_idemas.iterrows():
        idema_loc = (row['lat'], row['lon'])
        dist = geodesic(loc_point, idema_loc).km
        if dist < min_dist:
            min_dist = dist
            idema = row['indicativo']
        print(row['indicativo'],row['lat'], row['lon'], dist, min_dist)
    return idema

# Testing the function
loc_point = (28.480410, -16.400138) # LESTCI4FX8	Bodegas Alvaro
closest_indicativo = get_closest_idema( loc_point)
closest_indicativo
```

**And the output is:**
```
C429I 28.046991 -16.561157 50.56479217874456 50.56479217874456
C439J 28.318405 -16.382229 18.039925667610657 18.039925667610657
C447A 28.477497 -16.329445 6.930032096794651 6.930032096794651
C449C 28.463337 -16.255281 14.311488034348436 6.930032096794651

'C447A'
```

In [8]:
df_xa = pd.read_csv(DATA_PATH+'df_tfs_xa_2019.csv')

df_xa.head()

Unnamed: 0,H_BO_HEADERID,H_BOOKINGDATE,H_STARTDATE,H_ENDDATE,H_DURATION,H_DAYSBEFOREBOOK,I_BO_ITEMID,I_BOOKINGDATE,I_STARTDATE,I_ENDDATE,...,CHD,INF,AVG_ADT_AGE,LEAD_PAX_AGE,TR_NUM_OF_SERVICES,TR_TOTAL_SALES,XA_NUM_OF_SERVICES,XA_TOTAL_SALES,AC_NUM_OF_SERVICES,AC_TOTAL_SALES
0,21788788,10/05/18,06/01/19,27/01/19,21,241,57811718,10/05/18,14/01/19,14/01/19,...,0,0,,,2.0,13.0,5,221,1.0,0.0
1,21788788,10/05/18,06/01/19,27/01/19,21,241,57811786,10/05/18,22/01/19,22/01/19,...,0,0,,,2.0,13.0,5,221,1.0,0.0
2,21788788,10/05/18,06/01/19,27/01/19,21,241,59637648,10/05/18,22/01/19,22/01/19,...,0,0,,,2.0,13.0,5,221,1.0,0.0
3,21788788,10/05/18,06/01/19,27/01/19,21,241,59244068,10/05/18,16/01/19,16/01/19,...,0,0,,,2.0,13.0,5,221,1.0,0.0
4,21788788,10/05/18,06/01/19,27/01/19,21,241,57811620,10/05/18,16/01/19,16/01/19,...,0,0,,,2.0,13.0,5,221,1.0,0.0


In [9]:

df_xa['H_BOOKINGDATE'] = pd.to_datetime(df_xa['H_BOOKINGDATE'])
df_xa['H_STARTDATE'] = pd.to_datetime(df_xa['H_STARTDATE'])
df_xa['H_ENDDATE'] = pd.to_datetime(df_xa['H_ENDDATE'])
df_xa['I_BOOKINGDATE'] = pd.to_datetime(df_xa['I_BOOKINGDATE'])
df_xa['I_STARTDATE'] = pd.to_datetime(df_xa['I_STARTDATE'])
df_xa['I_ENDDATE'] = pd.to_datetime(df_xa['I_ENDDATE'])


In [10]:
from geopy.distance import geodesic
    
def get_closest_idema(x):
    # returns the closest idema to the (lat,lon) of the row pased as parameters
    # parameter: a df row
    #
    if pd.isna(x['I_LATITUD']) or pd.isna(x['I_LATITUD']):
        return None
    
    loc_point = (x['I_LATITUD'], x['I_LONGITUD'])
    
    idema = None
    min_dist = 100000 # initially 100k kms so, the first iteration always will set the min. value.
    for index, row in df_idemas.iterrows():
        idema_loc = (row['lat'], row['lon'])
        dist = geodesic(loc_point, idema_loc).km
        if dist < min_dist:
            min_dist = dist
            idema = row['indicativo']
            
    return idema


In [11]:
df_xa['indicativo'] = df_xa.apply(get_closest_idema, axis=1)
df_xa[['STOCK_CODE','I_LATITUD','I_LONGITUD','indicativo']]

Unnamed: 0,STOCK_CODE,I_LATITUD,I_LONGITUD,indicativo
0,XESTCIBSBI,28.056230,-16.724170,C429I
1,PESTCI4KNA,28.063877,-16.727066,C429I
2,LESTCI5XRO,28.033218,-17.190764,C429I
3,LESTCI4FX8,28.480410,-16.400138,C447A
4,XESTCIBUM4,28.061094,-16.727680,C429I
...,...,...,...,...
25480,PESTCI4IGS,28.264957,-16.637370,C429I
25481,PESTCI4FMM,28.417067,-16.542494,C439J
25482,PESTCI4FMM,28.417067,-16.542494,C439J
25483,XESTCID0CS,28.084962,-17.333884,C429I


In [12]:
df_xa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25485 entries, 0 to 25484
Data columns (total 37 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   H_BO_HEADERID          25485 non-null  int64         
 1   H_BOOKINGDATE          25485 non-null  datetime64[ns]
 2   H_STARTDATE            25485 non-null  datetime64[ns]
 3   H_ENDDATE              25485 non-null  datetime64[ns]
 4   H_DURATION             25485 non-null  int64         
 5   H_DAYSBEFOREBOOK       25485 non-null  int64         
 6   I_BO_ITEMID            25485 non-null  int64         
 7   I_BOOKINGDATE          25485 non-null  datetime64[ns]
 8   I_STARTDATE            25485 non-null  datetime64[ns]
 9   I_ENDDATE              25485 non-null  datetime64[ns]
 10  I_DURATION             25485 non-null  int64         
 11  I_DAYSBEFOREBOOK       25485 non-null  int64         
 12  I_OD_RESORTOFFICEID    25485 non-null  int64         
 13  I

In [13]:
# Drop bookings without lat,lon info : 25481 - 25429 = 52 bookings to be deleted
indexNames = df_xa[ (np.isnan(df_xa['I_LATITUD'])) | (np.isnan(df_xa['I_LONGITUD'])) ].index
df_xa.drop(indexNames , inplace=True)

In [14]:
df_xa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25429 entries, 0 to 25484
Data columns (total 37 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   H_BO_HEADERID          25429 non-null  int64         
 1   H_BOOKINGDATE          25429 non-null  datetime64[ns]
 2   H_STARTDATE            25429 non-null  datetime64[ns]
 3   H_ENDDATE              25429 non-null  datetime64[ns]
 4   H_DURATION             25429 non-null  int64         
 5   H_DAYSBEFOREBOOK       25429 non-null  int64         
 6   I_BO_ITEMID            25429 non-null  int64         
 7   I_BOOKINGDATE          25429 non-null  datetime64[ns]
 8   I_STARTDATE            25429 non-null  datetime64[ns]
 9   I_ENDDATE              25429 non-null  datetime64[ns]
 10  I_DURATION             25429 non-null  int64         
 11  I_DAYSBEFOREBOOK       25429 non-null  int64         
 12  I_OD_RESORTOFFICEID    25429 non-null  int64         
 13  I

## Important remark
  
We should not join the **xa** + **meteo** data based on booking date, because there are booking made quite a lot in advanced, so much, that sometimes the booking was made in 2018 and there are no meteo data. It would not have either sense to use meteo data from 2018, because these bookings are surely made from the market countries, before people travels to the destinations.

**Which meteo data to use?**
  
- If the booking was done less than 7 days prior the start of the excursion -> meteo data of the **booking_date**
- If the booking was done with more than 7 days in advanced:
    - Meteo data of **starting_date - 1** (the previous day)
    - if the start date if 1-Jan, then we will take the meteo data of the same **starting_date** instead.

In [15]:
def set_meteo_date(x):
    fecha = x['I_BOOKINGDATE']
    if x['I_DAYSBEFOREBOOK'] > 7 or fecha.year == 2018:
        fecha = x['I_STARTDATE'] - pd.Timedelta(days=1)
        if fecha.year == 2018:
            fecha = x['I_STARTDATE']
    return fecha

In [16]:
df_xa['fecha'] = df_xa.apply(set_meteo_date, axis=1)
df_xa[['I_BOOKINGDATE','I_DAYSBEFOREBOOK','I_STARTDATE','fecha']]

Unnamed: 0,I_BOOKINGDATE,I_DAYSBEFOREBOOK,I_STARTDATE,fecha
0,2018-10-05,249,2019-01-14,2019-01-13
1,2018-10-05,257,2019-01-22,2019-01-21
2,2018-10-05,257,2019-01-22,2019-01-21
3,2018-10-05,251,2019-01-16,2019-01-15
4,2018-10-05,251,2019-01-16,2019-01-15
...,...,...,...,...
25480,2019-12-29,0,2019-12-29,2019-12-29
25481,2019-12-30,0,2019-12-30,2019-12-30
25482,2019-12-26,0,2019-12-26,2019-12-26
25483,2019-12-30,0,2019-12-30,2019-12-30


In [17]:
df_xa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25429 entries, 0 to 25484
Data columns (total 38 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   H_BO_HEADERID          25429 non-null  int64         
 1   H_BOOKINGDATE          25429 non-null  datetime64[ns]
 2   H_STARTDATE            25429 non-null  datetime64[ns]
 3   H_ENDDATE              25429 non-null  datetime64[ns]
 4   H_DURATION             25429 non-null  int64         
 5   H_DAYSBEFOREBOOK       25429 non-null  int64         
 6   I_BO_ITEMID            25429 non-null  int64         
 7   I_BOOKINGDATE          25429 non-null  datetime64[ns]
 8   I_STARTDATE            25429 non-null  datetime64[ns]
 9   I_ENDDATE              25429 non-null  datetime64[ns]
 10  I_DURATION             25429 non-null  int64         
 11  I_DAYSBEFOREBOOK       25429 non-null  int64         
 12  I_OD_RESORTOFFICEID    25429 non-null  int64         
 13  I

**Creating the final dataset** adding the meteo data to excursions data set joining by fecha and indicativo**

In [18]:
df_xa_plus_meteo = pd.merge(df_xa, df_clima, how='left', on=['fecha','indicativo'])
df_xa_plus_meteo

Unnamed: 0,H_BO_HEADERID,H_BOOKINGDATE,H_STARTDATE,H_ENDDATE,H_DURATION,H_DAYSBEFOREBOOK,I_BO_ITEMID,I_BOOKINGDATE,I_STARTDATE,I_ENDDATE,...,tmin,tmax,dir,velmedia,racha,sol,presMax,presMin,lat,lon
0,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,57811718,2018-10-05,2019-01-14,2019-01-14,...,14.8,24.5,99.0,5.0,9.2,9.7,1013.5,1010.3,28.046991,-16.561157
1,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,57811786,2018-10-05,2019-01-22,2019-01-22,...,14.7,21.8,99.0,7.2,15.0,4.0,1016.8,1014.3,28.046991,-16.561157
2,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,59637648,2018-10-05,2019-01-22,2019-01-22,...,14.7,21.8,99.0,7.2,15.0,4.0,1016.8,1014.3,28.046991,-16.561157
3,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,59244068,2018-10-05,2019-01-16,2019-01-16,...,9.4,16.1,11.0,3.9,8.9,5.7,952.5,948.7,28.477497,-16.329445
4,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,57811620,2018-10-05,2019-01-16,2019-01-16,...,15.1,22.8,99.0,7.8,12.8,9.8,1014.3,1010.3,28.046991,-16.561157
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25424,33576052,2020-02-01,2019-12-29,2019-12-29,0,-5,76493326,2019-12-29,2019-12-29,2019-12-29,...,16.1,27.0,1.0,3.9,9.7,9.7,1013.5,1010.2,28.046991,-16.561157
25425,33576626,2020-02-01,2019-12-30,2019-12-30,0,-4,76494434,2019-12-30,2019-12-30,2019-12-30,...,15.7,23.0,4.0,3.3,7.2,,1011.9,1008.5,28.318405,-16.382229
25426,33595112,2020-03-01,2019-12-26,2019-12-26,0,-8,76536686,2019-12-26,2019-12-26,2019-12-26,...,15.1,25.0,5.0,2.8,6.9,,1008.3,1004.3,28.318405,-16.382229
25427,33744892,2019-12-30,2019-12-30,2019-12-30,0,0,76905618,2019-12-30,2019-12-30,2019-12-30,...,15.8,27.2,6.0,5.8,11.7,9.6,1016.0,1012.6,28.046991,-16.561157


In [19]:
df_xa_plus_meteo.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25429 entries, 0 to 25428
Data columns (total 50 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   H_BO_HEADERID          25429 non-null  int64         
 1   H_BOOKINGDATE          25429 non-null  datetime64[ns]
 2   H_STARTDATE            25429 non-null  datetime64[ns]
 3   H_ENDDATE              25429 non-null  datetime64[ns]
 4   H_DURATION             25429 non-null  int64         
 5   H_DAYSBEFOREBOOK       25429 non-null  int64         
 6   I_BO_ITEMID            25429 non-null  int64         
 7   I_BOOKINGDATE          25429 non-null  datetime64[ns]
 8   I_STARTDATE            25429 non-null  datetime64[ns]
 9   I_ENDDATE              25429 non-null  datetime64[ns]
 10  I_DURATION             25429 non-null  int64         
 11  I_DAYSBEFOREBOOK       25429 non-null  int64         
 12  I_OD_RESORTOFFICEID    25429 non-null  int64         
 13  I

In [20]:
df_xa_plus_meteo.head(10)

Unnamed: 0,H_BO_HEADERID,H_BOOKINGDATE,H_STARTDATE,H_ENDDATE,H_DURATION,H_DAYSBEFOREBOOK,I_BO_ITEMID,I_BOOKINGDATE,I_STARTDATE,I_ENDDATE,...,tmin,tmax,dir,velmedia,racha,sol,presMax,presMin,lat,lon
0,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,57811718,2018-10-05,2019-01-14,2019-01-14,...,14.8,24.5,99.0,5.0,9.2,9.7,1013.5,1010.3,28.046991,-16.561157
1,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,57811786,2018-10-05,2019-01-22,2019-01-22,...,14.7,21.8,99.0,7.2,15.0,4.0,1016.8,1014.3,28.046991,-16.561157
2,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,59637648,2018-10-05,2019-01-22,2019-01-22,...,14.7,21.8,99.0,7.2,15.0,4.0,1016.8,1014.3,28.046991,-16.561157
3,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,59244068,2018-10-05,2019-01-16,2019-01-16,...,9.4,16.1,11.0,3.9,8.9,5.7,952.5,948.7,28.477497,-16.329445
4,21788788,2018-10-05,2019-06-01,2019-01-27,21,241,57811620,2018-10-05,2019-01-16,2019-01-16,...,15.1,22.8,99.0,7.8,12.8,9.8,1014.3,1010.3,28.046991,-16.561157
5,21863092,2018-11-03,2018-12-22,2019-05-01,14,286,58241674,2018-12-30,2019-02-01,2019-02-01,...,16.5,19.3,99.0,3.3,10.3,0.0,1013.2,1010.3,28.046991,-16.561157
6,21863092,2018-11-03,2018-12-22,2019-05-01,14,286,58194746,2018-12-30,2019-02-01,2019-02-01,...,16.5,19.3,99.0,3.3,10.3,0.0,1013.2,1010.3,28.046991,-16.561157
7,21924996,2018-05-26,2019-01-03,2019-08-03,7,279,61921016,2019-04-03,2019-05-03,2019-05-03,...,13.0,24.0,26.0,2.8,11.7,7.9,1004.6,1002.0,28.046991,-16.561157
8,21926642,2018-01-04,2019-03-30,2019-04-13,14,363,63613814,2019-01-04,2019-09-04,2019-09-04,...,19.9,27.3,99.0,6.1,10.8,10.4,1011.6,1008.3,28.046991,-16.561157
9,21926642,2018-01-04,2019-03-30,2019-04-13,14,363,63310364,2019-01-04,2019-09-04,2019-09-04,...,14.8,21.2,99.0,4.2,11.7,9.2,1017.3,1013.4,28.046991,-16.561157


In [21]:
df_xa_plus_meteo.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25429 entries, 0 to 25428
Data columns (total 50 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   H_BO_HEADERID          25429 non-null  int64         
 1   H_BOOKINGDATE          25429 non-null  datetime64[ns]
 2   H_STARTDATE            25429 non-null  datetime64[ns]
 3   H_ENDDATE              25429 non-null  datetime64[ns]
 4   H_DURATION             25429 non-null  int64         
 5   H_DAYSBEFOREBOOK       25429 non-null  int64         
 6   I_BO_ITEMID            25429 non-null  int64         
 7   I_BOOKINGDATE          25429 non-null  datetime64[ns]
 8   I_STARTDATE            25429 non-null  datetime64[ns]
 9   I_ENDDATE              25429 non-null  datetime64[ns]
 10  I_DURATION             25429 non-null  int64         
 11  I_DAYSBEFOREBOOK       25429 non-null  int64         
 12  I_OD_RESORTOFFICEID    25429 non-null  int64         
 13  I

In [22]:
df_xa_plus_meteo.to_csv(DATA_PATH+'df_tfs_xa_plus_meteo.csv')