# Airbnb's Business Question and Plan for Success

## Understanding the Business Question

-- The Objective of the Business Problem:

    -- 1.0. Prediction of which country a new user's first booking destination will be upon using Airbnb.

-- Proposal for Solution:

-- 

## The Business Planning

# <font color = 'green'> ----- Cycle 1: First Sprint ----- </font>

# 0. Imports 

## 0.1. Libraries:

In [None]:
import sys

!{sys.executable} -m pip install keras
!{sys.executable} -m pip install tensorflow
!{sys.executable} -m pip install scikit-plot

In [None]:
import random

import pandas  as pd
import numpy   as np
import seaborn as sns

from IPython.core.display import HTML

from matplotlib import pyplot as plt

from sklearn import model_selection as ms
from sklearn import preprocessing   as pp
from sklearn import metrics         as m

from scikitplot import metrics as mt

from keras import models as ml
from keras import layers as l

## 0.2. Helper Functions

In [None]:
# Função auxiliar para construcão do layout:
def jupyter_settings():
    %matplotlib inline
#     %matplotlib inline
    
    plt.style.use( 'bmh' )
    plt.rcParams[ 'figure.figsize' ] = [15, 7]
    plt.rcParams[ 'font.size' ] = 20
    
    display( HTML( '<style>.container { width:95% !important; }</style>' ) )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()

In [None]:
jupyter_settings()

## 0.3. Loading the data

In [None]:
# Dataset for training the users inherent characteristics:
df_users = pd.read_csv( '../datasets/training_users.csv', low_memory=True )

In [None]:
# Dataset for training the users behaviors upon using the platform:
df_sessions = pd.read_csv( '/content/drive/MyDrive/COMUNIDADE DS/Colab_Notebooks-Projects/pa000_airbnb_predict_first_booking/datasets/sessions.csv', low_memory=True )
print( 'Data size for df_sessions: {}'.format(df_sessions.shape) )

# Para o caso de se realizar um merge de ambos os datasets devido a GRANULARIDADE.

# A fim de decidir qual granularidade deverá ser implementada no merge dos datasets, as seguintes questões devem ser consideradas:
## A previsão é feita em funcão do usuário
## A previsão é feita em funcão do usuário + evento ('action')

# Exemplo: Retornar as acões realizadas por um usuário aleatório da plataforma durante o período em que esteve ativo:
df_sessions[ df_sessions['user_id'] == '00023iyk9l' ][['user_id', 'action']].groupby('user_id').count()

# Da mesma forma, a fim de retornar os valores referentes à todas as acões do usuário '00023iyk9l':
df_sessions[ df_sessions['user_id'] == '00023iyk9l' ][['user_id', 'action']].value_counts()


## 0.4. Defining the granularity for both datasets

## 0.5. Merging both dataset

# 1. Data Description

In [None]:
df1 = df_users.copy()

## 1.1. Data Dimensions

In [6]:
# Data dimensions for the 'users.csv' dataset:
print( 'Data size for df_users: \n\nNumber of rows = {} \nNumber of columns = {}'.format(df_users.shape[1],df_users.shape[0]) )

Data size for df_users: 

Number of rows = 16 
Number of columns = 213451


In [None]:
# Printing the size of the sessions dataset:
print( 'Data size for df_sessions: \n\nNumber of rows = {} \nNumber of columns = {}'.format(df_sessions.shape[1],df_sessions.shape[0]) )

## 1.2. Data Types

In [None]:
df_users.dtypes

In [None]:
df_sessions.dtypes

In [None]:
df1.sample().T

## 1.3. Checking for presence of NA data

### 1.3.1. Checking the 'users.csv' dataset

In [None]:
df1.isna().sum() / len(df1)

In [None]:
# Checking the reason to why there are 58% of missing data (NA) within the 'date_first_booking' feature:
aux = df1[ df1['date_first_booking'].isna() ]   #returns only 'date_first_booking' containing NA values.
aux['country_destination'].value_counts(normalize=True)

The snippet above shows that 100% of the users who did not make any bookings on their first use of the platform represent 100% of all the 'country_destination' feature which possesses 'NDF' (No destination found) value. This is due the fact that people who browsed on the platform and didn't book any place are the same people who have no country destination.

However, the time the user took from the moment they entered the website to the moment they made their first booking is an important variable.



In [None]:
# Maximum date which one can project the date (Since the raw dataset is not updated!):
pd.to_datetime(df1['date_first_booking']).max()

# 'date_first_booking' -> date of first booking
date_first_booking_max = pd.to_datetime(df1['date_first_booking']).max().strftime('%Y-%m-%d')
df1['date_first_booking'] = df1['date_first_booking'].fillna(date_first_booking_max)

In [None]:
# 'age':
aux = df1[df1['age'].isna()]
aux['country_destination'].value_counts(normalize=True)

In this case, one is necessary to visualize how the 'age' feature is distributed (18 <= 'age' <= 65) within the variable 'country_destination', as follows...

In [None]:
sns.distplot(df1[df1['age'] <= 65]['age']);

As the plot above suggests, the distribution of the ages within the dataset suggests a gaussian distribution, even though it possesses a significant tale at the end and a skewness towards the ages ranging between 20 and 30 years old.

Due to this, in case of substituting the missing data (NA) by the average of these data plus its correspondent standard deviation, there is no bias on the analysis. 



In [None]:
age_avg = int( round(df1['age'].mean()) )
age_avg

df1['age'] = df1['age'].fillna(age_avg)

In [None]:
## 'first_affiliate_tracked' ->  whats the first marketing the user interacted with before the signing up:
df1['first_affiliate_tracked'].drop_duplicates()


The percentage of missing data (NA) regarding the 'first_affiliate_tracked' is about 3%, considering the totality of missing data throughout the raw dataset. 

When inputing data in categorical features, in order to replace missing data, it's common to use the frequency associated with each parameter. But this has a huge impact, once it could crease a biased analysis.

Due to this, **the 'first_affiliate_tracked' will not be considered in this cycle of the CRISP** and other features will be used throughout the Cross Validation process.

In [None]:
df1 = df1[~df1['first_affiliate_tracked'].isna()]

df1.isna().sum() / len(df1)

### 1.3.2. Checking the 'sessions.csv' dataset

In [None]:
df_sessions.isna().sum() / len(df_sessions)

 Checking the above dataset, it' possible to make a few assumptions:

- Even though the 'user_id' has only 3% of missing data, it is key for connecting both datasets 'users.csv' and 'sessions.csv' and thus, it cannot be replaced by any other data. It has to be dropped out.

- Both features 'action_type' and 'action_detail' has the same amount of missing data, which could be presumed that both are intrinsically correlated. It thus needs to be further analysed.

In [None]:
# user_id (0.3%)
df_sessions = df_sessions[~df_sessions['user_id'].isna()]

# action (0.7%)
df_sessions = df_sessions[~df_sessions['action'].isna()]

# action_type (11%)
df_sessions = df_sessions[~df_sessions['action_type'].isna()]

# action_detail (11%)
df_sessions = df_sessions[~df_sessions['action_detail'].isna()]

df_sessions.isna().sum() / len(df_sessions)

For the 'time_elapsed' feature, an analysis could be done...

In [None]:
# sns.distplot( df_sessions['secs_elapsed'].sample(100000) );

aux = df_sessions[df_sessions['secs_elapsed'] < 0.25e6]

sns.distplot( aux['secs_elapsed'].sample(100000) )

In [None]:
aux['secs_elapsed'].mean()

In [None]:
# secs_elapsed (1.2%)
df_sessions = df_sessions[~df_sessions['secs_elapsed'].isna()]

df_sessions.isna().sum() / len(df_sessions)

## 1.4. Changing the Data Types

In [None]:
## Num primeiro momento (1º ciclo CRISP), valores NA não serão considerados durante a análise do modelo. Isso porque o objetivo é atingir velocidade 
## na apresentação dos resultados iniciais. Caso as variáveis 'date_first_booking', 'age' e 'first_affliate_tracked', em um 2º ciclo do projeto, sejam interessante 
## para a modelagem, então estes serão introduzidos como features relevantes para a análise.

# Shape of dataframe containing NA data:
print( 'Number of total columns before NA dropping: {}'.format( df_users.shape[1] ) )
print( 'Number of total rows before NA dropping: {}'.format( df_users.shape[0] ) )

# Removing missing values (Containing NA):
df1 = df1.dropna()

# Shape of dataframe after removing NA data:
print( '\nNumber of total columns after NA dropping: {}'.format( df1.shape[1] ) )
print( 'Number of total rows after NA dropping: {}'.format( df1.shape[0] ) )

In [None]:
df1.dtypes

In [None]:
# Date which the account was created ('date_account_created'):
df1['date_account_created'] = pd.to_datetime( df1['date_account_created'] )

# Timestamp that the user was active the first time ('timestamp_first_active'):
df1['timestamp_first_active'] = pd.to_datetime( df1['timestamp_first_active'], format = '%Y%m%d%H%M%S' )

## (In this case, it'll be necessary to not only change from 'int64' to 'datetime', but also to divide into an actual date pattern)

# Date for when the user booked the first time ('date_first_booking'):
df1['date_first_booking'] = pd.to_datetime( df1['date_first_booking'] )

# Age:
df1['age'] = df1['age'].astype( int )


In [None]:
df1.dtypes

## 1.5. Checking of Balanced Data

In [None]:
df1['country_destination'].value_counts( normalize=True )

# 2. Data Filtering and Cleansing

In [None]:
df2 = df1.copy()

## 2.1 Filtering Rows

## 2.2 Columns Selection

# 3. Data Preparation

In [None]:
df3 = df2.copy()

In [None]:
df3.shape

In [None]:
# # dummy variable:
# df3_dummy = pd.get_dummies( df3.drop( ['id', 'country_destination'], axis=1 ) )

# # Joining again 'id' and 'country_destination' with dummy variable:
# df3 = pd.concat( [ df3[['id', 'country_destination']], df3_dummy ], axis=1 )

# df3.shape

# 4. Feature Selection of Variables

In [None]:
# Dropping original dates due to lack of further information that could be used by the model:
cols_drop = [ 'date_account_created', 'timestamp_first_active', 'date_first_booking' ]

df4 = df3.drop( cols_drop, axis = 1 )

In [None]:
df4.shape

# 5.0 Machine Learning Model

In [None]:
print(df_users.shape)
print(df1.shape)
print(df2.shape)
print(df3.shape)
print(df4.shape)

In [None]:
X = df4.drop( 'country_destination', axis=1 )
print(X.shape)

y = df4['country_destination'].copy()
print(y.shape)

In [None]:
# Split of prepared dataset into training and test datasets:
X_train, X_test, y_train, y_test = ms.train_test_split( X, y, test_size=0.2, random_state=32 )

In [None]:
x_train = X_train.drop( 'id', axis=1 )
x_test = X_test.drop( 'id', axis=1 )

print( 'Shape for training set: {}'.format( x_train.shape ) )
print( 'Shape for test set: {}'.format( x_test.shape ) )

## 5.1. Baseline Model

### 5.1.1. Building the baseline

In [None]:
df1['country_destination'].value_counts(normalize=True).sort_index().tolist()

In [None]:
# For a regression analysis, the baseline model is the average of the data. For this particular case in which
# the problem is based on a classification analysis, the baseline model must be one that *randomly chooses the
# baseline prediction*:

# (https://docs.python.org/3/library/random.html)

country_destination_list = df1['country_destination'].drop_duplicates().sort_values().tolist()

k_num = y_test.shape[0]

country_destination_weights = df1['country_destination'].value_counts(normalize=True).sort_index().tolist()

yhat_random = random.choices(population=country_destination_list,
                             weights=country_destination_weights, 
                             k=k_num)

len(yhat_random)

### 5.1.2. Evaluation of baseline model performance

In [None]:
# Accuracy
accur_random = m.accuracy_score( y_test, yhat_random )
print( 'Accuracy from model: {}%'.format( accur_random*100 ) )

In [None]:
# Confusion matrix
mt.plot_confusion_matrix( y_test, yhat_random, normalize=False, figsize=(12, 12) );

In [None]:
# Balanced Accuracy
balanced_accur_random = m.balanced_accuracy_score(y_test, yhat_random)
print( '\nBalanced accuracy for NN: {}%\n'.format(balanced_accur_random*100) )

In [None]:
# Classification Report:
print( m.classification_report(y_test, yhat_random) )

In [None]:
# Kappa Metrics:
kappa_random = m.cohen_kappa_score(y_test, yhat_random)
print( '\nKappa Score for NN: {}%'.format(kappa_random*100) )

## 5.2. Transformation of categorical variables

In [None]:
ohe = pp.OneHotEncoder()

# While transforming the categorical attributes (i.e. y_train) by using the One Hot Encoding transformation method, one needs
# to perform it within the Neural Network (NN):
y_train_nn = ohe.fit_transform( y_train.values.reshape( -1, 1 ) ).toarray()

In [None]:
y_train_nn

### 5.1.2. Data dimensions for the constructed datasets (Report)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

print('\n')
print(y_train_nn.shape)

# Analysing the datasets to be used in the NN:
print('\n')
print('x_train:')
print(x_train.shape)
print('\ny_train for NN:')
print(y_train_nn.shape)

## 5.3. Building up the Neural Network - NN MLP

In [None]:
# Model definition:
model = ml.Sequential()

# First layer of neural network:
model.add( l.Dense( 128, input_dim=x_train.shape[1], activation='relu' ) )  

# Second or exit layer for neural network:
model.add( l.Dense( 11, activation='softmax' ) )

# Compiling the model:
model.compile( loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'] )

# Training the model:
model.fit( x_train, y_train_nn, epochs=100 )

  


# 6. Neural Network (NN) Performance

## 6.1. Evaluating the prediction performance

In [None]:
# Evaluating the prediction after testing:
pred_nn = model.predict( x_test )

# Evaluating the inverted prediction:
yhat_nn = ohe.inverse_transform( pred_nn )

# Prediction prepare:
y_test_nn = y_test.to_numpy()

yhat_nn = yhat_nn.reshape( 1, -1 )[0]

## 6.2. Post-Evaluation of Metrics

In [None]:
# Accuracy
accur_nn = m.accuracy_score( y_test_nn, yhat_nn )

print( 'Accuracy from model: {}%'.format( accur_nn*100 ) )

In [None]:
# Confusion matrix
mt.plot_confusion_matrix( y_test_nn, yhat_nn, normalize=False, figsize=(12, 12) );

In [None]:
# Balanced Accuracy
balanced_accur_nn = m.balanced_accuracy_score(y_test_nn, yhat_nn)

print( 'Balanced accuracy for NN: {}%'.format(balanced_accur_nn*100) )

In [None]:
# Classification Report:
print( m.classification_report(y_test_nn, yhat_nn) )

In [None]:
# Kappa Metrics:
kappa_nn = m.cohen_kappa_score(y_test_nn, yhat_nn)

print( 'Kappa Score for NN: {}%'.format(kappa_nn*100) )

## 6.3. Implementing Cross Validation for NN Performance

In [None]:
# Generating k-fold:
num_folds = 5
kfold = ms.StratifiedKFold( n_splits=num_folds, shuffle=True, random_state=32 )

balanced_accur_list = []
kappa_accur_list = []

i = 1

for train_ix, val_ix in kfold.split( x_train, y_train ):    #Neste caso, a cada nova iteracao, a proporcao entre os dados de treino 
                                                            #e os dados de teste é mantida.
  
  # Print out of current fold in iteration:
  print( 'Fold Number: {}/{}'.format(i, num_folds) )
  
  # Getting the folds for...

  ## the training dataset:
  x_train_fold = x_train.iloc[train_ix]
  y_train_fold = y_train.iloc[train_ix]

  ## the validation ('testing') dataset:
  x_val_fold = x_train.iloc[val_ix]
  y_val_fold = y_train.iloc[val_ix]


  # Target one-hot-encoding for transforming the categorical variables from both the training and 
  # the validation dataset:
  ohe =  pp.OneHotEncoder()
  y_train_fold_nn = ohe.fit_transform( y_train_fold.values.reshape(-1, 1) ).toarray()


  # Model definition:
  ## (https://keras.io/api/layers/core_layers/dense/)
  model = ml.Sequential()
  model.add( l.Dense(256, input_dim=x_train.shape[1], activation='relu') )
  model.add( l.Dense(11, activation='softmax') )

  # Compiling the model:
  model.compile( loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'] )

  # Training the model:
  model.fit( x_train_fold, y_train_fold_nn, epochs=100, batch_size=32, verbose=0 )

  # Predictions:
  pred_nn = model.predict(x_val_fold)

  ## The 'pred_nn' variable is inherently encoded by the 'x_val_fold' variable, which came from 'x_train'. Due to this, 
  ## the 'pred_nn' must be "de-encoded":
  yhat_nn = ohe.inverse_transform(pred_nn)


  # Preparing the data after the prediction built-up:
  y_test_nn = y_val_fold.to_numpy()   #The idea of converting a list of classes (in this case, 'y_val_fold') into 
                                      #an array to_numpy() is due to how the keras, from tensorflow, operates; from a list of 
                                      #classes into an array of encoded dummies.
  yhat_nn = yhat_nn.reshape(1, -1)[0]


  # Metrics:

  ## Balanced Accuracy Metrics:
  balanced_accur_nn = m.balanced_accuracy_score(y_test_nn, yhat_nn)
  balanced_accur_list.append(balanced_accur_nn) 

  ## Kappa Metrics:
  kappa_accur_nn = m.cohen_kappa_score(y_test_nn, yhat_nn)
  kappa_accur_list.append(kappa_accur_nn) 

  i += 1

In [None]:
# List for balanced accuracy, in which each position refers to the specific value of accuracy at 
# the current iterated fold.

# print(type(balanced_accur_list))

# Statistical Description for the balanced accuracy metric (Mean value + standard deviation):

print( 'Average value for Balanced Accuracy: {} +/- {:,.10f}'.format(np.mean(balanced_accur_list), np.std(balanced_accur_list)) )
print( 'Average value for Kapppa Accuracy: {} +/- {:,.10f}'.format(np.mean(kappa_accur_list), np.std(kappa_accur_list)) )



CONTINUAR LIVE NO TEMPO -1:16:50 (RESTANTES) -> https://membro.comunidadedatascience.com/38233-pa000-previsao-de-agendamento-do-airbnb/841683-live-003-cross-validation-para-dados-desbalanceados