Data analysis.

We are going to analyze the data stored in the dataframe 'final_df', which is located in the directory 'data_new'. The reader can check the jupyter notebook 'data_preparation' to find out how 'final_df' was built. Many columns in that dataframe have self-explanatory titles. Other columns have descriptions located in the nested dictionary 'final_dic' (a dictionary of dictionaries). In reality, for confidentiality reason, 'final_dic' is not made publicly available. Instead, only its censored version 'final_dic_pub' is made publicly available. To start, let's import the relevant modules.

In [1]:
## Relevant modules.
import os
import csv
import numpy as np
import pandas as pd

The files stored in the directory 'data_new' are actually csv files. We are going to use them to recover 'final_df' and 'final_dic_pub'.

In [2]:
## Recovering 'final_df'.
final_df = pd.read_csv('data_new/final_df.csv', 
                       index_col=0, 
                       header=0, 
                       low_memory=False,
                       keep_default_na=False,
                       na_values='')

with open('data_new/final_df_dtypelist.csv', newline='') as to_read:
    reader = csv.reader(to_read)
    data = list(reader)
    final_df_dtypedic = dict(data)

for i in final_df.columns:
    final_df[i] = final_df[i].astype(final_df_dtypedic[i])

## Recovering 'final_dic_pub'.
final_dic_pub_df = pd.read_csv('data_new/final_dic_pub_df.csv',
                               index_col=0, 
                               header=0, 
                               low_memory=False,
                               keep_default_na=False,
                               na_values='')

final_dic_pub = {}
for key in final_dic_pub_df.columns:
    value = final_dic_pub_df[key].dropna().to_dict()
    final_dic_pub.update({key: value})

In order to keep things safe and simple, let's work with a copy of 'final_df' called just 'df'. 

In [3]:
df =  final_df.copy()
print(df.info(verbose=True, null_counts=True))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56640 entries, 132755 to 295082
Data columns (total 112 columns):
age                 9051 non-null Int64
age_range           9051 non-null object
country             41135 non-null object
us_region           40599 non-null object
us_zipcode          40529 non-null object
firm_id             56425 non-null float64
firm_type           56425 non-null float64
firm_loc            56425 non-null object
contact_date        56425 non-null datetime64[ns]
visit_datetime      56425 non-null datetime64[ns]
status_cancel       56425 non-null float64
source_online       56425 non-null float64
payment_cc          56425 non-null float64
guests              56425 non-null float64
guests_in           34521 non-null Int64
fee                 56425 non-null float64
discount            56425 non-null float64
tot_quant_addons    56425 non-null float64
tot_rev_addons      56425 non-null float64
tot_rev             56425 non-null float64
q1_bef              3