# %title%

**_Author: Jessica Cervi_**

**Expected time = %expected_time%**

**Total points = 70 points**

    
## Assignment Overview


In this assignment you will perform exploratory data analysis on a dataset from the travel industry.  You will explore different ways of visualizing the data to better understand relationships between variables. You'll examine the descriptive statistics and plots to draw new insights. 


This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions. 


### Learning Objectives

- Visualize data with matplotlib and probe for insights. 
- Use exploratory data analysis to describe data. 


**IMPORTANT INSTRUCTIONS:** 

- To be able to test for this module, you will be asked to save your figures as PNG into a folder called "results". Please don't change the name we ask you to give to the plots so you are able to get all the points in every question.
- Don't add any customization you're not asked to in the plots.

## Index:


#### %title%

- [Question 1](#Question-1)
- [Question 2](#Question-2)
- [Question 3](#Question-3)
- [Question 4](#Question-4)
- [Question 5](#Question-5)
- [Question 6](#Question-6)
- [Question 7](#Question-7)
- [Question 8](#Question-8)
- [Question 9](#Question-9)
- [Question 10](#Question-10)

## %title%

In this assignment you will work with the `pandas` concepts you learned in Module 11 to examine a dataset from the travel industry and visualize relationships between variables. You will begin by familiarzing yourself with the columns of the dataset, then explore their relationships. 


### Inspecting your Data

The dataset that we will be using in this assignment contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. More detailed information about the dataset can be found [here](https://www.kaggle.com/jessemostipak/hotel-booking-demand).


We will begin by importing the necessary libraries for this assignment and by reading the dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import scipy.stats as sp
import seaborn as sns

# Avoid warnings
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv("./data/hotel_bookings.csv")

For convenience, we will use the command `.head()` to visualize the first 10 rows of our DataFrame

In [2]:
df.head(10)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
5,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
6,Resort Hotel,0,0,2015,July,27,1,0,2,2,...,No Deposit,,,0,Transient,107.0,0,0,Check-Out,2015-07-03
7,Resort Hotel,0,9,2015,July,27,1,0,2,2,...,No Deposit,303.0,,0,Transient,103.0,0,1,Check-Out,2015-07-03
8,Resort Hotel,1,85,2015,July,27,1,0,3,2,...,No Deposit,240.0,,0,Transient,82.0,0,1,Canceled,2015-05-06
9,Resort Hotel,1,75,2015,July,27,1,0,3,2,...,No Deposit,15.0,,0,Transient,105.5,0,0,Canceled,2015-04-22


In [3]:
df.describe()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,agent,company,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
count,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119386.0,119390.0,119390.0,119390.0,119390.0,119390.0,103050.0,6797.0,119390.0,119390.0,119390.0,119390.0
mean,0.370416,104.011416,2016.156554,27.165173,15.798241,0.927599,2.500302,1.856403,0.10389,0.007949,0.031912,0.087118,0.137097,0.221124,86.693382,189.266735,2.321149,101.831122,0.062518,0.571363
std,0.482918,106.863097,0.707476,13.605138,8.780829,0.998613,1.908286,0.579261,0.398561,0.097436,0.175767,0.844336,1.497437,0.652306,110.774548,131.655015,17.594721,50.53579,0.245291,0.792798
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,0.0,-6.38,0.0,0.0
25%,0.0,18.0,2016.0,16.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,62.0,0.0,69.29,0.0,0.0
50%,0.0,69.0,2016.0,28.0,16.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,179.0,0.0,94.575,0.0,0.0
75%,1.0,160.0,2017.0,38.0,23.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,229.0,270.0,0.0,126.0,0.0,1.0
max,1.0,737.0,2017.0,53.0,31.0,19.0,50.0,55.0,10.0,10.0,1.0,26.0,72.0,21.0,535.0,543.0,391.0,5400.0,8.0,5.0


[Back to top](#Index:) 


### Question 1
*5 points*


We'll begin by exploring the arrival dates, to see when people begin a trip. 
Use `.value_counts()` on the column `arrival_date_year`. Assign the result to `ans1`.

In [4]:
### GRADED

### YOUR SOLUTION HERE
ans1 = None 

### BEGIN SOLUTION
ans1 = df['arrival_date_year'].value_counts()
### END SOLUTION

In [5]:
### BEGIN HIDDEN TESTS (5)
from pandas.testing import assert_series_equal
ans1_ = df['arrival_date_year'].value_counts()
#
#
#
assert ans1.equals(ans1_), "Remember that .value_counts() is a function that can be used on the DataFrame"
print("Correct!")
### END HIDDEN TESTS

Correct!


[Back to top](#Index:) 

### Question 2
    
*5 points*

What data type the attribute `.value_counts()` return? 
- a) A list
- b) A series
- c) A DataFrame.
- d) An object

Assign the character corresponding to your choice as a string to `ans2`.

In [6]:
### GRADED

### YOUR SOLUTION HERE
ans2 = None 

### BEGIN SOLUTION
ans2 = "b"
### END SOLUTION

In [7]:
### BEGIN HIDDEN TESTS (5)
#test
ans2_= "b"
#
#
#
assert ans2 == ans2_, "What is the type of the variable `ans1` from question 1?"
print("Correct!")
### END HIDDEN TESTS

Correct!


[Back to top](#Index:) 

### Question 3
    
*5 points*

Next, we will examine the lead time. How far in advance do people book travel? 
We can compute this with the median of the column `lead_time` by ignoring the NaN values. Assign the result to `ans3`.

In [8]:
### GRADED

### YOUR SOLUTION HERE
ans3 = None 

### BEGIN SOLUTION
ans3 = np.nanmedian(df["lead_time"])
### END SOLUTION

In [9]:
### BEGIN HIDDEN TESTS (5)
ans3_ = np.nanmedian(df["lead_time"])
#
#
#
assert ans3 == ans3_, "To compute the median ignoring the NaN values, use the NumPy function .nanmedian()"
print("Correct!")
### END HIDDEN TESTS

Correct!


[Back to top](#Index:) 

### Question 4
    
*10 points*

Now, we will create a heatmap which will help us explore relationships between the different variables, or columns, in the travel dataset. What relationships have we missed? This process will help us find out.

To begin, use `.figsize()` to set the figure size to `(15,15)`.
Next, produce a heatmap with the correlation between the different columns of `df`. Specify the parameter `annot= True`. DO NOT specify any other parameter. Save your plot as a png with the name "plot4.png" in the folder "results".

In [10]:
### GRADED

### YOUR SOLUTION HERE

### BEGIN SOLUTION
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(), annot= True)
plt.savefig("results/plot4.png")
plt.close()
### END SOLUTION

In [11]:
### BEGIN HIDDEN TESTS (10)
from matplotlib.testing.compare import compare_images
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(), annot= True)
plt.savefig("results/plot4_.png")
plt.close()
#
#
#
assert isinstance(compare_images("results/plot4.png","results/plot4_.png", tol=0.1), type(None)), "Make sure you have used the function .heatmap() with the default parameters."
print("Correct!")

import os
if os.path.isfile('results/plot4.png'):
    os.remove('results/plot4.png')
### END HIDDEN TESTS

Correct!


[Back to top](#Index:) 

### Question 5
    
*5 points*

The hotel wants to know how many parking spaces to be filled with the number of adults booking rooms. We will use feature engineering to create a new measure by dividing `adults` by `required_car_parking_spaces`.

Assign the result of this to a new column created in `df` called `parking_spaces_per_adult`.

In [12]:
### GRADED

### YOUR SOLUTION HERE
df['parking_spaces_per_adult'] = None 

### BEGIN SOLUTION
df['parking_spaces_per_adult'] = df['adults']/df['required_car_parking_spaces']
### END SOLUTION

In [13]:
### BEGIN HIDDEN TESTS (5)
df['parking_spaces_per_adult_'] = df['adults']/df['required_car_parking_spaces']
#
#
#
assert df['parking_spaces_per_adult'].all() == df['parking_spaces_per_adult_'].all(), "Remember, we want to divide df['adults'] by df['required_car_parking_spaces']"
print("Correct!")
### END HIDDEN TESTS

Correct!


[Back to top](#Index:) 

### Question 6
    
*10 points*

Next, we'll examine the habits of travelers. Are people more likely to stay in on week nights or weekend nights?

Produce a jointplot that compares the relationship between `stays_in_week_nights` and `stays_in_weekend_nights`of `df`. DO NOT specify any parameter. Save your plot as a png with the name "plot6.png" in the folder "results".

In [14]:
### GRADED

### YOUR SOLUTION HERE

### BEGIN SOLUTION
sns.jointplot(df.stays_in_week_nights,df.stays_in_weekend_nights)  
plt.savefig("results/plot6.png")
plt.close()
### END SOLUTION

In [15]:
### BEGIN HIDDEN TESTS (10)
from matplotlib.testing.compare import compare_images
sns.jointplot(df.stays_in_week_nights,df.stays_in_weekend_nights)  
plt.savefig("results/plot6_.png")
plt.close()
#
#
#
assert isinstance(compare_images("results/plot6.png","results/plot6_.png", tol=0.1), type(None)), "Make sure you have used the function .jointplot() with the default parameters."
print("Correct!")

import os
if os.path.isfile('results/plot6.png'):
    os.remove('results/plot6.png')
### END HIDDEN TESTS

Correct!


[Back to top](#Index:) 

### Question 7
    
*5 points*

Let's take a closer look at the graph you created to examine travelers who stay in on week nights versus weekend nights. 

From the graph produced in question 6, what can you say about the relationship between `stays_in_week_nights` and `stays_in_weekend_nights`?

- a) The two variables have a correlation value close to one (high correlation)
- b) The two variables are not correlated with a value close to zero
- c) The two variables have a correlation value close to -0.5
- d) None of the above


Assign the character corresponding to your choice as a string to `ans7`.

In [16]:
### GRADED

### YOUR SOLUTION HERE
ans7 = None 

### BEGIN SOLUTION
ans7 = "a"
### END SOLUTION

In [17]:
### BEGIN HIDDEN TESTS (5)
ans7_= "a"
#
#
#
assert ans7 == ans7_, "Remember, if variables are highly correlated then their correlaton value is the highest possible"
print("Correct!")
### END HIDDEN TESTS

Correct!


[Back to top](#Index:) 

### Question 8
    
*10 points*

Could the size of the group impact whether travelers are likely to stay in on weekend nights? Let's explore this. 
 
To begin, use `.figsize()` to set the figure size to `(5,5)`. Next, produce a boxplot that compares the relationship between `adults` and `stays_in_weekend_nights`of `df` and set the x limits equal to `(-1,5)`. Save your plot as a png with the name "plot8.png" in the folder "results".

In [18]:
### GRADED

### YOUR SOLUTION HERE

### BEGIN SOLUTION
fig = plt.figure(figsize=(5, 5))
sns.boxplot(df['adults'], df['stays_in_weekend_nights'])
plt.xlim(-1,5)
plt.savefig("results/plot8.png")
plt.close()
### END SOLUTION

In [19]:
### BEGIN HIDDEN TESTS (10)
from matplotlib.testing.compare import compare_images
fig = plt.figure(figsize=(5, 5))
sns.boxplot(df['adults'], df['stays_in_weekend_nights'])
plt.xlim(-1,5)
plt.savefig("results/plot8_.png")
plt.close()
#
#
#

assert isinstance(compare_images("results/plot8_.png","results/plot8.png", tol=0.1), type(None)), "Make sure you have used the function .heatmap() with the default parameters."
print("Correct!")

import os
if os.path.isfile('results/plot8.png'):
    os.remove('results/plot8.png')
### END HIDDEN TESTS

Correct!


[Back to top](#Index:) 

### Question 9
    
*10 points*

Next, we want to examine the reservation date more closely. We will split the entries of this column into three new columns to see the year, month, and day. 
 
Using the appropriate string method, split the column `reservation_status_date` at every occurrence of “-”. Next, add three new columns to `df`: `year`, `month` and `day`.

In [20]:
### GRADED

### YOUR SOLUTION HERE 
year = None 
month = None
day = None

### BEGIN SOLUTION
new= df["reservation_status_date"].str.split("-", n = 2, expand = True) 
df["year"] = new[0]
df["month"] = new[1]
df["day"] = new[2]
### END SOLUTION

In [21]:
### BEGIN HIDDEN TESTS (10)
new_= df["reservation_status_date"].str.split("-", n = 2, expand = True) 
df["year_"] = new[0]
df["month_"] = new[1]
df["day_"] = new[2]
#
#
#
assert df.year.all() == df.year_.all(), "Make sure you have split the column `reservation_status_date` correctly and assign the first split to `year`."
assert df.month.all() == df.month_.all(), "Make sure you have split the column `reservation_status_date` correctly and assign the second split to `month`."
assert df.day.all() == df.day_.all(), "Make sure you have split the column `reservation_status_date` correctly and assign the third split to `day`."
print("Correct!")
### END HIDDEN TESTS

Correct!


[Back to top](#Index:) 

### Question 10
    
*5 points*

To conclude, we will examine requests for required car parking spaces. Does a traveler need a parking space, yes or no?

Use hot encoding to create dummy categorical variables for modeling on the column `required_car_parking_spaces`. Make sure you drop the first level by using `drop_first= True`. Save this to a new DataFrame called `df1`.

In [22]:
### GRADED

### YOUR SOLUTION HERE
df1 = None

### BEGIN SOLUTION
df1 =  pd.get_dummies(df, columns = ['required_car_parking_spaces'], drop_first=True)
### END SOLUTION

In [23]:
### BEGIN HIDDEN TESTS (5)
from pandas.testing import assert_frame_equal
df1_ =  pd.get_dummies(df, columns = ['required_car_parking_spaces'], drop_first=True)
#
#
#
assert df1.equals(df1_), "Make sure you have used the function .get_dummies()"
print("Correct!")
### END HIDDEN TESTS

Correct!
