# Coding Temple's Data Analytics Course
---


# AirBnB NY Locations Data Case Study
---
In this final project, you task will be to take the data provided and find evidence to answer the following questions.

1. Which hosts are the busiest and why?
2. How many neighborhood groups are available and which shows up the most?
3. Are private rooms the most popular in manhattan?
4. Which hosts are the busiest and based on their reviews?
5. Which neighorhood group has the highest average price?
6. Which neighborhood group has the highest total price?
7. Which top 5 hosts have the highest total price?
8. Who currently has no (zero) availability with a review count of 100 or more?
9. What host has the highest total of prices and where are they located?
10. When did Danielle from Queens last receive a review?
11. Which host has the most listings?
12. How many listings have completely open availability?
13. What room types have the highest review numbers?
14. Is there a feature you would add to this dataset? Can we create it using existing data? If so, do it and walk through why. If not, walk through why you think the existing data is complete without any additional features.

In order to simulate a real-life situation in which you have a take-home project from a company, you will have the chance to ask me questions before you start on this project, but once started, you will be on your own. You will need to have this completed and turned in by 9am Monday morning in order to be considered completed. You are free to use any libraries, methods, and packages in order to complete your analysis

### **Treat this like a real take-home assesment Your workflow for this project should look something like:**
* Import all necessary libraries for your project
* Create mark-down cells and write out your thoughts and pre-analysis of the question, highlighting what features you will use to figure it out and what methods you will be testing.
* Write out the code, with proper comments so anyone can follow along with your thought process.
* Create a markdown cell after, summarizing your findings and conclusions after testing your methods.
* Rinse and repeat until you get through the questions
* Include a conclusionary summary of your analysis at the end.

#### Import all necessary libraries for your project.


In [319]:
#Your code here.

#Import pandas library using the industry standard alias.
#Import pandas as pd to use functions for working with dates.
  
import pandas as pd

##### 1. Which hosts are the busiest and why?
##### *For this question, I will load the dataset and use the column host_name to find busiest hosts. To support this I will also compare various columns to find other commonalities.*

In [334]:
#Your code here.

import pandas as pd

#Load the dataset.
data = pd.read_csv('AB_NYC_2019 (1).csv')

#Show the first rows of the dataset.
print(data.head())

#value_counts() used to count the times of unique values in a Series.
#head() used to show or use first rows. 
busiest_hosts = data['host_name'].value_counts().head(5)

#Print with f-string {}.
print(f'Busiest hosts: {busiest_hosts.to_string()}')


     id                                              name  host_id   
0  2539                Clean & quiet apt home by the park     2787  \
1  2595                             Skylit Midtown Castle     2845   
2  3647               THE VILLAGE OF HARLEM....NEW YORK !     4632   
3  3831                   Cozy Entire Floor of Brownstone     4869   
4  5022  Entire Apt: Spacious Studio/Loft by central park     7192   

     host_name neighbourhood_group neighbourhood  latitude  longitude   
0         John            Brooklyn    Kensington  40.64749  -73.97237  \
1     Jennifer           Manhattan       Midtown  40.75362  -73.98377   
2    Elisabeth           Manhattan        Harlem  40.80902  -73.94190   
3  LisaRoxanne            Brooklyn  Clinton Hill  40.68514  -73.95976   
4        Laura           Manhattan   East Harlem  40.79851  -73.94399   

         room_type  price  minimum_nights  number_of_reviews last_review   
0     Private room    149               1                  9  20

#### Conclusion to Q1 
#### *In conclusion, we have found that David, Sonder, John, and Alex appear to be the busiest hosts.This could be attributed to the property location.*

##### 2. How many neighborhood groups are available and which shows up the most?
##### *For this question, I will use the column neighbourhood_group to find out how many groups are available. I will then calculate which of those groups appears most frequently.*

In [348]:
#Your code here.

#Count the number of each host and show the top 10 busiest hosts.
#nunique() used to find the number of unique elements in a column.
neighborhood_groups = data['neighbourhood_group'].nunique()

#idxmax() used to find the position of the maximum value in a column.
#value_counts() is used to count the occurrences of unique values in a Series.
most_common_group = data['neighbourhood_group'].value_counts().idxmax()

#Print with f-string {}.
print(f'Number of neighborhood groups: {neighborhood_groups}')
print(f'Most common neighborhood group: {most_common_group}')

Number of neighborhood groups: 5
Most common neighborhood group: Manhattan


#### Conclusion to Q2
#### *In conclusion, we see that there are 5 different neighborhood groups. The most common of the neighborhood groups is Manhattan.*

##### 3. Are private rooms the most popular in manhattan?
#### *For this question, I will use the column neighbourhood_group and cross refrence against the column room_type to see if private rooms are most poplular compared to other types of rooms available.*

In [361]:
#Your code here.

#idxmax() used to find the position of the maximum value in a column.
#value_counts() is used to count the occurrences of unique values in a Series.
private_rooms_manhattan = data[data['neighbourhood_group'] == 'Manhattan']['room_type'].value_counts().idxmax()

#Print with f-string {}.
print(f'Most popular room type in Manhattan: {private_rooms_manhattan}')

Most popular room type in Manhattan: Entire home/apt


#### Conclusion to Q3
#### *In conclusion, we see that the most popular room type in Manhattan is actually entire homes and apartments. Not private rooms.*

#### 4. Which hosts are the busiest and based on their reviews?
#### *For this question, I will look to calculate the top 10 total busiest hosts by utilizing the columns host_name as well as number_of_reviews.*

In [373]:
#Your code here.

#Top 10 of total highest reviews.
#groupby() used to group data in a DataFrame groups based on the unique values in the columns.
#sum() used to get the sum of values.
#nlargest() used to retrieve the top largest numbers from a Series or DataFrame column.
busiest_hosts_reviews = data.groupby('host_name')['number_of_reviews'].sum().nlargest(10)

#Print with f-string {}.
print(f'Busiest hosts based on reviews: {busiest_hosts_reviews}')


Busiest hosts based on reviews: host_name
Michael    11081
David       8103
John        7223
Jason       6522
Alex        6204
Chris       5028
Anna        4799
Eric        4733
Daniel      4723
Sarah       4579
Name: number_of_reviews, dtype: int64


#### Conclusion to Q4
#### *In conclusion, with this data we see that Michael, David, and John are among the top three busiest hosts based on reviews. This information was pulled from an initial top ten list of total highest reviews.*

#### 5. Which neighorhood group has the highest average price?
#### *For this question I will use the columns neighbourhood_group as well as price to find the mean avearge price of each group and return the highest priced neighborhood on average.*

In [384]:
#Your code here.

#Highest average price between neighborhoods.
#mean() used to calculate the average of values.
#idxmax() used to find the position of the maximum value in a column.
highest_avg_price_group = data.groupby('neighbourhood_group')['price'].mean().idxmax()

#Print with f-string {}.
print(f'Neighborhood group with highest average price: {highest_avg_price_group}')

Neighborhood group with highest average price: Manhattan


#### Conclusion to Q5
#### *In conclusion, we see that after calculating the mean or average of price by neighborhood group we find Manhattan to be the highest average priced neighborhood.*

#### 6. Which neighborhood group has the highest total price?
#### *For this question I will use the columns neighbourhood_group as well as price to find the mean avearge price of each group and return the highest priced neighborhood based on total.*

In [394]:
#Your code here.

#Highest priced neighborhood.
#groupby() used to group data in a DataFrame groups based on the unique values in the columns.
#sum() used to get the sum of values.
#idxmax() used to find the position of the maximum value in a column.
highest_total_price_group = data.groupby('neighbourhood_group')['price'].sum().idxmax()

#Print with f-string {}.
print(f'Neighborhood group with highest total price: {highest_total_price_group}')

Neighborhood group with highest total price: Manhattan


#### Conclusion to Q6
#### *In conclusion, we see that after calculating the sum or total of price by neighborhood group we find Manhattan to be the highest priced neighborhood. Not suprising considering it is also the highest average price neighborhood as well.*

#### 7. Which top 5 hosts have the highest total price?
#### *For this question, I will use the columns host_name and price to find the sum of each of the top 5 highest priced hosts.*

In [403]:
#Your code here.

#Top 5 highest price hosts.
#groupby() used to group data in a DataFrame groups based on the unique values in the columns.
#sum() used to get the sum of values.
#nlargest() used to retrieve the top largest numbers from a Series or DataFrame column.
top_hosts_highest_price = data.groupby('host_name')['price'].sum().nlargest(5)

#Print with f-string {}.
print(f'Top 5 hosts with highest total price: {top_hosts_highest_price}')

Top 5 hosts with highest total price: host_name
Sonder (NYC)    82795
Blueground      70331
Michael         66895
David           65844
Alex            52563
Name: price, dtype: int64


#### Conclusion to Q7
#### *In conclusion, we see that after calculating the sum or total of price by host name can say Sonder has the highest total price as a host. The next highest on a top 5 list include Blueground and Michael. Interestingly enough, we learned that Michael was also one of the busiest hosts earlier.*

#### 8. Who currently has no (zero) availability with a review count of 100 or more?
#### *For this question, I will look to utilize the columns availabilty_365, number_of_reviews, as well as host_name to see out of only those with 100 reviews or more who has no availabilty.*

In [411]:
#Your code here.

#Select rows where 'availability_365' is 0 and 'number_of_reviews' is greater than or equal to 100.
#The & symbol combines these two conditions, returning True only for rows where both are True.
unavailable_hosts_reviews = data[(data['availability_365'] == 0) & (data['number_of_reviews'] >= 100)]['host_name'].unique()

#Print with f-string {}.
print(f'Hosts with no availability and 100+ reviews: {unavailable_hosts_reviews}')

Hosts with no availability and 100+ reviews: ['MaryEllen' 'Christiana' 'Sol' 'Coral' 'Doug' 'Ori' 'Lissette'
 'Liz And Melissa' 'Ivy' 'Jsun' 'Wanda' 'Ben' 'S' 'Adrienne' 'Lydia'
 'Karin' 'Elle' 'James' 'Jon' 'Liz' 'Jeanine' 'Lorena' 'Ron' 'Dragan'
 'Misty' 'Brian' 'Natalie' 'AJ And Freddy' 'Neil & Katie' 'Emily' 'Evelyn'
 'Alvaro' 'Bernard' 'Sarah' 'Karen' 'Summer' 'William' 'Andy & Friends'
 'Karece' 'Ehren' 'Nicole' 'Terri' 'Ravanna' 'Molly' 'Lane' 'Angelo'
 'DeLex' 'Michelle' 'Katarina' 'Véronique' 'Andreas' 'Caroline' 'Michael'
 'Aurea' 'Kent' 'Brendan' 'Jillian' 'Deanna' 'Jake' 'Emily And Joel'
 'George & Diana' 'Veronica' 'Masha' 'Danielle' 'Jeremy' 'Kyle' 'Stacey'
 'Sasha' 'Nick' 'Carlina' 'Taylor & Tee' 'Devin' 'Ame' 'Richard' 'Micah'
 'Elliott' '正川' 'Chao' 'Pj' 'Lou' 'Ingrid' 'Graham' 'Gurpreet  Singh'
 'Jimmy' 'Catrina' 'Long' 'Deborah' 'Hayes' 'Evan' 'Sofia' 'Antonia'
 'Margarita' 'Abraham' 'Alex' 'Qiyao' 'Cedrick' 'Greg' 'Chelsea' 'Kc'
 'Edward' 'Lasata' 'Krysta' 'Maeve' 'E

#### Conclusion to Q8
#### *As you can see there are a multitude of hosts that have no availibilty in an entire calendar year. This is even the case when we consider that sample group selcted was for hosts with 100 or more reviews.*

#### 9. What host has the highest total of prices and where are they located?
#### Explanation

In [418]:
#Your code here.

#Show the highest total prices and its location.
#groupby() used to group data in a DataFrame groups based on the unique values in the columns.
#sum() used to get the sum of values.
#nlargest() used to retrieve the top largest numbers from a Series or DataFrame column.
highest_price_host = data.groupby(['host_name', 'neighbourhood'])['price'].sum().nlargest(1)

#Print with f-string {}.
print(f'Host with highest total prices and their location: {highest_price_host}')

Host with highest total prices and their location: host_name     neighbourhood     
Sonder (NYC)  Financial District    57738
Name: price, dtype: int64


#### Conclusion to Q9
#### *In conclusion we see that once again Sonder comes in with the highest total price. He also happens to be located in the Financial District. An area within Manhattan. A neighborhood with the highest average price as well as highest total price.*

#### 10. When did Danielle from Queens last receive a review? 
#### *In this question, I will look to see when the last review was recieved by utilizing the columns host_name, neighbourhood_group, and last_review specifically referencing Danielle and Queens.*

In [424]:
#Your code here.

#Convert the 'last_review' column to datetime data type.
data['last_review'] = pd.to_datetime(data['last_review'])

#max() method is used to find the maximum value within a Series or DataFrame.
#The & symbol combines these two conditions, returning True only for rows where both are True.
danielle_last_review = data[(data['host_name'] == 'Danielle') & (data['neighbourhood_group'] == 'Queens')]['last_review'].max()

#Print with f-string {}.
print(f"Danielle from Queens' last review: {danielle_last_review}")


Danielle from Queens' last review: 2019-07-08 00:00:00


#### Conclusion to Q10
#### *In conclusion, by utilizing the pd.to_datetime function we were able to see that Danielle from Queens last review was 07/08/2019.*

#### 11. Which host has the most listings?
#### *For this question, I will use the column host_name to see how many occurences of listings per host occur to determine who has most listings.*

In [429]:
#Your code here.

#idxmax() used to find the position of the maximum value in a column.
#value_counts() is used to count the occurrences of unique values in a Series.
host_most_listings = data['host_name'].value_counts().idxmax()

#Print with f-string {}.
print(f'Host with the most listings: {host_most_listings}')

Host with the most listings: Michael


#### Conclusion to Q11
#### *Here we see, Michael who earlier we learned is one of the busiest hosts is also in fact the host who has the most listings.*

#### 12. How many listings have completely open availability?
#### *For this question, I will use the column availability_365 in conjunction with the len() function to see how many listings are open and available the entire year.*

In [433]:
#Your code here>

#Gives the count of rows with open availability for the entire year
#using len() to determine if its equal to 365.
open_avail_listings = len(data[data['availability_365'] == 365])

#Print with f-string {}.
print(f'Number of listings with completely open availability: {open_avail_listings}')

Number of listings with completely open availability: 1295


#### Conclusion to Q12
#### *In conclusion, by taking a sample of one year we see that there are 1295 listings with complete open availability.*

#### 13. What room types have the highest review numbers?
#### *For this question, I will use the columns room_type and number_of_reviews to find the sum of each room type reviews to see which type has the most.*

In [436]:
#Your code here.

#groupby() used to group data in a DataFrame groups based on the unique values in the columns.
#sum() used to get the sum of values.
#nlargest() used to retrieve the top largest numbers from a Series or DataFrame column.
room_types_highest_reviews = data.groupby('room_type')['number_of_reviews'].sum().nlargest(5)

#Print with f-string {}.
print(f'Room types with the highest review numbers: {room_types_highest_reviews}')

Room types with the highest review numbers: room_type
Entire home/apt    580403
Private room       538346
Shared room         19256
Name: number_of_reviews, dtype: int64


#### Conclusion to Q13
#### *In conclusion, according to the above data we can see that Entire home/apt is the room type with the highest total of reviews at 580403. This is closely followed by Private room at 538346 reviews.*

##### 14. Is there a feature you would add to this dataset? Can we create it using existing data? If so, do it and walk through why. If not, walk through why you think the existing data is complete without any additional features.
#### *One possible feature to add could be the average number of reviews for each listing.That way you can see which Host's property gets the most use. This can be calculated using the existing data by taking the sum or total number of host listings and dividing it by the number of reviews per month for each listing.*

In [438]:
import numpy as np

#Find the average number of reviews for each listing.
data['average_num_reviews'] = np.where(data['number_of_reviews'] != 0, data['calculated_host_listings_count'] / data['reviews_per_month'], 0)

# Show the updated DataFrame with the new 'average_rating_score' column.
print(data[['host_name', 'average_num_reviews']])



                                 host_name  average_num_reviews
0                                     John            28.571429
1                                 Jennifer             5.263158
2                                Elisabeth             0.000000
3                              LisaRoxanne             0.215517
4                                    Laura            10.000000
5                                    Chris             1.694915
6                                    Garon             2.500000
7                                 Shunichi             0.288184
8                                MaryEllen             1.010101
9                                      Ben             3.007519
10                                    Lena             2.325581
11                                    Kate             0.666667
12                                  Laurie             2.238806
13                                 Claudio             1.098901
14                                   Ali

#### Conclusion to Q14
#### *In conclusion, we are able to determine that based upon total number of host listings and dividing it by the number of reviews per month for each listing that John has the highest average number of reviews per listing.*

#### Conclusion analaysis

#### *In conclusion, due to the data provided we fortunatley were able to gather further insight about AirBnB, it hosts, as well as properties. For example, we learned that Manhattan not only was the highest average priced neighborhood but it was also in fact the absolute highest priced neighboorhood. We can gather this in part to another host in the same area having the highest priced listing in the Financial District which is located in Manhattan. We are also able to infer based on the analyzed data that Hosts with the most listing tend to be the busiest.*