# About this Dataset

Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, we will take a closer look at the New York Airbnb market by combining data from multiple file types like .csv, .tsv, and .xlsx.

Recall that **CSV, TSV**, and **Excel** files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to you:

**data/airbnb_price.csv** This is a CSV file containing data on Airbnb listing prices and locations.

- listing_id: unique identifier of listing
- price: nightly listing price in USD
- nbhood_full: name of borough and neighborhood where listing is located

**data/airbnb_room_type.xlsx** This is an Excel file containing data on Airbnb listing descriptions and room types.

- listing_id: unique identifier of listing
- description: listing description
- room_type: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments

**data/airbnb_last_review.tsv** This is a TSV file containing data on Airbnb host names and review dates.

- listing_id: unique identifier of listing
- host_name: name of listing host
- last_review: date when the listing was last reviewed

# Table of contents

As a consultant working for a real estate start-up, you have collected Airbnb listing data from various sources to investigate the short-term rental market in New York. You'll analyze this data to provide insights on private rooms to the real estate company.

There are three files in the data folder: <code>airbnb_price.csv</code>, <code>airbnb_room_type.xlsx</code>, <code>airbnb_last_review.tsv</code>.

- What are the dates of the earliest and most recent reviews? Store these values as two separate variables with your preferred names.
- How many of the listings are private rooms? Save this into any variable.
- What is the average listing price? Round to the nearest penny and save into a variable.
- Combine the new variables into one DataFrame called <code>review_dates</code> with four columns in the following order: <code>first_reviewed</code>, <code>last_reviewed</code>, <code>nb_private_rooms</code>, and <code>avg_price</code>. The DataFrame should only contain one row of values.

In [14]:
!pip install opendatasets



In [15]:
import opendatasets as od

In [16]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [17]:
#Copy credencials of kaggle to correct place
!mkdir -p ~/.kaggle
!cp '/content/drive/MyDrive/Colab Notebooks/kaggle.json' ~/.kaggle/
!cp '/content/drive/MyDrive/Colab Notebooks/kaggle.json' ./
!chmod 600 ~/.kaggle/kaggle.json
print("ok")

ok


In [18]:
# create folder of kaggle with url of dataset
od.download(
    "https://www.kaggle.com/datasets/frandlt/nyc-airbnb-market")

Skipping, found downloaded files in "./nyc-airbnb-market" (use force=True to force download)


In [19]:
# create folder of kaggle with url of dataset
od.download(
    "https://www.kaggle.com/datasets/aynashairi/rooms-type")

Skipping, found downloaded files in "./rooms-type" (use force=True to force download)


# Import Libraries

In [20]:
import pandas as pd
import numpy as np


import warnings
warnings.filterwarnings("ignore")

# Load data

In [21]:
# Read data (cvs file)

file =('/content/nyc-airbnb-market/airbnb_last_review.tsv')
reviews = pd.read_csv(file, sep='\t')
# , index_col=0


file1 =('/content/nyc-airbnb-market/airbnb_price.csv')
prices = pd.read_csv(file1)


file2 =('/content/rooms-type/airbnb_room_type.xlsx')
room_types = pd.read_excel(file2, sheet_name=0)

reviews.head(), prices.head(), room_types.head()


# room_type.parse(0)

(   listing_id    host_name   last_review
 0        2595     Jennifer   May 21 2019
 1        3831  LisaRoxanne  July 05 2019
 2        5099        Chris  June 22 2019
 3        5178     Shunichi  June 24 2019
 4        5238          Ben  June 09 2019,
    listing_id        price                nbhood_full
 0        2595  225 dollars         Manhattan, Midtown
 1        3831   89 dollars     Brooklyn, Clinton Hill
 2        5099  200 dollars     Manhattan, Murray Hill
 3        5178   79 dollars  Manhattan, Hell's Kitchen
 4        5238  150 dollars       Manhattan, Chinatown,
    listing_id                                description        room_type
 0        2595                      Skylit Midtown Castle  Entire home/apt
 1        3831            Cozy Entire Floor of Brownstone  Entire home/apt
 2        5099  Large Cozy 1 BR Apartment In Midtown East  Entire home/apt
 3        5178            Large Furnished Room Near B'way     private room
 4        5238         Cute & Cozy Lower 

In [None]:
# Join the three data frames together into one
listings = pd.merge(prices, room_types, on='listing_id')
listings = pd.merge(listings, reviews, on='listing_id')

In [None]:
# What are the dates of the earliest and most recent reviews?
# To use a function like max()/min() on last_review date column, it needs to be converted to datetime type
listings['last_review_date'] = pd.to_datetime(listings['last_review'], format='%B %d %Y')
first_reviewed = listings['last_review_date'].min()
last_reviewed = listings['last_review_date'].max()


In [None]:
# How many of the listings are private rooms?
# Since there are differences in capitalization, make capitalization consistent
listings['room_type'] = listings['room_type'].str.lower()
private_room_count = listings[listings['room_type'] == 'private room'].shape[0]

In [None]:
# What is the average listing price?
# To convert price to numeric, remove " dollars" from each value
listings['price_clean'] = listings['price'].str.replace(' dollars', '').astype(float)
avg_price = listings['price_clean'].mean()

In [None]:
review_dates = pd.DataFrame({
    'first_reviewed': [first_reviewed],
    'last_reviewed': [last_reviewed],
    'nb_private_rooms': [private_room_count],
    'avg_price': [round(avg_price, 2)]
})

print(review_dates)

  first_reviewed last_reviewed  nb_private_rooms  avg_price
0     2019-01-01    2019-07-09             11356     141.78
