# Airbnb Data Cleaning and Profiling

## Table of Contents
1. Load Raw Listings Data
2. Dataset Shape and Column Inspection
3. Price Column Cleaning
4. Column Subset Selection
5. Missing Value Analysis
6. Null Row Removal
7. Export Clean Dataset
8. Summary Statistics
9. Category Frequency Checks


In [8]:
import pandas as pd

df = pd.read_csv(r"C:\Users\rbaue\Desktop\listings.csv.gz")
df.shape

(2877, 79)

In [9]:
df.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'ca

In [10]:
df['price'] = (
    df['price']
    .replace('[\$,]', '', regex=True)
    .astype(float)
)

In [11]:
columns = [
    'id','name','neighbourhood','latitude','longitude',
    'room_type','property_type','price','minimum_nights',
    'number_of_reviews','reviews_per_month','availability_365',
    'host_is_superhost','instant_bookable'
]

airbnb = df[columns].copy()
airbnb.shape

(2877, 14)

In [12]:
airbnb.isnull().sum()

id                      0
name                    0
neighbourhood        1365
latitude                0
longitude               0
room_type               0
property_type           0
price                 183
minimum_nights          0
number_of_reviews       0
reviews_per_month     319
availability_365        0
host_is_superhost     161
instant_bookable        0
dtype: int64

In [13]:
airbnb = airbnb.dropna()
airbnb.shape

(1259, 14)

In [14]:
airbnb.to_csv("airbnb_clean.csv", index=False)

In [15]:
airbnb.describe()

Unnamed: 0,id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,availability_365
count,1259.0,1259.0,1259.0,1259.0,1259.0,1259.0,1259.0,1259.0
mean,6.007705e+17,39.98685,-82.993226,373.557585,7.813344,100.466243,2.275187,232.691819
std,5.13649e+17,0.041462,0.038348,3334.189609,12.113242,129.306236,1.971969,112.630375
min,90676.0,39.87764,-83.160016,25.0,1.0,1.0,0.02,0.0
25%,43884980.0,39.95777,-83.009,85.0,1.0,15.0,0.73,154.0
50%,7.028328e+17,39.98119,-82.99889,118.0,2.0,55.0,1.87,257.0
75%,1.000703e+18,39.999821,-82.977518,166.0,3.0,130.0,3.33,336.0
max,1.507536e+18,40.14729,-82.78194,50028.0,105.0,997.0,18.77,365.0


In [16]:
airbnb['room_type'].value_counts()
airbnb['neighbourhood'].value_counts().head(10)

neighbourhood
Neighborhood highlights    1259
Name: count, dtype: int64