# Airbnb Listings Analysis â€“ Data Overview

This notebook loads the raw Airbnb listings dataset and provides an initial 
overview of its structure, data types, and quality.

No transformations are applied at this stage.


In [9]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

In [13]:
df = pd.read_csv("../data/raw/airbnb_listings.csv")

In [14]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,11156,An Oasis in the City,40855,Colleen,,Sydney,-33.86767,151.22497,Private room,,90,193,2020-03-13,1.01,1,364,0,
1,15253,Unique Designer Rooftop Apartment in City Loca...,59850,Morag,,Sydney,-33.87964,151.2168,Private room,,1,632,2025-09-01,3.83,1,295,51,PID-STRA-24061-7
2,44545,Sunny Darlinghurst Warehouse Apartment,112237,Atari,,Sydney,-33.87888,151.21439,Entire home/apt,,2,85,2025-08-31,0.47,1,0,10,PID-STRA-74219
3,58506,"Studio Yindi @ Mosman, Sydney",279955,John,,Mosman,-33.81748,151.23484,Entire home/apt,,2,448,2025-08-31,2.5,1,138,29,PID-STRA-2810
4,68999,A little bit of Sydney - Australia,333581,Bryan,,Hornsby,-33.72966,151.05226,Private room,,1,120,2025-06-08,0.69,1,265,12,PID-STRA-9081


In [16]:
df.shape

(17730, 18)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17730 entries, 0 to 17729
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              17730 non-null  int64  
 1   name                            17730 non-null  object 
 2   host_id                         17730 non-null  int64  
 3   host_name                       17728 non-null  object 
 4   neighbourhood_group             0 non-null      float64
 5   neighbourhood                   17730 non-null  object 
 6   latitude                        17730 non-null  float64
 7   longitude                       17730 non-null  float64
 8   room_type                       17730 non-null  object 
 9   price                           0 non-null      float64
 10  minimum_nights                  17730 non-null  int64  
 11  number_of_reviews               17730 non-null  int64  
 12  last_review                     

## Initial Observations

- The dataset contains approximately 17,700 listings.
- Several columns contains missing values, including: - `price` - `neighbourhood_group`
- Some columns may require type conversion (e.g. dates).
- Further data cleaning is required before analysis.

## Data Quality Assessment

This section documents data quality issues identified in the raw dataset and outlines the cleaning strateg that will be appllied in the next step.

### Column-Level Data Quality Issues

**1. neighbourhood_group**
- All values are missing
- Column cannot be used for analysis
- Decision: exclude from analysis

**2. price**
- All values are missing
- Pricing analysis is not possible
- Decision: exclude from analysis and clearly document limitation

**3. last_review**
- Contains missing values
- Missing values likely indicate listings with no reviews
-Decision: keep column, treat missing values as "no review"

** 4. reviews_per_month**
-Contains missing values
- likely correlated with missing `last_review`
- Decision: keep column, assess impact during analysis

** 5. license**
-Contains missing and non-missing values
-Potential indicator of regulatory compliance
-Decision: keep column for exploratory analysis

**6. Data types**
- `last_review` is stored as object
- Decision: convert to datetime during cleaning

In [19]:
df.isna().sum().sort_values(ascending=False)

price                             17730
neighbourhood_group               17730
last_review                        2710
reviews_per_month                  2710
license                            1354
host_name                             2
name                                  0
id                                    0
neighbourhood                         0
host_id                               0
room_type                             0
longitude                             0
latitude                              0
number_of_reviews                     0
minimum_nights                        0
calculated_host_listings_count        0
availability_365                      0
number_of_reviews_ltm                 0
dtype: int64

In [20]:
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,17730.0,,,,7.673839601726689e+17,5.625214644127186e+17,11156.0,43835347.5,9.515956114894833e+17,1.2736154036353859e+18,1.4804933972129088e+18
name,17730.0,17269.0,Modern Elegance of Mascot Living,15.0,,,,,,,
host_id,17730.0,,,,211898350.693739,209422262.907644,35582.0,28262450.75,132288219.0,369908172.0,711457600.0
host_name,17728.0,3853.0,Ken,200.0,,,,,,,
neighbourhood_group,0.0,,,,,,,,,,
neighbourhood,17730.0,38.0,Sydney,4190.0,,,,,,,
latitude,17730.0,,,,-33.851333,0.089294,-34.09568,-33.896215,-33.875129,-33.80362,-33.38364
longitude,17730.0,,,,151.175164,0.116042,150.63049,151.135117,151.205292,151.252835,151.34014
room_type,17730.0,4.0,Entire home/apt,14089.0,,,,,,,
price,0.0,,,,,,,,,,


In [21]:
df.nunique().sort_values()

neighbourhood_group                   0
price                                 0
room_type                             4
neighbourhood                        38
calculated_host_listings_count       65
minimum_nights                       81
number_of_reviews_ltm               121
availability_365                    366
number_of_reviews                   470
reviews_per_month                   742
last_review                        1383
host_name                          3853
host_id                            8911
license                           10953
latitude                          14716
longitude                         14920
name                              17269
id                                17730
dtype: int64

## Cleaning Strategy Summary

Based on the data quality assessment, the following steps will be applied 
in the data cleaning phase:

- Drop columns with 100% missing values
- Convert date columns to appropriate datetime formats
- Preserve missing values where they carry business meaning
- Avoid imputing values without domain justification
- Document all transformations clearly

No records will be removed unless justified by data integrity issues.
