# Data Cleaning â€“ Airbnb Listings

This notebook applies the data cleaning steps defined in the 
Data Quality Assessment phase.

All transformations are documented and justified.
The cleaned dataset will be saved for use in subsequent analysis.


In [5]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

In [6]:
df = pd.read_csv("../data/raw/airbnb_listings.csv")

In [7]:
df.shape

(17730, 18)

## Removing Columns with Complete Missingness

The following columns contain 100% missing values and cannot contribute 
to the analysis:

- `neighbourhood_group`
- `price`

These columns are removed to simplify the dataset and avoid misleading analysis.


In [8]:
df.drop(columns=["neighbourhood_group", "price"], inplace=True)

In [9]:
df.shape

(17730, 16)

## Converting Date Columns

The `last_review` column is converted to datetime format.
Missing values are preserved, as they indicate listings with no reviews.


In [10]:
df["last_review"] = pd.to_datetime(df["last_review"], errors="coerce")

print("showing last_reviews")


showing last_reviews


In [11]:
df["last_review"].dtype

dtype('<M8[ns]')

## Duplicate Check

Duplicate listing records can distort analysis results.
The dataset is checked for exact duplicate rows.


In [12]:
df.duplicated().sum()

np.int64(0)

In [13]:
df = df.drop_duplicates()

print("Removing dublicates")

Removing dublicates


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17730 entries, 0 to 17729
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              17730 non-null  int64         
 1   name                            17730 non-null  object        
 2   host_id                         17730 non-null  int64         
 3   host_name                       17728 non-null  object        
 4   neighbourhood                   17730 non-null  object        
 5   latitude                        17730 non-null  float64       
 6   longitude                       17730 non-null  float64       
 7   room_type                       17730 non-null  object        
 8   minimum_nights                  17730 non-null  int64         
 9   number_of_reviews               17730 non-null  int64         
 10  last_review                     15020 non-null  datetime64[ns]
 11  re

In [15]:
df.isna().sum().sort_values(ascending=False)

last_review                       2710
reviews_per_month                 2710
license                           1354
host_name                            2
neighbourhood                        0
id                                   0
name                                 0
host_id                              0
room_type                            0
longitude                            0
latitude                             0
number_of_reviews                    0
minimum_nights                       0
calculated_host_listings_count       0
availability_365                     0
number_of_reviews_ltm                0
dtype: int64

In [17]:
df.to_csv("../data/processed/airbnb_listings_cleaned.csv", index=False)

## Output

The cleaned dataset is saved to:

`data/processed/airbnb_listings_cleaned.csv`

This dataset will be used for exploratory analysis and insights generation.
