<a href="https://colab.research.google.com/github/tuhanren/Airbnb-Data-Analysis/blob/main/Final_Project_Group15.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Project Objective:

The goal of this project is to help new Airbnb hosts estimate a reasonable nightly **rental price** for their property. Using historical data from Airbnb listings, we will build predictive models that consider factors such as **location(lat and long)**, **availability**, and **house rules?(50k missing)** to forecast the most suitable rental price for new listings.

Dataset: https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata/dataLinks to an external site.

Data dictionary: https://docs.google.com/spreadsheets/d/1b_dvmyhb_kAJhUmv81rAxl4KcXn0Pymz


Key Analysis Steps:

1. Data Cleaning: Clean the dataset by handling missing values, fixing inconsistencies, and addressing any outliers.
2. Descriptive Analytics: Explore the overall distribution of rental prices, relationships between location and price, popular room types and other descriptive insights.
3. Diagnostic Analytics: Perform correlation analysis and regression to find which factors (location, room type, availability, etc.) most influence rental prices.
4. Predictive Analytics: Build models that predict rental prices based on property characteristics and availability.

Practical Use Case:
New hosts will be able to input their property’s details, such as location, room type, availability, etc. Our model will then predict a reasonable nightly rental price based on similar listings from the historical data, helping the host price their property competitively.

Feasibility:
This project is feasible using Python libraries like pandas for data handling, scikit-learn and keras for modeling, and matplotlib for visualization. The final result will not only provide price recommendations for new Airbnb hosts but also reveal market trends and insights.

In [1]:
# Import the 'drive' module from the 'google.colab' package to enable Google Drive integration.
# Then, mount Google Drive to the '/drive' directory within the Colab environment.
# The 'force_remount=True' parameter ensures that the Drive is remounted even if it was previously mounted.

from google.colab import drive
drive.mount('/drive', force_remount=True)

# Change the current working directory to the specified folder within Google Drive,
# where you can save and load your Colab notebooks or files.
%cd '/drive/MyDrive/Colab Notebooks/INF1340/group project/'

Mounted at /drive
/drive/MyDrive/Colab Notebooks/INF1340/group project


In [69]:
import pandas as pd

def read_csv(uri: str) -> pd.DataFrame:
  """Read a CSV file from the given URI and return a pandas DataFrame.

  Args:
    uri: The URI of the CSV file to read.

  Returns:
    A pandas DataFrame containing the data from the CSV file.
  """

  try:
    return pd.read_csv(uri)
  except FileNotFoundError as ex:
    print(f'Error! File Not Found! uri={uri}')
    raise ex

def columns_snakecase(dataFrame: pd.DataFrame) -> None:
  """
  Convert column names in a pandas DataFrame to lowercase and
  replace all spaces with underscores e.g. 'My Column Name' becomes 'my_column_name'.

  Args:
    dataFrame: The pandas DataFrame whose column names need to be converted.
  """

  dataFrame.columns = dataFrame.columns.str.lower().str.replace(' ', '_')

def columns_drop(dataFrame: pd.DataFrame, columns: list) -> None:
  """Drop the specified columns from a pandas DataFrame.

  Args:
    dataFrame: The pandas DataFrame from which columns need to be dropped.
    columns: A list of column names to be dropped from the DataFrame.
  """
  dataFrame.drop(columns, axis=1, inplace=True)

def columns_drop_by_null_percentage(
    dataFrame: pd.DataFrame,
    percentage_threshold: float
) -> None:
  """"""

  columns = dataFrame.columns[dataFrame.isnull().mean() > percentage_threshold]
  print(f"Droping: {columns}")
  columns_drop(dataFrame, columns)

def columns_fill_null(dataFrame: pd.DataFrame, columns: list, value: any) -> None:
  """
  Fill missing values in the specified columns of a pandas DataFrame with a given value.

  Args:
    dataFrame: The pandas DataFrame in which missing values need to be filled.
    columns: A list of column names whose missing values need to be filled.
    value: The value to fill missing values with.
  """

  for column in columns:
    dataFrame[column].fillna(value, inplace=True)

def columns_dollarize(dataFrame: pd.DataFrame, columns: list) -> None:
  """
  Convert values in the specified columns of a pandas DataFrame from string to float
  by removing dollar signs and commas.

  Args:
    dataFrame: The pandas DataFrame in which values need to be converted.
    columns: A list of column names whose values need to be converted.
  """

  for column in columns:
    dataFrame[column] = dataFrame[column].replace('[\$,]', '', regex=True).astype(float)

def rows_drop_by_condition(dataFrame: pd.DataFrame, condition: any) -> None:
  """
  Drop rows from a pandas DataFrame based on a given condition.

  Args:
    dataFrame: The pandas DataFrame from which rows need to be dropped.
    condition: A pandas DataFrame condition to filter rows.
  """

  dataFrame.drop(dataFrame[condition].index, inplace=True)

def rows_drop_by_null(dataFrame: pd.DataFrame, columns: list) -> None:
  """
  Drop rows from a pandas DataFrame that contain missing values in the specified columns.

  Args:
    dataFrame: The pandas DataFrame from which rows need to be dropped.
    columns: A list of column names whose rows need to be dropped.
  """
  dataFrame.dropna(subset=columns, inplace=True)

def columns_lowercase(dataFrame: pd.DataFrame, columns: list) -> None:
  """
  Convert column names in a pandas DataFrame to lowercase.

  Args:
    dataFrame: The pandas DataFrame whose columns need to be converted.
    columns: A list of column names to be converted to categorical data type.
  """

  for column in columns:
    dataFrame[column] = dataFrame[column].str.lower()

def columns_categorize(dataFrame: pd.DataFrame, columns: list) -> None:
  """
  Convert the specified columns of a pandas DataFrame to categorical data type.

  Args:
    dataFrame: The pandas DataFrame whose columns need to be converted.
    columns: A list of column names to be converted to categorical data type.
  """

  for column in columns:
    dataFrame[column] = dataFrame[column].astype('category')

def columns_boolize(dataFrame: pd.DataFrame, columns: list) -> None:
  """
  Convert the specified columns of a pandas DataFrame to boolean data type.

  Args:
    dataFrame: The pandas DataFrame whose columns need to be converted.
    columns: A list of column names to be converted to boolean data type.
  """

  for column in columns:
    dataFrame[column] = dataFrame[column].astype(bool)

def columns_intize(dataFrame: pd.DataFrame, columns: list) -> None:
  """
  Convert the specified columns of a pandas DataFrame to integer data type.

  Args:
    dataFrame: The pandas DataFrame whose columns need to be converted.
    columns: A list of column names to be converted to integer data type.
  """

  for column in columns:
    dataFrame[column] = dataFrame[column].astype(int)

def columns_floatize(dataFrame: pd.DataFrame, columns: list) -> None:
  """
  Convert the specified columns of a pandas DataFrame to float data type.
  Args:
    dataFrame: The pandas DataFrame whose columns need to be converted.
    columns: A list of column names to be converted to float data type.
  """

  for column in columns:
    dataFrame[column] = dataFrame[column].astype(float)

def apply_lambda(dataFrame: pd.DataFrame, columns: list, fn: any) -> None:
  """
  Apply a lambda function to the specified columns of a pandas DataFrame.

  Args:
    dataFrame: The pandas DataFrame on which the lambda function needs to be applied.
    columns: A list of column names whose values need to be transformed.
    fn: The lambda
  """

  for column in columns:
    dataFrame[column] = dataFrame[column].apply(fn)

In [70]:
# Call read_csv() to import csv file.
df = read_csv('Airbnb_Open_Data.csv')
df.head(3)

  return pd.read_csv(uri)


Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,...,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,...,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,...,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",


In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102599 entries, 0 to 102598
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              102599 non-null  int64  
 1   NAME                            102349 non-null  object 
 2   host id                         102599 non-null  int64  
 3   host_identity_verified          102310 non-null  object 
 4   host name                       102193 non-null  object 
 5   neighbourhood group             102570 non-null  object 
 6   neighbourhood                   102583 non-null  object 
 7   lat                             102591 non-null  float64
 8   long                            102591 non-null  float64
 9   country                         102067 non-null  object 
 10  country code                    102468 non-null  object 
 11  instant_bookable                102494 non-null  object 
 12  cancellation_pol

In [72]:
df.describe()

Unnamed: 0,id,host id,lat,long,Construction year,minimum nights,number of reviews,reviews per month,review rate number,calculated host listings count,availability 365
count,102599.0,102599.0,102591.0,102591.0,102385.0,102190.0,102416.0,86720.0,102273.0,102280.0,102151.0
mean,29146230.0,49254110000.0,40.728094,-73.949644,2012.487464,8.135845,27.483743,1.374022,3.279106,7.936605,141.133254
std,16257510.0,28539000000.0,0.055857,0.049521,5.765556,30.553781,49.508954,1.746621,1.284657,32.21878,135.435024
min,1001254.0,123600500.0,40.49979,-74.24984,2003.0,-1223.0,0.0,0.01,1.0,1.0,-10.0
25%,15085810.0,24583330000.0,40.68874,-73.98258,2007.0,2.0,1.0,0.22,2.0,1.0,3.0
50%,29136600.0,49117740000.0,40.72229,-73.95444,2012.0,3.0,7.0,0.74,3.0,1.0,96.0
75%,43201200.0,73996500000.0,40.76276,-73.93235,2017.0,5.0,30.0,2.0,4.0,2.0,269.0
max,57367420.0,98763130000.0,40.91697,-73.70522,2022.0,5645.0,1024.0,90.0,5.0,332.0,3677.0


In [73]:
df.isnull().sum()

Unnamed: 0,0
id,0
NAME,250
host id,0
host_identity_verified,289
host name,406
neighbourhood group,29
neighbourhood,16
lat,8
long,8
country,532


**Conclusion from the first snapshot of the data (info(), describe() and isnull()):**


1.   Column name format needs to be aligned.
2.   Null values handling.
3.   Dtype adjustment(float, int, date, category, bool, object).

***1. Column name format: Rename all columns, to lower.***

In [74]:
columns_snakecase(df)
df.columns

Index(['id', 'name', 'host_id', 'host_identity_verified', 'host_name',
       'neighbourhood_group', 'neighbourhood', 'lat', 'long', 'country',
       'country_code', 'instant_bookable', 'cancellation_policy', 'room_type',
       'construction_year', 'price', 'service_fee', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'review_rate_number', 'calculated_host_listings_count',
       'availability_365', 'house_rules', 'license'],
      dtype='object')

***2. Null values handling.***
*   Drop high missing rate columns (15% or higher)



In [75]:
columns_drop_by_null_percentage(df, 0.15)

Droping: Index(['last_review', 'reviews_per_month', 'house_rules', 'license'], dtype='object')


*   Drop irrelevant columns



In [76]:
irrelevant_to_drop = ['name', 'host_id', 'country', 'country_code',
                   'host_name', 'calculated_host_listings_count']
columns_drop(df, irrelevant_to_drop)

In [77]:
# check
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102599 entries, 0 to 102598
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   id                      102599 non-null  int64  
 1   host_identity_verified  102310 non-null  object 
 2   neighbourhood_group     102570 non-null  object 
 3   neighbourhood           102583 non-null  object 
 4   lat                     102591 non-null  float64
 5   long                    102591 non-null  float64
 6   instant_bookable        102494 non-null  object 
 7   cancellation_policy     102523 non-null  object 
 8   room_type               102599 non-null  object 
 9   construction_year       102385 non-null  float64
 10  price                   102352 non-null  object 
 11  service_fee             102326 non-null  object 
 12  minimum_nights          102190 non-null  float64
 13  number_of_reviews       102416 non-null  float64
 14  review_rate_number  


*   Fill numeric missing values.
  -  `price`: 247 rows, fill with mean.
  -  `service_fee`:273 rows, fill with 0.0.
  -  `minimum_nights`: 409 rows, fill with 0.
  -  `number_of_reviews`: 183 rows, fill with 0.
  -  `review_rate_number`: 326 rows, fill with 0.
  -  `availability_365`: 448 rows, fill with 0.

*   Fill categorical missing values.
  -  `host_identity_verified`: 289 rows, fill with 'unconfirmed'.
  -  `cancellation_policy`: 76 rows, fill with 'strict'.
*   Fill bool missing values.
  -  `instant_bookable`: 105 rows, fill with `False`.



In [78]:
columns_dollarize(df, ['price', 'service_fee'])
columns_fill_null(df, ['price'], df.groupby(['neighbourhood', 'room_type'])['price'].transform('mean'))
columns_fill_null(df, ['service_fee', 'price', 'minimum_nights', 'number_of_reviews',
                       'review_rate_number', 'availability_365'], 0)
columns_fill_null(df, ['host_identity_verified'], 'unconfirmed')
columns_fill_null(df, ['cancellation_policy'], 'strict')
columns_fill_null(df, ['instant_bookable'], False)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataFrame[column].fillna(value, inplace=True)
  dataFrame[column].fillna(value, inplace=True)


*   Drop rows of null values. mentioned counts from draft.
  -  `lat` and `long`: 8 rows.
  -  `neighbourhood_group`: 29 rows.
  -  `neighbourhood`: 16 rows.
  -  `construction_year`: 214 rows.
  -  `room_type`.

In [79]:
rows_drop_by_null(df, ['lat', 'long', 'neighbourhood_group', 'neighbourhood', 'construction_year', 'room_type'])

In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 102338 entries, 0 to 102598
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   id                      102338 non-null  int64  
 1   host_identity_verified  102338 non-null  object 
 2   neighbourhood_group     102338 non-null  object 
 3   neighbourhood           102338 non-null  object 
 4   lat                     102338 non-null  float64
 5   long                    102338 non-null  float64
 6   instant_bookable        102338 non-null  bool   
 7   cancellation_policy     102338 non-null  object 
 8   room_type               102338 non-null  object 
 9   construction_year       102338 non-null  float64
 10  price                   102338 non-null  float64
 11  service_fee             102338 non-null  float64
 12  minimum_nights          102338 non-null  float64
 13  number_of_reviews       102338 non-null  float64
 14  review_rate_number      1

***3. Inconsistency and outliers handling***


*   Convert to lower case for all categorical values.
*   Fix all inconsistent cases.
*   Cast columns to the appropriate types.
*   Handle all outliers.
*   Drop duplicated records.




*   Convert to lower case for all categorical values
  - On columns of `host_identity_verified`, `neighbourhood_group`,`neighbourhood`, `cancellation_policy`, `room_type`.



In [81]:
columns_lowercase(df, ['host_identity_verified', 'neighbourhood_group',
                       'neighbourhood', 'cancellation_policy', 'room_type'])

In [82]:
# check
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 102338 entries, 0 to 102598
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   id                      102338 non-null  int64  
 1   host_identity_verified  102338 non-null  object 
 2   neighbourhood_group     102338 non-null  object 
 3   neighbourhood           102338 non-null  object 
 4   lat                     102338 non-null  float64
 5   long                    102338 non-null  float64
 6   instant_bookable        102338 non-null  bool   
 7   cancellation_policy     102338 non-null  object 
 8   room_type               102338 non-null  object 
 9   construction_year       102338 non-null  float64
 10  price                   102338 non-null  float64
 11  service_fee             102338 non-null  float64
 12  minimum_nights          102338 non-null  float64
 13  number_of_reviews       102338 non-null  float64
 14  review_rate_number      1

In [83]:
# Check inconsistency
df['neighbourhood_group'].value_counts().sort_index()

Unnamed: 0_level_0,count
neighbourhood_group,Unnamed: 1_level_1
bronx,2709
brookln,1
brooklyn,41735
manhatan,1
manhattan,43690
queens,13248
staten island,954


*   Fix all inconsistent cases.
  - `neighbourhood_group`: Case differences of 'manhatan' vs 'Manhattan' and 'brookln' vs 'brooklyn'.



In [84]:
df.loc[df['neighbourhood_group'] == 'manhatan', "neighbourhood_group"] = 'manhattan'
df.loc[df['neighbourhood_group'] == 'brookln', "neighbourhood_group"] = 'brooklyn'

In [85]:
# check
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 102338 entries, 0 to 102598
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   id                      102338 non-null  int64  
 1   host_identity_verified  102338 non-null  object 
 2   neighbourhood_group     102338 non-null  object 
 3   neighbourhood           102338 non-null  object 
 4   lat                     102338 non-null  float64
 5   long                    102338 non-null  float64
 6   instant_bookable        102338 non-null  bool   
 7   cancellation_policy     102338 non-null  object 
 8   room_type               102338 non-null  object 
 9   construction_year       102338 non-null  float64
 10  price                   102338 non-null  float64
 11  service_fee             102338 non-null  float64
 12  minimum_nights          102338 non-null  float64
 13  number_of_reviews       102338 non-null  float64
 14  review_rate_number      1

***Area Name Similarity Check and Results***

* To ensure data accuracy, we conducted a manual verification of areas with similar names using Google Maps. Below are the checks performed and the results:
  - Bay Terrace vs. Bay Terrace, Staten Island
  - Chelsea vs. Chelsea, Staten Island
  - Clifton vs. Clinton Hill
  - Concourse vs. Concourse Village
  - Hollis vs. Holliswood
  - Jamaica vs. Jamaica Estates vs. Jamaica Hills
  - Kew Gardens vs. Kew Gardens Hills
  - New Dorp vs. New Dorp Beach

**All Checked: Confirmed as different areas.**


In [86]:
# Check inconsistency
df['neighbourhood'].value_counts().sort_index().to_string()

"neighbourhood\nallerton                        96\narden heights                    9\narrochar                        52\narverne                        223\nastoria                       1872\nbath beach                      48\nbattery park city              118\nbay ridge                      304\nbay terrace                      8\nbay terrace, staten island       4\nbaychester                      29\nbayside                        124\nbayswater                       40\nbedford-stuyvesant            7918\nbelle harbor                    31\nbellerose                       26\nbelmont                         45\nbensonhurst                    157\nbergen beach                    30\nboerum hill                    357\nborough park                   268\nbreezy point                     9\nbriarwood                      121\nbrighton beach                 167\nbronxdale                       48\nbrooklyn heights               308\nbrownsville                    153\nbull's head 

In [87]:
# Check inconsistency
df['host_identity_verified'].value_counts().sort_index()

Unnamed: 0_level_0,count
host_identity_verified,Unnamed: 1_level_1
unconfirmed,51333
verified,51005


In [88]:
# Check inconsistency
df['cancellation_policy'].value_counts().sort_index()

Unnamed: 0_level_0,count
cancellation_policy,Unnamed: 1_level_1
flexible,33975
moderate,34265
strict,34098


In [89]:
# Check inconsistency
df['room_type'].value_counts().sort_index()

Unnamed: 0_level_0,count
room_type,Unnamed: 1_level_1
entire home/apt,53558
hotel room,116
private room,46439
shared room,2225


*   Cast columns to the appropriate types.
  - Integer columns: `minimum_nights`, `number_of_reviews`, `review_rate_number`, `availability_365`, `construction_year`
  - Category columns: `host_identity_verified`, `neighbourhood_group`, `neighbourhood`, `cancellation_policy`, `room_type`
  - Float columns: `lat`,`long`, `price`, `service_fee`
  - Boolean column: `instant_bookable`


In [90]:
int_columns = ['minimum_nights', 'number_of_reviews', 'review_rate_number', 'availability_365', 'construction_year']
cat_columns = ['host_identity_verified', 'neighbourhood_group', 'neighbourhood', 'cancellation_policy', 'room_type']
float_columns = ['lat','long', 'price', 'service_fee']
columns_intize(df, int_columns)
columns_categorize(df, cat_columns)
columns_floatize(df, float_columns)

In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 102338 entries, 0 to 102598
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype   
---  ------                  --------------   -----   
 0   id                      102338 non-null  int64   
 1   host_identity_verified  102338 non-null  category
 2   neighbourhood_group     102338 non-null  category
 3   neighbourhood           102338 non-null  category
 4   lat                     102338 non-null  float64 
 5   long                    102338 non-null  float64 
 6   instant_bookable        102338 non-null  bool    
 7   cancellation_policy     102338 non-null  category
 8   room_type               102338 non-null  category
 9   construction_year       102338 non-null  int64   
 10  price                   102338 non-null  float64 
 11  service_fee             102338 non-null  float64 
 12  minimum_nights          102338 non-null  int64   
 13  number_of_reviews       102338 non-null  int64   
 14  review_ra

*   Handle all outliers.
  - `price`: By defination it is the daily price in local currency. So it should not be euqal to or less than 0.
  - `availability_365`: By defination it is the minimum number of night stay for the listing (calendar rules may be different). So it should not be less than 0.
  - `minimum_nights`: By defination it is the availability of the listing x days in the future as determined by the calendar. So it should not be over 365.

In [92]:
rows_drop_by_condition(df, df['price'] == 0)
apply_lambda(df, ['availability_365', 'minimum_nights'], lambda x: max(0, x))
apply_lambda(df, ['availability_365'], lambda x: min(365, x))

In [93]:
# Check
df['availability_365'].describe()

Unnamed: 0,availability_365
count,102338.0
mean,139.651947
std,133.477069
min,0.0
25%,2.0
50%,95.0
75%,268.0
max,365.0


In [94]:
# Check
df['minimum_nights'].describe()

Unnamed: 0,minimum_nights
count,102338.0
mean,8.120112
std,30.227049
min,0.0
25%,1.0
50%,3.0
75%,5.0
max,5645.0


*   Drop duplicated records. (541 rows duplicated records affected) **这个还套娃不？**

In [95]:
df.drop_duplicates(inplace=True)

In [96]:
# Cleaned check
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 101797 entries, 0 to 102057
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype   
---  ------                  --------------   -----   
 0   id                      101797 non-null  int64   
 1   host_identity_verified  101797 non-null  category
 2   neighbourhood_group     101797 non-null  category
 3   neighbourhood           101797 non-null  category
 4   lat                     101797 non-null  float64 
 5   long                    101797 non-null  float64 
 6   instant_bookable        101797 non-null  bool    
 7   cancellation_policy     101797 non-null  category
 8   room_type               101797 non-null  category
 9   construction_year       101797 non-null  int64   
 10  price                   101797 non-null  float64 
 11  service_fee             101797 non-null  float64 
 12  minimum_nights          101797 non-null  int64   
 13  number_of_reviews       101797 non-null  int64   
 14  review_ra

In [97]:
# Cleaned check
df.describe()

Unnamed: 0,id,lat,long,construction_year,price,service_fee,minimum_nights,number_of_reviews,review_rate_number,availability_365
count,101797.0,101797.0,101797.0,101797.0,101797.0,101797.0,101797.0,101797.0,101797.0,101797.0
mean,29233240.0,40.728094,-73.949639,2012.487912,625.386732,124.717978,8.111555,27.383833,3.269281,139.562246
std,16243810.0,0.055859,0.049524,5.765736,331.288479,66.548931,30.290076,49.413577,1.29547,133.47313
min,1001254.0,40.49979,-74.24984,2003.0,50.0,0.0,0.0,0.0,0.0,0.0
25%,15162310.0,40.68873,-73.98258,2007.0,341.0,67.0,1.0,1.0,2.0,2.0
50%,29239880.0,40.72229,-73.95443,2012.0,625.0,125.0,3.0,7.0,3.0,95.0
75%,43298680.0,40.76276,-73.93234,2017.0,912.0,182.0,5.0,30.0,4.0,268.0
max,57367420.0,40.91697,-73.70522,2022.0,1200.0,240.0,5645.0,1024.0,5.0,365.0


In [98]:
# Save cleaned DataFrame as 'df_cleaned.csv'
df.to_csv('df_cleaned.csv', index=False)