<a href="https://colab.research.google.com/github/saran237/Airpnb_Exit_Test/blob/main/Short_term_rental_properties.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Loading


importing libraries

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [16]:
airbnb_df = pd.read_csv("/content/airbnb.csv")
airbnb_df.head()

Unnamed: 0,Host Id,Host Since,Name,Neighbourhood,Property Type,Review Scores Rating (bin),Room Type,Zipcode,Beds,Number of Records,Number Of Reviews,Price,Review Scores Rating
0,500,06-26-08,Gorgeous 1 BR with Private Balcony,Manhattan,Apartment,,Entire home/apt,10024.0,3.0,1,0,199,
1,500,06-26-08,Trendy Times Square Loft,Manhattan,Apartment,95.0,Private room,10036.0,3.0,1,39,549,96.0
2,1039,07-25-08,Big Greenpoint 1BD w/ Skyline View,Brooklyn,Apartment,100.0,Entire home/apt,11222.0,1.0,1,4,149,100.0
3,1783,08-12-08,Amazing Also,Manhattan,Apartment,100.0,Entire home/apt,10004.0,1.0,1,9,250,100.0
4,2078,08-15-08,"Colorful, quiet, & near the subway!",Brooklyn,Apartment,90.0,Private room,11201.0,1.0,1,80,90,94.0


In [17]:
airbnb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30475 entries, 0 to 30474
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Host Id                     30475 non-null  int64  
 1   Host Since                  30475 non-null  object 
 2   Name                        30475 non-null  object 
 3   Neighbourhood               30475 non-null  object 
 4   Property Type               30472 non-null  object 
 5   Review Scores Rating (bin)  22155 non-null  float64
 6   Room Type                   30475 non-null  object 
 7   Zipcode                     30341 non-null  float64
 8   Beds                        30390 non-null  float64
 9   Number of Records           30475 non-null  int64  
 10  Number Of Reviews           30475 non-null  int64  
 11  Price                       30475 non-null  object 
 12  Review Scores Rating        22155 non-null  float64
dtypes: float64(4), int64(3), object

### Converting the Host Since column to a proper datetime format

In [18]:
airbnb_df['Host Since'] = pd.to_datetime(airbnb_df['Host Since'], format='%m-%d-%y', errors='coerce')

In [19]:
airbnb_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30475 entries, 0 to 30474
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Host Id                     30475 non-null  int64         
 1   Host Since                  30475 non-null  datetime64[ns]
 2   Name                        30475 non-null  object        
 3   Neighbourhood               30475 non-null  object        
 4   Property Type               30472 non-null  object        
 5   Review Scores Rating (bin)  22155 non-null  float64       
 6   Room Type                   30475 non-null  object        
 7   Zipcode                     30341 non-null  float64       
 8   Beds                        30390 non-null  float64       
 9   Number of Records           30475 non-null  int64         
 10  Number Of Reviews           30475 non-null  int64         
 11  Price                       30475 non-null  object    

Creating a New Feature "Host Tenure" from the "Host Since" Column.
This feature helps the model predict prices by the  properties of experienced hosts are priced differently—sometimes higher due to better service or lower due to more competitive strategies.

In [20]:
#EDA
airbnb_df.isnull().sum()

Unnamed: 0,0
Host Id,0
Host Since,0
Name,0
Neighbourhood,0
Property Type,3
Review Scores Rating (bin),8320
Room Type,0
Zipcode,134
Beds,85
Number of Records,0


In [21]:
#Handling missing values
# Check skewness for both Review Scores Rating & Review Scores Rating (bin) columns
skew_rating = airbnb_df['Review Scores Rating'].skew()
skew_rating_bin = airbnb_df['Review Scores Rating (bin)'].skew()

print("Skewness of Review Scores Rating:", skew_rating)
print("Skewness of Review Scores Rating (bin):", skew_rating_bin)

Skewness of Review Scores Rating: -2.478578476320368
Skewness of Review Scores Rating (bin): -2.1085648167543


###  As the column is numerical and negatively skewed (skewness ≈ -2), we should use the median rather than the mean for imputation.

In [22]:
# Fill missing values in both 'Review Scores Rating' and 'Review Scores Rating (bin)' with their respective medians
airbnb_df['Review Scores Rating'] = airbnb_df['Review Scores Rating'].fillna(airbnb_df['Review Scores Rating'].median())
airbnb_df['Review Scores Rating (bin)'] = airbnb_df['Review Scores Rating (bin)'].fillna(airbnb_df['Review Scores Rating (bin)'].median())

In [23]:
airbnb_df.dropna(inplace=True)

In [24]:
airbnb_df.isnull().sum()

Unnamed: 0,0
Host Id,0
Host Since,0
Name,0
Neighbourhood,0
Property Type,0
Review Scores Rating (bin),0
Room Type,0
Zipcode,0
Beds,0
Number of Records,0


In [27]:
airbnb_df.columns

Index(['Host Id', 'Host Since', 'Name', 'Neighbourhood ', 'Property Type',
       'Review Scores Rating (bin)', 'Room Type', 'Zipcode', 'Beds',
       'Number of Records', 'Number Of Reviews', 'Price',
       'Review Scores Rating'],
      dtype='object')

## Advanced Feature Engineering

In [32]:
#  Clean Price column (remove commas, convert to numeric)
target_col = 'Price'
airbnb_df[target_col] = airbnb_df[target_col].astype(str).str.replace(',', '', regex=False)
airbnb_df[target_col] = pd.to_numeric(airbnb_df[target_col], errors='coerce')
airbnb_df = airbnb_df.dropna(subset=[target_col])

#  Dataset without interaction feature
df_no_interaction = airbnb_df.copy()

#  Dataset with engineered interaction feature
df_interaction = airbnb_df.copy()
# Correct column names: 'Neighbourhood ' (with trailing space!) and 'Room Type'
df_interaction['Neighbourhood_RoomType'] = df_interaction['Neighbourhood '] + "_" + df_interaction['Room Type']

#  Preprocessing function (drop target, one-hot encode categoricals)
def preprocess(df, target_col):
    X = df.drop(target_col, axis=1)
    # Drop unneeded columns if present
    drop_cols = []
    for col in ['Name', 'Host Since']:
        if col in X.columns:
            drop_cols.append(col)
    if drop_cols:
        X = X.drop(drop_cols, axis=1)
    # One-hot encode categorical variables
    X = pd.get_dummies(X, drop_first=True)
    return X

X_no_interaction = preprocess(df_no_interaction, target_col)
X_interaction = preprocess(df_interaction, target_col)
y = airbnb_df[target_col]

#  Train/test split (use same indices for both splits)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_no_interaction, y, test_size=0.2, random_state=42)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_interaction, y, test_size=0.2, random_state=42)

#  Train Random Forest and evaluate RMSE (take sqrt for RMSE since 'squared' param not supported)
rf1 = RandomForestRegressor(random_state=42)
rf1.fit(X_train1, y_train1)
preds1 = rf1.predict(X_test1)
rmse1 = np.sqrt(mean_squared_error(y_test1, preds1))

rf2 = RandomForestRegressor(random_state=42)
rf2.fit(X_train2, y_train2)
preds2 = rf2.predict(X_test2)
rmse2 = np.sqrt(mean_squared_error(y_test2, preds2))

print(f"RMSE without interaction feature: {rmse1:.3f}")
print(f"RMSE with interaction feature:    {rmse2:.3f}")

RMSE without interaction feature: 169.474
RMSE with interaction feature:    164.111


## RMSE without interaction feature: 169.474
## RMSE with interaction feature:    164.111

## The new interaction feature improved the model’s performance. The RMSE decreased from 169.474 to 164.111, indicating better predictive accuracy after adding the Neighbourhood_RoomType feature.

## Model Training and Evaluation

In [33]:


# 3. Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X_interaction, y, test_size=0.2, random_state=42)

# 4. Model Training
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

# 5. Evaluation
preds = rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))

print(f"Final RMSE of Random Forest model: {rmse:.2f}")

Final RMSE of Random Forest model: 164.11


### The RMSE gives hosts a clear, actionable measure of how close their predicted prices will be to reality. It helps set expectations and guides pricing decisions with a known margin for error.

In [34]:
import pickle

# Assuming your trained model is named 'rf'
with open("random_forest_airbnb_model.pkl", "wb") as f:
    pickle.dump(rf, f)

In [35]:
with open("random_forest_airbnb_model.pkl", "rb") as f:
    loaded_rf = pickle.load(f)