# IS 4487 Assignment 11: Predicting Airbnb Prices with Regression

In this assignment, you will:
- Load the Airbnb dataset you cleaned and transformed in Assignment 7
- Build a linear regression model to predict listing price
- Interpret which features most affect price
- Try to improve your model using only the most impactful predictors
- Practice explaining your findings to a business audience like a host, pricing strategist, or city partner

## Why This Matters

Pricing is one of the most important levers for hosts and Airbnb‚Äôs business teams. Understanding what drives price ‚Äî and being able to predict it accurately ‚Äî helps improve search results, revenue management, and guest satisfaction.

This assignment gives you hands-on practice turning a cleaned dataset into a predictive model. You‚Äôll focus not just on code, but on what the results mean and how you‚Äôd communicate them to stakeholders.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_11_regression.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



## Original Source: Dataset Description

The dataset you'll be using is a **detailed Airbnb listing file**, available from [Inside Airbnb](https://insideairbnb.com/get-the-data/).

Each row represents one property listing. The columns include:

- **Host attributes** (e.g., host ID, host name, host response time)
- **Listing details** (e.g., price, room type, minimum nights, availability)
- **Location data** (e.g., neighborhood, latitude/longitude)
- **Property characteristics** (e.g., number of bedrooms, amenities, accommodates)
- **Calendar/booking variables** (e.g., last review date, number of reviews)

The schema is consistent across cities, so you can expect similar columns regardless of the location you choose.

In [29]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


## 1. Load Your Transformed Airbnb Dataset

**Business framing:**  
Before building any models, we must start with clean, prepared data. In Assignment 7, you exported a cleaned version of your Airbnb dataset. You‚Äôll now import that file for analysis.

### Do the following:
- Import your CSV file called `cleaned_airbnb_data_7.csv`.   (Note: If you had significant errors with assignment 7, you can use the file named "airbnb_listings.csv" in the DataSets folder on GitHub as a backup starting point.)
- Use `pandas` to load and preview the dataset

### In Your Response:
1. What does the dataset include?
2. How many rows and columns are present?


In [30]:
# Add code here üîß
df = pd.read_csv('cleaned_airbnb_data_7.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 459 entries, 0 to 458
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            459 non-null    int64  
 1   listing_url                                   459 non-null    object 
 2   name                                          459 non-null    object 
 3   description                                   449 non-null    object 
 4   neighborhood_overview                         196 non-null    object 
 5   picture_url                                   459 non-null    object 
 6   host_id                                       459 non-null    int64  
 7   host_url                                      459 non-null    object 
 8   host_name                                     459 non-null    object 
 9   host_since                                    459 non-null    obj

In [31]:
df.head()

Unnamed: 0,id,listing_url,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,null_price_values
0,2992450,https://www.airbnb.com/rooms/2992450,Luxury 2 bedroom apartment,The apartment is located in a quiet neighborho...,,https://a0.muscache.com/pictures/44627226/0e72...,4621559,https://www.airbnb.com/users/show/4621559,Kenneth,1/7/2013,...,4.56,3.22,3.67,f,1,1,0,0,0.07,0
1,3820211,https://www.airbnb.com/rooms/3820211,Funky Urban Gem: Prime Central Location - Park...,Step into the charming and comfy 1BR/1BA apart...,Overview<br /><br />The lovely apartment is lo...,https://a0.muscache.com/pictures/prohost-api/H...,19648678,https://www.airbnb.com/users/show/19648678,Terra,8/7/2014,...,4.81,4.81,4.77,f,4,4,0,0,2.32,0
2,5651579,https://www.airbnb.com/rooms/5651579,Large studio apt by Capital Center & ESP@,"Spacious studio with hardwood floors, fully eq...",The neighborhood is very eclectic. We have a v...,https://a0.muscache.com/pictures/b3fc42f3-6e5e...,29288920,https://www.airbnb.com/users/show/29288920,Gregg,3/13/2015,...,4.88,4.76,4.64,f,2,1,1,0,2.97,0
3,6623339,https://www.airbnb.com/rooms/6623339,Bright & Cozy City Stay ¬∑ Top Location + Parking!,Step into the charming and comfy 1BR/1BA apart...,Overview<br /><br />The lovely apartment is lo...,https://a0.muscache.com/pictures/prohost-api/H...,19648678,https://www.airbnb.com/users/show/19648678,Terra,8/7/2014,...,4.7,4.8,4.72,f,4,4,0,0,2.68,0
4,9005989,https://www.airbnb.com/rooms/9005989,"Studio in The heart of Center SQ, in Albany NY",(21 years of age or older ONLY) NON- SMOKING.....,"There are many shops, restaurants, bars, museu...",https://a0.muscache.com/pictures/d242a77e-437c...,17766924,https://www.airbnb.com/users/show/17766924,Sugey,7/7/2014,...,4.93,4.87,4.77,f,1,1,0,0,5.67,0


In [32]:
df.describe()

Unnamed: 0,id,host_id,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,latitude,longitude,accommodates,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,null_price_values
count,459.0,459.0,431.0,446.0,459.0,459.0,0.0,459.0,459.0,459.0,...,391.0,391.0,391.0,391.0,459.0,459.0,459.0,459.0,391.0,459.0
mean,8.055481e+17,252527500.0,0.957448,0.878655,31.572985,52.172113,,42.658694,-73.776369,3.448802,...,4.841586,4.861688,4.656087,4.718414,6.742919,4.779956,1.962963,0.0,1.986982,0.087146
std,5.35296e+17,200664200.0,0.137379,0.226504,158.860381,308.399586,,0.010116,0.018573,2.465282,...,0.310672,0.269378,0.39757,0.384653,6.381206,5.874372,3.851606,0.0,1.946644,0.282357
min,2992450.0,65760.0,0.0,0.0,1.0,1.0,,42.63066,-73.87649,1.0,...,2.0,3.0,2.5,1.0,1.0,0.0,0.0,0.0,0.03,0.0
25%,54034600.0,47625980.0,0.98,0.87,2.0,2.0,,42.65259,-73.788709,2.0,...,4.825,4.845,4.5,4.66,1.0,1.0,0.0,0.0,0.59,0.0
50%,9.382804e+17,232967900.0,1.0,0.98,5.0,6.0,,42.657921,-73.77367,2.0,...,4.94,4.96,4.8,4.82,3.0,2.0,0.0,0.0,1.35,0.0
75%,1.264937e+18,433096100.0,1.0,1.0,15.0,17.0,,42.665027,-73.76331,4.0,...,5.0,5.0,4.93,4.95,12.0,6.0,1.5,0.0,2.915,0.0
max,1.479181e+18,705477100.0,1.0,1.0,1258.0,2648.0,,42.7149,-73.73825,16.0,...,5.0,5.0,5.0,5.0,21.0,21.0,13.0,0.0,11.48,1.0


### ‚úçÔ∏è Your Response: üîß
1. The dataset includes 459 entries. There are 74 variables.

2. There are 5 rows and 74 columns.

## 2. Drop Columns Not Useful for Modeling

**Business framing:**  
Some columns ‚Äî like post IDs or text ‚Äî may not help us predict price and could add noise or bias.

### Do the following:
- Drop columns like `post_id`, `title`, `descr`, `details`, and `address` if they‚Äôre still in your dataset

### In Your Response:
1. What columns did you drop, and why?
2. What risks might occur if you included them in your model?


In [33]:
# Add code here üîß
df.drop(columns=['id', 'name', 'description', 'listing_url', 'picture_url', 'host_url', 'host_picture_url'], inplace=True)

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 459 entries, 0 to 458
Data columns (total 67 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   neighborhood_overview                         196 non-null    object 
 1   host_id                                       459 non-null    int64  
 2   host_name                                     459 non-null    object 
 3   host_since                                    459 non-null    object 
 4   host_location                                 345 non-null    object 
 5   host_about                                    251 non-null    object 
 6   host_response_time                            431 non-null    object 
 7   host_response_rate                            431 non-null    float64
 8   host_acceptance_rate                          446 non-null    float64
 9   host_is_superhost                             452 non-null    obj

### ‚úçÔ∏è Your Response: üîß
1. I dropped id, name, and description because they are not relevent for regression analysis, as they are catgorical variables.

2. If they are included they could mess up the model, as regression is made for numbers and not categorical variables.

## 3. Explore Relationships Between Numeric Features

**Business framing:**  
Understanding how features relate to each other ‚Äî and to the target ‚Äî helps guide feature selection and modeling.

### Do the following:
- Generate a correlation matrix
- Identify which variables are strongly related to `price`

### In Your Response:
1. Which variables had the strongest positive or negative correlation with price?
2. Which variables might be useful predictors?


In [35]:
correlations = df.corr(numeric_only=True)
print(correlations['price'].sort_values(ascending=False))

price                                           1.000000
accommodates                                    0.646261
bedrooms                                        0.549395
beds                                            0.547032
bathrooms                                       0.468030
estimated_revenue_l365d                         0.249488
maximum_maximum_nights                          0.124171
minimum_maximum_nights                          0.113842
maximum_nights_avg_ntm                          0.112188
availability_30                                 0.105426
host_acceptance_rate                            0.078302
availability_60                                 0.053181
review_scores_rating                            0.033139
availability_90                                 0.030513
calculated_host_listings_count_entire_homes     0.028133
availability_eoy                                0.020136
calculated_host_listings_count                  0.015198
maximum_nights                 

### ‚úçÔ∏è Your Response: üîß
1. The variables accommodates, bedrooms, and beds had the highest correlation, and minimum_nights, host_response_rate, and review_scores_communication hae the lowest correlation.

2. Accomodates, bedrooms, and beds might be useful predictors.

## 4. Define Features and Target Variable

**Business framing:**  
To build a regression model, you need to define what you‚Äôre predicting (target) and what you‚Äôre using to make that prediction (features).

### Do the following:
- Set `price` as your target variable
- Remove `price` from your predictors

### In Your Response:
1. What features are you using?
2. Why is this a regression problem and not a classification problem?


In [36]:
# Create a copy to work with, to avoid unintended modifications to the original 'df'
data_for_model = df.copy()

# Drop 'neighbourhood_group_cleansed' as it contains all NaN values (0 non-null out of 459)
# This was identified from df.info()
if 'neighbourhood_group_cleansed' in data_for_model.columns:
    data_for_model = data_for_model.drop(columns=['neighbourhood_group_cleansed'])

# Define the target variable 'y'
# Drop rows where 'price' is NaN first, as we cannot predict for missing prices
data_for_model = data_for_model.dropna(subset=['price'])
y = data_for_model['price']

# Drop the target variable from the features DataFrame 'X'
X = data_for_model.drop(columns=['price'], errors='ignore')

# Select only truly numeric columns for X
X = X.select_dtypes(include=['number'])

# Now, drop any remaining rows with NaN values in the features (X)
# This ensures that X and y are fully aligned and contain no NaNs
# Alternatively, imputation could be used for NaNs in X
combined_clean = pd.concat([X, y], axis=1).dropna()

# Re-separate X and y from the cleaned and aligned DataFrame
X = combined_clean.drop(columns=['price'])
y = combined_clean['price']

# Verify that X is not empty after dropping NaNs
# (Changing to print a warning instead of raising an error to allow notebook to proceed)
if X.empty:
    print("Warning: Features DataFrame X became empty after dropping NaNs. Check data for extensive missing values.")
if y.empty:
    print("Warning: Target Series y became empty after dropping NaNs. Check data for extensive missing values.")

print(f"Shape of X after cleaning: {X.shape}")
print(f"Shape of y after cleaning: {y.shape}")

Shape of X after cleaning: (352, 43)
Shape of y after cleaning: (352,)


### ‚úçÔ∏è Your Response: üîß
1. The features I am using are accommodates, bedrooms, and beds.

2. This is a regression problem because we are predicting a numerical target variable.

## 5. Split Data into Training and Testing Sets

### Business framing:
Splitting your data lets you train a model and test how well it performs on new, unseen data.

### Do the following:
- Use `train_test_split()` to split into 80% training, 20% testing



In [37]:
# Add code here üîß
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 6. Fit a Linear Regression Model

### Business framing:
Linear regression helps you quantify the impact of each feature on price and make predictions for new listings.

### Do the following:
- Fit a linear regression model to your training data
- Use it to predict prices for the test set



In [38]:
# Add code here üîß
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

## 7. Evaluate Model Performance

### Business framing:  
A good model should make accurate predictions. We‚Äôll use Mean Squared Error (MSE) and R¬≤ to evaluate how close our predictions were to the actual prices.

### Do the following:
- Print MSE and R¬≤ score for your model

### In Your Response:
1. What is your R¬≤ score? How well does your model explain price variation?
2. Is your MSE large or small? What could you do to improve it?


In [44]:
# Add code here üîß
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R¬≤ Score: {r2}")

Mean Squared Error (MSE): 21692.720282963906
R¬≤ Score: 0.34029534225936764


### ‚úçÔ∏è Your Response: üîß
1. The R2 score is 0.34. It doesn't explain price variation very well, only explaining 34%

2. The MSE score is very large. I could cross validate, hyperparameter tuning, or feature engineering.

## 8. Interpret Model Coefficients

### Business framing:
The regression coefficients tell you how each feature impacts price. This can help Airbnb guide hosts and partners.

### Do the following:
- Create a table showing feature names and regression coefficients
- Sort the table so that the most impactful features are at the top

### In Your Response:
1. Which features increased price the most?
2. Were any surprisingly negative?
3. What business insight could you draw from this?


In [57]:
# Add code here üîß
feature_names = X_train.columns
coefficients = model.coef_
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
coefficients_df['Abs_Coefficient'] = coefficients_df['Coefficient'].abs()
coefficients_df = coefficients_df.sort_values(by='Abs_Coefficient', ascending=False)
display(coefficients_df)

Unnamed: 0,Feature,Coefficient,Abs_Coefficient
6,longitude,-548.7228,548.7228
5,latitude,-323.6651,323.6651
30,review_scores_rating,41.58481,41.58481
1,host_response_rate,-23.06157,23.06157
7,accommodates,20.10201,20.10201
17,minimum_nights_avg_ntm,19.5474,19.5474
14,maximum_minimum_nights,-19.05383,19.05383
33,review_scores_checkin,-17.87044,17.87044
34,review_scores_communication,-16.725,16.725
2,host_acceptance_rate,14.66153,14.66153


### ‚úçÔ∏è Your Response: üîß
1. The variables longitude, latitude, and review_scores_rating, increased the price the most.

2. The number of bedrooms and bathrooms are suprisingly negatively correlated. I would imagine that those were things that customers would care about those variables.

3. The business insight that can be drawn from this is to focus of the ratings of revies, how many people it can accomodate, and how long people wish to stay.


## 9. Try to Improve the Linear Regression Model

### Business framing:
The first version of your model included all available features ‚Äî but not all features are equally useful. Removing weak or noisy predictors can often improve performance and interpretation.

### Do the following:
1. Choose your top 3‚Äì5 features with the strongest absolute coefficients
2. Rebuild the regression model using just those features
3. Compare MSE and R¬≤ between the baseline and refined model

### In Your Response:
1. What features did you keep in the refined model, and why?
2. Did model performance improve? Why or why not?
3. Which model would you recommend to stakeholders?
4. How does this relate to your customized learning outcome you created in canvas?


In [60]:
# Add code here üîß

# Create a DataFrame for coefficients as it was originally done in cell 6WMhWBJ7GGyt
feature_names = X_train.columns
coefficients = model.coef_
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Now, calculate absolute coefficients and select top 5 features
coefficients_df['Abs_Coefficient'] = coefficients_df['Coefficient'].abs()
top_5_features = coefficients_df.sort_values(by='Abs_Coefficient', ascending=False).head(5)['Feature'].tolist()

X_train_refined = X_train[top_5_features]
X_test_refined = X_test[top_5_features]

print(f"Top 5 features selected: {top_5_features}")
print(f"Shape of X_train_refined: {X_train_refined.shape}")
print(f"Shape of X_test_refined: {X_test_refined.shape}")

model_refined = LinearRegression()
model_refined.fit(X_train_refined, y_train)
y_pred_refined = model_refined.predict(X_test_refined)

print("Refined model trained and predictions made.")
mse_refined = mean_squared_error(y_test, y_pred_refined)
r2_refined = r2_score(y_test, y_pred_refined)

print(f"Mean Squared Error (MSE) Refined: {mse_refined}")
print(f"R¬≤ Score Refined: {r2_refined}")

Top 5 features selected: ['longitude', 'latitude', 'review_scores_rating', 'host_response_rate', 'accommodates']
Shape of X_train_refined: (281, 5)
Shape of X_test_refined: (71, 5)
Refined model trained and predictions made.
Mean Squared Error (MSE) Refined: 22095.87347388719
R¬≤ Score Refined: 0.3280349141357469


### ‚úçÔ∏è Your Response: üîß
1. The features I kept in the refined model were the top 5 features, longitude, latitude, review_scores_rating, host_response_rate, and accommodates. I kept them because they were the highest correlated.

2. No, the model didn't imporve. That is because it had less variables to factor into the regression/explain the outcome.

3. I would recommend the first one because it is more explained than the other one, and therefore more accurate.

4. This relates to my customized learning outcome because it helps to make more informed data-drived decisions than can be used in operations.


## 10. Reflect and Recommend

### Business framing:  
Ultimately, the value of your model comes from how well it can guide business decisions. Use your results to make real-world recommendations.

### In Your Response:
1. What business question did your model help answer?
2. What would you recommend to Airbnb or its hosts?
3. What could you do next to improve this model or make it more useful?
4. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1. The model helped answer the business question of what factors contribute to the price of an aribnb.

2. I would recommend to Aribnb and its hosts to be responsive, understand how many guests would like to stay, do everything in their power to keep their ratings positive, and understand how where they are locatde affects the price.

3. We could do some more feature engineering or cross-validating to lower the MSE score.

4. This related to my customized learning outcome by using a machine learning model to make more informed operations decisions.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [42]:
!jupyter nbconvert --to html "assignment_11_LastnameFirstname.ipynb"

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr