# IS 4487 Assignment 11: Predicting Airbnb Prices with Regression

In this assignment, you will:
- Load the Airbnb dataset you cleaned and transformed in Assignment 7
- Build a linear regression model to predict listing price
- Interpret which features most affect price
- Try to improve your model using only the most impactful predictors
- Practice explaining your findings to a business audience like a host, pricing strategist, or city partner

## Why This Matters

Pricing is one of the most important levers for hosts and Airbnb’s business teams. Understanding what drives price — and being able to predict it accurately — helps improve search results, revenue management, and guest satisfaction.

This assignment gives you hands-on practice turning a cleaned dataset into a predictive model. You’ll focus not just on code, but on what the results mean and how you’d communicate them to stakeholders.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_11_regression.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



## Original Source: Dataset Description

The dataset you'll be using is a **detailed Airbnb listing file**, available from [Inside Airbnb](https://insideairbnb.com/get-the-data/).

Each row represents one property listing. The columns include:

- **Host attributes** (e.g., host ID, host name, host response time)
- **Listing details** (e.g., price, room type, minimum nights, availability)
- **Location data** (e.g., neighborhood, latitude/longitude)
- **Property characteristics** (e.g., number of bedrooms, amenities, accommodates)
- **Calendar/booking variables** (e.g., last review date, number of reviews)

The schema is consistent across cities, so you can expect similar columns regardless of the location you choose.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


## 1. Load Your Transformed Airbnb Dataset

**Business framing:**  
Before building any models, we must start with clean, prepared data. In Assignment 7, you exported a cleaned version of your Airbnb dataset. You’ll now import that file for analysis.

### Do the following:
- Import your CSV file called `cleaned_airbnb_data_7.csv`.   (Note: If you had significant errors with assignment 7, you can use the file named "airbnb_listings.csv" in the DataSets folder on GitHub as a backup starting point.)
- Use `pandas` to load and preview the dataset

### In Your Response:
1. What does the dataset include?
2. How many rows and columns are present?


The dataset has 459 rows and 84 columns.


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,host_url,host_name,...,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_transformed_review_scores_communication,minmax_scaled_review_scores_value,zscore_scaled_reviews_per_month,review_scores_location_category,price_per_night,high_reviews_per_month_flag,neighborhood_Neighborhood highlights
0,2992450,https://www.airbnb.com/rooms/2992450,20250804133828,2025-08-04,city scrape,Luxury 2 bedroom apartment,The apartment is located in a quiet neighborho...,,https://www.airbnb.com/users/show/4621559,Kenneth,...,0,0,0.07,1.715598,0.6675,-0.986024,Low,2.5,0,False
1,3820211,https://www.airbnb.com/rooms/3820211,20250804133828,2025-08-04,city scrape,Funky Urban Gem: Prime Central Location - Park...,Step into the charming and comfy 1BR/1BA apart...,Overview<br /><br />The lovely apartment is lo...,https://www.airbnb.com/users/show/19648678,Terra,...,0,0,2.32,1.759581,0.9425,0.171292,High,52.0,0,True
2,5651579,https://www.airbnb.com/rooms/5651579,20250804133828,2025-08-04,city scrape,Large studio apt by Capital Center & ESP@,"Spacious studio with hardwood floors, fully eq...",The neighborhood is very eclectic. We have a v...,https://www.airbnb.com/users/show/29288920,Gregg,...,1,0,2.97,1.771557,0.91,0.505628,High,37.5,0,True
3,6623339,https://www.airbnb.com/rooms/6623339,20250804133828,2025-08-04,city scrape,Bright & Cozy City Stay · Top Location + Parking!,Step into the charming and comfy 1BR/1BA apart...,Overview<br /><br />The lovely apartment is lo...,https://www.airbnb.com/users/show/19648678,Terra,...,0,0,2.68,1.740466,0.93,0.356463,High,50.5,0,True
4,9005989,https://www.airbnb.com/rooms/9005989,20250804133828,2025-08-04,city scrape,"Studio in The heart of Center SQ, in Albany NY",(21 years of age or older ONLY) NON- SMOKING.....,"There are many shops, restaurants, bars, museu...",https://www.airbnb.com/users/show/17766924,Sugey,...,0,0,5.67,1.780024,0.9425,1.894407,High,110.0,0,True


The dataset includes information about Airbnb bookings and properties, such as price per night, location, host, and number of reviews. It appears to be quite a comprehensive dataset with a lot of valuable information. Some variables in here are MinMax Scaled or Log Transformed.


There are 459 rows and 84 columns in this dataset.

## 2. Drop Columns Not Useful for Modeling

**Business framing:**  
Some columns — like post IDs or text — may not help us predict price and could add noise or bias.

### Do the following:
- Drop columns like `post_id`, `title`, `descr`, `details`, and `address` if they’re still in your dataset

### In Your Response:
1. What columns did you drop, and why?
2. What risks might occur if you included them in your model?


['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_s

The dataset now has 459 rows and 73 columns after dropping columns.


I dropped the following columns from the dataset: id, description, listing_url, scrape_id, neighborhood_overview, host_about, host_picture_url, name, source, host_name, host_thumbnail_url. I dropped these columns because they would be useless in a regression analysis. They were either meaningless numbers such as ID's, or text descriptions that would not help with regression. URL addresses are also not helpful for regression.


Including things like ID numbers in a regression model introduces the risk of overfitting, or the model memorizing outcomes associated with specific ID's. Things like ID numbers and URL's also have no predictive significance at all, so including them in a regression model is useless. It is best practice to remove these items before beginning analysis for those reasons.

## 3. Explore Relationships Between Numeric Features

**Business framing:**  
Understanding how features relate to each other — and to the target — helps guide feature selection and modeling.

### Do the following:
- Generate a correlation matrix
- Identify which variables are strongly related to `price`

### In Your Response:
1. Which variables had the strongest positive or negative correlation with price?
2. Which variables might be useful predictors?


Unnamed: 0,host_listings_count,host_total_listings_count,latitude,longitude,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,log_transformed_review_scores_communication,minmax_scaled_review_scores_value,zscore_scaled_reviews_per_month,price_per_night,high_reviews_per_month_flag,neighborhood_Neighborhood highlights
host_listings_count,1.0,0.97396,-0.070382,0.111195,0.037236,-0.050743,0.041395,0.057712,-0.014677,0.034032,...,0.057724,-0.061563,,-0.116031,-0.351926,-0.375941,-0.116031,-0.060832,-0.012745,0.176851
host_total_listings_count,0.97396,1.0,-0.072907,0.099717,0.057426,-0.044631,0.061755,0.086508,-0.00658,0.005437,...,0.048405,-0.062113,,-0.103788,-0.311565,-0.304869,-0.103788,-0.047459,-0.010989,0.160677
latitude,-0.070382,-0.072907,1.0,-0.548236,0.006482,0.067394,0.016919,0.021012,-0.014666,0.146688,...,-0.017696,0.227452,,-0.119777,0.096122,0.066316,-0.119777,-0.043658,-0.034922,-0.098618
longitude,0.111195,0.099717,-0.548236,1.0,-0.083466,-0.175493,-0.215899,-0.138303,-0.118913,-0.12024,...,0.1096,-0.245058,,0.04517,-0.158498,-0.214617,0.04517,-0.021484,0.047118,0.059345
accommodates,0.037236,0.057426,0.006482,-0.083466,1.0,0.520394,0.809299,0.828875,0.579588,-0.117221,...,0.100762,-0.196689,,0.075746,0.018718,0.009148,0.075746,0.46209,-0.025488,0.105587
bathrooms,-0.050743,-0.044631,0.067394,-0.175493,0.520394,1.0,0.534375,0.462908,0.46803,-0.065774,...,-0.058583,0.129067,,-0.000851,0.010007,0.098843,-0.000851,0.395159,-0.024227,0.004316
bedrooms,0.041395,0.061755,0.016919,-0.215899,0.809299,0.534375,1.0,0.784348,0.499286,-0.073682,...,0.018162,-0.094575,,-0.049262,-0.004458,0.01678,-0.049262,0.325599,-0.036005,0.050272
beds,0.057712,0.086508,0.021012,-0.138303,0.828875,0.462908,0.784348,1.0,0.547032,-0.132096,...,-0.05755,-0.140551,,0.07805,0.001999,0.000307,0.07805,0.394146,0.040904,0.10846
price,-0.014677,-0.00658,-0.014666,-0.118913,0.579588,0.46803,0.499286,0.547032,1.0,-0.075122,...,0.033206,-0.024513,,-0.090843,-0.1313,0.018269,-0.090843,0.83396,-0.019662,0.071828
minimum_nights,0.034032,0.005437,0.146688,-0.12024,-0.117221,-0.065774,-0.073682,-0.132096,-0.075122,1.0,...,-0.121811,0.073141,,-0.265131,-0.003105,-0.071429,-0.265131,-0.210288,-0.019055,-0.120949


Unnamed: 0,price
price,1.0
price_per_night,0.83396
accommodates,0.579588
beds,0.547032
bedrooms,0.499286
bathrooms,0.46803
estimated_revenue_l365d,0.249488
maximum_maximum_nights,0.122872
minimum_maximum_nights,0.112166
maximum_nights_avg_ntm,0.111271


Price_per_night, accomodates, beds, bedrooms, and bathrooms were all positively correlated with price. The most negatively correlated were review_scores_communication, review_scores_checkin, reviews_per_month, and minimum_nights.


Price_per_night, accomodates, beds, bedrooms, and bathrooms would be super useful predictors. These look to be the most correlated with price.

## 4. Define Features and Target Variable

**Business framing:**  
To build a regression model, you need to define what you’re predicting (target) and what you’re using to make that prediction (features).

### Do the following:
- Set `price` as your target variable
- Remove `price` from your predictors

### In Your Response:
1. What features are you using?
2. Why is this a regression problem and not a classification problem?


I am using all the features in the dataset except for the ones I dropped, which were mostly URL's and ID's, and price, because this is the target variable. I am excited to use features like price_per_night and beds, because these are very strongly correlated with price. I think they will be great predictors.


This is a regression problem, and not a classification problem, because we are trying to predict a number; price. This means it is regression. We are not trying to predict a category or a label, which would be classification. Regression predicts a number, like price.

## 5. Split Data into Training and Testing Sets

### Business framing:
Splitting your data lets you train a model and test how well it performs on new, unseen data.

### Do the following:
- Use `train_test_split()` to split into 80% training, 20% testing



## 6. Fit a Linear Regression Model

### Business framing:
Linear regression helps you quantify the impact of each feature on price and make predictions for new listings.

### Do the following:
- Fit a linear regression model to your training data
- Use it to predict prices for the test set





## 7. Evaluate Model Performance

### Business framing:  
A good model should make accurate predictions. We’ll use Mean Squared Error (MSE) and R² to evaluate how close our predictions were to the actual prices.

### Do the following:
- Print MSE and R² score for your model

### In Your Response:
1. What is your R² score? How well does your model explain price variation?
2. Is your MSE large or small? What could you do to improve it?


Mean Squared Error (MSE): 6611.992369323672
R-squared (R²) Score: -0.17808359532509344


My R-squared score is negative, which means that this model is currently worse at predicting than random predictions. This is not good, but it will be improved.

My MSE score is enormous at over 6000. I could base my model off just the top predictors to improve this score considerably.

## 8. Interpret Model Coefficients

### Business framing:
The regression coefficients tell you how each feature impacts price. This can help Airbnb guide hosts and partners.

### Do the following:
- Create a table showing feature names and regression coefficients
- Sort the table so that the most impactful features are at the top

### In Your Response:
1. Which features increased price the most?
2. Were any surprisingly negative?
3. What business insight could you draw from this?


Unnamed: 0,feature,coefficient
40,log_transformed_review_scores_communication,-1108.126
3,longitude,-482.5443
2,latitude,-373.9682
31,review_scores_communication,174.2625
27,review_scores_rating,36.26915
30,review_scores_checkin,-34.93175
44,high_reviews_per_month_flag,31.54842
11,maximum_minimum_nights,-25.4697
29,review_scores_cleanliness,-25.28003
14,minimum_nights_avg_ntm,22.90085


The features that increased price the most looked to be review_scores_communication, review_scores_rating, high_reviews_per_month_flag, minimum_nights_avg_ntm, and review_scores_value.


The two features that were surprisingly negative were longitude and latitude. This does not make sense to me.


As a business, you could say that high communication scores likely equate to higher prices, as well as higher reviews across the board. More reviews also look to correlate with higher prices. Higher-rated properties look to be more expensive, as review_scores_value is one of the top features that increase price.

## 9. Try to Improve the Linear Regression Model

### Business framing:
The first version of your model included all available features — but not all features are equally useful. Removing weak or noisy predictors can often improve performance and interpretation.

### Do the following:
1. Choose your top 3–5 features with the strongest absolute coefficients
2. Rebuild the regression model using just those features
3. Compare MSE and R² between the baseline and refined model

### In Your Response:
1. What features did you keep in the refined model, and why?
2. Did model performance improve? Why or why not?
3. Which model would you recommend to stakeholders?
4. How does this relate to your customized learning outcome you created in canvas?


In [28]:
mse_refined = mean_squared_error(y_test, y_pred_refined)
r2_refined = r2_score(y_test, y_pred_refined)

print(f"Refined Model Mean Squared Error (MSE): {mse_refined}")
print(f"Refined Model R-squared (R²) Score: {r2_refined}")

Refined Model Mean Squared Error (MSE): 5479.9873083045395
Refined Model R-squared (R²) Score: 0.023610011945060183


I kept the top 5 features with the strongest absolute coefficients. I kept these features because they were the most predictive of price, or at least more predictive than my last set of features.


The model performance did improve, but marginally. The R-squared score increased to above 0, which is an improvement compared to a negative R-squared score. The MSE score for the refined model dropped by about 1000, which is good. It is still high, but it came down a fair amount. This tells me that the model performance improved.


I would recommend the second model to stakeholders because it is more predictive than the first model.


In my customized learning outcome, I mentioned wanting to be able to analyze hospitality booking data and gain value from it. This assignment certainly relates to that strongly.


## 10. Reflect and Recommend

### Business framing:  
Ultimately, the value of your model comes from how well it can guide business decisions. Use your results to make real-world recommendations.

### In Your Response:
1. What business question did your model help answer?
2. What would you recommend to Airbnb or its hosts?
3. What could you do next to improve this model or make it more useful?
4. How does this relate to your customized learning outcome you created in canvas?


This model helped answer the business question of what predicts the price of Airbnb properties.

I would recommend to Airbnb and its hosts to take a good look at communication scores and ratings overall to get a good idea of higher-priced, higher quality properties. These seem to have very high correlation. I would also let Airbnb know that higher review frequencies also equates to higher prices in general.


To make this model more useful, I would refine the features even more, and delete more features that are not of any help in predicting price. I thought I did a good job at this at first, but I don't think I did. I think that shrinking the number of features used would be of help to the model. I would probably just pick the top ten features next time to work with.


I mentioned wanting to be able to analyze large datasets and make recommendations to companies in my customized learning outcome I created in Canvas. This assignment has certainly helped me do that. This was a pretty interesting assignment because it gave me a good idea of how challenging a good regression model can be to create, and how important it is to pick great features to use in the model.

## Submission Instructions
✅ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [29]:
!jupyter nbconvert --to html "assignment_11_regression.ipynb"

[NbConvertApp] Converting notebook assignment_11_regression.ipynb to html
[NbConvertApp] Writing 365286 bytes to assignment_11_regression.html
