# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [50]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [51]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


('', [])

### 1. Setup and Imports

In this section we import the main libraries that we will use throughout the notebook:
- **pandas / numpy** for data loading and manipulation.
- **plotly.express** for interactive visualizations.
- **scikit-learn (TF–IDF + SVD)** for turning text into numeric features.


In [52]:
# Core libraries for data handling and analysis
import pandas as pd
import numpy as np

# Plotly for interactive visualizations
import plotly.express as px

# Text -> numeric features (TF–IDF) + dimensionality reduction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Basic ML utilities for splitting data and evaluating models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# Make DataFrame outputs easier to read
pd.set_option("display.max_columns", 60)
pd.set_option("display.width", 140)


### 2. Load Antwerp Airbnb data (listings, calendar, reviews)

We now load the three raw CSV files for **Antwerp** that are already in this folder:
- `listings.csv.gz`
- `calendar.csv.gz`
- `reviews.csv.gz`

We will keep this step simple and just read them with sensible dtypes and parse dates where needed.

In [53]:
# Paths to the Antwerp CSV files (in the same folder as this notebook)
LISTINGS_PATH = "listings.csv.gz"
CALENDAR_PATH = "calendar.csv.gz"
REVIEWS_PATH = "reviews.csv.gz"

# Read the data
listings_raw = pd.read_csv(LISTINGS_PATH, low_memory=False)
calendar_raw = pd.read_csv(CALENDAR_PATH, parse_dates=["date"], low_memory=False)
reviews_raw = pd.read_csv(REVIEWS_PATH, parse_dates=["date"], low_memory=False)

listings_raw.head(), calendar_raw.head(), reviews_raw.head()


(       id                          listing_url       scrape_id last_scraped           source  \
 0   50904   https://www.airbnb.com/rooms/50904  20250625193541   2025-06-26      city scrape   
 1  345959  https://www.airbnb.com/rooms/345959  20250625193541   2025-06-25      city scrape   
 2  366252  https://www.airbnb.com/rooms/366252  20250625193541   2025-06-25      city scrape   
 3  522693  https://www.airbnb.com/rooms/522693  20250625193541   2025-06-26  previous scrape   
 4  603545  https://www.airbnb.com/rooms/603545  20250625193541   2025-06-26  previous scrape   
 
                                                 name                                        description  \
 0                    A Place Boutique - Deluxe suite  Decorated in a vintage style combined with a f...   
 1                     Marleen's home in Antwerp city  your entire, private groundfloor 2-bedroom apa...   
 2                ROOM IN FAMILY HOME near C. Station  In the Antwerp district of Borgerhout

### 3. Quick data overview

Here we do a **very light EDA** just to understand the basic shape of each table:
- number of rows / columns
- a few example rows
- key columns we will use later (`id`, `listing_id`, `date`, `available`, `comments`, etc.).


In [54]:
# Basic shapes
print("Listings:", listings_raw.shape)
print("Calendar:", calendar_raw.shape)
print("Reviews:", reviews_raw.shape)

# Peek at important columns
print("\nListings columns (sample):")
print(list(listings_raw.columns)[:20])

print("\nCalendar columns:")
print(calendar_raw.columns.tolist())

print("\nReviews columns:")
print(reviews_raw.columns.tolist())

# Show a few rows of each
display(listings_raw.head(3))
display(calendar_raw.head(3))
display(reviews_raw.head(3))


Listings: (2654, 79)
Calendar: (968710, 7)
Reviews: (122622, 6)

Listings columns (sample):
['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url']

Calendar columns:
['listing_id', 'date', 'available', 'price', 'adjusted_price', 'minimum_nights', 'maximum_nights']

Reviews columns:
['listing_id', 'id', 'date', 'reviewer_id', 'reviewer_name', 'comments']


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,...,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,50904,https://www.airbnb.com/rooms/50904,20250625193541,2025-06-26,city scrape,A Place Boutique - Deluxe suite,Decorated in a vintage style combined with a f...,,https://a0.muscache.com/pictures/f14b0908-cbc3...,234077,https://www.airbnb.com/users/show/234077,Calvin,2010-09-14,"Antwerp, Belgium",Ever since my childhood I dreamt of having my ...,within a day,90%,82%,f,https://a0.muscache.com/im/pictures/user/User/...,https://a0.muscache.com/im/pictures/user/User/...,,10.0,10.0,"['email', 'phone']",t,t,,Historisch Centrum,,...,,t,18,36,59,331,2025-06-26,3,0,0,155,0,0,0.0,2015-05-06,2022-05-15,5.0,5.0,5.0,5.0,5.0,5.0,5.0,,t,9,5,2,0,0.02
1,345959,https://www.airbnb.com/rooms/345959,20250625193541,2025-06-25,city scrape,Marleen's home in Antwerp city,"your entire, private groundfloor 2-bedroom apa...","nice, quiet residential neighborhood",https://a0.muscache.com/pictures/11642662/f9b6...,1754396,https://www.airbnb.com/users/show/1754396,Marleen,2012-02-15,"Antwerp, Belgium","I'm a reader, gardener and I like meeting peop...",within an hour,100%,95%,t,https://a0.muscache.com/im/users/1754396/profi...,https://a0.muscache.com/im/users/1754396/profi...,,1.0,3.0,"['email', 'phone']",t,t,"Antwerp, Flemish Region, Belgium",Markgrave,,...,,t,3,6,8,251,2025-06-25,130,20,1,76,20,120,7560.0,2012-05-12,2025-06-19,4.82,4.86,4.88,4.91,4.87,4.59,4.81,,f,1,1,0,0,0.81
2,366252,https://www.airbnb.com/rooms/366252,20250625193541,2025-06-25,city scrape,ROOM IN FAMILY HOME near C. Station,"In the Antwerp district of Borgerhout, we live...",we live on the 5th floor on top of a bed store...,https://a0.muscache.com/pictures/airflow/Hosti...,1820186,https://www.airbnb.com/users/show/1820186,Godelieve,2012-02-27,"Antwerp, Belgium","We, Lieve (52 y.) Fronk (49 y.) love talking, ...",within an hour,100%,83%,t,https://a0.muscache.com/im/pictures/user/cf99a...,https://a0.muscache.com/im/pictures/user/cf99a...,,3.0,4.0,"['email', 'phone']",t,t,"Antwerp, Flemish Region, Belgium",Borgerhout Intra Muros Noord,,...,,t,0,0,0,125,2025-06-25,162,1,0,94,2,6,204.0,2012-03-29,2024-09-26,4.64,4.68,4.39,4.86,4.89,4.41,4.64,,t,2,0,2,0,1.0


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,50904,2025-06-26,t,,,1,1000
1,50904,2025-06-27,f,,,1,1000
2,50904,2025-06-28,f,,,1,1000


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,50904,31511792,2015-05-06,19482395,Jihae,Karin’s “Aplace” is absolutely beautiful and c...
1,50904,470101024356869935,2021-10-10,333559,Emilie,"Karin is a wonderful host, she was really help..."
2,50904,627287279025726941,2022-05-15,32701854,Marie-Lou,The location is super super nice! Karin was al...


### 4. Simple EDA on prices and availability

We start with a few **basic visualizations** to understand price levels and availability patterns in Antwerp.
We:
- convert the `price` column to numeric,
- plot the price distribution,
- look at how availability changes over time.


In [55]:
# Helper: turn price strings like "$120.00" into numeric values
calendar = calendar_raw.copy()

# Parse both `price` and `adjusted_price` to numeric
for col in ["price", "adjusted_price"]:
    if col in calendar.columns:
        calendar[col] = (calendar[col]
                           .astype(str)
                           .str.replace("$", "", regex=False)
                           .str.replace(",", "", regex=False)
                           .replace("nan", np.nan)
                           .astype(float))

# If `price` is missing but `adjusted_price` is present, fall back to adjusted_price
if "price" in calendar.columns and "adjusted_price" in calendar.columns:
    missing_price = calendar["price"].isna() & calendar["adjusted_price"].notna()
    calendar.loc[missing_price, "price"] = calendar.loc[missing_price, "adjusted_price"]

# Basic price distribution (exclude missing / zero)
price_nonzero = calendar["price"].dropna()
price_nonzero = price_nonzero[price_nonzero > 0]

fig_price = px.histogram(price_nonzero, nbins=60, title="Antwerp nightly price distribution")
fig_price.show()

# Share of available vs not-available days over time
availability_by_date = (calendar
    .assign(is_available=lambda df: df["available"] == "t")
    .groupby("date")["is_available"]
    .mean()
    .reset_index(name="share_available"))

fig_avail = px.line(availability_by_date, x="date", y="share_available",
                    title="Share of listings available over time")
fig_avail.show()



Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



### 5. Feature engineering overview

To build an ML-ready dataset for **predicting availability of a listing on a given date**, we will:
- **Start from `calendar`** (one row per listing per date).
- Join **listing-level features** from `listings` (e.g. room type, neighbourhood, capacity).
- Add **review-based features** from `reviews` (counts and simple text signal).
- Convert the text **`comments`** column into numeric TF–IDF features, then reduce dimensionality with SVD.

We keep the feature set small and interpretable, focusing on a clean end-to-end pipeline.


In [56]:
# 5.1. Prepare a thin listings table with the most useful columns

# Select a subset of interpretable listing features
listing_cols = [
    "id",                 # listing_id key
    "neighbourhood_cleansed",
    "room_type",
    "property_type",
    "accommodates",
    "bedrooms",
    "bathrooms_text",
    "number_of_reviews",
    "review_scores_rating",
]

listings = listings_raw[listing_cols].copy()
listings = listings.rename(columns={"id": "listing_id"})

# Basic cleaning: fill a few obvious missing values
listings["neighbourhood_cleansed"] = listings["neighbourhood_cleansed"].fillna("Unknown")
listings["room_type"] = listings["room_type"].fillna("Unknown")
listings["property_type"] = listings["property_type"].fillna("Unknown")

# Convert bathrooms_text into a numeric "bathrooms" feature where possible
listings["bathrooms"] = (listings["bathrooms_text"]
    .astype(str)
    .str.extract(r"(\d+\.?\d*)", expand=False)
    .astype(float))

listings.head()


Unnamed: 0,listing_id,neighbourhood_cleansed,room_type,property_type,accommodates,bedrooms,bathrooms_text,number_of_reviews,review_scores_rating,bathrooms
0,50904,Historisch Centrum,Hotel room,Room in boutique hotel,2,1.0,1 private bath,3,5.0,1.0
1,345959,Markgrave,Entire home/apt,Entire condo,3,2.0,1 bath,130,4.82,1.0
2,366252,Borgerhout Intra Muros Noord,Private room,Private room in rental unit,2,1.0,1.5 shared baths,162,4.64,1.5
3,522693,Historisch Centrum,Entire home/apt,Entire rental unit,4,1.0,1 bath,0,,1.0
4,603545,Harmonie,Entire home/apt,Entire loft,1,1.0,1 bath,46,4.85,1.0


In [57]:
# 5.2. Aggregate review information at listing level

# Simple review features per listing
reviews_agg = (reviews_raw
    .groupby("listing_id")
    .agg(
        n_reviews=("id", "count"),
        last_review_date=("date", "max"),
    )
    .reset_index())

# Days since last review (relative to max date in data)
max_review_date = reviews_agg["last_review_date"].max()
reviews_agg["days_since_last_review"] = (
    (max_review_date - reviews_agg["last_review_date"]).dt.days
)

reviews_agg.head()


Unnamed: 0,listing_id,n_reviews,last_review_date,days_since_last_review
0,50904,3,2022-05-15,1140
1,345959,130,2025-06-19,9
2,366252,162,2024-09-26,275
3,603545,46,2024-11-08,232
4,772842,82,2025-06-01,27


In [58]:
# 5.3. Text features: TF–IDF over review comments (per listing)

# For simplicity we:
# - concatenate all comments for each listing into a single string
# - fit a TF–IDF vectorizer
# - reduce to a small number of components with SVD

# Concatenate comments per listing
reviews_text = (reviews_raw
    .dropna(subset=["comments"])
    .groupby("listing_id")["comments"]
    .apply(lambda s: " ".join(s.astype(str)))
    .reset_index())

# Use a small vocabulary and max_features to keep it light
vectorizer = TfidfVectorizer(max_features=500, stop_words="english")
X_tfidf = vectorizer.fit_transform(reviews_text["comments"])

# Reduce dimensionality so we only keep a few dense numeric text features
svd = TruncatedSVD(n_components=5, random_state=42)
X_svd = svd.fit_transform(X_tfidf)

# Build a DataFrame with the SVD components
text_features = pd.DataFrame(
    X_svd,
    columns=[f"review_text_comp_{i+1}" for i in range(X_svd.shape[1])]
)
text_features["listing_id"] = reviews_text["listing_id"].values

text_features.head()


Unnamed: 0,review_text_comp_1,review_text_comp_2,review_text_comp_3,review_text_comp_4,review_text_comp_5,listing_id
0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201,50904
1,0.871716,-0.13989,0.166389,-0.142667,-0.10658,345959
2,0.740089,-0.316491,-0.222935,-0.092113,-0.140684,366252
3,0.593361,-0.479606,-0.386175,-0.273201,0.069963,603545
4,0.762822,-0.129947,-0.249173,-0.024974,0.01672,772842


In [59]:
# 5.4. Build the base calendar table and join listing + review features

# Start from calendar (this will also define our target y = is_available)
base = calendar.copy()
base["is_available"] = (base["available"] == "t").astype(int)

# Extract simple temporal features from the date
base["year"] = base["date"].dt.year
base["month"] = base["date"].dt.month
base["day"] = base["date"].dt.day
base["dayofweek"] = base["date"].dt.dayofweek  # 0 = Monday

# Join static listing features
base = base.merge(listings.drop(columns=["bathrooms_text"]),
                  on="listing_id", how="left")

# Join aggregated review stats
base = base.merge(reviews_agg[["listing_id", "n_reviews", "days_since_last_review"]],
                  on="listing_id", how="left")

# Join text-based SVD components
base = base.merge(text_features, on="listing_id", how="left")

base.head()


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights,is_available,year,month,day,dayofweek,neighbourhood_cleansed,room_type,property_type,accommodates,bedrooms,number_of_reviews,review_scores_rating,bathrooms,n_reviews,days_since_last_review,review_text_comp_1,review_text_comp_2,review_text_comp_3,review_text_comp_4,review_text_comp_5
0,50904,2025-06-26,t,,,1,1000,1,2025,6,26,3,Historisch Centrum,Hotel room,Room in boutique hotel,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201
1,50904,2025-06-27,f,,,1,1000,0,2025,6,27,4,Historisch Centrum,Hotel room,Room in boutique hotel,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201
2,50904,2025-06-28,f,,,1,1000,0,2025,6,28,5,Historisch Centrum,Hotel room,Room in boutique hotel,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201
3,50904,2025-06-29,f,,,1,1000,0,2025,6,29,6,Historisch Centrum,Hotel room,Room in boutique hotel,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201
4,50904,2025-06-30,t,,,1,1000,1,2025,6,30,0,Historisch Centrum,Hotel room,Room in boutique hotel,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201


In [60]:
# 5.5. Basic cleaning: handle missing values and encode categoricals

ml_df = base.copy()

# 1) Fill ALL numeric NaNs in a robust way (including price, adjusted_price, etc.)
numeric_cols = ml_df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    # Compute median on non-NaN values only
    col_median = ml_df[col].dropna().median()
    # If a column is entirely NaN, fall back to 0
    if pd.isna(col_median):
        col_median = 0.0
    ml_df[col] = ml_df[col].fillna(col_median)

# 2) Fill text SVD components with 0 when we have no reviews (redundant but explicit)
text_cols = [c for c in ml_df.columns if c.startswith("review_text_comp_")]
for c in text_cols:
    ml_df[c] = ml_df[c].fillna(0.0)

# 3) Simple one-hot encoding for a few key categoricals (keep it compact)
cat_cols = ["neighbourhood_cleansed", "room_type", "property_type"]
ml_df = pd.get_dummies(ml_df, columns=cat_cols, drop_first=True)

ml_df.head()


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights,is_available,year,month,day,dayofweek,accommodates,bedrooms,number_of_reviews,review_scores_rating,bathrooms,n_reviews,days_since_last_review,review_text_comp_1,review_text_comp_2,review_text_comp_3,review_text_comp_4,review_text_comp_5,neighbourhood_cleansed_Borgerhout Extra Muros,neighbourhood_cleansed_Borgerhout Intra Muros Noord,neighbourhood_cleansed_Borgerhout Intra Muros Zuid,neighbourhood_cleansed_Brederode,neighbourhood_cleansed_Centraal Station,neighbourhood_cleansed_Dam,...,property_type_Entire cottage,property_type_Entire guest suite,property_type_Entire guesthouse,property_type_Entire home,property_type_Entire loft,property_type_Entire place,property_type_Entire rental unit,property_type_Entire serviced apartment,property_type_Entire townhouse,property_type_Entire vacation home,property_type_Entire villa,property_type_Houseboat,property_type_Private room,property_type_Private room in bed and breakfast,property_type_Private room in boat,property_type_Private room in casa particular,property_type_Private room in condo,property_type_Private room in guest suite,property_type_Private room in guesthouse,property_type_Private room in home,property_type_Private room in loft,property_type_Private room in rental unit,property_type_Private room in serviced apartment,property_type_Private room in townhouse,property_type_Room in aparthotel,property_type_Room in boutique hotel,property_type_Room in hotel,property_type_Shared room in guesthouse,property_type_Shipping container,property_type_Tiny home
0,50904,2025-06-26,t,0.0,0.0,1,1000,1,2025,6,26,3,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
1,50904,2025-06-27,f,0.0,0.0,1,1000,0,2025,6,27,4,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
2,50904,2025-06-28,f,0.0,0.0,1,1000,0,2025,6,28,5,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
3,50904,2025-06-29,f,0.0,0.0,1,1000,0,2025,6,29,6,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
4,50904,2025-06-30,t,0.0,0.0,1,1000,1,2025,6,30,0,2,1.0,3,5.0,1.0,3.0,1140.0,0.276287,-0.290923,-0.223017,-0.222606,-0.12201,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False


### 6. Final ML-ready dataset

We now:
- keep a clean set of **feature columns X** and target **`is_available`**,
- save the final table to a CSV file `ml_ready_antwerp.csv` that can be used in the next hackathon (modelling).

This dataset has one row per `(listing_id, date)` with all engineered features attached.


In [61]:
# Select target and features
# (We keep identifiers and date as well, they are handy for analysis but not for training directly.)

# Start from all engineered columns, but avoid duplicates and keep `is_available` only once
feature_cols = [
    c for c in ml_df.columns
    if c not in ["listing_id", "date", "available", "is_available"]
]

final_cols = ["listing_id", "date", "is_available"] + feature_cols

ml_ready = ml_df[final_cols].copy()

# Ensure there are no duplicate columns (keep first occurrence)
ml_ready = ml_ready.loc[:, ~ml_ready.columns.duplicated()]

# If, for any reason, `is_available` is still 2D, reduce it to a 1D Series
if hasattr(ml_ready["is_available"], "ndim") and ml_ready["is_available"].ndim > 1:
    ml_ready["is_available"] = ml_ready["is_available"].iloc[:, 0]

# Save to disk (in the same folder as this notebook)
output_path = "ml_ready_antwerp.csv"
ml_ready.to_csv(output_path, index=False)

ml_ready.head(), ml_ready.shape, output_path


(   listing_id       date  is_available  price  adjusted_price  minimum_nights  maximum_nights  year  month  day  dayofweek  accommodates  \
 0       50904 2025-06-26             1    0.0             0.0               1            1000  2025      6   26          3             2   
 1       50904 2025-06-27             0    0.0             0.0               1            1000  2025      6   27          4             2   
 2       50904 2025-06-28             0    0.0             0.0               1            1000  2025      6   28          5             2   
 3       50904 2025-06-29             0    0.0             0.0               1            1000  2025      6   29          6             2   
 4       50904 2025-06-30             1    0.0             0.0               1            1000  2025      6   30          0             2   
 
    bedrooms  number_of_reviews  review_scores_rating  bathrooms  n_reviews  days_since_last_review  review_text_comp_1  \
 0       1.0                 

### 7. Sanity checks for modelling (optional)

Before moving to the modelling hackathon, it is useful to:
- check the **class balance** of the target `is_available`,
- look at the final number of features you will feed into a model.

This section is optional but helps you quickly understand what the next notebook will work with.


In [62]:
# Target distribution
print(ml_ready["is_available"].value_counts(normalize=True).rename("share"))

# Simple X/y split view for the next notebook (no model here)
feature_cols = [c for c in ml_ready.columns
                if c not in ["listing_id", "date", "is_available"]]

X_shape = (len(ml_ready), len(feature_cols))
print("\nNumber of rows (samples):", X_shape[0])
print("Number of feature columns:", X_shape[1])

# Show just the first few feature columns as a preview
print("\nFirst 10 feature columns:")
print(feature_cols[:10])

is_available
1    0.53665
0    0.46335
Name: share, dtype: float64

Number of rows (samples): 968710
Number of feature columns: 112

First 10 feature columns:
['price', 'adjusted_price', 'minimum_nights', 'maximum_nights', 'year', 'month', 'day', 'dayofweek', 'accommodates', 'bedrooms']


### 8. Train/test split and baseline ML model

It is **crucial** not to evaluate a model on the same data it was trained on.
Here we:
- build a feature matrix `X` and target vector `y` from `ml_ready`,
- perform an **80% train / 20% test** split,
- train a simple **logistic regression** classifier,
- evaluate it on the **held-out test set** using accuracy and ROC-AUC.


In [63]:
# Build X (features) and y (target)
# We exclude identifiers and date from the feature matrix.

# Make sure target is strictly 1D
y = np.ravel(ml_ready["is_available"].values)
X = ml_ready.drop(columns=["is_available", "listing_id", "date"]).values

# 80/20 train-test split (stratify to keep class balance similar)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Baseline logistic regression model
log_reg = LogisticRegression(max_iter=1000, n_jobs=-1)
log_reg.fit(X_train, y_train)

# Predictions on the held-out test set
y_pred = log_reg.predict(X_test)
# Some solvers may not support predict_proba if changed, but default does.
y_proba = log_reg.predict_proba(X_test)[:, 1]

# Evaluation metrics
acc = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba)

print(f"Test accuracy: {acc:.3f}")
print(f"Test ROC-AUC: {auc:.3f}\n")

print("Classification report (test set):")
print(classification_report(y_test, y_pred))


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Test accuracy: 0.634
Test ROC-AUC: 0.689

Classification report (test set):
              precision    recall  f1-score   support

           0       0.62      0.55      0.58     89770
           1       0.64      0.71      0.67    103972

    accuracy                           0.63    193742
   macro avg       0.63      0.63      0.63    193742
weighted avg       0.63      0.63      0.63    193742

