# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [4]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [5]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = "https://github.com/seanhoet65-source/hackathon_python"
TEAM_MEMBERS = ["Pau Gratacós Fusté", "Sean Hoet", "Florian Nix", "Caroline Wheeler", "Riwad Irshied"]

GITHUB_REPO, TEAM_MEMBERS


('https://github.com/seanhoet65-source/hackathon_python',
 ['Pau Gratacós Fusté',
  'Sean Hoet',
  'Florian Nix',
  'Caroline Wheeler',
  'Riwad Irshied'])

In [None]:
# %% [markdown]
# # Hackathon: From Raw Data to ML-Ready Dataset
# ## Unified Pipeline: EDA, Feature Engineering, and Data Preparation


# %%
# ==============================================================================
# 0. SETUP & TEAM INFO
# ==============================================================================


# === Team Information (Mandatory) ===
GITHUB_REPO = "https://github.com/seanhoet65-source/hackathon_python"
TEAM_MEMBERS = ["Pau Gratacós Fusté", "Sean Hoet", "Florian Nix", "Caroline Wheeler", "Riwad Irshied"]


print(f"Team: {TEAM_MEMBERS}")
print(f"Repo: {GITHUB_REPO}")


# Imports
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import sys
import subprocess


# TF-IDF import (Safe import)
try:
   from sklearn.feature_extraction.text import TfidfVectorizer
except ImportError:
   TfidfVectorizer = None
   print("[WARN] sklearn not found. TF-IDF features will be skipped.")


# %%
# ==============================================================================
# PART 1: LOAD AND EXPLORE THE RAW DATA (EDA)
# Task: Load raw data, clean basic types, and produce Plotly visualizations.
# ==============================================================================


print("\n" + "="*60)
print("STEP 1: LOADING & EDA")
print("="*60)


# 1.1 Load Data
print("Loading Data...")
listings = pd.read_csv("listings.csv.gz", low_memory=False)
calendar = pd.read_csv("calendar.csv.gz", low_memory=False)
reviews = pd.read_csv("reviews.csv.gz", low_memory=False)


# 1.2 Basic Preprocessing for EDA (Type conversion & Cleaning)
# We need clean types to visualize correctly


# Dates
calendar["date"] = pd.to_datetime(calendar["date"], errors="coerce")
reviews["date"] = pd.to_datetime(reviews["date"], errors="coerce")


# Price Cleaning (Remove $ and ,)
for df_temp in [listings, calendar]:
   col_name = "price"
   if col_name in df_temp.columns and df_temp[col_name].dtype == "object":
       df_temp[col_name] = df_temp[col_name].replace(r"[\$,]", "", regex=True)
       df_temp[col_name] = pd.to_numeric(df_temp[col_name], errors="coerce")


# Booleans in Calendar
bool_mapping = {"t": True, "f": False}
if "available" in calendar.columns:
   calendar["available"] = calendar["available"].map(bool_mapping)


print(f"Data Loaded: Listings {listings.shape}, Calendar {calendar.shape}, Reviews {reviews.shape}")


# ------------------------------------------------------------------------------
# 1.3 VISUAL EDA (Required by Assignment: Insight-Driven Plots)
# ------------------------------------------------------------------------------


# Insight 1: What is the price distribution of listings?
# We filter out extreme luxury outliers (>1000) for better visualization
if "price" in listings.columns:
   fig_hist = px.histogram(
       listings[listings["price"] < 1000],
       x="price",
       nbins=50,
       title="Insight 1: Distribution of Listing Prices (Under $1000)",
       template="plotly_white"
   )
   fig_hist.show()


# Insight 2: How does availability change over time?
# We aggregate availability by date
if "available" in calendar.columns:
   daily_availability = calendar.groupby("date")["available"].mean().reset_index()
   fig_line = px.line(
       daily_availability,
       x="date",
       y="available",
       title="Insight 2: Average Availability Rate Over Time",
       template="plotly_white"
   )
   fig_line.show()


# Insight 3: Price vs. Neighbourhood (Box Plot)
# Helps identify expensive areas
if "neighbourhood_cleansed" in listings.columns and "price" in listings.columns:
   # Filter for top 20 neighbourhoods by count to keep chart readable
   top_neighbourhoods = listings["neighbourhood_cleansed"].value_counts().head(20).index
   filtered_listings = listings[listings["neighbourhood_cleansed"].isin(top_neighbourhoods)]
  
   fig_box = px.box(
       filtered_listings[filtered_listings["price"] < 500],
       x="neighbourhood_cleansed",
       y="price",
       title="Insight 3: Price Distribution by Top 20 Neighbourhoods",
       template="plotly_white"
   )
   fig_box.update_layout(xaxis_tickangle=-45)
   fig_box.show()







Team: ['Pau Gratacós Fusté', 'Sean Hoet', 'Florian Nix', 'Caroline Wheeler', 'Riwad Irshied']
Repo: https://github.com/seanhoet65-source/hackathon_python

STEP 1: LOADING & EDA
Loading Data...


In [None]:
# %%
# ==============================================================================
# PART 2: ENGINEER FEATURES
# Task: Create Numerical, Categorical, Temporal, and Textual (TF-IDF) features.
# ==============================================================================


print("\n" + "="*60)
print("STEP 2: FEATURE ENGINEERING")
print("="*60)


# ------------------------------------------------------------------------------
# 2.1 Text Features (TF-IDF)
# ------------------------------------------------------------------------------
# Requirement: "Use text-vectorization methods"
tfidf_df = pd.DataFrame() # Empty placeholder


if TfidfVectorizer is not None:
   # Use description or name
   text_col = "description" if "description" in listings.columns else "name"
   print(f"Vectorizing text from column: {text_col}")
  
   # Fill NA and vectorize
   text_series = listings[text_col].fillna("")
   # We limit to top 20 features to keep the dataset size manageable/performant
   tfidf = TfidfVectorizer(max_features=20, stop_words="english")
   text_matrix = tfidf.fit_transform(text_series)


   # Create a DataFrame with the features
   tfidf_feature_names = [f"tfidf_{i}" for i in range(text_matrix.shape[1])]
   tfidf_df = pd.DataFrame(text_matrix.toarray(), columns=tfidf_feature_names, index=listings.index)
  
   # We attach the listing ID to this dataframe so we can merge it later
   tfidf_df["id"] = listings["id"]
   print(f"✓ Created {len(tfidf_feature_names)} TF-IDF features.")


# ------------------------------------------------------------------------------
# 2.2 Temporal Features (From Calendar)
# ------------------------------------------------------------------------------
# Requirement: "Derived temporal features"
if "date" in calendar.columns:
   calendar["month"] = calendar["date"].dt.month
   calendar["dayofweek"] = calendar["date"].dt.dayofweek
   # 1 if Friday(4) or Saturday(5), else 0. (Simple weekend definition)
   calendar["is_weekend"] = calendar["dayofweek"].isin([4, 5]).astype(int)
   print("✓ Created Temporal Features: month, dayofweek, is_weekend")


# ------------------------------------------------------------------------------
# 2.3 Aggregated Review Features (Numerical)
# ------------------------------------------------------------------------------
# Requirement: "Enrich dataset with additional information"
review_features = (
  reviews.groupby("listing_id")
  .agg({"id": "count", "date": ["min", "max"]})
  .reset_index()
)
review_features.columns = ["listing_id", "review_count", "first_review", "last_review"]


# Calculate days since last review
current_date = pd.Timestamp.now()
review_features["days_since_last_review"] = (current_date - review_features["last_review"]).dt.days
review_features["reviews_per_month_calc"] = review_features["review_count"] / \
   ((review_features["last_review"] - review_features["first_review"]).dt.days / 30).replace(0, 1)


print("✓ Created Aggregated Review Features.")




STEP 2: FEATURE ENGINEERING
Vectorizing text from column: description
✓ Created 20 TF-IDF features.
✓ Created Temporal Features: month, dayofweek, is_weekend
✓ Created Aggregated Review Features.


In [None]:
# ==============================================================================
# PART 3: BUILD SMALL IN-MEMORY ML DATASET (NO FILE EXPORT)
# ==============================================================================


print("\n" + "="*60)
print("STEP 3: BUILD SMALL ML DATASET (IN MEMORY ONLY)")
print("="*60)


# 3.0 Subsample calendar to keep dataset small
MAX_ROWS = 50_000  # make this small so memory + size stay safe


if len(calendar) > MAX_ROWS:
   ml_base = calendar.sample(MAX_ROWS, random_state=42)
   print(f"✓ Subsampled calendar from {len(calendar):,} to {len(ml_base):,} rows")
else:
   ml_base = calendar.copy()
   print(f"✓ Using full calendar: {len(ml_base):,} rows")


# 3.1 Merge calendar sample with listings
ml_df = ml_base.merge(
   listings,
   left_on="listing_id",
   right_on="id",
   how="inner",
   suffixes=("_cal", "_list"),
)
print(f"✓ After merging calendar + listings: {ml_df.shape[0]:,} rows × {ml_df.shape[1]} columns")


# 3.2 Merge aggregated review features (listing-level)
ml_df = ml_df.merge(
   review_features,
   on="listing_id",
   how="left"
)
print(f"✓ After adding review features: {ml_df.shape[0]:,} rows × {ml_df.shape[1]} columns")


# 3.3 Create unified PRICE feature (from listings)
price_cols = [c for c in ml_df.columns if "price" in c]
print(f"Detected price-related columns: {price_cols}")


price_list_col = None
for c in price_cols:
   if c.endswith("_list") or c == "price":
       price_list_col = c
       break


if price_list_col is None:
   print("⚠ Warning: No listings-level price column found. 'price' will be NaN.")
   ml_df["price"] = np.nan
else:
   ml_df["price"] = ml_df[price_list_col]


# Clip extreme outliers
ml_df["price"] = ml_df["price"].clip(upper=ml_df["price"].quantile(0.99))


# 3.4 Target variable: availability (True=1, False=0)
if "available" not in ml_df.columns:
   raise ValueError("Column 'available' not found in ml_df. Check previous steps.")
ml_df["target_available"] = ml_df["available"].astype(int)
print("✓ Target variable 'target_available' created.")


# 3.5 Select a *minimal* feature set
numeric_candidates = [
   "price",
   "accommodates",
   "bedrooms",
   "beds",
   "minimum_nights",
   "number_of_reviews",
   "review_scores_rating",
   "review_count",
   "days_since_last_review",
   "month",
   "dayofweek",
   "is_weekend",
]
numeric_cols = [c for c in numeric_candidates if c in ml_df.columns]


categorical_candidates = ["room_type"]
categorical_cols = [c for c in categorical_candidates if c in ml_df.columns]


print(f"Using numeric features: {numeric_cols}")
print(f"Using categorical features: {categorical_cols}")


selected_cols = numeric_cols + categorical_cols + ["target_available"]
final_df = ml_df[selected_cols].copy()


# 3.6 Handle missing values
# Numeric: fill with median
for col in numeric_cols:
   final_df[col] = pd.to_numeric(final_df[col], errors="coerce")


numeric_medians = final_df[numeric_cols].median()
final_df[numeric_cols] = final_df[numeric_cols].fillna(numeric_medians)


# Categorical: fill with 'Unknown'
for col in categorical_cols:
   final_df[col] = final_df[col].fillna("Unknown")


# One-hot encode (tiny)
if categorical_cols:
   final_df = pd.get_dummies(
       final_df,
       columns=categorical_cols,
       drop_first=True,
       dtype="int8",
   )


print(f"\nFinal ML-Ready Dataset Shape: {final_df.shape}")
print("\nTarget Distribution:")
print(final_df["target_available"].value_counts(normalize=True))


# Build X and y IN MEMORY ONLY
X = final_df.drop(columns=["target_available"])
y = final_df["target_available"]


print(f"\nFeature matrix X shape: {X.shape}")
print(f"Target vector y length: {y.shape[0]}")


print("\n" + "="*60)
print("✅ SUCCESS: Built small in-memory X and y (no CSV saved)")
print("="*60)
print("Ready for model training in the next cell.")
print("\nSample of X:")
print(X.head())