# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [None]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [None]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = ""       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    # "Full Name 1",
    # "Full Name 2",
    # "Full Name 3",
]

GITHUB_REPO, TEAM_MEMBERS


In [2]:
import pandas as pd
import os

# --- CONFIGURATION ---
DATA_FOLDER = "data"        # Folder where your raw csv.gz files are located
N_LISTINGS = 1000           # Number of unique properties to sample
SEED = 42                   # Seed for reproducibility (so your team gets the same sample)

print("Starting sampling process...")

# 1. Load Listings (The "Parent" Table)
# We start here because we need to select specific IDs first to keep consistency.
listings_path = os.path.join(DATA_FOLDER, "listings.csv.gz")
df_listings = pd.read_csv(listings_path, compression="gzip")

# 2. Create the Sample of Listings
# We verify if we have enough data before sampling
if len(df_listings) > N_LISTINGS:
    df_listings_sample = df_listings.sample(n=N_LISTINGS, random_state=SEED)
else:
    # If the dataset is smaller than the requested sample, take everything
    df_listings_sample = df_listings.copy()
    
# Get the list of IDs we selected. This is the key to filter the other files.
selected_ids = df_listings_sample['id'].unique()

print(f"Selected {len(selected_ids)} unique listings.")

# 3. Load and Filter Calendar (The "Target" Table)
# This file is usually huge, so we filter it to keep only rows for our selected IDs.
calendar_path = os.path.join(DATA_FOLDER, "calendar.csv.gz")
df_calendar_raw = pd.read_csv(calendar_path, compression="gzip")
df_calendar_sample = df_calendar_raw[df_calendar_raw['listing_id'].isin(selected_ids)].copy()

# 4. Load and Filter Reviews (Text Data)
# Same logic: only keep reviews that belong to the selected houses.
reviews_path = os.path.join(DATA_FOLDER, "reviews.csv.gz")
df_reviews_raw = pd.read_csv(reviews_path, compression="gzip")
df_reviews_sample = df_reviews_raw[df_reviews_raw['listing_id'].isin(selected_ids)].copy()

# 5. Save the Samples
# We save them as new files so you don't have to run this heavy process again.
df_listings_sample.to_csv(os.path.join(DATA_FOLDER, "listings_sample.csv"), index=False)
df_calendar_sample.to_csv(os.path.join(DATA_FOLDER, "calendar_sample.csv"), index=False)
df_reviews_sample.to_csv(os.path.join(DATA_FOLDER, "reviews_sample.csv"), index=False)

# 6. Verification Output
print("\n--- SAMPLING COMPLETE ---")
print(f"Original Listings: {df_listings.shape} -> Sample: {df_listings_sample.shape}")
print(f"Original Calendar: {df_calendar_raw.shape} -> Sample: {df_calendar_sample.shape}")
print(f"Original Reviews:  {df_reviews_raw.shape} -> Sample: {df_reviews_sample.shape}")
print(f"\nFiles saved in '{DATA_FOLDER}/': listings_sample.csv, calendar_sample.csv, reviews_sample.csv")


Starting sampling process...
Selected 1000 unique listings.

--- SAMPLING COMPLETE ---
Original Listings: (3679, 79) -> Sample: (1000, 79)
Original Calendar: (1342835, 7) -> Sample: (365000, 7)
Original Reviews:  (83000, 6) -> Sample: (21615, 6)

Files saved in 'data/': listings_sample.csv, calendar_sample.csv, reviews_sample.csv
