# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF–IDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> — property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> — daily availability and pricing information for each listing.  
- <b>reviews.csv</b> — guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF–IDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions—**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [1]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [2]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = "https://github.com/seanhoet65-source/hackathon_python"
TEAM_MEMBERS = ["Pau Gratacós Fusté", "Sean Hoet", "Florian Nix", "Caroline Wheeler", "Riwad Irshied"]

GITHUB_REPO, TEAM_MEMBERS


('https://github.com/seanhoet65-source/hackathon_python',
 ['Pau Gratacós Fusté',
  'Sean Hoet',
  'Florian Nix',
  'Caroline Wheeler',
  'Riwad Irshied'])

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")
sns.set_theme()

listings = pd.read_csv("listings.csv.gz", compression="gzip")
calendar = pd.read_csv("calendar.csv.gz", compression="gzip")
reviews = pd.read_csv("reviews.csv.gz", compression="gzip")

print("Listings shape:", listings.shape)
print("Calendar shape:", calendar.shape)
print("Reviews shape:", reviews.shape)


Listings shape: (2654, 79)
Calendar shape: (968710, 7)
Reviews shape: (122622, 6)


In [3]:
print("=== LISTINGS INFO ===")
listings.info()

print("\n=== CALENDAR INFO ===")
calendar.info()

print("\n=== REVIEWS INFO ===")
reviews.info()

display(listings.head())
display(calendar.head())
display(reviews.head())

=== LISTINGS INFO ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2654 entries, 0 to 2653
Data columns (total 79 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   id                                                2654 non-null   int64  
 1   listing_url                                       2654 non-null   object 
 2   scrape_id                                         2654 non-null   int64  
 3   last_scraped                                      2654 non-null   object 
 4   source                                            2654 non-null   object 
 5   name                                              2654 non-null   object 
 6   description                                       2613 non-null   object 
 7   neighborhood_overview                             1299 non-null   object 
 8   picture_url                                       2654 non-null   object 
 9

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,50904,https://www.airbnb.com/rooms/50904,20250625193541,2025-06-26,city scrape,A Place Boutique - Deluxe suite,Decorated in a vintage style combined with a f...,,https://a0.muscache.com/pictures/f14b0908-cbc3...,234077,...,5.0,5.0,5.0,,t,9,5,2,0,0.02
1,345959,https://www.airbnb.com/rooms/345959,20250625193541,2025-06-25,city scrape,Marleen's home in Antwerp city,"your entire, private groundfloor 2-bedroom apa...","nice, quiet residential neighborhood",https://a0.muscache.com/pictures/11642662/f9b6...,1754396,...,4.87,4.59,4.81,,f,1,1,0,0,0.81
2,366252,https://www.airbnb.com/rooms/366252,20250625193541,2025-06-25,city scrape,ROOM IN FAMILY HOME near C. Station,"In the Antwerp district of Borgerhout, we live...",we live on the 5th floor on top of a bed store...,https://a0.muscache.com/pictures/airflow/Hosti...,1820186,...,4.89,4.41,4.64,,t,2,0,2,0,1.0
3,522693,https://www.airbnb.com/rooms/522693,20250625193541,2025-06-26,previous scrape,Ahome Awayfromhome,,"Ahome lies in the heart of the old city, the b...",https://a0.muscache.com/pictures/6812669/ad924...,2562294,...,,,,,f,1,1,0,0,
4,603545,https://www.airbnb.com/rooms/603545,20250625193541,2025-06-26,previous scrape,*Perfect* for a Longer Stay | Trendy | City ce...,"My loft has it all! Superb location, right in ...",Excellent location. There are plenty of great ...,https://a0.muscache.com/pictures/airflow/Hosti...,2987880,...,4.91,4.91,4.72,,f,1,1,0,0,0.3


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,50904,2025-06-26,t,,,1,1000
1,50904,2025-06-27,f,,,1,1000
2,50904,2025-06-28,f,,,1,1000
3,50904,2025-06-29,f,,,1,1000
4,50904,2025-06-30,t,,,1,1000


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,50904,31511792,2015-05-06,19482395,Jihae,Karin’s “Aplace” is absolutely beautiful and c...
1,50904,470101024356869935,2021-10-10,333559,Emilie,"Karin is a wonderful host, she was really help..."
2,50904,627287279025726941,2022-05-15,32701854,Marie-Lou,The location is super super nice! Karin was al...
3,345959,1267667,2012-05-12,1116585,Saskia,Marleen was very welcoming even though we had ...
4,345959,2045263,2012-08-20,1775578,Max,Marleen was very helpful accommodating our arr...
