# 🏡 Airbnb Data Analysis with PySpark
This notebook demonstrates how to load, clean, and analyze Airbnb listings using PySpark. We'll walk through:
- Mounting Google Drive
- Installing Spark
- Loading CSV data
- Cleaning and transforming
- Running basic aggregations
- Training a linear regression model

In [ ]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [ ]:
# Install Spark
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://mirrors.huaweicloud.com/apache/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!tar -xvf spark-3.4.1-bin-hadoop3.tgz
!pip install -q findspark

In [ ]:
# Configure environment
import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'
os.environ['SPARK_HOME'] = '/content/spark-3.4.1-bin-hadoop3'
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('AirbnbReviews').getOrCreate()

In [ ]:
# Load dataset
df = spark.read.csv('/content/drive/MyDrive/listings.csv', header=True, inferSchema=True)
df.printSchema()
df.show(5)

## 🧹 Data Cleaning
We'll filter out null prices and convert review dates to proper date format.

In [ ]:
from pyspark.sql.functions import col, to_date
df_clean = df.filter(col('price').isNotNull())
df_clean = df_clean.withColumn('last_review', to_date(col('last_review')))
df_clean.select('id', 'name', 'price', 'last_review').show()

## 📊 Aggregation Examples
Let's explore average price by room type and listing count by neighborhood.

In [ ]:
df_clean.groupBy('room_type').avg('price').orderBy('avg(price)', ascending=False).show()

In [ ]:
df_clean.groupBy('neighbourhood').count().orderBy('count', ascending=False).show()