<a href="https://colab.research.google.com/github/solomontessema/Data-Analytics-and-AI-with-Python/blob/main/notebooks/sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Airbnb Reviews with PySpark (Colab)


1. Sets up Java & Spark on Colab
2. Loads the Airbnb listings dataset from Google Drive
3. Performs light cleaning and exploratory aggregations
4. Builds a simple price prediction model (Linear Regression)
5. Saves the cleaned data and model, then runs a sample prediction

## 1) Connect Google Drive
Mount your Drive so we can read/write datasets and models.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

## 2) Install Java & Spark
Install OpenJDK 11, download Spark 3.4.1 (Hadoop 3), extract it, and install `findspark`.

In [None]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://mirrors.huaweicloud.com/apache/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!tar -xvf spark-3.4.1-bin-hadoop3.tgz > /dev/null
!pip install -q findspark

## 3) Initialize Spark
Set environment variables, initialize `findspark`, and start a `SparkSession`.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"

import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AirbnbReviews").getOrCreate()
spark

## 4) Load Data
Read the CSV from Drive. Update the path if your file is located elsewhere. We display the schema and a few rows to verify.

In [None]:
data_path = "/content/drive/MyDrive/listings.csv"
df = spark.read.csv(data_path, header=True, inferSchema=True)
df.printSchema()
df.show(5, truncate=False)

## 5) Light Cleaning & Transformation
- Keep rows with non-null `price`
- Parse `last_review` to a proper date type
- Preview selected columns

In [None]:
from pyspark.sql.functions import col, to_date

df_clean = df.filter(col("price").isNotNull())
df_clean = df_clean.withColumn("last_review", to_date(col("last_review")))
df_clean.select("id", "name", "price", "last_review").show(10, truncate=False)

## 6) Exploratory Aggregations
A couple of simple summaries:
- Average price by `room_type`
- Top neighborhoods by listing count

In [None]:
df_clean.groupBy("room_type").avg("price").orderBy("avg(price)", ascending=False).show(truncate=False)

In [None]:
df_clean.groupBy("neighbourhood").count().orderBy("count", ascending=False).show(20, truncate=False)

## 7) Save Cleaned Data
Store the cleaned dataset to Parquet (columnar format) for efficient future use.

In [None]:
clean_out_path = "/content/drive/MyDrive/listings_clean.parquet"
df_clean.write.mode("overwrite").parquet(clean_out_path)
print(f"Saved cleaned data to: {clean_out_path}")

## 8) Build a Simple Price Prediction Model
We train a **Linear Regression** model to predict `price` using:
- Encoded `room_type`
- `minimum_nights`
- `number_of_reviews`

Steps:
1. Drop rows missing required fields
2. Cast numeric columns to `double`
3. Encode `room_type` using `StringIndexer`
4. Assemble features into a single vector
5. Train/Test split and model training
6. Evaluate with RMSE

In [None]:
from pyspark.sql.functions import col, isnan
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Prepare ML dataframe
ml_df = df.dropna(subset=["room_type", "minimum_nights", "number_of_reviews", "price"])
ml_df = ml_df.withColumn("number_of_reviews", col("number_of_reviews").cast("double"))
ml_df = ml_df.withColumn("minimum_nights", col("minimum_nights").cast("double"))
ml_df = ml_df.withColumn("price", col("price").cast("double"))
ml_df = ml_df.filter((~isnan("price")) & (col("price").isNotNull()))

# Encode room_type
indexer = StringIndexer(inputCol="room_type", outputCol="room_type_index", handleInvalid="skip")
df_indexed = indexer.fit(ml_df).transform(ml_df)

df_indexed = df_indexed.filter(
    (col("minimum_nights").isNotNull()) &
    (col("number_of_reviews").isNotNull()) &
    (col("room_type_index").isNotNull()) &
    (col("price").isNotNull())
)


# Assemble features
assembler = VectorAssembler(
    inputCols=["room_type_index", "minimum_nights", "number_of_reviews"],
    outputCol="features"
)
df_vector = assembler.transform(df_indexed)
df_vector = df_vector.filter(col("features").isNotNull())

# Train/test split
train_df, test_df = df_vector.randomSplit([0.8, 0.2], seed=42)

# Train model
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(train_df)

# Evaluate
predictions = model.transform(test_df)
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print(f"RMSE: {rmse:.4f}")
print(f"Coefficients: {model.coefficients}")
print(f"Intercept: {model.intercept:.4f}")

## 9) Save & Reload the Model
Persist the trained model to Drive and show how to load it back for later use.

In [None]:
model_path = "/content/drive/MyDrive/linear_model"

# uncomment the following to save the model if not saved already.
# model.save(model_path)

from pyspark.ml.regression import LinearRegressionModel
loaded_model = LinearRegressionModel.load(model_path)
print("Model reloaded.")

## 10) Make a Sample Prediction
Predict price for a synthetic listing described by:
- `room_type_index = 1.0`
- `minimum_nights = 3.0`
- `number_of_reviews = 25.0`

> In a production workflow, map real `room_type` strings to indices using the fitted `StringIndexerModel`.

In [None]:
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row

sample_data = [Row(features=Vectors.dense([1.0, 3.0, 25.0]))]
sample_df = spark.createDataFrame(sample_data)
prediction = loaded_model.transform(sample_df)
prediction.select("features", "prediction").show(truncate=False)

---
### Next Steps
- Feature engineering (e.g., host features, location encodings)
- Robust outlier handling for `price`
- Try tree-based models (Random Forest, GBT) and compare metrics
- Cross-validation and hyperparameter tuning for better generalization