#  Document Database Layer
## Project: Beauty Lakehouse Analytics Platform

In this notebook, we implement the **Document-Oriented Database layer** of our multi-model architecture using MongoDB Atlas.

### Objectives:
- Load curated Delta tables from the Data Lake (01_data_ingestion)
- Transform structured Spark DataFrames into nested JSON documents
- Store documents in MongoDB Atlas
- Query MongoDB collections
- Understand the role of document databases in multi-model analytics

This layer complements the Data Lake ingestion layer from 01_data_ingestion.


## MongoDB Overview

MongoDB is a NoSQL document database where data is stored as flexible JSON-like documents (BSON).

### Key characteristics:
- Schema flexibility
- Natural support for nested objects and arrays
- Horizontal scalability
- Ideal for semi-structured and evolving data

In this project:
- Data Lake (Delta tables) → structured storage
- MongoDB → semi-structured document layer
- Warehouse → analytical layer
- Graph DB → relationship analytics

This notebook implements the **Document Database layer** of our Lakehouse.


In [0]:
%pip install pymongo certifi pandas

## Load Curated Data from Data Lake (Delta Tables)

We load the cleaned and validated Delta tables created in 01_data_intestion.


In [0]:
spark.catalog.setCurrentDatabase("dm_project")

customers_df = spark.table("customers")
products_df = spark.table("products")
orders_df = spark.table("orders")
order_items_df = spark.table("order_items")


## Data Validation Before Document Modeling

We verify row counts and schemas to ensure data integrity before transforming into nested documents.


In [0]:
print("Customers:", customers_df.count())
print("Products:", products_df.count())
print("Orders:", orders_df.count())
print("Order Items:", order_items_df.count())

orders_df.printSchema()
order_items_df.printSchema()


## Connect to MongoDB Atlas

We configure Spark and PyMongo to connect to MongoDB Atlas.



In [0]:
dbutils.library.restartPython()

In [0]:
from pymongo import MongoClient
import certifi

mongo_uri = "mongodb+srv://zinahpoulus_db_user:Znnbk58116c@cluster0.7yguf6a.mongodb.net/"

client = MongoClient(mongo_uri, tlsCAFile=certifi.where())
db = client["beauty_lakehouse_db"]

client.admin.command("ping")
print("Connected to MongoDB Atlas!")


In [0]:
spark.conf.set("spark.mongodb.write.connection.uri", mongo_uri)
spark.conf.set("spark.mongodb.read.connection.uri", mongo_uri)


In [0]:
from pymongo import MongoClient
import certifi


client = MongoClient("mongodb+srv://zinahpoulus_db_user:Znnbk58116c@cluster0.7yguf6a.mongodb.net/",
                     tlsCAFile=certifi.where())
db = client["beauty_lakehouse_db"]

client.admin.command("ping")
print("Connected to MongoDB Atlas!")

In [0]:
customers_df = spark.table("customers")
products_df = spark.table("products")
orders_df = spark.table("orders")
order_items_df = spark.table("order_items")

In [0]:
customers_df.write \
    .format("mongodb") \
    .option("database", "beauty_lakehouse_db") \
    .option("collection", "Customers") \
    .mode("overwrite") \
    .save()