## Prerequisites
- Azure Databricks Runtime 8.0 with Spark 3.1.1
- Install Cosmos DB Spark Connector, in your spark Cluster
  - https://search.maven.org/artifact/com.azure.cosmos.spark/azure-cosmos-spark_3-1_2-12/4.1.0/jar

## Create databases and containers
- First, set Cosmos DB account credentials, and the Cosmos DB Database name and container name.

In [0]:
cosmosEndpoint = "https://cosmosdbatin.documents.azure.com:443/"
cosmosMasterKey = "uWscEe78JP8Kw1Q6TxohCWur10aeG8nOXgrajYiiuxR13QTYpiVoBZRgJVk9Lu6EyHP9tzkQQNwfNGYzfn9s1w=="
cosmosDatabaseName = "sampleDB"
cosmosContainerName = "sampleContainer"

cfg = {
  "spark.cosmos.accountEndpoint" : cosmosEndpoint,
  "spark.cosmos.accountKey" : cosmosMasterKey,
  "spark.cosmos.database" : cosmosDatabaseName,
  "spark.cosmos.container" : cosmosContainerName,
}

- Next, use the new Catalog API to create a Cosmos DB Database and Container through Spark.

In [0]:
# Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)

# create a cosmos database using catalog api
spark.sql("CREATE DATABASE IF NOT EXISTS cosmosCatalog.{};".format(cosmosDatabaseName))

# create a cosmos container using catalog api
spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{} using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '/id', manualThroughput = '1100')".format(cosmosDatabaseName, cosmosContainerName))

## Ingesting data

- Write a memory dataframe consisting of two items to Cosmos DB:

In [0]:
spark.createDataFrame((("cat-alive", "Schrodinger cat", 2, True), ("cat-dead", "Schrodinger cat", 2, False)))\
  .toDF("id","name","age","isAlive") \
   .write\
   .format("cosmos.oltp")\
   .options(**cfg)\
   .mode("APPEND")\
   .save()

## Querying data
- Using the same cosmos.oltp data source, we can query data and use filter to push down filters:

In [0]:
from pyspark.sql.functions import col

df = spark.read.format("cosmos.oltp").options(**cfg)\
 .option("spark.cosmos.read.inferSchema.enabled", "true")\
 .load()

df.filter(col("isAlive") == True)\
 .show()

## Schema inference
- When querying data, the Spark Connector can infer the schema based on sampling existing items by setting spark.cosmos.read.inferSchema.enabled to true.

In [0]:
df = spark.read.format("cosmos.oltp").options(**cfg)\
 .option("spark.cosmos.read.inferSchema.enabled", "true")\
 .load()
 
df.printSchema()

- Alternatively, can pass the custom schema to be used to read the data:

In [0]:
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType, BooleanType
customSchema = StructType([
      StructField("id", StringType()),
      StructField("name", StringType()),
      StructField("type", StringType()),
      StructField("age", IntegerType()),
      StructField("isAlive", BooleanType())
    ])

df = spark.read.schema(customSchema).format("cosmos.oltp").options(**cfg)\
 .load()
 
df.printSchema()