# Reading and writing from Azure Cosmos DB

**In this lesson you:**
- Write data into Azure Cosmos DB
- Read data from Azure Cosmos DB

## Library Requirements

1. the Maven library with coordinate `com.databricks.training:databricks-cosmosdb-spark2.2.0-scala2.11:1.0.0` in the `https://files.training.databricks.com/repo` repository.
   - this allows a Databricks `spark` session to communicate with Azure Cosmos DB

The next cell walks you through installing the Maven library.

## Lab Setup

If you are running in an Azure Databricks environment that is already pre-configured with the libraries you need, you can skip to the next cell. To use this notebook in your own Databricks environment, you will need to create libraries, using the [Create Library](https://docs.azuredatabricks.net/user-guide/libraries.html) interface in Azure Databricks. Follow the steps below to attach the `azure-cosmosdb-spark` library to your cluster:

%md
1. Right click on the browser tab and select "Duplicate" to open a new tab.
1. In the left-hand navigation menu of your Databricks workspace, select **Clusters**, then select your cluster in the list. If it's not running, start it now.

  ![Select cluster](https://databricksdemostore.blob.core.windows.net/images/10-de-learning-path/select-cluster.png)

1. Select the **Libraries** tab (1), then select **Install New** (2). In the Install Library dialog, select **Maven** under Library Source (3). Under Coordinates, paste `com.databricks.training:databricks-cosmosdb-spark2.2.0-scala2.11:1.0.0` (4). Under Repository, paste `https://files.training.databricks.com/repo` (5), then select **Install** (6).
  
  ![Databricks new Maven library](https://databricksdemostore.blob.core.windows.net/images/14-de-learning-path/install-cosmosdb-spark-library.png)

1. Wait until the library successfully installs before continuing.

Once complete, return to this notebook to continue with the lesson.

##![Spark Logo Tiny](https://files.training.databricks.com/images/wiki-book/general/logo_spark_tiny.png) Load Azure Cosmos DB

Now load a small amount of data into Azure Cosmos DB to demonstrate that connection.

In [0]:
%run ./Includes/Classroom-Setup

Enter your Azure Cosmos DB account information in the cell below. Be sure to replace the **"cosmos-db-uri"** and **"your-cosmos-db-key"** values with your own before executing.

In [0]:
URI = "https://4v6j7i74extrg.documents.azure.com:443/"
PrimaryKey = "sSYFdpOje9fciSKbSD80vGRTX7LT5SHziR9kqeQsVw2eJtqiwqMH3XZ4QonZOtOtlB9yWmLCqwZNHK4AMrMSzg=="

<span>1.</span> Enter the Azure Cosmos DB connection information into the cell below. <br>

In [0]:
CosmosDatabase = "AdventureWorks"
CosmosCollection = "ratings"

cosmosConfig = {
  "Endpoint": URI,
  "Masterkey": PrimaryKey,
  "Database": CosmosDatabase,
  "Collection": CosmosCollection,
  "Upsert": "false"
}

<span>2.</span> Read the input parquet file.

In [0]:
from pyspark.sql.functions import col
ratingsDF = (spark.read
  .parquet("dbfs:/mnt/training/initech/ratings/ratings.parquet/")
  .withColumn("rating", col("rating").cast("double")))
print("Num Rows: {}".format(ratingsDF.count()))

In [0]:
display(ratingsDF)

product_id,user_id,rating
2,1,3.5
29,1,3.5
32,1,3.5
31,1,3.5
29,1,4.0
3,2,4.0
1,3,4.0
24,3,3.0
32,3,4.0
31,3,5.0


<span>3.</span> Write the data to Azure Cosmos DB.

In [0]:
ratingsSampleDF = ratingsDF.sample(.0001)
(ratingsSampleDF.write
  .mode("overwrite")
  .format("com.microsoft.azure.cosmosdb.spark")
  .options(**cosmosConfig)
  .save())


<span>4.</span> Confirm that your data is now in Azure Cosmos DB.

In [0]:
dfCosmos = (spark.read
  .format("com.microsoft.azure.cosmosdb.spark")
  .options(**cosmosConfig)
  .load())
dfCosmos.count()


In [0]:
ratingsSampleDF.createOrReplaceTempView('sampleDF')

In [0]:
# Prepare Read config
CosmosDatabase = "AdventureWorks"
CosmosCollection = "ratings"
query  = "SELECT * FROM c "

readConfig = {
  "Endpoint": URI,
  "Masterkey": PrimaryKey,
  "Database": CosmosDatabase,
  "Collection": CosmosCollection,
  "query_custom": query
}

In [0]:
#Get data from cosmos
dfCosmos = (spark.read
  .format("com.microsoft.azure.cosmosdb.spark")
  .options(**readConfig)
  .load())
dfCosmos.createOrReplaceTempView('vwCosmos')

In [0]:
%sql
Select count(*) from vwCosmos

count(1)
170


In [0]:
dfCosmos.describe()
{
    "user_id": 77021,
    "product_id": 27,
    "rating": 5,
    "id": "e5829190-6517-4c9f-95f5-6e2757493f17",
    "_rid": "y+gJAOPKBIQBAAAAAAAAAA==",
    "_self": "dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQBAAAAAAAAAA==/",
    "_etag": "\"00006305-0000-0c00-0000-5fdfc7a00000\"",
    "_attachments": "attachments/",
    "_ts": 1608501152
}

In [0]:
%sql
--Update is not supported so create a new dataframe with updated values
CREATE OR REPLACE TEMPORARY VIEW updatedCosmos
AS
Select user_id, product_id, 5 as rating, _self, id, _rid, _self, _etag, _attachments, _ts
from vwCosmos

In [0]:
#get dataframe
updatedDF = sqlContext.table('updatedCosmos')
updatedDF.describe()

In [0]:
display(updatedDF)

user_id,product_id,rating,_self,id,_rid,_self.1,_etag,_attachments,_ts
77021,27,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQBAAAAAAAAAA==/,e5829190-6517-4c9f-95f5-6e2757493f17,y+gJAOPKBIQBAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQBAAAAAAAAAA==/,"""00006305-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152
74847,6,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQCAAAAAAAAAA==/,479c517a-797f-48dd-bae4-ac7549871ed2,y+gJAOPKBIQCAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQCAAAAAAAAAA==/,"""00006605-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152
76684,29,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQDAAAAAAAAAA==/,2d239e16-b4ae-4e9f-93e2-744850fc6278,y+gJAOPKBIQDAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQDAAAAAAAAAA==/,"""00006705-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152
72506,31,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQEAAAAAAAAAA==/,fa32467c-6f04-4f62-865c-4ac0548f2943,y+gJAOPKBIQEAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQEAAAAAAAAAA==/,"""00006805-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152
78984,32,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQFAAAAAAAAAA==/,fe6f52e4-da80-493d-813c-83e1234d34e7,y+gJAOPKBIQFAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQFAAAAAAAAAA==/,"""00006905-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152
101492,6,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQGAAAAAAAAAA==/,febd1e0e-d84a-42cd-b86f-c0018a9fa4c5,y+gJAOPKBIQGAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQGAAAAAAAAAA==/,"""00006a05-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152
48731,34,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQHAAAAAAAAAA==/,ef395fd5-3ce3-4f46-8710-15d3d9684dd8,y+gJAOPKBIQHAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQHAAAAAAAAAA==/,"""00006b05-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152
89569,1,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQIAAAAAAAAAA==/,554c324c-d399-453e-81f5-c41d1b394012,y+gJAOPKBIQIAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQIAAAAAAAAAA==/,"""00006c05-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152
44507,32,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQJAAAAAAAAAA==/,706d2bd4-c5a0-4cfb-8e96-ab8b06b22ad3,y+gJAOPKBIQJAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQJAAAAAAAAAA==/,"""00006d05-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152
42921,1,5,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQKAAAAAAAAAA==/,dbe3d852-2a8f-472c-97b6-5ccc8107fe71,y+gJAOPKBIQKAAAAAAAAAA==,dbs/y+gJAA==/colls/y+gJAOPKBIQ=/docs/y+gJAOPKBIQKAAAAAAAAAA==/,"""00006e05-0000-0c00-0000-5fdfc7a00000""",attachments/,1608501152


In [0]:
#Write the documents back to cosmos using Upsert 
#does not work as expected
#best case push new data 
writeConfig = {
  "Endpoint": URI,
  "Masterkey": PrimaryKey,
  "Database": CosmosDatabase,
  "Collection": CosmosCollection,
  "writingBatchSize":"100",
  "Upsert": "true"
}

dfCosmos.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig).save()