## MongoDB Atlas Spark Streaming

The following illustrates how to setup continuos streaming of the data from MongoDB to Databricks Delta Lake.

### Create a Databricks Cluster and Add the Connector as a Library

1. Create a Databricks cluster.
2. Download the latest version(>10.1) of MongoDB connector for Apache spark jar file from [Maven Central](https://repo1.maven.org/maven2/org/mongodb/spark/)
   **Note**: Make sure the package has same version of Scala as the Scala version of the Spark Pool.
3. Navigate to the cluster detail page and select the **Libraries** tab.
4. Click the **Install New** button.
5. Select **Upload** as the Library Source and **JAR** as the library type.
6. Drop the MongoDB connector JAR downloaded from the step 2.
7. Click **Install**. <br/>
For more info on the MongoDB Spark connector (which now supports structured streaming) see the [MongoDB documentation](https://www.mongodb.com/docs/spark-connector/current/). 

### Create a MongoDB Atlas Instance

Atlas is a fully managed, cloud-based MongoDB service. We'll use Atlas to test the integration between MongoDb and Spark.

1. Sign up for [MongoDB Atlas](https://www.mongodb.com/atlas/database). 
2. [Create an Atlas free tier cluster](https://docs.atlas.mongodb.com/getting-started/).
3. Enable Databricks clusters to connect to the cluster by adding the external IP addresses for the Databricks cluster nodes to the [whitelist in Atlas](https://docs.atlas.mongodb.com/setup-cluster-security/#add-ip-addresses-to-the-whitelist). For convenience you could (**temporarily!!**) 'allow access from anywhere', though we recommend to enable [network peering](https://www.mongodb.com/docs/atlas/security-vpc-peering/) for production. 

### Prep MongoDB with a sample data-set 

MongoDB comes with a nice sample data-set that allows to quickly get started. We will use this in the context of this notebook

1. In MongoDB Atlas [Load the sample data-set](https://www.mongodb.com/docs/charts/tutorial/order-data/prerequisites-setup/) once the cluster is up and running. 
2. You can confirm the presence of the data-set via the **Browse Collections** button in the Atlas UI.

### Update Spark Configuration with the Atlas Connection String


1. Note the connect string under the **Connect** dialog in MongoDB Atlas. It has the form of "mongodb+srv://\<username>\:\<password>\@\<databasename>\.xxxxx.mongodb.net/"
2. Update the Mongodb connection string below.

In [0]:
connectionString='mongodb+srv://CONNECTION_STRING_HERE/
database="sample_supplies"
collection="sales"
destination_table="deltalake_sales"

#### Create a Spark Readstream

You will need to create a readstream in order to use Spark with the MongoDB Connector.

In [0]:
query=(spark.readStream.format("mongodb").\
	option('spark.mongodb.connection.uri', connectionString).\
	option('spark.mongodb.database', database).\
	option('spark.mongodb.collection', collection).\
	option('spark.mongodb.change.stream.publish.full.document.only','true').\
	option("forceDeleteTempCheckpointLocation", "true").\
	load())

#### Write data to Delta lake 


Once you have the data in a Spark readstream, you can use the writeStream method to write the data to Delta lake in a table called deltalake_sales.

In [0]:
query.writeStream.format("delta").\
    outputMode("append").\
    option("checkpointLocation", "/tmp/delta/_checkpoint/").\
    option("path", "/delta/deltalake_sales").\
    table(destination_table)

### Test data streaming from MongoDB Atlas to the Databricks platform
You are now ready to go, let's test the data streaming capability from MongoDB Atlas to the Databricks platform.

1. Execute this notebook on the cluster you created.
2. Add or modify any document in the MongoDB Collection `sales`.
3. You should observe the corresponding changes reflected in the `deltalake_sales` delta lake table on the Databricks platform.

## More Info

- Discover more about the configuration of the MongoDB Spark Connector [here](https://www.mongodb.com/docs/spark-connector/current/).