## Configuration

In [74]:
from demolib import Namespace
app_config = Namespace({
    'database': 'test',
    'collection': 'contacts',
    'mongo_uri': 'mongodb://127.0.0.1',
    'mongo_uri_collection': 'mongodb://127.0.0.1/test.contacts',
    'spark_packages': ['org.mongodb.spark:mongo-spark-connector_2.11:2.4.1']
})

## Initialize Spark

In [75]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder\
    .config('spark.jars.packages', ','.join(app_config.spark_packages)) \
    .getOrCreate()

## Write to MongoDB Using Spark DataFrame API

Let's start first with creating some sample data and putting it into a Spark DataFrame.

In [76]:
people = spark.createDataFrame([
    ("Bilbo Baggins",  50), 
    ("Gandalf", 1000), 
    ("Thorin", 195), 
    ("Balin", 178), 
    ("Kili", 77),
    ("Dwalin", 169),
    ("Oin", 167), 
    ("Gloin", 158), 
    ("Fili", 82), 
    ("Bombur", None)
], 
    ["name", "age"])

In [77]:
people.show()

+-------------+----+
|         name| age|
+-------------+----+
|Bilbo Baggins|  50|
|      Gandalf|1000|
|       Thorin| 195|
|        Balin| 178|
|         Kili|  77|
|       Dwalin| 169|
|          Oin| 167|
|        Gloin| 158|
|         Fili|  82|
|       Bombur|null|
+-------------+----+



In [78]:
people.write.format("mongo") \
    .option("uri", app_config.mongo_uri) \
    .option("database", app_config.database)  \
    .option("collection", app_config.collection) \
    .mode("overwrite") \
    .save()

In above write we specified that Spark uses `overwrite` mode. This means the entire collection will be replaced. 

We could specify `append` mode. In such case Spark will perform upserts - documents with matching `_id` field in the database will be updated, documents without matching `_id` will be inserted.

## Read Data from Mongo Using Spark DataFrame API

In [58]:
df = spark.read.format("mongo") \
    .option("uri", app_config.mongo_uri) \
    .option("database", app_config.database) \
    .option("collection", app_config.collection) \
    .load()

In [60]:
df.printSchema()

root
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [59]:
df.show(5)

+--------------------+----+-------------+
|                 _id| age|         name|
+--------------------+----+-------------+
|[5d6a18bd0bd95c18...|1000|      Gandalf|
|[5d6a18bd0bd95c18...| 178|        Balin|
|[5d6a18bd0bd95c18...|  77|         Kili|
|[5d6a18bd0bd95c18...|  50|Bilbo Baggins|
|[5d6a18bd0bd95c18...| 158|        Gloin|
+--------------------+----+-------------+
only showing top 5 rows



We can encode the database and collection names in the MongoDB URI:

In [69]:
df = spark.read.format("mongo") \
    .option("uri",app_config.mongo_uri_collection) \
    .load()
df.show(3)

+--------------------+----+-------+
|                 _id| age|   name|
+--------------------+----+-------+
|[5d6a19490bd95c18...| 167|    Oin|
|[5d6a19490bd95c18...|1000|Gandalf|
|[5d6a19490bd95c18...| 178|  Balin|
+--------------------+----+-------+
only showing top 3 rows



## Upsert Documents

First we need to add document ID field, named `_id`.

When writing to the collection MongoDB will use the `_id` field to update existing documents. If document doesn't exist, it will be inserted.

The default update behavior is to replace the entire document. If we want to update only changed or new fields, we can setthe `replaceDocument` option value to `False` (default is `True`).

In [22]:
from pyspark.sql.functions import col
people_with_id = people.withColumn('_id', col('name'))

In [24]:
people_with_id.write.format("mongo") \
    .option("uri","mongodb://127.0.0.1") \
    .option("database", "test")  \
    .option("collection", "contacts_id") \
    .mode("append") \
    .save()