### In order for Spark to talk to MongoDB we need to initial the spark context with pointers to the mongo uri and also include the mongo-spark-connector.

### Additionally, whoever configures the cluster may need to make sure additional jars are installed in $SPARK_HOME/jars



In [None]:
# import os

# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1 pyspark-shell'

# def initspark(appname = "Test", servername = "local", mongo="mongodb://127.0.0.1/classroom"):
#     print ('initializing pyspark')
#     conf = SparkConf().setAppName(appname).setMaster(servername)
#     sc = SparkContext(conf=conf)
#     spark = SparkSession.builder.appName(appname) \
#     .config("spark.mongodb.input.uri", mongo) \
#     .config("spark.mongodb.output.uri", mongo) \
#     .enableHiveSupport().getOrCreate()
#     sc.setLogLevel("WARN")
#     print ('pyspark initialized')
#     return sc, spark, conf


In [1]:
import sys, os
#print(sys.version)
os.environ["SPARK_HOME"] = '/usr/local/spark'
# os.environ["PYTHON_PATH"] = 'python3'
# os.environ["PYSPARK_DRIVER_PYTHON"] = 'python3'

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1 pyspark-shell'
sys.path.append('/class')
from initspark import *
sc, spark, conf = initspark()


initializing pyspark
pyspark initialized


In [None]:
df = spark.read.format("mongo").option("uri", "mongodb://127.0.0.1/Northwind.regions").load()
df.show()


### We can also take a DataFrame and write it to a Mongo destination.

In [None]:
x = sc.parallelize([('APAC', '5')])
x1 = spark.createDataFrame(x, schema = ['RegionDescription', 'RegionID'])
x1.write.format("mongo").options(collection="regions", database="Northwind").mode("append").save()
print('Done')


In [None]:
df = spark.read.format("mongo").option("uri", "mongodb://127.0.0.1/Northwind.regions").load()
df.show()


### Like any DataFrame, we can make it into a temporary view and use SparkSQL on it.

In [None]:
df.createOrReplaceTempView('regions')
spark.sql('select * from regions where regionid between 2 and 4').show()


## From here we can start using Mongo collections just like tables from any other source, and use Spark to process them with SQL or Spark dot methods.

In [None]:
c = spark.read.format("mongo").option("uri", "mongodb://127.0.0.1/Northwind.categories").load()
p = spark.read.format("mongo").option("uri", "mongodb://127.0.0.1/Northwind.products").load()
c.createOrReplaceTempView('categories')
p.createOrReplaceTempView('products')
spark.sql('''select c.categoryid, c.categoryname, p.productid, p.productname
from products as p 
join categories as c on p.categoryid = c.categoryid 
order by c.categoryid, p.productid''').show()



## COLLECT_LIST can be used to create nested repeating fields instead of using the aggregate pipeline.

In [None]:
spark.sql('''select c.categoryid, c.categoryname
, COLLECT_LIST(NAMED_STRUCT('productid', p.productid, 'productname', p.productname, 'unitprice', p.unitprice)) as products
from products as p 
join categories as c on p.categoryid = c.categoryid 
group by c.categoryid, c.categoryname
order by c.categoryid''').show()



## Adding the SORT_ARRAY function let's you sort the contents of the nested collection.

In [None]:
spark.sql('''select c.categoryid, c.categoryname
, SORT_ARRAY(COLLECT_LIST(NAMED_STRUCT('productid', p.productid, 'productname', p.productname, 'unitprice', p.unitprice))) as products
from products as p 
join categories as c on p.categoryid = c.categoryid 
group by c.categoryid, c.categoryname
order by c.categoryid''').show()



## LAB: ## 
### Write shippers to Mongo and find all the shippers with an 800 number using  a temporary view.
<br>
<details><summary>Click for <b>hint</b></summary>
<p>
Unlike Cassandra, Mongo does not require a collection to exist before writing to it, so just write the DataFrame to a new collection
<br>
Make a DataFrame from the new Mongo collection and turn it into a temporary view
<br>
Use SQL-like expression to find the desired records
<br>
<br>
</p>
</details>

<details><summary>Click for <b>code</b></summary>
<p>

```python
shippers.write.format("mongo").options(collection="shippers", database="classroom").mode("append").save()

s=spark.read.format("mongo").option("uri","mongodb://127.0.0.1/classroom.shippers").load()
s.createOrReplaceTempView('shippers')
display(spark.sql("select * from shippers where phone like '%800%'"))
```
</p>
</details>

## HOMEWORK: ## 
**First Challenge**

Read Products from any source and write it to a Cassandra table. For simplicity, we only need to keep the productid, productname, and unitprice columns.

**Second Challenge**

Read Orders_LineItems.json from Day3 folder and write it to a Mongo collection.

**Third Challenge**

Join the Products and Orders_LineItems and join then, flatten them and regroup them so that the orders are grouped under each product instead.

**Bonus Challenge**

Include a calculated column showing how many times each product was ordered.

A starting template has been provided in Day4-Homework.py to deal with preparing the Cassandra and Mongo environments for Challenges 1 & 2. If you have difficulty doing those on your own, then start with that template. Otherwise, try it from scratch from the code provided in the course so far.
<br>
<details><summary>Click for <b>hint</b></summary>
<p>
<br>
Read each table from the NoSQL source and turn it into a temporary view
<br>
Use LATERAL VIEW EXPLODE() EXPLODED_TABLE to flatten out the nested format file or orders
<br>
Use the flattened results to join to products
<br>
Use the results of the join to group on productid, productname, and collect a structured list of customerid, orderid, orderdate, productid, quantity and price
<br>
Use the size function on the collected list to determine how many times each product was ordered or alternatively do it as part of the SQL query with other familiar techniques
<br>
<br>
</p>
</details>





In [29]:
cu = spark.read.format("mongo").option("uri", "mongodb://127.0.0.1/Northwind.customers").load()
o = spark.read.format("mongo").option("uri", "mongodb://127.0.0.1/Northwind.orders").load()
od = spark.read.format("mongo").option("uri", "mongodb://127.0.0.1/Northwind.order-details").load()
p = spark.read.format("mongo").option("uri", "mongodb://127.0.0.1/Northwind.products").load()
ca = spark.read.format("mongo").option("uri", "mongodb://127.0.0.1/Northwind.categories").load()

cu.createOrReplaceTempView('customers')
o.createOrReplaceTempView('orders')
od.createOrReplaceTempView('orderdetails')
p.createOrReplaceTempView('products')
ca.createOrReplaceTempView('categories')

orderjoin = spark.sql('''SELECT o.OrderID, o.OrderDate, o.CustomerID, cu.CompanyName
, od.ProductID, od.UnitPrice as PurchasePrice, od.Quantity
, p.ProductName, p.UnitPrice as ListPrice, ca.CategoryID, ca.CategoryName
FROM orders AS o
JOIN orderdetails AS od ON o.OrderID = od.OrderID
JOIN products AS p ON od.ProductID = p.ProductID
JOIN categories AS ca ON p.CategoryID = ca.CategoryID 
JOIN customers as cu ON o.CustomerID = cu.CustomerID
''')
orderjoin.createOrReplaceTempView('orderjoin')

ord1 = spark.sql('''
SELECT OrderID, OrderDate, CustomerID, CompanyName
, SORT_ARRAY(COLLECT_LIST(NAMED_STRUCT('ProductID', ProductID, 'ProductName', ProductName
               , 'CategoryID', CategoryID, 'CategoryName', CategoryName
               , 'ListPrice', ListPrice, 'PurchasePrice', PurchasePrice, 'Quantity', Quantity
               ))) AS LineItems

from orderjoin
GROUP BY OrderID, OrderDate, CustomerID, CompanyName
''')

#ord1.show()
ord1.createOrReplaceTempView('ord1')

ord2 = spark.sql('''
SELECT CustomerID, CompanyName
, SORT_ARRAY(COLLECT_LIST(NAMED_STRUCT('OrderID', OrderID, 'OrderDate', OrderDate, 'LineItems', LineItems))) AS Orders
from ord1
GROUP BY CustomerID, CompanyName
''')

ord2.show()

ord1.write.format("mongo").options(collection="orders1", database="Northwind2").mode("append").save()
ord2.write.format("mongo").options(collection="orders1", database="Northwind2").mode("append").save()


+----------+--------------------+--------------------+
|CustomerID|         CompanyName|              Orders|
+----------+--------------------+--------------------+
|     WOLZA|      Wolski  Zajazd|[[10374, 1996-12-...|
|     MAISD|        Maison Dewey|[[10529, 1997-05-...|
|     BLAUS|Blauer See Delika...|[[10501, 1997-04-...|
|     MAGAA|Magazzini Aliment...|[[10275, 1996-08-...|
|     FOLKO|      Folk och fä HB|[[10264, 1996-07-...|
|     ANATR|Ana Trujillo Empa...|[[10308, 1996-09-...|
|     ISLAT|      Island Trading|[[10315, 1996-09-...|
|     VAFFE|        Vaffeljernet|[[10367, 1996-11-...|
|     BLONP|Blondesddsl père ...|[[10265, 1996-07-...|
|     CENTC|Centro comercial ...|[[10259, 1996-07-...|
|     SPLIR|Split Rail Beer &...|[[10271, 1996-08-...|
|     TRAIH|Trail's Head Gour...|[[10574, 1997-06-...|
|     LILAS|   LILA-Supermercado|[[10283, 1996-08-...|
|     WARTH|      Wartian Herkku|[[10266, 1996-07-...|
|     FRANR| France restauration|[[10671, 1997-09-...|
|     SEVE