# Demonstrating Apache Spark interop with MongoDB

<hr>

### Details: 
*  Author:  Tom Bresee
*  Date:  March 2020

### What is this notebook ? 
> Demonstration of using Apache Spark to interoperate with MongoDB via the purpose-built connector.  
See https://www.mongodb.com/products/spark-connector

An html copy of this notebook is available [here](https://htmlpreview.github.io/?https://github.com/tombresee/Prairie-Fire/blob/master/ENTER/ApacheSparkMongoConnector.html) for ease of viewing, feel free to check it out

*The MongoDB Connector for Apache Spark exposes all of Spark’s libraries, including Scala, Java, Python and R. MongoDB data is materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming, and SQL APIs.*

<img src="https://github.com/tombresee/Prairie-Fire/raw/master/ENTER/spark-connector-diagram.png" width="500">   

<br>

#### Let's get started



```python

Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
20/03/16 22:53:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/16 22:53:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 3.7.3 (default, Jul  1 2019 21:52:21)
SparkSession available as 'spark'.
>>> 
```


<br>

# The Largest College Football Stadiums - Ranked

### What are we showing here ? 
1.  We have some raw stadium capacity data (in .csv form)
2.  We will initiate a Spark session, specifically with the **MongoDB Spark connector**
3.  We will use Spark to read in the raw stadium data to a DataFrame, and then export that data **directly** to a local mongodb instance (also specifying the new database name and collection name) 
4.  Then, just to demonstrate how, we will read the mongodb information directly from the mongo database (via the mongo spark connector) *back* in to a new DataFrame and **print** out the dataframe results. This will effectively be the mongo collection data.  
5. Thus we demonstrate how to both write and read from a mongo database via the connector 


```python


# The file we will execute (stadium_analysis.py)

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession.builder.appName("Connector-Mongodb-Apache Spark").getOrCreate()

    #logger = spark._jvm.org.apache.log4j
    #logger.LogManager.getRootLogger().setLevel(logger.Level.FATAL)


    # Pull in some raw stadium data
    stadiums = spark.read.csv("/root/stadium_data.csv", header=True, inferSchema=True)


    # WRITE this data in mongo format to the mondoDB
    # .format("mongo")
    stadiums.write.format("mongo").mode("overwrite").save()


    # print the df schema
    print("Stadium Data Schema:")
    stadiums.printSchema()


    # OFFICIALLY READING from the MongoDB collection, **VIA** the spark connector
    # i.e. using the enhanced spark connector built for mongo to stream it faster
    # effectively:  assigning the collection to a DataFrame with spark.read() 
    # from within the pyspark shell    
    df = spark.read.format("mongo").load()


    # SQL
    df.registerTempTable("stadium_temp")
    data = spark.sql("SELECT * FROM stadium_temp")
    print("\n\n*******************************************************************")
    print("")
    print("\nRanking of Largest College Football Stadium Sizes, in the USA:")
    data.show(20,False)
    print("")
    print("\n\n*******************************************************************")

    
    ```
    
    
    


```python


# We will issue these commands to execute our file:


./bin/spark-submit --master "local[*]"  \
--conf "spark.mongodb.input.uri=mongodb://127.0.0.1/stadium_db.stadium_collection?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/stadium_db.stadium_collection" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.7 \
/root/stadium_analysis.py

        
# mongodb input and output: 
# --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/stadium_db.stadium_collection?readPreference=primaryPreferred"
# --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/stadium_db.stadium_collection" \
# format:  mongodb://<host>/<mongo database name>.<mongo collection name>
#
# --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.7 
#  this is absolutely critical and MUST match your apache spark version
#  if you have the wrong connector variable it won't work in any form... 
        
``` 
        

##### Output:

<img src="https://github.com/tombresee/Prairie-Fire/raw/master/ENTER/stadium_result.png" width="700">   

#### Prove it:  Let's look at the MongoDB and see what we created

```python
# go into mongodb

> use stadium_db   # specify the database to use 
switched to db stadium_db


> show collections  
stadium_collection  # this is the collection we created with Spark 


> db.stadium_collection.find()  # show all the data in the mongo table (collection) 
{ "_id" : ObjectId("5e7023ba17d43a604c556d13"), "RANK" : 1, "SCHOOL" : "Michigan", "STADIUM" : "Michigan Stadium (Ann Arbor, Mich.)", "CAPACITY" : "107,601" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d14"), "RANK" : 2, "SCHOOL" : "Penn State", "STADIUM" : "Beaver Stadium (University Park, Pa.)", "CAPACITY" : "106,572" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d15"), "RANK" : 3, "SCHOOL" : "Texas A&M", "STADIUM" : "Kyle Field (College Station, Texas)", "CAPACITY" : "102,733" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d16"), "RANK" : 4, "SCHOOL" : "Tennessee", "STADIUM" : "Neyland Stadium (Knoxville, Tenn.)", "CAPACITY" : "102,455" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d17"), "RANK" : 5, "SCHOOL" : "LSU", "STADIUM" : "Tiger Stadium (Baton Rouge, La.)", "CAPACITY" : "102,321" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d18"), "RANK" : 6, "SCHOOL" : "Ohio State", "STADIUM" : "Ohio Stadium (Columbus, Ohio)", "CAPACITY" : "102,082" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d19"), "RANK" : 7, "SCHOOL" : "Alabama", "STADIUM" : "Bryant-Denny Stadium (Tuscaloosa, Ala.)", "CAPACITY" : "101,821" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d1a"), "RANK" : 8, "SCHOOL" : "Texas", "STADIUM" : "Darrell K Royal-Texas Memorial Stadium (Austin, Texas)", "CAPACITY" : "100,119" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d1b"), "RANK" : 9, "SCHOOL" : "Georgia", "STADIUM" : "Sanford Stadium (Athens, Ga.)", "CAPACITY" : "92,746" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d1c"), "RANK" : 10, "SCHOOL" : "UCLA", "STADIUM" : "Rose Bowl (Pasadena, Calif.)", "CAPACITY" : "90,888" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d1d"), "RANK" : 11, "SCHOOL" : "Florida", "STADIUM" : "Ben Hill Griffin Stadium (Gainesville, Fla.)", "CAPACITY" : "88,548" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d1e"), "RANK" : 12, "SCHOOL" : "Auburn", "STADIUM" : "Jordan-Hare Stadium (Auburn, Ala.)", "CAPACITY" : "87,451" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d1f"), "RANK" : 13, "SCHOOL" : "Oklahoma", "STADIUM" : "Gaylord Family Oklahoma Memorial Stadium (Norman, Okla.)", "CAPACITY" : "86,112" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d20"), "RANK" : 14, "SCHOOL" : "Nebraska", "STADIUM" : "Memorial Stadium (Lincoln, Neb.)", "CAPACITY" : "85,458" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d21"), "RANK" : 15, "SCHOOL" : "Clemson", "STADIUM" : "Frank Howard Field at Clemson Memorial Stadium (Clemson, S.C.)", "CAPACITY" : "81,500" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d22"), "RANK" : 16, "SCHOOL" : "Notre Dame", "STADIUM" : "Notre Dame Stadium (South Bend, Ind.)", "CAPACITY" : "80,795" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d23"), "RANK" : 17, "SCHOOL" : "Wisconsin", "STADIUM" : "Camp Randall Stadium (Madison, Wisc.)", "CAPACITY" : "80,321" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d24"), "RANK" : 18, "SCHOOL" : "South Carolina", "STADIUM" : "Williams-Brice Stadium (Columbia, S.C.)", "CAPACITY" : "80,250" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d25"), "RANK" : 19, "SCHOOL" : "Florida State", "STADIUM" : "Bobby Bowden Field at Doak Campbell Stadium (Tallahassee, Fla.)", "CAPACITY" : "79,560" }
{ "_id" : ObjectId("5e7023ba17d43a604c556d26"), "RANK" : 20, "SCHOOL" : "Southern Cal.", "STADIUM" : "United Airlines Field at Los Angeles Memorial Coliseum (Los Angeles)", "CAPACITY" : "77,500" }


> db.stadium_collection.find().pretty()  # show table data in json form... 
{
        "_id" : ObjectId("5e7023ba17d43a604c556d13"),
        "RANK" : 1,
        "SCHOOL" : "Michigan",
        "STADIUM" : "Michigan Stadium (Ann Arbor, Mich.)",
        "CAPACITY" : "107,601"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d14"),
        "RANK" : 2,
        "SCHOOL" : "Penn State",
        "STADIUM" : "Beaver Stadium (University Park, Pa.)",
        "CAPACITY" : "106,572"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d15"),
        "RANK" : 3,
        "SCHOOL" : "Texas A&M",
        "STADIUM" : "Kyle Field (College Station, Texas)",
        "CAPACITY" : "102,733"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d16"),
        "RANK" : 4,
        "SCHOOL" : "Tennessee",
        "STADIUM" : "Neyland Stadium (Knoxville, Tenn.)",
        "CAPACITY" : "102,455"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d17"),
        "RANK" : 5,
        "SCHOOL" : "LSU",
        "STADIUM" : "Tiger Stadium (Baton Rouge, La.)",
        "CAPACITY" : "102,321"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d18"),
        "RANK" : 6,
        "SCHOOL" : "Ohio State",
        "STADIUM" : "Ohio Stadium (Columbus, Ohio)",
        "CAPACITY" : "102,082"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d19"),
        "RANK" : 7,
        "SCHOOL" : "Alabama",
        "STADIUM" : "Bryant-Denny Stadium (Tuscaloosa, Ala.)",
        "CAPACITY" : "101,821"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d1a"),
        "RANK" : 8,
        "SCHOOL" : "Texas",
        "STADIUM" : "Darrell K Royal-Texas Memorial Stadium (Austin, Texas)",
        "CAPACITY" : "100,119"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d1b"),
        "RANK" : 9,
        "SCHOOL" : "Georgia",
        "STADIUM" : "Sanford Stadium (Athens, Ga.)",
        "CAPACITY" : "92,746"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d1c"),
        "RANK" : 10,
        "SCHOOL" : "UCLA",
        "STADIUM" : "Rose Bowl (Pasadena, Calif.)",
        "CAPACITY" : "90,888"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d1d"),
        "RANK" : 11,
        "SCHOOL" : "Florida",
        "STADIUM" : "Ben Hill Griffin Stadium (Gainesville, Fla.)",
        "CAPACITY" : "88,548"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d1e"),
        "RANK" : 12,
        "SCHOOL" : "Auburn",
        "STADIUM" : "Jordan-Hare Stadium (Auburn, Ala.)",
        "CAPACITY" : "87,451"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d1f"),
        "RANK" : 13,
        "SCHOOL" : "Oklahoma",
        "STADIUM" : "Gaylord Family Oklahoma Memorial Stadium (Norman, Okla.)",
        "CAPACITY" : "86,112"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d20"),
        "RANK" : 14,
        "SCHOOL" : "Nebraska",
        "STADIUM" : "Memorial Stadium (Lincoln, Neb.)",
        "CAPACITY" : "85,458"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d21"),
        "RANK" : 15,
        "SCHOOL" : "Clemson",
        "STADIUM" : "Frank Howard Field at Clemson Memorial Stadium (Clemson, S.C.)",
        "CAPACITY" : "81,500"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d22"),
        "RANK" : 16,
        "SCHOOL" : "Notre Dame",
        "STADIUM" : "Notre Dame Stadium (South Bend, Ind.)",
        "CAPACITY" : "80,795"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d23"),
        "RANK" : 17,
        "SCHOOL" : "Wisconsin",
        "STADIUM" : "Camp Randall Stadium (Madison, Wisc.)",
        "CAPACITY" : "80,321"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d24"),
        "RANK" : 18,
        "SCHOOL" : "South Carolina",
        "STADIUM" : "Williams-Brice Stadium (Columbia, S.C.)",
        "CAPACITY" : "80,250"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d25"),
        "RANK" : 19,
        "SCHOOL" : "Florida State",
        "STADIUM" : "Bobby Bowden Field at Doak Campbell Stadium (Tallahassee, Fla.)",
        "CAPACITY" : "79,560"
}
{
        "_id" : ObjectId("5e7023ba17d43a604c556d26"),
        "RANK" : 20,
        "SCHOOL" : "Southern Cal.",
        "STADIUM" : "United Airlines Field at Los Angeles Memorial Coliseum (Los Angeles)",
        "CAPACITY" : "77,500"
}



> db.stadium_collection.find({}, {RANK:1, SCHOOL:1, _id:0})  # Just list the rank and school columns from the data table
{ "RANK" : 1, "SCHOOL" : "Michigan" }
{ "RANK" : 2, "SCHOOL" : "Penn State" }
{ "RANK" : 3, "SCHOOL" : "Texas A&M" }
{ "RANK" : 4, "SCHOOL" : "Tennessee" }
{ "RANK" : 5, "SCHOOL" : "LSU" }
{ "RANK" : 6, "SCHOOL" : "Ohio State" }
{ "RANK" : 7, "SCHOOL" : "Alabama" }
{ "RANK" : 8, "SCHOOL" : "Texas" }
{ "RANK" : 9, "SCHOOL" : "Georgia" }
{ "RANK" : 10, "SCHOOL" : "UCLA" }
{ "RANK" : 11, "SCHOOL" : "Florida" }
{ "RANK" : 12, "SCHOOL" : "Auburn" }
{ "RANK" : 13, "SCHOOL" : "Oklahoma" }
{ "RANK" : 14, "SCHOOL" : "Nebraska" }
{ "RANK" : 15, "SCHOOL" : "Clemson" }
{ "RANK" : 16, "SCHOOL" : "Notre Dame" }
{ "RANK" : 17, "SCHOOL" : "Wisconsin" }
{ "RANK" : 18, "SCHOOL" : "South Carolina" }
{ "RANK" : 19, "SCHOOL" : "Florida State" }
{ "RANK" : 20, "SCHOOL" : "Southern Cal." }



```



> What's a quick way to  know things worked ?  Remember, whenever you store data in a MongoDB database, it associates (creates) unique ids with each data record (you don't tell it to do that, it just does).  IF you write to a mongo database, and then when you go back and read that data from the table and see a **brand new column called `_id`**, you know you are doing something right...  

##### Full CLI output: 

```
#./bin/spark-submit --master "local[*]"  \
>                      --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/stadium_db.stadium_collection?readPreference=primaryPreferred" \
>                      --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/stadium_db.stadium_collection" \
>                      --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.7 \
>                      /root/stadium_analysis.py

The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-2.4.5-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-1572bebb-4d85-43f8-a3a5-7d8c19073faf;1.0
        confs: [default]
        found org.mongodb.spark#mongo-spark-connector_2.11;2.2.7 in central
        found org.mongodb#mongo-java-driver;3.10.2 in central
        [3.10.2] org.mongodb#mongo-java-driver;[3.10,3.11)
:: resolution report :: resolve 2024ms :: artifacts dl 4ms
        :: modules in use:
        org.mongodb#mongo-java-driver;3.10.2 from central in [default]
        org.mongodb.spark#mongo-spark-connector_2.11;2.2.7 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   1   |   0   |   0   ||   2   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-1572bebb-4d85-43f8-a3a5-7d8c19073faf
        confs: [default]
        0 artifacts copied, 2 already retrieved (0kB/5ms)
20/03/16 17:51:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/03/16 17:51:29 INFO SparkContext: Running Spark version 2.4.5
20/03/16 17:51:29 INFO SparkContext: Submitted application: Connector-Mongodb-Apache Spark
20/03/16 17:51:29 INFO SecurityManager: Changing view acls to: root
20/03/16 17:51:29 INFO SecurityManager: Changing modify acls to: root
20/03/16 17:51:29 INFO SecurityManager: Changing view acls groups to:
20/03/16 17:51:29 INFO SecurityManager: Changing modify acls groups to:
20/03/16 17:51:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/03/16 17:51:30 INFO Utils: Successfully started service 'sparkDriver' on port 39687.
20/03/16 17:51:30 INFO SparkEnv: Registering MapOutputTracker
20/03/16 17:51:30 INFO SparkEnv: Registering BlockManagerMaster
20/03/16 17:51:30 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/03/16 17:51:30 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/03/16 17:51:30 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-614c947a-3823-44a7-aaed-cfbcf39ecd3d
20/03/16 17:51:30 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/03/16 17:51:30 INFO SparkEnv: Registering OutputCommitCoordinator
20/03/16 17:51:30 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/03/16 17:51:30 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://animal-mother.mgmt:4040
20/03/16 17:51:30 INFO SparkContext: Added JAR file:///root/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar at spark://animal-mother.mgmt:39687/jars/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar with timestamp 1584406290461
20/03/16 17:51:30 INFO SparkContext: Added JAR file:///root/.ivy2/jars/org.mongodb_mongo-java-driver-3.10.2.jar at spark://animal-mother.mgmt:39687/jars/org.mongodb_mongo-java-driver-3.10.2.jar with timestamp 1584406290462
20/03/16 17:51:30 INFO SparkContext: Added file file:///root/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar at file:///root/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar with timestamp 1584406290479
20/03/16 17:51:30 INFO Utils: Copying /root/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar to /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar
20/03/16 17:51:30 INFO SparkContext: Added file file:///root/.ivy2/jars/org.mongodb_mongo-java-driver-3.10.2.jar at file:///root/.ivy2/jars/org.mongodb_mongo-java-driver-3.10.2.jar with timestamp 1584406290495
20/03/16 17:51:30 INFO Utils: Copying /root/.ivy2/jars/org.mongodb_mongo-java-driver-3.10.2.jar to /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/org.mongodb_mongo-java-driver-3.10.2.jar
20/03/16 17:51:30 INFO Executor: Starting executor ID driver on host localhost
20/03/16 17:51:30 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41405.
20/03/16 17:51:30 INFO NettyBlockTransferService: Server created on animal-mother.mgmt:41405
20/03/16 17:51:30 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/03/16 17:51:30 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, animal-mother.mgmt, 41405, None)
20/03/16 17:51:30 INFO BlockManagerMasterEndpoint: Registering block manager animal-mother.mgmt:41405 with 366.3 MB RAM, BlockManagerId(driver, animal-mother.mgmt, 41405, None)
20/03/16 17:51:30 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, animal-mother.mgmt, 41405, None)
20/03/16 17:51:30 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, animal-mother.mgmt, 41405, None)
20/03/16 17:51:30 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark-2.4.5-bin-hadoop2.7/spark-warehouse/').
20/03/16 17:51:30 INFO SharedState: Warehouse path is 'file:/opt/spark-2.4.5-bin-hadoop2.7/spark-warehouse/'.
20/03/16 17:51:31 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/03/16 17:51:31 INFO InMemoryFileIndex: It took 35 ms to list leaf files for 1 paths.
20/03/16 17:51:31 INFO InMemoryFileIndex: It took 1 ms to list leaf files for 1 paths.
20/03/16 17:51:33 INFO FileSourceStrategy: Pruning directories with:
20/03/16 17:51:33 INFO FileSourceStrategy: Post-Scan Filters: (length(trim(value#0, None)) > 0)
20/03/16 17:51:33 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
20/03/16 17:51:33 INFO FileSourceScanExec: Pushed Filters:
20/03/16 17:51:33 INFO CodeGenerator: Code generated in 191.762826 ms
20/03/16 17:51:34 INFO CodeGenerator: Code generated in 18.763374 ms
20/03/16 17:51:34 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 284.0 KB, free 366.0 MB)
20/03/16 17:51:34 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.9 KB, free 366.0 MB)
20/03/16 17:51:34 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on animal-mother.mgmt:41405 (size: 23.9 KB, free: 366.3 MB)
20/03/16 17:51:34 INFO SparkContext: Created broadcast 0 from csv at NativeMethodAccessorImpl.java:0
20/03/16 17:51:34 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
20/03/16 17:51:34 INFO SparkContext: Starting job: csv at NativeMethodAccessorImpl.java:0
20/03/16 17:51:34 INFO DAGScheduler: Got job 0 (csv at NativeMethodAccessorImpl.java:0) with 1 output partitions
20/03/16 17:51:34 INFO DAGScheduler: Final stage: ResultStage 0 (csv at NativeMethodAccessorImpl.java:0)
20/03/16 17:51:34 INFO DAGScheduler: Parents of final stage: List()
20/03/16 17:51:34 INFO DAGScheduler: Missing parents: List()
20/03/16 17:51:34 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at csv at NativeMethodAccessorImpl.java:0), which has no missing parents
20/03/16 17:51:34 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.9 KB, free 366.0 MB)
20/03/16 17:51:34 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.6 KB, free 366.0 MB)
20/03/16 17:51:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on animal-mother.mgmt:41405 (size: 4.6 KB, free: 366.3 MB)
20/03/16 17:51:34 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1163
20/03/16 17:51:34 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at csv at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
20/03/16 17:51:34 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/03/16 17:51:34 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8245 bytes)
20/03/16 17:51:34 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/03/16 17:51:34 INFO Executor: Fetching file:///root/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar with timestamp 1584406290479
20/03/16 17:51:34 INFO Utils: /root/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar has been previously copied to /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar
20/03/16 17:51:34 INFO Executor: Fetching file:///root/.ivy2/jars/org.mongodb_mongo-java-driver-3.10.2.jar with timestamp 1584406290495
20/03/16 17:51:34 INFO Utils: /root/.ivy2/jars/org.mongodb_mongo-java-driver-3.10.2.jar has been previously copied to /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/org.mongodb_mongo-java-driver-3.10.2.jar
20/03/16 17:51:34 INFO Executor: Fetching spark://animal-mother.mgmt:39687/jars/org.mongodb_mongo-java-driver-3.10.2.jar with timestamp 1584406290462
20/03/16 17:51:34 INFO TransportClientFactory: Successfully created connection to animal-mother.mgmt/10.94.207.196:39687 after 39 ms (0 ms spent in bootstraps)
20/03/16 17:51:34 INFO Utils: Fetching spark://animal-mother.mgmt:39687/jars/org.mongodb_mongo-java-driver-3.10.2.jar to /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/fetchFileTemp6763034555078161264.tmp
20/03/16 17:51:34 INFO Utils: /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/fetchFileTemp6763034555078161264.tmp has been previously copied to /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/org.mongodb_mongo-java-driver-3.10.2.jar
20/03/16 17:51:34 INFO Executor: Adding file:/tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/org.mongodb_mongo-java-driver-3.10.2.jar to class loader
20/03/16 17:51:34 INFO Executor: Fetching spark://animal-mother.mgmt:39687/jars/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar with timestamp 1584406290461
20/03/16 17:51:34 INFO Utils: Fetching spark://animal-mother.mgmt:39687/jars/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar to /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/fetchFileTemp8181146042765144332.tmp
20/03/16 17:51:34 INFO Utils: /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/fetchFileTemp8181146042765144332.tmp has been previously copied to /tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar
20/03/16 17:51:34 INFO Executor: Adding file:/tmp/spark-05121f50-5990-4238-9553-51ff7b78a6ab/userFiles-08642a1c-3bef-4057-991b-98b364f8872c/org.mongodb.spark_mongo-spark-connector_2.11-2.2.7.jar to class loader
20/03/16 17:51:34 INFO FileScanRDD: Reading File path: file:///root/stadium_data.csv, range: 0-1337, partition values: [empty row]
20/03/16 17:51:34 INFO CodeGenerator: Code generated in 11.910103 ms
20/03/16 17:51:34 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1334 bytes result sent to driver
20/03/16 17:51:34 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 404 ms on localhost (executor driver) (1/1)
20/03/16 17:51:34 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/03/16 17:51:34 INFO DAGScheduler: ResultStage 0 (csv at NativeMethodAccessorImpl.java:0) finished in 0.514 s
20/03/16 17:51:34 INFO DAGScheduler: Job 0 finished: csv at NativeMethodAccessorImpl.java:0, took 0.572299 s
20/03/16 17:51:34 INFO FileSourceStrategy: Pruning directories with:
20/03/16 17:51:34 INFO FileSourceStrategy: Post-Scan Filters:
20/03/16 17:51:34 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
20/03/16 17:51:34 INFO FileSourceScanExec: Pushed Filters:
20/03/16 17:51:34 INFO CodeGenerator: Code generated in 6.918409 ms
20/03/16 17:51:34 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 284.0 KB, free 365.7 MB)
20/03/16 17:51:34 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.9 KB, free 365.7 MB)
20/03/16 17:51:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on animal-mother.mgmt:41405 (size: 23.9 KB, free: 366.2 MB)
20/03/16 17:51:34 INFO SparkContext: Created broadcast 2 from csv at NativeMethodAccessorImpl.java:0
20/03/16 17:51:34 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
20/03/16 17:51:34 INFO SparkContext: Starting job: csv at NativeMethodAccessorImpl.java:0
20/03/16 17:51:34 INFO DAGScheduler: Got job 1 (csv at NativeMethodAccessorImpl.java:0) with 1 output partitions
20/03/16 17:51:34 INFO DAGScheduler: Final stage: ResultStage 1 (csv at NativeMethodAccessorImpl.java:0)
20/03/16 17:51:34 INFO DAGScheduler: Parents of final stage: List()
20/03/16 17:51:34 INFO DAGScheduler: Missing parents: List()
20/03/16 17:51:34 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[9] at csv at NativeMethodAccessorImpl.java:0), which has no missing parents
20/03/16 17:51:34 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 16.3 KB, free 365.7 MB)
20/03/16 17:51:34 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.1 KB, free 365.7 MB)
20/03/16 17:51:34 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on animal-mother.mgmt:41405 (size: 9.1 KB, free: 366.2 MB)
20/03/16 17:51:34 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1163
20/03/16 17:51:34 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[9] at csv at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
20/03/16 17:51:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
20/03/16 17:51:34 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 8245 bytes)
20/03/16 17:51:34 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
20/03/16 17:51:35 INFO FileScanRDD: Reading File path: file:///root/stadium_data.csv, range: 0-1337, partition values: [empty row]
20/03/16 17:51:35 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1534 bytes result sent to driver
20/03/16 17:51:35 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 45 ms on localhost (executor driver) (1/1)
20/03/16 17:51:35 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/03/16 17:51:35 INFO DAGScheduler: ResultStage 1 (csv at NativeMethodAccessorImpl.java:0) finished in 0.059 s
20/03/16 17:51:35 INFO DAGScheduler: Job 1 finished: csv at NativeMethodAccessorImpl.java:0, took 0.066280 s
20/03/16 17:51:35 INFO cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
20/03/16 17:51:35 INFO cluster: Cluster description not yet available. Waiting for 30000 ms before timing out
20/03/16 17:51:35 INFO connection: Opened connection [connectionId{localValue:1, serverValue:137}] to 127.0.0.1:27017
20/03/16 17:51:35 INFO cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 2, 3]}, minWireVersion=0, maxWireVersion=8, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=4572629}
20/03/16 17:51:35 INFO MongoClientCache: Creating MongoClient: [127.0.0.1:27017]
20/03/16 17:51:35 INFO connection: Opened connection [connectionId{localValue:2, serverValue:138}] to 127.0.0.1:27017
20/03/16 17:51:35 INFO FileSourceStrategy: Pruning directories with:
20/03/16 17:51:35 INFO FileSourceStrategy: Post-Scan Filters:
20/03/16 17:51:35 INFO FileSourceStrategy: Output Data Schema: struct<RANK: int, SCHOOL: string, STADIUM: string, CAPACITY: string ... 2 more fields>
20/03/16 17:51:35 INFO FileSourceScanExec: Pushed Filters:
20/03/16 17:51:35 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 284.0 KB, free 365.4 MB)
20/03/16 17:51:35 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 23.9 KB, free 365.4 MB)
20/03/16 17:51:35 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on animal-mother.mgmt:41405 (size: 23.9 KB, free: 366.2 MB)
20/03/16 17:51:35 INFO SparkContext: Created broadcast 4 from rdd at MongoSpark.scala:154
20/03/16 17:51:35 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
20/03/16 17:51:35 INFO SparkContext: Starting job: foreachPartition at MongoSpark.scala:117
20/03/16 17:51:35 INFO DAGScheduler: Got job 2 (foreachPartition at MongoSpark.scala:117) with 1 output partitions
20/03/16 17:51:35 INFO DAGScheduler: Final stage: ResultStage 2 (foreachPartition at MongoSpark.scala:117)
20/03/16 17:51:35 INFO DAGScheduler: Parents of final stage: List()
20/03/16 17:51:35 INFO DAGScheduler: Missing parents: List()
20/03/16 17:51:35 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[15] at map at MongoSpark.scala:154), which has no missing parents
20/03/16 17:51:35 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 18.1 KB, free 365.3 MB)
20/03/16 17:51:35 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 9.9 KB, free 365.3 MB)
20/03/16 17:51:35 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on animal-mother.mgmt:41405 (size: 9.9 KB, free: 366.2 MB)
20/03/16 17:51:35 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1163
20/03/16 17:51:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[15] at map at MongoSpark.scala:154) (first 15 tasks are for partitions Vector(0))
20/03/16 17:51:35 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
20/03/16 17:51:35 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, executor driver, partition 0, PROCESS_LOCAL, 8245 bytes)
20/03/16 17:51:35 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
20/03/16 17:51:35 INFO CodeGenerator: Code generated in 14.184878 ms
20/03/16 17:51:35 INFO FileScanRDD: Reading File path: file:///root/stadium_data.csv, range: 0-1337, partition values: [empty row]
20/03/16 17:51:35 INFO CodeGenerator: Code generated in 10.456978 ms
20/03/16 17:51:35 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1437 bytes result sent to driver
20/03/16 17:51:35 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 133 ms on localhost (executor driver) (1/1)
20/03/16 17:51:35 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
20/03/16 17:51:35 INFO DAGScheduler: ResultStage 2 (foreachPartition at MongoSpark.scala:117) finished in 0.152 s
20/03/16 17:51:35 INFO DAGScheduler: Job 2 finished: foreachPartition at MongoSpark.scala:117, took 0.157083 s
20/03/16 17:51:35 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 256.0 B, free 365.3 MB)
20/03/16 17:51:35 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 411.0 B, free 365.3 MB)
20/03/16 17:51:35 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on animal-mother.mgmt:41405 (size: 411.0 B, free: 366.2 MB)
20/03/16 17:51:35 INFO SparkContext: Created broadcast 6 from broadcast at MongoSpark.scala:543
Schema:
root
 |-- RANK: integer (nullable = true)
 |-- SCHOOL: string (nullable = true)
 |-- STADIUM: string (nullable = true)
 |-- CAPACITY: string (nullable = true)

20/03/16 17:51:35 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 320.0 B, free 365.3 MB)
20/03/16 17:51:35 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 441.0 B, free 365.3 MB)
20/03/16 17:51:35 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on animal-mother.mgmt:41405 (size: 441.0 B, free: 366.2 MB)
20/03/16 17:51:35 INFO SparkContext: Created broadcast 7 from broadcast at MongoSpark.scala:543
20/03/16 17:51:35 INFO cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
20/03/16 17:51:35 INFO cluster: Cluster description not yet available. Waiting for 30000 ms before timing out
20/03/16 17:51:35 INFO connection: Opened connection [connectionId{localValue:3, serverValue:139}] to 127.0.0.1:27017
20/03/16 17:51:35 INFO cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 2, 3]}, minWireVersion=0, maxWireVersion=8, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=913141}
20/03/16 17:51:35 INFO MongoClientCache: Creating MongoClient: [127.0.0.1:27017]
20/03/16 17:51:35 INFO connection: Opened connection [connectionId{localValue:4, serverValue:140}] to 127.0.0.1:27017
20/03/16 17:51:35 INFO SparkContext: Starting job: treeAggregate at MongoInferSchema.scala:88
20/03/16 17:51:35 INFO DAGScheduler: Got job 3 (treeAggregate at MongoInferSchema.scala:88) with 1 output partitions
20/03/16 17:51:35 INFO DAGScheduler: Final stage: ResultStage 3 (treeAggregate at MongoInferSchema.scala:88)
20/03/16 17:51:35 INFO DAGScheduler: Parents of final stage: List()
20/03/16 17:51:35 INFO DAGScheduler: Missing parents: List()
20/03/16 17:51:35 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[23] at treeAggregate at MongoInferSchema.scala:88), which has no missing parents
20/03/16 17:51:35 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 6.2 KB, free 365.3 MB)
20/03/16 17:51:35 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 3.3 KB, free 365.3 MB)
20/03/16 17:51:35 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on animal-mother.mgmt:41405 (size: 3.3 KB, free: 366.2 MB)
20/03/16 17:51:35 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1163
20/03/16 17:51:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[23] at treeAggregate at MongoInferSchema.scala:88) (first 15 tasks are for partitions Vector(0))
20/03/16 17:51:35 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
20/03/16 17:51:35 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, executor driver, partition 0, ANY, 8001 bytes)
20/03/16 17:51:35 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
20/03/16 17:51:35 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 1605 bytes result sent to driver
20/03/16 17:51:35 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 70 ms on localhost (executor driver) (1/1)
20/03/16 17:51:35 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
20/03/16 17:51:35 INFO DAGScheduler: ResultStage 3 (treeAggregate at MongoInferSchema.scala:88) finished in 0.082 s
20/03/16 17:51:35 INFO DAGScheduler: Job 3 finished: treeAggregate at MongoInferSchema.scala:88, took 0.089431 s


*******************************************************************


Ranking of Largest College Football Stadium Sizes, in the USA:
20/03/16 17:51:36 INFO MongoRelation: requiredColumns: RANK, SCHOOL, _id, CAPACITY, STADIUM, filters:
20/03/16 17:51:36 INFO CodeGenerator: Code generated in 13.305146 ms
20/03/16 17:51:36 INFO CodeGenerator: Code generated in 18.310045 ms
20/03/16 17:51:36 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
20/03/16 17:51:36 INFO DAGScheduler: Got job 4 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
20/03/16 17:51:36 INFO DAGScheduler: Final stage: ResultStage 4 (showString at NativeMethodAccessorImpl.java:0)
20/03/16 17:51:36 INFO DAGScheduler: Parents of final stage: List()
20/03/16 17:51:36 INFO DAGScheduler: Missing parents: List()
20/03/16 17:51:36 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[29] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
20/03/16 17:51:36 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 11.8 KB, free 365.3 MB)
20/03/16 17:51:36 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 5.7 KB, free 365.3 MB)
20/03/16 17:51:36 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on animal-mother.mgmt:41405 (size: 5.7 KB, free: 366.2 MB)
20/03/16 17:51:36 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1163
20/03/16 17:51:36 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[29] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
20/03/16 17:51:36 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
20/03/16 17:51:36 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, executor driver, partition 0, ANY, 8001 bytes)
20/03/16 17:51:36 INFO Executor: Running task 0.0 in stage 4.0 (TID 4)
20/03/16 17:51:36 INFO Executor: Finished task 0.0 in stage 4.0 (TID 4). 2673 bytes result sent to driver
20/03/16 17:51:36 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 23 ms on localhost (executor driver) (1/1)
20/03/16 17:51:36 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
20/03/16 17:51:36 INFO DAGScheduler: ResultStage 4 (showString at NativeMethodAccessorImpl.java:0) finished in 0.034 s
20/03/16 17:51:36 INFO DAGScheduler: Job 4 finished: showString at NativeMethodAccessorImpl.java:0, took 0.040323 s
+--------+----+--------------+--------------------------------------------------------------------+--------------------------+
|CAPACITY|RANK|SCHOOL        |STADIUM                                                             |_id                       |
+--------+----+--------------+--------------------------------------------------------------------+--------------------------+
|107,601 |1   |Michigan      |Michigan Stadium (Ann Arbor, Mich.)                                 |[5e701f174dd58907d6a06630]|
|106,572 |2   |Penn State    |Beaver Stadium (University Park, Pa.)                               |[5e701f174dd58907d6a06631]|
|102,733 |3   |Texas A&M     |Kyle Field (College Station, Texas)                                 |[5e701f174dd58907d6a06632]|
|102,455 |4   |Tennessee     |Neyland Stadium (Knoxville, Tenn.)                                  |[5e701f174dd58907d6a06633]|
|102,321 |5   |LSU           |Tiger Stadium (Baton Rouge, La.)                                    |[5e701f174dd58907d6a06634]|
|102,082 |6   |Ohio State    |Ohio Stadium (Columbus, Ohio)                                       |[5e701f174dd58907d6a06635]|
|101,821 |7   |Alabama       |Bryant-Denny Stadium (Tuscaloosa, Ala.)                             |[5e701f174dd58907d6a06636]|
|100,119 |8   |Texas         |Darrell K Royal-Texas Memorial Stadium (Austin, Texas)              |[5e701f174dd58907d6a06637]|
|92,746  |9   |Georgia       |Sanford Stadium (Athens, Ga.)                                       |[5e701f174dd58907d6a06638]|
|90,888  |10  |UCLA          |Rose Bowl (Pasadena, Calif.)                                        |[5e701f174dd58907d6a06639]|
|88,548  |11  |Florida       |Ben Hill Griffin Stadium (Gainesville, Fla.)                        |[5e701f174dd58907d6a0663a]|
|87,451  |12  |Auburn        |Jordan-Hare Stadium (Auburn, Ala.)                                  |[5e701f174dd58907d6a0663b]|
|86,112  |13  |Oklahoma      |Gaylord Family Oklahoma Memorial Stadium (Norman, Okla.)            |[5e701f174dd58907d6a0663c]|
|85,458  |14  |Nebraska      |Memorial Stadium (Lincoln, Neb.)                                    |[5e701f174dd58907d6a0663d]|
|81,500  |15  |Clemson       |Frank Howard Field at Clemson Memorial Stadium (Clemson, S.C.)      |[5e701f174dd58907d6a0663e]|
|80,795  |16  |Notre Dame    |Notre Dame Stadium (South Bend, Ind.)                               |[5e701f174dd58907d6a0663f]|
|80,321  |17  |Wisconsin     |Camp Randall Stadium (Madison, Wisc.)                               |[5e701f174dd58907d6a06640]|
|80,250  |18  |South Carolina|Williams-Brice Stadium (Columbia, S.C.)                             |[5e701f174dd58907d6a06641]|
|79,560  |19  |Florida State |Bobby Bowden Field at Doak Campbell Stadium (Tallahassee, Fla.)     |[5e701f174dd58907d6a06642]|
|77,500  |20  |Southern Cal. |United Airlines Field at Los Angeles Memorial Coliseum (Los Angeles)|[5e701f174dd58907d6a06643]|
+--------+----+--------------+--------------------------------------------------------------------+--------------------------+

*******************************************************************
20/03/16 17:51:36 INFO MongoClientCache: Closing MongoClient: [127.0.0.1:27017]
20/03/16 17:51:36 INFO SparkContext: Invoking stop() from shutdown hook
20/03/16 17:51:36 INFO connection: Closed connection [connectionId{localValue:2, serverValue:138}] to 127.0.0.1:27017 because the pool has been closed.
20/03/16 17:51:36 INFO MongoClientCache: Closing MongoClient: [127.0.0.1:27017]
20/03/16 17:51:36 INFO connection: Closed connection [connectionId{localValue:4, serverValue:140}] to 127.0.0.1:27017 because the pool has been closed.
20/03/16 17:51:36 INFO SparkUI: Stopped Spark web UI at http://animal-mother.mgmt:4040
20/03/16 17:51:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

            ```

### Another core approach


```python

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark Mongo Connector - Demo") \
    .config("spark.mongodb.input.uri", "mongodb://localhost:27017/database.collection") \
    .config("spark.mongodb.output.uri", "mongodb://localhost:27017/database.collection") \
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.4.5') \
    .getOrCreate()


df = spark.read.format("mongo").load()
# df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
    ```

<br><br>