<p align="center">
<img src ="https://raw.githubusercontent.com/microsoft/azuredatastudio/master/src/sql/media/microsoft_logo_gray.svg?sanitize=true" width="250" align="center">
</p>

# **Spark Package Management in SQL Server 2019 Big Data Clusters**
This guide covers installing packages and submitting jobs to a SQL Server 2019 Big Data Cluster using Spark.
* Built-In Tools
* Install Packages from a Maven Repository onto the Spark Cluster at Runtime
* Import .jar from HDFS for use at runtime
* Import .jar at runtime through Azure Data Studio notebook cell configuration
* Install Python Packages at Runtime for use with PySpark 
* Submit local .jar or python file
<!-- <span style="color:red"><font size="3">Please press the "Run Cells" button to run the notebook</font></span> -->

# Built-in Tools
* Spark and Hadoop base packages
* Python 3.5 and Python 2.7
* Pandas, Sklearn, Numpy, and other data processing packages.
* R and MRO packages
* Sparklyr


# Install Packages from a Maven Repository onto the Spark Cluster at Runtime
Maven packages can be installed onto your Spark cluster using notebook cell configuration at the start of your spark session. Before starting a spark session in Azure Data Studio, run the following code:

```
%%configure -f \
{"conf": {"spark.jars.packages": "com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1"}}
```


# Import .jar from HDFS for use at runtime

Import jar at runtime through Azure Data Studio notebook cell configuration.

```
%%configure -f
{"conf": {"spark.jars": "/jar/mycodeJar.jar"}}
```


# Import .jar at runtime through Azure Data Studio notebook cell configuration

```
%%configure -f
{"conf": {"spark.jars": "/jar/mycodeJar.jar"}}
```


# Install Python Packages at Runtime for use with PySpark

The following code can be used to install packages on each executor node at runtime. \
**Note**: This functionality is not available on a non-root BDC deployment (including OpenShift). This installation is temporary, and must be performed each time a new Spark session is invoked.

If you want to use this from CU5 upwards, you must add two settings pre-deployment.

In contron.json, add (under security):

_"allowRunAsRoot": true_

In BDC.json, add (under spec.services.spark.settings): 

_"yarn-site.yarn.nodemanager.container-executor.class": "org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor"_

``` Python
import subprocess
import os
os.environ["XDG_CACHE_HOME"]="/tmp"
# Install TensorFlow
stdout = subprocess.check_output(
    "pip3 install tensorflow",
    stderr=subprocess.STDOUT,
    shell=True).decode("utf-8")
print(stdout)
```

# Submit local .jar or python file
One of the key scenarios for big data clusters is the ability to submit Spark jobs for SQL Server. The Spark job submission feature allows you to submit a local Jar or Py files with references to SQL Server 2019 big data cluster. It also enables you to execute a Jar or Py files, which are already located in the HDFS file system.

* [Submit Spark jobs on SQL Server Big Data Clusters in Azure Data Studio](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job?view=sqlallproducts-allversions)
* [Submit Spark jobs on SQL Server Big Data Clusters in IntelliJ](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-intellij-tool-plugin?view=sqlallproducts-allversions)
* [Submit Spark jobs on SQL Server big data cluster in Visual Studio Code](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-hive-tools-vscode?view=sqlallproducts-allversions)
