# SparkSession and SparkContext

**What are `SparkSession` and `SparkContext`?**

The `SparkSession` and `SparkContext` are two important components in Apache Spark, but they serve different purposes and have different functionalities:

1. `SparkContext`:
    - `SparkContext` is the entry point for Spark functionality in older versions of Spark (prior to Spark 2.0).
    - It represents the connection to a Spark cluster and acts as a handle to the cluster.
    - It provides the low-level API functionality of Spark, such as creating RDDs (Resilient Distributed Datasets), performing transformations, and executing actions.
    - `SparkContext` is used primarily for working with RDDs and is not aware of higher-level structured APIs like DataFrames and Datasets.
   

2. `SparkSession`:

    - `SparkSession` was introduced in Spark 2.0 as a unified entry point for working with structured data and higher-level APIs like DataFrames and Datasets.
    - It encapsulates the functionality of `SparkContext` and provides additional capabilities for working with structured data.
    - `SparkSession` provides a more user-friendly and intuitive API for working with data in Spark.
    - It includes methods for creating DataFrames and Datasets, executing SQL queries, and interacting with various data sources.
    - `SparkSession` internally manages a `SparkContext` and automatically creates one if it doesn't exist.
    
    
In summary, `SparkContext` is the older, lower-level entry point for Spark that focuses on RDD-based operations, while `SparkSession` is the newer, higher-level entry point that encompasses the functionality of `SparkContext` while providing a more convenient and structured API for working with data using DataFrames, Datasets, and SQL. In general, it is recommended to use `SparkSession` for most Spark applications, unless you specifically need to work with RDDs or require functionalities not available in the higher-level APIs.

## 1. How to build a SparkSession.

**Full documentation**: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.html

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Ejemplo1_SparkSession") \
    .config("spark.some.conf ig.option", "some-value") \
    .getOrCreate()

In [3]:
spark

In [4]:
spark.stop()

## 2. How to build a SparkContext.

**Full documentation**: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.SparkContext.html

In [5]:
from pyspark import SparkContext, SparkConf

In [6]:
def initSC(cores, memory):
    """
    Función que inicializa SparkContext
    """
    conf = SparkConf()\
           .setAppName(f"AppName {cores} cores {memory} mem")
           # .set('spark.cores.max', cores)\
           # .set('spark.executorEnv.PYTHONHASHSEED',"123")\
           # .set('spark.executor.memory', memory)
    sc = SparkContext(conf=conf)
    sc.setLogLevel("ERROR")
    #sc.addPyFile("py.zip")
    return sc

In [7]:
sc = initSC(0,0)
sc

In [8]:
sc.stop()

---

It is **not recommended** to **create multiple SparkSessions within the same Spark application**. Each SparkSession corresponds to a separate Spark application and represents the entry point for interacting with Spark.

The SparkSession encapsulates the SparkContext, SQLContext, HiveContext, and other components required for Spark's functionality. Creating multiple SparkSessions within the same application can lead to conflicts and unexpected behavior.

If you have different sections or tasks within your application that require separate configurations or dependencies, it is recommended to **use different SparkContexts within the same SparkSession**. You can create multiple SparkContexts using different names and configurations using the SparkSession.newSession() method:



In [10]:
spark = SparkSession.builder.appName("AppName").getOrCreate()

# Create a new SparkContext within the existing SparkSession
sc1 = spark.sparkContext
sc2 = spark.newSession().sparkContext

This way, you can have **multiple SparkContexts with different configurations, but they are still part of the same SparkSession**, allowing them to share the same underlying resources.

However, keep in mind that creating multiple SparkContexts should be done sparingly and only when necessary, as it can impact performance and resource usage. In most cases, a single SparkSession with a single SparkContext is sufficient for most Spark applications.

---

### Para el cluster de Dana:

In [None]:
conf = SparkConf().setMaster("spark://dana:7077").setAppName(internal_param[1])./
                    setAll([('spark.driver.cores', internal_param[2]),/
                            ('spark.driver.memory',internal_param[3]),/
                            ('spark.executor.instances', internal_param[4]),/
                            ('spark.executor.memory',internal_param[5]),/
                            ('spark.executor.cores', internal_param[6])])

sc = SparkContext(conf = conf)