## Add some Spark in the data

Here, we put a pandas DataFrame into a Spark cluster! The SparkSession class has a method for this.

The .createDataFrame() method takes a pandas DataFrame and returns a Spark DataFrame.

The output of this method is stored locally, not in the SparkSession catalog. This means that all the Spark DataFrame methods can be used on it, but data in other contexts is not accessible.

For example, a SQL query (using the .sql() method) that references the DataFrame will throw an error. To access the data in this way, it has to be saved as a temporary table. This is done using the .createTempView() Spark DataFrame method, which takes as its only argument the name of the temporary table that needs to be registered. This method registers the DataFrame as a table in the catalog, but as this table is temporary, and it can only be accessed from the specific SparkSession used to create the Spark DataFrame.

There is also the method .createOrReplaceTempView(). This safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. This method to is used to avoid running into problems with duplicate tables.

In [0]:
import pandas as pd
import numpy as np

# Create pd_temp
pd_temp = pd.DataFrame(np.random.random(10))

# Create spark_temp from pd_temp
spark_temp = spark.createDataFrame(pd_temp)

# Examine the tables in the catalog
print(spark.catalog.listTables())

# Add spark_temp to the catalog
spark_temp.createOrReplaceTempView("temp")

# Examine the tables in the catalog again
print(spark.catalog.listTables())