# Using Hyperspace for indexing in Synapse Spark

Hyperspace introduces the ability for Apache Spark users to create indexes on their datasets, such as CSV, JSON, and Parquet, and use them for potential query and workload acceleration.

Hyperspace helps accelerate your workloads or queries under two circumstances:

- Queries contain filters on predicates with high selectivity. For example, you might want to select 100 matching rows from a million candidate rows.
- Queries contain a join that requires heavy shuffles. For example, you might want to join a 100-GB dataset with a 10-GB dataset.

By default, Spark uses broadcast join to optimize join queries when the data size for one side of join is small (which is the case for the sample data we use in this tutorial). Therefore, we disable broadcast joins so that later when we run join queries, Spark uses sort-merge join. This is mainly to show how Hyperspace indexes would be used at scale for accelerating join queries.

In [None]:
# Disable BroadcastHashJoin, so Spark will use standard SortMergeJoin. Currently, Hyperspace indexes utilize SortMergeJoin to speed up query.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

# Verify that BroadcastHashJoin is set correctly 
print(spark.conf.get("spark.sql.autoBroadcastJoinThreshold"))

Load customer data into a Spark dataframe.

In [None]:
df_customer = spark.read.load('abfss://wwi-02@#DATA_LAKE_ACCOUNT_NAME#.dfs.core.windows.net/data-generators/generator-customer.csv', format='csv', header=True)
display(df_customer.limit(10))

Load sales data into a Spark dataframe.

In [None]:
df_sales = spark.read.load('abfss://wwi-02@#DATA_LAKE_ACCOUNT_NAME#.dfs.core.windows.net/sale-small/Year=2019/Quarter=Q4/Month=12/*/sale-small-20191201-snappy.parquet', format='parquet')
display(df_sales.limit(10))

Initialize the Hyperspace engine in the Spark session.

In [None]:
from hyperspace import *  
from com.microsoft.hyperspace import *
from com.microsoft.hyperspace.index import *

# Create an instance of Hyperspace
hyperspace = Hyperspace(spark)

Create index configurations for customer and sales data as follows:
- The customer index is built on the `CustomerId` column and also includes (covers) column `BirthDate`
- The sales index is built on the `CustomerId` column and also includes (covers) columns `ProductId` and `Quantity`

Using the index configurations, create the actual indexes on the customer and sales dataframes.

In [None]:
customer_index_config = IndexConfig("customerIndex1", ["CustomerId"], ["BirthDate"])
sales_index_config = IndexConfig("salesIndex1", ["CustomerId"], ["ProductId", "Quantity"])

hyperspace.createIndex(df_customer, customer_index_config)
hyperspace.createIndex(df_sales, sales_index_config)

Enumerate all available indexes.

In [None]:
hyperspace.indexes().show()

Check the data lake location of the first index from the list.

In [None]:
hyperspace.indexes().first().indexLocation

Hyperspace provides APIs to enable or disable index usage with Spark.

- By using the **hyperspace.enable()** command, Hyperspace optimization rules become visible to the Spark optimizer and exploit existing Hyperspace indexes to optimize user queries.
- By using the **hyperspace.disable()** command, Hyperspace rules no longer apply during query optimization. Disabling Hyperspace has no impact on created indexes because they remain intact.

Enable hyperspace on the current Spark session.

In [None]:
hyperspace.enable(spark)

Currently, Hyperspace has rules to exploit indexes for two groups of queries:

- Selection queries with lookup or range selection filtering predicates.
- Join queries with an equality join predicate (that is, equijoins).

Observe the impact of Hyperspace on range selection. Start with a filtering predicate followed by a selection that contains columns not covered by the index.

In [None]:
sales_filter = df_sales.filter('CustomerId = 85100').select(['CustomerId', 'TransactionDate'])
sales_filter.show()

Observe the impact of the Hyperspace index. Note how the physical plan scans the actual data files (this happens because `TransactionDate` is not covered by the index, thus it needs to be loaded from the original data).

In [None]:
spark.conf.set("spark.hyperspace.explain.displayMode", "html")

hyperspace.explain(sales_filter, True, displayHTML)

Perform the same filtering but with a selection that is covered by the index.

In [None]:
sales_filter = df_sales.filter('CustomerId == 85100').select(['CustomerId', 'ProductId', 'Quantity'])
sales_filter.show()

Observe how the plan relies now in the Hyperspace index for execution.

In [None]:
hyperspace.explain(sales_filter, True, displayHTML)

Perform a join between the customer and sales dataframes.

In [None]:
customers_sales_join = df_customer.join(df_sales, df_customer.CustomerId == df_sales.CustomerId).select(df_sales.CustomerId, df_sales.ProductId, df_customer.BirthDate, df_sales.Quantity)
customers_sales_join.show()

Observe the impact of both indexes in the execution plan.

In [None]:
hyperspace.explain(customers_sales_join, True, displayHTML)