<a href="https://colab.research.google.com/github/visshal2301/AdvanceSpark_GoogleColab/blob/main/Shubham_4_Join_Broadcast_Accumulators.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
#BroadCast

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BroadcastJoinExample").getOrCreate()

# Small dataset (can be broadcasted)
small_df = spark.createDataFrame(
    [(1, "A"), (2, "B"), (3, "C")],
    ["id", "val1"]
)

# Large dataset
large_df = spark.range(0, 1000000).withColumnRenamed("id", "id")

# Force Broadcast Hash Join
broadcast_join = large_df.join(small_df.hint("BROADCAST"), "id", "inner")

broadcast_join.show(5)
broadcast_join.explain()


+---+----+
| id|val1|
+---+----+
|  1|   A|
|  2|   B|
|  3|   C|
+---+----+

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [id#10L, val1#9]
   +- BroadcastHashJoin [id#10L], [id#8L], Inner, BuildRight, false
      :- Range (0, 1000000, step=1, splits=2)
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [plan_id=160]
         +- Filter isnotnull(id#8L)
            +- Scan ExistingRDD[id#8L,val1#9]




You are absolutely right! The explain() output clearly shows that Spark applied a BroadcastHashJoin.

Specifically, you can see BroadcastHashJoin listed in the physical plan, and below it, BroadcastExchange indicates that the small_df (the right side of the join, BuildRight) was broadcasted to all executor nodes. This confirms the broadcast strategy was successfully utilized as intended by the hint("BROADCAST").

- Default value: 10 MB (in most Spark versions).
- can be configured to 8 gb . but it is very unlikely that process will happen

- Meaning: If the size of a DataFrame/table is estimated to be less than this threshold, Spark will broadcast it to all worker nodes and perform a Broadcast Hash Join.
- Effect: This avoids shuffling the larger dataset, making joins much faster for small lookup tables.


In [3]:
#broadcast Hash Join

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("BroadcastJoinExample").getOrCreate()

# Medium-sized dataset (too large to broadcast efficiently)
medium_df = spark.range(0, 500000).withColumnRenamed("id", "id")

# Another large dataset
large_df2 = spark.range(0, 1000000).withColumnRenamed("id", "id")

# Force Shuffle Hash Join
shuffle_hash_join = medium_df.hint("SHUFFLE_HASH").join(large_df2, "id", "inner")

# The original intention might have been to show some rows and then explain the plan.
# If you intended to save the DataFrame, you need to specify a format and a path, e.g.:
# shuffle_hash_join.write.format("parquet").mode("overwrite").save("output_path.parquet")

# Showing the top 5 rows
shuffle_hash_join.show(5)

# Explaining the physical plan of the join
shuffle_hash_join.explain()

# To simulate a write without actually writing the data, call .explain() on the DataFrameWriter
# This shows the plan for the write operation.
# The DataFrameWriter object itself does not have an explain method.
# The explain() on the DataFrame itself (as shown above) usually provides the relevant plan.
# print("\n--- Explaining the write operation (simulation only) ---")
# shuffle_hash_join.write.format("parquet").mode("overwrite").explain()

+----+
|  id|
+----+
|  26|
|  29|
| 474|
| 964|
|1677|
+----+
only showing top 5 rows
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [id#19L]
   +- ShuffledHashJoin [id#19L], [id#21L], Inner, BuildLeft
      :- Exchange hashpartitioning(id#19L, 200), ENSURE_REQUIREMENTS, [plan_id=262]
      :  +- Range (0, 500000, step=1, splits=2)
      +- Exchange hashpartitioning(id#21L, 200), ENSURE_REQUIREMENTS, [plan_id=263]
         +- Range (0, 1000000, step=1, splits=2)




The output of explain() confirms that a ShuffledHashJoin was applied, as you explicitly hinted for with medium_df.hint("SHUFFLE_HASH").

Key elements in the physical plan that confirm this are:

ShuffledHashJoin: This directly indicates the join strategy used.
Exchange hashpartitioning: You'll see this operation for both datasets (medium_df and large_df2). This signifies that both DataFrames were re-partitioned (shuffled) across the network based on their join keys (id#33L and id#35L). This shuffling is necessary for a shuffle hash join to ensure that all rows with the same join key are brought together on the same executor to perform the hash join locally.

- Both datasets are shuffled across the cluster by id.
- In each partition, Spark builds a hash table from the smaller side (medium_df relative to large_df2).
- The larger dataset probes the hash table for matches.
- Broadcast is avoided because medium_df is too large to efficiently send to every executor.


- Broadcast Hash Join: Best when one dataset is very small (few MBs).
- Shuffle Hash Join: Used when broadcast is infeasible but hashing is still efficient


In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ShuffleHashJoinExample").getOrCreate()

# Two medium-sized DataFrames (too large to broadcast efficiently)
df1 = spark.range(0, 500000).withColumnRenamed("id", "id")
df2 = spark.range(250000, 750000).withColumnRenamed("id", "id")

# Force Shuffle Hash Join using a hint
joined_df = df1.hint("SHUFFLE_HASH").join(df2, "id", "inner")

joined_df.show(5)
shuffle_hash_join.explain()

+------+
|    id|
+------+
|250267|
|250622|
|250722|
|250780|
|250953|
+------+
only showing top 5 rows
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [id#19L]
   +- ShuffledHashJoin [id#19L], [id#21L], Inner, BuildLeft
      :- Exchange hashpartitioning(id#19L, 200), ENSURE_REQUIREMENTS, [plan_id=262]
      :  +- Range (0, 500000, step=1, splits=2)
      +- Exchange hashpartitioning(id#21L, 200), ENSURE_REQUIREMENTS, [plan_id=263]
         +- Range (0, 1000000, step=1, splits=2)




This physical plan confirms that a ShuffledHashJoin was performed, as intended by the hint. Here's a breakdown:

ShuffledHashJoin [id#19L], [id#21L], Inner, BuildLeft: This is the core operation. It means Spark is using a Hash Join strategy where data is first shuffled.

[id#19L], [id#21L]: These are the join keys from the two DataFrames (df1 and df2 in your example).
Inner: Specifies the type of join.
BuildLeft: Indicates that Spark built a hash table using the left DataFrame (df1, the smaller one after shuffling) in each partition.
Exchange hashpartitioning(id#19L, 200), ENSURE_REQUIREMENTS (for both datasets): This is the crucial part that signifies shuffling.

Exchange hashpartitioning: Both DataFrames are re-partitioned across the cluster based on the hash of their id column. This ensures that rows with the same id from both DataFrames end up on the same executor to be joined locally.
200: Refers to the default number of partitions Spark creates during shuffling.
In essence, Spark moved data across the network (shuffled) to group matching keys together, and then performed a hash join within each partition.