### Introduction

If you've spent time working with SQL, which stands for Structured Query Language, you probably know about JOINs. If not, here's a quick explanation: In SQL, JOIN is used to combine two or more tables of data by using a shared or related column(s).

JOINing is useful for various tasks, such as adding more information from a smaller dataset to a larger one, or searching for data in reference datasets.

There are different types of joins, like left, right, inner, and full outer. In Spark, especially with DataFrames, there are multiple ways to perform these joins. You can do all of this using the join() method.

Just to clarify, in our discussion, we might use the words "table" and "dataframe" interchangeably. This is because, for our purposes here, they essentially mean the same thing.

!['Sql Joins'](sqljoin.png)

- **Inner Join:** This type of join returns records that have matching values in both tables being joined. Rows that don't have matches on both sides are excluded.

- **Left (Outer) Join:** With a left join, you get all records from the left table and only the matching records from the right table. It's useful when you want all the rows from the left table along with matching rows from the right table. It's like adding information from the right table to the left table.

- **Right (Outer) Join:** The right join is similar to the left join, but it returns all records from the right table and only the matching records from the left table. It's handy when you need all the rows from the right table along with matching rows from the left table. Think of it as flipping the perspective of the left join.

- **Full (Outer) Join:** This type of join returns all records from both the left and right tables, regardless of whether they have matching values or not. It's useful when you want to combine rows from both tables while keeping non-matching rows from each side.

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\Spark\\sparkhome'

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName('Vamsi_App_2').getOrCreate()

we have two lists of tuples: one containing animals and their categories, and the other with animals and their foods.

In [6]:
categorized_animals = [("dog", "pet"), ("cat", "pet"), ("bear", "wild")]
animal_foods = [("dog", "kibble"), ("cat", "canned tuna"), ("bear", "salmon")]

We turn these lists into RDDs and then create DataFrames from the RDDs using Spark:

In [7]:
animalDataRDD = spark.sparkContext.parallelize(categorized_animals)
animalFoodRDD = spark.sparkContext.parallelize(animal_foods)

animalData = spark.createDataFrame(animalDataRDD, ['name', 'category'])
animalFoods = spark.createDataFrame(animalFoodRDD, ['animal', 'food'])

Next, we join the two DataFrames based on the common column, which is the animal name, and print the results:

In [8]:
animals_enhanced = animalData.join(animalFoods, animalData.name == animalFoods.animal)
animals_enhanced.show()

+----+--------+------+-----------+
|name|category|animal|       food|
+----+--------+------+-----------+
|bear|    wild|  bear|     salmon|
| cat|     pet|   cat|canned tuna|
| dog|     pet|   dog|     kibble|
+----+--------+------+-----------+



In [9]:
animals_enhanced_1 = animalFoods.join(animalData, animalFoods.animal == animalData.name)

In [10]:
animals_enhanced_1.show()

+------+-----------+----+--------+
|animal|       food|name|category|
+------+-----------+----+--------+
|  bear|     salmon|bear|    wild|
|   cat|canned tuna| cat|     pet|
|   dog|     kibble| dog|     pet|
+------+-----------+----+--------+



By doing this, we've successfully combined the data about animals, their categories, and their foods, creating a unified dataset.

Remember, this is just one way to perform data joins in Spark. There's another approach called the "usingColumn" approach within the .join() method.

### The usingColumn Join Method

Using the "usingColumn" approach in a join is useful when the left and right data sets have the same column name for the join key. This method provides a simpler and more straightforward way to achieve the same results as we did in the previous example.

We can try the same thing in PYTHON using the following code:

There's an advanced version of this approach that lets you include multiple columns in the join. You just need to provide a sequence of the relevant columns you want to join on in Scala or a list in Python. If you're acquainted with the JOIN USING clause in SQL, this is quite similar. Remember, every value in the list must be present on both sides of the join for this method to work successfully.

We can achieve the same in Python using the following code:

These techniques are not the sole methods for joining data, but they are widely used and clear-cut. This explanation also doesn't delve into the inherent computational outcomes of data joins, but we can explore that in a separate exercise.