In [6]:
import findspark

findspark.find()

'C:\\Users\\Shree123\\spark\\spark-3.5.1'

SparkContext is the entry point for using Spark functionalities. It represents the connection to a Spark cluster and is used to create RDDs (Resilient Distributed Datasets) and perform operations on them.

In [10]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('MyApp')
sc = SparkContext(conf = conf)

Important Note: In Databricks, you can directly use spark for most operations, which implicitly uses SparkContext.

SparkSession is the unified entry point for reading data, managing configurations, and creating DataFrames. It encapsulates SparkContext and provides a higher-level API for working with structured data

In [11]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [12]:
spark = SparkSession.builder \
        .appName('MyApp')    \
        .getOrCreate()

In [13]:
spark

Reading Data

In [14]:
df = spark.read.csv("Crop_recommendation.csv", header=True, inferSchema=True)

In [16]:
df.show(5) 

+---+---+---+-----------+-----------+-----------+-----------+-----+
|  N|  P|  K|temperature|   humidity|         ph|   rainfall|label|
+---+---+---+-----------+-----------+-----------+-----------+-----+
| 90| 42| 43|20.87974371|82.00274423|6.502985292|202.9355362| rice|
| 85| 58| 41|21.77046169|80.31964408|7.038096361|226.6555374| rice|
| 60| 55| 44|23.00445915| 82.3207629|7.840207144|263.9642476| rice|
| 74| 35| 40|26.49109635|80.15836264|6.980400905|242.8640342| rice|
| 78| 42| 42|20.13017482|81.60487287|7.628472891|262.7173405| rice|
+---+---+---+-----------+-----------+-----------+-----------+-----+
only showing top 5 rows



In [17]:
df.select('N', 'K').show(5)

+---+---+
|  N|  K|
+---+---+
| 90| 43|
| 85| 41|
| 60| 44|
| 74| 40|
| 78| 42|
+---+---+
only showing top 5 rows



Filtering by Column

In [27]:
# Normal Filtering
df.filter(col('P') > 50).show(5)


# Filtering with multiple columns
df.filter((col('N') > 30) & (col('N') < 35) & (col('label') == 'mango')).show()

+---+---+---+-----------+-----------+-----------+-----------+-----+
|  N|  P|  K|temperature|   humidity|         ph|   rainfall|label|
+---+---+---+-----------+-----------+-----------+-----------+-----+
| 85| 58| 41|21.77046169|80.31964408|7.038096361|226.6555374| rice|
| 60| 55| 44|23.00445915| 82.3207629|7.840207144|263.9642476| rice|
| 69| 55| 38|22.70883798|82.63941394| 5.70080568|271.3248604| rice|
| 94| 53| 40|20.27774362|82.89408619|5.718627178|241.9741949| rice|
| 89| 54| 38|24.51588066| 83.5352163|6.685346424|230.4462359| rice|
+---+---+---+-----------+-----------+-----------+-----------+-----+
only showing top 5 rows

+---+---+---+-----------+-----------+-----------+-----------+-----+
|  N|  P|  K|temperature|   humidity|         ph|   rainfall|label|
+---+---+---+-----------+-----------+-----------+-----------+-----+
| 34| 16| 25|30.07202564|50.96040505| 6.10729559|92.09609766|mango|
| 33| 29| 34|31.40948821|49.21729127|6.832979509|92.99739415|mango|
| 34| 38| 31|35.3777559

Adding a New Column

In [30]:
df = df.withColumn('Amalgam', col('N') * col('P') * col('K'))

df.select('Amalgam').show(3)

+-------+
|Amalgam|
+-------+
| 162540|
| 202130|
| 145200|
+-------+
only showing top 3 rows



DataFrames: Higher-level API, optimized with Catalyst optimizer, and easier to use for SQL-like queries.

RDDs: Lower-level API, more control over data, and transformations.

Caching and persisting data can improve performance by storing intermediate results.

caching and persisting DataFrames are crucial techniques for optimizing performance, especially when dealing with large datasets and iterative algorithms. Both methods help to avoid recomputing the same DataFrame multiple times, thus saving time and resources.


For Persistent 

Common Storage Levels:

MEMORY_ONLY: Store the DataFrame only in memory. This is the default if you just call cache(). 

DISK_ONLY: Store the DataFrame only on disk.

MEMORY_AND_DISK: Store the DataFrame in memory if there is enough space; otherwise, store it on disk.

MEMORY_ONLY_SER: Store the DataFrame in memory as serialized objects (useful for reducing memory usage).

DISK_ONLY_2: Store the DataFrame on disk with two replicas (useful for fault tolerance).

In [35]:
# More on Filtering

df.filter(col('label').isin('mango')).show(5)

+---+---+---+-----------+-----------+-----------+-----------+-----+-------+
|  N|  P|  K|temperature|   humidity|         ph|   rainfall|label|Amalgam|
+---+---+---+-----------+-----------+-----------+-----------+-----+-------+
|  2| 40| 27|29.73770045|47.54885174|5.954626604|90.09586854|mango|   2160|
| 39| 24| 31|33.55695561|53.72979826|4.757114897|98.67527561|mango|  29016|
| 21| 26| 27|27.00315545|47.67525434|5.699586972|95.85118326|mango|  14742|
| 25| 22| 25|33.56150184|45.53556603|5.977413803|95.70525913|mango|  13750|
|  0| 21| 32|35.89855625|54.25964196|6.430139436|92.19721736|mango|      0|
+---+---+---+-----------+-----------+-----------+-----------+-----+-------+
only showing top 5 rows



In [36]:
employees_data = [
    (1, 'Alice', 101, '2022-01-15'),
    (2, 'Bob', 102, '2023-03-12'),
    (3, 'Cathy', 101, '2021-11-01'),
    (4, 'David', 103, '2022-06-20'),
    (5, 'Eve', 102, '2024-02-25')
]
employees_columns = ["employee_id", "name", "department_id", "hire_date"]
employees_df = spark.createDataFrame(employees_data, employees_columns)

departments_data = [
    (101, 'Engineering'),
    (102, 'Marketing'),
    (103, 'HR')
]
departments_columns = ["department_id", "department_name"]
departments_df = spark.createDataFrame(departments_data, departments_columns)

# Create Salaries DataFrame
salaries_data = [
    (1, 70000, '2024-01-01'),
    (1, 75000, '2024-06-01'),
    (2, 80000, '2023-03-12'),
    (3, 72000, '2024-01-01'),
    (4, 68000, '2022-06-20'),
    (5, 75000, '2024-02-25')
]
salaries_columns = ["employee_id", "salary", "effective_date"]
salaries_df = spark.createDataFrame(salaries_data, salaries_columns)

In [39]:
# window function
from pyspark.sql import Window
from pyspark.sql.functions import row_number
# rank employee by hire date - row_number

row_spec = Window.partitionBy(col('department_id')).orderBy(col('hire_date'))

window_data = employees_df.withColumn('row_number', row_number().over(row_spec))
window_data.filter(col('row_number') == 1).show()

+-----------+-----+-------------+----------+----------+
|employee_id| name|department_id| hire_date|row_number|
+-----------+-----+-------------+----------+----------+
|          3|Cathy|          101|2021-11-01|         1|
|          2|  Bob|          102|2023-03-12|         1|
|          4|David|          103|2022-06-20|         1|
+-----------+-----+-------------+----------+----------+

