<a href="https://colab.research.google.com/github/shonendumm/pyspark_lessons/blob/main/pyspark_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark


Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=f0459d9c955a64502f76b2c964d73b43a6b40f2f78c73ef13825f79af6edded6
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In Apache Spark, both SparkSession and SparkContext are important components, but they serve different purposes and have different scopes in the Spark application.

**SparkContext:**
The SparkContext is the entry point for low-level Spark functionality and represents the connection to a Spark cluster. It was the primary entry point in earlier versions of Spark.

SparkContext is responsible for coordinating the execution of tasks across the cluster and managing the resources (e.g., memory, cores) for Spark applications.

In modern Spark applications (Spark 2.0 and later), you typically don't create a SparkContext directly. Instead, it is automatically created for you when you create a SparkSession.

**SparkSession:**
The SparkSession is a higher-level API introduced in Spark 2.0 to provide a unified entry point for reading data, executing SQL queries, and managing Spark jobs.

**SparkSession encapsulates SparkContext** and provides a single entry point for interacting with structured data using Spark. It includes **functionality for working with DataFrames and Datasets.**

When you create a SparkSession, it automatically creates a SparkContext for you. Therefore, in Spark 2.0 and later, it's common to use SparkSession instead of directly working with SparkContext.

In [5]:
import pyspark
from pyspark.sql import SparkSession


In [6]:
spark = SparkSession.builder.appName('tutorial').getOrCreate()

In [8]:
df = spark.read.load("*.csv", format='csv')

In [9]:
df.show()

+-------+---+----------+-------------------+--------------------+--------------------+---+---------+--------+
|    _c0|_c1|       _c2|                _c3|                 _c4|                 _c5|_c6|      _c7|     _c8|
+-------+---+----------+-------------------+--------------------+--------------------+---+---------+--------+
|SO49171|  1|2021-01-01|      Mariah Foster|mariah21@adventur...|  Road-250 Black, 48|  1|2181.5625| 174.525|
|SO49172|  1|2021-01-01|       Brian Howard|brian23@adventure...|    Road-250 Red, 44|  1|  2443.35| 195.468|
|SO49173|  1|2021-01-01|      Linda Alvarez|linda19@adventure...|Mountain-200 Silv...|  1|2071.4196|165.7136|
|SO49174|  1|2021-01-01|     Gina Hernandez|gina4@adventure-w...|Mountain-200 Silv...|  1|2071.4196|165.7136|
|SO49178|  1|2021-01-01|          Beth Ruiz|beth4@adventure-w...|Road-550-W Yellow...|  1|1000.4375|  80.035|
|SO49179|  1|2021-01-01|          Evan Ward|evan13@adventure-...|Road-550-W Yellow...|  1|1000.4375|  80.035|
|SO49175| 

In [12]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

orderSchema = StructType([
     StructField("SalesOrderNumber", StringType()),
     StructField("SalesOrderLineNumber", IntegerType()),
     StructField("OrderDate", DateType()),
     StructField("CustomerName", StringType()),
     StructField("Email", StringType()),
     StructField("Item", StringType()),
     StructField("Quantity", IntegerType()),
     StructField("UnitPrice", FloatType()),
     StructField("Tax", FloatType())
 ])

df = spark.read.load('*.csv', format='csv', schema=orderSchema)
df.show(20)

+----------------+--------------------+----------+-------------------+--------------------+--------------------+--------+---------+--------+
|SalesOrderNumber|SalesOrderLineNumber| OrderDate|       CustomerName|               Email|                Item|Quantity|UnitPrice|     Tax|
+----------------+--------------------+----------+-------------------+--------------------+--------------------+--------+---------+--------+
|         SO49171|                   1|2021-01-01|      Mariah Foster|mariah21@adventur...|  Road-250 Black, 48|       1|2181.5625| 174.525|
|         SO49172|                   1|2021-01-01|       Brian Howard|brian23@adventure...|    Road-250 Red, 44|       1|  2443.35| 195.468|
|         SO49173|                   1|2021-01-01|      Linda Alvarez|linda19@adventure...|Mountain-200 Silv...|       1|2071.4197|165.7136|
|         SO49174|                   1|2021-01-01|     Gina Hernandez|gina4@adventure-w...|Mountain-200 Silv...|       1|2071.4197|165.7136|
|         SO4

In [13]:
df.printSchema()

root
 |-- SalesOrderNumber: string (nullable = true)
 |-- SalesOrderLineNumber: integer (nullable = true)
 |-- OrderDate: date (nullable = true)
 |-- CustomerName: string (nullable = true)
 |-- Email: string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- UnitPrice: float (nullable = true)
 |-- Tax: float (nullable = true)

