# This notebook explains how to create DataFrames using Pyspark

### Init Spark session

In [6]:
%run 00.spark_init.ipynb

Initializing Spark session ...
Initialized


### Verify the spark session

In [7]:
spark

In [24]:
df = spark.read.option("header", True).csv("data/laptop_prices.csv")

## Create a Spark DataFrames from CSV file
1. Option: enable header
2. Cache before use

In [26]:
df = spark.read.option("header", True).csv("data/laptop_prices.csv").cache()

### Show top 10 lines of the DataFrames

In [9]:
df.show(10)

+-------+---------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|Company|        Product| TypeName|Inches|Ram|        OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|     CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|           GPU_model|
+-------+---------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|  Apple|    MacBook Pro|Ultrabook|  13.3|  8|     macOS|  1.37|    1339.69|Standard|   2560|   1600|         No|     Yes|          Yes|      Intel|     2.3|       Co

### Count number of lines in the DataFrames

In [10]:
df.count()

1275

### Check if the DataFrames is cached?

In [11]:
df.is_cached

True

### Explore the schema

In [70]:
df.printSchema()

root
 |-- Company: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- TypeName: string (nullable = true)
 |-- Inches: string (nullable = true)
 |-- Ram: string (nullable = true)
 |-- OS: string (nullable = true)
 |-- Weight: string (nullable = true)
 |-- Price_euros: string (nullable = true)
 |-- Screen: string (nullable = true)
 |-- ScreenW: string (nullable = true)
 |-- ScreenH: string (nullable = true)
 |-- Touchscreen: string (nullable = true)
 |-- IPSpanel: string (nullable = true)
 |-- RetinaDisplay: string (nullable = true)
 |-- CPU_company: string (nullable = true)
 |-- CPU_freq: string (nullable = true)
 |-- CPU_model: string (nullable = true)
 |-- PrimaryStorage: string (nullable = true)
 |-- SecondaryStorage: string (nullable = true)
 |-- PrimaryStorageType: string (nullable = true)
 |-- SecondaryStorageType: string (nullable = true)
 |-- GPU_company: string (nullable = true)
 |-- GPU_model: string (nullable = true)



## Create a new DataFrames from an existing one

In [18]:
df_apple = df.filter("Company = 'Apple'")

### Explain the DataFrames to see how it is executed

In [75]:
df_apple.explain(extended=False)

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Filter (isnotnull(Company#772) AND (Company#772 = Apple))
   +- InMemoryTableScan [Company#772, Product#773, TypeName#774, Inches#775, Ram#776, OS#777, Weight#778, Price_euros#779, Screen#780, ScreenW#781, ScreenH#782, Touchscreen#783, IPSpanel#784, RetinaDisplay#785, CPU_company#786, CPU_freq#787, CPU_model#788, PrimaryStorage#789, SecondaryStorage#790, PrimaryStorageType#791, SecondaryStorageType#792, GPU_company#793, GPU_model#794], [isnotnull(Company#772), (Company#772 = Apple)]
         +- InMemoryRelation [Company#772, Product#773, TypeName#774, Inches#775, Ram#776, OS#777, Weight#778, Price_euros#779, Screen#780, ScreenW#781, ScreenH#782, Touchscreen#783, IPSpanel#784, RetinaDisplay#785, CPU_company#786, CPU_freq#787, CPU_model#788, PrimaryStorage#789, SecondaryStorage#790, PrimaryStorageType#791, SecondaryStorageType#792, GPU_company#793, GPU_model#794], StorageLevel(disk, memory, deserialized, 1 replicas)
             

In [19]:
df_apple.show(10)

+-------+------------+---------+------+---+--------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+---------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|Company|     Product| TypeName|Inches|Ram|      OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|           GPU_model|
+-------+------------+---------+------+---+--------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+---------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|  Apple| MacBook Pro|Ultrabook|  13.3|  8|   macOS|  1.37|    1339.69|Standard|   2560|   1600|         No|     Yes|          Yes|      Intel|     2.3|  Core i5|           128|               0|  

In [20]:
df_apple.count()

21

In [21]:
df_apple.is_cached

False

## Create a DataFrames from a sequence / list
We need to prepare:
1. A list of items
2. A schema for the dataframe

In [33]:
l = [(i,) for i in range(1000)]

The schema is defined with the following format:
`<column name> <data type>`
<br />
For example:
- `id int, name string, phone_numbers array<int>`

In [34]:
df_l = spark.createDataFrame(l, 'id int')

In [35]:
df_l.explain(extended=True)

== Parsed Logical Plan ==
LogicalRDD [id#4845], false

== Analyzed Logical Plan ==
id: int
LogicalRDD [id#4845], false

== Optimized Logical Plan ==
LogicalRDD [id#4845], false

== Physical Plan ==
*(1) Scan ExistingRDD[id#4845]



In [36]:
df_l.show(10)

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
only showing top 10 rows



In [37]:
df_l.is_cached

False

### Another example with different style of schema definition

In [61]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType, LongType
import random

In [62]:
# Define the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("phone_number", ArrayType(LongType()), True),
    StructField("age", IntegerType(), True),
    StructField("country", StringType(), True)
])

In [51]:
schema.json()

'{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"phone_number","nullable":true,"type":{"containsNull":true,"elementType":"integer","type":"array"}},{"metadata":{},"name":"age","nullable":true,"type":"integer"},{"metadata":{},"name":"country","nullable":true,"type":"string"}],"type":"struct"}'

In [58]:
schema.simpleString()

'struct<id:int,name:string,phone_number:array<int>,age:int,country:string>'

In [63]:
# Generate random data
data = []
names = ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace", "Hannah", "Ian", "Jack"]

for i in range(1000):
    id = i + 1  # id starts from 1 to 1000
    name = random.choice(names)
    phone_number = [random.randint(1000000000, 9999999999) for _ in range(random.randint(1, 3))]  # 1 to 3 phone numbers
    age = random.randint(18, 70)  # age between 18 and 70
    country = random.choice(["USA", "Canada", "UK", "Germany", "France"])  # random country
    data.append((id, name, phone_number, age, country))

In [64]:
data

[(1, 'Grace', [6721472906, 3599360586, 6559471999], 58, 'Germany'),
 (2, 'Ian', [2126373471, 7806220636, 8025018938], 37, 'UK'),
 (3, 'Frank', [8156990943, 6833753138], 51, 'USA'),
 (4, 'Eva', [5591541570, 2852838966, 6506400926], 45, 'USA'),
 (5, 'Ian', [1673483203, 5862781198], 31, 'Canada'),
 (6, 'David', [1125198908, 6479874130, 3840731034], 24, 'Canada'),
 (7, 'David', [9367792590, 1938039613, 1851477876], 55, 'UK'),
 (8, 'Hannah', [2968823757], 47, 'Germany'),
 (9, 'David', [5423236760], 25, 'USA'),
 (10, 'Eva', [8411101186, 6725706243], 46, 'Germany'),
 (11, 'Grace', [7473409692, 8793061815, 2694380391], 23, 'UK'),
 (12, 'Frank', [7580200039, 8716773039, 6083638015], 41, 'UK'),
 (13, 'Eva', [8187718738, 4059936435, 3578056701], 70, 'USA'),
 (14, 'Eva', [6414601570], 52, 'France'),
 (15, 'Hannah', [4384632419, 4502597307, 2837557624], 45, 'Canada'),
 (16, 'David', [2484634925, 6796238336, 5248909485], 18, 'Canada'),
 (17, 'Grace', [3775611324], 23, 'France'),
 (18, 'Jack', [48150

In [65]:
df_people = spark.createDataFrame(data, schema)

In [66]:
df_people.show(10)

+---+------+--------------------+---+-------+
| id|  name|        phone_number|age|country|
+---+------+--------------------+---+-------+
|  1| Grace|[6721472906, 3599...| 58|Germany|
|  2|   Ian|[2126373471, 7806...| 37|     UK|
|  3| Frank|[8156990943, 6833...| 51|    USA|
|  4|   Eva|[5591541570, 2852...| 45|    USA|
|  5|   Ian|[1673483203, 5862...| 31| Canada|
|  6| David|[1125198908, 6479...| 24| Canada|
|  7| David|[9367792590, 1938...| 55|     UK|
|  8|Hannah|        [2968823757]| 47|Germany|
|  9| David|        [5423236760]| 25|    USA|
| 10|   Eva|[8411101186, 6725...| 46|Germany|
+---+------+--------------------+---+-------+
only showing top 10 rows



In [69]:
df_people.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- phone_number: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- age: integer (nullable = true)
 |-- country: string (nullable = true)



In [73]:
df_people.explain(extended=True)

== Parsed Logical Plan ==
LogicalRDD [id#4852, name#4853, phone_number#4854, age#4855, country#4856], false

== Analyzed Logical Plan ==
id: int, name: string, phone_number: array<bigint>, age: int, country: string
LogicalRDD [id#4852, name#4853, phone_number#4854, age#4855, country#4856], false

== Optimized Logical Plan ==
LogicalRDD [id#4852, name#4853, phone_number#4854, age#4855, country#4856], false

== Physical Plan ==
*(1) Scan ExistingRDD[id#4852,name#4853,phone_number#4854,age#4855,country#4856]



In [71]:
df_people.is_cached

False