# This notebook explains how to create DataFrames using Pyspark

### Init Spark session

In [1]:
from common import init_spark

spark, sc = init_spark.create()

Initializing Spark session ...
Initialized


### Verify the spark session

In [2]:
spark

In [4]:
df = spark.read.option("header", True).csv("w05/data/laptop_prices.csv")

## Create a Spark DataFrames from CSV file
1. Option: enable header
2. Cache before use

In [6]:
df = spark.read.option("header", True).csv("w05/data/laptop_prices.csv").cache()

### Show top 10 lines of the DataFrames

In [7]:
df.show(10)

+-------+---------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|Company|        Product| TypeName|Inches|Ram|        OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|     CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|           GPU_model|
+-------+---------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|  Apple|    MacBook Pro|Ultrabook|  13.3|  8|     macOS|  1.37|    1339.69|Standard|   2560|   1600|         No|     Yes|          Yes|      Intel|     2.3|       Co

### Count number of lines in the DataFrames

In [8]:
df.count()

1275

### Check if the DataFrames is cached?

In [9]:
df.is_cached

True

### Explore the schema

In [10]:
df.printSchema()

root
 |-- Company: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- TypeName: string (nullable = true)
 |-- Inches: string (nullable = true)
 |-- Ram: string (nullable = true)
 |-- OS: string (nullable = true)
 |-- Weight: string (nullable = true)
 |-- Price_euros: string (nullable = true)
 |-- Screen: string (nullable = true)
 |-- ScreenW: string (nullable = true)
 |-- ScreenH: string (nullable = true)
 |-- Touchscreen: string (nullable = true)
 |-- IPSpanel: string (nullable = true)
 |-- RetinaDisplay: string (nullable = true)
 |-- CPU_company: string (nullable = true)
 |-- CPU_freq: string (nullable = true)
 |-- CPU_model: string (nullable = true)
 |-- PrimaryStorage: string (nullable = true)
 |-- SecondaryStorage: string (nullable = true)
 |-- PrimaryStorageType: string (nullable = true)
 |-- SecondaryStorageType: string (nullable = true)
 |-- GPU_company: string (nullable = true)
 |-- GPU_model: string (nullable = true)



## Create a new DataFrames from an existing one

In [11]:
df_apple = df.filter("Company = 'Apple'")

### Explain the DataFrames to see how it is executed

In [12]:
df_apple.explain(extended=False)

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Filter (isnotnull(Company#80) AND (Company#80 = Apple))
   +- InMemoryTableScan [Company#80, Product#81, TypeName#82, Inches#83, Ram#84, OS#85, Weight#86, Price_euros#87, Screen#88, ScreenW#89, ScreenH#90, Touchscreen#91, IPSpanel#92, RetinaDisplay#93, CPU_company#94, CPU_freq#95, CPU_model#96, PrimaryStorage#97, SecondaryStorage#98, PrimaryStorageType#99, SecondaryStorageType#100, GPU_company#101, GPU_model#102], [isnotnull(Company#80), (Company#80 = Apple)]
         +- InMemoryRelation [Company#80, Product#81, TypeName#82, Inches#83, Ram#84, OS#85, Weight#86, Price_euros#87, Screen#88, ScreenW#89, ScreenH#90, Touchscreen#91, IPSpanel#92, RetinaDisplay#93, CPU_company#94, CPU_freq#95, CPU_model#96, PrimaryStorage#97, SecondaryStorage#98, PrimaryStorageType#99, SecondaryStorageType#100, GPU_company#101, GPU_model#102], StorageLevel(disk, memory, deserialized, 1 replicas)
               +- FileScan csv [Company#80,Product#81,Typ

In [13]:
df_apple.show(10)

+-------+------------+---------+------+---+--------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+---------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|Company|     Product| TypeName|Inches|Ram|      OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|           GPU_model|
+-------+------------+---------+------+---+--------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+---------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|  Apple| MacBook Pro|Ultrabook|  13.3|  8|   macOS|  1.37|    1339.69|Standard|   2560|   1600|         No|     Yes|          Yes|      Intel|     2.3|  Core i5|           128|               0|  

In [14]:
df_apple.count()

21

In [15]:
df_apple.is_cached

False

## Create a DataFrames from a sequence / list
We need to prepare:
1. A list of items
2. A schema for the dataframe

In [16]:
l = [(i,) for i in range(1000)]

The schema is defined with the following format:
`<column name> <data type>`
<br />
For example:
- `id int, name string, phone_numbers array<int>`

In [17]:
df_l = spark.createDataFrame(l, 'id int')

In [18]:
df_l.explain(extended=True)

== Parsed Logical Plan ==
LogicalRDD [id#2922], false

== Analyzed Logical Plan ==
id: int
LogicalRDD [id#2922], false

== Optimized Logical Plan ==
LogicalRDD [id#2922], false

== Physical Plan ==
*(1) Scan ExistingRDD[id#2922]



In [19]:
df_l.show(10)

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
only showing top 10 rows



In [20]:
df_l.is_cached

False

### Another example with different style of schema definition

In [21]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType, LongType
import random

In [22]:
# Define the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("phone_number", ArrayType(LongType()), True),
    StructField("age", IntegerType(), True),
    StructField("country", StringType(), True)
])

In [23]:
schema.json()

'{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"phone_number","nullable":true,"type":{"containsNull":true,"elementType":"long","type":"array"}},{"metadata":{},"name":"age","nullable":true,"type":"integer"},{"metadata":{},"name":"country","nullable":true,"type":"string"}],"type":"struct"}'

In [24]:
schema.simpleString()

'struct<id:int,name:string,phone_number:array<bigint>,age:int,country:string>'

In [25]:
# Generate random data
data = []
names = ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace", "Hannah", "Ian", "Jack"]

for i in range(1000):
    id = i + 1  # id starts from 1 to 1000
    name = random.choice(names)
    phone_number = [random.randint(1000000000, 9999999999) for _ in range(random.randint(1, 3))]  # 1 to 3 phone numbers
    age = random.randint(18, 70)  # age between 18 and 70
    country = random.choice(["USA", "Canada", "UK", "Germany", "France"])  # random country
    data.append((id, name, phone_number, age, country))

In [26]:
data

[(1, 'Hannah', [3492610877], 20, 'Canada'),
 (2, 'David', [3985746259, 2267715323], 19, 'USA'),
 (3, 'Bob', [8830053193], 23, 'Germany'),
 (4, 'Alice', [7327162959, 8555011933, 8310932671], 29, 'France'),
 (5, 'Charlie', [5604867876, 4301756453, 1396013702], 20, 'France'),
 (6, 'Eva', [5661570260], 62, 'Canada'),
 (7, 'Grace', [6379077261], 24, 'USA'),
 (8, 'Eva', [3174883644], 69, 'Germany'),
 (9, 'Ian', [9087484497], 18, 'USA'),
 (10, 'Charlie', [9714735889], 59, 'UK'),
 (11, 'Grace', [6055152068, 9227104339], 49, 'France'),
 (12, 'Eva', [3104815554, 1703746523, 7730621721], 35, 'France'),
 (13, 'Hannah', [4271159088], 33, 'USA'),
 (14, 'Alice', [8458347027, 6844465029, 5072520303], 68, 'Canada'),
 (15, 'David', [7403115605, 1178664473, 8710069674], 62, 'UK'),
 (16, 'Grace', [4060460066], 65, 'UK'),
 (17, 'Hannah', [9338502557, 6080375943], 59, 'Germany'),
 (18, 'Grace', [9013500720], 65, 'USA'),
 (19, 'Grace', [7756847994], 63, 'UK'),
 (20, 'Charlie', [1899960138], 68, 'Germany'),
 

In [27]:
df_people = spark.createDataFrame(data, schema)

In [28]:
df_people.show(10)

+---+-------+--------------------+---+-------+
| id|   name|        phone_number|age|country|
+---+-------+--------------------+---+-------+
|  1| Hannah|        [3492610877]| 20| Canada|
|  2|  David|[3985746259, 2267...| 19|    USA|
|  3|    Bob|        [8830053193]| 23|Germany|
|  4|  Alice|[7327162959, 8555...| 29| France|
|  5|Charlie|[5604867876, 4301...| 20| France|
|  6|    Eva|        [5661570260]| 62| Canada|
|  7|  Grace|        [6379077261]| 24|    USA|
|  8|    Eva|        [3174883644]| 69|Germany|
|  9|    Ian|        [9087484497]| 18|    USA|
| 10|Charlie|        [9714735889]| 59|     UK|
+---+-------+--------------------+---+-------+
only showing top 10 rows



In [29]:
df_people.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- phone_number: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- age: integer (nullable = true)
 |-- country: string (nullable = true)



In [30]:
df_people.explain(extended=True)

== Parsed Logical Plan ==
LogicalRDD [id#2929, name#2930, phone_number#2931, age#2932, country#2933], false

== Analyzed Logical Plan ==
id: int, name: string, phone_number: array<bigint>, age: int, country: string
LogicalRDD [id#2929, name#2930, phone_number#2931, age#2932, country#2933], false

== Optimized Logical Plan ==
LogicalRDD [id#2929, name#2930, phone_number#2931, age#2932, country#2933], false

== Physical Plan ==
*(1) Scan ExistingRDD[id#2929,name#2930,phone_number#2931,age#2932,country#2933]



In [31]:
df_people.is_cached

False