# This Lab introduces the DataFrames APIs

## Preparation
1. Spark Session
2. DataFrames loaded from file
3. The DataFrames should be cached for multiple uses

In [76]:
%run 00.spark_init.ipynb

Initializing Spark session ...
Initialized


In [77]:
spark

In [131]:
df = spark.read.option('header', True).option('inferSchema', True).csv('data/laptop_prices.csv').cache()

In [132]:
df.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- InMemoryTableScan [Company#23610, Product#23611, TypeName#23612, Inches#23613, Ram#23614, OS#23615, Weight#23616, Price_euros#23617, Screen#23618, ScreenW#23619, ScreenH#23620, Touchscreen#23621, IPSpanel#23622, RetinaDisplay#23623, CPU_company#23624, CPU_freq#23625, CPU_model#23626, PrimaryStorage#23627, SecondaryStorage#23628, PrimaryStorageType#23629, SecondaryStorageType#23630, GPU_company#23631, GPU_model#23632]
      +- InMemoryRelation [Company#23610, Product#23611, TypeName#23612, Inches#23613, Ram#23614, OS#23615, Weight#23616, Price_euros#23617, Screen#23618, ScreenW#23619, ScreenH#23620, Touchscreen#23621, IPSpanel#23622, RetinaDisplay#23623, CPU_company#23624, CPU_freq#23625, CPU_model#23626, PrimaryStorage#23627, SecondaryStorage#23628, PrimaryStorageType#23629, SecondaryStorageType#23630, GPU_company#23631, GPU_model#23632], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- FileScan csv [Company#

In [133]:
df.is_cached

True

In [106]:
df.show()

+-------+-----------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|Company|          Product| TypeName|Inches|Ram|        OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|     CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|           GPU_model|
+-------+-----------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|  Apple|      MacBook Pro|Ultrabook|  13.3|  8|     macOS|  1.37|    1339.69|Standard|   2560|   1600|         No|     Yes|          Yes|      Intel|     2.3| 

In [105]:
df.printSchema()

root
 |-- Company: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- TypeName: string (nullable = true)
 |-- Inches: double (nullable = true)
 |-- Ram: integer (nullable = true)
 |-- OS: string (nullable = true)
 |-- Weight: double (nullable = true)
 |-- Price_euros: double (nullable = true)
 |-- Screen: string (nullable = true)
 |-- ScreenW: integer (nullable = true)
 |-- ScreenH: integer (nullable = true)
 |-- Touchscreen: string (nullable = true)
 |-- IPSpanel: string (nullable = true)
 |-- RetinaDisplay: string (nullable = true)
 |-- CPU_company: string (nullable = true)
 |-- CPU_freq: double (nullable = true)
 |-- CPU_model: string (nullable = true)
 |-- PrimaryStorage: integer (nullable = true)
 |-- SecondaryStorage: integer (nullable = true)
 |-- PrimaryStorageType: string (nullable = true)
 |-- SecondaryStorageType: string (nullable = true)
 |-- GPU_company: string (nullable = true)
 |-- GPU_model: string (nullable = true)



In [123]:
df.schema.simpleString()

'struct<Company:string,Product:string,TypeName:string,Inches:double,Ram:int,OS:string,Weight:double,Price_euros:double,Screen:string,ScreenW:int,ScreenH:int,Touchscreen:string,IPSpanel:string,RetinaDisplay:string,CPU_company:string,CPU_freq:double,CPU_model:string,PrimaryStorage:int,SecondaryStorage:int,PrimaryStorageType:string,SecondaryStorageType:string,GPU_company:string,GPU_model:string>'

In [107]:
df.select('Company').show()

+-------+
|Company|
+-------+
|  Apple|
|  Apple|
|     HP|
|  Apple|
|  Apple|
|   Acer|
|  Apple|
|  Apple|
|   Asus|
|   Acer|
|     HP|
|     HP|
|  Apple|
|   Dell|
|  Apple|
|  Apple|
|   Dell|
|  Apple|
| Lenovo|
|   Dell|
+-------+
only showing top 20 rows



In [108]:
df.filter('Price_euros > 1000').show()

+-------+---------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|Company|        Product| TypeName|Inches|Ram|        OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|     CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|           GPU_model|
+-------+---------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|  Apple|    MacBook Pro|Ultrabook|  13.3|  8|     macOS|  1.37|    1339.69|Standard|   2560|   1600|         No|     Yes|          Yes|      Intel|     2.3|       Co

In [109]:
df.where('Price_euros < 1000').show()

+-------+--------------------+------------------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+-----------------+--------------+----------------+------------------+--------------------+-----------+----------------+
|Company|             Product|          TypeName|Inches|Ram|        OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|        CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|       GPU_model|
+-------+--------------------+------------------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+-----------------+--------------+----------------+------------------+--------------------+-----------+----------------+
|  Apple|         Macbook Air|         Ultrabook|  13.3|  8|     macOS|  1.34|     898.94|Standard|   1440|    900|         No|

In [110]:
df.groupBy('Company').count().show()

+---------+-----+
|  Company|count|
+---------+-----+
|    Razer|    7|
|  Fujitsu|    3|
|   Huawei|    2|
|   Xiaomi|    4|
|       HP|  268|
|     Dell|  291|
|     Vero|    4|
|     Acer|  101|
|     Asus|  152|
|   Lenovo|  289|
| Mediacom|    7|
|  Samsung|    9|
|   Google|    3|
|       LG|    3|
|    Chuwi|    3|
|Microsoft|    6|
|    Apple|   21|
|      MSI|   54|
|  Toshiba|   48|
+---------+-----+



In [111]:
df.groupBy('Company').count().orderBy('count').show()

+---------+-----+
|  Company|count|
+---------+-----+
|   Huawei|    2|
|  Fujitsu|    3|
|   Google|    3|
|       LG|    3|
|    Chuwi|    3|
|   Xiaomi|    4|
|     Vero|    4|
|Microsoft|    6|
|    Razer|    7|
| Mediacom|    7|
|  Samsung|    9|
|    Apple|   21|
|  Toshiba|   48|
|      MSI|   54|
|     Acer|  101|
|     Asus|  152|
|       HP|  268|
|   Lenovo|  289|
|     Dell|  291|
+---------+-----+



In [112]:
from pyspark.sql.functions import desc
df.groupBy('Company').count().orderBy(desc('count')).show()

+---------+-----+
|  Company|count|
+---------+-----+
|     Dell|  291|
|   Lenovo|  289|
|       HP|  268|
|     Asus|  152|
|     Acer|  101|
|      MSI|   54|
|  Toshiba|   48|
|    Apple|   21|
|  Samsung|    9|
|    Razer|    7|
| Mediacom|    7|
|Microsoft|    6|
|   Xiaomi|    4|
|     Vero|    4|
|  Fujitsu|    3|
|   Google|    3|
|       LG|    3|
|    Chuwi|    3|
|   Huawei|    2|
+---------+-----+



In [114]:
df.agg({'Price_euros': 'avg'}).show()

+------------------+
|  avg(Price_euros)|
+------------------+
|1134.9690588235296|
+------------------+



In [134]:
from pyspark.sql import functions as sf

In [138]:
df.withColumn('discount_rate', sf.lit(0.5)).withColumn('discounted_price', sf.col('Price_euros') * sf.col('discount_rate')).show()

+-------+-----------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+-------------+----------------+
|Company|          Product| TypeName|Inches|Ram|        OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|     CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|           GPU_model|discount_rate|discounted_price|
+-------+-----------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+-------------+----------------+
|  Apple|      MacBook Pro|Ultrabook|  13.3|  8|     macOS|  1.37| 

In [139]:
df.withColumnRenamed('Company', 'Brand').show()

+------+-----------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
| Brand|          Product| TypeName|Inches|Ram|        OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|     CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|           GPU_model|
+------+-----------------+---------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
| Apple|      MacBook Pro|Ultrabook|  13.3|  8|     macOS|  1.37|    1339.69|Standard|   2560|   1600|         No|     Yes|          Yes|      Intel|     2.3|     

In [141]:
df.drop('TypeName').show()

+-------+-----------------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|Company|          Product|Inches|Ram|        OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|     CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|           GPU_model|
+-------+-----------------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------+--------------+----------------+------------------+--------------------+-----------+--------------------+
|  Apple|      MacBook Pro|  13.3|  8|     macOS|  1.37|    1339.69|Standard|   2560|   1600|         No|     Yes|          Yes|      Intel|     2.3|       Core i5|           128|           

In [142]:
df.count()

1275

In [143]:
df.distinct().count()

1275

In [145]:
df.sort(sf.desc('Price_euros')).show()

+-------+------------------+-----------+------+---+----------+------+-----------+-----------+-------+-------+-----------+--------+-------------+-----------+--------+----------------+--------------+----------------+------------------+--------------------+-----------+-----------------+
|Company|           Product|   TypeName|Inches|Ram|        OS|Weight|Price_euros|     Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|       CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|        GPU_model|
+-------+------------------+-----------+------+---+----------+------+-----------+-----------+-------+-------+-----------+--------+-------------+-----------+--------+----------------+--------------+----------------+------------------+--------------------+-----------+-----------------+
|  Razer|         Blade Pro|     Gaming|  17.3| 32|Windows 10|  3.49|     6099.0|4K Ultra HD|   3840|   2160|        Yes|      No|           No| 

In [146]:

df.sort(sf.asc('Price_euros')).show()

+--------+--------------------+--------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------------+--------------+----------------+------------------+--------------------+-----------+---------------+
| Company|             Product|TypeName|Inches|Ram|        OS|Weight|Price_euros|  Screen|ScreenW|ScreenH|Touchscreen|IPSpanel|RetinaDisplay|CPU_company|CPU_freq|           CPU_model|PrimaryStorage|SecondaryStorage|PrimaryStorageType|SecondaryStorageType|GPU_company|      GPU_model|
+--------+--------------------+--------+------+---+----------+------+-----------+--------+-------+-------+-----------+--------+-------------+-----------+--------+--------------------+--------------+----------------+------------------+--------------------+-----------+---------------+
|    Acer|C740-C9QX (3205U/...| Netbook|  11.6|  2| Chrome OS|   1.3|      174.0|Standard|   1366|    768|         No|      No|           No|      I

In [150]:
df.limit(10).explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- CollectLimit 10
   +- InMemoryTableScan [Company#23610, Product#23611, TypeName#23612, Inches#23613, Ram#23614, OS#23615, Weight#23616, Price_euros#23617, Screen#23618, ScreenW#23619, ScreenH#23620, Touchscreen#23621, IPSpanel#23622, RetinaDisplay#23623, CPU_company#23624, CPU_freq#23625, CPU_model#23626, PrimaryStorage#23627, SecondaryStorage#23628, PrimaryStorageType#23629, SecondaryStorageType#23630, GPU_company#23631, GPU_model#23632]
         +- InMemoryRelation [Company#23610, Product#23611, TypeName#23612, Inches#23613, Ram#23614, OS#23615, Weight#23616, Price_euros#23617, Screen#23618, ScreenW#23619, ScreenH#23620, Touchscreen#23621, IPSpanel#23622, RetinaDisplay#23623, CPU_company#23624, CPU_freq#23625, CPU_model#23626, PrimaryStorage#23627, SecondaryStorage#23628, PrimaryStorageType#23629, SecondaryStorageType#23630, GPU_company#23631, GPU_model#23632], StorageLevel(disk, memory, deserialized, 1 replicas)
            

In [151]:
df.limit(10).collect()

[Row(Company='Apple', Product='MacBook Pro', TypeName='Ultrabook', Inches=13.3, Ram=8, OS='macOS', Weight=1.37, Price_euros=1339.69, Screen='Standard', ScreenW=2560, ScreenH=1600, Touchscreen='No', IPSpanel='Yes', RetinaDisplay='Yes', CPU_company='Intel', CPU_freq=2.3, CPU_model='Core i5', PrimaryStorage=128, SecondaryStorage=0, PrimaryStorageType='SSD', SecondaryStorageType='No', GPU_company='Intel', GPU_model='Iris Plus Graphics 640'),
 Row(Company='Apple', Product='Macbook Air', TypeName='Ultrabook', Inches=13.3, Ram=8, OS='macOS', Weight=1.34, Price_euros=898.94, Screen='Standard', ScreenW=1440, ScreenH=900, Touchscreen='No', IPSpanel='No', RetinaDisplay='No', CPU_company='Intel', CPU_freq=1.8, CPU_model='Core i5', PrimaryStorage=128, SecondaryStorage=0, PrimaryStorageType='Flash Storage', SecondaryStorageType='No', GPU_company='Intel', GPU_model='HD Graphics 6000'),
 Row(Company='HP', Product='250 G6', TypeName='Notebook', Inches=15.6, Ram=8, OS='No OS', Weight=1.86, Price_euros=5

In [154]:
res = df.head(10)

In [155]:
type(res)

list

In [156]:
res

[Row(Company='Apple', Product='MacBook Pro', TypeName='Ultrabook', Inches=13.3, Ram=8, OS='macOS', Weight=1.37, Price_euros=1339.69, Screen='Standard', ScreenW=2560, ScreenH=1600, Touchscreen='No', IPSpanel='Yes', RetinaDisplay='Yes', CPU_company='Intel', CPU_freq=2.3, CPU_model='Core i5', PrimaryStorage=128, SecondaryStorage=0, PrimaryStorageType='SSD', SecondaryStorageType='No', GPU_company='Intel', GPU_model='Iris Plus Graphics 640'),
 Row(Company='Apple', Product='Macbook Air', TypeName='Ultrabook', Inches=13.3, Ram=8, OS='macOS', Weight=1.34, Price_euros=898.94, Screen='Standard', ScreenW=1440, ScreenH=900, Touchscreen='No', IPSpanel='No', RetinaDisplay='No', CPU_company='Intel', CPU_freq=1.8, CPU_model='Core i5', PrimaryStorage=128, SecondaryStorage=0, PrimaryStorageType='Flash Storage', SecondaryStorageType='No', GPU_company='Intel', GPU_model='HD Graphics 6000'),
 Row(Company='HP', Product='250 G6', TypeName='Notebook', Inches=15.6, Ram=8, OS='No OS', Weight=1.86, Price_euros=5

In [157]:
df.first()

Row(Company='Apple', Product='MacBook Pro', TypeName='Ultrabook', Inches=13.3, Ram=8, OS='macOS', Weight=1.37, Price_euros=1339.69, Screen='Standard', ScreenW=2560, ScreenH=1600, Touchscreen='No', IPSpanel='Yes', RetinaDisplay='Yes', CPU_company='Intel', CPU_freq=2.3, CPU_model='Core i5', PrimaryStorage=128, SecondaryStorage=0, PrimaryStorageType='SSD', SecondaryStorageType='No', GPU_company='Intel', GPU_model='Iris Plus Graphics 640')

In [158]:
# Calculate approximate quantiles (percentiles).

df.approxQuantile('Price_euros', [0.25, 0.5, 0.75], 0.01)

[609.0, 979.0, 1479.0]

In [164]:
# Drop duplicate rows based on specific columns.

df.dropDuplicates(['Company']).count()

19

In [167]:
sc.setCheckpointDir('checkpoints/dataframes')

In [168]:
df.checkpoint()

DataFrame[Company: string, Product: string, TypeName: string, Inches: double, Ram: int, OS: string, Weight: double, Price_euros: double, Screen: string, ScreenW: int, ScreenH: int, Touchscreen: string, IPSpanel: string, RetinaDisplay: string, CPU_company: string, CPU_freq: double, CPU_model: string, PrimaryStorage: int, SecondaryStorage: int, PrimaryStorageType: string, SecondaryStorageType: string, GPU_company: string, GPU_model: string]

In [172]:
df.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   TableCacheQueryStage 0
   +- InMemoryTableScan [Company#23610, Product#23611, TypeName#23612, Inches#23613, Ram#23614, OS#23615, Weight#23616, Price_euros#23617, Screen#23618, ScreenW#23619, ScreenH#23620, Touchscreen#23621, IPSpanel#23622, RetinaDisplay#23623, CPU_company#23624, CPU_freq#23625, CPU_model#23626, PrimaryStorage#23627, SecondaryStorage#23628, PrimaryStorageType#23629, SecondaryStorageType#23630, GPU_company#23631, GPU_model#23632]
         +- InMemoryRelation [Company#23610, Product#23611, TypeName#23612, Inches#23613, Ram#23614, OS#23615, Weight#23616, Price_euros#23617, Screen#23618, ScreenW#23619, ScreenH#23620, Touchscreen#23621, IPSpanel#23622, RetinaDisplay#23623, CPU_company#23624, CPU_freq#23625, CPU_model#23626, PrimaryStorage#23627, SecondaryStorage#23628, PrimaryStorageType#23629, SecondaryStorageType#23630, GPU_company#23631, GPU_model#23632], StorageLevel(disk, memory, deserialized

In [159]:
# Remove the DataFrame from memory/disk.

df.unpersist()

DataFrame[Company: string, Product: string, TypeName: string, Inches: double, Ram: int, OS: string, Weight: double, Price_euros: double, Screen: string, ScreenW: int, ScreenH: int, Touchscreen: string, IPSpanel: string, RetinaDisplay: string, CPU_company: string, CPU_freq: double, CPU_model: string, PrimaryStorage: int, SecondaryStorage: int, PrimaryStorageType: string, SecondaryStorageType: string, GPU_company: string, GPU_model: string]

In [160]:
# Check if the dataframe is still cached?

df.is_cached

False