# 9 most useful functions for PySpark DataFrame

Article link = https://www.analyticsvidhya.com/blog/2021/05/9-most-useful-functions-for-pyspark-dataframe/

In [1]:
# Importing Sparksession
from pyspark.sql import SparkSession

In [2]:
# Creating a sparksession
spark = SparkSession.builder.appName("PySparkdf").getOrCreate()

In [3]:
df = spark.read.csv('Data/cereal.csv', inferSchema=True, header=True)

In [4]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- mfr: string (nullable = true)
 |-- type: string (nullable = true)
 |-- calories: integer (nullable = true)
 |-- protein: integer (nullable = true)
 |-- fat: integer (nullable = true)
 |-- sodium: integer (nullable = true)
 |-- fiber: double (nullable = true)
 |-- carbo: double (nullable = true)
 |-- sugars: integer (nullable = true)
 |-- potass: integer (nullable = true)
 |-- vitamins: integer (nullable = true)
 |-- shelf: integer (nullable = true)
 |-- weight: double (nullable = true)
 |-- cups: double (nullable = true)
 |-- rating: double (nullable = true)



**select():** Helps to display a subset of selected columns from entire data frame

In [8]:
# Selecting relavant columns
df.select('name', 'mfr', 'type', 'calories').show(10)

+--------------------+---+----+--------+
|                name|mfr|type|calories|
+--------------------+---+----+--------+
|           100% Bran|  N|   C|      70|
|   100% Natural Bran|  Q|   C|     120|
|            All-Bran|  K|   C|      70|
|All-Bran with Ext...|  K|   C|      50|
|      Almond Delight|  R|   C|     110|
|Apple Cinnamon Ch...|  G|   C|     110|
|         Apple Jacks|  K|   C|     110|
|             Basic 4|  G|   C|     130|
|           Bran Chex|  R|   C|      90|
|         Bran Flakes|  P|   C|      90|
+--------------------+---+----+--------+
only showing top 10 rows



**withColumn():** Used to manipulate a coloumn OR to create a new column with the existing column, It's a transformation function, so we can also change the datatype of any existing column

In [9]:
# Change the datatype of calory column to integer
df.withColumn("Calories", df['calories'].cast("Integer")).printSchema()

root
 |-- name: string (nullable = true)
 |-- mfr: string (nullable = true)
 |-- type: string (nullable = true)
 |-- Calories: integer (nullable = true)
 |-- protein: integer (nullable = true)
 |-- fat: integer (nullable = true)
 |-- sodium: integer (nullable = true)
 |-- fiber: double (nullable = true)
 |-- carbo: double (nullable = true)
 |-- sugars: integer (nullable = true)
 |-- potass: integer (nullable = true)
 |-- vitamins: integer (nullable = true)
 |-- shelf: integer (nullable = true)
 |-- weight: double (nullable = true)
 |-- cups: double (nullable = true)
 |-- rating: double (nullable = true)



**groupBy():** Used to collect the data into groups on DataFrame and allows us to perform aggregate functions on the grouped data

In [11]:
df.groupBy("name", "calories").count().show()

+--------------------+--------+-----+
|                name|calories|count|
+--------------------+--------+-----+
|Just Right Fruit ...|     140|    1|
|         Raisin Bran|     120|    1|
|Shredded Wheat sp...|      90|    1|
|           Corn Pops|     110|    1|
|  Honey Nut Cheerios|     110|    1|
|Muesli Raisins; D...|     150|    1|
|      Fruity Pebbles|     110|    1|
|           100% Bran|      70|    1|
|       Fruitful Bran|     120|    1|
|         Puffed Rice|      50|    1|
|      Raisin Squares|      90|    1|
|   Total Raisin Bran|     140|    1|
|      Golden Grahams|     110|    1|
|   Nutri-grain Wheat|      90|    1|
|   100% Natural Bran|     120|    1|
|Apple Cinnamon Ch...|     110|    1|
|Mueslix Crispy Blend|     160|    1|
|Shredded Wheat 'n...|      90|    1|
|              Smacks|     110|    1|
|      Quaker Oatmeal|     100|    1|
+--------------------+--------+-----+
only showing top 20 rows



**orderBy():** Sort the entire dataframe based on the particular column of the dataframe

In [12]:
df.orderBy("protein").show()

+--------------------+---+----+--------+-------+---+------+-----+-----+------+------+--------+-----+------+----+---------+
|                name|mfr|type|calories|protein|fat|sodium|fiber|carbo|sugars|potass|vitamins|shelf|weight|cups|   rating|
+--------------------+---+----+--------+-------+---+------+-----+-----+------+------+--------+-----+------+----+---------+
|      Frosted Flakes|  K|   C|     110|      1|  0|   200|  1.0| 14.0|    11|    25|      25|    1|   1.0|0.75|31.435973|
|        Cap'n'Crunch|  Q|   C|     120|      1|  2|   220|  0.0| 12.0|    12|    35|      25|    2|   1.0|0.75|18.042851|
|Cinnamon Toast Cr...|  G|   C|     120|      1|  3|   210|  0.0| 13.0|     9|    45|      25|    2|   1.0|0.75|19.823573|
|         Puffed Rice|  Q|   C|      50|      1|  0|     0|  0.0| 13.0|     0|    15|       0|    3|   0.5| 1.0|60.756112|
|           Rice Chex|  R|   C|     110|      1|  0|   240|  0.0| 23.0|     2|    30|      25|    1|   1.0|1.13|41.998933|
|               

**split():** Used to split a string column of dataframe into multiple columns. This Functions is applied to the dataframe with the help of withColumn() and select()

In [13]:
from pyspark.sql.functions import split

In [15]:
df1 = df.withColumn('Name1', split(df['name'], " ").getItem(0))\
        .withColumn('Name2', split(df['name'], " ").getItem(1))

In [16]:
df1.select("name", "Name1", "Name2").show()

+--------------------+------------+--------+
|                name|       Name1|   Name2|
+--------------------+------------+--------+
|           100% Bran|        100%|    Bran|
|   100% Natural Bran|        100%| Natural|
|            All-Bran|    All-Bran|    null|
|All-Bran with Ext...|    All-Bran|    with|
|      Almond Delight|      Almond| Delight|
|Apple Cinnamon Ch...|       Apple|Cinnamon|
|         Apple Jacks|       Apple|   Jacks|
|             Basic 4|       Basic|       4|
|           Bran Chex|        Bran|    Chex|
|         Bran Flakes|        Bran|  Flakes|
|        Cap'n'Crunch|Cap'n'Crunch|    null|
|            Cheerios|    Cheerios|    null|
|Cinnamon Toast Cr...|    Cinnamon|   Toast|
|            Clusters|    Clusters|    null|
|         Cocoa Puffs|       Cocoa|   Puffs|
|           Corn Chex|        Corn|    Chex|
|         Corn Flakes|        Corn|  Flakes|
|           Corn Pops|        Corn|    Pops|
|       Count Chocula|       Count| Chocula|
|  Crackli

**lit():** Used to add a new column to the dataframe that contains literals or some constant value

In [18]:
from pyspark.sql.functions import lit

df2 = df.select("name", lit("75 gm").alias("intake quantity"))

In [19]:
df2.show()

+--------------------+---------------+
|                name|intake quantity|
+--------------------+---------------+
|           100% Bran|          75 gm|
|   100% Natural Bran|          75 gm|
|            All-Bran|          75 gm|
|All-Bran with Ext...|          75 gm|
|      Almond Delight|          75 gm|
|Apple Cinnamon Ch...|          75 gm|
|         Apple Jacks|          75 gm|
|             Basic 4|          75 gm|
|           Bran Chex|          75 gm|
|         Bran Flakes|          75 gm|
|        Cap'n'Crunch|          75 gm|
|            Cheerios|          75 gm|
|Cinnamon Toast Cr...|          75 gm|
|            Clusters|          75 gm|
|         Cocoa Puffs|          75 gm|
|           Corn Chex|          75 gm|
|         Corn Flakes|          75 gm|
|           Corn Pops|          75 gm|
|       Count Chocula|          75 gm|
|  Cracklin' Oat Bran|          75 gm|
+--------------------+---------------+
only showing top 20 rows



**when():** Used to display the output based on the particular condition.

In [20]:
from pyspark.sql.functions import when

In [22]:
df.select("name",
         when(df.vitamins >= "25", "rich in vitamins")).show()

+--------------------+----------------------------------------------------+
|                name|CASE WHEN (vitamins >= 25) THEN rich in vitamins END|
+--------------------+----------------------------------------------------+
|           100% Bran|                                    rich in vitamins|
|   100% Natural Bran|                                                null|
|            All-Bran|                                    rich in vitamins|
|All-Bran with Ext...|                                    rich in vitamins|
|      Almond Delight|                                    rich in vitamins|
|Apple Cinnamon Ch...|                                    rich in vitamins|
|         Apple Jacks|                                    rich in vitamins|
|             Basic 4|                                    rich in vitamins|
|           Bran Chex|                                    rich in vitamins|
|         Bran Flakes|                                    rich in vitamins|
|        Cap

**filter():** Used to filter data in rows based on the particular column values.

In [23]:
from pyspark.sql.functions import filter

In [24]:
df.filter(df.calories == "100").show()

+--------------------+---+----+--------+-------+---+------+-----+-----+------+------+--------+-----+------+----+---------+
|                name|mfr|type|calories|protein|fat|sodium|fiber|carbo|sugars|potass|vitamins|shelf|weight|cups|   rating|
+--------------------+---+----+--------+-------+---+------+-----+-----+------+------+--------+-----+------+----+---------+
|         Corn Flakes|  K|   C|     100|      2|  0|   290|  1.0| 21.0|     2|    35|      25|    1|   1.0| 1.0|45.863324|
|Cream of Wheat (Q...|  N|   H|     100|      3|  0|    80|  1.0| 21.0|     0|    -1|       0|    2|   1.0| 1.0|64.533816|
|Crispy Wheat & Ra...|  G|   C|     100|      2|  1|   140|  2.0| 11.0|    10|   120|      25|    3|   1.0|0.75|36.176196|
|         Double Chex|  R|   C|     100|      2|  0|   190|  1.0| 18.0|     5|    80|      25|    3|   1.0|0.75|44.330856|
| Frosted Mini-Wheats|  K|   C|     100|      3|  0|     0|  3.0| 14.0|     7|   100|      25|    2|   1.0| 0.8|58.345141|
|        Golden 

**isNull()/isNotNull():** These two functions are used to find out if there is any null value present in the dataframe.

In [25]:
from pyspark.sql.functions import *

In [26]:
# filter data by null values
df.filter(df.name.isNotNull()).show()

+--------------------+---+----+--------+-------+---+------+-----+-----+------+------+--------+-----+------+----+---------+
|                name|mfr|type|calories|protein|fat|sodium|fiber|carbo|sugars|potass|vitamins|shelf|weight|cups|   rating|
+--------------------+---+----+--------+-------+---+------+-----+-----+------+------+--------+-----+------+----+---------+
|           100% Bran|  N|   C|      70|      4|  1|   130| 10.0|  5.0|     6|   280|      25|    3|   1.0|0.33|68.402973|
|   100% Natural Bran|  Q|   C|     120|      3|  5|    15|  2.0|  8.0|     8|   135|       0|    3|   1.0| 1.0|33.983679|
|            All-Bran|  K|   C|      70|      4|  1|   260|  9.0|  7.0|     5|   320|      25|    3|   1.0|0.33|59.425505|
|All-Bran with Ext...|  K|   C|      50|      4|  0|   140| 14.0|  8.0|     0|   330|      25|    3|   1.0| 0.5|93.704912|
|      Almond Delight|  R|   C|     110|      2|  2|   200|  1.0| 14.0|     8|    -1|      25|    3|   1.0|0.75|34.384843|
|Apple Cinnamon 

In [27]:
df.filter(df.name.isNull()).show()

+----+---+----+--------+-------+---+------+-----+-----+------+------+--------+-----+------+----+------+
|name|mfr|type|calories|protein|fat|sodium|fiber|carbo|sugars|potass|vitamins|shelf|weight|cups|rating|
+----+---+----+--------+-------+---+------+-----+-----+------+------+--------+-----+------+----+------+
+----+---+----+--------+-------+---+------+-----+-----+------+------+--------+-----+------+----+------+

