# **Operations performed using spark dataframe**

Once if you load your data as a dataframe using pyspark, there are lot many operations available in spark dataframe which will be more useful in analyzing the dara in a better way. In this notebook let's discuss how to



> Execute SQL commands in a spark dataframe.


> Query the spark dataframe using filter with single and multiple conditional operators.


> Converting a query result (spark datframe) into a json(python dictionary).







In [1]:
#As an initial step let's create a spark session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("df_ops").getOrCreate()

In [2]:
#Reading the data using spark as a dataframe.
df=spark.read.csv("Data1.csv", inferSchema=True, header=True)

In [3]:
df.show()

+-----------+----+
|       Name| Age|
+-----------+----+
|Swaminathan|  24|
|      Peter|null|
|        Sam|  54|
|      Henry|null|
+-----------+----+



In [4]:
#Creating one new column as "NewAge" in the dataframe.
df=df.withColumn("NewAge",df["Age"]*2)
df.show()

+-----------+----+------+
|       Name| Age|NewAge|
+-----------+----+------+
|Swaminathan|  24|    48|
|      Peter|null|  null|
|        Sam|  54|   108|
|      Henry|null|  null|
+-----------+----+------+



## **Executing SQL commands on spark dataframe**

Querying using SQL commands over a spark dataframe is possible by creating a temporary view of the entire dataframe as shown below.

In [5]:
# I have created a temporary view by replicating the entire spark dataframe in the name of "employee."
df.createOrReplaceTempView("employee")

Getting data of a employee whose age is less than or equal to 25.

In [6]:
#Getting Employee's age less than or equal to 25
spark.sql("SELECT * FROM employee WHERE Age<=25").show()

+-----------+---+------+
|       Name|Age|NewAge|
+-----------+---+------+
|Swaminathan| 24|    48|
+-----------+---+------+



Getting name of the employee whose age is beyond 50.

In [7]:
#Getting name of the employee whose age is beyond 50.
spark.sql("SELECT Name FROM employee WHERE Age>=50").show()

+----+
|Name|
+----+
| Sam|
+----+



Getting name of the person whose age is 24 and newage is 48

In [8]:
#Getting name of the employee whose age is 24 and newage is 48.
spark.sql("SELECT Name FROM employee WHERE Age==24 AND NewAge==48").show()

+-----------+
|       Name|
+-----------+
|Swaminathan|
+-----------+



## **Query the spark dataframe using filter**

As like executing SQL queries in a spark dataframe, we can also query a spark dataframe using the filter() that is available for pyspark dataframes. 

### **Filtering using single condition**

Let's see how to filter a spark dataframe using one condition.

In [9]:
#Getting details of a employee whose age is above 50.
df.filter("Age > 50").show()

+----+---+------+
|Name|Age|NewAge|
+----+---+------+
| Sam| 54|   108|
+----+---+------+



You can also do the same query in another way as shown below

In [10]:
df.filter(df["Age"]>50).show()

+----+---+------+
|Name|Age|NewAge|
+----+---+------+
| Sam| 54|   108|
+----+---+------+



In [11]:
#Let's choose two columns out from the filter result using select()
df.filter(df["Age"]>50).select(["Name","Age"]).show()

+----+---+
|Name|Age|
+----+---+
| Sam| 54|
+----+---+



### **Filtering using multiple conditions**

Basically you can query a datframe with one or more than one conditions using filter(), those multiple conditions will be declared using the logical operators like  (AND)& , (OR)|, (NOT)~ . Spark dataframe filter supports logical operations declared only in the symbol rather than declaring those operators in words, and also the conditions should be enclosed within the brackets (). If you don't do so then it may throws an error. 

In [12]:
#Filtering using multiple condition
df.filter((df["Age"]>20) & (df["Name"]=="Swaminathan")).show()

+-----------+---+------+
|       Name|Age|NewAge|
+-----------+---+------+
|Swaminathan| 24|    48|
+-----------+---+------+



In [13]:
#Choosing specific columns of a filtered spark dataframe
df.filter((df["Age"]>20) & (df["Name"]=="Swaminathan")).select(["Name","Age"]).show()

+-----------+---+
|       Name|Age|
+-----------+---+
|Swaminathan| 24|
+-----------+---+



## **Converting a filter query result to json(python dictionary)**

You can convert a filtered query result which is basically in the form of a spark dataframe into a python dictionary(json) using the collect() and asDict() method available in pyspark and python.

In [14]:
#Let's filter the record of a individual whose new age is beyond 60.
df.filter(df["NewAge"]>60).show()

+----+---+------+
|Name|Age|NewAge|
+----+---+------+
| Sam| 54|   108|
+----+---+------+



In [15]:
#Later then let's assign the filter result to a variable
filter_result = df.filter(df["NewAge"]>60).collect()

In [16]:
#The result would be a list of rows.
filter_result

[Row(Name='Sam', Age=54, NewAge=108)]

In [17]:
#Let's convert the result into a python dictionary
filter_result[0].asDict()

{'Name': 'Sam', 'Age': 54, 'NewAge': 108}

In [18]:
#Let's try to get the "Name" from the dictionary
#Let's convert the result into a python dictionary
filter_result[0].asDict()["Name"]

'Sam'