# Row Filtering

As a data scientist you will find yourself filtering rows based on the value of one or several columns. In this notebook we will go over the basic tools used to filtering rows based on the column values.

We will start by importing the pyspark machinery and a helper function that creates the table we will be using.

In [2]:
# Spark related machinery
import pyspark
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql.functions import concat_ws

spark = pyspark.sql.SparkSession.builder.enableHiveSupport().getOrCreate()

In [3]:
from pyspark_functions import create_sp_table1

In the following block of code we create the table and print the first 5 lines. The table has 1000 rows in total and 4 columns:

* student_id
* exam_1
* exam_2
* exam_3

The table resembles the grades obtained by 1000 students in 3 exams.

In [5]:
#Create the table
grades = create_sp_table1()

#Print the first 5 rows
grades.show(5)

+----------+------+------+------+
|student_id|exam_1|exam_2|exam_3|
+----------+------+------+------+
|         1|     9|     5|     9|
|         2|     4|     3|    10|
|         3|     0|    10|     0|
|         4|     8|     9|     4|
|         5|     8|     2|     5|
+----------+------+------+------+
only showing top 5 rows



## Filtering Based on one Column
In the following block of code we create a new table containing only rows where the value for the column **exam_1** is equal to 10. Then, we print the first 5 rows of the new table.


In [6]:
#Keep only rows where the column exam_1 is equal to 10
exam1_best = grades.where(F.col("exam_1") == 10)

#Print the first 5 rows of the table
exam1_best.show(5)

#Count the number of rows in the new table
print("number of rows in the new table: ", exam1_best.count())

+----------+------+------+------+
|student_id|exam_1|exam_2|exam_3|
+----------+------+------+------+
|         6|    10|     4|     2|
|         8|    10|     6|     4|
|         9|    10|     8|     4|
|        16|    10|     3|     4|
|        33|    10|     0|     4|
+----------+------+------+------+
only showing top 5 rows

number of rows in the new table:  83


The filtering took place using the method ```where```. However, we could have used the method ```filter``` to accomplish the same task. Try it! just replace ```where``` with ```filter```.

It is important to notice a couple of things:

1. We are referring to the column **exam_1** using the expression ```F.col(“exam_1”)```, see the first line in the block of code above.
2. The new table has less rows than the initial table (1000) , see the print statement at the end of the block of code above.

## Filtering Based on Two Columns
In this example we will construct a table where the values for column **exam_1** are less than 4 the values of the column **exam_2** are greater than 7. To achieve this goal we will use the "**and**" operator, which is represented by ```&```. We will name our new table **bad_and_good**.

In [7]:
#Create new table by fltering rows
bad_and_good = grades.where((F.col("exam_1") <= 4) & (F.col("exam_2") >= 7))

#Print the first 5 rows of the table to screen
bad_and_good.show(5)

#Count the number of rows in the new table
print("number of rows in the new table: ", bad_and_good.count())

+----------+------+------+------+
|student_id|exam_1|exam_2|exam_3|
+----------+------+------+------+
|         3|     0|    10|     0|
|        17|     1|    10|     3|
|        18|     1|     8|     6|
|        19|     0|    10|     8|
|        20|     3|     8|     2|
+----------+------+------+------+
only showing top 5 rows

number of rows in the new table:  187


As you can see the new table have less rows than the original table. 

Now, one cool trick, that is very useful when you have many conditions in your filtering, is to define the conditions before the filtering. In the block of code below, we create a table where the all the values for the column **exam_1** are smaller than 3, or the values for the column **exam_3** are greater than 8.

In [8]:
#Define conditions
condition_1 = F.col("exam_1") <= 3
condition_2 = F.col("exam_3") >= 8

#Create new table by filtering rows
bad_or_good = grades.where(condition_1 | condition_2)

#Print the first 5 rows of the table to screen
bad_or_good.show(5)

#Count the number of rows in the new table
print("number of rows in the new table: ", bad_or_good.count())

+----------+------+------+------+
|student_id|exam_1|exam_2|exam_3|
+----------+------+------+------+
|         1|     9|     5|     9|
|         2|     4|     3|    10|
|         3|     0|    10|     0|
|         7|     2|     6|     7|
|        13|     4|     6|     8|
+----------+------+------+------+
only showing top 5 rows

number of rows in the new table:  544


# Final Words

We went over the basics concepts behind row filtering in pyspark. Now it is time for you to start coding, try changing the filter conditions and doing different combinations of  **"or"** and **"and"** operators. Also try using the method ```filter``` instead of ```where```.