# A brief Walkthrough of the Technologies

Apache Spark is one of the hottest new trends in the technology domain. It is the framework with probably the highest potential to realize the fruit of the marriage between Big Data and Machine Learning.

Spark is implemented on Hadoop/HDFS and written mostly in Scala, a functional programming language, similar to Java. 
In fact, Scala needs the latest Java installation on your system and runs on JVM. However, for most beginners, Scala is not a language that they learn first to venture into the world of data science. 

Fortunately, Spark provides a wonderful Python integration, called PySpark, which lets Python programmers to interface with the Spark framework and learn how to manipulate data at scale and work with objects and algorithms over a distributed file system.



# The Power of Spark


One thing to remember is that Spark is not a programming language like Python or Java. It is a general-purpose distributed data processing engine, suitable for use in a wide range of circumstances. It is particularly useful for big data processing both at scale and with high speed.

Application developers and data scientists generally incorporate Spark into their applications to rapidly query, analyze, and transform data at scale. 

Some of the tasks that are most frequently associated with Spark, include, – ETL and SQL batch jobs across large data sets (often of terabytes of size), – processing of streaming data from IoT devices and nodes, data from various sensors, financial and transactional systems of all kinds, and – machine learning tasks for e-commerce or IT applications.

At its core, Spark builds on top of the Hadoop/HDFS framework for handling distributed files. It is mostly implemented with Scala, a functional language variant of Java. 

There is a core Spark data processing engine, but on top of that, there are many libraries developed for SQL-type query analysis, distributed machine learning, large-scale graph computation, and streaming data processing. 

Multiple programming languages are supported by Spark in the form of easy interface libraries: Java, Python, Scala, and R.


# How Spark Works

The basic idea of distributed processing is to divide the data chunks into small manageable pieces (including some filtering and sorting), bring the computation close to the data i.e. use small nodes of a large cluster for specific jobs and then re-combine them back. 

The dividing portion is called the ‘Map’ action and the recombination is called the ‘Reduce’ action. Together, they make the famous ‘MapReduce’ paradigm, which was introduced by Google around 2004.

For example, if a file has 100 records to be processed, 100 mappers can run together to process one record each. Or maybe 50 mappers can run together to process two records each. After all the mappers complete processing, the framework shuffles and sorts the results before passing them on to the reducers. A reducer cannot start while a mapper is still in progress. 

All the map output values that have the same key are assigned to a single reducer, which then aggregates the values for that key.



Reference: https://www.kdnuggets.com/2020/04/benefits-apache-spark-pyspark.html

# What are we planning to do?

1. Creating a DataFrame manually
2. Read DataFrame
3. Write DataFrame
4. Validate DataFrame
5. Search DataFrame
6. Select functionality on a DataFrame
7. Filter DataFrame
8. Slicing a DataFrame
9. Introducing SQL in PySpark

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
     |████████████████████████████████| 281.3 MB 35 kB/s              
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
     |████████████████████████████████| 198 kB 38.0 MB/s            
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
[?25h  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=aeb5caf5693deff21e6aaebd39764d45b471aec4e53a8ead08e80838ae253133
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("Pyspark_1").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/08 12:37:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


You are working with 1 core(s)


# Creating a DataFrame manually

In [3]:
values = [('Pasta',100),('Pizza',200),('Noodles',50),('Burger',199),('Dosa',10000),('Bread',1)]
df = spark.createDataFrame(values,['food','my_hunger_index'])
df.show()

                                                                                

+-------+---------------+
|   food|my_hunger_index|
+-------+---------------+
|  Pasta|            100|
|  Pizza|            200|
|Noodles|             50|
| Burger|            199|
|   Dosa|          10000|
|  Bread|              1|
+-------+---------------+



# Reading a DataFrame

In [4]:
path =""

# Some csv data
wine_quality = spark.read.csv(path+'/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv',
                          inferSchema=True,header=True)

In [5]:
wine_quality

DataFrame[fixed acidity: double, volatile acidity: double, citric acid: double, residual sugar: double, chlorides: double, free sulfur dioxide: double, total sulfur dioxide: double, density: double, pH: double, sulphates: double, alcohol: double, quality: int]

#### If the object is printed out it tells about the column names and data types of those columns separated by colon.

In [6]:
wine_quality.show()

+-------------+----------------+-----------+--------------+-------------------+-------------------+--------------------+-------+----+---------+-------+-------+
|fixed acidity|volatile acidity|citric acid|residual sugar|          chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|
+-------------+----------------+-----------+--------------+-------------------+-------------------+--------------------+-------+----+---------+-------+-------+
|          7.4|             0.7|        0.0|           1.9|              0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|
|          7.8|            0.88|        0.0|           2.6|              0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|
|          7.8|            0.76|       0.04|           2.3|              0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|      5|
|         11.2|            0.28|       0

#### way too many rows, lets use the limit method.

In [7]:
wine_quality.limit(5).show()

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|          7.4|             0.7|        0.0|           1.9|    0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|
|          7.8|            0.88|        0.0|           2.6|    0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|
|          7.8|            0.76|       0.04|           2.3|    0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|      5|
|         11.2|            0.28|       0.56|           1.9|    0.075|               17.0|           

#### Better but not eye pleasing though.

In [8]:
wine_quality.limit(5).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


#### This is pretty smooth on the eyes but what is happening inside the toPandas() function call?
* First of all this method is only going to work if Pandas library is installed and available in the Python execution environment. Yes you guessed it right, this function returns a Pandas DF.
* Second, do not use this method of displaying content when the content is too heavy because all the content is loaded into memory and it would be similar to using Pandas without Spark's ability to deal with large data. 
* In case the data is heavy beyond a manageable limit then the processes would most probability end up either crashing or executing at snail's pace.

#### Note the types here:

In [9]:
print(type(wine_quality))

wine_quality_pandas_df = wine_quality.toPandas()
print(type(wine_quality_pandas_df))

<class 'pyspark.sql.dataframe.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


# Data Summary

#### We have seen the functions show(), limit() and toPandas() already in action for data printing.

#### We have also seen how the schema of a DataFrame can be structured by printing the DataFrame object like below

In [10]:
wine_quality

DataFrame[fixed acidity: double, volatile acidity: double, citric acid: double, residual sugar: double, chlorides: double, free sulfur dioxide: double, total sulfur dioxide: double, density: double, pH: double, sulphates: double, alcohol: double, quality: int]

#### It can be made much better though with the help of a new function.

In [11]:
wine_quality.printSchema()


root
 |-- fixed acidity: double (nullable = true)
 |-- volatile acidity: double (nullable = true)
 |-- citric acid: double (nullable = true)
 |-- residual sugar: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- free sulfur dioxide: double (nullable = true)
 |-- total sulfur dioxide: double (nullable = true)
 |-- density: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- quality: integer (nullable = true)



#### Pretty slick!

#### Now printing just the columns as a list of strings.

In [12]:
print(wine_quality.columns)

['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']


#### When one needs to know the data type of a single column 

In [13]:
wine_quality.schema['fixed acidity'].dataType

DoubleType

#### Let's try getting some standard stats of the data we have

In [14]:
wine_quality.describe().toPandas()

Unnamed: 0,summary,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
1,mean,8.319637273295838,0.5278205128205131,0.2709756097560964,2.538805503439652,0.0874665415884925,15.87492182614134,46.46779237023139,0.9967466791744832,3.311113195747343,0.6581488430268921,10.422983114446502,5.636022514071295
2,stddev,1.7410963181276948,0.1790597041535352,0.1948011374053182,1.40992805950728,0.04706530201009,10.46015696980971,32.89532447829907,0.0018873339538427,0.1543864649035427,0.1695069795901101,1.0656675818473935,0.8075694397347051
3,min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
4,max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


### PySpark describe() VS Pandas describe()

In [15]:
wine_quality_pandas_df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


#### The quartiles are additional in Pandas describe() function

#### and now welcome on stage the summary() function in PySpark.

In [16]:
wine_quality.summary().toPandas()

                                                                                

Unnamed: 0,summary,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
1,mean,8.319637273295838,0.5278205128205131,0.2709756097560964,2.538805503439652,0.0874665415884925,15.87492182614134,46.46779237023139,0.9967466791744832,3.311113195747343,0.6581488430268921,10.422983114446502,5.636022514071295
2,stddev,1.7410963181276948,0.1790597041535352,0.1948011374053182,1.40992805950728,0.04706530201009,10.46015696980971,32.89532447829907,0.0018873339538427,0.1543864649035427,0.1695069795901101,1.0656675818473935,0.8075694397347051
3,min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
4,25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
5,50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
6,75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.99784,3.4,0.73,11.1,6.0
7,max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


#### As good as the Pandas describe(), notice the index of the two DataFrames.

#### In case one needs to get the standard stats of a single column

In [17]:
wine_quality.describe(['fixed acidity']).toPandas()

Unnamed: 0,summary,fixed acidity
0,count,1599.0
1,mean,8.319637273295838
2,stddev,1.7410963181276948
3,min,4.6
4,max,15.9


#### The same using summary() function. It is done using select() function which we'll cover right in this notebook in just a while.

In [18]:
wine_quality.select("fixed acidity").summary().toPandas()

Unnamed: 0,summary,fixed acidity
0,count,1599.0
1,mean,8.319637273295838
2,stddev,1.7410963181276948
3,min,4.6
4,25%,7.1
5,50%,7.9
6,75%,9.2
7,max,15.9


#### The same for multiple columns

In [19]:
wine_quality.select("fixed acidity","chlorides").summary().toPandas()

Unnamed: 0,summary,fixed acidity,chlorides
0,count,1599.0,1599.0
1,mean,8.319637273295838,0.0874665415884925
2,stddev,1.7410963181276948,0.04706530201009
3,min,4.6,0.012
4,25%,7.1,0.07
5,50%,7.9,0.079
6,75%,9.2,0.09
7,max,15.9,0.611


### Schema Modifications 

It can be possible that PySpark is not reading the columns in the very specific data type that you'd like the columns to be read in.

In that case data type can be assigned when reading the file.

In [20]:
from pyspark.sql.types import StructField,StringType,IntegerType,StructType,DateType, DoubleType

In [21]:
data_schema = [StructField("fixed acidity", DoubleType(), True),
               StructField("citric acid", DoubleType(), True),
               StructField("chlorides", DoubleType(), True),
               StructField("pH", DoubleType(), True),
               StructField("sulphates", DoubleType(), True),
               StructField("density", DoubleType(), True)]

In [22]:
final_struc = StructType(fields=data_schema)

In [23]:
wine_quality_partial = spark.read.csv(path+'/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv',
                              schema=final_struc)

In [24]:
wine_quality_partial.printSchema()

root
 |-- fixed acidity: double (nullable = true)
 |-- citric acid: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- density: double (nullable = true)



#### Note how only the columns we defined the schema for have been read.

# Writing a DataFrame

In [25]:
wine_quality.write.mode("overwrite").csv('wine_quality.csv')

##### The name of the file in the output seems to make less sense. The folder name is indicative but not the file name. 


In [26]:
wine_quality.toPandas().to_csv('wine_quality_pandas.csv')

### Saving file in Partitions by Categorical Column

In [27]:
wine_quality.write.mode("overwrite").partitionBy("quality").csv('partitioned_by_quality_csv/')

# Select functionality on a DataFrame

In [28]:
from pyspark.sql.functions import *

In [29]:
wine_quality.select(['fixed acidity','sulphates','pH']).show(5)

+-------------+---------+----+
|fixed acidity|sulphates|  pH|
+-------------+---------+----+
|          7.4|     0.56|3.51|
|          7.8|     0.68| 3.2|
|          7.8|     0.65|3.26|
|         11.2|     0.58|3.16|
|          7.4|     0.56|3.51|
+-------------+---------+----+
only showing top 5 rows



### OrderBy

In [30]:
wine_quality.select(['fixed acidity','sulphates','pH', 'quality']).orderBy("quality").show(5)

+-------------+---------+----+-------+
|fixed acidity|sulphates|  pH|quality|
+-------------+---------+----+-------+
|         11.6|     0.57|3.25|      3|
|          7.6|      0.4| 3.5|      3|
|         10.4|     0.63|3.16|      3|
|          7.4|     0.54|3.63|      3|
|         10.4|     0.86|3.38|      3|
+-------------+---------+----+-------+
only showing top 5 rows



### Descending Order

In [31]:
wine_quality.select(['fixed acidity','sulphates','pH', 'quality']).orderBy(wine_quality["quality"].desc()).show(5)

+-------------+---------+----+-------+
|fixed acidity|sulphates|  pH|quality|
+-------------+---------+----+-------+
|          7.9|     0.86|3.35|      8|
|          9.4|     0.92|3.15|      8|
|         10.3|     0.82|3.23|      8|
|          5.6|     0.82|3.56|      8|
|         12.6|     0.82|2.88|      8|
+-------------+---------+----+-------+
only showing top 5 rows



In [32]:
#isin
wine_quality[wine_quality.quality.isin(5, 6)].limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6


#### Notice how in Pyspark the isin() function doesn't require a list but the values directly as arguments

# Slicing a DataFrame

In [33]:

# Starting
print('Starting row count:',wine_quality.count())
print('Starting column count:',len(wine_quality.columns))

# Slice rows
df2 = wine_quality.limit(100)
print('Sliced row count:',df2.count())

# Slice columns
cols_list = wine_quality.columns[0:3]
df3 = wine_quality.select(cols_list)
print('Sliced column count:',len(df3.columns))

Starting row count: 1599
Starting column count: 12
Sliced row count: 100
Sliced column count: 3


# Filtering a DataFrame

#### Using where construct

In [34]:
wine_quality.select("pH","quality").where(wine_quality.pH.startswith("3")) \
                                  .where(wine_quality.pH.endswith("1")).limit(4).toPandas()

Unnamed: 0,pH,quality
0,3.51,5
1,3.51,5
2,3.51,5
3,3.11,5


#### Filtering a DataFrame with SQL 'Like' Operation

In [35]:
wine_quality.select('fixed acidity','sulphates','pH', 'quality').where(wine_quality.pH.like("%5%")).show(10, False)

+-------------+---------+----+-------+
|fixed acidity|sulphates|pH  |quality|
+-------------+---------+----+-------+
|7.4          |0.56     |3.51|5      |
|7.4          |0.56     |3.51|5      |
|7.4          |0.56     |3.51|5      |
|7.5          |0.8      |3.35|5      |
|7.5          |0.8      |3.35|5      |
|5.6          |0.52     |3.58|5      |
|7.6          |0.65     |3.52|5      |
|6.7          |0.54     |3.35|5      |
|6.9          |0.52     |3.45|6      |
|5.7          |0.48     |3.5 |4      |
+-------------+---------+----+-------+
only showing top 10 rows



#### Using filter construct

Spark DataFrames are built on top of the Spark SQL platform, if you already know SQL, you can easily use SQL commands to get similar operations done in Spark also.

In [36]:
wine_quality.filter("pH>3").limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6


# Multiple filters

In [37]:
wine_quality.filter("pH>3 and sulphates<0.6").limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
2,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
3,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5


In [38]:
wine_quality.filter("pH>3 or sulphates<0.6").limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6


In [39]:
wine_quality.filter("pH>3 and sulphates<0.6 and density like '%0.997%'").limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
2,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5
3,7.4,0.59,0.08,4.4,0.086,6.0,29.0,0.9974,3.38,0.5,9.0,4


In [40]:
wine_quality.filter("pH>3 and sulphates<0.6 and density not like '%0.997%'").limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
1,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5
2,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7
3,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7


# Using DataFrame style filters

In [41]:
## or condition
wine_quality.filter( (wine_quality.pH  == 3.0) | (wine_quality.quality  == 5) ).limit(5).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
4,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5


In [42]:
## and condition
wine_quality.filter( (wine_quality.pH  == 3.0) & (wine_quality.quality  == 5) ).limit(5).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.5,0.705,0.24,1.8,0.36,15.0,63.0,0.9964,3.0,1.59,9.5,5
1,8.9,0.635,0.37,1.7,0.263,5.0,62.0,0.9971,3.0,1.09,9.3,5
2,8.7,0.78,0.51,1.7,0.415,12.0,66.0,0.99623,3.0,1.17,9.2,5
3,8.7,0.78,0.51,1.7,0.415,12.0,66.0,0.99623,3.0,1.17,9.2,5


In [43]:
## and and not equal to
wine_quality.filter( (wine_quality.pH  == 3.0) & (wine_quality.quality  != 5) ).limit(5).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,11.9,0.37,0.69,2.3,0.078,12.0,24.0,0.9958,3.0,0.65,12.8,6
1,12.7,0.59,0.45,2.3,0.082,11.0,22.0,1.0,3.0,0.7,9.3,6


# String based filters

In [44]:
wine_quality.filter(wine_quality.pH.startswith("2")).limit(5).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,8.6,0.49,0.28,1.9,0.11,20.0,136.0,0.9972,2.93,1.95,9.9,6
1,8.6,0.49,0.28,1.9,0.11,20.0,136.0,0.9972,2.93,1.95,9.9,6
2,8.6,0.49,0.29,2.0,0.11,19.0,133.0,0.9972,2.93,1.98,9.8,5
3,9.2,0.52,1.0,3.4,0.61,32.0,69.0,0.9996,2.74,2.0,9.4,4
4,12.5,0.46,0.63,2.0,0.071,6.0,15.0,0.9988,2.99,0.87,10.2,5


In [45]:
wine_quality.filter(wine_quality.pH.endswith("9")).limit(5).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7
1,8.9,0.22,0.48,1.8,0.077,29.0,60.0,0.9968,3.39,0.53,9.4,6
2,4.6,0.52,0.15,2.1,0.054,8.0,65.0,0.9934,3.9,0.56,13.1,4
3,6.6,0.5,0.04,2.1,0.068,6.0,14.0,0.9955,3.39,0.64,9.4,6
4,7.0,0.735,0.05,2.0,0.081,13.0,54.0,0.9966,3.39,0.57,9.8,5


In [46]:
wine_quality.filter(wine_quality.pH.contains("4")).limit(5).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.9,0.32,0.51,1.8,0.341,17.0,56.0,0.9969,3.04,1.08,9.2,6
1,6.9,0.4,0.14,2.4,0.085,21.0,40.0,0.9968,3.43,0.63,9.7,6
2,6.3,0.39,0.16,1.4,0.08,11.0,23.0,0.9955,3.34,0.56,9.3,5
3,7.1,0.71,0.0,1.9,0.08,14.0,35.0,0.9972,3.47,0.55,9.4,5
4,6.9,0.685,0.0,2.5,0.105,22.0,37.0,0.9966,3.46,0.57,10.6,6


# DataFrame based null filters

In [47]:
## null check
wine_quality.filter(wine_quality['fixed acidity'].isNull() | wine_quality.alcohol.isNull()).limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality


In [48]:
## Not null check
wine_quality.filter(wine_quality['fixed acidity'].isNotNull() | wine_quality.alcohol.isNotNull()).limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6


# SQL based null filters

In [49]:
wine_quality.filter("'fixed acidity' is NULL").limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality


In [50]:
wine_quality.filter("'fixed acidity' is not NULL").limit(4).toPandas()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6


#### Filter and select combination

In [51]:
wine_quality.filter("pH>4").select(['fixed acidity','sulphates','pH', 'quality']).limit(4).toPandas()

Unnamed: 0,fixed acidity,sulphates,pH,quality
0,5.4,0.59,4.01,6
1,5.0,0.59,4.01,6


#### Filter, select and orderby combination

In [52]:
wine_quality.select(['fixed acidity','sulphates','pH', 'quality']).filter("pH>3").orderBy(wine_quality["quality"].desc()).limit(4).toPandas()

Unnamed: 0,fixed acidity,sulphates,pH,quality
0,7.9,0.86,3.35,8
1,11.3,0.69,3.22,8
2,10.3,0.82,3.23,8
3,5.6,0.82,3.56,8


# Collecting Results as Objects

The collect() function on DataFrame returns an array/list of Rows which is a flattened version of the DataFrame. It shows the metadata of the DataFrame.

A row in DataFrame. The fields in it can be accessed:

* like attributes (row.key)
* like dictionary values (row[key])

key in row will search through row keys.

Row can be used to create a row object by using named arguments. It is not allowed to omit a named argument to represent that the value is None or missing. This should be explicitly set to None in this case.



Row also can be used to create another Row like class, then it could be used to create Row objects, such as

In [53]:
from pyspark.sql import Row

Person = Row("name", "age")
Person

<Row('name', 'age')>

In [54]:
'name' in Person

True

In [55]:
Person("Alice", 11)

Row(name='Alice', age=11)

Reference: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Row.html#pyspark.sql.Row

#### Back to collect()

In [56]:
result = wine_quality.select(['fixed acidity','sulphates','pH', 'quality']).filter("pH>3").orderBy(wine_quality["pH"].desc()).collect()

In [57]:
result[:5]

[Row(fixed acidity=5.4, sulphates=0.59, pH=4.01, quality=6),
 Row(fixed acidity=5.0, sulphates=0.59, pH=4.01, quality=6),
 Row(fixed acidity=4.6, sulphates=0.56, pH=3.9, quality=4),
 Row(fixed acidity=5.1, sulphates=0.62, pH=3.9, quality=6),
 Row(fixed acidity=4.7, sulphates=0.6, pH=3.85, quality=6)]

In [58]:
type(result[0])

pyspark.sql.types.Row

#### Accessing a row as a dictionary

In [59]:
result[0].asDict()

{'fixed acidity': 5.4, 'sulphates': 0.59, 'pH': 4.01, 'quality': 6}

In [60]:
for item in result[0]:
    print(item)

5.4
0.59
4.01
6


#### List of rows to DataFrame

In [61]:
backto_df = spark.createDataFrame(result)
backto_df.limit(5).toPandas()

Unnamed: 0,fixed acidity,sulphates,pH,quality
0,5.4,0.59,4.01,6
1,5.0,0.59,4.01,6
2,4.6,0.56,3.9,4
3,5.1,0.62,3.9,6
4,4.7,0.6,3.85,6


## Thanks!

### More in the next notebook.

Reference
* https://sparkbyexamples.com/pyspark/pyspark-where-filter/
* https://sparkbyexamples.com/pyspark/pyspark-filter-rows-with-null-values/
* https://stackoverflow.com/questions/41000273/spark-difference-between-collect-take-and-show-outputs-after-conversion
* https://www.kdnuggets.com/2020/04/benefits-apache-spark-pyspark.html