#**Introduction to PySpark**
This is an introduction to Spark DataFrames and MLlib in Python.




In [16]:
# Clone the GitHub repository
!git clone https://github.com/ssalloum/SDSC-Spark4.git

Cloning into 'SDSC-Spark4'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 28 (delta 4), reused 22 (delta 4), pack-reused 0[K
Receiving objects: 100% (28/28), 7.24 MiB | 12.27 MiB/s, done.
Resolving deltas: 100% (4/4), done.


In [18]:
!ls /content/SDSC-Spark4/

data  README.md


# PySpark
PySpark is an interface for Apache Spark that allows users to write Spark applications using python APIs. PySpark supports most of Spark’s features such as Spark SQL, Streaming, MLlib (Machine Learning) and Spark Core. For detailed information on these components and APIs, please refer to the [official PySpark Documentation](https://spark.apache.org/docs/latest/api/python/index.html).

In [19]:
#You don't need this on Databricks
!pip install pyspark



## Spark SQL
* [Spark SQL](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html) is a Spark module for structured data processing.
* Spark SQL integrates relational processing (using SQL) and functional programming (using the DataFrame API).



## SparkSession
* An essentail class in Spark SQL is [pyspark.sql.SparkSession](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) which represents a unified entry point to programming in Spark.

* In Spark-shell or Databricks notebooks, a SparkSession is created for you, stored in a variable called `spark`.

In [20]:
# You don't need this on Databricks or spark-shell
from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession.builder\
        .master("local[*]")\
        .appName("Intro to PySpark")\
        .getOrCreate()

In [21]:
# Check Spark Session Information
spark

## SparkContext
* [SparkContext](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html) was the main entry point in earlier versions of Spark.
* For working with low-level APIs, [Resilient Distributed Datasets (RDDs)](https://spark.apache.org/docs/latest/rdd-programming-guide.html), and for backward compatibility, you can access SparkContext via SparkSession.

In [22]:
# get SparkContext
sc = spark.sparkContext
sc

# Spark DataFrames


*   [pyspark.sql.DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html) represents a distributed collection of data grouped into named columns.

* [pyspark.sql.Column](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.html): represents a column in a DataFrame.
* [pyspark.sql.Row](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Row.html): represents a row in a DataFrame.
*   [pyspark.sql.functions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html): common functions to work with DataFrames.

* A DataFrame can be constructed from a variety of [supported data sources](https://spark.apache.org/docs/latest/sql-data-sources.html).


## DataFrameReader
* [DataFrameReader](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.html): loading DataFrames from external sources.
* You cannot create an instance of DataFrameReader.
* You can access a DataFrameReader through a SparkSession instance using the [read](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.read.html) property to read data from a static data source (streaming data sources has a different method: readStream).
* DataFrameReader provides several public methods that can be used with all supported data sources, and may take different arguments for each source: [format](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.format.html), [option](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.option.html), [schema](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.schema.html), and [load](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.load.html).

* If you don’t specify the format, then the default is
Parquet or whatever is set in 'spark.sql.sources.default'.

* DataFrameReader also has methods to directly load data from specific formats/sources such as [parquet](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.parquet.html), [csv](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html), [json](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.json.html).

### Creating DataFrames From CSV files
* You can read data from a [CSV file](https://spark.apache.org/docs/latest/sql-data-sources-csv.html) into a DataFrame.
* The [pyspark.sql.SparkSession.read](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.read.html) function can be used to read in the CSV file and returns a DataFrame of rows and named columns with the types dictated in the schema. We will use csv files from the flights dataset:

In [23]:
dataPath = "/content/SDSC-Spark4/data/2015-summary.csv"

In [24]:
#try with inferSchema
flights_df = spark.read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv(dataPath)

In [25]:
flights_df.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [26]:
flights_df.count()

256

## DataFrame Schema
* A schema in Spark defines the column names and associated data types for a DataFrame. In addition to inferring the schema from the source data, Spark allows you to define a schema programmatically.

* A schema is a [StructType](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StructType.html) made up of a number of fields, each field is a [StructField](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StructField.htm), that have a name, type, a Boolean flag which specifies whether that column can contain missing or null values, and, finally, users can optionally specify associated metadata with that column.
*  Supported data types are defined in [pyspark.sql.types](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/data_types.html).

* [Spark SQL Guide: Data Types](https://spark.apache.org/docs/latest/sql-ref-datatypes.html)

In [27]:
#Define a schema programatically
from pyspark.sql.types import *

myFlightSchema = StructType([
  StructField("dest", StringType(), True),
  StructField("origin", StringType(), True),
  StructField("flights", LongType(), False)
])

myFlightSchema

StructType([StructField('dest', StringType(), True), StructField('origin', StringType(), True), StructField('flights', LongType(), False)])

In [28]:
flights_df_2015 = spark.read.schema(myFlightSchema).option("header", "true").csv(dataPath)

In [29]:
flights_df_2015.take(5)

[Row(dest='United States', origin='Romania', flights=15),
 Row(dest='United States', origin='Croatia', flights=1),
 Row(dest='United States', origin='Ireland', flights=344),
 Row(dest='Egypt', origin='United States', flights=15),
 Row(dest='United States', origin='India', flights=62)]

## Columns
* [DataFrame.columns](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.columns.html): get all columns names in a DataFrame.



In [30]:
flights_df_2015.columns

['dest', 'origin', 'flights']

* you can refer to columns in a number of different
ways; and you can use them interchangeably: [col()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.col.html), [column()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.column.html), [expr()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.expr.html).

In [31]:
from pyspark.sql.functions import expr, col, column

flights_df_2015.select(
  col("dest"),
  column("dest"),
  expr("lower(dest)"),
  flights_df_2015.dest)\
.show(5)

+-------------+-------------+-------------+-------------+
|         dest|         dest|  lower(dest)|         dest|
+-------------+-------------+-------------+-------------+
|United States|United States|united states|United States|
|United States|United States|united states|United States|
|United States|United States|united states|United States|
|        Egypt|        Egypt|        egypt|        Egypt|
|United States|United States|united states|United States|
+-------------+-------------+-------------+-------------+
only showing top 5 rows



* In Spark DataFrames, Columns are objects represented by [pyspark.sql.Column](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.html) that provides commonly used methods on columns.

In [32]:
from pyspark.sql import Column

flights_df_2015.orderBy(flights_df_2015.flights.desc()).show(5)

+-------------+-------------+-------+
|         dest|       origin|flights|
+-------------+-------------+-------+
|United States|United States| 370002|
|United States|       Canada|   8483|
|       Canada|United States|   8399|
|United States|       Mexico|   7187|
|       Mexico|United States|   7140|
+-------------+-------------+-------+
only showing top 5 rows



## Rows
* A row in Spark is an object of [pyspark.sql.Row](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Row.html), containing one or more columns.


In [33]:
#get the first Row
flights_df_2015.first()

Row(dest='United States', origin='Romania', flights=15)

In [34]:
#get a list of the first "num" of Rows

In [35]:
flights_df_2015.take(5)

[Row(dest='United States', origin='Romania', flights=15),
 Row(dest='United States', origin='Croatia', flights=1),
 Row(dest='United States', origin='Ireland', flights=344),
 Row(dest='Egypt', origin='United States', flights=15),
 Row(dest='United States', origin='India', flights=62)]

* Because Row is an object in Spark and an ordered collection of fields, you can instantiate a Row in each of Spark’s supported languages and access its fields by an index starting at 0:

In [36]:
from pyspark.sql import Row

blog_row = Row(6, "Reynold", "Xin", "https://tinyurl.6", 255568, "3/2/2015",
["twitter", "LinkedIn"])

# access using index for individual items
blog_row[1]
'Reynold'

'Reynold'

In [37]:
# the following code results in an array of Row objects.
spark.range(5).show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+



Row objects can be used to create DataFrames if you need them for quick interactivity
and exploration:

In [38]:
rows = [Row("Matei Zaharia", "CA"), Row("Reynold Xin", "CA")]
authors_df = spark.createDataFrame(rows, ["Authors", "State"])
authors_df.show()
authors_df.printSchema()

+-------------+-----+
|      Authors|State|
+-------------+-----+
|Matei Zaharia|   CA|
|  Reynold Xin|   CA|
+-------------+-----+

root
 |-- Authors: string (nullable = true)
 |-- State: string (nullable = true)



## Parquet Data Source
* [Parquet](https://parquet.apache.org/) is an open-source columnar format that offers many I/O
optimizations (such as compression, which saves storage space and allows for quick
access to data columns).

* [Parquet files](https://github.com/apache/parquet-format#file-format) are stored in a directory structure that contains the data files, metadata,
a number of compressed files, and some status files.

* Spark SQL provides support for [reading and writing Parquet files](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html).

* Parquet is the default data
source in Spark.

* Unless you are reading from a streaming data source, there’s no need to supply a
schema when reading from a Parquet file, because Parquet saves it as part of its metadata.

* Another way to read this same data using the [parquet](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.parquet.html) method.

In [41]:
parquetPath = "/content/SDSC-Spark4/data/2010-summary.parquet"

In [42]:
df2 = spark.read.parquet(parquetPath)

In [43]:
df2.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



## DataFrameWriter
* [DataFrameWriter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.html) is an interface used to write a DataFrame to external stoage systems.

* Unlike with DataFrameReader, you access its instance not from a SparkSession but from the DataFrame you wish to save.

* To get an instance handle, use the [DataFrame.write](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.write.html) method for static data sources (DataFrame.writeStream for streaming data sources).

* It also provides several public methods: [format](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.format.html), [option](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.option.html), [bucketBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.bucketBy.html), [save](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.save.html), and [saveAsTable](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.saveAsTable.html).
* DataFrameWriter also has methods to directly write data to specific formats/sources such as [parquet](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.parquet.html), [csv](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.csv.html), [json](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.json.html).

In [44]:
df2.write.parquet(path="/tmp/data/df_parquet1",
  mode="overwrite",
  compression="snappy")

#DataFrame Operations: Transformations and Actions

* Spark operations on DataFrames can be classified into two types: transformations and actions.
* All transformations are evaluated lazily. Their results are not computed immediately,
but they are recorded as a lineage. This allows Spark to optimize the execution
plan.
* Distributed computation occurs upon invoking an action on a DataFrame, e.g.,: `show(), take(), count(), collect()`.

### select
The easiest way to work with columns is just to use the [select](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.select.html) method and pass in the column names as strings:

In [46]:
flights_df.select("DEST_COUNTRY_NAME").show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



In [47]:
flights_df.select("DEST_COUNTRY_NAME","ORIGIN_COUNTRY_NAME").show(2)

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows



### selectExpr
Because select followed by a series of expr is such a common pattern, Spark has a shorthand for doing this efficiently: [selectExpr](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.selectExpr.html).

In [48]:
flights_df.selectExpr("DEST_COUNTRY_NAME as Destination", "Origin_COUNTRY_NAME as Origin").show(2)

+-------------+-------+
|  Destination| Origin|
+-------------+-------+
|United States|Romania|
|United States|Croatia|
+-------------+-------+
only showing top 2 rows



### Adding columns
To add a new column to your DataFrame, you can use the [withColumn](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html) method:

In [49]:
flights_df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))\
.take(5)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15, withinCountry=False),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1, withinCountry=False),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344, withinCountry=False),
 Row(DEST_COUNTRY_NAME='Egypt', ORIGIN_COUNTRY_NAME='United States', count=15, withinCountry=False),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=62, withinCountry=False)]

### Renaming columns
You can rename a column  with the [withColumnRenamed](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumnRenamed.html) method:

In [50]:
flights_df.withColumnRenamed("DEST_COUNTRY_NAME", "Destination").show(2)

+-------------+-------------------+-----+
|  Destination|ORIGIN_COUNTRY_NAME|count|
+-------------+-------------------+-----+
|United States|            Romania|   15|
|United States|            Croatia|    1|
+-------------+-------------------+-----+
only showing top 2 rows



In [51]:
# renaming multiple columns
flights_df.withColumnRenamed("DEST_COUNTRY_NAME", "dest")\
  .withColumnRenamed("ORIGIN_COUNTRY_NAME", "origin").show(2)

+-------------+-------+-----+
|         dest| origin|count|
+-------------+-------+-----+
|United States|Romania|   15|
|United States|Croatia|    1|
+-------------+-------+-----+
only showing top 2 rows



### Removing columns
[drop](https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.drop.html) is a dedicated method to remove columns from a DataFrame.

In [52]:
flights_df.drop("count").columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME']

In [53]:
flights_df.drop("ORIGIN_COUNTRY_NAME","count").show(5)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
|    United States|
|            Egypt|
|    United States|
+-----------------+
only showing top 5 rows



### Filtering Rows
There are two methods to perform filtering operations: you can use [where](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.where.html) or [filter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html)
and they both will perform the same operation and accept the same argument types when used
with DataFrames. To filter rows, you need an expression that evaluates to true or false.

In [54]:
flights_df.filter(col("count") < 2).show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



In [55]:
flights_df.where("count < 2").show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



You might want to put multiple filters into the same expression, but this is not always useful, because Spark automatically performs all filtering operations at
the same time regardless of the filter ordering.
If you want to specify multiple filters, just chain them sequentially and let Spark handle the rest.

In [56]:
flights_df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") != "Singapore")\
.show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



### Getting Unique Rows
To extract the unique or distinct values in a DataFrame, you can use the [distinct](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.distinct.html?highlight=distinct#pyspark.sql.DataFrame.distinct) method on a
DataFrame.

In [57]:
flights_df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()

256

In [58]:
flights_df.select("ORIGIN_COUNTRY_NAME").distinct().count()

125

### Random Samples
To sample some random records from your DataFrame, use the [sample](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sample.html) method.

In [59]:
flights_df.sample(withReplacement = False,
                      fraction= 0.5,
                      seed = 5).count()

138

### Random Splits
You may need to break up your DataFrame into random splits to use with machine learning algorithms to create training,
validation, and test sets. You can do this using the [randomSplit](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html) method.

In [60]:
flightDataSplits = flights_df.randomSplit([0.25, 0.75], seed = 5)

flightDataSplits

[DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: int],
 DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: int]]

In [61]:
flightDataSplits[0].count() < flightDataSplits[1].count()

True

### Sorting Rows

There are two equivalent operations to sort the values in a DataFrame: [sort](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sort.html) and [orderBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.orderBy.html).

In [62]:
flights_df.sort("count", ascending=False).show(5)

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
|           Canada|      United States|  8399|
|    United States|             Mexico|  7187|
|           Mexico|      United States|  7140|
+-----------------+-------------------+------+
only showing top 5 rows



In [63]:
flights_df.orderBy("count", "DEST_COUNTRY_NAME").show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
|           Cyprus|      United States|    1|
|         Djibouti|      United States|    1|
|        Indonesia|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [64]:
flights_df.orderBy(col("count"), col("DEST_COUNTRY_NAME"), ascending=False).show(5)

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
|           Canada|      United States|  8399|
|    United States|             Mexico|  7187|
|           Mexico|      United States|  7140|
+-----------------+-------------------+------+
only showing top 5 rows



Let’s find the top five destination countries in the data.

In [65]:
from pyspark.sql.functions import desc

flights_df\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(desc("destination_total"))\
  .limit(5)\
  .show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



You can also sort within each partition using the [sortWithinPartitions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sortWithinPartitions.html) method.

# Spark MLlib

* [pyspark.ml](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html): DataFrame-based Machine Learning library.

* It provides APIs for exploratory data analysis, feature engineering, model training, model evaluation, tuning, pipelines, statisitics, linear algebra.

* [Spark Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html).

* [pyspark.ml.linalg](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#vector-and-matrix): data types for representing vectors and matrics.
* [pyspark.ml.stat](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#statistics): Summarizer, Correlation, and ChiSquareTest.

* [pyspark.ml.Transformer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Transformer.html): The abstract class for transofmers.


* [pyspark.ml.Estimator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Estimator.html): The abstract class for estimators.

* [pyspark.ml.feature](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#feature): transformers and estimators for feature extraction, transformation, and selection.



In this section, we only use small training data to explore how to use the MLlib API to train machine learning models.

## Classification
* `pyspark.ml.classification`: [PySpark API: Classification](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#classification)
* [Spark Programming Guide: Classification](https://spark.apache.org/docs/latest/ml-classification-regression.html#classification)

## Dataset


In [66]:
bInputPath = "/content/SDSC-Spark4/data/binary_class"

In [67]:
bInput = spark.read.parquet(bInputPath).selectExpr("features", "cast(label as double) as label")

In [68]:
bInput.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)



In [69]:
bInput.show(2)

+-------------+-----+
|     features|label|
+-------------+-----+
|[0.0,0.0,0.0]|  1.0|
|[0.0,0.0,0.0]|  0.0|
+-------------+-----+
only showing top 2 rows



## Logistic Regression
* [LogisticRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html) is used to predict a binary outcome.

* [LogisticRegressionModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegressionModel.html): Model fitted by LogisticRegression.

In [70]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression()

### Parameters

In [71]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must beequal wi

### Fit a model

You can train the model with the [LogisticRegression.fit()](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html#pyspark.ml.classification.LogisticRegression.fit) method and get a [LogisticRegressionModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegressionModel.html).

In [72]:
lrModel = lr.fit(bInput)

In [73]:
lrModel.coefficients

DenseVector([18.7224, -0.5694, 9.3612])

In [74]:
lrModel.intercept

-28.04329511868945

### Training Summary

[BinaryLogisticRegressionTrainingSummary](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary.html)  provides a summary for a binary LogisticRegressionModel.

In [75]:
lrSummary = lrModel.summary
lrSummary

<pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary at 0x7acc41a73430>

In [76]:
lrSummary.areaUnderROC

0.6666666666666666

In [77]:
lrSummary.objectiveHistory

[0.6730116670092563,
 0.30533476678669746,
 0.19572951692227342,
 0.08238560717506734,
 0.039904390712412516,
 0.019187605729977825,
 0.009480513129879626,
 0.004700793975398925,
 0.002342824005088814,
 0.0011692212872630964,
 0.0005841333526453686,
 0.00029193843681446,
 0.00014593757317782447,
 7.295887614374282e-05,
 3.647309882223227e-05,
 1.822801708342421e-05,
 9.095755464927005e-06,
 4.50530629284565e-06,
 2.1743484095163617e-06,
 1.0422594942126269e-06,
 5.280808738948462e-07,
 2.628531186444535e-07,
 1.3166032239693124e-07,
 6.578498712560823e-08,
 3.290121373800775e-08,
 1.6448921648781483e-08,
 8.224786126080745e-09]

##Decision Tree Classifier

* A Decision Tree is a set of if-then-else rules learned from the training data
* [Spark Programming Guide: Decision Trees](https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-trees)

* [pyspark.ml.classification.DecisionTreeClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html)

* [DecisionTreeClassificationModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassificationModel.html)


In [78]:
from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier()

### Parameters

In [79]:
print(dt.explainParams())

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featuresCol: features column name. (default: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini (default: gini)
labelCol: label column name. (default: label)
leafCol: Leaf indices column name. Predicted leaf index of each instance in each tree by preorder. (default: )
maxBins: Max number of bins for discretizing continuous features.  Must be 

### Fit a model

In [80]:
dtModel = dt.fit(bInput)

### Feature Importance
You can extract the feature importance scores from your model using [featureImportances](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassificationModel.html#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances):

In [81]:
dtModel.featureImportances

SparseVector(3, {0: 1.0})

## Random Forest Classifier
* Random forests are ensembles of decision trees (combine many decision trees in order to reduce the risk of overfitting).
* [RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html)
* [RandomForestClassificationModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassificationModel.html)

In [82]:
from pyspark.ml.classification import RandomForestClassifier

rfClassifier = RandomForestClassifier()

### Parameters

In [83]:
print(rfClassifier.explainParams())

bootstrap: Whether bootstrap samples are used when building trees. (default: True)
cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the featur

* [numTrees](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier.numTrees)

* [featureSubsetStrategy](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier.featureSubsetStrategy)

* [subsamplingRate](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier.subsamplingRate)

In [84]:
rfClassifier.getNumTrees()

20

In [85]:
rfClassifier.getFeatureSubsetStrategy()

'auto'

In [86]:
rfClassifier.getSubsamplingRate()

1.0

### Fit a model

In [87]:
rfModel = rfClassifier.fit(bInput)

### Training Summary


In [88]:
rfSummary = rfModel.summary
rfSummary

<pyspark.ml.classification.BinaryRandomForestClassificationTrainingSummary at 0x7acc40099f90>

[BinaryRandomForestClassificationTrainingSummary](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.BinaryRandomForestClassificationTrainingSummary.html)

In [89]:
rfSummary.falsePositiveRateByLabel

[0.6666666666666666, 0.0]

## Other classification algorithms in MLlib
* Gradient Boosted Trees: [GBTClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.GBTClassifier.html)
* Naive Bayes: [NaiveBayes](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.NaiveBayes.html)

#Regression

* `pyspark.ml.regression`: [PySpark API: Regression](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#regression)
* [Spark Programming Guide: Regression](https://spark.apache.org/docs/latest/ml-classification-regression.html#regression)

## Dataset

In [90]:
regPath = "/content/SDSC-Spark4/data/regression"

In [91]:
regDF = spark.read.parquet(regPath)
regDF.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)



In [92]:
regDF.show(2)

+-------------+-----+
|     features|label|
+-------------+-----+
|[0.0,0.0,0.0]|  2.0|
|[0.0,0.0,0.0]|  1.0|
+-------------+-----+
only showing top 2 rows



In [93]:
regDF.write.parquet("/content/drive/MyDrive/sample_data/regression")

## Linear Regression
* [LinearRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html) is used to predict a real number from a set of numeric features.
* [LinearRegressionModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionModel.html)

In [94]:
from pyspark.ml.regression import LinearRegression

linear = LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)

### Parameters

In [95]:
print(linear.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0, current: 0.8)
epsilon: The shape parameter to control the amount of robustness. Must be > 1.0. Only valid when loss is huber (default: 1.35)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
loss: The loss function to be optimized. Supported options: squaredError, huber. (default: squaredError)
maxBlockSizeInMB: maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be >= 0. (default: 0.0)
maxIter: max number of iterations (>= 0). (defaul

### Fit a model

In [96]:
linearModel = linear.fit(regDF)

### Training Summary
[LinearRegressionTrainingSummary](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionTrainingSummary.html)

In [97]:
linearSummary = linearModel.summary
linearSummary

<pyspark.ml.regression.LinearRegressionTrainingSummary at 0x7acc400a5fc0>

In [98]:
linearSummary.residuals.show()
print (linearSummary.totalIterations)
print (linearSummary.objectiveHistory)
print (linearSummary.rootMeanSquaredError)
print (linearSummary.r2)

+-------------------+
|          residuals|
+-------------------+
|  1.762342343964863|
| 0.7623423439648629|
|-0.2376576560351371|
|-0.2376576560351371|
| 0.8547088792080308|
+-------------------+

5
[0.5000000000000001, 0.4315295810362787, 0.3132335933881022, 0.31225692666554117, 0.309150608198303, 0.30915058933480255]
0.9518934791114696
-0.13262649446867214


##Decision Tree Regressor
* [DecisionTreeRegressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.DecisionTreeRegressor.html)
* [DecisionTreeRegressionModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.DecisionTreeRegressionModel.html)

In [99]:
from pyspark.ml.regression import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
print (dtr.explainParams())

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featuresCol: features column name. (default: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: variance (default: variance)
labelCol: label column name. (default: label)
leafCol: Leaf indices column name. Predicted leaf index of each instance in each tree by preorder. (default: )
maxBins: Max number of bins for discretizing continuous features.  Must be >

In [100]:
dtrModel = dtr.fit(regDF)

##Random Forest Regressor
* [RandomForestRegressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.RandomForestRegressor.html)
* [RandomForestRegressionModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.RandomForestRegressionModel.html)

In [101]:
from pyspark.ml.regression import RandomForestRegressor
rfr =  RandomForestRegressor()
print(rfr.explainParams())

bootstrap: Whether bootstrap samples are used when building trees. (default: True)
cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the featur

In [102]:
rfModel = rfr.fit(regDF)

## Other regression algorithms in MLlib
* Generalized Linear Regression:
[GeneralizedLinearRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GeneralizedLinearRegression.html)
*  Gradient-Boosted Tree Regressor: [GBTRegressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GBTRegressor.html)
* Factorization Machines learning algorithm for regression: [FMRegressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.FMRegressor.html)

#Clustering
* `pyspark.ml.clustering`: [PySpark API: Clustering](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#clustering)
* [Spark Programming Guide: Clustering](https://spark.apache.org/docs/latest/ml-clustering.html)

## Dataset

In [103]:
retailDF = (spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/content/SDSC-Spark4/data/online-retail-dataset.csv")
  .where("Description IS NOT NULL"))

In [104]:
retailDF.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



In [105]:
from pyspark.ml.feature import VectorAssembler

va = VectorAssembler()\
  .setInputCols(["Quantity", "UnitPrice"])\
  .setOutputCol("features")

retail_clusteringDF = va.transform(retailDF).select("features")

In [106]:
retail_clusteringDF.show(2)

+----------+
|  features|
+----------+
|[6.0,2.55]|
|[6.0,3.39]|
+----------+
only showing top 2 rows



##K-means

* [KMeans](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeans.html) is implemented based on [Scalable K-Means++](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf), known as (***k-means||***).
* [KMeansModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeansModel.html): a model fitted by KMeans.
* [KMeansSummary](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeansSummary.html): a summary of KMeans.

In [107]:
from pyspark.ml.clustering import KMeans
km = KMeans().setK(5)

### Parameters


In [108]:
print(km.explainParams())

distanceMeasure: the distance measure. Supported options: 'euclidean' and 'cosine'. (default: euclidean)
featuresCol: features column name. (default: features)
initMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)
initSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)
k: The number of clusters to create. Must be > 1. (default: 2, current: 5)
maxBlockSizeInMB: maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be >= 0. (default: 0.0)
maxIter: max number of iterations (>= 0). (default: 20)
predictionCol: prediction column name. (default: prediction)
seed: random seed. (default: -7685785370690492299)
solver

### Fit a model

In [109]:
kmModel = km.fit(retail_clusteringDF)

In [110]:
print("Cluster Centers: ")
for center in kmModel.clusterCenters():
    print(center)

Cluster Centers: 
[9.0114221  4.55313107]
[-7.7605e+04  1.5600e+00]
[7.7605e+04 1.5600e+00]
[1.19799628e+03 1.16297398e+00]
[-1.000e+00  3.897e+04]


### Training Summary

In [111]:
summary = kmModel.summary

print("Cluster Sizes: ")
print (summary.clusterSizes) # number of points in each cluster

Cluster Sizes: 
[540181, 2, 2, 269, 1]


## Other clustering algorithms in MLlib
* Bisecting K-Means: [BisectingKMeans](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.clustering.BisectingKMeans.html)

* Gaussian Texture Models: [GaussianMixture](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html)

* Latent Dirichlet allocation: [LDA](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.LDA.html)