# PySpark and SparkSQL Module

Apache Spark is a cluster computing system that offers comprehensive libraries and APIs for developers, and
SparkSQL can be represented as the module in Apache Spark for processing unstructured data with the help of DataFrame API.

In this notebook, we will cover the basics how to run Spark Jobs with PySpark (Python API) and execute useful functions insdie. If followed, you should be able to grasp a basic understadning of PySparks and its common functions. 

In [2]:
import pandas as pd
import numpy as np
from datetime import date, timedelta, datetime
import time

import pyspark
from pyspark.sql import SparkSession, SQLContext
from pyspark.context import SparkContext
from pyspark.sql.functions import * 
from pyspark.sql.types import * 

### 1. Initialize the Spark Session

We need to begin with initilize the Spark Session. DataFrame can be created and registered as tables. Moreover, SQL tables be executed, tables can be cached, and parquet/json/csv/avro data formatted files can be read.

In [3]:
sc = SparkSession.builder.appName("PysparkExample").getOrCreate()

In [4]:
sc

### 2. Load Data

You can download the Kaggle dataset, which includes the book title, author, the date of the best seller list, the published date of the list, the book description, the rank (this week and last week), the publisher, number of weeks on the list, and the price [Link](https://www.kaggle.com/cmenca/new-york-times-hardcover-fiction-best-sellers).

Spark is so Awesome that it supports all different types of data to be read.

DataFrames can be created by reading txt, csv, json and parquet file formats. In our example, we will be using .json formatted file. You can also find and read text, csv and parquet file formats by using the related read functions as shown below.

In [5]:
#JSON
dataframe = sc.read.json('./data/nyt2.json')

#TXT files
# dataframe_txt = sc.read.text('./data/xyz.data')

#CSV files
# dataframe_csv = sc.read.csv('./data/xyz.csv')


#### Look at the data with show()

In [6]:
dataframe.show(10)

+--------------------+--------------------+--------------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+--------------------+-------------+
|                 _id|  amazon_product_url|              author| bestsellers_date|         description|        price|   published_date|    publisher|rank|rank_last_week|               title|weeks_on_list|
+--------------------+--------------------+--------------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+--------------------+-------------+
|{5b4aa4ead3089013...|http://www.amazon...|       Dean R Koontz|{{1211587200000}}|Odd Thomas, who c...|   {null, 27}|{{1212883200000}}|       Bantam| {1}|           {0}|           ODD HOURS|          {1}|
|{5b4aa4ead3089013...|http://www.amazon...|     Stephenie Meyer|{{1211587200000}}|Aliens have taken...|{25.99, null}|{{1212883200000}}|Little, Brown| {2}|           {1}|           

#### Simple Data Inspection 

We usually want to simply skim through the dataframe before going deeper

In [7]:
# Returns dataframe column names and data types
dataframe.dtypes

[('_id', 'struct<$oid:string>'),
 ('amazon_product_url', 'string'),
 ('author', 'string'),
 ('bestsellers_date', 'struct<$date:struct<$numberLong:string>>'),
 ('description', 'string'),
 ('price', 'struct<$numberDouble:string,$numberInt:string>'),
 ('published_date', 'struct<$date:struct<$numberLong:string>>'),
 ('publisher', 'string'),
 ('rank', 'struct<$numberInt:string>'),
 ('rank_last_week', 'struct<$numberInt:string>'),
 ('title', 'string'),
 ('weeks_on_list', 'struct<$numberInt:string>')]

In [8]:
# Return first 2 rows
dataframe.head(2)

[Row(_id=Row($oid='5b4aa4ead3089013507db18b'), amazon_product_url='http://www.amazon.com/Odd-Hours-Dean-Koontz/dp/0553807056?tag=NYTBS-20', author='Dean R Koontz', bestsellers_date=Row($date=Row($numberLong='1211587200000')), description='Odd Thomas, who can communicate with the dead, confronts evil forces in a California coastal town.', price=Row($numberDouble=None, $numberInt='27'), published_date=Row($date=Row($numberLong='1212883200000')), publisher='Bantam', rank=Row($numberInt='1'), rank_last_week=Row($numberInt='0'), title='ODD HOURS', weeks_on_list=Row($numberInt='1')),
 Row(_id=Row($oid='5b4aa4ead3089013507db18c'), amazon_product_url='http://www.amazon.com/The-Host-Novel-Stephenie-Meyer/dp/0316218502?tag=NYTBS-20', author='Stephenie Meyer', bestsellers_date=Row($date=Row($numberLong='1211587200000')), description='Aliens have taken control of the minds and bodies of most humans, but one woman won’t surrender.', price=Row($numberDouble='25.99', $numberInt=None), published_date=

In [9]:
# Return last 2 rows
dataframe.tail(2)

[Row(_id=Row($oid='5b4aa4ead3089013507dd95c'), amazon_product_url='https://www.amazon.com/Shelter-Place-Nora-Roberts-ebook/dp/B076BGDMK9?tag=NYTBS-20', author='Nora Roberts', bestsellers_date=Row($date=Row($numberLong='1530921600000')), description='Survivors of a mass shooting outside a mall in Portland, Me., develop different coping mechanisms.', price=Row($numberDouble=None, $numberInt='0'), published_date=Row($date=Row($numberLong='1532217600000')), publisher="St. Martin's", rank=Row($numberInt='14'), rank_last_week=Row($numberInt='5'), title='SHELTER IN PLACE', weeks_on_list=Row($numberInt='6')),
 Row(_id=Row($oid='5b4aa4ead3089013507dd95d'), amazon_product_url='https://www.amazon.com/Last-Time-Lied-Novel/dp/1524743070?tag=NYTBS-20', author='Riley Sager', bestsellers_date=Row($date=Row($numberLong='1530921600000')), description='A painter is in danger when she returns to the summer camp where some of her childhood friends disappeared.', price=Row($numberDouble=None, $numberInt='0'

In [10]:
# Returns first row
dataframe.first()

Row(_id=Row($oid='5b4aa4ead3089013507db18b'), amazon_product_url='http://www.amazon.com/Odd-Hours-Dean-Koontz/dp/0553807056?tag=NYTBS-20', author='Dean R Koontz', bestsellers_date=Row($date=Row($numberLong='1211587200000')), description='Odd Thomas, who can communicate with the dead, confronts evil forces in a California coastal town.', price=Row($numberDouble=None, $numberInt='27'), published_date=Row($date=Row($numberLong='1212883200000')), publisher='Bantam', rank=Row($numberInt='1'), rank_last_week=Row($numberInt='0'), title='ODD HOURS', weeks_on_list=Row($numberInt='1'))

In [11]:
# Returns columns of dataframe
dataframe.columns

['_id',
 'amazon_product_url',
 'author',
 'bestsellers_date',
 'description',
 'price',
 'published_date',
 'publisher',
 'rank',
 'rank_last_week',
 'title',
 'weeks_on_list']

In [12]:
# Counts the number of rows in dataframe
dataframe.count()

10195

In [13]:
# Counts the number of distinct rows in dataframe
dataframe.distinct().count()

10195

In [14]:

# Computes summary statistics
dataframe.describe().show()

+-------+--------------------+---------------+--------------------+---------+------------------+
|summary|  amazon_product_url|         author|         description|publisher|             title|
+-------+--------------------+---------------+--------------------+---------+------------------+
|  count|               10195|          10195|               10195|    10195|             10195|
|   mean|                null|           null|                null|     null|1877.7142857142858|
| stddev|                null|           null|                null|     null| 370.9760613506458|
|    min|http://www.amazon...|        AJ Finn|                    |      ACE|  10TH ANNIVERSARY|
|    max|https://www.amazo...|various authors|’Tis for the Rebe...|allantine|               ZOO|
+-------+--------------------+---------------+--------------------+---------+------------------+



### 3. Useful common functions

#### [1] Remove Duplicate Values

Duplicate values in a table can be eliminated by using dropDuplicates() function.

In [15]:
dataframe_dropdup = dataframe.dropDuplicates() 
dataframe_dropdup.show(10)

+--------------------+--------------------+------------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+-------------------+-------------+
|                 _id|  amazon_product_url|            author| bestsellers_date|         description|        price|   published_date|    publisher|rank|rank_last_week|              title|weeks_on_list|
+--------------------+--------------------+------------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+-------------------+-------------+
|{5b4aa4ead3089013...|http://www.amazon...|      Daniel Silva|{{1217030400000}}|Gabriel Allon, an...|{26.95, null}|{{1218326400000}}|       Putnam| {1}|           {0}| THE SECRET SERVANT|          {1}|
|{5b4aa4ead3089013...|http://www.amazon...|        Jane Green|{{1218240000000}}|A woman’s life ch...|    {null, 0}|{{1219536000000}}|       Viking|{18}|           {0}|    THE BEACH HOUSE|     

In [16]:
dataframe_dropdup.count()

10195

#### [2] 'Select' Operation 

It is possible to obtain columns by column or by indexing (i.e. dataframe[‘author’]).

In [17]:
dataframe.columns

['_id',
 'amazon_product_url',
 'author',
 'bestsellers_date',
 'description',
 'price',
 'published_date',
 'publisher',
 'rank',
 'rank_last_week',
 'title',
 'weeks_on_list']

In [20]:
#Show all entries in title column
dataframe.select("author").show(10)

+--------------------+
|              author|
+--------------------+
|       Dean R Koontz|
|     Stephenie Meyer|
|        Emily Giffin|
|   Patricia Cornwell|
|     Chuck Palahniuk|
|James Patterson a...|
|       John Sandford|
|       Jimmy Buffett|
|    Elizabeth George|
|      David Baldacci|
+--------------------+
only showing top 10 rows



In [21]:
#Show all entries in title, author, rank, price columns
dataframe.select("author", "title", "rank", "price").show(10)

+--------------------+--------------------+----+-------------+
|              author|               title|rank|        price|
+--------------------+--------------------+----+-------------+
|       Dean R Koontz|           ODD HOURS| {1}|   {null, 27}|
|     Stephenie Meyer|            THE HOST| {2}|{25.99, null}|
|        Emily Giffin|LOVE THE ONE YOU'...| {3}|{24.95, null}|
|   Patricia Cornwell|           THE FRONT| {4}|{22.95, null}|
|     Chuck Palahniuk|               SNUFF| {5}|{24.95, null}|
|James Patterson a...|SUNDAYS AT TIFFANY’S| {6}|{24.99, null}|
|       John Sandford|        PHANTOM PREY| {7}|{26.95, null}|
|       Jimmy Buffett|          SWINE NOT?| {8}|{21.99, null}|
|    Elizabeth George|     CARELESS IN RED| {9}|{27.95, null}|
|      David Baldacci|     THE WHOLE TRUTH|{10}|{26.99, null}|
+--------------------+--------------------+----+-------------+
only showing top 10 rows



#### [3] 'When' Operation 

In [20]:
# Show title and assign 0 or 1 depending on title
dataframe.select("title", when(dataframe.title != 'ODD HOURS', 1).otherwise(0)).show(10)

+--------------------+-----------------------------------------------------+
|               title|CASE WHEN (NOT (title = ODD HOURS)) THEN 1 ELSE 0 END|
+--------------------+-----------------------------------------------------+
|           ODD HOURS|                                                    0|
|            THE HOST|                                                    1|
|LOVE THE ONE YOU'...|                                                    1|
|           THE FRONT|                                                    1|
|               SNUFF|                                                    1|
|SUNDAYS AT TIFFANY’S|                                                    1|
|        PHANTOM PREY|                                                    1|
|          SWINE NOT?|                                                    1|
|     CARELESS IN RED|                                                    1|
|     THE WHOLE TRUTH|                                                    1|

#### [4] 'isin' Operation 

In [21]:
# Show rows with specified authors if in the given options
dataframe [dataframe.author.isin("John Sandford", "Emily Giffin")].show(5)

+--------------------+--------------------+-------------+-----------------+--------------------+-------------+-----------------+------------+----+--------------+--------------------+-------------+
|                 _id|  amazon_product_url|       author| bestsellers_date|         description|        price|   published_date|   publisher|rank|rank_last_week|               title|weeks_on_list|
+--------------------+--------------------+-------------+-----------------+--------------------+-------------+-----------------+------------+----+--------------+--------------------+-------------+
|{5b4aa4ead3089013...|http://www.amazon...| Emily Giffin|{{1211587200000}}|A woman's happy m...|{24.95, null}|{{1212883200000}}|St. Martin's| {3}|           {2}|LOVE THE ONE YOU'...|          {2}|
|{5b4aa4ead3089013...|http://www.amazon...|John Sandford|{{1211587200000}}|The Minneapolis d...|{26.95, null}|{{1212883200000}}|      Putnam| {7}|           {4}|        PHANTOM PREY|          {3}|
|{5b4aa4ead3089

#### [5] 'Like' Operation 

In [22]:
# Show author and title is TRUE if title has " THE " word in titles
dataframe.select("author", "title", dataframe.title.like("% THE %")).show(15)

+--------------------+--------------------+------------------+
|              author|               title|title LIKE % THE %|
+--------------------+--------------------+------------------+
|       Dean R Koontz|           ODD HOURS|             false|
|     Stephenie Meyer|            THE HOST|             false|
|        Emily Giffin|LOVE THE ONE YOU'...|              true|
|   Patricia Cornwell|           THE FRONT|             false|
|     Chuck Palahniuk|               SNUFF|             false|
|James Patterson a...|SUNDAYS AT TIFFANY’S|             false|
|       John Sandford|        PHANTOM PREY|             false|
|       Jimmy Buffett|          SWINE NOT?|             false|
|    Elizabeth George|     CARELESS IN RED|             false|
|      David Baldacci|     THE WHOLE TRUTH|             false|
|        Troy Denning|          INVINCIBLE|             false|
|          James Frey|BRIGHT SHINY MORNING|             false|
|         Garth Stein|THE ART OF RACING...|            

#### [6] 'Startswith' — 'Endswith' Operation 

StartsWith scans from the beginning of word/content with specified criteria in the brackets. In parallel, EndsWith processes the word/content starting from the end. Both of the functions are case sensitive.

In [23]:
dataframe.select("author", "title", dataframe.title.startswith("THE")).show(5)
dataframe.select("author", "title", dataframe.title.endswith("NT")).show(5)

+-----------------+--------------------+----------------------+
|           author|               title|startswith(title, THE)|
+-----------------+--------------------+----------------------+
|    Dean R Koontz|           ODD HOURS|                 false|
|  Stephenie Meyer|            THE HOST|                  true|
|     Emily Giffin|LOVE THE ONE YOU'...|                 false|
|Patricia Cornwell|           THE FRONT|                  true|
|  Chuck Palahniuk|               SNUFF|                 false|
+-----------------+--------------------+----------------------+
only showing top 5 rows

+-----------------+--------------------+-------------------+
|           author|               title|endswith(title, NT)|
+-----------------+--------------------+-------------------+
|    Dean R Koontz|           ODD HOURS|              false|
|  Stephenie Meyer|            THE HOST|              false|
|     Emily Giffin|LOVE THE ONE YOU'...|              false|
|Patricia Cornwell|           THE

#### [7] 'Substring' Operation 

In the following examples, texts are extracted from the index numbers (1, 3), (3, 6) and (1, 6).

In [25]:
dataframe.select(dataframe.author.substr(1, 3).alias("title")).show(5)
dataframe.select(dataframe.author.substr(1, 6).alias("title")).show(5)
dataframe.select(dataframe.author.substr(3, 6).alias("title")).show(5)

+-----+
|title|
+-----+
|  Dea|
|  Ste|
|  Emi|
|  Pat|
|  Chu|
+-----+
only showing top 5 rows

+------+
| title|
+------+
|Dean R|
|Stephe|
|Emily |
|Patric|
|Chuck |
+------+
only showing top 5 rows

+------+
| title|
+------+
|an R K|
|epheni|
|ily Gi|
|tricia|
|uck Pa|
+------+
only showing top 5 rows



#### [8] Adding Columns 

In [27]:
# Lit() is required while we are creating columns with exact values.
dataframe = dataframe.withColumn('new_column', lit('new column value'))

display(dataframe)

DataFrame[_id: struct<$oid:string>, amazon_product_url: string, author: string, bestsellers_date: struct<$date:struct<$numberLong:string>>, description: string, price: struct<$numberDouble:string,$numberInt:string>, published_date: struct<$date:struct<$numberLong:string>>, publisher: string, rank: struct<$numberInt:string>, rank_last_week: struct<$numberInt:string>, title: string, weeks_on_list: struct<$numberInt:string>, new_column: string]

In [28]:
dataframe.show(2)

+--------------------+--------------------+---------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+---------+-------------+----------------+
|                 _id|  amazon_product_url|         author| bestsellers_date|         description|        price|   published_date|    publisher|rank|rank_last_week|    title|weeks_on_list|      new_column|
+--------------------+--------------------+---------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+---------+-------------+----------------+
|{5b4aa4ead3089013...|http://www.amazon...|  Dean R Koontz|{{1211587200000}}|Odd Thomas, who c...|   {null, 27}|{{1212883200000}}|       Bantam| {1}|           {0}|ODD HOURS|          {1}|new column value|
|{5b4aa4ead3089013...|http://www.amazon...|Stephenie Meyer|{{1211587200000}}|Aliens have taken...|{25.99, null}|{{1212883200000}}|Little, Brown| {2}|           {1}| THE HOST|  

#### [9] Updating Columns 

For updated operations of DataFrame API, withColumnRenamed() function is used with two parameters.

In [29]:
# Update column 'amazon_product_url' with 'URL'
dataframe = dataframe.withColumnRenamed('amazon_product_url', 'URL')

dataframe.show(5)

+--------------------+--------------------+-----------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+--------------------+-------------+----------------+
|                 _id|                 URL|           author| bestsellers_date|         description|        price|   published_date|    publisher|rank|rank_last_week|               title|weeks_on_list|      new_column|
+--------------------+--------------------+-----------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+--------------------+-------------+----------------+
|{5b4aa4ead3089013...|http://www.amazon...|    Dean R Koontz|{{1211587200000}}|Odd Thomas, who c...|   {null, 27}|{{1212883200000}}|       Bantam| {1}|           {0}|           ODD HOURS|          {1}|new column value|
|{5b4aa4ead3089013...|http://www.amazon...|  Stephenie Meyer|{{1211587200000}}|Aliens have taken...|{25.99, null}|{{12128832

#### [10] Removing Columns 

Removal of a column can be achieved in two ways: \
1. Adding the list of column names in the drop() function 
2. Specifying columns by pointing in the drop function

In [30]:
# 1.
dataframe_remove = dataframe.drop("publisher", "published_date").show(5)

# 2.
dataframe_remove2 = dataframe.drop(dataframe.publisher).drop(dataframe.published_date).show(5)

+--------------------+--------------------+-----------------+-----------------+--------------------+-------------+----+--------------+--------------------+-------------+----------------+
|                 _id|                 URL|           author| bestsellers_date|         description|        price|rank|rank_last_week|               title|weeks_on_list|      new_column|
+--------------------+--------------------+-----------------+-----------------+--------------------+-------------+----+--------------+--------------------+-------------+----------------+
|{5b4aa4ead3089013...|http://www.amazon...|    Dean R Koontz|{{1211587200000}}|Odd Thomas, who c...|   {null, 27}| {1}|           {0}|           ODD HOURS|          {1}|new column value|
|{5b4aa4ead3089013...|http://www.amazon...|  Stephenie Meyer|{{1211587200000}}|Aliens have taken...|{25.99, null}| {2}|           {1}|            THE HOST|          {3}|new column value|
|{5b4aa4ead3089013...|http://www.amazon...|     Emily Giffin|{{12

#### [11] 'GroupBy' Operation

In [31]:
# Group by author, count the books of the authors in the groups

dataframe.groupBy("author").count().show(10)

+-----------------+-----+
|           author|count|
+-----------------+-----+
|       James Frey|    2|
| Elin Hilderbrand|   58|
|Sharon Kay Penman|    2|
|      Lisa Genova|    7|
|     Will Allison|    1|
|Patricia Cornwell|   64|
|    Laurie R King|    6|
|       Tea Obreht|    6|
|     Sarah Dunant|    1|
|     Tim Johnston|    1|
+-----------------+-----+
only showing top 10 rows



#### [12] 'Filter' Operation

In [35]:
# Filtering entries of title
# Only keeps records having value 'THE HOST'

dataframe.filter(dataframe["title"] == 'THE HOST').show(5)

+--------------------+--------------------+---------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+--------+-------------+----------------+
|                 _id|                 URL|         author| bestsellers_date|         description|        price|   published_date|    publisher|rank|rank_last_week|   title|weeks_on_list|      new_column|
+--------------------+--------------------+---------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+--------+-------------+----------------+
|{5b4aa4ead3089013...|http://www.amazon...|Stephenie Meyer|{{1211587200000}}|Aliens have taken...|{25.99, null}|{{1212883200000}}|Little, Brown| {2}|           {1}|THE HOST|          {3}|new column value|
|{5b4aa4ead3089013...|http://www.amazon...|Stephenie Meyer|{{1212192000000}}|Aliens have taken...|{25.99, null}|{{1213488000000}}|Little, Brown| {2}|           {2}|THE HOST|       

#### [13] Handling Missing Values

In [33]:
# Replacing null values with 0
dataframe.fillna(0)

DataFrame[_id: struct<$oid:string>, URL: string, author: string, bestsellers_date: struct<$date:struct<$numberLong:string>>, description: string, price: struct<$numberDouble:string,$numberInt:string>, published_date: struct<$date:struct<$numberLong:string>>, publisher: string, rank: struct<$numberInt:string>, rank_last_week: struct<$numberInt:string>, title: string, weeks_on_list: struct<$numberInt:string>, new_column: string]

In [34]:
# Returning new dataframe restricting rows with null valuesdataframe.na.drop()
dataframe.dropna()

DataFrame[_id: struct<$oid:string>, URL: string, author: string, bestsellers_date: struct<$date:struct<$numberLong:string>>, description: string, price: struct<$numberDouble:string,$numberInt:string>, published_date: struct<$date:struct<$numberLong:string>>, publisher: string, rank: struct<$numberInt:string>, rank_last_week: struct<$numberInt:string>, title: string, weeks_on_list: struct<$numberInt:string>, new_column: string]

In [35]:
dataframe.show()

+--------------------+--------------------+--------------------+-----------------+--------------------+-------------+-----------------+--------------------+----+--------------+--------------------+-------------+----------------+
|                 _id|                 URL|              author| bestsellers_date|         description|        price|   published_date|           publisher|rank|rank_last_week|               title|weeks_on_list|      new_column|
+--------------------+--------------------+--------------------+-----------------+--------------------+-------------+-----------------+--------------------+----+--------------+--------------------+-------------+----------------+
|{5b4aa4ead3089013...|http://www.amazon...|       Dean R Koontz|{{1211587200000}}|Odd Thomas, who c...|   {null, 27}|{{1212883200000}}|              Bantam| {1}|           {0}|           ODD HOURS|          {1}|new column value|
|{5b4aa4ead3089013...|http://www.amazon...|     Stephenie Meyer|{{1211587200000}}|Al

In [36]:
# Return new dataframe replacing one value with another
dataframe = dataframe.replace("Dean R Koontz","Dean Koontz")

In [37]:
dataframe.show()

+--------------------+--------------------+--------------------+-----------------+--------------------+-------------+-----------------+--------------------+----+--------------+--------------------+-------------+----------------+
|                 _id|                 URL|              author| bestsellers_date|         description|        price|   published_date|           publisher|rank|rank_last_week|               title|weeks_on_list|      new_column|
+--------------------+--------------------+--------------------+-----------------+--------------------+-------------+-----------------+--------------------+----+--------------+--------------------+-------------+----------------+
|{5b4aa4ead3089013...|http://www.amazon...|         Dean Koontz|{{1211587200000}}|Odd Thomas, who c...|   {null, 27}|{{1212883200000}}|              Bantam| {1}|           {0}|           ODD HOURS|          {1}|new column value|
|{5b4aa4ead3089013...|http://www.amazon...|     Stephenie Meyer|{{1211587200000}}|Al

#### [14] Repartitioning

It is possible to increase or decrease the existing level of partitioning in RDD. \

Increasing can be actualized by using **repartition(self, numPartitions)** function which results in a new RDD that obtains same/higher number of partitions. \

Decreasing can be processed with **coalesce(self, numPartitions, shuffle=False)** function that results in new RDD with a reduced number of partitions to a specified number

In [36]:
# Dataframe with 10 partitions
dataframe.repartition(10).rdd.getNumPartitions()

# Dataframe with 1 partition
dataframe.coalesce(1).rdd.getNumPartitions()

1

#### [15] Running SQL Commnads In Spark

In [42]:
# Registering a table
dataframe.registerTempTable("df")

sc.sql("select * from df").show(5)

+--------------------+--------------------+-----------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+--------------------+-------------+----------------+
|                 _id|                 URL|           author| bestsellers_date|         description|        price|   published_date|    publisher|rank|rank_last_week|               title|weeks_on_list|      new_column|
+--------------------+--------------------+-----------------+-----------------+--------------------+-------------+-----------------+-------------+----+--------------+--------------------+-------------+----------------+
|{5b4aa4ead3089013...|http://www.amazon...|    Dean R Koontz|{{1211587200000}}|Odd Thomas, who c...|   {null, 27}|{{1212883200000}}|       Bantam| {1}|           {0}|           ODD HOURS|          {1}|new column value|
|{5b4aa4ead3089013...|http://www.amazon...|  Stephenie Meyer|{{1211587200000}}|Aliens have taken...|{25.99, null}|{{12128832



In [53]:
sc.sql("select * from df where description like '%Thomas%' and author like 'Dean%'").show(5)

+--------------------+--------------------+-------------+-----------------+--------------------+----------+-----------------+---------+----+--------------+---------+-------------+----------------+
|                 _id|                 URL|       author| bestsellers_date|         description|     price|   published_date|publisher|rank|rank_last_week|    title|weeks_on_list|      new_column|
+--------------------+--------------------+-------------+-----------------+--------------------+----------+-----------------+---------+----+--------------+---------+-------------+----------------+
|{5b4aa4ead3089013...|http://www.amazon...|Dean R Koontz|{{1211587200000}}|Odd Thomas, who c...|{null, 27}|{{1212883200000}}|   Bantam| {1}|           {0}|ODD HOURS|          {1}|new column value|
|{5b4aa4ead3089013...|http://www.amazon...|Dean R Koontz|{{1212192000000}}|Odd Thomas, who c...|{null, 27}|{{1213488000000}}|   Bantam| {3}|           {1}|ODD HOURS|          {2}|new column value|
|{5b4aa4ead3089

In [56]:
sc.sql("SELECT CASE WHEN description LIKE '%love%' THEN 'Love_Theme' WHEN description LIKE '%hate%' THEN 'Hate_Theme'WHEN description LIKE '%happy%' THEN 'Happiness_Theme' WHEN description LIKE '%anger%' THEN 'Anger_Theme' WHEN description LIKE '%horror%' THEN 'Horror_Theme' WHEN description LIKE '%death%' THEN 'Criminal_Theme' WHEN description LIKE '%detective%' THEN 'Mystery_Theme' ELSE 'Other_Themes' END Themes from df").groupBy('Themes').count().show()

+---------------+-----+
|         Themes|count|
+---------------+-----+
|    Anger_Theme|  203|
|   Other_Themes| 8778|
|  Mystery_Theme|  454|
|     Hate_Theme|   23|
|     Love_Theme|  392|
|Happiness_Theme|   34|
|   Horror_Theme|    6|
| Criminal_Theme|  305|
+---------------+-----+



#### [16] End Spark Session

In [41]:
sc.stop()