In [1]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("example") \
    .getOrCreate()

# Filtering column content with Python

This is often one of the first steps in data cleaning - removing anything that is obviously outside the format. For this dataset, make sure to look at the original data and see what looks out of place for the `VOTER_NAME` column.

In [2]:
voter_df = spark.read.csv("dataset/DallasCouncilVoters.csv", header=True)
voter_df.show(3)

+----------+-------------+-------------------+
|      DATE|        TITLE|         VOTER_NAME|
+----------+-------------+-------------------+
|02/08/2017|Councilmember|  Jennifer S. Gates|
|02/08/2017|Councilmember| Philip T. Kingston|
|02/08/2017|        Mayor|Michael S. Rawlings|
+----------+-------------+-------------------+
only showing top 3 rows



In [3]:
import pyspark.sql.functions as F
# Show the distinct VOTER_NAME entries
voter_df.select('VOTER_NAME').distinct().show(40, truncate=False)

# Filter voter_df where the VOTER_NAME is 1-20 characters in length
voter_df = voter_df.filter('length(VOTER_NAME) > 0 and length(VOTER_NAME) < 20')

# Filter out voter_df where the VOTER_NAME contains an underscore
voter_df = voter_df.filter(~ F.col('VOTER_NAME').contains('_'))

# Show the distinct VOTER_NAME entries again
voter_df.select('VOTER_NAME').distinct().show(40, truncate=False)
voter_df_single = voter_df

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|VOTER_NAME                                                                                                                                                                                                                                                                                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Filtering Question #1

Consider the following Data Frame called `users_df`:
```
ID	Name	        Age	State
140	George L        47	Iowa
3260	Mary R	        34	Vermont
18502	null	        68	Ohio
999	Rick W	        23	California
```
If you wanted to return only the entries without nulls, which of following options would work?


- `users_df = users_df.filter(users_df.Name.isNotNull())`
- `users_df = users_df.where(~ (users_df.ID == 18502) )`
- `users_df = users_df.filter(~ col('Name').isNull())`

# Filtering Question #2

Consider the following Data Frame called `users_df`:
```
ID	Name	        Age	State
140	George L        47	Iowa
3260	Mary R	        34	Vermont
18502	Audrey	        68	Ohio
999	Rick W	        23	California
```
If we wanted to return only the Name and State fields for any ID greater than 3000, which code snippet meets these requirements?

- `users_df.filter('ID > 3000').select("Name", "State")`

# Modifying DataFrame columns

Previously, you filtered out any rows that didn't conform to something generally resembling a name. Now based on your earlier work, your manager has asked you to create two new columns - first_name and last_name. She asks you to split the VOTER_NAME column into words on any space character. You'll treat the last word as the last_name, and all other words as the first_name. You'll be using some new functions in this exercise including .split(), .size(), and .getItem(). The .getItem(index) takes an integer value to return the appropriately numbered item in the column. The functions .split() and .size() are in the pyspark.sql.functions library.

In [4]:
# Add a new column called splits separated on whitespace
voter_df = voter_df.withColumn('splits', F.split(voter_df.VOTER_NAME, '\s+'))

# Create a new column called first_name based on the first item in splits
voter_df = voter_df.withColumn('first_name', voter_df.splits.getItem(0))

# Get the last entry of the splits list and create a column called last_name
voter_df = voter_df.withColumn('last_name', voter_df.splits.getItem(F.size('splits') - 1))

# Drop the splits column
voter_df = voter_df.drop('splits')

# Show the voter_df DataFrame
voter_df.show()



+----------+-------------+-------------------+----------+---------+
|      DATE|        TITLE|         VOTER_NAME|first_name|last_name|
+----------+-------------+-------------------+----------+---------+
|02/08/2017|Councilmember|  Jennifer S. Gates|  Jennifer|    Gates|
|02/08/2017|Councilmember| Philip T. Kingston|    Philip| Kingston|
|02/08/2017|        Mayor|Michael S. Rawlings|   Michael| Rawlings|
|02/08/2017|Councilmember|       Adam Medrano|      Adam|  Medrano|
|02/08/2017|Councilmember|       Casey Thomas|     Casey|   Thomas|
|02/08/2017|Councilmember|Carolyn King Arnold|   Carolyn|   Arnold|
|02/08/2017|Councilmember|       Scott Griggs|     Scott|   Griggs|
|02/08/2017|Councilmember|   B. Adam  McGough|        B.|  McGough|
|02/08/2017|Councilmember|       Lee Kleinman|       Lee| Kleinman|
|02/08/2017|Councilmember|      Sandy Greyson|     Sandy|  Greyson|
|02/08/2017|Councilmember|  Jennifer S. Gates|  Jennifer|    Gates|
|02/08/2017|Councilmember| Philip T. Kingston|  

# when() example

The when() clause lets you conditionally modify a Data Frame based on its content. You'll want to modify our voter_df DataFrame to add a random number to any voting member that is defined as a "Councilmember".

In [5]:
# Add a column to voter_df for any voter with the title **Councilmember**
voter_df = voter_df.withColumn('random_val',
                               F.when(voter_df.TITLE == 'Councilmember', F.rand()))

# Show some of the DataFrame rows, noting whether the when clause worked
voter_df.show()

+----------+-------------+-------------------+----------+---------+--------------------+
|      DATE|        TITLE|         VOTER_NAME|first_name|last_name|          random_val|
+----------+-------------+-------------------+----------+---------+--------------------+
|02/08/2017|Councilmember|  Jennifer S. Gates|  Jennifer|    Gates| 0.33064662047029625|
|02/08/2017|Councilmember| Philip T. Kingston|    Philip| Kingston| 0.02256437187270488|
|02/08/2017|        Mayor|Michael S. Rawlings|   Michael| Rawlings|                NULL|
|02/08/2017|Councilmember|       Adam Medrano|      Adam|  Medrano| 0.38935911511962484|
|02/08/2017|Councilmember|       Casey Thomas|     Casey|   Thomas|0.012822829808157299|
|02/08/2017|Councilmember|Carolyn King Arnold|   Carolyn|   Arnold|  0.5108983699725119|
|02/08/2017|Councilmember|       Scott Griggs|     Scott|   Griggs|  0.6955949493400981|
|02/08/2017|Councilmember|   B. Adam  McGough|        B.|  McGough|  0.8825023775424193|
|02/08/2017|Councilme

# When / Otherwise

This requirement is similar to the last, but now you want to add multiple values based on the voter's position. Modify your voter_df DataFrame to add a random number to any voting member that is defined as a Councilmember. Use 2 for the Mayor and 0 for anything other position.

In [6]:
# Add a column to voter_df for a voter based on their position
voter_df = voter_df.withColumn('random_val',
                               F.when(voter_df.TITLE == 'Councilmember', F.rand())
                               .when(voter_df.TITLE == 'Mayor', 2)
                               .otherwise(0))

# Show some of the DataFrame rows
voter_df.show(3)

# Use the .filter() clause with random_val
voter_df.filter(voter_df.random_val == 0).show(3)

+----------+-------------+-------------------+----------+---------+-------------------+
|      DATE|        TITLE|         VOTER_NAME|first_name|last_name|         random_val|
+----------+-------------+-------------------+----------+---------+-------------------+
|02/08/2017|Councilmember|  Jennifer S. Gates|  Jennifer|    Gates| 0.1427376437076402|
|02/08/2017|Councilmember| Philip T. Kingston|    Philip| Kingston|0.35573805376397416|
|02/08/2017|        Mayor|Michael S. Rawlings|   Michael| Rawlings|                2.0|
+----------+-------------+-------------------+----------+---------+-------------------+
only showing top 3 rows

+----------+--------------------+-----------------+----------+---------+----------+
|      DATE|               TITLE|       VOTER_NAME|first_name|last_name|random_val|
+----------+--------------------+-----------------+----------+---------+----------+
|04/25/2018|Deputy Mayor Pro Tem|     Adam Medrano|      Adam|  Medrano|       0.0|
|04/25/2018|       Mayo

# Understanding user defined functions

When creating a new user defined function, which is a possible value for the second argument?

- `ArrayType(IntegerType())`
- `IntegerType()`
- `LongType()`
- `StringType()`

# Using user defined functions in Spark

For this exercise, we'll use our voter_df DataFrame, but you're going to replace the first_name column with the first and middle names.

In [7]:
voter_df = voter_df.withColumn('splits', F.split(voter_df.VOTER_NAME,' '))
voter_df.show(3)

+----------+-------------+-------------------+----------+---------+-------------------+--------------------+
|      DATE|        TITLE|         VOTER_NAME|first_name|last_name|         random_val|              splits|
+----------+-------------+-------------------+----------+---------+-------------------+--------------------+
|02/08/2017|Councilmember|  Jennifer S. Gates|  Jennifer|    Gates| 0.1427376437076402|[Jennifer, S., Ga...|
|02/08/2017|Councilmember| Philip T. Kingston|    Philip| Kingston|0.35573805376397416|[Philip, T., King...|
|02/08/2017|        Mayor|Michael S. Rawlings|   Michael| Rawlings|                2.0|[Michael, S., Raw...|
+----------+-------------+-------------------+----------+---------+-------------------+--------------------+
only showing top 3 rows



In [8]:
from pyspark.sql.types import *
def getFirstAndMiddle(names):
  # Return a space separated string of names
  return ' '.join(names)

# Define the method as a UDF
udfFirstAndMiddle = F.udf(getFirstAndMiddle, StringType())

# Create a new column using your UDF
voter_df = voter_df.withColumn('first_and_middle_name', udfFirstAndMiddle(voter_df.splits))

# Show the DataFrame
# voter_df.show(3)
voter_df.count()

43912

In [9]:
voter_df.dtypes

[('DATE', 'string'),
 ('TITLE', 'string'),
 ('VOTER_NAME', 'string'),
 ('first_name', 'string'),
 ('last_name', 'string'),
 ('random_val', 'double'),
 ('splits', 'array<string>'),
 ('first_and_middle_name', 'string')]

# Adding an ID Field

When working with data, you sometimes only want to access certain fields and perform various operations. In this case, find all the unique voter names from the DataFrame and add a unique ID number. Remember that Spark IDs are assigned based on the DataFrame partition - as such the ID values may be much greater than the actual number of rows in the DataFrame.

With Spark's lazy processing, the IDs are not actually generated until an action is performed and can be somewhat random depending on the size of the dataset.

In [10]:
df = spark.read.csv("dataset/DallasCouncilVotes.csv", header=True)
df.show(3)

+----------+------------------+---------+--------+-------------+-------------------+---------+------------------+-----------------------+------------------+--------------------+
|      DATE|AGENDA_ITEM_NUMBER|ITEM_TYPE|DISTRICT|        TITLE|         VOTER NAME|VOTE CAST|FINAL ACTION TAKEN|AGENDA ITEM DESCRIPTION|         AGENDA_ID|             VOTE_ID|
+----------+------------------+---------+--------+-------------+-------------------+---------+------------------+-----------------------+------------------+--------------------+
|02/08/2017|                 1|   AGENDA|      13|Councilmember|  Jennifer S. Gates|      N/A|  NO ACTION NEEDED|          Call to Order|020817__Special__1|020817__Special__...|
|02/08/2017|                 1|   AGENDA|      14|Councilmember| Philip T. Kingston|      N/A|  NO ACTION NEEDED|          Call to Order|020817__Special__1|020817__Special__...|
|02/08/2017|                 1|   AGENDA|      15|        Mayor|Michael S. Rawlings|      N/A|  NO ACTION NEED

In [11]:

# Select all the unique council voters
voter_df = df.select(df["VOTER NAME"]).distinct()

# Count the rows in voter_df
print("\nThere are %d rows in the voter_df DataFrame.\n" % voter_df.count())

# Add a ROW_ID
voter_df = voter_df.withColumn('ROW_ID', F.monotonically_increasing_id())

# Show the rows with 10 highest IDs in the set
voter_df.orderBy(voter_df.ROW_ID.desc()).show(10)


There are 36 rows in the voter_df DataFrame.

+--------------------+------+
|          VOTER NAME|ROW_ID|
+--------------------+------+
|                NULL|    35|
|  the  final  201...|    34|
|  the  final   20...|    33|
|   the   final  2...|    32|
|  the  final  201...|    31|
|   the   final  2...|    30|
| the final 2018 A...|    29|
|  the  final   20...|    28|
|          011018__42|    27|
|        Lee Kleinman|    26|
+--------------------+------+
only showing top 10 rows



# IDs with different partitions

You've just completed adding an ID field to a DataFrame. Now, take a look at what happens when you do the same thing on DataFrames containing a different number of partitions.

In [13]:
# Print the number of partitions in each DataFrame
print("\nThere are %d partitions in the voter_df DataFrame.\n" % voter_df.rdd.getNumPartitions())
print("\nThere are %d partitions in the voter_df_single DataFrame.\n" % voter_df_single.rdd.getNumPartitions())

# Add a ROW_ID field to each DataFrame
voter_df = voter_df.withColumn('ROW_ID', F.monotonically_increasing_id())
voter_df_single = voter_df_single.withColumn('ROW_ID', F.monotonically_increasing_id())




There are 1 partitions in the voter_df DataFrame.


There are 1 partitions in the voter_df_single DataFrame.



In [14]:
# Show the top 10 IDs in each DataFrame 
voter_df.orderBy(voter_df.ROW_ID.desc()).show(10)
voter_df_single.orderBy(voter_df_single.ROW_ID.desc()).show(10)

+--------------------+------+
|          VOTER NAME|ROW_ID|
+--------------------+------+
|                NULL|    35|
|  the  final  201...|    34|
|  the  final   20...|    33|
|   the   final  2...|    32|
|  the  final  201...|    31|
|   the   final  2...|    30|
| the final 2018 A...|    29|
|  the  final   20...|    28|
|          011018__42|    27|
|        Lee Kleinman|    26|
+--------------------+------+
only showing top 10 rows

+----------+--------------------+-------------------+------+
|      DATE|               TITLE|         VOTER_NAME|ROW_ID|
+----------+--------------------+-------------------+------+
|11/20/2018|       Councilmember|      Mark  Clayton| 43911|
|11/20/2018|       Councilmember|     Tennell Atkins| 43910|
|11/20/2018|       Councilmember|       Kevin Felder| 43909|
|11/20/2018|       Councilmember|       Omar Narvaez| 43908|
|11/20/2018|       Councilmember|Rickey D.  Callahan| 43907|
|11/20/2018|       Mayor Pro Tem|      Casey  Thomas| 43906|
|11/2

# More ID tricks

Once you define a Spark process, you'll likely want to use it many times. Depending on your needs, you may want to start your IDs at a certain value so there isn't overlap with previous runs of the Spark task. This behavior is similar to how IDs would behave in a relational database. You have been given the task to make sure that the IDs output from a monthly Spark task start at the highest value from the previous month.

In [42]:
df = voter_df_single.withColumnRenamed("DATE","NEWDATE")
df = df.withColumn("DATE", F.to_date(F.col("NEWDATE"), "MM/dd/yyyy"))
df = df.drop('NEWDATE')
print(df.dtypes)

voter_df_march = df.filter(F.month(F.col("date")) == 3)
voter_df_april = df.filter(F.month(F.col("date")) == 4)
voter_df_march = voter_df_march.withColumn('ROW_ID', F.monotonically_increasing_id()) 
# voter_df_april = voter_df_march.withColumn('ROW_ID', F.monotonically_increasing_id()) 
voter_df_march.show(3)

[('TITLE', 'string'), ('VOTER_NAME', 'string'), ('ROW_ID', 'bigint'), ('DATE', 'date')]
+-------------+----------------+------+----------+
|        TITLE|      VOTER_NAME|ROW_ID|      DATE|
+-------------+----------------+------+----------+
|Councilmember|   Scott  Griggs|     0|2018-03-21|
|Councilmember|B. Adam  McGough|     1|2018-03-21|
|Councilmember| Lee M. Kleinman|     2|2018-03-21|
+-------------+----------------+------+----------+
only showing top 3 rows



In [61]:

previous_max_ID  = voter_df_march.agg({'ROW_ID':'max'}).first()[0]
previous_max_ID

1109

In [62]:
# Add a ROW_ID column to voter_df_april starting at the desired value
voter_df_april = voter_df_april.withColumn('ROW_ID', previous_max_ID + F.monotonically_increasing_id())

# Show the ROW_ID from both DataFrames and compare
voter_df_march.select('ROW_ID').show(5)
voter_df_april.select('ROW_ID').show(5)

+------+
|ROW_ID|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
+------+
only showing top 5 rows

+------+
|ROW_ID|
+------+
|  1109|
|  1110|
|  1111|
|  1112|
|  1113|
+------+
only showing top 5 rows

