# Manipulating Data in DataFrames

In this lecture we will learn how to manipulate data in dataframes. You will need these techniques to accomplish some of the following tasks:

 - Change data types when they are incorrectly interpretted
 - Clean your data
 - Create new columns
 - Rename columns
 - Extract or Create New Values
 
We will also cover how to manipulate arrays in this lecture as well. 

#### So let's get started!

First we will create our spark instance as we need to do at the start of every project.

In [2]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Manipulate").getOrCreate()
spark

In [4]:
names = spark.createDataFrame([('Abraham', 'Lincoln')],['first_name','last_name'])
names.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|   Abraham|  Lincoln|
+----------+---------+



In [5]:
names.rdd.id()

10

In [6]:
from pyspark.sql.functions import *
names = names.select(names.first_name, names.last_name, concat_ws(' ', names.first_name, names.last_name).alias('full_name'))
names.show()

+----------+---------+---------------+
|first_name|last_name|      full_name|
+----------+---------+---------------+
|   Abraham|  Lincoln|Abraham Lincoln|
+----------+---------+---------------+



In [7]:
# rdd's id changed.
names.rdd.id()

18

In [9]:
path = 'C:/spark/PySpark_Essentials/data/'
videos = spark.read.csv(path + 'youtubevideos.csv', inferSchema=True, header=True)

In [10]:
videos.limit(4).toPandas()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"""last week tonight trump presidency""|""last wee...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"""racist superman""|""rudy""|""mancuso""|""king""|""bac...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"""rhett and link""|""gmm""|""good mythical morning""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...


In [11]:
videos.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: string (nullable = true)
 |-- likes: string (nullable = true)
 |-- dislikes: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [12]:
from pyspark.sql.types import *

In [25]:
df = videos.withColumn("views", videos["views"].cast(IntegerType())) \
        .withColumn("likes", videos["likes"].cast(IntegerType())) \
        .withColumn("dislikes", videos["dislikes"].cast(IntegerType())) \
        .withColumn("trending_date", to_date(videos.trending_date, 'yy.dd.MM')) \
#         .withColumn("publish_time", to_timestamp(videos.publish_time, 'yyyy-MM-dd HH:mm:ss:ms'))

In [26]:
df.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: date (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [27]:
df.limit(4).toPandas()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"""last week tonight trump presidency""|""last wee...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,2017-11-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"""racist superman""|""rudy""|""mancuso""|""king""|""bac...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,2017-11-14,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"""rhett and link""|""gmm""|""good mythical morning""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...


## Cleaning data

In [28]:
df = df.withColumn('publish_time_2', regexp_replace(df.publish_time, 'T', ' '))

In [29]:
df = df.withColumn('publish_time_2', regexp_replace(df.publish_time_2, 'Z', ''))

In [30]:
df.select('publish_time_2').show(5, False)

+-----------------------+
|publish_time_2         |
+-----------------------+
|2017-11-13 17:13:01.000|
|2017-11-13 07:30:00.000|
|2017-11-12 19:05:24.000|
|2017-11-13 11:00:04.000|
|2017-11-12 18:01:41.000|
+-----------------------+
only showing top 5 rows



In [31]:
df = df.withColumn("publish_time_3", to_timestamp(df.publish_time_2, 'yyyy-MM-dd HH:mm:ss.SSS'))

In [32]:
df.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: date (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)
 |-- publish_time_2: string (nullable = true)
 |-- publish_time_3: timestamp (nullable = true)



In [37]:
df.select("publish_time","publish_time_2","publish_time_3").show(5,False)

+------------------------+-----------------------+-------------------+
|publish_time            |publish_time_2         |publish_time_3     |
+------------------------+-----------------------+-------------------+
|2017-11-13T17:13:01.000Z|2017-11-13 17:13:01.000|2017-11-13 17:13:01|
|2017-11-13T07:30:00.000Z|2017-11-13 07:30:00.000|2017-11-13 07:30:00|
|2017-11-12T19:05:24.000Z|2017-11-12 19:05:24.000|2017-11-12 19:05:24|
|2017-11-13T11:00:04.000Z|2017-11-13 11:00:04.000|2017-11-13 11:00:04|
|2017-11-12T18:01:41.000Z|2017-11-12 18:01:41.000|2017-11-12 18:01:41|
+------------------------+-----------------------+-------------------+
only showing top 5 rows



### Translate Function

In [39]:
df.select("publish_time", translate(col("publish_time"), "TZ", " ").alias('trans')).show(5, False)

+------------------------+-----------------------+
|publish_time            |trans                  |
+------------------------+-----------------------+
|2017-11-13T17:13:01.000Z|2017-11-13 17:13:01.000|
|2017-11-13T07:30:00.000Z|2017-11-13 07:30:00.000|
|2017-11-12T19:05:24.000Z|2017-11-12 19:05:24.000|
|2017-11-13T11:00:04.000Z|2017-11-13 11:00:04.000|
|2017-11-12T18:01:41.000Z|2017-11-12 18:01:41.000|
+------------------------+-----------------------+
only showing top 5 rows



In [40]:
# Trim
df = df.withColumn('title', trim(df.title))

In [41]:
df.select('title').show(4, False)

+--------------------------------------------------------------+
|title                                                         |
+--------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |
|Nickelback Lyrics: Real or Fake?                              |
+--------------------------------------------------------------+
only showing top 4 rows



In [44]:
# Lower
df.select('title', lower(df.title)).show(5, False)

+--------------------------------------------------------------+--------------------------------------------------------------+
|title                                                         |lower(title)                                                  |
+--------------------------------------------------------------+--------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |we want to talk about our marriage                            |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|the trump presidency: last week tonight with john oliver (hbo)|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |racist superman | rudy mancuso, king bach & lele pons         |
|Nickelback Lyrics: Real or Fake?                              |nickelback lyrics: real or fake?                              |
|I Dare You: GOING BALD!?                                      |i dare you: going bald!?                

### Case When

In [67]:
print("Option#1: select or withColumn() using when-otherwise")
from pyspark.sql.functions import when
df.select("likes","dislikes",(when(df.likes > df.dislikes, 'Good').when(df.likes < df.dislikes, 'Bad').otherwise('Undetermined')).alias("Favorability")).show(3)

print("Option2: select or withColumn() using expr function")
from pyspark.sql.functions import expr 
df.select("likes","dislikes",expr("CASE WHEN likes > dislikes THEN  'Good' WHEN likes < dislikes THEN 'Bad' ELSE 'Undetermined' END AS Favorability")).show(3)

print("Option 3: selectExpr() using SQL equivalent CASE expression")
df.selectExpr("likes","dislikes","CASE WHEN likes > dislikes THEN  'Good' WHEN likes < dislikes THEN 'Bad' ELSE 'Undetermined' END AS Favorability").show(3)

Option#1: select or withColumn() using when-otherwise
+------+--------+------------+
| likes|dislikes|Favorability|
+------+--------+------------+
| 57527|    2966|        Good|
| 97185|    6146|        Good|
|146033|    5339|        Good|
+------+--------+------------+
only showing top 3 rows

Option2: select or withColumn() using expr function
+------+--------+------------+
| likes|dislikes|Favorability|
+------+--------+------------+
| 57527|    2966|        Good|
| 97185|    6146|        Good|
|146033|    5339|        Good|
+------+--------+------------+
only showing top 3 rows

Option 3: selectExpr() using SQL equivalent CASE expression
+------+--------+------------+
| likes|dislikes|Favorability|
+------+--------+------------+
| 57527|    2966|        Good|
| 97185|    6146|        Good|
|146033|    5339|        Good|
+------+--------+------------+
only showing top 3 rows



In [48]:
# Concatenate
df.select(concat_ws(' ', df.title, df.channel_title).alias('text')).show(5, False)

+------------------------------------------------------------------------------+
|text                                                                          |
+------------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE CaseyNeistat                               |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO) LastWeekTonight|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons Rudy Mancuso            |
|Nickelback Lyrics: Real or Fake? Good Mythical Morning                        |
|I Dare You: GOING BALD!? nigahiga                                             |
+------------------------------------------------------------------------------+
only showing top 5 rows



In [49]:
df.select("trending_date", year("trending_date"), month("trending_date")).show(5)

+-------------+-------------------+--------------------+
|trending_date|year(trending_date)|month(trending_date)|
+-------------+-------------------+--------------------+
|   2017-11-14|               2017|                  11|
|   2017-11-14|               2017|                  11|
|   2017-11-14|               2017|                  11|
|   2017-11-14|               2017|                  11|
|   2017-11-14|               2017|                  11|
+-------------+-------------------+--------------------+
only showing top 5 rows



In [51]:
# diffenrece between dates
df.select("trending_date", "publish_time_3", datediff(df.trending_date, df.publish_time_3).alias("diff")).show(5)

+-------------+-------------------+----+
|trending_date|     publish_time_3|diff|
+-------------+-------------------+----+
|   2017-11-14|2017-11-13 17:13:01|   1|
|   2017-11-14|2017-11-13 07:30:00|   1|
|   2017-11-14|2017-11-12 19:05:24|   2|
|   2017-11-14|2017-11-13 11:00:04|   1|
|   2017-11-14|2017-11-12 18:01:41|   2|
+-------------+-------------------+----+
only showing top 5 rows



In [54]:
array = df.select('title', split(df.title, ' ').alias('new'))

In [56]:
array.show(5, False)

+--------------------------------------------------------------+-------------------------------------------------------------------------+
|title                                                         |new                                                                      |
+--------------------------------------------------------------+-------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                            |[WE, WANT, TO, TALK, ABOUT, OUR, MARRIAGE]                               |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)|[The, Trump, Presidency:, Last, Week, Tonight, with, John, Oliver, (HBO)]|
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons         |[Racist, Superman, |, Rudy, Mancuso,, King, Bach, &, Lele, Pons]         |
|Nickelback Lyrics: Real or Fake?                              |[Nickelback, Lyrics:, Real, or, Fake?]                                   |
|I Dare You: GOING BALD!?  

In [57]:
# array_contains
array.select("title", array_contains(array.new, "MARRIAGE")).show(1, False)

+----------------------------------+-----------------------------+
|title                             |array_contains(new, MARRIAGE)|
+----------------------------------+-----------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE|true                         |
+----------------------------------+-----------------------------+
only showing top 1 row



In [59]:
# since spark 2.4.0
# array.select('title', array_distinct(array.new)).show(5, False)
# array.select('title', array_remove(array.new, "WE")).show(5, False)

NameError: name 'array_remove' is not defined

## User Defined Function
- applying custom functions to dataframe columns

In [61]:
from pyspark.sql.functions import udf

In [62]:
def square(x):
    return int(x**2)

In [63]:
square_udf = udf(lambda z: square(z), IntegerType())

In [66]:
df.select('dislikes', square_udf('dislikes')).where(col('dislikes').isNotNull()).show(5)

+--------+------------------+
|dislikes|<lambda>(dislikes)|
+--------+------------------+
|    2966|           8797156|
|    6146|          37773316|
|    5339|          28504921|
|     666|            443556|
|    1989|           3956121|
+--------+------------------+
only showing top 5 rows

