**Lazy Evaluation**

Lazy evaulation means that Spark will wait until the very last moment to execute the graph of computation instructions. In Spark, instead of modifying the data immediately when you express some operation, you build up a plan of transformations that you would like to apply to your source data. By waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame transformations to a streamlined physical plan that will run as efficiently as possible across the cluster. This provides immense benefits because Spark can optimize the entire data flow from end to end. An example of this is something called predicate pushdown on DataFrames. If we build a large Spark job but specify a filter at the end that only requires us to fetch one row from our source data, the most efficient way to execute this is to access the single record that we need. Spark will actually optimize this for us by pushing the filter down automatically.

You can monitor the progress of a job through the Spark web UI. The Spark UI is available on port 4040 of the driver node. If you are running in local mode, this will be http://localhost:4040

In [1]:
import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

Intitializing Scala interpreter ...

Spark Web UI available at http://Ramya:4040
SparkContext available as 'sc' (version = 3.0.1, master = local[*], app id = local-1607478600669)
SparkSession available as 'spark'


import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._


In [2]:
// in Scala
val movie_data = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/ramya/Desktop/Santa_Clara_University/Projects/TODO/Moviesallstreaming/MoviesOnStreamingPlatforms_updated.csv")

movie_data: org.apache.spark.sql.DataFrame = [ID: int, Title: string ... 13 more fields]


In [3]:
val fillColValues = Map("IMDb" -> 0, "Age" -> "all")
movie_data.na.fill(fillColValues)

fillColValues: scala.collection.immutable.Map[String,Any] = Map(IMDb -> 0, Age -> all)
res0: org.apache.spark.sql.DataFrame = [ID: int, Title: string ... 13 more fields]


In [4]:
movie_data.take(3)

res1: Array[org.apache.spark.sql.Row] = Array([1,Inception,2010,13+,8.8,87%,1,0,0,0,Christopher Nolan,Action,Adventure,Sci-Fi,Thriller,United States,United Kingdom,English,Japanese,French,148], [2,The Matrix,1999,18+,8.7,87%,1,0,0,0,Lana Wachowski,Lilly Wachowski,Action,Sci-Fi,United States,English,136], [3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,Anthony Russo,Joe Russo,Action,Adventure,Sci-Fi,United States,English,149])


In [5]:
movie_data.sort("Title").explain()

== Physical Plan ==
*(1) Sort [Title#17 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(Title#17 ASC NULLS FIRST, 200), true, [id=#32]
   +- FileScan csv [ID#16,Title#17,Year#18,Age#19,IMDb#20,Rotten Tomatoes#21,Netflix#22,Hulu#23,Prime Video#24,Disney+#25,Directors#26,Genres#27,Country#28,Language#29,Runtime#30] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/ramya/Desktop/Santa_Clara_University/Projects/TODO/Moviesallstre..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ID:int,Title:string,Year:string,Age:string,IMDb:double,Rotten Tomatoes:string,Netflix:int,...




**SCHEMA**

A schema is a StructType made up of a number of fields, StructFields, that have a name, type, a Boolean flag which specifies whether that column can contain missing or null values, and, finally, users can optionally specify associated metadata with that column. The metadata is a way of storing information about this column (Spark uses this in its machine learning library).

In [6]:
movie_data.schema

res3: org.apache.spark.sql.types.StructType = StructType(StructField(ID,IntegerType,true), StructField(Title,StringType,true), StructField(Year,StringType,true), StructField(Age,StringType,true), StructField(IMDb,DoubleType,true), StructField(Rotten Tomatoes,StringType,true), StructField(Netflix,IntegerType,true), StructField(Hulu,IntegerType,true), StructField(Prime Video,IntegerType,true), StructField(Disney+,IntegerType,true), StructField(Directors,StringType,true), StructField(Genres,StringType,true), StructField(Country,StringType,true), StructField(Language,StringType,true), StructField(Runtime,IntegerType,true))


In [7]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

**MAX RUN TIME**

In [8]:
movie_data.select(max("Runtime")).take(1)

res5: Array[org.apache.spark.sql.Row] = Array([1256])


**TOP MOST MOVIES PRODUCING YEARS OVERALL**

In [9]:
movie_data.groupBy("Year").count().withColumnRenamed("count", "Count_movies").sort(desc("Count_movies")).show(5)

+----+------------+
|Year|Count_movies|
+----+------------+
|2017|        1401|
|2018|        1284|
|2016|        1206|
|2015|        1065|
|2014|         986|
+----+------------+
only showing top 5 rows



**TOP MOST MOVIES PRODUCING YEARS FOR EACH PLATFORM**

In [10]:
movie_data.groupBy("Year").agg(sum("Netflix"),sum("Hulu"),sum("Prime Video"),sum("Disney+")).sort(desc("Year")).
withColumnRenamed("sum(Netflix)","Movies_Netflix").show(8)

+----+--------------+---------+----------------+------------+
|Year|Movies_Netflix|sum(Hulu)|sum(Prime Video)|sum(Disney+)|
+----+--------------+---------+----------------+------------+
|2020|           104|        6|              31|           9|
|2019|           428|      104|             172|          23|
|2018|           560|      158|             624|          16|
|2017|           569|      124|             763|          22|
|2016|           444|       62|             730|          17|
|2015|           272|       61|             765|          10|
|2014|           174|       46|             783|          12|
|2013|           133|       45|             811|          12|
+----+--------------+---------+----------------+------------+
only showing top 8 rows



**AGE WISE AVERAGE RATING ON EACH PLATFORM**

The $ allows us to designate a string as a special string that should refer to an expression. 

In [11]:
movie_data.filter($"Netflix"=== 1).groupBy($"Age",$"Year").agg(round(mean("IMDb"),2)).withColumnRenamed("round(avg(IMDb), 2)","Mean_rating").sort(desc("Mean_rating")).show(5)

+---+----+-----------+
|Age|Year|Mean_rating|
+---+----+-----------+
|18+|1966|        8.8|
|18+|1989|        8.5|
|13+|1968|        8.5|
| 7+|1985|        8.5|
| 7+|1981|        8.4|
+---+----+-----------+
only showing top 5 rows



**WHICH MOVIE IS PRESENT IN ALL PLATFORMS**

In [12]:
val movie_data_1=movie_data.withColumn("No_of_platforms",col("Netflix")+col("Hulu")+col("Prime Video")+ col("Disney+"))

movie_data_1: org.apache.spark.sql.DataFrame = [ID: int, Title: string ... 14 more fields]


NO movie is present in all 4 platforms but the below are present in 3 of them\
select and selectExpr allow you to do the DataFrame equivalent of SQL queries on a table of data

In [13]:
movie_data_1.filter(col("No_of_platforms")===3).select("Title","IMDb","Rotten Tomatoes").show()

+--------------------+----+---------------+
|               Title|IMDb|Rotten Tomatoes|
+--------------------+----+---------------+
|                 Amy| 7.8|            95%|
|          The Square| 8.1|           100%|
|       The Interview| 6.5|            52%|
|              Blame!| 6.7|            82%|
|           Evolution| 6.1|            43%|
|No Game No Life: ...| 7.5|           null|
|              Zapped| 5.1|           null|
|              Mother| 5.6|           null|
|             The Kid| 5.9|            45%|
|          Inside Out| 4.5|            25%|
+--------------------+----+---------------+



**EACH YEAR IMDB TOP MOST RATED MOVIES**

In [14]:
val w = Window.partitionBy("Year")

w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@2712c334


In [15]:
movie_data_1.withColumn("max_R", max("IMDb").over(w)).filter($"IMDb" === $"max_R").select("Year","Title","IMDb","Rotten Tomatoes").sort(desc("Year")).show(false)

+----+-------------------------------------------------------------------------------------------+----+---------------+
|Year|Title                                                                                      |IMDb|Rotten Tomatoes|
+----+-------------------------------------------------------------------------------------------+----+---------------+
|2020|Sufna                                                                                      |8.2 |null           |
|2019|My Next Guest with David Letterman and Shah Rukh Khan                                      |9.3 |null           |
|2019|Square One                                                                                 |9.3 |null           |
|2018|Operation Toussaint: Operation Underground Railroad and the Fight to End Modern Day Slavery|8.8 |null           |
|2017|Where's Daddy?                                                                             |9.1 |null           |
|2016|Natsamrat                         

**EACH YEAR IMDB TOP MOST RATED MOVIES IN NETFLIX**

In [16]:
movie_data_1.filter($"Netflix"=== 1).withColumn("max_R", max("IMDb").over(w)).filter($"IMDb" === $"max_R").select("Year","Title","IMDb","Rotten Tomatoes").sort(desc("Year")).show(false)

+----+---------------------------------------------------------------------+----+---------------+
|Year|Title                                                                |IMDb|Rotten Tomatoes|
+----+---------------------------------------------------------------------+----+---------------+
|2020|A Secret Love                                                        |8.1 |null           |
|2019|My Next Guest with David Letterman and Shah Rukh Khan                |9.3 |null           |
|2018|Untamed Romania                                                      |8.7 |null           |
|2017|One Heart: The A.R. Rahman Concert Film                              |8.7 |null           |
|2016|Natsamrat                                                            |9.1 |null           |
|2015|Eh Janam Tumhare Lekhe                                               |8.7 |null           |
|2014|Punjab 1984                                                          |8.5 |null           |
|2013|Bo Burnham: Wh

**COUNT OF MOVIES IN EACH PLATFORM YEAR WISE AFTER 2000 having rating above 5**

In [17]:
movie_data_1.where(col("Year") > 2000).where(col("IMDb") >= 5).groupBy($"Year").agg(sum($"Netflix"),sum($"Hulu")).sort(desc("Year")).show()

+----+------------+---------+
|Year|sum(Netflix)|sum(Hulu)|
+----+------------+---------+
|2020|          82|        4|
|2019|         337|       90|
|2018|         468|      132|
|2017|         481|      104|
|2016|         370|       51|
|2015|         235|       50|
|2014|         147|       32|
|2013|         118|       36|
|2012|          92|       29|
|2011|          69|       27|
|2010|          74|       22|
|2009|          57|       26|
|2008|          56|       23|
|2007|          30|        5|
|2006|          40|       10|
|2005|          31|       12|
|2004|          23|        5|
|2003|          28|        5|
|2002|          21|        9|
|2001|          22|        4|
+----+------------+---------+



**COUNT OF DISTINCT DIRECTORS IN EACH PLATFORM YEAR WISE AFTER 2000**

In [18]:
movie_data_1.where(col("Year") > 2000).select("Directors").distinct().count()

res13: Long = 9213


**WHICH YEAR WERE HIGHEST MOVIES PRESENT IN ALL THE PLATFORMS TOGETHER**

In [19]:
movie_data_1.where(col("Year") > 2000).groupBy($"Year").
agg(sum($"Netflix"),sum($"Hulu"),sum("Prime Video"),sum("Disney+")).
withColumn("Count_of_all", col("sum(Netflix)")+col("sum(Hulu)")+col("sum(Prime Video)")+col("sum(Disney+)")).
sort(desc("Count_of_all")).select(col("Year"),col("Count_of_all")).show(2)

+----+------------+
|Year|Count_of_all|
+----+------------+
|2017|        1478|
|2018|        1358|
+----+------------+
only showing top 2 rows



**WHICH YEAR WERE HIGHEST RATED MOVIES PRESENT IN ALL THE PLATFORMS TOGETHER**

In [20]:
movie_data.where(col("Year") > 2000).groupBy("Year").
agg(mean("IMDb")).
withColumnRenamed("avg(IMDb)","Average_rating").
sort(desc("Average_rating")).select(col("Year"),col("Average_rating")).show(2)

+----+-----------------+
|Year|   Average_rating|
+----+-----------------+
|2020|6.175000000000001|
|2002|6.065608465608464|
+----+-----------------+
only showing top 2 rows



**HOW DID THE DURATION OF MOVIES CHANGED ACROSS YEARS-FINDING DURATION OF TOP RATED MOVIES AND LEAST RATED MOVIES**

In [21]:
val movie_h=movie_data.withColumn("High_or_low", when(col("IMDb")>=5, "Highly_rated").otherwise("Low_rated")).
where(col("High_or_low")==="Highly_rated").groupBy("Year").agg(round(mean("Runtime"),2)).
withColumnRenamed("round(avg(Runtime), 2)","Avg_rating_of_highly_rated_movies").sort(desc(("Year")))

movie_h: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Year: string, Avg_rating_of_highly_rated_movies: double]


In [22]:
val movie_l=movie_data.withColumn("High_or_low", when(col("IMDb")>=5, "Highly_rated").otherwise("Low_rated")).
where(col("High_or_low")==="Low_rated").groupBy("Year").agg(round(mean("Runtime"),2)).
withColumnRenamed("round(avg(Runtime), 2)","Avg_rating_of_low_rated_movies").sort(desc(("Year")))

movie_l: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Year: string, Avg_rating_of_low_rated_movies: double]


In [23]:
movie_h.join(movie_l, "Year").sort(desc("Year")).show()

+----+---------------------------------+------------------------------+
|Year|Avg_rating_of_highly_rated_movies|Avg_rating_of_low_rated_movies|
+----+---------------------------------+------------------------------+
|2020|                            93.69|                         95.16|
|2019|                            94.54|                         89.04|
|2018|                            97.88|                         85.96|
|2017|                            96.01|                         89.42|
|2016|                            94.74|                         91.97|
|2015|                            94.21|                          89.1|
|2014|                            94.63|                         88.95|
|2013|                            92.36|                         88.97|
|2012|                            91.61|                         87.49|
|2011|                            94.53|                         86.86|
|2010|                            93.12|                        

**ENGLISH AND NON ENGLISH MOVIES AVERAGE RATING**

In [24]:
val descripFilter = lower(col("Language")).contains("english")
movie_data.where(descripFilter).agg(mean(col("IMDb"))).show()

+-----------------+
|        avg(IMDb)|
+-----------------+
|5.822641363881808|
+-----------------+



descripFilter: org.apache.spark.sql.Column = contains(lower(Language), english)


In [25]:
movie_data.where(!descripFilter).agg(mean(col("IMDb"))).show()

+-----------------+
|        avg(IMDb)|
+-----------------+
|6.260852713178291|
+-----------------+



**NUMBER OF MOVIES WHICH FOLLOW A CRITERIA**

In [26]:
val descripFilter = lower(col("Language")).contains("english")
val countfilter= lower(col("Country")).contains("united states")
val genrefilter= upper(col("Genres")).contains("ACTION")

descripFilter: org.apache.spark.sql.Column = contains(lower(Language), english)
countfilter: org.apache.spark.sql.Column = contains(lower(Country), united states)
genrefilter: org.apache.spark.sql.Column = contains(upper(Genres), ACTION)


In [27]:
movie_data_1.withColumn("Criteria", countfilter.and((descripFilter.or(genrefilter))))
  .where("Criteria")
  .count()

res19: Long = 10303


**CORRELATION DURATION AND RATING**

In [28]:
import org.apache.spark.sql.functions.{corr}
movie_data.stat.corr("Runtime", "IMDb")

import org.apache.spark.sql.functions.corr
res20: Double = 0.22394511701216854


**HOW MANY UNIQUE GENRES**

In [29]:
val movie_data_2=movie_data.withColumn("Genres_all",split(($"Genres"),",")).drop("Genres")

movie_data_2: org.apache.spark.sql.DataFrame = [ID: int, Title: string ... 13 more fields]


In [30]:
movie_data_2.select(explode($"Genres_all")).distinct().count()

res21: Long = 27


**HOW MANY UNIQUE LANGUAGES**

In [31]:
movie_data.withColumn("Lang_all",split(($"Language"),",")).drop("Language").select(explode($"Lang_all")).distinct().count()

res22: Long = 178


**MOVIES PER LANGUAGE**

In [32]:
movie_data.withColumn("Lang_all",explode(split(($"Language"),","))).drop("Language").groupBy("Lang_all").count().sort(desc("count")).show(5)

+--------+-----+
|Lang_all|count|
+--------+-----+
| English|13233|
| Spanish|  872|
|  French|  799|
|   Hindi|  731|
|  German|  483|
+--------+-----+
only showing top 5 rows



**TOP MOST GENRE BY COUNT PER LANGUAGE**

In [33]:
val gen_lan=movie_data.withColumn("Lang_all",explode(split(($"Language"),","))).withColumn("Genre_all",explode(split(($"Genres"),","))).
withColumn("Number", lit(1)).select("Lang_all","Genre_all","Number")

gen_lan: org.apache.spark.sql.DataFrame = [Lang_all: string, Genre_all: string ... 1 more field]


In [34]:
val w = Window.partitionBy("Lang_all")
gen_lan.groupBy("Lang_all","Genre_all").count().withColumn("Max_count", max("count").over(w)).
where(col("count") === col("Max_count")).sort(desc("Max_count")).select("Lang_all","Genre_all").
show(5)

+--------+---------+
|Lang_all|Genre_all|
+--------+---------+
| English|    Drama|
|   Hindi|    Drama|
|  French|    Drama|
| Spanish|    Drama|
|  German|    Drama|
+--------+---------+
only showing top 5 rows



w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@38021de4


**OTHER THAN US,AUSTRALIA AND UK WHICH COUNTRIES HAVE ENGLISH MOVIES**

In [35]:
val filter1 = not(lower(col("Country")).contains("united states"))
val filter2 = not(lower(col("Country")).contains("kingdom"))
val filter3 = not(lower(col("Country")).contains("australia"))
val filter4 = lower(col("Language")).contains("english")

filter1: org.apache.spark.sql.Column = (NOT contains(lower(Country), united states))
filter2: org.apache.spark.sql.Column = (NOT contains(lower(Country), kingdom))
filter3: org.apache.spark.sql.Column = (NOT contains(lower(Country), australia))
filter4: org.apache.spark.sql.Column = contains(lower(Language), english)


In [36]:
val movies= movie_data.withColumn("Criteria", filter4.and(filter1.and(filter2.and(filter3)))).where("Criteria").withColumn("Lang_all",explode(split(($"Language"),","))).withColumn("Country_all",explode(split(($"Country"),",")))
.select("Lang_all","Country_all").filter(col("Lang_all")==="English").groupBy("Country_all").count().sort(desc("count"))

movies: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Country_all: string, count: bigint]


In [37]:
movies.show(5)

+-----------+-----+
|Country_all|count|
+-----------+-----+
|     Canada|  597|
|     France|  171|
|      India|  150|
|    Germany|  136|
|      Italy|  107|
+-----------+-----+
only showing top 5 rows



**WHICH MOVIE INTO MANY LANGUAGES**

In [38]:
 val movie_data_3=movie_data.withColumn("Lang_all",size(split(($"Language"),",")))

movie_data_3: org.apache.spark.sql.DataFrame = [ID: int, Title: string ... 14 more fields]


In [39]:
val max_count=movie_data_3.agg(max("Lang_all"))

max_count: org.apache.spark.sql.DataFrame = [max(Lang_all): int]


In [40]:
movie_data_3.filter(col("Lang_all") === 10).select("Title").show()

+-----+
|Title|
+-----+
| 2012|
+-----+



**HOW MANY MOVIES HAVE ENGLISH AS THEIR TRANSLATED LANGUAGE(not the first name in list)**

In [41]:
movie_data.withColumn("Lang_all",size(split(($"Language"),","))).filter(col("Lang_all")>1).
withColumn("xyz",(split(($"Language"),","))).withColumn("slice", slice($"xyz", 2, 10)).
where(array_contains(col("slice"),"English")).count()



res27: Long = 705


**REGEX REPLACE USAGE**

In [42]:
import org.apache.spark.sql.functions.regexp_replace
val simpleColors = Seq("black", "white", "red", "green", "blue")
val regexString = simpleColors.map(_.toUpperCase).mkString("|")

import org.apache.spark.sql.functions.regexp_replace
simpleColors: Seq[String] = List(black, white, red, green, blue)
regexString: String = BLACK|WHITE|RED|GREEN|BLUE


In [43]:
movie_data.select(regexp_replace(col("Title"), regexString, "COLOR").alias("color_clean"),col("Title")).show(2)

+-----------+----------+
|color_clean|     Title|
+-----------+----------+
|  Inception| Inception|
| The Matrix|The Matrix|
+-----------+----------+
only showing top 2 rows



**WHICH LANGUAGE HAS SO MANY CRIME THRILLERS**

In [44]:
val filter1 = lower(col("Genres")).contains("crime")
val filter2 = lower(col("Genres")).contains("thriller")

filter1: org.apache.spark.sql.Column = contains(lower(Genres), crime)
filter2: org.apache.spark.sql.Column = contains(lower(Genres), thriller)


In [45]:
movie_data.withColumn("Criteria", filter1.or(filter2)).where("Criteria").
withColumn("Lang_all",explode(split(($"Language"),","))).select("Lang_all","Criteria").groupBy("Lang_all").agg(count("Criteria")).show(5)

+---------+---------------+
| Lang_all|count(Criteria)|
+---------+---------------+
|  English|           3518|
|Afrikaans|              8|
|    Xhosa|              2|
| Hawaiian|              1|
|  Swedish|             26|
+---------+---------------+
only showing top 5 rows



**TOP MOST GENRE BY AVERAGE RATING PER LANGUAGE**

In [47]:
val gen_lan=movie_data.withColumn("Lang_all",explode(split(($"Language"),","))).
withColumn("Genre_all",explode(split(($"Genres"),","))).select("Lang_all","Genre_all","IMDb").
groupBy("Genre_all","Lang_all").agg(round(mean("IMDb"),2).alias("Rating"))

gen_lan: org.apache.spark.sql.DataFrame = [Genre_all: string, Lang_all: string ... 1 more field]


In [51]:
val w = Window.partitionBy("Lang_all")

gen_lan.withColumn("Max_rating", max("Rating").over(w)).where(col("Rating") ===  col("Max_rating")).sort(desc("Max_rating")).select("Lang_all","Genre_all").show(5)

+---------+---------+
| Lang_all|Genre_all|
+---------+---------+
|   Polish|    Sport|
|  Bosnian|  History|
|  Bosnian|   Family|
|    Xhosa|     News|
|Afrikaans|    Music|
+---------+---------+
only showing top 5 rows



w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@571dd030


**WHAT ARE THE HIGH RATED KIDS MOVIES EVERY YEAR**

In [60]:
val gen_lan=movie_data.filter(col("Age")==="7+")

gen_lan: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: int, Title: string ... 13 more fields]


In [63]:
val w = Window.partitionBy("Year")
gen_lan.withColumn("Max_rating",max("IMDb").over(w)).
where(col("IMDb") === col("Max_rating")).sort(desc("Max_rating")).select("Title","Year","IMDb").
show(5,false)

+----------------------------------+----+----+
|Title                             |Year|IMDb|
+----------------------------------+----+----+
|Untamed Romania                   |2018|8.7 |
|Star Wars: The Empire Strikes Back|1980|8.7 |
|Slednecks 13                      |2010|8.6 |
|Star Wars: A New Hope             |1977|8.6 |
|Fabulous Frogs                    |2014|8.6 |
+----------------------------------+----+----+
only showing top 5 rows



w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@69682265
