# PYSPARK ASSIGNMENT

#### Group 13 - 20BDA15

### Entertainment - Netflix shows Analytics

## 1. Extract: Load the data 

#### Read data as pandas dataframe

In [34]:
import pandas as pd #data analytical library

In [35]:
pd_data=pd.read_csv("netflix_titles.csv") #reading the csv data using read_csv function in pandas

In [36]:
pd_data.head() #displaying first 5 rows to check if the data is loaded correctly

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


#### Create spark dataframe

In [37]:
from pyspark.sql import SparkSession #SparkSession is used create PySpark RDD, DataFrame.

In [38]:
#Creating a PySpark SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("SparkByExamples.com") \
    .getOrCreate()

In [None]:
spark_data= (spark.read.format("csv").options(header="true")
    .load("netflix_titles.csv"))                              #loading the csv file as a spark dataframe

In [40]:
spark_data.printSchema() #to see the schema of spark dataframe  
spark_data.show()

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- description: string (nullable = true)

+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|            director|                cast|             country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+

#### Create a table view "netflix" as spark SQL

In [41]:
spark_session = SparkSession.builder.master("local").\
        appName("SparkApplication").\
        config("spark.driver.bindAddress","localhost").\
        config("spark.ui.port","4041").\
        getOrCreate()
sc = spark_session.sparkContext

In [42]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.registerDataFrameAsTable(spark_data, "netflix_table")



In [43]:
## Using query show spark netflix data
raw_data = sqlContext.sql("SELECT * FROM netflix_table ")
raw_data.show()

+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|            director|                cast|             country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|     Kirsten Johnson|                null|       United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|                null|Ama Qamata, Khosi...|        South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglan

## 2. Transform: Exploratory data analysis using spark df using spark sql queries 

#### Unique showId count

In [44]:
from pyspark.sql.functions import countDistinct
count=spark_data.select(countDistinct("show_id"))
count.show()


+-----------------------+
|count(DISTINCT show_id)|
+-----------------------+
|                   8809|
+-----------------------+



#### GroupBy type,release_year and count of showId 

In [45]:
groupby=spark_data.groupBy("type","release_year").count()

In [46]:
groupby.show()

+-------+---------------+-----+
|   type|   release_year|count|
+-------+---------------+-----+
|  Movie|  June 12, 2021|    1|
|  Movie|           1963|    1|
|TV Show|           1981|    1|
|  Movie|           1971|    5|
|TV Show|           1972|    1|
|TV Show|           1988|    2|
|TV Show|  Nse Ikpe-Etim|    1|
|  Movie|           1956|    2|
|  Movie| Charles Rocket|    1|
|  Movie|           1997|   33|
|  Movie|           2015|  397|
|  Movie|           1969|    2|
|  Movie|           2010|  153|
|  Movie|           1993|   24|
|  Movie|           1977|    6|
|TV Show|           2020|  436|
|TV Show|           1997|    4|
|  Movie|           2016|  657|
|  Movie|           1992|   20|
|TV Show|           1945|    1|
+-------+---------------+-----+
only showing top 20 rows



#### Update column duration values as 90 min to 90 and 2 seasons to 2 and others 

In [47]:
from pyspark.sql.functions import *

update = spark_data.withColumn('duration', regexp_replace('duration', 'Season', ''))
update1 = update.withColumn('duration', regexp_replace('duration', 'min', ''))
update2 = update1.withColumn('duration', regexp_replace('duration', 's', ''))

In [48]:
update2

DataFrame[show_id: string, type: string, title: string, director: string, cast: string, country: string, date_added: string, release_year: string, rating: string, duration: string, listed_in: string, description: string]

#### groupby type and avg durations

In [49]:
from pyspark.sql.functions import col,sum,avg,max
update2.groupBy("type") \
    .agg(avg("duration").alias("avg durations") \
         ) \
    .show(truncate=False)

+-------------+------------------+
|type         |avg durations     |
+-------------+------------------+
|TV Show      |1.7654320987654322|
|Movie        |99.88907068062828 |
|William Wyler|null              |
|null         |null              |
+-------------+------------------+



## 3. Load - Save analysis report

#### Save as tables and Partitionby type

In [50]:
spark_data.write.option("header",True) \
        .partitionBy("type") \
        .mode("overwrite") \
        .csv("D:/ADATA/netflix")