# TV-Shows data analytics
#### This notebook explores a dataset of TV-Shows, using pyspark. The information gained could be of value to a market-researcher, content-creator or TV enthusiast. The data is imported in JSON format, as a Spark Dataframe. Some visualisations have been created using Seaborn by converting the data to a Pandas Dataframe.

In [None]:
from pyspark.sql import SparkSession

import json
import seaborn as sns
import matplotlib as plt
import pandas as pd

In [None]:
spark = SparkSession.builder.appName("TVShowAnalysis").getOrCreate()
data = spark.read.json(file_path)

PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

#### Look at a data row

#### Return number of rows

In [None]:
data.count()

#### Remove duplicate rows, and return row count

In [None]:
data.dropDuplicates()
data.count()

#### View column datatypes

In [None]:
data.dtypes

#### View schema info

In [None]:
data.printSchema

#### View a selection of title names

In [None]:
display(data.select("name").limit(10))
display(data.select("name").count())

#### Count number of shows in each language

In [None]:
# Filter out rows where 'language' column is not null
filtered_data = data.filter(col('language').isNotNull())
display(filtered_data.groupBy('language').count().count())
# Group by 'language' column and count occurrences, then sort in descending order and limit to 7 rows
sorted_data = filtered_data.groupBy('language').count().orderBy(desc('count')).limit(7)

# Display the sorted and limited data
display(sorted_data)

In [None]:
data.createOrReplaceTempView("tv_shows")

In [None]:
genres_data = spark.sql("""
          SELECT DISTINCT genres
          FROM tv_shows
         """)

In [None]:
# Assuming `genres_data` is your DataFrame containing the column of arrays
distinct_genres = genres_data.select(explode("genres").alias("genre")).distinct()

# Show the distinct genres
distinct_genres.show()


#### View count by show-type

In [None]:
type_data = data.groupby('type').count()
display(type_data)
types_pandas = type_data.toPandas()
sns.set_style("whitegrid")
sns.barplot(data=types_pandas, x='count', y='type')

In [None]:
x_data = data.groupby('averageRuntime').count()
display(x_data.limit(5))

pandas_df = x_data.toPandas()
sns.displot(data=pandas_df, kind="kde", x='averageRuntime')

#### To conclude, some aspects of this large dataset have been revealed by using Pyspark. For instance, it was shown that scripted shows are by far the most common in the set, followed by documentaries, reality shows and animation. A total of 73 languages were found, the most prevalent being English, then Japanese, Russian and Korean. The distribution curve investigating average-runtime peaks at just under 100 minutes.