Big Data Analysis using PySpark

Project Objective:

Perform scalable data analysis using PySpark

Extract insights from a large dataset

Demonstrate Big Data processing techniques

Spark Setup

In [1]:
!pip install pyspark




In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Big Data Analysis Project") \
    .getOrCreate()

print("Spark Session Started")


Spark Session Started


Load Dataset

In [3]:
df = spark.read.csv("/content/netflix_titles.csv", header=True, inferSchema=True)

df.printSchema()
df.show(5)



root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- description: string (nullable = true)

+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+-------------------

Analysis Section

In [4]:
print("Total Records:", df.count())



Total Records: 8809


In [5]:
df.groupBy("type").count().show()


+-------------+-----+
|         type|count|
+-------------+-----+
|         NULL|    1|
|      TV Show| 2676|
|        Movie| 6131|
|William Wyler|    1|
+-------------+-----+



In [6]:
from pyspark.sql.functions import col

df.groupBy("country") \
  .count() \
  .orderBy(col("count").desc()) \
  .show(10)


+--------------+-----+
|       country|count|
+--------------+-----+
| United States| 2805|
|         India|  972|
|          NULL|  832|
|United Kingdom|  419|
|         Japan|  245|
|   South Korea|  199|
|        Canada|  181|
|         Spain|  145|
|        France|  123|
|        Mexico|  110|
+--------------+-----+
only showing top 10 rows


In [10]:
from pyspark.sql.functions import col, year, trim, when, to_date

# Step 1: Clean spaces
df = df.withColumn("date_added_clean", trim(col("date_added")))

# Step 2: Keep only rows that look like a date (contain a comma and numbers)
df = df.withColumn(
    "date_added_clean",
    when(col("date_added_clean").rlike("^[A-Za-z]+\\s\\d{1,2},\\s\\d{4}$"),
         col("date_added_clean"))
)

# Step 3: Convert to date safely
df = df.withColumn(
    "date_added_clean",
    to_date(col("date_added_clean"), "MMMM d, yyyy")
)

# Step 4: Extract year
df = df.withColumn("year_added", year(col("date_added_clean")))

# Step 5: Remove invalid rows
df_year = df.filter(col("year_added").isNotNull())

# Step 6: Year-wise count
df_year.groupBy("year_added") \
       .count() \
       .orderBy("year_added") \
       .show()


+----------+-----+
|year_added|count|
+----------+-----+
|      2008|    2|
|      2009|    2|
|      2010|    1|
|      2011|   13|
|      2012|    3|
|      2013|   11|
|      2014|   24|
|      2015|   81|
|      2016|  429|
|      2017| 1186|
|      2018| 1647|
|      2019| 2014|
|      2020| 1873|
|      2021| 1491|
+----------+-----+



In [11]:
df.groupBy("rating") \
  .count() \
  .orderBy(col("count").desc()) \
  .show()


+-----------------+-----+
|           rating|count|
+-----------------+-----+
|            TV-MA| 3195|
|            TV-14| 2158|
|            TV-PG|  862|
|                R|  796|
|            PG-13|  489|
|            TV-Y7|  334|
|             TV-Y|  307|
|               PG|  286|
|             TV-G|  220|
|               NR|   80|
|                G|   41|
|             NULL|    6|
|         TV-Y7-FV|    6|
|               UR|    3|
|            NC-17|    3|
|             2021|    2|
| November 1, 2020|    1|
| Shavidee Trotter|    1|
|    Adriane Lenox|    1|
|    Maury Chaykin|    1|
+-----------------+-----+
only showing top 20 rows


Scalability Section

In [12]:
df_repartitioned = df.repartition(4)
print("Number of partitions:", df_repartitioned.rdd.getNumPartitions())


Number of partitions: 4


In [13]:
df.cache()
df.count()

8809

The dataset was repartitioned to enable parallel processing. Caching was applied to improve performance for repeated operations, demonstrating scalability using PySpark.

In [14]:
df.groupBy("type").count().show()


+-------------+-----+
|         type|count|
+-------------+-----+
|         NULL|    1|
|      TV Show| 2676|
|        Movie| 6131|
|William Wyler|    1|
+-------------+-----+



In [15]:
from pyspark.sql.functions import col

df.groupBy("country") \
  .count() \
  .orderBy(col("count").desc()) \
  .show(10)


+--------------+-----+
|       country|count|
+--------------+-----+
| United States| 2805|
|         India|  972|
|          NULL|  832|
|United Kingdom|  419|
|         Japan|  245|
|   South Korea|  199|
|        Canada|  181|
|         Spain|  145|
|        France|  123|
|        Mexico|  110|
+--------------+-----+
only showing top 10 rows


In [16]:
df.groupBy("rating") \
  .count() \
  .orderBy(col("count").desc()) \
  .show()


+-----------------+-----+
|           rating|count|
+-----------------+-----+
|            TV-MA| 3195|
|            TV-14| 2158|
|            TV-PG|  862|
|                R|  796|
|            PG-13|  489|
|            TV-Y7|  334|
|             TV-Y|  307|
|               PG|  286|
|             TV-G|  220|
|               NR|   80|
|                G|   41|
|             NULL|    6|
|         TV-Y7-FV|    6|
|               UR|    3|
|            NC-17|    3|
|             2021|    2|
| November 1, 2020|    1|
| Shavidee Trotter|    1|
|    Adriane Lenox|    1|
|    Maury Chaykin|    1|
+-----------------+-----+
only showing top 20 rows


In [17]:
spark.stop()


Final Insights-

Total records analyzed: ~8800+

Movies dominate Netflix content

USA produces the highest content

Content growth increased after 2015

TV-MA is the most common rating

Data cleaning was performed to handle inconsistent values

PySpark enabled distributed processing

Scalability demonstrated using partitioning and caching

Analysis executed on Google Colab cloud environment