# Analyzing Content from Streaming Services Using SQL

For this project, I will process and explore content from popular streaming services such as Amazon Prime, Hulu, Netflix, and Disney+. Each service will have its own table in our database. Fortunately, the data from these platforms shares mostly the same column names, allowing us to join them using UNIONs. I've opted to use UNION ALL to preserve any potential duplicate rows, as we need to count entries that appear across multiple services. Additionally, to retain streaming service information, I've included a 'genres' table containing film titles and their respective genres.

Now, let's examine the number of rows with missing data in the 'age,' 'IMDb,' and 'Rotten Tomatoes' rating fields.

In [4]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM public.amazon

    UNION ALL 

    SELECT *, 'hulu' AS service
    FROM hulu 

    UNION ALL 

    SELECT *, 'netflix' AS service 
    FROM netflix 

    UNION ALL 

    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
	SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
    ON sd.title = g.film
)

SELECT 
SUM(CASE WHEN imdb IS NULL THEN 1 ELSE 0 END) AS imdb_missing,
SUM(CASE WHEN age IS NULL THEN 1 ELSE 0 END) AS age_missing,
SUM(CASE WHEN rotten_tomatoes IS NULL THEN 1 ELSE 0 END) AS rt_missing
FROM all_data;

Unnamed: 0,imdb_missing,age_missing,rt_missing
0,6901,3633,7


It appears that there is a significant amount of missing data concerning IMDb scores. The next step is to investigate whether this missing data follows any discernible patterns. Specifically, I aim to determine if there is a discrepancy between types of content. Are movies or TV shows more prone to having missing IMDb scores?

In [5]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM public.amazon

    UNION ALL 

    SELECT *, 'hulu' AS service
    FROM hulu 

    UNION ALL 

    SELECT *, 'netflix' AS service 
    FROM netflix 

    UNION ALL 

    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
	SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
    ON sd.title = g.film
)

SELECT type,
COUNT(imdb) 
FROM all_data 
GROUP BY type;

Unnamed: 0,type,count
0,0,0
1,1,3263


The result reflects that the missing data comes from the movies field.

## Determining the Most Family-Friendly Streaming Service

To identify the most family-friendly streaming service, our initial focus will be on analyzing the percentage of content tailored for children.

While we have a primary genre column, relying solely on it may overlook certain content or inadvertently include inappropriate material, such as the "Rick and Morty" show if we simply categorize based on cartoons as a genre. A more precise approach involves employing pattern matching techniques to detect any references to terms like "kids," "family," and similar descriptors.

In [6]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM public.amazon

    UNION ALL 

    SELECT *, 'hulu' AS service
    FROM hulu 

    UNION ALL 

    SELECT *, 'netflix' AS service 
    FROM netflix 

    UNION ALL 

    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
	SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
    ON sd.title = g.film
)

SELECT *
FROM all_data 
WHERE genre ILIKE '%kids%'
   OR genre ILIKE '%family%'
   OR genre ILIKE '%children%'

Unnamed: 0,id,title,year,age,imdb,rotten_tomatoes,type,service,film,genre
0,4961,Open Season,2006,7+,,69/100,0,amazon,Open Season,"Children & Family Movies, Comedies"
1,5000,How to Steal a Dog,2014,,,68/100,0,amazon,How to Steal a Dog,"Drama, Kids"
2,4827,The Little Prince,2015,7+,,74/100,0,amazon,The Little Prince,"Animation, Kids"
3,4864,Pinocchio,2019,13+,,72/100,0,amazon,Pinocchio,"Animation, Comedy, Family, Fantasy, Musical"
4,5166,Troop Zero,2019,7+,,64/100,0,amazon,Troop Zero,Kids
...,...,...,...,...,...,...,...,...,...,...
1536,5522,Vampirina,2017,all,6.6/10,52/100,1,disney,Vampirina,"Animation, Short, Comedy, Family, Fantasy, Mus..."
1537,5521,Sydney to the Max,2019,all,6.3/10,52/100,1,disney,Sydney to the Max,"Comedy, Family"
1538,5558,The Mickey Mouse Club,1955,all,7.7/10,48/100,1,disney,The Mickey Mouse Club,"Family, Comedy, Drama, Music"
1539,2152,Genius,2017,16+,8.3/10,74/100,1,disney,Genius,"Comedy, Family, Romance, Sci-Fi"


Now that we've applied our filters to ensure our data is free from inappropriate content for children, the next inquiry is to determine which platform boasts the highest percentage of content rated A and AA.

In [7]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM public.amazon

    UNION ALL 

    SELECT *, 'hulu' AS service
    FROM hulu 

    UNION ALL 

    SELECT *, 'netflix' AS service 
    FROM netflix 

    UNION ALL 

    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
	SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
    ON sd.title = g.film
)

SELECT service,
AVG(CASE WHEN genre ILIKE '%kids%'
    OR genre ILIKE '%family%'
    OR genre ILIKE '%children%' THEN 1.0000 ELSE 0.0000 END) * 100 AS pct_family
FROM all_data 
GROUP BY service 
ORDER BY pct_family DESC;

Unnamed: 0,service,pct_family
0,disney,74.744898
1,netflix,11.057287
2,hulu,10.985075
3,amazon,8.319242


## Identifying the Highest-Rated Content

In addition to our dataset, we have ratings information for each piece of content in the rotten_tomatoes column. To extract the numerical rating, I'll utilize the SPLIT_PART() function followed by casting (::) the result as a numeric value.

Furthermore, we'll segregate the data into movies and TV shows for further analysis.

In [10]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM public.amazon

    UNION ALL 

    SELECT *, 'hulu' AS service
    FROM hulu 

    UNION ALL 

    SELECT *, 'netflix' AS service 
    FROM netflix 

    UNION ALL 

    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
	SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
    ON sd.title = g.film
)

SELECT 
       service,
	   CASE WHEN type = 1 THEN 'TV' ELSE 'Movie' END AS type,
       AVG(SPLIT_PART(rotten_tomatoes, '/', 1)::NUMERIC) AS rt_score
FROM all_data
GROUP BY service, type
ORDER BY service, type;

Unnamed: 0,service,type,rt_score
0,amazon,Movie,51.990146
1,amazon,TV,52.377207
2,disney,Movie,60.047934
3,disney,TV,54.486034
4,hulu,Movie,60.482517
5,hulu,TV,59.690625
6,netflix,Movie,54.965913
7,netflix,TV,54.229586


## Analyzing the Divergence Between Critics and Audiences Over Time

As we conclude our analysis, let's set aside the service focus and delve into whether critics and audiences showed more alignment on TV shows in the past.

To prepare the date data for charting purposes, I'll utilize the TO_DATE() function to convert the year into a date format.

In [14]:
WITH service_data AS (
	SELECT *, 'amazon' AS service
    FROM public.amazon

    UNION ALL 

    SELECT *, 'hulu' AS service
    FROM hulu 

    UNION ALL 

    SELECT *, 'netflix' AS service 
    FROM netflix 

    UNION ALL 

    SELECT *, 'disney' AS service
    FROM disney
),

all_data AS (
	SELECT *
    FROM service_data AS sd
    LEFT JOIN genres AS g
    ON sd.title = g.film
)


SELECT date,
       AVG(ABS(imdb_score - rt_score)) AS avg_difference
FROM (
     SELECT TO_DATE(year::TEXT, 'YYYY') AS date,
            (SPLIT_PART(rotten_tomatoes, '/', 1)::NUMERIC) AS rt_score,
            (SPLIT_PART(imdb, '/', 1)::NUMERIC * 10) AS imdb_score
     FROM all_data
     WHERE imdb IS NOT NULL
        AND rotten_tomatoes IS NOT NULL
        AND year >= 2000
     ) AS sub
GROUP BY date 
ORDER BY date; 

Unnamed: 0,date,avg_difference
0,2000-01-01 00:00:00+00:00,12.857143
1,2001-01-01 00:00:00+00:00,10.941176
2,2002-01-01 00:00:00+00:00,12.526316
3,2003-01-01 00:00:00+00:00,11.103448
4,2004-01-01 00:00:00+00:00,10.741935
5,2005-01-01 00:00:00+00:00,10.962264
6,2006-01-01 00:00:00+00:00,12.092593
7,2007-01-01 00:00:00+00:00,13.825
8,2008-01-01 00:00:00+00:00,13.155172
9,2009-01-01 00:00:00+00:00,11.759036
