# Metacritic TV Shows - Data Cleaning

## Data Source

This project uses the [Metacritic TV Shows Dataset](https://www.kaggle.com/datasets/mohamedasak/metacritic-tv-shows-dataset) from Kaggle. The following is taken verbatim from the site:

> Metacritic TV Shows Dataset provides a structured and clean collection of high-quality television series metadata aggregated from Metacritic. The dataset includes critically acclaimed TV shows across multiple genres, networks, and release periods. It is designed for exploratory data analysis (EDA), recommendation systems, visualization projects, and machine learning tasks related to media analytics.

## Notebook Objective

This notebook focuses on data cleaning and validation to prepare the dataset for analysis in Power BI. 

### Goals

1. **Assess data quality** - Identify missing values, inconsistencies, and data type issues
2. **Handle missing values** - Apply appropriate strategies based on column context (fill, flag, or remove)
3. **Validate data** - Cross-reference with external sources (OMDB API) where needed
4. **Standardize formats** - Ensure consistency in dates, categorical values, and text fields
5. **Document decisions** - Record all cleaning choices and rationale for reproducibility

### Out of Scope

Data normalization for Power BI (dimension tables, bridge tables, etc.) is covered in a separate notebook.



In [None]:
import pandas as pd
import load_dotenv

from utils.data_utils import missing_value_summary, get_release_date

In [None]:
load_dotenv(override=True)

In [44]:
data = pd.read_csv("data/metacritic_tv_shows.csv")
data.head(5)

Unnamed: 0,id,title,releaseDate,seasonCount,rating,genres,description,duration,tagline,metascore,metascore_count,metascore_sentiment,userscore,userscore_count,userscore_sentiment,created_by,production_companies,director,writer,top_cast
0,1000358361,Planet Earth: Blue Planet II,2017-10-29,1.0,TV-G,Documentary,"Airing simultaneously on AMC, BBC America, IFC...",50.0,Take a deep breath,97.0,7,Universal acclaim,82,178,Universal acclaim,,"BBC Natural History Unit (NHU),BBC Studios,BBC...",James Honeyborne,,"David Attenborough,Peter Drost,Roger Munns,Rog..."
1,1000359012,America to Me,2018-08-26,1.0,TV-14,Documentary,The 10-part documentary series from Steve Jame...,60.0,,96.0,9,Universal acclaim,59,75,Mixed or average,,"Participant,Kartemquin Films,Starz,Starz,Nolo ...","Bing Liu,Kevin Shaw,Steve James,Rebecca Parrish",,"Kendale McCoy,Charles Donalson Jr.,Ke'Shawn Ku..."
2,1000357720,Planet Earth II,2016-11-06,1.0,TV-G,Documentary,"Narrated by David Attenborough, the sequel to ...",50.0,A new world revealed,96.0,10,Universal acclaim,92,242,Universal acclaim,,"BBC Natural History Unit (NHU),BBC America,Zwe...","Justin Anderson,Ed Charles,Elizabeth White,Emm...",Elizabeth White,"David Attenborough,Chadden Hunter,Gordon Bucha..."
3,1000302375,The Staircase,2004-06,1.0,TV-MA,"Documentary,Crime,Drama",An 8-part documentary series about the celebra...,,Did He Do It?,95.0,9,Universal acclaim,71,62,Generally favorable,,"Maha Productions,ABC News,Docurama,Netflix,Net...",Jean-Xavier de Lestrade,"Jean-Xavier de Lestrade,Nathalie Sobania","Michael Peterson,David Rudolf,Ron Guerette,Mar..."
4,1000366530,The U.S. and the Holocaust,2022-09-18,1.0,TV-14,"Documentary,History","Narrated by Peter Coyote, the three-part docum...",133.0,,96.0,10,Universal acclaim,53,65,Mixed or average,,"Florentine Films,Public Broadcasting Service (...","Sarah Botstein,Ken Burns,Lynn Novick",Geoffrey C. Ward,"Peter Coyote,Daniel Mendelsohn,Peter Hayes,Deb..."


### Dataset Overview
We begin with an overview as seen in the output below.
| Aspect | Details |
|--------|---------|
| Source | Metacritic |
| Records | 3,300 TV shows |
| Time span (release dates of shows) | 1969 - 2025 |
| Key fields | Metascore, Userscore, Genres, Cast, Writers, Directors, Production Companies |


There are 3330 TV-Shows recorded. We note that the `id` column is unique, meaning each row represents one show - i.e. there are no duplicates. Furthermore, looking at the missing value summary we note that
- `title`, `id`,  `metascore_count`, `userscore`, and `userscore_count` have no missing values
- `created_by`, `tagline` and `director` have the highest percentage of missing values. 
- for the above 3 and the rest we will decide whether to attempt to fill in the gaps or replace these values as `NaN`.

In [41]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3330 entries, 0 to 3329
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    3330 non-null   int64  
 1   title                 3330 non-null   object 
 2   releaseDate           3317 non-null   object 
 3   seasonCount           3135 non-null   float64
 4   rating                2807 non-null   object 
 5   genres                3328 non-null   object 
 6   description           3329 non-null   object 
 7   duration              3002 non-null   float64
 8   tagline               1722 non-null   object 
 9   metascore             3314 non-null   float64
 10  metascore_count       3330 non-null   int64  
 11  metascore_sentiment   3314 non-null   object 
 12  userscore             3330 non-null   int64  
 13  userscore_count       3330 non-null   int64  
 14  userscore_sentiment   2871 non-null   object 
 15  created_by           

In [39]:
print(missing_value_summary(data))

                      Number of Missing Values  Percent of Total Values
created_by                                1618                    48.59
tagline                                   1608                    48.29
director                                  1139                    34.20
rating                                     523                    15.71
userscore_sentiment                        459                    13.78
duration                                   328                     9.85
writer                                     243                     7.30
seasonCount                                195                     5.86
metascore_sentiment                         16                     0.48
metascore                                   16                     0.48
releaseDate                                 13                     0.39
top_cast                                    12                     0.36
production_companies                         3                  

In [6]:
df = data.copy()

### Metascore and user score
We begin our data quality checks with two of the most important fields. The following was noted, and subsequent decisions taken:
- There are instances of TV-Shows with no valid score data from the critics (`NaN`)
- So we remove these instances since using the following rationale: If metascore is NaN and userscore_count is 0, there's no rating data to analyze.
- This includes rows where metascore is NaN (even with metascore_count > 0)
- The above, case is possibly because Metascore not published despite reviews existing. 

In [7]:
df = df[~(df['metascore'].isna() & (df['userscore_count'] == 0))] #both metascore AND userscore unusable
df = df[df['metascore'].notna()] #metascore is NaN

#### Top Cast
The next columns we investigated were `top_cast` and `season_count`. The former revealed the following
- NaNs: 12 (0.36%), of these 10 were documentaries (and documentaries, upon a few Google searches, were found not to have a dedicated cast in the majority of the cases).
- We decided to fill with a placeholder

#### Season count
- NaNs: 5.86%. Various (some movies/specials classified as TV shows by Metacritic). This could be q quirk of the website where there are rules of classifying media that are not made clear.
We decided on the following:

- Create flag column to track original NaNs (could be of use later on)
- Fill with 1 (treating as single-season content)

In [8]:
df['top_cast'] = df['top_cast'].fillna('No Cast')
df['seasonCount_was_null'] = df['seasonCount'].isna()
df['seasonCount'] = df['seasonCount'].fillna(1)

####  Release date
- NaNs: 13 (0.39%)
- Instead of a place holder here, an attempt was made to use the OMDB (Online Movie Database) to query the release dates. 
- We managed all but tow, which were then filled in manually with a quick Google search.

In [None]:

mask = df['releaseDate'].isna()
df.loc[mask, 'releaseDate'] = df.loc[mask, 'title'].apply(get_release_date)

In [31]:
df.loc[df['title'] == 'Manhunt (2013)', 'releaseDate'] = '2013-07-17'
df.loc[df['title'] == 'Nightingale', 'releaseDate'] = '2014-06-17'


In [42]:
#missing_value_summary(df)

#### Writers (`writer` column)
NaNs: ~7%
Approach: Multi-step based on genre
For this coluumn, we noted that some of the shows with no writers were documentaries, reality tv shows and the like. So we looked up the genre column and decided that if the show with`NaN` for a writer and had one of these genres, we would replace the `NaN` with, for example 'None - Documentary'.


We then dealt with with Edge cases (12 remaining) by using manual lookups (Metacritic website).

We applied the following rule, especially for shows with a lot of writers but each writer 1 episode: we only recorded writers with 2+ episodes

In [None]:
mask_nan = df['writer'].isna()

mask_both = (mask_nan & 
             df['genres'].str.contains('Documentary', na=False) & 
             df['genres'].str.contains('Reality-TV', na=False))
df.loc[mask_both, 'writer'] = 'None - Documentary/Reality-TV'

df.loc[df['writer'].isna() & df['genres'].str.contains('Documentary', na=False), 'writer'] = 'None - Documentary'
df.loc[df['writer'].isna() & df['genres'].str.contains('Reality-TV', na=False), 'writer'] = 'None - Reality-TV'
df.loc[df['writer'].isna() & df['genres'].str.contains('Talk-Show', na=False), 'writer'] = 'None - Talk-Show'
df.loc[df['writer'].isna() & df['genres'].str.contains('Game-Show', na=False), 'writer'] = 'None - Game-Show'

In [16]:
df.loc[df['title'] == 'Generation Cryo', 'writer'] = 'None - Documentary'
df.loc[df['title'] == 'Modern Dads', 'writer'] = 'None - Reality'
df.loc[df['title'] == 'Ellen DeGeneres: For Your Approval', 'writer'] = 'None - Stand-up'
df.loc[df['title'] == 'Killing Fields', 'writer'] = 'None - Documentary'
df.loc[df['title'] == "Dane Cook's Tourgasm", 'writer'] = 'None - Stand-up'

In [17]:
df.loc[df['title'] == 'Star Wars: Visions', 'writer'] = 'George Lucas, Masahiko Ōtsuka, Yasumi Atarashi'
df.loc[df['title'] == "Marvel's M.O.D.O.K.", 'writer'] = 'Jordan Blum, Patton Oswalt'
df.loc[df['title'] == 'The Boys Presents: Diabolical', 'writer'] = 'Garth Ennis, Darick Robertson'
df.loc[df['title'] == 'The Agency (2024)', 'writer'] = 'Jez Butterworth, John-Henry Butterworth'
df.loc[df['title'] == 'Creepshow', 'writer'] = 'Greg Nicotero, John Esposito'
df.loc[df['title'] == 'Ten Percent', 'writer'] = 'John Morton, Ella Road'
df.loc[df['title'] == 'Black Rabbit', 'writer'] = 'Zach Baylin, Kate Susman'

#### Creators (`created_by` column)
- NaNs: 1,605 (48%)
- Attemping to fill these gaps using OMDB proved futile as the site also recorded the creators as N/A for most of these shows.
We decided to fill all with 'Unknown' - revisit in Power BI if needed

In [18]:
df['created_by'] = df['created_by'].fillna('Unknown')

#### Directors (director column)
- NaNs: 1,137 (34%). Similar difficulties as above were encountered when trying to using OMDB. (Likely TV shows have rotating directors, so it is possibly cumbersome to list all?)
- We followed a similar approach to above: Fill all with 'Unknown' 


In [19]:
df['director'] = df['director'].fillna('Unknown')

### Duration
- NaNs: 327 (9.87%)
- wE attempted the OMDB  but rejected it due to accuracy issues (e.g., the show "I Wanna Marry Harry" returned 1 min instead of 45 min which was found via a Google search)
- So we leave as NaN — Power BI will exclude from numeric calculations automatically

### Rating (TV-MA, TV-PG, etc.) 
- NaNs: 517 (15%)
- We filled with placeholder due to OMDB issues as well.

In [20]:
df['rating'] = df['rating'].fillna('Unknown')

### User score sentiment (userscore_sentiment column)
- NaNs: 450. Metacritic assigns the sentiment once there are more than 4 user ratings
- We perfomed the following verification: All 450 had userscore_count < 4 
- So we fill these as `Insufficient scores`

In [None]:
mask_nan = df['userscore_sentiment'].isna()
low_count = df.loc[mask_nan, 'userscore_count'] < 4

df['userscore_sentiment'] = df['userscore_sentiment'].fillna('Insufficient scores')

### Production companies
Lastly, we also filled in the NaNs in this coluumn with "Unknown placeholder"

In [22]:
df['production_companies'] = df['production_companies'].fillna('Unknown')

In [23]:
df.loc[df['title'] == 'Generation Cryo', 'genres'] = 'Documentary'
df.loc[df['title'] == 'Modern Dads', 'genres'] = 'Reality-TV'


### Placeholder Values Summary - TV Shows Data Cleaning

| Column | Placeholder | Reason |
|--------|-------------|--------|
| top_Cast | `No Cast` | Documentaries typically don't have cast |
| writer | `None - Documentary` | Genre-appropriate |
| writer | `None - Reality-TV` | Genre-appropriate |
| writer | `None - Talk-Show` | Genre-appropriate |
| writer | `None - Game-Show` | Genre-appropriate |
| writer | `None - Documentary/Reality-TV` | Genre-appropriate (overlap) |
| writer | `None - Stand-up` | Genre-appropriate |
| created_by | `Unknown` | Too many unexplained NaNs (48%) |
| director | `Unknown` | Too many unexplained NaNs (34%) |
| rating | `Unknown` | No clear pattern (15% NaN) |
| userscore_sentiment | `Insufficient scores` | Verified < 4 user scores |
| duration | *Left as NaN* | OMDB unreliable, numeric field benefits from exclusion in calculations |

In [43]:
#missing_value_summary(df)

In [32]:
df.to_csv("data/shows_cleaned.csv", index=False)