# üé¨ Netflix Data Labeling Lab ‚Äî Notebook 1: Weak Labeling

---

### üß† **Objective**
This lab focuses on the **data labeling** phase of the Netflix dataset workflow.  
The goal is to automatically generate **weak supervision labels** for one classification task ‚Äî identifying whether a title is **family-friendly** ‚Äî using heuristic **Labeling Functions (LFs)** instead of manual annotation.  
The labeled dataset created here will later be used for **data augmentation**, **data slicing**, and **model development** in subsequent notebooks.

---

### üóÇ **Dataset Overview**
The dataset `netflix_titles.csv` contains information about all movies and TV shows available on Netflix.  
Key columns include:
- **`type`** ‚Äì Movie or TV Show  
- **`title`** ‚Äì Title of the content  
- **`rating`** ‚Äì Official content rating (G, PG-13, TV-MA, etc.)  
- **`listed_in`** ‚Äì Genres or categories  
- **`description`** ‚Äì Short synopsis  
- **`duration`** ‚Äì Runtime (e.g., ‚Äú90 min‚Äù, ‚Äú3 Seasons‚Äù)

---

### ‚öôÔ∏è **Notebook Workflow**
1. **Load and Inspect Data** ‚Äì Import the dataset and preview its structure.  
2. **Basic Cleaning** ‚Äì Standardize column names, parse durations, and create lowercase fields for rule-based matching.  
3. **Define Label Task** ‚Äì Choose the target variable `is_family_friendly` with labels { 1 = family-friendly, 0 = not family, ‚Äì1 = abstain }.  
4. **Design Labeling Functions (LFs)** ‚Äì Write heuristic rules using ratings, genres, and keywords to assign weak labels.  
5. **Apply LFs** ‚Äì Evaluate all LFs across the dataset to create a label matrix.  
6. **Coverage & Conflict Analysis** ‚Äì Measure how many rows were labeled and how often rules conflict.  
7. **Majority Vote Aggregation** ‚Äì Combine multiple LF votes into a single weak label.  
8. **Fallback Rule** ‚Äì Assign a default label where all LFs abstain.  
9. **Inspect Examples** ‚Äì Manually review sample outputs for quality control.  
10. **Save Labeled Dataset** ‚Äì Export the final weakly labeled data for future steps.  
11. ** Evaluate LF Performance ‚Äì Compare how often each rule fires and its vote distribution.  
12. ** Export Compact Version ‚Äì Save a lightweight subset for model training.

---

### üß© **Key Concepts**

#### ü™Ñ Labeling Functions (LFs)
LFs are simple, interpretable heuristics that assign noisy labels using available features such as ratings, genres, or keywords.  
Each LF can output:
- `1`   ‚Üí Family-friendly  
- `0`   ‚Üí Not family  
- `-1` ‚Üí Abstain (no decision)

By combining many LFs, we approximate human annotation through **weak supervision**.

#### ‚öñÔ∏è Weak Label Aggregation
LFs often disagree. We resolve their votes using a simple **majority vote** aggregator.  
Later labs can replace this with probabilistic models (e.g., Snorkel‚Äôs LabelModel).

#### üìà Evaluation Metrics
- **Coverage:** % of records labeled by any LF.  
- **Conflict:** % of records where LFs disagree.  
- **Overlap:** Average number of votes per record.

---

### üîÑ **Outcome**
By the end of this notebook you will have:
- A cleaned Netflix dataset with parsed durations and normalized text fields.  
- A set of Labeling Functions for family-friendly classification.  
- A weakly labeled CSV (`netflix_labeled_family.csv`) ready for:  
  - **Data Augmentation** (Notebook 2)  
  - **Data Slicing & Model Development** (Notebook 3)

---

### üßæ **Structure of This Notebook**
| Section | Purpose |
|:--|:--|
| **Step 1 ‚Äì Load and Inspect Data** | Load the raw Netflix dataset and preview it. |
| **Step 2 ‚Äì Basic Cleaning** | Standardize columns and parse runtime information. |
| **Step 3 ‚Äì Define Label Task** | Introduce the target variable and label space. |
| **Step 4 ‚Äì Labeling Functions** | Implement heuristics that generate weak labels. |
| **Step 5 ‚Äì Apply LFs** | Execute rules and store their outputs. |
| **Step 6 ‚Äì Coverage & Conflicts** | Quantify LF agreement and effectiveness. |
| **Step 7 ‚Äì Majority Vote** | Aggregate LF outputs into a single label. |
| **Step 8 ‚Äì Fallback Rule** | Assign labels to abstained records. |
| **Step 9 ‚Äì Inspect Examples** | Qualitatively verify label correctness. |
| **Step 10 ‚Äì Save Output** | Persist the labeled dataset for later labs. |
| **Steps 11-12 (Optional)** | Examine LF activity and export a compact subset. |

---

### üß© **Next Notebook Preview**
- **Notebook 2 ‚Äî Data Augmentation:** Use synthetic text transformations and sampling to balance classes and improve label robustness.  
- **Notebook 3 ‚Äî Model Development:** Train supervised models (Logistic Regression, Random Forest, XGBoost) using the weakly labeled data and evaluate performance on different data slices.


In [34]:
import pandas as pd

# Path to your CSV
file_path = "/Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_titles.csv"

# Load the dataset directly
df = pd.read_csv(file_path, low_memory=False)

# Verify successful load
print("‚úÖ Successfully loaded Netflix dataset")
print(f"Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")

# Display first few rows
display(df.head())


‚úÖ Successfully loaded Netflix dataset
Shape: 8807 rows √ó 12 columns


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


###  Step 1 ‚Äî Load and Inspect the Netflix Dataset
We begin by loading the Netflix titles dataset, which contains information about movies and TV shows available on Netflix.  
This step ensures that the dataset is successfully imported into the notebook and allows us to preview the data structure and confirm that key columns such as `title`, `type`, `rating`, and `listed_in` are present.

In [36]:
import pandas as pd
import numpy as np

# If df isn't already loaded, uncomment and set your path:
# df = pd.read_csv("/Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_titles.csv", low_memory=False)

print("Rows:", len(df))
df.head(2)

Rows: 8807


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."


###  Step 2 ‚Äî Perform Basic Cleaning
Before labeling, we standardize column names, parse the `duration` column into separate numeric and unit fields, and create lowercase text fields (`title_lower`, `desc_lower`) for easier rule-based text matching.  
This lightweight preprocessing ensures consistency and avoids formatting issues when we apply labeling functions later.

In [38]:
# Standardize columns we will use for rules
df = df.copy()
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

# Parse duration like "90 min" or "3 Seasons"
def _parse_duration(x):
    if pd.isna(x): 
        return np.nan, None
    s = str(x).strip()
    parts = s.split()
    try:
        num = int(parts[0])
    except:
        return np.nan, None
    unit = parts[1].lower() if len(parts) > 1 else None
    if unit.endswith("s"): unit = unit[:-1]   # minutes->minute, seasons->season
    if unit == "minute": unit = "min"
    return num, unit

dur = df["duration"].apply(_parse_duration)
df[["duration_num","duration_unit"]] = pd.DataFrame(dur.tolist(), index=df.index)

# Lowercase helpers
df["title_lower"] = df["title"].fillna("").str.lower()
df["desc_lower"]  = df.get("description", df.get("description", pd.Series([""]*len(df)))).fillna("").str.lower()

# Quick sanity
df[["type","rating","listed_in","duration","duration_num","duration_unit","title_lower"]].head()

Unnamed: 0,type,rating,listed_in,duration,duration_num,duration_unit,title_lower
0,Movie,PG-13,Documentaries,90 min,90.0,min,dick johnson is dead
1,TV Show,TV-MA,"International TV Shows, TV Dramas, TV Mysteries",2 Seasons,2.0,season,blood & water
2,TV Show,TV-MA,"Crime TV Shows, International TV Shows, TV Act...",1 Season,1.0,season,ganglands
3,TV Show,TV-MA,"Docuseries, Reality TV",1 Season,1.0,season,jailbirds new orleans
4,TV Show,TV-MA,"International TV Shows, Romantic TV Shows, TV ...",2 Seasons,2.0,season,kota factory


###  Step 3 ‚Äî Define the Labeling Task
The primary objective of this lab is to automatically generate weak labels for one key classification task:  
**`is_family_friendly`**, indicating whether a title is suitable for family audiences.  
We define the label space as:
- `1` ‚Üí Family-friendly  
- `0` ‚Üí Not family-friendly  
- `-1` ‚Üí Abstain (no decision)  
This sets the foundation for all following labeling functions.

In [40]:
ABSTAIN  = -1
FAMILY   = 1
NOT_FAM  = 0

task_name = "is_family_friendly"
print("Label task:", task_name)

Label task: is_family_friendly


###  Step 4 ‚Äî Create Labeling Functions (LFs)
Labeling Functions are simple heuristic rules that assign tentative labels to each record based on specific cues such as ratings, genres, or keywords.  
Each LF outputs `1`, `0`, or `-1` (abstain).  
Here we design rules like:
- Positive LFs: detect family-oriented ratings (`G`, `TV-Y`, etc.) or family-related genres.  
- Negative LFs: flag mature content (`R`, `TV-MA`) or adult-oriented genres like *Horror* or *Crime*.  
These diverse rules provide multiple weak supervision signals that we‚Äôll later combine.

In [42]:
import re

# Helper: safe-get a lowercase string
def _s(x): 
    return "" if pd.isna(x) else str(x).lower()

FAMILY_RATINGS_POS = {"g","tv-g","tv-y","tv-y7","pg"}          # strong positive signal
MATURE_RATINGS_NEG = {"r","tv-ma","nc-17"}                     # strong negative

FAMILY_WORDS       = ["family", "kids", "children", "children & family"]
NEG_GENRES         = ["horror","thriller","crime","stand-up comedy","stand-up"]

def lf_rating_positive(row):
    r = _s(row.get("rating"))
    return FAMILY if r in FAMILY_RATINGS_POS else ABSTAIN

def lf_rating_negative(row):
    r = _s(row.get("rating"))
    return NOT_FAM if r in MATURE_RATINGS_NEG else ABSTAIN

def lf_genre_family_bucket(row):
    genres = _s(row.get("listed_in"))
    return FAMILY if any(w in genres for w in FAMILY_WORDS) else ABSTAIN

def lf_genre_negative_bucket(row):
    genres = _s(row.get("listed_in"))
    return NOT_FAM if any(g in genres for g in NEG_GENRES) else ABSTAIN

def lf_title_keywords(row):
    t = _s(row.get("title_lower"))
    return FAMILY if any(w in t for w in ["family","kids","children"]) else ABSTAIN

def lf_desc_keywords(row):
    d = _s(row.get("desc_lower"))
    # gentle: if description mentions "for kids/family" etc.
    if re.search(r"\b(family|kids|children|all ages)\b", d):
        return FAMILY
    return ABSTAIN

def lf_long_gritty_tv(row):
    # heuristic: long, gritty tv shows are less likely family content
    if row.get("type") == "TV Show" and row.get("duration_unit") == "season":
        seasons = row.get("duration_num")
        genres  = _s(row.get("listed_in"))
        if pd.notna(seasons) and seasons >= 4 and any(g in genres for g in ["crime","thriller","horror"]):
            return NOT_FAM
    return ABSTAIN

LFs = [
    lf_rating_positive,
    lf_rating_negative,
    lf_genre_family_bucket,
    lf_genre_negative_bucket,
    lf_title_keywords,
    lf_desc_keywords,
    lf_long_gritty_tv
]

print(f"Loaded {len(LFs)} labeling functions.")

Loaded 7 labeling functions.


###  Step 5 ‚Äî Apply Labeling Functions
Each labeling function is applied across every record to create a **label matrix**, where each column corresponds to one LF and each row to a Netflix title.  
This matrix records all individual votes and forms the raw input for our weak-label aggregation step.

In [44]:
# Apply each LF to every row -> label matrix (n_rows x n_lfs)
LF_matrix = pd.DataFrame(
    {f"lf_{i:02d}": df.apply(fn, axis=1) for i, fn in enumerate(LFs)},
    index=df.index
)
LF_matrix.head()

Unnamed: 0,lf_00,lf_01,lf_02,lf_03,lf_04,lf_05,lf_06
0,-1,-1,-1,-1,-1,-1,-1
1,-1,0,-1,-1,-1,-1,-1
2,-1,0,-1,0,-1,1,-1
3,-1,0,-1,-1,-1,-1,-1
4,-1,0,-1,-1,-1,-1,-1


###  Step 6 ‚Äî Measure LF Coverage and Conflicts
We evaluate how effective the labeling functions are by computing:
- **Coverage** ‚Äî the percentage of rows receiving at least one vote.  
- **Conflict** ‚Äî the percentage of rows with contradictory LF votes.  
- **Overlap** ‚Äî the average number of votes per record.  
These diagnostics help us gauge the strength and agreement of our labeling rules.


In [46]:
votes = LF_matrix.values
n, m = votes.shape

coverage = (votes != ABSTAIN).any(axis=1).mean()
conflict = np.array([
    (len(set(row[row != ABSTAIN])) > 1) for row in votes
]).mean()
overlap  = (votes != ABSTAIN).sum(axis=1)
print(f"LF count: {m}")
print(f"Coverage: {coverage*100:.1f}% of rows got at least one vote")
print(f"Conflict: {conflict*100:.1f}% of rows had conflicting votes")
print("Avg #votes / row:", overlap[overlap>0].mean() if (overlap>0).any() else 0)

LF count: 7
Coverage: 69.0% of rows got at least one vote
Conflict: 4.6% of rows had conflicting votes
Avg #votes / row: 1.4811460563148362


###  Step 7 ‚Äî Aggregate Labels via Majority Vote
We combine all LF outputs using a simple majority-vote scheme:  
the label receiving more votes (family vs. non-family) is assigned as the final weak label for that record.  
Rows with ties or no votes remain **abstained (`-1`)**, representing uncertain or unlabelled cases.

In [48]:
def majority_vote(row, abstain=ABSTAIN):
    vals = [v for v in row if v != abstain]
    if not vals:
        return abstain
    # majority
    ones = vals.count(FAMILY)
    zeros = vals.count(NOT_FAM)
    if ones > zeros:
        return FAMILY
    if zeros > ones:
        return NOT_FAM
    return abstain  # tie

mv_labels = LF_matrix.apply(lambda r: majority_vote(list(r.values)), axis=1)
df[task_name + "_mv"] = mv_labels
df[task_name + "_mv"].value_counts(dropna=False)

is_family_friendly_mv
 0    4179
-1    2999
 1    1629
Name: count, dtype: int64

###  Step 8 ‚Äî Handle Abstains with a Fallback Rule
To minimize unlabeled data, we apply a fallback rule based on rating categories.  
If all LFs abstained or tied, we default to a rule that interprets ‚Äúfamily-rated‚Äù content (e.g., `G`, `TV-Y`) as positive and mature ratings (`R`, `TV-MA`) as negative.  
This produces a complete, balanced labeled dataset.


In [50]:
def fallback_rule(row):
    r = _s(row.get("rating"))
    if r in FAMILY_RATINGS_POS:
        return FAMILY
    if r in MATURE_RATINGS_NEG:
        return NOT_FAM
    # neutral default
    return NOT_FAM

labels_final = []
for i in df.index:
    y = df.at[i, task_name + "_mv"]
    if y == ABSTAIN:
        y = fallback_rule(df.loc[i])
    labels_final.append(y)

df[task_name] = labels_final
df[task_name].value_counts()

is_family_friendly
0    7166
1    1641
Name: count, dtype: int64

###  Step 9 ‚Äî Inspect Labeled Examples
To qualitatively assess label quality, we sample a few examples from both the *family-friendly* and *non-family* classes.  
Reviewing these helps confirm whether the weak labeling logic aligns with human intuition.


In [52]:
pos_examples = df[df[task_name] == FAMILY].sample(5, random_state=1)[["title","rating","listed_in","description"]]
neg_examples = df[df[task_name] == NOT_FAM].sample(5, random_state=2)[["title","rating","listed_in","description"]]
display(pos_examples)
display(neg_examples)

Unnamed: 0,title,rating,listed_in,description
813,The Adventures of Sonic the Hedgehog,TV-Y7,Kids' TV,"Hyper hedgehog Sonic and his cohort Miles ""Tai..."
1576,Bobbleheads The Movie,PG,"Children & Family Movies, Comedies",A team of bobbleheads band together to defend ...
1442,Korean Pork Belly Rhapsody,TV-G,"Docuseries, International TV Shows",A love letter to pork belly ‚Äî a perennial favo...
3490,Birders,TV-G,"Documentaries, International Movies",Bird watchers on both sides of the U.S.-Mexico...
1170,Secret Magic Control Agency,TV-Y7,"Children & Family Movies, Comedies",Hansel and Gretel of fairy tale fame ‚Äî now act...


Unnamed: 0,title,rating,listed_in,description
2547,Have a Good Trip: Adventures in Psychedelics,TV-MA,Documentaries,Explore hallucinogenic highs and lows as celeb...
495,Awon Boyz,TV-MA,"Documentaries, International Movies",This documentary takes a close look at the liv...
2334,Seven (Tamil),TV-MA,"Dramas, International Movies, Romantic Movies",Multiple women report their husbands as missin...
5984,100% Hotter,TV-14,"British TV Shows, International TV Shows, Real...","A stylist, a hair designer and a makeup artist..."
5915,Aziz Ansari Live at Madison Square Garden,TV-MA,Stand-Up Comedy,"Stand-up comedian and TV star Aziz Ansari (""Pa..."


###  Step 10 ‚Äî Save the Labeled Dataset
We export the dataframe containing all derived labels to a new CSV file.  
This file serves as the primary input for the upcoming **Data Augmentation** and **Model Development** notebooks.

In [54]:
out_path = "/Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_labeled_family.csv"
df.to_csv(out_path, index=False)
print("‚úÖ Saved labeled dataset to:", out_path)

‚úÖ Saved labeled dataset to: /Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_labeled_family.csv


###  Step 11 ‚Äî Evaluate Labeling Function Performance
We summarize how frequently each labeling function fired, along with its positive and negative votes.  
This quick diagnostic highlights which rules are most influential and whether any are too dominant or redundant.


In [56]:
stats = (LF_matrix != ABSTAIN).sum().sort_values(ascending=False).to_frame("fires")
stats["positive_votes"] = (LF_matrix == FAMILY).sum()
stats["negative_votes"] = (LF_matrix == NOT_FAM).sum()
stats

Unnamed: 0,fires,positive_votes,negative_votes
lf_01,4009,0,4009
lf_03,1783,0,1783
lf_00,1189,1189,0
lf_02,1092,1092,0
lf_05,804,804,0
lf_04,68,68,0
lf_06,50,0,50


###  Step 12 ‚Äî Export Compact Dataset
Finally, we create a lightweight version containing only essential metadata and the generated family-friendly label.  
This trimmed file makes future modeling faster and easier to share.


In [58]:
keep_cols = [
    "show_id","type","title","director","cast","country","date_added","release_year","rating",
    "duration","listed_in","description","duration_num","duration_unit", task_name
]
df_small = df[[c for c in keep_cols if c in df.columns]].copy()
small_out = "/Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_family_labels_min.csv"
df_small.to_csv(small_out, index=False)
print("üì¶ Saved compact file:", small_out, "‚Üí", df_small.shape)

üì¶ Saved compact file: /Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_family_labels_min.csv ‚Üí (8807, 15)
