# üé¨ Netflix Data Augmentation Lab ‚Äî Notebook 2

---

### üß† **Objective**
This lab focuses on the **data augmentation** phase of the Netflix labeling pipeline.  
Using the weak labels created in Notebook 1, we enhance dataset diversity and mitigate class imbalance through text-based augmentation of the *title* and *description* fields.

---

### ‚öôÔ∏è **Notebook Workflow**
1. **Load labeled dataset** ‚Äî import the file produced in Notebook 1.  
2. **Check class balance** ‚Äî inspect label distribution.  
3. **Text cleaning** ‚Äî normalize descriptions for augmentation.  
4. **Augment data** ‚Äî create synthetic variants of under-represented samples using simple NLP transformations:  
   - synonym replacement  
   - random word swap / deletion  
   - noise injection  
5. **Concatenate & deduplicate** ‚Äî merge original + augmented data.  
6. **Evaluate balance** ‚Äî verify improved class ratio.  
7. **Save augmented dataset** ‚Äî export for slicing & modeling.

---

### üß© **Key Concepts**
- **Data Augmentation** improves generalization by exposing models to slightly altered but semantically similar samples.  
- **Weak Labels** from Notebook 1 are preserved during augmentation.  
- **Target Column:** `is_family_friendly`  

---

### üîÑ **Outcome**
By the end of this notebook, you will have an expanded dataset  
`netflix_augmented.csv` ready for:
- **Data Slicing** and  
- **Model Development** (Notebook 3).


###  Step 1 ‚Äî Load the Weakly Labeled Dataset
We import the labeled CSV produced in Notebook 1 and quickly inspect its structure and class distribution.


In [3]:
import pandas as pd
import numpy as np
from collections import Counter

file_path = "/Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_labeled_family.csv"
df = pd.read_csv(file_path)

print("‚úÖ Loaded:", file_path)
print("Shape:", df.shape)
df.head(3)

‚úÖ Loaded: /Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_labeled_family.csv
Shape: (8807, 18)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration_num,duration_unit,title_lower,desc_lower,is_family_friendly_mv,is_family_friendly
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",90.0,min,dick johnson is dead,"as her father nears the end of his life, filmm...",-1,0
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2.0,season,blood & water,"after crossing paths at a party, a cape town t...",0,0
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1.0,season,ganglands,to protect his family from a powerful drug lor...,0,0


###  Step 2 ‚Äî Check Class Balance
We count how many samples are labeled as **family-friendly (1)** vs **non-family (0)**.  
If a class imbalance exists, we will primarily augment the minority class to improve balance.


In [8]:
label_col = "is_family_friendly"
counts = df[label_col].value_counts()
print(counts)
print("\nClass balance (%):")
print(round(counts / counts.sum() * 100, 2))

is_family_friendly
0    7166
1    1641
Name: count, dtype: int64

Class balance (%):
is_family_friendly
0    81.37
1    18.63
Name: count, dtype: float64


###  Step 3 ‚Äî Clean and Prepare Text Fields
We merge the title and description into a single text column and clean basic punctuation, spacing, and casing.  
This unified text will be used for augmentation operations.


In [11]:
import re

def clean_text(txt):
    if pd.isna(txt): return ""
    txt = str(txt).lower()
    txt = re.sub(r"[^a-z0-9\s]", " ", txt)
    txt = re.sub(r"\s+", " ", txt).strip()
    return txt

df["text"] = (df["title"].fillna("") + " " + df["description"].fillna("")).apply(clean_text)
df["text"].head(3)

0    dick johnson is dead as her father nears the e...
1    blood water after crossing paths at a party a ...
2    ganglands to protect his family from a powerfu...
Name: text, dtype: object

###  Step 4 ‚Äî Define Text Augmentation Functions
We implement lightweight augmentation strategies suitable for rule-based expansion:
1. **Synonym Replacement** using WordNet  
2. **Random Swap** of two words  
3. **Random Deletion** of a low-importance word  
Each produces slightly varied sentences while preserving overall meaning.

In [15]:
import random
from nltk.corpus import wordnet
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

def synonym_replacement(sentence, n=1):
    words = sentence.split()
    new_words = words.copy()
    for _ in range(n):
        candidates = [w for w in words if len(wordnet.synsets(w)) > 0]
        if not candidates: break
        word = random.choice(candidates)
        syns = wordnet.synsets(word)
        lemmas = [l.name().replace("_"," ") for s in syns for l in s.lemmas() if l.name().isalpha()]
        if lemmas:
            new_word = random.choice(lemmas)
            idx = new_words.index(word)
            new_words[idx] = new_word
    return " ".join(new_words)

def random_swap(sentence, n=1):
    words = sentence.split()
    for _ in range(n):
        if len(words) < 2: break
        i1, i2 = random.sample(range(len(words)), 2)
        words[i1], words[i2] = words[i2], words[i1]
    return " ".join(words)

def random_deletion(sentence, p=0.1):
    if len(sentence.split()) == 1:
        return sentence
    return " ".join([w for w in sentence.split() if random.uniform(0,1) > p])

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sreevarshansathiyamurthy/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/sreevarshansathiyamurthy/nltk_data...


###  Step 5 ‚Äî Augment Minority Class Samples
We apply our augmentation functions to the minority class (whichever has fewer samples).  
Each selected record generates a few synthetic variations that inherit the same weak label.


In [19]:
minority_label = counts.idxmin()
df_minority = df[df[label_col] == minority_label]

augmented_rows = []
for _, row in df_minority.iterrows():
    txt = row["text"]
    aug1 = synonym_replacement(txt)
    aug2 = random_swap(txt)
    aug3 = random_deletion(txt)
    for aug_txt in [aug1, aug2, aug3]:
        augmented_rows.append({
            "type": row.get("type"),
            "title": row.get("title"),
            "description": row.get("description"),
            "text": aug_txt,
            label_col: row[label_col]
        })

df_aug = pd.DataFrame(augmented_rows)
print("Generated augmented samples:", len(df_aug))
df_aug.head(3)


Generated augmented samples: 4923


Unnamed: 0,type,title,description,text,is_family_friendly
0,Movie,My Little Pony: A New Generation,Equestria's divided. But a bright-eyed hero be...,my little pony a new generation equestria s di...,1
1,Movie,My Little Pony: A New Generation,Equestria's divided. But a bright-eyed hero be...,my little pony a new generation s equestria di...,1
2,Movie,My Little Pony: A New Generation,Equestria's divided. But a bright-eyed hero be...,my little pony a new generation equestria s di...,1


###  Step 6 ‚Äî Combine Original and Augmented Data
We merge the newly created synthetic samples with the original dataset and remove any duplicates based on the `text` field.


In [22]:
df_full = pd.concat([df, df_aug], ignore_index=True)
df_full.drop_duplicates(subset=["text"], inplace=True)
print("After merge:", df_full.shape)

After merge: (13083, 19)


###  Step 7 ‚Äî Evaluate Post-Augmentation Balance
After augmentation, we verify that both classes are better represented and that the dataset size has increased as expected.


In [25]:
new_counts = df_full[label_col].value_counts()
print("Class distribution after augmentation:")
print(new_counts)
print("\nPercentage:")
print(round(new_counts / new_counts.sum() * 100, 2))


Class distribution after augmentation:
is_family_friendly
0    7163
1    5920
Name: count, dtype: int64

Percentage:
is_family_friendly
0    54.75
1    45.25
Name: count, dtype: float64


###  Step 8 ‚Äî Save Augmented Dataset
We export the expanded dataset to a new CSV file for use in **Notebook 3 (Data Slicing & Model Development)**.

In [28]:
out_path = "/Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_augmented.csv"
df_full.to_csv(out_path, index=False)
print("‚úÖ Saved augmented dataset to:", out_path)

‚úÖ Saved augmented dataset to: /Users/sreevarshansathiyamurthy/Downloads/netflix_lab/netflix_augmented.csv


###  Step 9 ‚Äî Preview Augmented Text
We display a few original and augmented pairs to visually confirm that the transformations introduce small, realistic variations without changing semantic meaning.

In [32]:
display(df_minority[["text"]].head(3))
display(df_aug[["text"]].head(3))

Unnamed: 0,text
6,my little pony a new generation equestria s di...
13,confessions of an invisible girl when the clev...
23,go go cory carson chrissy takes the wheel from...


Unnamed: 0,text
0,my little pony a new generation equestria s di...
1,my little pony a new generation s equestria di...
2,my little pony a new generation equestria s di...
