# AUTOMATED PODCAST TRANSCRIPTION AND SEGMENTATION

This project aims to preprocess long-form podcast audio, generate aligned transcripts using automatic speech recognition (ASR), and perform topic segmentation for efficient navigation and analysis.


**MOUNT GOOGLE DRIVE**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**ENVIRONMENT VERIFICATION**

In [None]:
import sys
print("Python version:", sys.version)

Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]


**DATASET ACQUISITION**

The podcast transcript dataset was downloaded from Kaggle.
Due to the large size of audio files (~6 GB), audio is stored separately and processed incrementally.

Dataset link:
https://www.kaggle.com/datasets/thedevastator/this-american-life-podcast-transcript-dataset

**VERIFY DATASET FILES**

In [None]:
import os

print("Transcript files:")
print(os.listdir("/content/drive/MyDrive/podcast-project/data/transcripts_raw"))

Transcript files:
['lines_clean.csv', 'episode_info_clean.csv']


In [None]:
audio_dir = "/content/drive/MyDrive/podcast-project/data/audio_raw"

# numerical sorting
audio_files = sorted(
    os.listdir(audio_dir),
    key=lambda f: int(f.split(".")[0])
)

print("Audio files:")
print(audio_files[:10])   # show only first 10


Audio files:
['1.mp3', '2.mp3', '3.mp3', '4.mp3', '5.mp3', '6.mp3', '7.mp3', '8.mp3', '9.mp3', '10.mp3']


**CSV TRANSCRIPT FILES FOR 200 ROWS**

In [None]:
import os
import pandas as pd

INPUT_DIR = "/content/drive/MyDrive/podcast-project/data/transcripts_raw"
OUTPUT_DIR = "/content/drive/MyDrive/podcast-project/data/transcripts_raw_truncated"

# Create the output directory if it does not already exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

for file in os.listdir(INPUT_DIR):
    if file.endswith(".csv"):
        df = pd.read_csv(os.path.join(INPUT_DIR, file))

        name, ext = os.path.splitext(file)
        output_file = f"{name}_200{ext}"

        # Converts rows 0 to 199 using iloc(integer location) to csv
        df.iloc[:200].to_csv(
            os.path.join(OUTPUT_DIR, output_file),
            index=False
        )

        print(f"Saved {output_file}")


Saved lines_clean_200.csv
Saved episode_info_clean_200.csv
