# Capstone Project: Bringing It All Together

In this lab, you will bring together many of the tools and techniques that you have learned throughout this course into a final project. You can choose from many different paths to get to the solution. You could use AWS Managed Services, such as Amazon Comprehend, or use the Amazon SageMaker models. Have fun on whichever path you choose.

## Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisorâ€™s requirements.

To assist you, all of the previous labs have been provided in this workspace.


## 1.5. 2. Transcribing the videos
(Go to top)

Use this section to implement your solution to transcribe the videos.


In [None]:

# --- Imports & Configuration ---
import boto3
import json
import pandas as pd
import re
import uuid
import time
from time import sleep
import matplotlib.pyplot as plt
from collections import Counter

# Lab region
region_name = "us-east-1"

# Student lab S3 bucket for outputs (provided by user)
bucket = "c176045a4549683l12324630t1w934798949390-labbucket-mmdkc0xqpkjx"

# Comprehend role (provided by user)
job_data_access_role = "arn:aws:iam::934798949390:role/service-role/c176045a4549683l12324630t1-ComprehendDataAccessRole-E2EeGxWSgfrW"

# Source videos (read-only shared bucket path for reference)
video_source_bucket_prefix = "s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/"

# AWS clients
s3 = boto3.client("s3", region_name=region_name)
transcribe = boto3.client("transcribe", region_name=region_name)
comprehend = boto3.client("comprehend", region_name=region_name)

print("Region:", region_name)
print("Student bucket:", bucket)
print("Comprehend role:", job_data_access_role)
print("Video source:", video_source_bucket_prefix)


In [None]:

# --- Transcribing the videos ---
# Most Academy roles restrict creating Transcribe jobs. This section prioritizes consuming
# existing outputs in your student bucket (transcribe-job-*.json).
# If no outputs are present, you can *optionally* attempt to start jobs (code commented below).

# 1) (Optional) Peek at the shared source video listing (requires AWS CLI in system shell):
# !aws s3 ls s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/

# 2) Discover completed Transcribe JSONs in your bucket
resp = s3.list_objects_v2(Bucket=bucket, Prefix="transcribe-job-")
output_files = []
for obj in resp.get("Contents", []) or []:
    key = obj["Key"]
    video_id = key.replace("transcribe-job-", "").replace(".json", "")
    output_files.append({"Video": video_id, "OutputKey": key})

print(f"Found {len(output_files)} Transcribe result files in s3://{bucket}/")
if not output_files:
    print("No transcribe-job-*.json files found. If permitted, you may attempt to create jobs below.")

# 3) (Optional) Create Transcribe jobs (often blocked in student roles)
# NOTE: This is provided for completeness; many labs expect AccessDenied for job creation.
# from urllib.parse import quote
# import uuid
# media_uri = "s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/Mod01_Course Overview.mp4"
# job_name = f"CapstoneTranscribeJob-{uuid.uuid4()}"
# try:
#     start_resp = transcribe.start_transcription_job(
#         TranscriptionJobName=job_name,
#         Media={"MediaFileUri": media_uri},
#         MediaFormat="mp4",
#         LanguageCode="en-US",
#         OutputBucketName=bucket
#     )
#     print("Started job:", start_resp["TranscriptionJob"]["TranscriptionJobName"])
# except Exception as e:
#     print("Transcribe start likely not permitted in this lab:", e)

output_files[:5]


## 1.6. 3. Normalizing the text
(Go to top)

Use this section to perform any text normalization steps that are necessary for your solution.


In [None]:

# --- Load transcripts and normalize text ---
data_rows = []
for entry in output_files:
    key = entry["OutputKey"]
    try:
        obj = s3.get_object(Bucket=bucket, Key=key)
        payload = obj["Body"].read().decode("utf-8")
        data = json.loads(payload)
        transcript = data["results"]["transcripts"][0]["transcript"]
        data_rows.append({"Video": entry["Video"], "Transcription": transcript})
    except Exception as e:
        print(f"Error reading {key}:", e)

df = pd.DataFrame(data_rows)
print("Loaded transcripts:", len(df))
display(df.head())

def normalize_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

if "Transcription" not in df.columns:
    raise KeyError("Expected 'Transcription' column missing; verify previous step.")

df["clean_text"] = df["Transcription"].apply(normalize_text)
print("Normalized transcripts.")
display(df[["Video", "clean_text"]].head())


## 1.7. 4. Extracting key phrases and topics
(Go to top)

Use this section to extract the key phrases and topics from the videos.


In [None]:

# --- Extracting key phrases and topics (real Amazon Comprehend) ---

# 1) DetectKeyPhrases per transcript (real-time)
key_rows = []
print("Running Amazon Comprehend DetectKeyPhrases...")
for i, row in df.iterrows():
    text = row["Transcription"][:4500]  # ~5000 bytes limit safety
    try:
        resp = comprehend.detect_key_phrases(Text=text, LanguageCode="en")
        phrases = [kp["Text"] for kp in resp.get("KeyPhrases", [])]
    except Exception as e:
        print(f"Comprehend error on {row['Video']}: {e}")
        phrases = []
    key_rows.append({"Video": row["Video"], "KeyPhrases": phrases})
    if (i + 1) % 10 == 0:
        print(f"Processed {i+1}/{len(df)}")
    sleep(0.25)  # polite throttle

df_keys = pd.DataFrame(key_rows)
print("Key phrases extracted for:", len(df_keys), "videos")
display(df_keys.head())

# 2) Prepare inputs for Topics Detection (Phase Detection)
input_prefix = "transcribe-json-input/"
uploaded = 0
for _, r in df.iterrows():
    key = f"{input_prefix}{r['Video']}.txt"
    s3.put_object(Bucket=bucket, Key=key, Body=r["Transcription"].encode("utf-8"))
    uploaded += 1
print(f"Uploaded {uploaded} ONE_DOC_PER_FILE inputs to s3://{bucket}/{input_prefix}")

# 3) Start Topics Detection Job
input_s3_uri  = f"s3://{bucket}/{input_prefix}"
output_s3_uri = f"s3://{bucket}/comprehend-topics-output/"

try:
    start = comprehend.start_topics_detection_job(
        InputDataConfig={"S3Uri": input_s3_uri, "InputFormat": "ONE_DOC_PER_FILE"},
        OutputDataConfig={"S3Uri": output_s3_uri},
        DataAccessRoleArn=job_data_access_role,
        JobName=f"ComprehendTopics-{uuid.uuid4()}",
        NumberOfTopics=10
    )
    job_id = start["JobId"]
    print("Started Topics Detection Job:", job_id)
except Exception as e:
    job_id = None
    print("Could not start Topics Detection (may be restricted by lab role):", e)

# 4) Monitor (if started)
if job_id:
    while True:
        desc = comprehend.describe_topics_detection_job(JobId=job_id)
        props = desc["TopicsDetectionJobProperties"]
        state = props["JobStatus"]
        print("Job status:", state)
        if state in ("COMPLETED", "FAILED"):
            print("Final properties:", json.dumps(props, indent=2, default=str))
            break
        time.sleep(60)

# 5) List outputs (if any)
res = s3.list_objects_v2(Bucket=bucket, Prefix="comprehend-topics-output/")
print("Topics output files:")
for obj in res.get("Contents", []) or []:
    print("-", obj["Key"])


## 1.8. 5. Creating the dashboard
(Go to top)

Use this section to create the dashboard for your solution.


In [None]:

# --- Creating the dashboard (simple visualization example) ---
# Here we visualize the top key phrases across all transcripts.
# In a fuller dashboard, you'd add filters/search, per-video links, etc.

if not df_keys.empty and "KeyPhrases" in df_keys.columns:
    all_phrases = [p for sub in df_keys["KeyPhrases"] for p in (sub or [])]
    if all_phrases:
        top = Counter(all_phrases).most_common(10)
        labels, counts = zip(*top)
        plt.figure(figsize=(8,4))
        plt.barh(labels, counts)
        plt.gca().invert_yaxis()
        plt.title("Top Key Phrases (Amazon Comprehend)")
        plt.xlabel("Frequency")
        plt.show()
    else:
        print("No key phrases found to visualize.")
else:
    print("df_keys is empty; run extraction first.")
