# Project

In this Project, you will bring together many of the tools and techniques that you have learned throughout this course into a final project. You can choose from many different paths to get to the solution. 

### Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisor’s requirements.

## Project steps

To complete this project, you will follow these steps:

1. [Viewing the video files](#1.-Viewing-the-video-files)
2. [Transcribing the videos](#2.-Transcribing-the-videos)
3. [Normalizing the text](#3.-Normalizing-the-text)
4. [Extracting key phrases and topics](#4.-Extracting-key-phrases-and-topics)
5. [Creating the dashboard](#5.-Creating-the-dashboard)

## Useful information

The following cell contains some information that might be useful as you complete this project.

In [1]:
bucket = "c56161a939430l3396553t1w744137092661-labbucket-rn642jaq01e9"
job_data_access_role = 'arn:aws:iam::744137092661:role/service-role/c56161a939430l3396553t1w7-ComprehendDataAccessRole-1P24MSS91ADHP'

## 1. Viewing the video files
([Go to top](#Capstone-8:-Bringing-It-All-Together))


The source video files are located in the following shared Amazon Simple Storage Service (Amazon S3) bucket.

In [2]:
!aws s3 ls s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/

2021-04-26 20:17:33  410925369 Mod01_Course Overview.mp4
2021-04-26 20:10:02   39576695 Mod02_Intro.mp4
2021-04-26 20:31:23  302994828 Mod02_Sect01.mp4
2021-04-26 20:17:33  416563881 Mod02_Sect02.mp4
2021-04-26 20:17:33  318685583 Mod02_Sect03.mp4
2021-04-26 20:17:33  255877251 Mod02_Sect04.mp4
2021-04-26 20:23:51   99988046 Mod02_Sect05.mp4
2021-04-26 20:24:54   50700224 Mod02_WrapUp.mp4
2021-04-26 20:26:27   60627667 Mod03_Intro.mp4
2021-04-26 20:26:28  272229844 Mod03_Sect01.mp4
2021-04-26 20:27:06  309127124 Mod03_Sect02_part1.mp4
2021-04-26 20:27:06  195635527 Mod03_Sect02_part2.mp4
2021-04-26 20:28:03  123924818 Mod03_Sect02_part3.mp4
2021-04-26 20:31:28  171681915 Mod03_Sect03_part1.mp4
2021-04-26 20:32:07  285200083 Mod03_Sect03_part2.mp4
2021-04-26 20:33:17  105470345 Mod03_Sect03_part3.mp4
2021-04-26 20:35:10  157185651 Mod03_Sect04_part1.mp4
2021-04-26 20:36:27  187435635 Mod03_Sect04_part2.mp4
2021-04-26 20:36:40  280720369 Mod03_Sect04_part3.mp4
2021-04-26 20:40:01  443479

## 2. Transcribing the videos
 ([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to implement your solution to transcribe the videos. 

In [5]:
# Prerequisites for the code to run
!pip install transformers torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 nltk spacy bertopic dash dash-core-components dash-html-components

# For converting video to audio ffmpeg is required to be installed
#!apt-get install ffmpeg

# Downloading en_core_web_sm model
!python -m spacy download en_core_web_sm

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
  Using cached https://download.pytorch.org/whl/cu121/torch-2.2.2%2Bcu121-cp311-cp311-win_amd64.whl (2454.8 MB)
Collecting torchvision
  Using cached https://download.pytorch.org/whl/cu121/torchvision-0.17.2%2Bcu121-cp311-cp311-win_amd64.whl (5.7 MB)
Collecting torchaudio
  Using cached https://download.pytorch.org/whl/cu121/torchaudio-2.2.2%2Bcu121-cp311-cp311-win_amd64.whl (4.1 MB)
Installing collected packages: torch, torchvision, torchaudio
Successfully installed torch-2.2.2+cu121 torchaudio-2.2.2+cu121 torchvision-0.17.2+cu121
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.1/12.8 MB 907.3 kB/

In [6]:
# Downloading S3 bucket to sagemaker notebook
#!aws s3 sync s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/ .

In [2]:
# Write your answer/code here

import os
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Function to extract audio from video
def extract_audio(video_path, output_audio_path):
    command = ["ffmpeg", "-i", video_path, "-ab", "160k", "-ac", "1", "-ar", "16000", "-vn", output_audio_path]
    os.system(" ".join(command))  # Execute ffmpeg command

# Function to transcribe audio file
def transcribe_audio(audio_file_path, pipe):
    result = pipe(audio_file_path, generate_kwargs={"language": "english"})
    return result["text"]

# Check if CUDA is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load pre-trained model and processor
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, use_safetensors=True)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

# Create pipeline for automatic speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# Get all video files in the current directory
video_files = [f for f in os.listdir() if f.endswith('.mp4') or f.endswith('.avi')]

# Convert each video to audio and transcribe it
for video_file in video_files:
    # Replace spaces in the filename with underscores
    new_video_file = video_file.replace(' ', '_')
    os.rename(video_file, new_video_file)
    video_file = new_video_file
    
    audio_file_path = video_file.split('.')[0] + '.wav'  # Create audio file path
    extract_audio(video_file, audio_file_path)  # Extract audio from video
    transcription = transcribe_audio(audio_file_path, pipe)  # Transcribe audio
    
    # Save transcription to a text file
    output_text_file = video_file.split('.')[0] + '_transcription.txt'
    with open(output_text_file, 'w') as f:
        f.write(transcription)
    
    print(f"Transcription for {video_file} saved to {output_text_file}")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


Transcription for Mod01_Course_Overview.mp4 saved to Mod01_Course_Overview_transcription.txt
Transcription for Mod02_Intro.mp4 saved to Mod02_Intro_transcription.txt
Transcription for Mod02_Sect01.mp4 saved to Mod02_Sect01_transcription.txt
Transcription for Mod02_Sect02.mp4 saved to Mod02_Sect02_transcription.txt
Transcription for Mod02_Sect03.mp4 saved to Mod02_Sect03_transcription.txt
Transcription for Mod02_Sect04.mp4 saved to Mod02_Sect04_transcription.txt
Transcription for Mod02_Sect05.mp4 saved to Mod02_Sect05_transcription.txt
Transcription for Mod02_WrapUp.mp4 saved to Mod02_WrapUp_transcription.txt
Transcription for Mod03_Intro.mp4 saved to Mod03_Intro_transcription.txt
Transcription for Mod03_Sect01.mp4 saved to Mod03_Sect01_transcription.txt


--- Logging error ---
Traceback (most recent call last):
  File "C:\Program Files\Python311\Lib\logging\__init__.py", line 1110, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\logging\__init__.py", line 953, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\logging\__init__.py", line 687, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\logging\__init__.py", line 377, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
TypeError: not all arguments converted during string formatting
Call stack:
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Sandy\PycharmProjects\pythonProject\.venv\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\Sandy\Pychar

Transcription for Mod03_Sect02_part1.mp4 saved to Mod03_Sect02_part1_transcription.txt
Transcription for Mod03_Sect02_part2.mp4 saved to Mod03_Sect02_part2_transcription.txt
Transcription for Mod03_Sect02_part3.mp4 saved to Mod03_Sect02_part3_transcription.txt
Transcription for Mod03_Sect03_part1.mp4 saved to Mod03_Sect03_part1_transcription.txt
Transcription for Mod03_Sect03_part2.mp4 saved to Mod03_Sect03_part2_transcription.txt
Transcription for Mod03_Sect03_part3.mp4 saved to Mod03_Sect03_part3_transcription.txt
Transcription for Mod03_Sect04_part1.mp4 saved to Mod03_Sect04_part1_transcription.txt
Transcription for Mod03_Sect04_part2.mp4 saved to Mod03_Sect04_part2_transcription.txt
Transcription for Mod03_Sect04_part3.mp4 saved to Mod03_Sect04_part3_transcription.txt
Transcription for Mod03_Sect05.mp4 saved to Mod03_Sect05_transcription.txt
Transcription for Mod03_Sect06.mp4 saved to Mod03_Sect06_transcription.txt
Transcription for Mod03_Sect07_part1.mp4 saved to Mod03_Sect07_part

Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


Transcription for Mod06_Sect02.mp4 saved to Mod06_Sect02_transcription.txt
Transcription for Mod06_WrapUp.mp4 saved to Mod06_WrapUp_transcription.txt
Transcription for Mod07_Sect01.mp4 saved to Mod07_Sect01_transcription.txt


## 3. Normalizing the text
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to perform any text normalization steps that are necessary for your solution.

In [3]:
# Write your answer/code here

import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize NLTK components
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Define synonyms and abbreviations mapping
synonym_map = {
    'u': 'you',
    'r': 'are',
    '\'ur': 'your',
    '\'ll': 'will',
    'can\'t': 'cannot',
    'don\'t': 'do not',
    'won\'t': 'will not',
    'i\'m': 'i am',
    'i\'ve': 'i have',
    'i\'d': 'i would',
    'isn\'t': 'is not',
    'wasn\'t': 'was not',
    'weren\'t': 'were not',
    'haven\'t': 'have not',
    'hasn\'t': 'has not',
    'hadn\'t': 'had not',
    'doesn\'t': 'does not',
    'didn\'t': 'did not',
    'couldn\'t': 'could not',
    'wouldn\'t': 'would not',
    'mightn\'t': 'might not',
    'mustn\'t': 'must not',
    'shan\'t': 'shall not',
    'shan\'t\'ve': 'shall not have',
    'should\'ve': 'should have',
    'shouldn\'t': 'should not',
    'shouldn\'t\'ve': 'should not have',
    'so\'ve': 'so have',
    'so\'s': 'so as',
    'this\'s': 'this is',
    'that\'s': 'that is',
    'there\'ve': 'there have',
    'there\'s': 'there is',
    'here\'s': 'here is',
    'where\'d': 'where did',
    'where\'s': 'where is',
    'where\'ve': 'where have',
    'who\'ve': 'who have',
    'who\'s': 'who is',
    'who\'d': 'who would',
    'who\'d\'ve': 'who would have',
    'why\'s': 'why is',
    'how\'ve': 'how have',
    'we\'ll': 'we will',
    'you\'ll': 'you will',
    'they\'ll': 'they will',
    'i\'ll': 'i will',
    'he\'ll': 'he will',
    'she\'ll': 'she will',
    'it\'ll': 'it will',
    'youll': 'you will',
    'theyll': 'they will',
    'ill': 'i will',
    'hell': 'he will',
    'shell': 'she will',
    'itll': 'it will',
    'youre': 'you are',
    'theyre': 'they are',
    'youve': 'you have',
    'theyve': 'they have',
    'weve': 'we have',
    'ive': 'i have',
    'hes': 'he is',
    'shes': 'she is',
    'its': 'it is',
    'thats': 'that is',
    'whos': 'who is',
    'thatll': 'that will',
    'whichll': 'which will',
    # Add more mappings as needed
}

# Function for text normalization
def normalize_text(text):
    # 1. Case Normalization
    text = text.lower()
    
    # 2. Punctuation Removal, Removing numbers and symbol
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # 3. Stop Word Removal, Tokenization
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    
    # 4. Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # 5. Replacing synonyms and Abbreviation
    text = ' '.join(tokens)
    text = replace_synonyms_abbreviations(text)
    
    return text

# Function to replace synonyms and abbreviations
def replace_synonyms_abbreviations(text):
    # Replace synonyms and abbreviations
    for word in text.split():
        if word in synonym_map:
            text = text.replace(word, synonym_map[word])

    return text

# Get all text files in the current directory
text_files = [f for f in os.listdir() if f.endswith('_transcription.txt')]

# Normalize text in each file
for text_file in text_files:
    with open(text_file, 'r') as f:
        transcription = f.read()
        normalized_transcription = normalize_text(transcription)
    
    # Write normalized transcription back to the file
    with open(text_file, 'w') as f:
        f.write(normalized_transcription)

    print(f"Text in {text_file} normalized.")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sandy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sandy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sandy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Text in Mod01_Course_Overview_transcription.txt normalized.
Text in Mod02_Intro_transcription.txt normalized.
Text in Mod02_Sect01_transcription.txt normalized.
Text in Mod02_Sect02_transcription.txt normalized.
Text in Mod02_Sect03_transcription.txt normalized.
Text in Mod02_Sect04_transcription.txt normalized.
Text in Mod02_Sect05_transcription.txt normalized.
Text in Mod02_WrapUp_transcription.txt normalized.
Text in Mod03_Intro_transcription.txt normalized.
Text in Mod03_Sect01_transcription.txt normalized.
Text in Mod03_Sect02_part1_transcription.txt normalized.
Text in Mod03_Sect02_part2_transcription.txt normalized.
Text in Mod03_Sect02_part3_transcription.txt normalized.
Text in Mod03_Sect03_part1_transcription.txt normalized.
Text in Mod03_Sect03_part2_transcription.txt normalized.
Text in Mod03_Sect03_part3_transcription.txt normalized.
Text in Mod03_Sect04_part1_transcription.txt normalized.
Text in Mod03_Sect04_part2_transcription.txt normalized.
Text in Mod03_Sect04_part3_

## 4. Extracting key phrases and topics
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to extract the key phrases and topics from the videos.

In [4]:
#Write your answer/code here

from bertopic import BERTopic
import spacy
from collections import defaultdict

# Function to extract topics and key phrases from a single text file
def extract_topics_and_keyphrases(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    # Assuming you have already downloaded and loaded a spaCy model
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)

    # Extracting key phrases
    key_phrases = [chunk.text for chunk in doc.noun_chunks]

    return text, key_phrases

# Function to process all text files in the current directory
def process_text_files():
    # Get all files ending with "_transcription.txt" in the current directory
    files = [f for f in os.listdir('.') if f.endswith('_transcription.txt')]

    topics_and_keyphrases = defaultdict(list)

    for file in files:
        text, key_phrases = extract_topics_and_keyphrases(file)
        topics_and_keyphrases[file] = (text, key_phrases)

    return topics_and_keyphrases

# Function to suggest videos based on input topic or keyword
def suggest_videos(input_topic_or_keyword, topics_and_keyphrases):
    # Perform topic modeling using BERTopic
    texts = [text for text, _ in topics_and_keyphrases.values()]
    model = BERTopic(language="english")
    topics, _ = model.fit_transform(texts)

    # Get the topic or keyword related files
    related_files = []
    for file, (_, key_phrases) in topics_and_keyphrases.items():
        if input_topic_or_keyword in key_phrases:
            related_files.append(file)

    # Extract video names from transcript file names
    video_names = [file.split('_transcription.txt')[0] for file in related_files]

    # Search for corresponding video files in the directory
    video_files = [file + '.mp4' for file in video_names if os.path.isfile(file + '.mp4')]

    return video_files

# Process text files in the current directory
topics_and_keyphrases = process_text_files()

# Example of suggesting videos based on a topic or keyword
input_topic_or_keyword = "machine learning"
suggested_videos = suggest_videos(input_topic_or_keyword, topics_and_keyphrases)
print("Suggested videos based on '{}' topic or keyword:".format(input_topic_or_keyword))
for video in suggested_videos:
    print(video)

Suggested videos based on 'machine learning' topic or keyword:
Mod01_Course_Overview.mp4
Mod02_Intro.mp4
Mod02_Sect02.mp4
Mod02_Sect03.mp4
Mod02_Sect04.mp4
Mod02_WrapUp.mp4
Mod03_Sect02_part1.mp4


## 5. Creating the dashboard
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to create the dashboard for your solution.

In [6]:
# Write your answer/code here

import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import os

app = dash.Dash(__name__)

# Process text files in the current directory
topics_and_keyphrases = process_text_files()

# Define CSS styles
styles = {
    'input': {'margin': '10px', 'width': '300px', 'height': '30px', 'font-size': '16px'},
    'button': {'margin': '10px', 'width': '100px', 'height': '40px', 'font-size': '16px', 'background-color': '#4CAF50', 'color': 'white', 'border': 'none', 'cursor': 'pointer'},
    'output': {'margin': '10px', 'font-size': '16px'}
}

# Create Dash layout
app.layout = html.Div([
    html.H1("Video Suggestion Dashboard", style={'text-align': 'center'}),
    dcc.Input(id='input-topic', type='text', placeholder='Enter topic or keyword', style=styles['input']),
    html.Button('Submit', id='submit-val', n_clicks=0, style=styles['button']),
    html.Div(id='output-videos', style=styles['output'])
])

# Callback to suggest videos based on input topic or keyword
@app.callback(
    Output('output-videos', 'children'),
    [Input('submit-val', 'n_clicks')],
    [dash.dependencies.State('input-topic', 'value')]
)
def update_output(n_clicks, input_topic):
    if input_topic is None:
        return html.Div("Please enter a topic or keyword", style=styles['output'])
    else:
        suggested_videos = suggest_videos(input_topic, topics_and_keyphrases)
        if not suggested_videos:
            return html.Div("No videos found for the given topic or keyword", style=styles['output'])
        else:
            return html.Div([html.P(video) for video in suggested_videos], style=styles['output'])

if __name__ == '__main__':
    app.run_server(debug=True)