# Project

In this Project, you will bring together many of the tools and techniques that you have learned throughout this course into a final project. You can choose from many different paths to get to the solution. 

### Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisor’s requirements.

## Project steps

To complete this project, you will follow these steps:

1. [Viewing the video files](#1.-Viewing-the-video-files)
2. [Transcribing the videos](#2.-Transcribing-the-videos)
3. [Normalizing the text](#3.-Normalizing-the-text)
4. [Extracting key phrases and topics](#4.-Extracting-key-phrases-and-topics)
5. [Creating the dashboard](#5.-Creating-the-dashboard)

## Useful information

The following cell contains some information that might be useful as you complete this project.

In [None]:
bucket = "c56161a939430l3396553t1w744137092661-labbucket-rn642jaq01e9"
job_data_access_role = 'arn:aws:iam::744137092661:role/service-role/c56161a939430l3396553t1w7-ComprehendDataAccessRole-1P24MSS91ADHP'

In [None]:
!aws s3 ls s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/


In [None]:
!aws s3 cp s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/ . --recursive


## 1. Viewing the video files
([Go to top](#Capstone-8:-Bringing-It-All-Together))


The source video files are located in the following shared Amazon Simple Storage Service (Amazon S3) bucket.

In [1]:
import cv2 
import numpy as np
import speech_recognition as sr 
from pydub.utils import make_chunks
from tqdm import tqdm
from pydub import AudioSegment 
import os

## 2. Transcribing the videos
 ([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to implement your solution to transcribe the videos. 

In [2]:
videoFiles = [] 
for file in os.listdir('./videos/'):
    if file.endswith('.mp4'):
        videoFiles.append(file)

In [6]:
print("Total videos-",len(videoFiles))

Total videos- 46


In [7]:
# Write your answer/code here
# Loading video files
for file in videoFiles:      
    video = AudioSegment.from_file(f'videos/{file}', format="mp4")
    audio = video.set_channels(1).set_frame_rate(16000).set_sample_width(2)
    audio_filename= file[:-4]+'.wav'
    audio.export(f'audios/{audio_filename}', format="wav")

In [8]:
audioFiles = []
for file in os.listdir('./audios/'):
    if file.endswith('.wav'):
        audioFiles.append(file)

In [9]:
print("Total audio files-",len(audioFiles))

Total audio files- 46


In [12]:
text_ = []
#count = 0
for file in audioFiles:
    myaudio = AudioSegment.from_wav(f'./audios/{file}')
    chunks_length = 35000
    chunks = make_chunks(myaudio,chunks_length)
    text = ""
    for j, chunk in enumerate(chunks):
        chunkName = f"{file[:-4]}_{j}.wav"
        chunk.export(f"./chunkedAudios/{chunkName}",format = "wav")
        r = sr.Recognizer()
        with sr.AudioFile(f"./chunkedAudios/{chunkName}") as source:
            audio_data = r.record(source)
            try:
                temp = r.recognize_google(audio_data)
                text = text + " " + temp
            except sr.UnknownValueError:
                print("Got UnknownValeError")
    data = [file,text]
    text_.append(data)
    #count += 1

In [14]:
for i in text_:
    filename = i[0][:-4]+".txt"
    with open(f'./Transcribed_files/{filename}','w') as writer:
        writer.write(i[1])

## 3. Normalizing the text
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to perform any text normalization steps that are necessary for your solution.

In [15]:
# Write your answer/code here
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
import re

In [16]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Raj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Raj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Raj\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [44]:
def preprocess_text(text):
    text = text.lower()
    
    text = text.strip()
    
    text = re.sub('\s+', ' ', text) 

    intro_phrases = ['hi', 'hello', 'welcome', 'thanks', 'watching', 'video']

    words = word_tokenize(text)
    words = [word for word in words if word.isalpha()]

    stop_words = nltk.corpus.stopwords.words('english')
    stop_words = stop_words + intro_phrases
    words = [word for word in words if word not in stop_words]

    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    preprocessedSentences = ' '.join(words)

    return preprocessedSentences

In [45]:
transcripts = []

for file, transcript in text_:
    transcripts.append([f"{file[:-4]}.mp4", preprocess_text(transcript)])


In [46]:
transcripts

[['Mod01_Course Overview.mp4',
  'amazon academy machine learning foundation module learn course objective various job role machine learning domain go learn machine learning completing module able identify course prerequisite objective indicate role data scientist business identify resource learning going look prerequisite taking course take course recommend first complete aws academy cloud foundation also general technical knowledge including foundational computer literacy skill like basic computer concept email file management good understanding internet also recommend intermediate skill python programming general knowledge applied statistic finally general business knowledge important course includes insight information technology used business also important business related skill set communication skill leadership skill orientation towards customer service course introduced key concept machine learning tool us also introduced work aws service machine learning learn recognize machi

In [22]:
pip install yake

Collecting yakeNote: you may need to restart the kernel to use updated packages.





  Downloading yake-0.4.8-py2.py3-none-any.whl (60 kB)
     ---------------------------------------- 60.2/60.2 kB 1.1 MB/s eta 0:00:00
Collecting segtok
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Installing collected packages: segtok, yake
Successfully installed segtok-1.5.11 yake-0.4.8


## 4. Extracting key phrases and topics
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to extract the key phrases and topics from the videos.

In [47]:
# Write your answer/code here
import yake

In [48]:
def extract_phrases(text):
    extractor = yake.KeywordExtractor()
    keywords = extractor.extract_keywords(text)
    return keywords

In [49]:
keywords = []

for file, text in transcripts:
    keywords.append([file, [phrase for phrase, score in extract_phrases(text)]])

In [50]:
keywords

[['Mod01_Course Overview.mp4',
  ['machine learning pipeline',
   'machine learning section',
   'machine learning',
   'machine learning problem',
   'amazon machine learning',
   'machine learning service',
   'certified machine learning',
   'machine learning engineer',
   'implement machine learning',
   'machine learning specialty',
   'learning section learn',
   'machine learning model',
   'natural language processing',
   'role machine learning',
   'statistic machine learning',
   'learning section describes',
   'learning machine learning',
   'field machine learning',
   'machine learning technology',
   'machine learning professional']],
 ['Mod02_Intro.mp4',
  ['business problem solved',
   'solve business problem',
   'challenge face completing',
   'traditional software development',
   'software development method',
   'development method ready',
   'problem solved machine',
   'aws academy machine',
   'deep learning part',
   'business problem describe',
   'academy m

## 5. Creating the dashboard
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to create the dashboard for your solution.

In [None]:
# Write your answer/code here

## Testing

In [None]:
while(1):
    query = str(input('Search: '))
    if query == 'Stop': 
        break
    preprocessed_query = preprocess_text(query)
    query_keywords = [phrase for phrase, score in extract_phrases(preprocessed_query)]
    suggestedvideos = []
    for keyword in query_keywords:
        for k in keywords:
            if keyword in k[1]:
                if k[0] not in suggestedvideos:
                    suggestedvideos.append(k[0])
    if len(suggestedvideos) != 0:
        print(suggestedvideos)
    else:
        print('Nothing found.')
    print('--------------------------------------------------------')

Search: Bollywood Movies
Nothing found.
--------------------------------------------------------
Search: service machine learning
['Mod02_Sect05.mp4', 'Mod01_Course Overview.mp4', 'Mod02_Sect01.mp4', 'Mod02_Sect02.mp4', 'Mod02_Sect04.mp4', 'Mod02_WrapUp.mp4', 'Mod03_Sect01.mp4', 'Mod07_Sect01.mp4']
--------------------------------------------------------
Search: unsupervised machine learning
['Mod02_Sect02.mp4', 'Mod03_Sect01.mp4', 'Mod01_Course Overview.mp4', 'Mod02_Sect01.mp4', 'Mod02_Sect04.mp4', 'Mod02_Sect05.mp4', 'Mod02_WrapUp.mp4', 'Mod07_Sect01.mp4']
--------------------------------------------------------
Search: Web development
Nothing found.
--------------------------------------------------------
