# Project

In this Project, you will bring together many of the tools and techniques that you have learned throughout this course into a final project. You can choose from many different paths to get to the solution. 

### Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisor’s requirements.

## Project steps

To complete this project, you will follow these steps:

1. [Viewing the video files](#1.-Viewing-the-video-files)
2. [Transcribing the videos](#2.-Transcribing-the-videos)
3. [Normalizing the text](#3.-Normalizing-the-text)
4. [Extracting key phrases and topics](#4.-Extracting-key-phrases-and-topics)
5. [Creating the dashboard](#5.-Creating-the-dashboard)

## Useful information

The following cell contains some information that might be useful as you complete this project.

In [1]:
bucket = "c56161a939430l3396553t1w744137092661-labbucket-rn642jaq01e9"
job_data_access_role = 'arn:aws:iam::744137092661:role/service-role/c56161a939430l3396553t1w7-ComprehendDataAccessRole-1P24MSS91ADHP'

## 1. Viewing the video files
([Go to top](#Capstone-8:-Bringing-It-All-Together))


The source video files are located in the following shared Amazon Simple Storage Service (Amazon S3) bucket.

In [2]:
!aws s3 ls s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/

2021-04-26 20:17:33  410925369 Mod01_Course Overview.mp4
2021-04-26 20:10:02   39576695 Mod02_Intro.mp4
2021-04-26 20:31:23  302994828 Mod02_Sect01.mp4
2021-04-26 20:17:33  416563881 Mod02_Sect02.mp4
2021-04-26 20:17:33  318685583 Mod02_Sect03.mp4
2021-04-26 20:17:33  255877251 Mod02_Sect04.mp4
2021-04-26 20:23:51   99988046 Mod02_Sect05.mp4
2021-04-26 20:24:54   50700224 Mod02_WrapUp.mp4
2021-04-26 20:26:27   60627667 Mod03_Intro.mp4
2021-04-26 20:26:28  272229844 Mod03_Sect01.mp4
2021-04-26 20:27:06  309127124 Mod03_Sect02_part1.mp4
2021-04-26 20:27:06  195635527 Mod03_Sect02_part2.mp4
2021-04-26 20:28:03  123924818 Mod03_Sect02_part3.mp4
2021-04-26 20:31:28  171681915 Mod03_Sect03_part1.mp4
2021-04-26 20:32:07  285200083 Mod03_Sect03_part2.mp4
2021-04-26 20:33:17  105470345 Mod03_Sect03_part3.mp4
2021-04-26 20:35:10  157185651 Mod03_Sect04_part1.mp4
2021-04-26 20:36:27  187435635 Mod03_Sect04_part2.mp4
2021-04-26 20:36:40  280720369 Mod03_Sect04_part3.mp4
2021-04-

In [1]:
!pip install moviepy torch transformers accelerate nltk




In [2]:
import os
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from moviepy.editor import VideoFileClip
import boto3
import shutil
import re
from unicodedata import normalize
import nltk
from nltk.tokenize import word_tokenize
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
import ipywidgets as widgets
from IPython.display import display, clear_output, Video
import boto3

ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5033:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2501:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4554:(_snd_config_evaluate) function snd_func_concat returned error: N

## 2. Transcribing the videos
 ([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to implement your solution to transcribe the videos. 

In [3]:

def download_from_s3(s3, bucket_name, key, output_path):
    s3.download_file(bucket_name, key, output_path)

# Following code convert into the audio format
def convert_video_to_audio(video_route, audio_route):
   
    video = VideoFileClip(video_route)
    audio = video.audio
    audio.write_audiofile(audio_route)

# Following code will be processing the video
def process_video(video_index, output_directory):
    
    # Following code will download the video and than send it to convert into the audio format
    print("Downloading video:", video_index)
    video_route = os.path.join(output_directory, 'video.mp4')
    audio_file_name = os.path.splitext(os.path.basename(video_index))[0] + '.mp3'
    print(audio_file_name)
    audio_route = os.path.join(output_directory, audio_file_name)  

    download_from_s3(s3, bucket_name, video_index, video_route)

    print("Converting video to audio:", video_index)
    
    convert_video_to_audio(video_route, audio_route)
    print(f"audio_output/{audio_file_name}")
    result = pipelin(f"audio_output/{audio_file_name}")
    print("Video:", video_index)
    print("Text:", result["text"])
    if not os.path.exists("text_data"):
        os.makedirs("text_data")
    text_file_name = audio_file_name.split(".")
    with open(f"text_data/{text_file_name[0]}.txt","w",encoding="UTF-8") as file:
        file.write(result["text"])

    # remove the all temporary file
    os.remove(video_route)
    os.remove(audio_route)

# AWS S3 configuration
bucket_name = "aws-tc-largeobjects"
output_prefix = "CUR-TF-200-ACMNLP-1/video/"
output_directory = "audio_output"  


if not os.path.exists(output_directory):
    os.makedirs(output_directory)

s3 = boto3.client('s3')

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processors = AutoProcessor.from_pretrained(model_id)
pipelin = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processors.tokenizer,
    feature_extractor=processors.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# print the list of videos in the S3 bucket
response = s3.list_objects_v2(Bucket=bucket_name, Prefix=output_prefix)
video_indexs = [obj['Key'] for obj in response.get('Contents', [])]

for i, video_index in enumerate(video_indexs):
    print(f"Processing video {i+1}/{len(video_indexs)}")
    process_video(video_index, output_directory)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Processing video 1/46
Downloading video: CUR-TF-200-ACMNLP-1/video/Mod01_Course Overview.mp4
Mod01_Course Overview.mp3
Converting video to audio: CUR-TF-200-ACMNLP-1/video/Mod01_Course Overview.mp4
MoviePy - Writing audio in audio_output/Mod01_Course Overview.mp3


                                                                        

MoviePy - Done.
audio_output/Mod01_Course Overview.mp3


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


Video: CUR-TF-200-ACMNLP-1/video/Mod01_Course Overview.mp4
Text:  Hi, and welcome to Amazon Academy Machine Learning Foundations. In this module, you'll learn about the course objectives, various job roles in the machine learning domain, and where you can go to learn more about machine learning. After completing this module, you should be able to identify course prerequisites and objectives, indicate the role of the data scientist in business, and identify resources for further learning. We're now going to look at the prerequisites for taking this course. Before you take this course, we recommend that you first complete AWS Academy Cloud Foundations. You should also have some general technical knowledge of IT, including foundational computer literacy skills like basic computer concepts, email, file management, and a good understanding of the Internet. We also recommend that you have intermediate skills with Python programming and a general knowledge of applied statistics. Finally, gene

                                                                      

MoviePy - Done.
audio_output/Mod02_Intro.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod02_Intro.mp4
Text:  Hi, and welcome to Module 2 of AWS Academy Machine Learning. In this module, we're going to introduce machine learning. We'll first look at the business problems that can be solved by machine learning. We'll then talk about terminology, process, tools, and some of the challenges you'll face. process, tools, and some of the challenges you'll face. After completing this module, you should be able to recognize how machine learning and deep learning are part of artificial intelligence, describe artificial intelligence and machine learning terminology, identify how machine learning can be used to solve a business problem, describe the machine learning process, list the tools available to data scientists, and identify when to use machine learning instead of traditional software development methods. You're now ready to get started with Section 1. See you in the next video.
Processing video 3/4

                                                                        

MoviePy - Done.
audio_output/Mod02_Sect01.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod02_Sect01.mp4
Text:  Hi, and welcome to Section 1. In this section, we're going to talk about what machine learning is. This course is an introduction to machine learning, which is also known as ML. But first, we'll discuss where machine learning fits into the larger picture. Machine learning is a subset of artificial intelligence, or AI. This is a broad branch of computer science that's focused on building machines that can do human tasks. Deep learning is a subdomain of machine learning. To understand where these all fit together, we'll discuss each one. As we just mentioned, machine learning is a subset of a broader computer science field known as artificial intelligence. AI focuses on building machines that can perform tasks a human would typically perform. In contemporary popular culture, you've probably seen AIs in movies, television, or works of fiction. For example, you might have seen AIs that co

                                                                        

MoviePy - Done.
audio_output/Mod02_Sect02.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod02_Sect02.mp4
Text:  Hi and welcome back. In this section, we're going to look at the types of business problems machine learning can help you solve. Machine learning is used all across your digital lives. Your email spam filter is the result of a machine learning program that was trained with examples of spam and regular email messages. Based on books you're reading or products you bought, machine learning programs can predict other books or products you're likely to be interested in. Again, the machine learning program was trained with data from other readers' habits and purchases. When detecting credit card fraud, the machine learning program was trained on examples of transactions that turned out to be fraud, along with normal transactions. You can probably think of many more examples, from social media applications, using facial detection to group your photos, to detecting brain tumors in brain scans

                                                                        

MoviePy - Done.
audio_output/Mod02_Sect03.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod02_Sect03.mp4
Text:  Hi and welcome back. This is section 3 and we're going to give you a quick, high-level overview of machine learning terminology and a typical workflow. We will cover these topics in more detail later in this course, but for now we'll focus on the larger picture. So to begin, you should always start with the business problem you or your team believe could benefit from machine learning. From there, you want to do some problem formulation. In this phase, one task is to articulate your business problem and convert it to an ML problem. After you've formulated the problem, you move on to the data preparation and pre-processing phase. You'll pull data from one or more data sources. These data sources might have differences in data or types that need to be reconciled so you can form a single, cohesive view of your data. You'll need to visualize your data and use statistics to determine if the

                                                                      

MoviePy - Done.
audio_output/Mod02_Sect04.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod02_Sect04.mp4
Text:  Welcome back. In this section, we'll look at some of the tools you'll be using throughout the rest of this course. Before we start, this list isn't an exhaustive list of all the tools available today. We're only going to cover them at a high level, but it's a good place to get started. First, there's the Jupyter Notebook. The Jupyter Notebook is an open-source web application you can use to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. JupyterLab is a web-based interactive development environment for Jupyter notebooks, code, and data. JupyterLab is flexible. You can use it to configure and arrange the user interface to support a wide range of workflows in data science, scientific computing, a

                                                                      

MoviePy - Done.
audio_output/Mod02_Sect05.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod02_Sect05.mp4
Text:  Hi, welcome back. This is section 5 and we're going to discuss challenges with machine learning. You'll come across many challenges in machine learning. There are a lot of poor quality and inconsistent data available. A significant portion of your job will be getting access to or generating enough good data that's representative of the problem you want to solve. A key issue to watch out for is under or overfitting the model. It's not all about the data, although it mostly is. Do you have data science experience? Is staffing a team of data scientists cost-effective? Does management support using machine learning? What does the business landscape look like? Are the problems too complex to formulate into a machine learning problem? Can the resulting model be explained to the business? If it can't be explained, it might not get adopted. What's the cost of building, updating, and operating

                                                                      

MoviePy - Done.
audio_output/Mod02_WrapUp.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod02_WrapUp.mp4
Text:  It's now time to review the module. Here are the main takeaways for this module. First, we looked at defining machine learning and how it fits into the broader AI landscape. We also looked at the types of problems machine learning can help us solve and how machine learning applies learning algorithms to develop models from large datasets. We then looked at the machine learning pipeline and the different stages for developing a machine learning application. Finally, we introduced some of the tools and services you can use, before discussing some of the challenges with machine learning. In summary, in this module you learned how to recognize how machine learning. In summary, in this module, you learned how to recognize how machine learning and deep learning are part of artificial intelligence, describe artificial intelligence and machine learning terminology, identify how machine learni

                                                                      

MoviePy - Done.
audio_output/Mod03_Intro.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Intro.mp4
Text:  Welcome back to AWS Academy Machine Learning. This is module three, and we're going to work through the entire machine learning pipeline by using Amazon SageMaker. This module will discuss a typical process for handling a machine learning problem. The machine learning pipeline can be applied to many machine learning problem. The machine learning pipeline can be applied to many machine learning problems. The focus is on supervised learning, but the process you learn in this module can be adapted to other types of machine learning as well. This is a large module and we'll be covering a lot of material. At the end of this module, you'll be able to formulate a problem from a business request, obtain and secure data for machine learning, build a Jupyter notebook by using Amazon SageMaker, outline the process for evaluating data, explain why data needs to be preprocessed, use open-source tool

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect01.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect01.mp4
Text:  Hi, and welcome back to module 3. This is section 1, and we're going to take a look at some of the data sets we'll use in this module. We'll also look at guidance for how to formulate a business problem. Before we get started, here's a reminder of the machine learning pipeline we looked at in the previous module, and how that maps to the sections in this module. This section, Section 1, will cover how to formulate a problem. It will also cover the datasets we'll use throughout this module. Section 2 will discuss how to obtain and secure data for your machine learning activities. In Section 3, we'll show you tools and techniques for gaining an understanding of your data. Then in Section 4, we'll look at pre-processing your data so it's ready to train a model. Section 5 will cover selecting and training an appropriate machine learning model. Section 6 will show you how to deploy a model

--- Logging error ---                                                   
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/runpy.py", line 86, in _run_cod

MoviePy - Done.
audio_output/Mod03_Sect02_part1.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect02_part1.mp4
Text:  Hi, welcome back. We're now going to look at a few ways you can collect and secure data. In this section, we'll explore some of the techniques and challenges associated with collecting and securing the data that's needed for machine learning. Consider again the original example about predicting credit card fraud. You've further formulated the problem. But what data do you need to actually train your model so you can get the desired output and subsequently achieve your intended business outcome? Do you have access to the data? If so, how much data do you have and where is it? What solution can you use to bring all this data into one centralized repository? The answers to these questions are essential at this stage. The good news for a budding data scientist is that there are many places where you can obtain data. Private data from you or your existing customer already exist

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect02_part2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect02_part2.mp4
Text:  Hi, welcome back. We'll continue exploring data collection by reviewing how to extract, transform, and load data. Data is typically spread across many different systems and data providers. This presents a challenge. You'll need to bring all these data sources together into something that can be consumed by a machine learning model. You can do this through extract, transform, and load, which is also known as ETL. The steps in ETL are defined this way. In the extract step, you pull the data from the sources to a single location. During extraction, you might need to modify the data, combine matching records, or do other tasks that transform the data. Finally, in the load step, the data is loaded into a repository such as Amazon S3. A typical ETL framework has several components. As an example, consider the diagram. First, the Crawler A program connects to a data store, which 

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect02_part3.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect02_part3.mp4
Text:  Hi, welcome back. We'll continue exploring data collection by reviewing how to secure your data. It's important to consider the security of your data. Though the data sets used in this course are all public, real data about customer transactions or health records need to be kept secure. You can use AWS Identity and Access Management, which is also known as IAM. It's a service that controls access to resources. Make sure you're securing your data within AWS correctly so you can avoid data breaches. The diagram shows a simple IAM policy that allows only read access to a specific S3 bucket for the listed role. In addition to controlling access to data, you need to make sure your data is secure. It's a good practice and it might also be legally required for certain data types, such as financial data or healthcare records. AWS provides encryption features for storage services, 

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect03_part1.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect03_part1.mp4
Text:  Hi and welcome back. This is section 3 and we're going to cover how to evaluate your data. In this section, we'll look at different data formats and types. We'll also look at how you can visualize and analyze the data before feature engineering. Before you can start running statistics on your data to better understand what you're working with, you need to ensure it's in the right format for analysis. For Amazon SageMaker, algorithms support training with data in CSV format. Many of the tools you'll use to explore, visualize, and analyze the data can also read it in CSV format. Generally speaking, you'll need to have at least some domain knowledge for the problem you're trying to solve with machine learning. For example, if you're developing a model to predict if a set of symptoms indicates a disease, you'd need to know the relationship between the symptoms and the disease.

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect03_part2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect03_part2.mp4
Text:  Hi, welcome back. We'll continue exploring how to describe your data. Now that your data is in a readable format, you can perform descriptive statistics on the data to better understand it. Descriptive statistics help you gain valuable insights into your data so that you can effectively pre-process the data and prepare it for your ML model. We'll look at how you can do that and discuss why it's so important. First, descriptive statistics can be organized into a few different categories. Overall statistics include the number of rows and the number of columns in your dataset. This information, which relates to the dimensions of your data, is very important. For example, it can indicate that you have too many features, which can lead to high dimensionality and poor model performance. Attribute statistics are another type of descriptive statistic, specifically for numeric attr

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect03_part3.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect03_part3.mp4
Text:  Hi, welcome back. Now we'll review how to find correlations in your dataset. How can you quantify the linear relationship among the variables you're seeing in a scatterplot? A correlation matrix is a good tool in this situation. It conveys both the strong and weak linear relationships among numerical variables. Correlation can go as high as 1 or as low as minus 1. When the correlation is 1, this means those two numerical features are perfectly correlated with each other. It's like saying y is proportional to x. When the correlation of those two variables is minus one, it's like saying y is proportional to minus x. Any linear relationship in between can be quantified by the correlation. So if the correlation is zero, this means there's no linear relationship. But it doesn't mean that there's no relationship, it's just an indication that there's no linear relationship betwee

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect04_part1.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect04_part1.mp4
Text:  Hi, and welcome to section 4. In this section, we're going to look at feature engineering. Feature engineering is one of the most impactful things you can do to improve your machine learning model. We'll now look at what it is. There are two things that can help make your models more successful. The first is feature selection and the second is feature extraction or the process of creating features. In feature selection you select the most relevant features and discard the rest. You can apply feature selection to prevent redundancy or irrelevance in the existing features. You can also use it to limit the number of features to help prevent overfitting. Feature extraction builds valuable information from raw data by reformatting, combining, and transforming primary features into new ones. This process continues until it yields a new data set that can be consumed by the model 

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect04_part2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect04_part2.mp4
Text:  Hi, welcome back. We'll continue exploring feature engineering by reviewing how to clean your dataset. In addition to converting string data to numerical data, you'll need to clean your dataset for several other potential problem areas. Before encoding the string data, make sure the strings are all consistent. You'll also need to make sure variables use a consistent scale. For example, if one variable describes the number of doors in a car, the scale will probably be between 2 and 8. But if another variable describes the number of cars of a particular type sold in the state of California, the scale will type sold in the state of California, the scale will probably be in the thousands. Some data items might also capture more than one variable in a single value. For instance, suppose the dataset includes variables that combine safety and maintenance into a single variable, s

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect04_part3.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect04_part3.mp4
Text:  Hi, welcome back. We'll continue exploring feature engineering by describing how to work with outliers. You might also need to clean your data based on any outliers that exist. Outliers are points in your dataset that lie at an abnormal distance from other values. They're not always something you want to clean up because they can add richness to your dataset. But they can also make it harder to make accurate predictions because they skew values away from the other, more normal values related to that feature. An outlier might also indicate that the data point actually belongs to another column. You can think of outliers as falling into two broad categories. might also indicate that the data point actually belongs to another column. You can think of outliers as falling into two broad categories. The first is a single variation for just a single variable, or a univariate outl

                                                                        

MoviePy - Done.
audio_output/Mod03_Sect05.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect05.mp4
Text:  Hi, welcome back to module 3. This is section 5 on training. In this section, we're going to look at how to select a model and train it with the data we have preprocessed. At this point, you've done a lot to clean and prepare your data, but that doesn't mean your data is completely ready to train the algorithm. Some algorithms may not be able to work with training data in a data frame format. Some file formats, like CSV, are commonly used by various algorithms, but they do not make use of that optimization that some of the file formats, like RecordIO Protobuf, can use. Many Amazon SageMaker algorithms support training with data in a CSV format. Amazon SageMaker requires that a CSV file doesn't have a header record and that the target variable is in the first column. Most Amazon SageMaker algorithms work best when you use the optimized protobuf record I-O format for the training data. 

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect06.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect06.mp4
Text:  Hi and welcome back. This is section 6 and we're going to look at hosting and using the model. In this section, we'll look at how you can deploy your trained model so it can be consumed by applications. After you've trained, tuned, and tested your model you'll learn more about testing in the next section, you're now ready to deploy your model. If you're thinking that we're looking at the phases out of order, here's why we're discussing deployment now. If you want to test your model and get performance metrics from it, you first need to make an inference or prediction from the model, and this typically requires deployment. Deployment for testing is different from production, although the mechanics are the same. Amazon SageMaker provides everything you need to host your model for simple testing and evaluation, from a few requests to deployments handling tens of thousands of requests. Th

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect07_part1.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect07_part1.mp4
Text:  Hi, welcome back to module 3. In this section, we'll look at how you can evaluate your model's success in predicting results. At this point, you've trained your models. It's now time to evaluate that model to determine if it will do a good job predicting the target on new and future data. Because future instances have unknown target values, you need to assess how the model will perform on data where you already know the target answer. You'll then use this assessment as a proxy for performance on future data. This is the reason why you hold out a sample of your data for evaluating or testing. An important part of this phase involves choosing the most appropriate metric for your business situation. Think back to the earlier section on problem formulation. During that phase, you define your business problem and outcome, and then you craft a business metric to evaluate success

                                                                      

MoviePy - Done.
audio_output/Mod03_Sect07_part2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect07_part2.mp4
Text:  Hi, welcome back. We'll continue exploring how to evaluate your model. The diagram shows the confusion matrix of how two different models performed on the same data. Can you tell which one's better? Which is better isn't a good question to ask. What do you mean by better? Does better mean making sure you find all the cats? Even if it means you'll get many false positives? Or does better mean making sure the model is the most accurate? It's difficult to see just by looking at the two charts. What if you're trying several models, using multiple folds, and have hundreds of data points to compare? To do that, you'll need to calculate more metrics. The first metric is sensitivity. This is sometimes referred to as recall, hit rate, or true positive rate. Sensitivity is the percentage of positive identifications. In the cat example, it represents what percentage of cats were corr

                                                                        

MoviePy - Done.
audio_output/Mod03_Sect07_part3.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect07_part3.mp4
Text:  Hi, welcome back. We'll continue exploring how to evaluate your model. Classification models are going to return a probability for the target. This is a value of the input belonging to the target class, and it will be between 0 and 1. To convert the value to a class, you need to determine the threshold to use. You might think it's 50%, but you could change it to be lower or higher to improve your results. As you've seen with sensitivity and specificity, there's a trade-off between correctly and incorrectly identifying classes. Changing the threshold can impact that outcome. We're going to take a look at how you can visualize this. A receiver operating characteristic graph is also known as an ROC graph. It summarizes all the confusion matrices that each threshold produced. To build one, you calculate and plot the sensitivity, or true positive rate, against the false positiv

                                                                        

MoviePy - Done.
audio_output/Mod03_Sect08.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_Sect08.mp4
Text:  Hi, and welcome back to module 3. This is section 8. In this section, we're going to take a look at how you can tune the model's hyperparameters to improve model performance. Recall from an earlier module that hyperparameters can be thought of as the knobs that tune the machine learning algorithm to improve its performance. Now that we're looking more explicitly at tuning models, it's time to look more specifically at the different types of hyperparameters and how to perform hyperparameter optimization. There are a couple of different categories of hyperparameters. The first kind are model hyperparameters. The first kind are model hyperparameters. They help define the model itself. As an example, consider a neural network for a computer vision problem. For this case, additional attributes of the architecture need to be defined, like filter size, pooling, and the stride or padding. The

                                                                      

MoviePy - Done.
audio_output/Mod03_WrapUp.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod03_WrapUp.mp4
Text:  It's now time to review the module and wrap up with a knowledge check. In this module, you learned how to formulate a problem from a business request, obtain and secure data for machine learning, build a Jupyter notebook by using Amazon SageMaker. Outline the process for evaluating data. Explain why data needs to be pre-processed. Use open source tools to examine and pre-process data. Use Amazon SageMaker to train and host a machine learning model. Use cross-validation to test the performance of an ML model, use a hosted model for inference, and create an Amazon SageMaker hyperparameter tuning job to optimize a model's effectiveness. That concludes this module. Thanks for watching. We'll see you again in the next video.
Processing video 27/46
Downloading video: CUR-TF-200-ACMNLP-1/video/Mod04_Intro.mp4
Mod04_Intro.mp3
Converting video to audio: CUR-TF-200-ACMNLP-1/video/Mod04_Intro.mp

                                                                      

MoviePy - Done.
audio_output/Mod04_Intro.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod04_Intro.mp4
Text:  Hi, and welcome to Module 4 of AWS Academy Machine Learning. In this module, we're going to look at forecasting. We'll start with an introduction to forecasting and look at how time series data is different from other kinds of data. Then, we're going to look at Amazon Forecast, a service that helps you simplify building forecasts. At the end of this module, you'll be able to describe the business problem solved with Amazon Forecast, describe the challenges of working with time series data, list the steps required to create a forecast by using Amazon Forecast. And use Amazon Forecast to make a prediction. See you in the next video!
Processing video 28/46
Downloading video: CUR-TF-200-ACMNLP-1/video/Mod04_Sect01.mp4
Mod04_Sect01.mp3
Converting video to audio: CUR-TF-200-ACMNLP-1/video/Mod04_Sect01.mp4
MoviePy - Writing audio in audio_output/Mod04_Sect01.mp3


                                                                      

MoviePy - Done.
audio_output/Mod04_Sect01.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod04_Sect01.mp4
Text:  Hi, and welcome to Section 1. We'll get started by reviewing what forecasting is and some use cases for it. Forecasting is an important area of machine learning. It's important because there are so many opportunities for predicting future outcomes based on historical data. Many of these opportunities involve a time component. However, while the time component adds additional information, it also makes time series problems more difficult to handle compared to other types of predictions. You can think of time series data as falling into two broad categories. The first type is univariate data, which means there's just one variable. The second one is multivariate data, which means there's more than one variable. There are several common patterns in time series data. The first pattern is a trend. With a trend, you get a pattern with the values increasing, decreasing, or staying the same ov

                                                                        

MoviePy - Done.
audio_output/Mod04_Sect02_part1.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod04_Sect02_part1.mp4
Text:  Hi and welcome back. This is section 2 and we're going to focus on processing time series data because it can be different from other types of data you've been using so far. Time series data is data that is captured in chronological sequence over a defined period of time. Introducing time into a machine learning model has a positive impact because the model can derive meaning from changes in the data points over time. Time series data tends to be correlated. This means that there's a dependency between data points. This has mixed results for forecasting. This is because you're dealing with a regression problem, and regression assumes that data points are independent. You need to develop a method for dealing with data dependence so you can increase the validity of the predictions. In addition to the time series data, you can add related data to augment a forecasting model. 

                                                                      

MoviePy - Done.
audio_output/Mod04_Sect02_part2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod04_Sect02_part2.mp4
Text:  Hi, welcome back. We'll continue exploring wrangling time series data. Seasonality in data is any kind of repeating observation where the frequency of the observation is stable. For example, in sales, you typically see higher sales at the end of a quarter and into the fourth quarter. Consumer retail sees even higher sales in the fourth quarter. Be aware that data can have multiple types of seasonality in the same data set. There are many times when you should incorporate seasonality information into your forecast. For instance, localized holidays are a good example for sales. The chart shows that the total revenue generated by arcades has a strong correlation with the number of computer science doctorates awarded in the US. But correlations do not mean causation. If you disagree, see the source for the chart. There are many other correlations plotted on the site, and none 

                                                                        

MoviePy - Done.
audio_output/Mod04_Sect02_part3.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod04_Sect02_part3.mp4
Text:  Hi and welcome back. In this section, we'll look at how you can use Amazon Forecast to create a predictor and generate forecasts. When you generate forecasts, you can apply the machine learning development pipeline you've seen throughout this course. But you still need data. You need to import as much data as you have, both historical data and related data. You'll want to do some basic evaluation and feature engineering before you use the data to train a model so you can meet the requirements of Amazon Forecast. To train a predictor, you need to choose an algorithm. If you're not sure which algorithm is the best for your data, Amazon Forecast can choose for you. To do this, select AutoML as your algorithm. You also need to select a domain for your data. If you're not sure what the best fit is, you can also select a custom domain. Domains have specific types of data they re

                                                                    

MoviePy - Done.
audio_output/Mod04_WrapUp.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod04_WrapUp.mp4
Text:  Hi, welcome back. It's now time to review the module and wrap it up. In this module, you learned how to describe the business problem solved by Amazon Forecast, describe the challenges of working with time series data, list the steps required to create Forecast by using Amazon Forecast, and use Amazon Forecast to make a prediction. Thanks for participating. See you in the next module.
Processing video 33/46
Downloading video: CUR-TF-200-ACMNLP-1/video/Mod05_Intro.mp4
Mod05_Intro.mp3
Converting video to audio: CUR-TF-200-ACMNLP-1/video/Mod05_Intro.mp4
MoviePy - Writing audio in audio_output/Mod05_Intro.mp3


                                                                      

MoviePy - Done.
audio_output/Mod05_Intro.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod05_Intro.mp4
Text:  Welcome back to AWS Academy Machine Learning. This is Module 5, and we have a great topic for you today, Computer Vision. In this module, we'll start with an overview of the computer vision space, and you'll learn about some of the use cases and terminology. Next, we'll explore details about analyzing image and video with managed services from Amazon Web Services or AWS. Finally, we'll look at how you can use your own customized data sets for performing object detection. At the end of this module, you'll be able to describe the use cases for computer vision, describe the Amazon Managed Machine Learning services available for image and video analysis, list the steps required to prepare a custom dataset for object detection, describe how Amazon SageMaker Ground Truth can be used to prepare a custom dataset. And finally, use Amazon Recognition to perform facial detection. Thanks for watchi

                                                                      

MoviePy - Done.
audio_output/Mod05_Sect01_ver2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod05_Sect01_ver2.mp4
Processing video 35/46
Downloading video: CUR-TF-200-ACMNLP-1/video/Mod05_Sect02_part1_ver2.mp4
Mod05_Sect02_part1_ver2.mp3
Converting video to audio: CUR-TF-200-ACMNLP-1/video/Mod05_Sect02_part1_ver2.mp4
MoviePy - Writing audio in audio_output/Mod05_Sect02_part1_ver2.mp3


                                                                        

MoviePy - Done.
audio_output/Mod05_Sect02_part1_ver2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod05_Sect02_part1_ver2.mp4
Text:  Welcome back. In this section, we'll explore image analysis in more detail. And in part two, we'll take a closer look into video analysis. To start, we'll introduce the main Amazon service we'll be using, Amazon Recognition. Amazon Recognition is a computer vision service that's based on deep learning. You can use it to add image and video analysis to your applications. There are many uses for Amazon Recognition, including creating searchable image and video libraries. Amazon Recognition makes both images and stored videos searchable, so you can discover the objects and scenes that appear in them. You can use Amazon Recognition to build a face-based user verification system, so your applications can confirm user identities by comparing their live image with a reference image. Amazon Recognition interprets emotional expressions, such as happy, sad, or surprise. It

                                                                        

MoviePy - Done.
audio_output/Mod05_Sect02_part2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod05_Sect02_part2.mp4
Text:  Hi, welcome back. We'll continue exploring image analysis with a closer look at facial detection. Facial detection uses a model that was tuned to perform predictions specifically for detecting faces and facial features. Facial detection has many of the same features as standard object detection, such as a bounding box or the coordinates of the box surrounding the face that was detected. This will include a value representing the confidence that the bounding box contains a face. There will be a list of attributes if found, such as if the face has a beard or if it appears to be male or female. There will also be a confidence score for these attributes. It can also detect physical emotions, like whether the person is smiling or frowning. It's important to understand this classification is based only on visual clues, and so it might not represent the actual emotion of the pers

                                                                      

MoviePy - Done.
audio_output/Mod05_Sect03_part1.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part1.mp4
Text:  In this section, we'll look at preparing custom datasets for computer vision, so you can detect custom objects. One challenge of using a pre-built model is that it will only find images it was trained to find. Though Amazon Rekognition was trained with tens of millions of images, it can't detect objects that it wasn't trained on. For example, consider the 8 of hearts playing card. If you run this card through Amazon Recognition, the results show various attributes. However, none of the labels are playing card or 8 of hearts. If you want Amazon Recognition to detect images in your problem domain, you must train the model with your images. So in this section, you'll learn how to train Amazon Recognition with images from your problem domain. Though you'll focus only on using Amazon Recognition here, you'll encounter a similar process if you use other pre-trained models. Train

                                                                      

MoviePy - Done.
audio_output/Mod05_Sect03_part2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part2.mp4
Text:  Hi, welcome back. We'll continue exploring video analysis by reviewing how to create the training dataset. Datasets contain information that's needed to train and test an Amazon Recognition Custom Labels model, such as images, labels, and bounding boxes. such as images, labels, and bounding boxes. You can use images from Amazon S3, or you can upload them from your computer to S3 as part of the process. To train a model, your dataset should have at least two labels, with at least 10 images per label. Each image in your dataset must be labeled. As we mentioned earlier, you can use the Amazon Recognition Custom Labels console or Amazon SageMaker Ground Truth to label your images. Again, to train an Amazon Recognition Custom Labels model, your images must be labeled. A label indicates that an image contains an object, scene, or concept. As we mentioned earlier, a dataset needs

                                                                      

MoviePy - Done.
audio_output/Mod05_Sect03_part3.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part3.mp4
Text:  Hi, welcome back. We'll continue exploring video analysis by reviewing how to create the test dataset. The final step before you train your model is to identify a test dataset. You will use this test dataset to validate and evaluate the model's performance. You'll do this by performing an inference on the images in the test dataset. You'll then compare the results with the labeling information that's in the training dataset. You can create your own test dataset. Alternatively, you can use Amazon Recognition Custom Labels to split your training dataset into two datasets by using an 80-20 split. This split means that 80% of the data is used for training and 20% is used for testing. After you define the training and test datasets, Amazon Recognition Custom Labels can automatically train the model for you. The service automatically loads and inspects the data, selects the corr

                                                                      

MoviePy - Done.
audio_output/Mod05_Sect03_part4_ver2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod05_Sect03_part4_ver2.mp4
Text:  Hi, welcome back. We'll continue exploring video analysis by reviewing how to evaluate and improve your model. In general, you can improve the quality of your model with larger quantities of better quality data. Use training images that clearly show the object or scene, and don't include many things that you're not interested in. For bounding boxes around objects, use training images that show the object as fully visible and not hidden by other objects. Make sure that your training and test data sets match the type of images that you'll eventually run inference on. For objects where you have just a few training examples, like logos, you should provide bounding boxes around the logo in your test images. These images represent the scenarios you want to localize the object in. Reducing false positives often results in better precision. To reduce false positives, fir

                                                                     

MoviePy - Done.
audio_output/Mod05_WrapUp_ver2.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod05_WrapUp_ver2.mp4
Text:  It's now time to summarize some of the main points in this module. In this module, you learned how to describe the use cases for computer vision, describe the Amazon Managed Machine Learning services available for image and video analysis, list the steps required to prepare a custom data set for object detection. Describe how Amazon SageMaker Ground Truth can be used to prepare a custom data set. And use Amazon Recognition to perform facial detection. That concludes this introduction to computer vision. Thanks for watching. We'll see you again in the next video.
Processing video 42/46
Downloading video: CUR-TF-200-ACMNLP-1/video/Mod06_Intro.mp4
Mod06_Intro.mp3
Converting video to audio: CUR-TF-200-ACMNLP-1/video/Mod06_Intro.mp4
MoviePy - Writing audio in audio_output/Mod06_Intro.mp3


                                                                      

MoviePy - Done.
audio_output/Mod06_Intro.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod06_Intro.mp4
Text:  Introduction to Natural Language Processing Hi, and welcome to Module 6 of AWS Academy Machine Learning, Introduction to Natural Language Processing. In this module, we'll introduce Natural Language Processing, which is also known as NLP. This section includes a description of the major challenges faced by NLP and the overall development process for NLP applications. We'll then review five AWS services you can use to speed up the development of NLP-based applications. After completing this module, you should be able to describe the NLP use cases that are solved by using managed Amazon ML services, and describe the managed Amazon ML services available for NLP. Let's get started.
Processing video 43/46
Downloading video: CUR-TF-200-ACMNLP-1/video/Mod06_Sect01.mp4
Mod06_Sect01.mp3
Converting video to audio: CUR-TF-200-ACMNLP-1/video/Mod06_Sect01.mp4
MoviePy - Writing audio in audio_output/

                                                                        

MoviePy - Done.
audio_output/Mod06_Sect01.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod06_Sect01.mp4
Text:  We'll get started by reviewing what Natural Language Processing means. Natural Language Processing is also known as NLP. Before we explain what NLP is, we'll consider an example of NLP, Amazon Alexa. Alexa works by having a device, such as an Amazon Echo, record your words. The recording of your speech is sent to Amazon's servers to be analyzed more efficiently. Amazon breaks down your phrase into individual sounds. Then, it connects to a database containing the pronunciation of various words to find which words most closely correspond to the combination of individual sounds. Amazon identifies important words to make sense of the tasks and carry out corresponding functions. For instance, if Alexa notices words like outside or temperature, it will open the weather Alexa skill. Amazon servers then send the information back to your device and Alexa skill. Amazon servers then send the inf

                                                                        

MoviePy - Done.
audio_output/Mod06_Sect02.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod06_Sect02.mp4
Text:  Welcome back. In this section, we'll review five managed machine learning services you can use for various use cases. These services simplify the process of creating a machine learning application. We'll start by looking at Amazon Transcribe. You can use Amazon Transcribe to recognize speech in audio files and produce a transcription. It can recognize specific voices in an audio file, and you can create a customized vocabulary for terms that are specialized for a particular domain. You can also add a transcription service to your applications by integrating with WebSockets, a transcription service to your applications by integrating with WebSockets, an internet protocol you can use for two-way communication between an application and Amazon Transcribe. Here are some of the more common use cases for Amazon Transcribe. First, medical professionals can record their notes, and Amazon Tran

                                                                    

MoviePy - Done.
audio_output/Mod06_WrapUp.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod06_WrapUp.mp4
Text:  Welcome back. It's now time to review the module and wrap it up. In summary, in this module, you learned how to describe the NLP use cases that are solved by using managed Amazon ML services and describe the managed ML services available for NLP. Good job. Thanks for watching. We'll see you in the next module. and describe the managed ML services available for NLP. Good job. Thanks for watching. We'll see you in the next module.
Processing video 46/46
Downloading video: CUR-TF-200-ACMNLP-1/video/Mod07_Sect01.mp4
Mod07_Sect01.mp3
Converting video to audio: CUR-TF-200-ACMNLP-1/video/Mod07_Sect01.mp4
MoviePy - Writing audio in audio_output/Mod07_Sect01.mp3


                                                                      

MoviePy - Done.
audio_output/Mod07_Sect01.mp3
Video: CUR-TF-200-ACMNLP-1/video/Mod07_Sect01.mp4
Text:  Welcome to Module 7, Course Wrap-Up. Congratulations on completing the AWS Academy Machine Learning course. We'll take a few minutes to review what you've learned and where you can go from here. We're going to start with a review of what you've learned in this course. You learned how to describe machine learning, implement a machine learning pipeline, and use Amazon machine learning services for forecasting, computer vision, and natural language processing. Well done. Although this course isn't designed to prepare you to become certified for the AWS Certified Machine Learning specialty, we'll review how you can continue to work towards that certification. AWS Certification helps you build credibility and confidence by validating your cloud expertise with an industry-recognized credential. It also helps organizations identify skilled professionals who can lead cloud initiatives by usin

## 3. Normalizing the text
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to perform any text normalization steps that are necessary for your solution.

In [11]:
nltk.download('punkt')
nltk.download('stopwords')

def normalize_text(text):
    normalized_text = normalize('NFKD', text).encode('ASCII', 'ignore').decode('utf-8')
    normalized_text = normalized_text.lower()
    words = word_tokenize(normalized_text)
    words = [word for word in words if word.isalnum()]
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    normalized_text = ' '.join(words)
    return normalized_text

def normalization_of_files(folder_path):
    normalized_texts = {}
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                text = file.read()
                normaalize_text = nomalization_of_text(text)
                normalized_texts[filename] = normaalize_text
    return normalized_texts

if not os.path.exists("normalized_text"):
    os.makedirs("normalized_text")
    
folder_path = 'text_data/'  
normalized_texts = normalization_of_files(folder_path)
for filename, normaalize_text in normalized_texts.items():
    print("File:", filename)
    print("\n")
    print("Normalized Text:", normaalize_text)
    print("\n\n")
    with open(os.path.join("normalized_text", filename), 'w', encoding='utf-8') as file:
        file.write(normaalize_text)
            

File: Mod04_Sect02_part2.txt


Normalized Text: hi, welcome back. well continue exploring wrangling time series data. seasonality in data is any kind of repeating observation where the frequency of the observation is stable. for example, in sales, you typically see higher sales at the end of a quarter and into the fourth quarter. consumer retail sees even higher sales in the fourth quarter. be aware that data can have multiple types of seasonality in the same data set. there are many times when you should incorporate seasonality information into your forecast. for instance, localized holidays are a good example for sales. the chart shows that the total revenue generated by arcades has a strong correlation with the number of computer science doctorates awarded in the us. but correlations do not mean causation. if you disagree, see the source for the chart. there are many other correlations plotted on the site, and none of them make any sense. with your own data, be careful that youre no

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 4. Extracting key phrases and topics
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to extract the key phrases and topics from the videos.

In [12]:

def download_stopwords():
    nltk.download('stopwords')

def extract_key_phrases(text):
    # Tokenization of the text data 
    words = word_tokenize(text)

    # Removing all the stop words
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word.lower() not in stop_words]

    phrase_one = BigramCollocationFinder.from_words(filtered_words)
    phrase_two = TrigramCollocationFinder.from_words(filtered_words)

    phrase_one_score = phrase_one.score_ngrams(BigramAssocMeasures.likelihood_ratio)
    phrase_two_score = phrase_two.score_ngrams(TrigramAssocMeasures.likelihood_ratio)


    best_phrases_one_list = [phrases_one_list for phrases_one_list, score in sorted(phrase_one_score, key=lambda x: -x[1])[:7]]
    best_phrases_two_list = [phrases_two_list for phrases_two_list, score in sorted(phrase_two_score, key=lambda x: -x[1])[:7]]

    topics_list = []
    for ngram, score in sorted(phrase_one_score + phrase_two_score, key=lambda x: -x[1])[:7]:
        topics_list.append(' '.join(ngram))

    return best_phrases_one_list, best_phrases_two_list, topics_list

def process_normalized_files(input_folder, output_folder):
    
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for filename in os.listdir(input_folder):
        if filename.endswith('.txt'):
            with open(os.path.join(input_folder, filename), 'r', encoding='utf-8') as file:
                text = file.read()
                
                key_bigrams, key_trigrams, topics_list = extract_key_phrases(text)
                
                output_file_path = os.path.join(output_folder, f"{os.path.splitext(filename)[0]}_key_phrases.txt")
                with open(output_file_path, 'w', encoding='utf-8') as output_file:
                    output_file.write("Key Phrases:\n")
                    for phrases_one_list in key_bigrams:
                        output_file.write(' '.join(phrases_one_list) + '\n')
                    for phrases_two_list in key_trigrams:
                        output_file.write(' '.join(phrases_two_list) + '\n')
                    output_file.write("\nKey Topics:\n")
                    for topic in topics_list:
                        output_file.write(topic + '\n')
                
                print(f"Processed file: {filename}")
                print("Key Phrases:")
                for phrases_one_list in key_bigrams:
                    print(' '.join(phrases_one_list))
                for phrases_two_list in key_trigrams:
                    print(' '.join(phrases_two_list))
                print("\nKey Topics:")
                for topic in topics_list:
                    print(topic)
                print()


normalized_text_folder = 'normalized_text/'  
output_folder = 'key_phrases/'  
process_normalized_files(normalized_text_folder, output_folder)


Processed file: Mod04_Sect02_part2.txt
Key Phrases:
time series
amazon forecast
appendix excellent
arent independent
want determine
fourth quarter
noise separate
time series data
excellent time series
time series support
trend time series
face time series
random time series
wrangling time series

Key Topics:
time series data
excellent time series
time series support
trend time series
face time series
random time series
wrangling time series

Processed file: Mod05_Sect03_part4_ver2.txt
Key Phrases:
custom labels
amazon recognition
bounding box
bounding boxes
false positives
models calculated
calculated threshold
custom labels operation
detect custom labels
recognition custom labels
custom labels returned
custom labels accuracy
number custom labels
custom labels console

Key Topics:
custom labels operation
detect custom labels
recognition custom labels
custom labels returned
custom labels accuracy
number custom labels
custom labels console

Processed file: Mod05_Intro.txt
Key Phrases:
co

## 5. Creating the dashboard
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to create the dashboard for your solution.

In [16]:
s3_bucket = 'aws-tc-largeobjects'
s3_prefix = 'CUR-TF-200-ACMNLP-1/video/'

search_field = widgets.Text(placeholder='Enter key phrase and topics', description='Search:', layout=widgets.Layout(width='50%'))
search_button = widgets.Button(description='Search', layout=widgets.Layout(margin='0 10px 0 0'))
clear_button = widgets.Button(description='Clear', layout=widgets.Layout(margin='0 0 0 10px'))
output_box = widgets.Output()
video_display = widgets.Output(layout=widgets.Layout(width='70%', height='400px'))

search_box = widgets.HBox([search_field, search_button, clear_button], layout=widgets.Layout(justify_content='flex-start'))

display(search_box, output_box, video_display)

def search_function(button):
    search_term = search_field.value.lower()
    
    matching_files = []
    for filename in os.listdir('key_phrases/'):
        with open(os.path.join('key_phrases/', filename), 'r', encoding='utf-8') as file:
            key_phrases = file.read().lower()
            if search_term in key_phrases:
                matching_files.append(filename[:-16])
    
    with output_box:
        clear_output(wait=True) 
        if matching_files:
            print("Matching files:")
            for file in matching_files:
                matching_file_button = widgets.Button(description=file, layout=widgets.Layout(width='auto'))
                matching_file_button.on_click(lambda x, file=file: display_video(file))
                display(matching_file_button)
        else:
            print("No matching files found.")

def display_video(file_name):
    with video_display:
        clear_output(wait=True)
    
    s3 = boto3.client('s3')
    video_url = f'https://{s3_bucket}.s3.amazonaws.com/{s3_prefix}{file_name}.mp4'
    with video_display:
        display(Video(video_url, width=800)) 

def clear_video(button):
    with video_display:
        clear_output(wait=True)

search_button.on_click(search_function)
clear_button.on_click(clear_video)


HBox(children=(Text(value='', description='Search:', layout=Layout(width='50%'), placeholder='Enter key phrase…

Output()

Output(layout=Layout(height='400px', width='70%'))