# Project

In this Project, you will bring together many of the tools and techniques that you have learned throughout this course into a final project. You can choose from many different paths to get to the solution. 

### Business scenario

You work for a training organization that recently developed an introductory course about machine learning (ML). The course includes more than 40 videos that cover a broad range of ML topics. You have been asked to create an application that will students can use to quickly locate and view video content by searching for topics and key phrases.

You have downloaded all of the videos to an Amazon Simple Storage Service (Amazon S3) bucket. Your assignment is to produce a dashboard that meets your supervisor’s requirements.

## Project steps

To complete this project, you will follow these steps:

1. [Viewing the video files](#1.-Viewing-the-video-files)
2. [Transcribing the videos](#2.-Transcribing-the-videos)
3. [Normalizing the text](#3.-Normalizing-the-text)
4. [Extracting key phrases and topics](#4.-Extracting-key-phrases-and-topics)
5. [Creating the dashboard](#5.-Creating-the-dashboard)

## Useful information

The following cell contains some information that might be useful as you complete this project.

In [None]:
bucket = "c56161a939430l3396553t1w744137092661-labbucket-rn642jaq01e9"
job_data_access_role = 'arn:aws:iam::744137092661:role/service-role/c56161a939430l3396553t1w7-ComprehendDataAccessRole-1P24MSS91ADHP'

## 1. Viewing the video files
([Go to top](#Capstone-8:-Bringing-It-All-Together))


The source video files are located in the following shared Amazon Simple Storage Service (Amazon S3) bucket.

In [None]:
!aws s3 ls s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/

In [None]:
if not os.path.exists('videos'):
    os.mkdir("videos")
!aws s3 cp s3://aws-tc-largeobjects/CUR-TF-200-ACMNLP-1/video/ videos --recursive

## 2. Transcribing the videos
 ([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to implement your solution to transcribe the videos. 

In [31]:
 %%time
# Write your answer/code here
!pip install moviepy
!pip install speechrecognition
!pip install pydub
import moviepy.editor as mp
import speech_recognition as sr
from pydub import AudioSegment
from pydub.utils import make_chunks
from tqdm import tqdm
import os


# I am creating few folder in which I stored original videos, audio, chunked files and finally text files.

if not os.path.exists('chunked'):
    os.mkdir('chunked')
if not os.path.exists('Wav File'):
    os.mkdir('Wav File')
if not os.path.exists('Transcribe Files'):
    os.mkdir('Transcribe Files')
    
    

video_files = [file for file in os.listdir('videos') if file.endswith('.mp4')]

print(len(video_files))
# extract_text(video_files)



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELIS

In [3]:
def extract_text(fileNames):
    i=0
    for file in fileNames:
        print(i)
#         Creating AUDIO FILE FROM THE VIDEO FILE
        video=mp.VideoFileClip(f'videos/{file}')
        audio_filename= file[:-4]+'.wav'
        video.audio.write_audiofile(f'Wav File/{audio_filename}')
#         CREATING AUDIO Chunks because the speech recognition library accept small size of audio and we have big audio size.
        myaudio=AudioSegment.from_wav(f'Wav File/{audio_filename}')
        chunks_length_ms=70000
        chunks=make_chunks(myaudio,chunks_length_ms)
        text=""
#         Iterating through each chunks and extracting text and combining the text and then finally writing that text to the text file
        for j,chunk in enumerate(chunks):
            chunkName=f"{file[:-4]}_{j}.wav"
            chunk.export(f"chunked/{chunkName}",format="wav")
            r=sr.Recognizer()
            with sr.AudioFile(f"chunked/{chunkName}") as source:
                audio_data=r.record(source)
                try:
                    text =text+" "+r.recognize_google(audio_data)
                except sr.UnknownValueError:
                    print("i don't recognized the error")
            os.remove(f"chunked/{chunkName}")
        filename=file[:-4]+".txt"
        with open(f'Transcribe Files/{filename}','w') as writer:
            writer.write(text)
        print(i)
        i=i+1

In [32]:
%%time
# Getting all text from each text file and creating a dataframe which we will use further.
import pandas as pd
text_files = [file for file in os.listdir("Transcribe Files") if file.endswith('.txt')]
def get_data_from_text(text_files):
    final_dataset=list()
    for i in text_files:
        temp=list()
        name=i[:-4]
        temp.append(name)
        with open(f'Transcribe Files/{i}') as f:
            contents = f.read()
            temp.append(contents)
        final_dataset.append(temp)
    return final_dataset
dataset=get_data_from_text(text_files)
final_dataset=pd.DataFrame(dataset,columns=['name','Transcribe'])

CPU times: user 7.95 ms, sys: 9 µs, total: 7.96 ms
Wall time: 294 ms


In [33]:
final_dataset.head(10)

Unnamed: 0,name,Transcribe
0,Mod02_Sect03,hi and welcome back this is Section 3 and we'...
1,Mod03_Sect02_part2,hi welcome back we'll continue exploring data...
2,Mod05_WrapUp_ver2,it's now time to summarize some of the main p...
3,Mod03_Sect01,hi and welcome back to module 3 this is Secti...
4,Mod05_Sect03_part1,in this section we'll look at preparing custo...
5,Mod05_Sect02_part1_ver2,welcome back in this section we'll explore im...
6,Mod03_Sect03_part3,hi welcome back now will review how to find c...
7,Mod03_Sect04_part2,hi welcome back we'll continue exploring feat...
8,Mod03_Sect07_part3,hi welcome back we'll continue exploring how ...
9,Mod02_Sect01,hi and welcome to section 1 in this section w...


## 3. Normalizing the text
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to perform any text normalization steps that are necessary for your solution.

In [34]:
import regex as re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download("wordnet")
nltk.download("omw-1.4")
stopWords = stopwords.words('english')
nltk.download('punkt')
LEMMATIZER = WordNetLemmatizer()
# In this part I am doing normalization of my text by lower the text, removing extra space, removing stopwords and lematizing the text.
def normalized_text(text:str):
    text = text.lower()
    text = text.strip()
    text = re.sub('[^\w\s]','', text)
    text = text.split()
    text = [word for word in text if word.isalpha()]
    text = [word for word in text if word not in stopWords and len(word) >= 2]
    text = [LEMMATIZER.lemmatize(word) for word in text] 
    return ' '.join(text)

final_dataset['cleaned']=final_dataset['Transcribe'].apply(normalized_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/ec2-user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 4. Extracting key phrases and topics
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to extract the key phrases and topics from the videos.

In [35]:
# I have used keybert library to extract keyword from each text files.
!pip install keybert
from keybert import KeyBERT

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [36]:
# Here I am using all-mpnet-base-v2 keybert model and then extracting top 30 keywords from each text.
keyword_model = KeyBERT(model='all-mpnet-base-v2')
def keyword_extractor_keybert(text):
    keywords = keyword_model.extract_keywords(text,keyphrase_ngram_range=(1, 3), stop_words='english',highlight=False,top_n=30)
    keywords_list= list(keywords)
    return keywords_list
final_dataset['keywords']=final_dataset['cleaned'].apply(keyword_extractor_keybert)

In [37]:
final_dataset['keywords']=final_dataset['keywords'].apply(lambda x: list(dict(x).keys()))

## 5. Creating the dashboard
([Go to top](#Capstone-8:-Bringing-It-All-Together))

Use this section to create the dashboard for your solution.

In [38]:
# Write your answer/code here
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('distilbert-base-nli-mean-tokens')

# In this function I am encoding my all keyword extracted from the text and then I am calculating the cosine similarity between the input keywords and the keywords I already have and then sorting on the based of similarity score I am retrieving top 5 video.

def rank_videos(input_keywords):
    video_encode = model.encode(final_dataset['keywords'].apply(lambda x: ' '.join(x)).tolist())
    input_encode = model.encode(' '.join(input_keywords))
    similarity_score = cosine_similarity(input_encode.reshape(1, -1), video_encode)
    video_rankings = list(enumerate(similarity_score[0]))
    video_rankings = sorted(video_rankings, key=lambda x: x[1], reverse=True)
    return video_rankings
# Generate recommendations
def get_videos(input_keywords):
    ranked_videos = rank_videos(input_keywords)
    indexes = [i[0] for i in ranked_videos]
    similarity_score=[i[1] for i in ranked_videos]
    return final_dataset.iloc[indexes[:5]]['name']

In [44]:
while True: 
    keyword= input("Which videos you want ? or Type Exit or Quit ")
    if keyword=="Exit" or keyword=="exit" or keyword=="quit" or keyword=="Quit":
        break
    keyword=keyword_extractor_keybert(keyword)
    keys=lambda x: list(dict(x).keys())
    keyword=keys(keyword)
    result=get_videos(list(keyword))
    print("I would like to recommend this 5 videos :")
    for index,i in enumerate(result):
        index=index+1
        print(f'{index}) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/{i}'+'.mp4')

Which videos you want ? or Type Exit or QuitI want video on ml
I would like to recommend this 5 videos :
1) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/Mod03_Sect03_part3.mp4
2) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/Mod06_Intro.mp4
3) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/Mod04_Sect02_part1.mp4
4) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/Mod03_Sect07_part2.mp4
5) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/Mod05_Sect01_ver2.mp4
Which videos you want ? or Type Exit or QuitI want video on nlp
I would like to recommend this 5 videos :
1) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/Mod06_Intro.mp4
2) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/Mod03_Sect03_part3.mp4
3) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/Mod06_Sect01.mp4
4) https://nlp-project-transcribevideos.s3.amazonaws.com/videos/Mod03_Sect07_part2.mp4
5) https://nlp-project-tran