<a href="https://colab.research.google.com/github/vicotrbb/machine_learning/blob/master/projects/podcast_summarizer/video_to_text_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [15]:
!pip install youtube_dl
!pip install SpeechRecognition pydub

Collecting SpeechRecognition
[?25l  Downloading https://files.pythonhosted.org/packages/26/e1/7f5678cd94ec1234269d23756dbdaa4c8cfaed973412f88ae8adf7893a50/SpeechRecognition-3.8.1-py2.py3-none-any.whl (32.8MB)
[K     |████████████████████████████████| 32.8MB 125kB/s 
[?25hCollecting pydub
  Downloading https://files.pythonhosted.org/packages/7b/d1/fbfa79371a8cd9bb15c2e3c480d7e6e340ed5cc55005174e16f48418333a/pydub-0.24.1-py2.py3-none-any.whl
Installing collected packages: SpeechRecognition, pydub
Successfully installed SpeechRecognition-3.8.1 pydub-0.24.1


In [16]:
from __future__ import unicode_literals
import youtube_dl as yt

import speech_recognition as sr 
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence

In [None]:
%edit yt.main
help(yt)

In [None]:
help(AudioSegment)

In [35]:
def download_sound_file(sound_link, source='youtube'):
  ydl_opts = {
      'format': 'bestaudio/best',
      'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'mp3',
        'preferredquality': '192',
      }]
  }

  with yt.YoutubeDL(ydl_opts) as ydl:
    ydl.download([sound_link])

  return True

In [22]:
def convert_sound_to_text(sound_file):
    r = sr.Recognizer()
    folder_name = "audio-chunks"
    sound = AudioSegment.from_mp3(sound_file)

    chunks = split_on_silence(
        sound,
        min_silence_len = 500,
        silence_thresh = sound.dBFS-14,
        keep_silence=500,
    )

    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
    converted_text = ""

    for i, audio_chunk in enumerate(chunks, start=1):
        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
        audio_chunk.export(chunk_filename, format="wav")
        with sr.AudioFile(chunk_filename) as source:
            audio_listened = r.record(source)
            try:
                text = r.recognize_google(audio_listened)
            except sr.UnknownValueError as e:
                print("Error:", str(e))
            else:
                text = f"{text.capitalize()}. "
                print(chunk_filename, ":", text)
                converted_text += text

    return converted_text

In [36]:
print(download_sound_file('https://www.youtube.com/watch?v=LLyd-bqLnu8&t=61s'))

[youtube] LLyd-bqLnu8: Downloading webpage
[download] Destination: The Truth About Programming-LLyd-bqLnu8.webm
[download] 100% of 4.89MiB in 00:00
[ffmpeg] Destination: The Truth About Programming-LLyd-bqLnu8.mp3
Deleting original file The Truth About Programming-LLyd-bqLnu8.webm (pass -k to keep)
True


In [37]:
text = convert_sound_to_text('The Truth About Programming-LLyd-bqLnu8.mp3')
print(text)

audio-chunks/chunk1.wav : If you've actually done any kind of programming you would know that programming is frustrating. 
audio-chunks/chunk2.wav : It's confusing. 
audio-chunks/chunk3.wav : And sometimes it's even demoralizing. 
audio-chunks/chunk4.wav : You remember when you just started. 
audio-chunks/chunk5.wav : You were so motivated so passionate about the things you can do with it all you see are people building their empires with it and affecting billions of lives with their lines of code but no one ever told you about their failures the untold stories of people who have fallen and the hardships that brought to them the reality is when we start programming and building things break things get stuck projects get scraped or even worse business has died and jobs are lost startups fail and time is wasted failure after failure or happening everyday but all you see are two successes that coating brought to these different people but no one talks about the path to get their programmi

In [38]:
from nltk.corpus import stopwords
import nltk
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [59]:
def read_article(text):
    article = text.split(". ")
    sentences = []

    for sentence in article:
        print(sentence)
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    sentences.pop()
    
    return sentences

def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)
 
def build_similarity_matrix(sentences, stop_words):
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2:
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix


def generate_summary(text, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []

    sentences =  read_article(text)

    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    for i in range(top_n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

    print("Summarize Text: \n", ". ".join(summarize_text))
    return summarize_text

In [57]:
summary = generate_summary(text, top_n=2)

If you've actually done any kind of programming you would know that programming is frustrating
It's confusing
And sometimes it's even demoralizing
You remember when you just started
You were so motivated so passionate about the things you can do with it all you see are people building their empires with it and affecting billions of lives with their lines of code but no one ever told you about their failures the untold stories of people who have fallen and the hardships that brought to them the reality is when we start programming and building things break things get stuck projects get scraped or even worse business has died and jobs are lost startups fail and time is wasted failure after failure or happening everyday but all you see are two successes that coating brought to these different people but no one talks about the path to get their programming is hard and you will fail at one point you will doubt yourself you'll see you're not smart enough that you're not lucky enough but i do

In [58]:
for point in summary:
  print('-> ' + point + ';')

-> Get ignore my advice if that's you then it means you're very stubborn but you don't give up and you care it means you fail all the time but you pick yourself back up and you become stronger it means you're dumb enough to try to do something impossible but then you make it with you push technology to its limit and you are at the forefront of innovation and you push even further and when people tell you that you can't do it you go ahead and you do it when people tell you you're wasting your time you go back and you work twice as hard and you come back and prove them wrong when people tell you you're crazy it's a compliment to you if that sounds like you then you are a true innovator and we need more people like you because we need more people to face the impossible;
-> I'm partnered with ibm today to talk about a global initiative called call for code which calls for developers to build something impactful and have a positive change across the world through their cold as you know ther