## From Colab menu, select: **Runtime** > **Change runtime type**, and verify that it is set to Python3, and select GPU if you want to try out GPU version.

## Common Setup

1. **Install google cloud speech package**

You may have to restart the runtime after this.

In [1]:
!pip3 install google-cloud-speech



## Setup

1. **Upload Google Cloud Cred file**

Have Google Cloud creds stored in a file named **`gc-creds.json`**, and upload it by running following code cell. See https://developers.google.com/accounts/docs/application-default-credentials for more details.

This may reqire enabling **third-party cookies**. Check out https://colab.research.google.com/notebooks/io.ipynb for other alternatives.

In [2]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving gc-creds.json to gc-creds (1).json
User uploaded file "gc-creds.json" with length 2391 bytes


In [3]:
!pwd
!ls -l ./gc-creds.json

/content
-rw-r--r-- 1 root root 2391 May 17 03:00 ./gc-creds.json


2. **Set environment variable**

In [4]:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/content/gc-creds.json'

!ls -l $GOOGLE_APPLICATION_CREDENTIALS

-rw-r--r-- 1 root root 2391 May 17 03:00 /content/gc-creds.json


In [8]:
from google.cloud import speech_v1p1beta1 as speech_v1
from google.cloud.speech_v1p1beta1 import enums

def sample_long_running_recognize(storage_uri, sample_rate):
    """
    Transcribe long audio file from Cloud Storage using asynchronous speech
    recognition

    Args:
      storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
    """

    client = speech_v1.SpeechClient()

    # storage_uri = 'gs://cloud-samples-data/speech/brooklyn_bridge.raw'

    # Sample rate in Hertz of the audio data sent
    sample_rate_hertz = sample_rate

    # The language of the supplied audio
    language_code = "en-US"

    # Encoding of audio data sent. This sample sets this explicitly.
    # This field is optional for FLAC and WAV audio formats.
    encoding = enums.RecognitionConfig.AudioEncoding.MP3
    config = {
        "sample_rate_hertz": sample_rate_hertz,
        "language_code": language_code,
        "encoding": encoding,
    }
    audio = {"uri": storage_uri}

    operation = client.long_running_recognize(config, audio)

    print(u"Waiting for operation to complete...")
    response = operation.result()

    output = []

    for result in response.results:
        # First alternative is the most probable result
        alternative = result.alternatives[0]
        #print(u"Transcript: {}".format(alternative.transcript))
        output.append(alternative.transcript)

    return output
        
podcast_uris = [r"gs://ru_hacks_2020/art_A journey through the mind of an artist Dustin Yellin.mp3",\
                r"gs://ru_hacks_2020/art_Art in the Age of Instagram Jia Jia Fei TEDxMarthasVineyard.mp3",\
                r"gs://ru_hacks_2020/art_How art can help you analyze - Amy E. Herman.mp3",\
                r"gs://ru_hacks_2020/art_Why art is important Katerina Gregos TEDxGhent.mp3",\
                r"gs://ru_hacks_2020/education_Every kid needs a championRita Pierson.mp3",\
                r"gs://ru_hacks_2020/education_Teaching history in the 21st century Thomas Ketchell at TEDxLiege.mp3",\
                r"gs://ru_hacks_2020/education_Why teachers teach but kids dont learnBen RichardsTEDxYouthHaileybury.mp3",\
                r"gs://ru_hacks_2020/sports_Are athletes really getting faster, better, stronger David Epstein.mp3",\
                r"gs://ru_hacks_2020/sports_The Math Behind Basketball's Wildest Moves Rajiv Maheswaran TED Talks.mp3",\
                r"gs://ru_hacks_2020/sports_The best teams have this secret weapon Adam Grant.mp3",\
                r"gs://ru_hacks_2020/sports_The real importance of sports Sean Adams TEDxACU.mp3",\
                r"gs://ru_hacks_2020/tech_A beginner's guide to quantum computing Shohini Ghose.mp3",\
                r"gs://ru_hacks_2020/tech_The next step in nanotechnology George Tulevski.mp3",\
                r"gs://ru_hacks_2020/tech_iot.mp3"]

transcripts = []

for i in range(len(podcast_uris)):
  print('Handling podcast ', i+1, '...')
  transcript = sample_long_running_recognize(podcast_uris[i], sample_rate=24000)
  transcripts.append(transcript)    

Handling podcast  1 ...
Waiting for operation to complete...
Handling podcast  2 ...
Waiting for operation to complete...
Handling podcast  3 ...
Waiting for operation to complete...
Handling podcast  4 ...
Waiting for operation to complete...
Handling podcast  5 ...
Waiting for operation to complete...
Handling podcast  6 ...
Waiting for operation to complete...
Handling podcast  7 ...
Waiting for operation to complete...
Handling podcast  8 ...
Waiting for operation to complete...
Handling podcast  9 ...
Waiting for operation to complete...
Handling podcast  10 ...
Waiting for operation to complete...
Handling podcast  11 ...
Waiting for operation to complete...
Handling podcast  12 ...
Waiting for operation to complete...
Handling podcast  13 ...
Waiting for operation to complete...
Handling podcast  14 ...
Waiting for operation to complete...


In [0]:
# Write transcripts list to file
import pickle

pickle.dump(transcripts, open( "transcripts.pkl", "wb" ) )

In [15]:
transcripts[0]

["I was raised by lesbians in the mountains and I should have came like a forest gnome to New York City a while back really messed with my head but I'm going to do that later I'll start with when I was 8 years old I took a wood box and I buried a dollar bill a pen at a fork inside this box in Colorado and I thought some strange humanoids are aliens in 500 years would find this box and learn about the way our species exchanged ideas maybe how we ate our spaghetti I really didn't know",
 "anyways it's kind of funny cuz here I am 30 years later and I'm still making boxes",
 " no at some point I was in Hawaii I like to hike and Surf and do all that weird stuff I was making a collage for my mom and I took a addiction and I ripped it up and I made it to the start of an Agnes Martin grid I poured resin all over and Abby got stuck that she's afraid of bees and she's allergic to them so I poured more resin on the canvas thinking I could like hide it or something said the opposite they have an i

# Process transcripts for summarization

In [0]:
to_summarize = []

for t in transcripts:
  paragraph = '. '.join(t)
  to_summarize.append(paragraph)

In [0]:
import pickle

pickle.dump(to_summarize, open( "to_summarize.pkl", "wb" ) )

# Summarization

In [19]:
# Installing BERT

!pip install bert-extractive-summarizer

Collecting bert-extractive-summarizer
  Downloading https://files.pythonhosted.org/packages/23/1d/71f0a5c7f81b1a87d4428a6a935e9ddeb5e662e41512952e11bd10533cd9/bert-extractive-summarizer-0.4.2.tar.gz
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/22/97/7db72a0beef1825f82188a4b923e62a146271ac2ced7928baa4d47ef2467/transformers-2.9.1-py3-none-any.whl (641kB)
[K     |████████████████████████████████| 645kB 4.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 21.6MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/3b/88/49e772d686088e1278766ad68a463513642a2a877487decbd691dec02955/sentencepiece-0.1.90-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 29.5MB/s 
Collecting tokenizers==0.7.0
[?25

In [20]:
!pip install spacy==2.1.3
!pip install transformers==2.2.2
!pip install neuralcoref

Collecting spacy==2.1.3
[?25l  Downloading https://files.pythonhosted.org/packages/52/da/3a1c54694c2d2f40df82f38a19ae14c6eb24a5a1a0dae87205ebea7a84d8/spacy-2.1.3-cp36-cp36m-manylinux1_x86_64.whl (27.7MB)
[K     |████████████████████████████████| 27.7MB 149kB/s 
[?25hCollecting preshed<2.1.0,>=2.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/20/93/f222fb957764a283203525ef20e62008675fd0a14ffff8cc1b1490147c63/preshed-2.0.1-cp36-cp36m-manylinux1_x86_64.whl (83kB)
[K     |████████████████████████████████| 92kB 9.3MB/s 
[?25hCollecting blis<0.3.0,>=0.2.2
[?25l  Downloading https://files.pythonhosted.org/packages/34/46/b1d0bb71d308e820ed30316c5f0a017cb5ef5f4324bcbc7da3cf9d3b075c/blis-0.2.4-cp36-cp36m-manylinux1_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 45.3MB/s 
Collecting plac<1.0.0,>=0.9.6
  Downloading https://files.pythonhosted.org/packages/9e/9b/62c60d2f5bc135d2aa1d8c8a86aaf84edb719a59c7f11a4316259e61a298/plac-0.9.6-py2.py3-none-any.whl
Co

In [21]:
import spacy.cli
spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [22]:
from summarizer import Summarizer

"""
model(
    body: str # The string body that you want to summarize
    ratio: float # The ratio of sentences that you want for the final summary
    min_length: int # Parameter to specify to remove sentences that are less than 40 characters
    max_length: int # Parameter to specify to remove sentences greater than the max length
)
"""

summaries = []
model = Summarizer()

for t in to_summarize:
  result = model(t, min_length=60)
  full = ''.join(result)
  summaries.append(full)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=434.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1344997306.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [32]:
original_len = []
summary_len = []
for i in range(len(summaries)):
  original_len.append(len(to_summarize[i]))
  summary_len.append(len(summaries[i]))

reductions = []
min_r = 100.0
for i in range(len(summaries)):
  curr = summary_len[i]/original_len[i]
  reductions.append(curr)
  if curr < min_r:
    min_r = curr


print(1- (sum(reductions)/len(summaries))) # average reduction in characters over podcasts
print(1-min_r) # max reduction in characters

0.8586608326137236
0.9764298843118175


In [0]:
import pickle

pickle.dump(summaries, open( "summaries.pkl", "wb" ) )

# Generate BERT embeddings (bert-as-service)

### They were generated on my Ubuntu laptop

In [0]:
import pickle 

emb = pickle.load(open('/content/summary_emb.pkl', 'rb'))

In [10]:
print(len(emb)) # 14, 1 for each podcast
print(len(emb[0])) # 1 for each sentence
print(len(emb[0][0])) # 768 features per BERT embedding vector

14
2
768


In [0]:
import numpy as np

summary_vectors = []

for i in range(len(emb)):
  sentence_vectors = emb[i]
  avg = np.mean(sentence_vectors, axis=0)
  summary_vectors.append(avg)

In [0]:
import pickle

pickle.dump(summary_vectors, open( "summary_vectors.pkl", "wb" ) )

# Cosine similarity calculations

In [0]:
import pickle 

summary_vectors = pickle.load(open("summary_vectors.pkl", "rb"))

In [0]:
# Calculate similarities and store in a dict with key:val = podcast_id:[(other_podcast_id, distance)]
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity as cs
from collections import defaultdict

distances = defaultdict(list)

for i in range(len(summary_vectors)):
  for j in range(len(summary_vectors)):
    if i != j:
      dist = cs(summary_vectors[i].reshape(1, -1), summary_vectors[j].reshape(1, -1))[0][0]
      distances[i].append(dist)
    else:
      distances[i].append(0.0)

In [14]:
distances[10]

[0.9704179389894878,
 0.9821084034950969,
 0.9621183552132359,
 0.989027292957433,
 0.9862388284240569,
 0.990664320041551,
 0.983770893558252,
 0.9830749119935601,
 0.98226528763534,
 0.975102426588823,
 0.0,
 0.9764336751799597,
 0.9601839280403783,
 0.9855245698411912]

In [0]:
import pickle

pickle.dump(distances, open( "cosine_distances.pkl", "wb" ) )

# k-NN algorithm

In [0]:
def knn(query_id, k, distances):
  d = sorted(distances[query_id])
  closest = []
  for dist in d:
    closest.append(distances[query_id].index(dist))
  return closest[1:k+1]

In [22]:
print(knn(1, 3, distances))
print(distances[1])

[12, 2, 0]
[0.9693567816602391, 0.0, 0.9693399680323607, 0.9877530214461293, 0.9749146207604942, 0.9845845324604345, 0.9762406851846627, 0.9815967211546198, 0.9811022299256805, 0.9715402140936132, 0.9821084034950969, 0.9766872620908579, 0.9513618326860354, 0.9854427861804697]


## Appendix: Podcasts and their IDs

In [0]:
r"gs://ru_hacks_2020/art_A journey through the mind of an artist Dustin Yellin.mp3",\                         # 0
r"gs://ru_hacks_2020/art_Art in the Age of Instagram Jia Jia Fei TEDxMarthasVineyard.mp3",\                   # 1
r"gs://ru_hacks_2020/art_How art can help you analyze - Amy E. Herman.mp3",\                                  # 2
r"gs://ru_hacks_2020/art_Why art is important Katerina Gregos TEDxGhent.mp3",\                                # 3
r"gs://ru_hacks_2020/education_Every kid needs a championRita Pierson.mp3",\                                  # 4
r"gs://ru_hacks_2020/education_Teaching history in the 21st century Thomas Ketchell at TEDxLiege.mp3",\       # 5
r"gs://ru_hacks_2020/education_Why teachers teach but kids dont learnBen RichardsTEDxYouthHaileybury.mp3",\   # 6
r"gs://ru_hacks_2020/sports_Are athletes really getting faster, better, stronger David Epstein.mp3",\         # 7
r"gs://ru_hacks_2020/sports_The Math Behind Basketball's Wildest Moves Rajiv Maheswaran TED Talks.mp3",\      # 8
r"gs://ru_hacks_2020/sports_The best teams have this secret weapon Adam Grant.mp3",\                          # 9
r"gs://ru_hacks_2020/sports_The real importance of sports Sean Adams TEDxACU.mp3",\                           # 10
r"gs://ru_hacks_2020/tech_A beginner's guide to quantum computing Shohini Ghose.mp3",\                        # 11
r"gs://ru_hacks_2020/tech_The next step in nanotechnology George Tulevski.mp3",\                              # 12
r"gs://ru_hacks_2020/tech_iot.mp3"                                                                            # 13