<a href="https://colab.research.google.com/github/sedwardsmarsh/Marine-Mammal-Classifier/blob/master/Marine_Mammal_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Marine Mammal Classifier**


*   source for audio data: Watkins Marine Mammal Sound Database, Woods Hole Oceanographic Institution: https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm
*   thanks to Todd Hayton for the python tutorial *Scraping by Example - Iterating through Select Items With Mechanize*: http://toddhayton.com/2015/01/09/scraping-by-example-ntu-edu/
*   this answer from stack exchange was used as well, thank you: https://stackoverflow.com/questions/5974595/download-all-the-linksrelated-documents-on-a-webpage-using-python






# Before running anything, you need to tell Colab that you are interested in using a GPU. 
1.   You can do this by clicking on the ‘Runtime’ tab and selecting ‘Change runtime type’. 
2.   A pop-up window will open up with a drop-down menu. Select ‘GPU’ from the menu and click ‘Save’.









<img src="https://course.fast.ai/images/colab/03.png" alt="Click the 'Runtime' tab above and select 'Change runtime type'" height="80%" width="40%">
<img src="https://course.fast.ai/images/colab/04.png" alt="A pop-up window will open up with a drop-down menu. Select ‘GPU’ from the menu and click ‘Save’." height="50%" width="50%">

# Setup the environment

In [1]:
# connect to google drive
from google.colab import drive
from pathlib import Path
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
data_dir = root_dir + "Colab Notebooks/watkins_data/"
bash_path = "/content/gdrive/My\ Drive/Colab\ Notebooks/watkins_data/"

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:
# create google drive directory to hold watkins marine mammal data
%mkdir {bash_path}

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/watkins_data/’: File exists


In [3]:
# fetch the latest fast.ai version 
!curl -s https://course.fast.ai/setup/colab | bash

Updating fastai...
Done.


In [4]:
# install the latest versions of SoX and ffmpeg 
!apt-get -qq install sox ffmpeg

Reading package lists...
Building dependency tree...
Reading state information...
ffmpeg is already the newest version (7:3.4.6-0ubuntu0.18.04.1).
The following additional packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3 sox
0 upgraded, 8 newly installed, 0 to remove and 25 not upgraded.
Need to get 760 kB of archives.
After this operation, 6,715 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrnb0 amd64 0.1.3-2.1 [92.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrwb0 amd64 0.1.3-2.1 [45.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.3 [184 kB]
Get:4 htt

# script to download all audio files from the 'Best Of' section on: https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm

### *Notice: only run this script if you haven't already or if it was interrupted while running it previously. (It takes a lot of time).*

In [5]:
# install the latest mechanize version
!pip -q install mechanize

[?25l[K     |███                             | 10kB 29.7MB/s eta 0:00:01[K     |██████                          | 20kB 36.1MB/s eta 0:00:01[K     |█████████                       | 30kB 42.7MB/s eta 0:00:01[K     |████████████                    | 40kB 47.8MB/s eta 0:00:01[K     |███████████████                 | 51kB 40.9MB/s eta 0:00:01[K     |██████████████████              | 61kB 43.7MB/s eta 0:00:01[K     |█████████████████████           | 71kB 32.3MB/s eta 0:00:01[K     |████████████████████████        | 81kB 33.3MB/s eta 0:00:01[K     |███████████████████████████     | 92kB 35.4MB/s eta 0:00:01[K     |██████████████████████████████  | 102kB 34.1MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 34.1MB/s 
[?25h

In [0]:
#!/usr/bin/env python

# download all audio files from the 'Best Of' section on:
# https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm


import mechanize
import re
from time import sleep
import os
import pathlib

URL = 'https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm'
DELAY = 0.5 

class WatkinsScraper:
    def __init__(self, url=URL, delay=DELAY):
        # initilize browser, url, delay and items array
        self.br = mechanize.Browser()
        self.url = url
        self.delay = delay
        self.items = []
        self.dl_links = []


    def scrape(self):
        '''
        Get the list of items in the first dropdown menu, "Common name", 
        submit the form for each item. 
        Using the response, save the files to this script's 
        directory.
        '''
        self.get_items()

        for item in self.items:
            # Skip invalid/blank item selections
            if "https" in str(item):
                self.get_links(str(item))

                label = ' '.join([label.text for label in item.get_labels()])

                # remove non alphanumeric characters
                label = re.sub('[^a-zA-Z]', '', label)


                # make directory
                if not os.path.exists(data_dir+"/"+label):
                    os.makedirs(data_dir+"/"+label)

                print("downloading %s..." % label)
                self.download_links(label)
                print("%s finished downloading!" % label)

        
    def get_items(self):
        '''
        Get the list of items in the first dropdown of the form.
        '''
        self.br.open(self.url)
        self.br.select_form('jump1')

        # get items from submit tag 
        self.items = self.br.form.find_control('getSpeciesCommon').get_items()


    def get_links(self, parent_url):
        '''
        Locates the links on a given webpage.
        '''
        temp_br = mechanize.Browser()
        temp_br.open(str(parent_url))

        # filetypes holds the extensions of the files we want to download.
        filetypes=[".wav"]
        # iterate through links inside browser on the page.
        for link in temp_br.links():
            # check if this link has the file extension we want.
            for ft in filetypes:
                if ft in str(link):
                    self.dl_links.append(link)


    def download_links(self, label):
        '''
        Downloads all links stored in a scraper obj. dl_links array.
        '''

        temp_br = mechanize.Browser()
        file_num = 0

        for link in self.dl_links:
            straight_url = re.sub('/science.*',link.url,link.base_url)
            sleep(self.delay)
            temp_br.retrieve(straight_url, 
                             data_dir+label+"/"+label+str(file_num)+".wav")
            file_num+=1

        # empty the links.
        self.dl_links = []


if __name__ == '__main__':
    scraper = WatkinsScraper()
    scraper.scrape()
    print("all sound files have finished downloading!")

# python script that shortens all audio files to the length of the shortest file.

In [0]:
#!/usr/bin/env python

# locate the shortest audio file and shorten all longer files to that length.

import wave
import contextlib
import re
import os 

fname = '/tmp/test.wav'
with contextlib.closing(wave.open(fname,'r')) as f:
    frames = f.getnframes()
    rate = f.getframerate()
    duration = frames / float(rate)
    print(duration)

SyntaxError: ignored

In [0]:
!cd AtlanticSpottedDolphin/

In [0]:
# convert the sound files to spectrogram images via SoX
!sox AtlanticSpottedDolphin/AtlanticSpottedDolphin0.wav -n spectrogram -hmr -o print.png 