<a href="https://colab.research.google.com/github/sedwardsmarsh/Marine-Mammal-Classifier/blob/master/Marine_Mammal_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Marine Mammal Classifier**


*   source for audio data: Watkins Marine Mammal Sound Database, Woods Hole Oceanographic Institution: https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm
*   thanks to Todd Hayton for the python tutorial *Scraping by Example - Iterating through Select Items With Mechanize*: http://toddhayton.com/2015/01/09/scraping-by-example-ntu-edu/
*   this answer from stack exchange was used as well, thank you: https://stackoverflow.com/questions/5974595/download-all-the-linksrelated-documents-on-a-webpage-using-python






Before running anything, you need to tell Colab that you are interested in using a GPU. You can do this by clicking on the ‘Runtime’ tab and selecting ‘Change runtime type’. A pop-up window will open up with a drop-down menu. Select ‘GPU’ from the menu and click ‘Save’.

# ***make these images a lot smaller***

![Click the 'Runtime' tab above and select 'Change runtime type'](https://course.fast.ai/images/colab/03.png)

![A pop-up window will open up with a drop-down menu. Select ‘GPU’ from the menu and click ‘Save’.](https://course.fast.ai/images/colab/04.png)

# Setup the environment

In [0]:
# connect to google drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My\ Drive/"
data_dir = root_dir + "Colab\ Notebooks/watkins_data/"

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# create google drive directory to hold watkins marine mammal data
!mkdir {data_dir}

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/watkins_data/’: File exists


In [0]:
# fetch the latest fast.ai version 
!curl -s https://course.fast.ai/setup/colab | bash

Updating fastai...
Done.


In [0]:
# install the latest SoX version
!apt-get install -qq sox

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3 sox
0 upgraded, 8 newly installed, 0 to remove and 25 not upgraded.
Need to get 760 kB of archives.
After this operation, 6,715 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrnb0 amd64 0.1.3-2.1 [92.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrwb0 amd64 0.1.3-2.1 [45.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.3 [184 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic1

In [1]:
# install the latest mechanize version
!pip install mechanize

Collecting mechanize
[?25l  Downloading https://files.pythonhosted.org/packages/13/08/77368b47ba2f9e0c03f33902ed2c8e0fa83d15d81dcf7fe102b40778d810/mechanize-0.4.5-py2.py3-none-any.whl (109kB)
[K     |███                             | 10kB 30.0MB/s eta 0:00:01[K     |██████                          | 20kB 3.0MB/s eta 0:00:01[K     |█████████                       | 30kB 4.2MB/s eta 0:00:01[K     |████████████                    | 40kB 2.9MB/s eta 0:00:01[K     |███████████████                 | 51kB 3.3MB/s eta 0:00:01[K     |██████████████████              | 61kB 3.9MB/s eta 0:00:01[K     |█████████████████████           | 71kB 4.2MB/s eta 0:00:01[K     |████████████████████████        | 81kB 4.6MB/s eta 0:00:01[K     |███████████████████████████     | 92kB 5.1MB/s eta 0:00:01[K     |██████████████████████████████  | 102kB 4.9MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 4.9MB/s 
Installing collected packages: mechanize
Successfully installed me

In [4]:
# change working directory to watkins_data folder to save data
%cd /content/gdrive/'My Drive'/'Colab Notebooks'/watkins_data
%pwd

[Errno 2] No such file or directory: '/content/gdrive/My Drive/Colab Notebooks/watkins_data'
/content


'/content'

In [7]:
#!/usr/bin/env python

# import sys
import signal
import mechanize 
import wave
from time import sleep

URL = 'https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm'
DELAY = 5

def sigint(signal, frame):
  sys.stderr.write('Exiting...\n')
  sys.exit(0)    

class WatkinsScraper:
    def __init__(self, url=URL, delay=DELAY):
        # initilize browser, url, delay and items array
        self.br = mechanize.Browser()
        self.url = url
        self.delay = delay
        self.items = []
        self.dl_links = []


    def scrape(self):
        '''
        Get the list of items in the first dropdown menu, "Common name", 
        submit the form for each item. 
        Using the response, save the files to this script's 
        directory.
        '''
        items = self.get_items()

        for item in items:
            # Skip invalid/blank item selections
            if "https" in str(item):
                # print(item)
                response = self.follow_link(str(item))
                self.save_item_results(item, response)


    # working
    def get_items(self):
        '''
        Get the list of items in the first dropdown of the form
        '''
        self.br.open(self.url)
        self.br.select_form('jump1')

        # get items from submit tag 
        items = self.br.form.find_control('getSpeciesCommon').get_items()
        # print(items[1])
        return items


    # def submit_form(self, item):
    #     '''
    #     Submit form using selection item.name and download the audio files
    #     to data_dir
    #     '''
    #     max_tries = 3
    #     num_tries = 0

    #     while num_tries < max_tries:
    #         # loop through each item name from submit tag.
    #         try:
    #             # this isn't submitting the correct form
    #             self.br.open(self.url)
    #             self.br.select_form('jump1')
    #             self.br.form['getSpeciesCommon'] = [ item.name ]
    #             self.br.submit()
    #             break
    #         # unless encountering an error.
    #         except (mechanize.HTTPError, mechanize.URLError) as e:
    #             if isinstance(e,mechanize.HTTPError):
    #                 print(e.code)
    #             else:
    #                 print(e.reason.args)

    #         num_tries += 1
    #         time.sleep(num_tries * self.delay)

    #     if num_tries == max_tries:
    #         raise

    #     # return page response from server.
    #     return self.br.response().read()


    # def get_links(self):
    #     '''
    #     Locates the links on a given webpage
    #     '''
    #     # filetypes holds the extensions of the files we want to download.
    #     filetypes=[".wav"]
    #     # iterate through links inside browser on the page.
    #     for link in self.br.links():
    #         # check if this link has the file extension we want.
    #         for ft in filetypes:
    #             if ft in str(link): 
    #                 self.dl_links.append(link)


    # def download_link(self, link, label):
    #     # with open("%s.wav" % label, 'w') as f:
    #     #     f.write(results)
    #     #     f.close()

    #     # possibly replace with wave.open
    #     f = open(str(link),"w")
    #     # possibly replace with br.follow_link(link)
    #     br.click_link(link)
    #     f.write(br.response().read())
    #     print ("%s has been downloaded" % str(link))

        
    # def save_item_results(self, item):
    #     label = ' '.join([label.text for label in item.get_labels()])
    #     label = '-'.join(label.split())
        
    #     for link in self.dl_links:
    #         # throttle so you dont hammer the site
    #         sleep(self.delay) 
    #         self.download_link(link, label)
    #         print("all %s files have been downloaded" % label)
    #         # clear stored links to prep for next iteration
    #         self.dl_links = []


if __name__ == '__main__':
    signal.signal(signal.SIGINT, sigint)
    scraper = WatkinsScraper()
    scraper.scrape()
    # some_items = scraper.get_items()
    # token = scraper.save_item_results(item=some_items[1])
    # for x in zip(some_items): 
    #     print(x)

AttributeError: ignored