# Download Images - Medical Devices

<!-- <hr> -->

## Table of Content: <a class="anchor" id="table-of-content"></a>
* [1. Problem Background](#problem-background)
* [2. Import Package](#import-package)
* [3. Custom Functions](#custom-functions)
* [4. Download Images](#download-images)

## 1. Background and Motivation <a class="anchor" id="problem-background"></a>

### Problem
> Download images for medical devices. Downloaded images will work as training data for YOLO model to detect medical devices in the youtube video.
>

### Steps
> - Use selenium and beautiful soup for web scrapping.
> - Selenium will be used to initiate the web broswer and will scroll the web page to enable the javascript on the web page
> - Beautiful soup is used to extract the html content from the web page.

* [Go to Top](#table-of-content)

## 2. Import Libraries <a class="anchor" id="import-package"></a>
<br>

### Libraries<br>

>**OS:**<br>
We use it to create a new folder to save the image files  <br>

>**Selenium and Beautiful Soup:**<br>
We use the libraries for web scrapping  <br>

>**PIL and io:**<br>
These libraries are used to remove thumbnail images from the image data set <br>

>**time:**<br>
Time library is used to add the delay so that web page can be loaded <br>

* [Go to Top](#table-of-content)

In [19]:
!pip install selenium

Defaulting to user installation because normal site-packages is not writeable


In [20]:
from bs4 import *
import requests
import os
from selenium.webdriver.common.by import By

from selenium import webdriver
import time

from PIL import Image
from io import BytesIO

import pandas as pd

import logging
# logging.basicConfig(filename="../Data/download_images_log.log", level=logging.INFO)

## 3. Custom Functions <a class="anchor" id="custom-functions"></a>

Glossary of User defined functions:
1. **create_folder** - function to check if the folder is present or not. If folder is not present, then create the folder.
2. **download_images** - function to download images from images from image URL.
* [Go to Top](#table-of-content)

In [21]:
def create_folder(folder_name):
#     folder_name='..\Data\Test'
    isExist = os.path.exists(folder_name)
    
    if not isExist:
        os.makedirs(folder_name)
        print('New folder created')
        logging.info('New folder created')
    else:
        print('Folder is already present')
        logging.info('Folder is already present')

In [22]:
def get_image_url(url):
#     url = 'https://www.google.com/search?q=glucose+pen&tbm=isch&chips=q:glucose+pen,g_1:diabetes:jn6D-rZ6h10%3D&rlz=1C1RXQR_enUS1017US1017&hl=en&sa=X&ved=2ahUKEwiGuarJoIL-AhWPPkQIHQPEB_wQ4lYoAHoECAEQLA&biw=1381&bih=684'
#     r = requests.get(url)
#     soup= BeautifulSoup(r.text, 'html.parser')
    
    # open a web broser
    driver = webdriver.Chrome()

    # open the url and expand to fullscreen
    driver.get(url)
    driver.fullscreen_window()
    
    # wait for page to load
    time.sleep(5)

    SCROLL_PAUSE_TIME = 5
    load_more_images=1
    show_more_ctr=0

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            show_more_ctr +=1
            if show_more_ctr<=load_more_images:
                element = driver.find_element(By.CLASS_NAME,'mye4qd')
                if element:
                    try:
                        element.click()
                    except:
                        break
            else:
                break
        last_height = new_height
    
    # extract source code of page after scrolling
    page_source=driver.page_source
    
    # close the broser
    driver.close()
    
    # get html part of page source
    soup= BeautifulSoup(page_source, 'html.parser')
    
#     r = requests.get(url)
#     soup= BeautifulSoup(r.text, 'html.parser')

    # get all the html image tags
    images = soup.findAll('img')
    
    return images



In [4]:
# def get_image_url(url):
# #     url = 'https://www.google.com/search?q=glucose+pen&tbm=isch&chips=q:glucose+pen,g_1:diabetes:jn6D-rZ6h10%3D&rlz=1C1RXQR_enUS1017US1017&hl=en&sa=X&ved=2ahUKEwiGuarJoIL-AhWPPkQIHQPEB_wQ4lYoAHoECAEQLA&biw=1381&bih=684'
# #     r = requests.get(url)
# #     soup= BeautifulSoup(r.text, 'html.parser')
    
#     # open a web broser
#     driver = webdriver.Chrome()

#     # open the url and expand to fullscreen
#     driver.get(url)
#     driver.fullscreen_window()
    
#     # wait for page to load
#     time.sleep(5)

#     # number of times page should be scrolled
#     page_scroll=10
    
#     # scrolling the page enables the javascript of the webpage and helps to download more images
#     for i in range(page_scroll):
#         #page scroll
#         driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
#         # wait for page to load
#         time.sleep(5)
    
#     # extract source code of page after scrolling
#     page_source=driver.page_source
    
#     # close the broser
# #     driver.close()
    
#     # get html part of page source
#     soup= BeautifulSoup(page_source, 'html.parser')
    
# #     r = requests.get(url)
# #     soup= BeautifulSoup(r.text, 'html.parser')

#     # get all the html image tags
#     images = soup.findAll('img')
    
#     return images



In [23]:
def download_images(folder_name,url):
    count=0
    
    #get list of image urls
    images = get_image_url(url)
    
    #remove duplicate image urls
    images=list(set(images))
    
    #number of images related to the search query
    total_images = len(images)

    print(f'Total {total_images} Images Found!!')
    logging.info(f'Total {total_images} Images Found!!')

    if total_images !=0:
        create_folder(folder_name)
        
        for i, image in enumerate(images[0:100]):

            # first we will search for "data-srcset" in img tag
            try:
                # In image tag ,searching for "data-srcset"
                image_link = image["data-srcset"]

            # then we will search for "data-src" in img
            # tag and so on..
            except:
                try:
                    # In image tag ,searching for "data-src"
                    image_link = image["data-src"]
                except:
                    try:
                        # In image tag ,searching for "data-fallback-src"
                        image_link = image["data-fallback-src"]
                    except:
                        try:
                            # In image tag ,searching for "src"
                            image_link = image["src"]

                        # if no Source URL found
                        except:
                            pass
    #         print(image_link)
            try:
                r = requests.get(image_link).content
                
                # capture size to remove thumbnails and very small images
                im = Image.open(BytesIO(r))
                width, height=im.size
                
                
                try:
                    r= str(r,'utf-8')

                except UnicodeDecodeError:
                    
                    if width>50 and height >50:
                        with open(f'{folder_name}/images_{i+1}.jpg','wb+') as f:
                            f.write(r)

                        count += 1

            except:
                pass
    else:
        print('No images found')
        logging.error('No images found')

    if count == total_images:
        print("All Images Downloaded!")
        logging.info(f"All Images Downloaded!")

    # if all images are not download
    else:
        print(f"Total {count} Images Downloaded Out of {total_images}")
        logging.info(f"Total {count} Images Downloaded Out of {total_images}")

## 4. Download Images <a class="anchor" id="download-images"></a>


* [Go to Top](#table-of-content)

In [6]:
# # enter search query
# searchTerm = "Glucose Pen"

# # enter folder name to download images
# searchTerm_Label='glucose_pen'
# folder_name='..\Data\\' + searchTerm_Label

# # replace space with %20 
# searchTerm = searchTerm.replace(' ','%20')

# url = "https://www.google.co.in/search?q="+searchTerm+"&source=lnms&tbm=isch"
# # url = 'https://www.google.com/search?q=glucose+pen&tbm=isch&chips=q:glucose+pen,g_1:diabetes:jn6D-rZ6h10%3D&rlz=1C1RXQR_enUS1017US1017&hl=en&sa=X&ved=2ahUKEwiGuarJoIL-AhWPPkQIHQPEB_wQ4lYoAHoECAEQLA&biw=1381&bih=684'
# download_images(folder_name,url)

In [28]:
# extract all the search queries
df_search_query=pd.read_csv('SearchQuery.csv')
df_search_query.head()

df_search_query['Object'] = df_search_query['Object'].apply(lambda x: x.lower())
df_search_query['Search Terms'] = df_search_query['Search Terms'].apply(lambda x: x.lower())

# extract list of all the medical devices
search_object=df_search_query['Object'].unique()
# search_term = df_search_query['Search Terms']

# loopthrough all the medical devices
for medical_object in search_object:
    print('Medical Object:',medical_object)
    logging.info('Medical Object: ' + medical_object)
    
    search_terms = df_search_query[df_search_query['Object']==medical_object]['Search Terms'].unique()
    # loop through all the search query related to particular medical device
    for i, searchTerm in enumerate(search_terms[0:100]):
        # search query
        print(i+1,' : ',searchTerm)
        logging.info(str(i+1)+' : '+searchTerm)
    
        # enter folder name to download images
        searchTerm_Label=medical_object
        folder_name='..\Data\\' + medical_object+'\\'+searchTerm

        # replace space with %20 
        searchTerm = searchTerm.replace(' ','%20')

        url = "https://www.google.co.in/search?q="+searchTerm+"&source=lnms&tbm=isch"
        # url = 'https://www.google.com/search?q=glucose+pen&tbm=isch&chips=q:glucose+pen,g_1:diabetes:jn6D-rZ6h10%3D&rlz=1C1RXQR_enUS1017US1017&hl=en&sa=X&ved=2ahUKEwiGuarJoIL-AhWPPkQIHQPEB_wQ4lYoAHoECAEQLA&biw=1381&bih=684'
        download_images(folder_name,url)
            
    print('*'*30)
    logging.info('-----------------------------------')
    
    

Medical Object: donuts
1  :  donuts
Total 1322 Images Found!!
New folder created
Total 53 Images Downloaded Out of 1322
******************************
