# Assignment 1 - Threading and Multiprocessing

In this project, we will explore multithreading an multiprocessing difference. For that purpose, we have an imaginary colleage whose name is John, who asks for your help to increase the speed of his process while downloading images from internet.

John already has the code for serial-programming, however, he don't know concurrent programming and parallel programming! Help John to succeed in his mission by using multithreading and multiprocessing logic to increase the speed of his task.
He has two tasks:

Download images from internet


Resize them to 128x128 px.

# Imports

In [26]:
import os
import json
import requests
import threading
from PIL import Image

# Client ID Specification and Required Functions

In [27]:
client_id = "b1fe5ed600a5829"

In [28]:
def create_download_dir():
    """
    creates a download directory for images.
    """
    dir_images = os.path.join('images')

    if not os.path.exists(dir_images):
        os.mkdir(dir_images)

    return dir_images

In [29]:
def download_image_from_url(url, directory):
    """
    download image and save into given directory.
    """
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        filename = os.path.basename(url)
        filepath = os.path.join(directory, f'{filename}')
        with open(filepath, 'wb') as f:
            f.write(response.content)

In [30]:
def build_link_list(client_id, num_of_images):
    """
    builds a list of image links.
    """
    i = 1
    cnt = 0
    url_list = []
    url_list_len = []

    try:
        while(cnt < num_of_images):
            # get request
            response = requests.get(
                f'https://api.imgur.com/3/gallery/random/random/{i}', 
                headers={'Authorization': f'Client-ID {client_id}'},
                stream=True
            )
            
            # control
            if response.status_code == 200:
                data_list = json.loads(response.content)['data']
                url_list.extend([
                    i['link']
                    for i in data_list 
                    if 'type' in i 
                    and i['type'] in ('image/png', 'image/jpeg')
                    and i['link'] not in url_list
                ])

                cnt = len(url_list)
                url_list_len.append(cnt)
                i += 1
                
                # control if api doesn't return anything new
                if set(url_list_len[-10:]) == 1:
                    break
            
            elif response.status_code == 429:
                print('too many requests, enough, or you can choose to put time.sleep() in here...') 
                break

            else:
                break

    except:
        print('api limit reached!')
        
    
    return url_list

In [31]:
def create_thumbnail(size, path):
    """
    create resized version of the image path given, with the same name 
    extended with _thumbnail.
    """
    try:
        # create thumbnail
        image = Image.open(path)
        image.thumbnail(size)

        # create path for thumbnail
        dir_images = os.path.join(path)
        filename, extension = os.path.splitext(path)
        new_filename = os.path.join('{}{}{}'.format(filename, '_thumbnail', extension))

        # save thumbnail
        image.convert('RGB').save(new_filename)
    except:
        'image error'

# Global Variables

In [32]:
NUM_OF_IMAGES = 1000 # max requests can be done per day is 12500

IMAGES_DIR = create_download_dir()

# Serial code of John

In this section, we will download some images from internet. As network related tasks are considered as IO bound, it can be fasten by multithreading the downloading task. Our john already did serial way of downloading, it is your turn to do multithreading.

In [33]:
%%time

image_links = build_link_list(client_id, NUM_OF_IMAGES)

for image_link in image_links:
    download_image_from_url(image_link, IMAGES_DIR)

too many requests, enough, or you can choose to put time.sleep() in here...
CPU times: user 430 ms, sys: 55.7 ms, total: 486 ms
Wall time: 2.04 s


# MultiThreading John's Task

In [35]:
%%time

if __name__ == "__main__":
    image_links = build_link_list(client_id, NUM_OF_IMAGES)
    t1 = threading.Thread(target = download_image_from_url,args =(image_link, IMAGES_DIR) )
    t2 = threading.Thread(target = download_image_from_url,args =(image_link, IMAGES_DIR) )
    t3 = threading.Thread(target = download_image_from_url,args =(image_link, IMAGES_DIR) )
    t4 = threading.Thread(target = download_image_from_url,args =(image_link, IMAGES_DIR) )  
    t1.start()
    t2.start()
    t3.start()
    t4.start()
    t1.join()
    t2.join()
    t3.join()
    t3.join()

too many requests, enough, or you can choose to put time.sleep() in here...
Done
CPU times: user 103 ms, sys: 12 ms, total: 115 ms
Wall time: 287 ms


# Serial Code for John - Resizing the image

In [36]:
%%time

image_path_list = os.listdir('images')
for image_path in image_path_list:
    create_thumbnail((128, 128), os.path.join('images', image_path))

CPU times: user 30.6 s, sys: 5.26 s, total: 35.9 s
Wall time: 40.9 s


# Multi - Processing John's Code

In this part, we have to resize the images downloaded into another size, in this example case, it will be 128x128px. As CPU bound operations are generally considered as multiprocessing tasks, resizing suits exactly for this purpose!

In [39]:
import multiprocessing
from multiprocessing import Pool
from itertools import product

In [37]:
%%time

p = multiprocessing.Pool(processes=5)
image_path_list = os.listdir('images')
for image_path in image_path_list:
    p.starmap(create_thumbnail, product((128, 128), os.path.join('images', image_path)))

CPU times: user 8.71 s, sys: 1.65 s, total: 10.4 s
Wall time: 12.8 s


# Conclusion

Create a table to show differences between all four approaches and the time it took for those tasks. Table can be anything, as long as you show the differences, as in below.

In [40]:
from prettytable import PrettyTable

x = PrettyTable()

x.field_names = ["Description", "Time Taken"]

x.add_row(["Downloaded pics", "2:04 s"])
x.add_row(["MultiThreading", "287 ms"])
x.add_row(["Resize pics", "40.9 s"])
x.add_row(["MultiProcessing", "12.8 s"])

print(x)

+-----------------+------------+
|   Description   | Time Taken |
+-----------------+------------+
| Downloaded pics |   2:04 s   |
|  MultiThreading |   287 ms   |
|   Resize pics   |   40.9 s   |
| MultiProcessing |   12.8 s   |
+-----------------+------------+
