# The Project #
1. This is a project with minimal scaffolding. Expect to use the the discussion forums to gain insights! It’s not cheating to ask others for opinions or perspectives!
2. Be inquisitive, try out new things.
3. Use the previous modules for insights into how to complete the functions! You'll have to combine Pillow, OpenCV, and Pytesseract
4. There are hints provided in Coursera, feel free to explore the hints if needed. Each hint provide progressively more details on how to solve the issue. This project is intended to be comprehensive and difficult if you do it without the hints.

### The Assignment ###
Take a [ZIP file](https://en.wikipedia.org/wiki/Zip_(file_format)) of images and process them, using a [library built into python](https://docs.python.org/3/library/zipfile.html) that you need to learn how to use. A ZIP file takes several different files and compresses them, thus saving space, into one single file. The files in the ZIP file we provide are newspaper images (like you saw in week 3). Your task is to write python code which allows one to search through the images looking for the occurrences of keywords and faces. E.g. if you search for "pizza" it will return a contact sheet of all of the faces which were located on the newspaper page which mentions "pizza". This will test your ability to learn a new ([library](https://docs.python.org/3/library/zipfile.html)), your ability to use OpenCV to detect faces, your ability to use tesseract to do optical character recognition, and your ability to use PIL to composite images together into contact sheets.

Each page of the newspapers is saved as a single PNG image in a file called [images.zip](./readonly/images.zip). These newspapers are in english, and contain a variety of stories, advertisements and images. Note: This file is fairly large (~200 MB) and may take some time to work with, I would encourage you to use [small_img.zip](./readonly/small_img.zip) for testing.

Here's an example of the output expected. Using the [small_img.zip](./readonly/small_img.zip) file, if I search for the string "Christopher" I should see the following image:
![Christopher Search](./readonly/small_project.png)
If I were to use the [images.zip](./readonly/images.zip) file and search for "Mark" I should see the following image (note that there are times when there are no faces on a page, but a word is found!):
![Mark Search](./readonly/large_project.png)

Note: That big file can take some time to process - for me it took nearly ten minutes! Use the small one for testing.

In [2]:
!pip install pytesseract
!pip install opencv-python

Processing ./.cache/pip/wheels/ac/5b/f4/d5bcc930771126a32285e058c576eda84e43691453a9f7ad71/pytesseract-0.3.7-py2.py3-none-any.whl
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.7


In [3]:
import zipfile

from PIL import Image
import pytesseract
import cv2 as cv
import numpy as np

# loading the face detection classifier
face_cascade = cv.CascadeClassifier('readonly/haarcascade_frontalface_default.xml')

# the rest is up to you!

### Define individual functions

In [None]:
# Small images to test, large images to submit
SMALL_IMAGES_PATH = 'readonly/small_img.zip'
LARGE_IMAGES_PATH = 'readonly/images.zip'

def unzip_files(file_path):
    """
    file_path: the file path where the zipped images are stored (small_img.zip or images.zip)
    
    Returns a dictionary of the images
    """
    images = {}
    images_text = {}
    
    folder = zipfile.ZipFile(f'{file_path}', 'r')
    
    # Loop through and save each image into a dictionary
    for item in folder.infolist():
        #print(item)
        image = Image.open(folder.open(item)).convert('RGB')
        #display(image)
        #print(item.filename)
        # Save these into our dictionary (small_img has 4 PNGs)
        images[item.filename] = image
     
    return images


def ocr_text(image_dict):
    # Empty dictionary to hold our results
    image_text = {}
    
    for key, value in image_dict.items():
        text = pytesseract.image_to_string(value.convert("L"))
        image_text[key] = text 
        
    return image_text

images = unzip_files(SMALL_IMAGES_PATH)
image_text = ocr_text(images)

large_images = unzip_files(LARGE_IMAGES_PATH)
large_images_text = ocr_text(large_images)

"""
# Check we got all the images        
for key, value in images.items():
    print(key)
    print(type(value))
    display(value)
"""

In [None]:
# Put a PIL image in, and get back list of cropped faces
def get_faces(image):
    """
    image: PIL image
    
    Returns a list of cropped faces
    """
    draw = ImageDraw.Draw(image)
    
    # Rectangles around faces
    bboxes = []
    
    faces = face_cascade.detectMultiScale(np.array(image), 1.3, 5)

    cropped_faces = []
    
    for x, y, w, h in faces:
        draw.rectangle((x, y, int(x + 1.1 * w), int(y + 1.1 * h)), outline="white")
        bboxes.append([x, y, int(1.1 * w), int(1.1 * h)])
        cropped_image = image.crop((x, y, int(x + 1.1 * w), int(y + 1.1 * h)))
        cropped_faces.append(cropped_image)
    
    return cropped_faces

In [None]:
# Make contact sheet
def make_contact_sheet(list_of_images, num_of_images):
    """
    list_of_images: a list of images that are cropped faces, returned by get_faces()
    
    Displays a contact sheet with the results
    """
    # Rows of 5 images across    
    if num_of_images % 5 == 0:
        num_height = int(num_of_images / 5)
    else:
        num_height = int(num_of_images / 5) + 1
    
    
    image_width = 50
    image_height = 50
    contact_sheet = Image.new('RGB', (image_width * 5, image_height * num_height))

    x = 0
    y = 0

    for image in list_of_images:
        image.thumbnail((image_width, image_height), Image.ANTIALIAS)

        contact_sheet.paste(image, (x, y))

        if x + image_width == contact_sheet.width:
            x = 0
            y = y + image_height
        else:
            x = x + image_width
    display(contact_sheet)

In [None]:
# Search for the keyword in the image_text dictionary
# Either prints out the contact sheet if it finds faces, or a statement
def search_keyword(word, image_text):
    """
    word: string, a keyword to be searched for in the text
    image_text: dictionary, what was returned from pytesseract
    """
    
    for key, value in image_text.items():
        if word in value:
            print(f"Keyword: {word}")

            # Add a statement that should be printed all the time
            statement1 = f"Results found in file {key}"
            
            try:
                # Get cropped faces
                cropped_faces = get_faces(images[key])
                print(statement1)
                
                # Put cropped faces into contact sheet
                make_contact_sheet(cropped_faces, len(cropped_faces))
            
            # There might be KeyError if it can't find the faces
            except:
                statement2 =  "\nBut there were no faces in that file!"
                print(statement1 + statement2)

### String all the functions together

In [None]:
# We'll use args to put in the dictionary for image_dict and image_text_dict
# Can test on small images and the large images separately
def search_in_newspaper(word, image_dict, image_text_dict):
    """
    word: string, a keyword to be searched for in the text
    image_dict: dictionary, when we open the zipped folder and save the images
    image_text_dict: dictionary, what was returned from pytesseract
    """
    for key, value in image_dict.items():
        search_keyword(word, image_text_dict)

### Search for keywords in small images

In [None]:
search_in_newspaper("Christopher", images, image_text)

In [None]:
search_in_newspaper("Mark", images, image_text)

## Search for keywords in large images

In [None]:
large_images = unzip_files(LARGE_IMAGES_PATH)
large_images_text = ocr_text(large_images)

In [None]:
search_in_newspaper("Christopher", large_images, large_images_text)

In [None]:
search_in_newspaper("Mark", large_images, large_images_text)

### Hint 1
To access the newspapers in the zipfile, you must first use the Zipfile library to open the zipfile then iterate through the objects (newspapers) in the zipfile using .infolist(). Try and write a simple routine to just go through the zipfile, printing out the name of the file as well as using display(). Remember that the PIL.Image library can .open() files, and that items in .infolist() in the zipfile each appear to Python just as if they were a file (these are called "file-like" objects). 


### Hint 2
You can spend a lot of time converting between PIL.Image files and byte arrays, but you don't have to. Why not just store the PIL.Image objects in a global data structure, maybe a list or a dictionary indexed by name? Then you can further process this data structure, by adding in information such as the text detected on the pages or the bounding boxes behind faces.  Come to think of it, a list of dictionary objects, where each entry in the list would have the PIL image, the bounding boxes, and the text discovered on the page, would be a handy way to store this data.


### Hint 3
A quick reminder - in Python all strings are just like lists of characters. Kind of (remember they are immutable lists - more like tuples!). But this means you can use the in keyword to find substrings really easily. So the following statement will return True if the substring is matched: if "Christopher" in my_text  


### Hint 4
Creating the contact sheet can be a bit of a pain. But you can resize images without having to worry about the aspect ratio if you use the PIL.Image.thumbnail function. I used it when creating out the output images, maybe you should too! And check out the lecture on the contact sheet, you want to be careful that you don't "walk off" the end of the images when creating a row (or column). 