<h1> Webcam Image Recognizer </h1>

<h3> Import all required libraries, define model url and directory </h3>
Notably, we use:

- OpenCV2 to pull the video feed from the webcam.

- TensorFlow (and NumPy) for image classification.

- GTTS to access Google Text-To-Speech API.

- PyGame to reproduce the audio file pulled from GTTS.

In [13]:
import argparse
import os.path
import re
import sys
import tarfile
import cv2
from time import sleep
import numpy as np
from six.moves import urllib
import tensorflow as tf
import time
from gtts import gTTS
import pygame
import os
from threading import Thread

model_url = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz'
model_dir = '/tmp/'

<h3> Threaded Video Processing Class </h3>
The cv2.read() function is a blocking operation so the main thread is blocked until the frame is read from the webcam and returned. In a real time system, this slows down overall processing.

  This class makes a new thread that pulls new frames from the webcam while the main thread processes the most recent frame.

In [21]:
class ThreadedVideoProcessing:
    def __init__(self):
        self.stream = cv2.VideoCapture(0)
        (self.grabbed, self.frame) = self.stream.read()
        self.stopped = False

    def begin(self):
        Thread(target = self.update, args = ()).start()
        return self

    def update(self):
        while True:
            if self.stopped:
                return

            (self.grabbed, self.frame) = self.stream.read()

    def get_curr_frame(self):
        return self.frame

    def end(self):
        self.stopped = True

<h3>Image Recognition</h3><h4>Part 1</h4>
The Labelize class maps the class name to the result from the model.

In [16]:
class Labelize(object):
    def __init__(self, class_path=None, node_path=None):
        if not class_path:
            class_path = os.path.join(model_dir, 'imagenet_2012_challenge_label_map_proto.pbtxt')
        if not node_path:
            node_path = os.path.join(model_dir, 'imagenet_synset_to_human_label_map.txt')
        self.name_dict = self.update_dict(class_path, node_path)

    def update_dict(self, class_path, node_path):
        if not tf.gfile.Exists(class_path):
            tf.logging.fatal('Label path does not exist: ' + str(class_path))
        if not tf.gfile.Exists(node_path):
            tf.logging.fatal('Node path does not exist: ' + str(node_path))

        class_dict = dict()
        for line in tf.gfile.GFile(class_path).readlines():
            if line.startswith('  target_class:'):
                current_class = int(line.split(': ')[1])
            if line.startswith('  target_class_string:'):
                class_dict[current_class] = line.split(': ')[1][1:-2]

        string_dict = dict()
        for line in tf.gfile.GFile(node_path).readlines():
            parsed_items = re.compile(r'[n\d]*[ \S,]*').findall(line)
            string_dict[parsed_items[0]] = parsed_items[2]

        name_dict = dict()
        for node, string in class_dict.items():
            if string not in string_dict:
                tf.logging.fatal('Label does not exist: ' + str(string))
            name = string_dict[string]
            name_dict[node] = name

        return name_dict

    def get_label(self, node):
        if node not in self.name_dict:
            return ''
        return self.name_dict[node]

<h3>Image Recognition</h3><h4>Part 2</h4>
First, download and extract the Google v3 CNN Inception model tar file. Next, create graph from the downloaded model.

In [22]:
# Download and extract model (if not already in model_dir)
if not os.path.exists(model_dir):
    os.makedirs(model_dir)
filename = model_url.split('/')[-1]
filepath = os.path.join(model_dir, filename)
if not os.path.exists(filepath):
    
    def downloadbar(count, block_size, total_size):
        sys.stdout.write('\r>> Downloading %s %.1f%%' %
            (filename, float(count * block_size) / float(total_size) * 100.0))
        sys.stdout.flush()
    
    filepath, _ = urllib.request.urlretrieve(model_url, filepath, downloadbar)
    statinfo = os.stat(filepath)
    print('Downloaded', filename, statinfo.st_size, 'bytes.')
tarfile.open(filepath, 'r:gz').extractall(model_dir)

# Create graph to feed into TF
with tf.gfile.FastGFile(os.path.join(model_dir, 'classify_image_graph_def.pb'), 'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
    g_in = tf.import_graph_def(graph_def, name='')

<h3> The Main Loop </h3>

The main loop begins with starting the video feed using the threaded video processing class. The TF sess is started and every fifth frame is saved. The frame is then fed into the model with a softmax tensor, and the result is sent to the Labelize class to generate a label. If the label is not found in the current directory, the Google text-to-speech API is used to download the tts file, which is then sent to PyGame to generate an audio description and save in the current directory. The audio description is played every 40 frames, and the label, prediction score and fps are displayed on the video feed.

Hitting 'Q' quits the application and closes all windows.

In [24]:
frames = 0
score = 0
start_time = time.time()
pygame.mixer.init()
frames_since_pred = 0
class_label = ""
print("\n\nHit 'Q' to quit.\n\n")
vid = ThreadedVideoProcessing().begin()

with tf.Session() as sess:
    while True:
        frame = vid.get_curr_frame()
        frames += 1
        if (frames % 5 == 0):
            cv2.imwrite("current_frame.jpg", frame)
            predictions = sess.run(sess.graph.get_tensor_by_name('softmax:0'), {'DecodeJpeg/contents:0': tf.gfile.FastGFile("./current_frame.jpg", 'rb').read()})
            predictions = np.squeeze(predictions)
            node_lookup = Labelize()

            class_label = node_lookup.get_label(predictions.argsort()[-1:][::-1][0])
            score = predictions[predictions.argsort()[-1:][::-1][0]]
            if (class_label == "iPod"):
                class_label = "iPhone"
            if (score > .4):
                labels = class_label.split()
                class_label = " ".join(labels[0:])
                audio_filename = "-".join(labels[0])

            current_time = time.time()
            fps = frames / (current_time - start_time)

        if ((frames_since_pred > 50) and (pygame.mixer.music.get_busy() == False)):
            audio_description = audio_filename + ".mp3"
            if not os.path.isfile(audio_description):
                gTTS(text="I see a " + class_label, lang='en').save(audio_description)
            frames_since_pred = 0
            pygame.mixer.music.load(audio_description)
            pygame.mixer.music.play()
        if ((frames_since_pred < 40) and (frames > 10)):
            cv2.putText(frame, class_label, (20, 400), cv2.FONT_HERSHEY_DUPLEX, 1, (255, 255, 255))
            cv2.putText(frame, str(np.round(score * 100, 2)) + "%", (20, 440), cv2.FONT_HERSHEY_DUPLEX, 1, (255, 255, 255))
        if (frames > 20):
            cv2.putText(frame, "fps: " + str(np.round(fps, 2)), (460, 460), cv2.FONT_HERSHEY_DUPLEX, 1, (255, 255, 255))
        cv2.imshow("Frame", frame)
        frames_since_pred += 1
        if (cv2.waitKey(1) & 0xFF == ord("q")):
            break

vid.end()
cv2.destroyAllWindows()
sess.close()



Hit 'Q' to quit.


