# Image Searching
This notebook handles searching imagenet for images based on a keyword.
It performs this task by doing the following:

    1. User inputs keyword and number of images to retrieve.
    2. The keyword is sent to the wordnet API to obtain the synset id.
    3. Get a grandparent from the synset for similar images.
    4. Get up to 5 hyponyms of the hypernym (siblings to the synset).
    5. Get a random synset that is completely unrelated to the synset.
    6. Retrieve a set number of images from each synset:  
        - About 10% of the images retrieved are exactly matching the keyword.  
        - About 50% of the images retrieved are sibling synsets to the keyword. 
        - About 40% of the images are completely unrelated images from a random synset.  
   
The percentage of exact, related, and unrelated images is subject to change depending on what works best for the neural net.

## Imports and Initialization

In [22]:
import requests
import random
import urllib.request
from socket import gaierror
from IPython.display import Image

# Download wordnet corpus using nltk
from nltk import download
download("wordnet")

# Import the wordnet from nltk corpus
from nltk.corpus import wordnet as wn

API = {
    'allsynsets': "http://image-net.org/api/text/imagenet.synset.obtain_synset_list",
    'wordsfor': "http://image-net.org/api/text/wordnet.synset.getwords?wnid={}",
    'urlsfor': "http://image-net.org/api/text/imagenet.synset.geturls?wnid={}",
    'hyponymfor': "http://image-net.org/api/text/wordnet.structure.hyponym?wnid={}",
}

synsets = requests.get(API['allsynsets']).content.decode().splitlines()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mattj\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [23]:
def getSynsetId(synset):
    return "n{}".format(str(synset.offset()).zfill(8))

## 1. User Prompt

In [24]:
keyword = input("Keyword: ")
imgCount = int(input("Image Count: "))

Keyword: bear
Image Count: 100


## 2. Obtain Synset ID
Hyponym: A child of the synset  
Hypernym: The parent of the synset

In [25]:
offset = next(iter(wn.synsets(keyword, pos=wn.NOUN)), None).offset()
synsetId = "n{}".format(str(offset).zfill(8))
synset = wn.synset("{}.n.01".format(keyword))
print("{} : {} : {}".format(keyword, synset, synsetId))

synInImagenet = synsetId in synsets
print("In imagenet? {}".format(synInImagenet))

bear : Synset('bear.n.01') : n02131653
In imagenet? True


## 3. Obtain Synset Parent

In [26]:
parent = random.choice(synset.hypernyms())
print(parent)

Synset('carnivore.n.01')


## 4. Obtain Siblings of Synset

In [27]:
siblings = []
siblingCount = 0
for sibling in parent.hyponyms():
    if siblingCount == 5:
        break
    if sibling != synset:
        siblings.insert(siblingCount, sibling)
        siblingCount += 1

for sibling in siblings:
    print(sibling)

Synset('canine.n.02')
Synset('feline.n.01')
Synset('fissiped_mammal.n.01')
Synset('musteline_mammal.n.01')
Synset('procyonid.n.01')


## 5. Obtain Random Synset

In [28]:
while True:
    try:
        randomSynsetId = random.choice(synsets)
        randomSynsetName = random.choice(requests.get(API["wordsfor"].format(randomSynsetId)).content.decode().splitlines())
        randomSynset = wn.synset("{}.n.01".format(randomSynsetName))
        break
    except:
        print ("{} is not a noun, try again.".format(randomSynsetName))

print(randomSynset)

picture tube is not a noun, try again.
Carissa grandiflora is not a noun, try again.
deer hunt is not a noun, try again.
cassette player is not a noun, try again.
Chinese elm is not a noun, try again.
Synset('north_carolinian.n.01')


## 6. Display Obtained Synsets

In [29]:
print("Synset:")
print("-------")
print("{} Id('{}')\n".format(synset, synsetId))

print("Siblings:")
print("-------")
for sibling in siblings:
    print("{} Id('{}')\n".format(sibling, getSynsetId(sibling)))

print("Random:")
print("-------")
print("{} Id('{}')\n".format(randomSynset, randomSynsetId))

Synset:
-------
Synset('bear.n.01') Id('n02131653')

Siblings:
-------
Synset('canine.n.02') Id('n02083346')

Synset('feline.n.01') Id('n02120997')

Synset('fissiped_mammal.n.01') Id('n02082190')

Synset('musteline_mammal.n.01') Id('n02441326')

Synset('procyonid.n.01') Id('n02507649')

Random:
-------
Synset('north_carolinian.n.01') Id('n09744834')



## 7. Retrieve Percentage of Images

In [None]:
totalRetrieved = 0

exact = (int)(imgCount * 0.1)
similar = (int)(imgCount * 0.5)
unrelated = (int)(imgCount * 0.4)

def getImages(count, imageType, passedSynset):
    print("count:{} imageType:{} passedSynset:{}".format(count, imageType, passedSynset))
    
    request = requests.get(API['urlsfor'].format(passedSynset))
    urls = request.content.decode().splitlines()
    del request

    errorOffset = 0
    retrieved = 0
    while (retrieved < count):
        while True:
            try:
                file = ".\\CollectedImages\\{}\\img{}.jpg".format(imageType, retrieved)
                
                print(file)
                
                urllib.request.urlretrieve(urls[retrieved + errorOffset], file)
                retrieved = retrieved + 1
                print(file)
                break
            except:
                errorOffset = errorOffset + 1
                print("File not found.")
    return

# Get exact images
getImages(exact, "Exact", synsetId)

# Get similar images
for sibling in siblings:
    getImages((int)(similar / len(siblings)), "Similar", getSynsetId(sibling))

# Get unrelated images
getImages(unrelated, "Unrelated", randomSynsetId)

count:10 imageType:Exact passedSynset:n02131653
.\CollectedImages\Exact\img0.jpg
.\CollectedImages\Exact\img0.jpg
.\CollectedImages\Exact\img1.jpg
.\CollectedImages\Exact\img1.jpg
.\CollectedImages\Exact\img2.jpg
File not found.
.\CollectedImages\Exact\img2.jpg
File not found.
.\CollectedImages\Exact\img2.jpg
.\CollectedImages\Exact\img2.jpg
.\CollectedImages\Exact\img3.jpg
.\CollectedImages\Exact\img3.jpg
.\CollectedImages\Exact\img4.jpg
.\CollectedImages\Exact\img4.jpg
.\CollectedImages\Exact\img5.jpg
.\CollectedImages\Exact\img5.jpg
.\CollectedImages\Exact\img6.jpg
.\CollectedImages\Exact\img6.jpg
.\CollectedImages\Exact\img7.jpg
.\CollectedImages\Exact\img7.jpg
.\CollectedImages\Exact\img8.jpg
.\CollectedImages\Exact\img8.jpg
.\CollectedImages\Exact\img9.jpg
File not found.
.\CollectedImages\Exact\img9.jpg
.\CollectedImages\Exact\img9.jpg
count:10 imageType:Similar passedSynset:n02083346
.\CollectedImages\Similar\img0.jpg
File not found.
.\CollectedImages\Similar\img0.jpg
.\Collect