# Distant Viewing with Deep Learning: Part 5 (optional)

This notebook is only needed if you want to process your own corpus of images. Here,
you will need to install **keras** and **tensorflow** to run to the
neural networks from scratch.

In [None]:
import numpy as np
import scipy as sp
import pandas as pd

import os
from os.path import join

from keras.applications.vgg19 import VGG19
from keras.preprocessing import image
from keras.applications.vgg19 import preprocess_input, decode_predictions
from keras.models import Model

Load the keras models. The first time you run this the models will be downloaded
from the internet.

In [None]:
base_model = VGG19(weights='imagenet')
vgg_fc2 = Model(inputs=base_model.input,
                outputs=base_model.get_layer('fc2').output)

Here is the function to compute the embeddings from scratch.

In [None]:
def compute_embedding(cn):
    corpus = pd.read_csv(join("meta", cn + ".csv"))
    corpus_img = np.zeros((corpus.shape[0], 224, 224, 3))
    for index, row in corpus.iterrows():
        img_path = join('images', cn, row['filename'])
        img = image.load_img(img_path, target_size=(224, 224))
        x = image.img_to_array(img)
        corpus_img[index, :, :, :] = x
        if (index % 50) == 0:
            print("Done with {0:03d}".format(index))

    corpus_img = preprocess_input(corpus_img)
    corpus_fc2 = vgg_fc2.predict(corpus_img, verbose=True)
    corpus_base = base_model.predict(corpus_img, verbose=True)
    cats = decode_predictions(corpus_base, top=20)
    
    return corpus_fc2, cats

Now, you will need to create directory inside of the `images` directory 
with a name describing your corpus (no spaces please!). Next, we will
create a CSV file to describe the corpus. The code below works on the
corpus called "test"; to run on your data, change the first line to your
dataset name

In [None]:
cn = "test"
impaths = [os.path.join(dp, f) for dp, dn, fn in os.walk(join("images", cn)) for f in fn]
impaths = [x for x in impaths if os.path.splitext(x)[1] in ['.jpg', '.png', '.jpeg']]
impaths = [x[12:] for x in impaths]
im_meta = pd.DataFrame({'filename': impaths, 'title': impaths})
im_meta.to_csv(join("meta", cn + ".csv"), index=False)    

Then, you can run the `compute_embedding` function on the dataset and save the results.

In [None]:
X, cats = compute_embedding(cn)
np.save(join("data", cn + "_vgg19_fc2"), X)
np.save(join("data", cn + "_vgg19_categories"), cats)

You should be able to now run the Part-3 notebook on your own corpus. If you run into
any difficulties, please let us know!