## Instructions

## Data Analytics at Scale Summative Assessment

You have been contracted as a consultant for Find Images Now (FIN), a tech start-up that wants to match and cluster images at scale. FIN has identified a promising technique but needs help evaluating the performance of the approach. Your evaluation should assess both the accuracy and computational costs (e.g., CPU, memory, runtime demands and how these scale with the number of images as input). 

In particular, FIN has identified “image hashing” approaches and identified the image hash library in Python: https://github.com/JohannesBuchner/imagehash as well as its own in-house hashing approach, called FINd, for which a pure Python implementation is available at https://github.com/oxfordinternetinstitute/das2020 . 
Candidates must first seek to optimize FINd by identifying what portions of the algorithm are computational bottlenecks, implementing alternatives, and comparing computational performance. Candidates must plan and implement two or three optimizations. These optimizations may include: 
* use of scientific Python library such as numpy, scipy, and pandas, 
* different execution approaches (e.g., single process on one CPU vs multiprocessing on a single computer vs distributed approaches), 
* use of GPUs, 
* use of C-compiled code (i.e., cython), 
* etc. 

When implementing the optimizations, candidates should ensure the correctness of their output by writing an appropriate unit test. They should also profile the code to analyse CPU, memory, runtime, and other relevant aspects of computational performance.  

After this, candidates must compare the performance (both in terms of accuracy and computational costs) of two methods from the imagehash library to their optimized version of FINd using the dataset provided in class. That is, compare FINd with any two of the following from the imagehash library: 
* average hashing (aHash) 
* perception hashing (pHash) 
* difference hashing (dHash) 
* wavelet hashing (wHash) 

FIN expect a written report, not to exceed 3,500 words, consisting of two parts. Part 1 will report on FINd including an initial assessment of the performance of the code, the optimizations attempted, and the resulting changes (positive or negative) in computational performance. Candidates should discuss any relevant trade-offs (e.g., CPU vs. memory) of these optimizations. The second half of the report should focus on how FINd compares to the two other image hashing methods selected. This part should analyse both the accuracy of the results as well as the computational costs. The report should focus most in-depth on the trade-offs of different approaches (e.g., the advantages and disadvantages of each approach and how the approaches compare to one another). Finally, candidates should discuss which approach is the 'best' for the given dataset and FIN’s need to match similar images at scale. 

### Further details: 
Projects will be examined based on the following criteria and approximate weights: 
* Part 1 
    * 15 points Clear plan for and analysis of the provided FINd algorithm  
    * 5 points - Executable code analysing the provided FINd algorithm 
    * 10 points - Clear rationale/justification for the planned optimizations for FINd along with a clear description of two or three computational optimizations to FINd that the author will implement and compare (and the relative ‘difficulty’ of these approaches [see below]) 
    * 10 points - Executable code that implements the described optimizations 
    * 15 points - Comparison of each optimization (i.e., what trade-offs are involved) 
* Part 2 
    * 15 points - Comparison of the accuracy and performance of FINd against two other image hashing approaches 
    * 10 points - Executable code that implements the comparisons between FINd and two other image hashing approaches 
    * 20 points - Discussion as to which approach is most suited for the specific data and task. 

The best projects will typically contain at least some of the following: 
* Evidence that the candidate has gone beyond the materials presented in class and the lab sessions 
* Include multiple approaches that are clearly distinct. For example, comparing the performance three different Python libraries is considerably easier than implementing single-process, multi-process, and distributed approaches. 
* Have clear code with sufficient comments/documentation for it to be easily understood  
* Be well-written and demonstrate originality 

### FAQs 
* Can I base my project purely on the code distributed in class? - Yes, but try and at least use it in an original way. A project which simply duplicates a class exercise is unlikely to get a good mark, though it likely would not fail if it was at least well executed 
* Can I use code found on the Internet? Can I use library X? - Yes. Understanding another person’s code / library is a fundamental part of programming, but it is important to understand the code including its computational and memory implications. Following best practices, please cite any sources from where you obtained code and be clear about what modifications you made.  
* Can one of my approaches be a standalone software package (e.g., NodeXL, QDA Minner, etc.) - Not for this class: please choose approaches that require programming 
* Can one of my approaches be in a language other than Python? - No. For the purpose of this assignment, please only compare Python approaches. 
* One of my computational approaches is unlikely to finish in the time available. Help! - It is perfectly fine (and in fact good practice) to profile your code on a subset of the data if the data set is very large. There is no requirement that you run all your approaches on the full dataset. 
* Can I work together with another student? - No. The report submitted and code developed must be entirely your own work. Collaboration with another student in the course is not allowed. 

In [6]:
# from FINd import FINDHasher
from FINd_optim import FINDHasher as FINDHasherImproved
from FINd import FINDHasher as FINDHasher
import imagehash
from PIL import Image

In [7]:
# Additional import statements
import os
import random
random.seed(42)
import pickle
import numpy as np
import bottleneck as bn
import math
from scipy import misc
from numba import jit, njit

In [3]:
findHasher=FINDHasher()
findHasherImproved = FINDHasherImproved()

In [4]:
ex1=findHasher.fromFile("das_images/0040_10318987.jpg")
ex2=findHasher.fromFile("das_images/0040_10321701.jpg")

ex3=findHasher.fromFile("das_images/0120_27768345.jpg")

img=Image.open("das_images/0120_28335973.jpg")
ex4=findHasher.fromImage(img)

In [5]:
%load_ext line_profiler
%lprun -f findHasherImproved.boxFilter findHasherImproved.fromFile("das_images/0040_10318987.jpg")


Timer unit: 1e-06 s

Total time: 0.036702 s
File: /Users/willemzents/Documents/SDS/Code/Michaelmas/DAS/das-summative/FINd_optim.py
Function: boxFilter at line 165

Line #      Hits         Time  Per Hit   % Time  Line Contents
   165                                               @classmethod
   166                                               def boxFilter(cls, input, output, rows, cols, rowWin, colWin):
   167         1          9.0      9.0      0.0          halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
   168         1          1.0      1.0      0.0          halfRowWin = int((rowWin + 2) / 2)
   169         1       2580.0   2580.0      7.0          x = np.array(input).reshape((rows, cols))
   170         1          1.0      1.0      0.0          temp = [x]
   171         1          1.0      1.0      0.0          matrices = []
   172         3          6.0      2.0      0.0          for i in range(1, halfColWin):
   173         2         88.0     44.0      0.2              temp.a

In [35]:
# %load_ext line_profiler
# %lprun -f findHasherImproved.fromImage findHasherImproved.fromFile("das_images/0040_10318987.jpg")
img=Image.open("das_images/0120_28335973.jpg")

findHasherImproved.fromImage(img)

array([0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1])

In [14]:
ex1=findHasherImproved.fromFile("das_images/0040_10318987.jpg")
ex2=findHasher.fromFile("das_images/0040_10318987.jpg")

ex3=findHasher.fromFile("das_images/0120_27768345.jpg")

img=Image.open("das_images/0120_28335973.jpg")
ex4=findHasher.fromImage(img)

In [21]:
%lprun -f findHasherImproved.boxFilter findHasherImproved.fromFile("das_images/0040_10318987.jpg")

Timer unit: 1e-06 s

Total time: 0.041068 s
File: /Users/willemzents/Documents/SDS/Code/Michaelmas/DAS/das-summative/FINd_optim.py
Function: boxFilter at line 165

Line #      Hits         Time  Per Hit   % Time  Line Contents
   165                                               @classmethod
   166                                               def boxFilter(cls, input, output, rows, cols, rowWin, colWin):
   167         1          4.0      4.0      0.0          halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
   168         1          2.0      2.0      0.0          halfRowWin = int((rowWin + 2) / 2)
   169         1       3326.0   3326.0      8.1          x = np.array(input).reshape((rows, cols))
   170         1          2.0      2.0      0.0          temp = [x]
   171         1          1.0      1.0      0.0          matrices = []
   172         3         11.0      3.7      0.0          for i in range(1, halfColWin):
   173         2        162.0     81.0      0.4              temp.a

In [38]:

%prun findHasherImproved.fromFile("das_images/0120_27768345.jpg")

 

         2430 function calls (2325 primitive calls) in 0.071 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      179    0.015    0.000    0.015    0.000 {built-in method numpy.array}
       35    0.014    0.000    0.014    0.000 arraypad.py:88(_pad_simple)
        1    0.011    0.011    0.011    0.011 FINd_optim.py:108(dct64To16)
   141/36    0.004    0.000    0.023    0.001 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    0.004    0.004    0.060    0.060 FINd_optim.py:76(findHash256FromFloatLuma)
        1    0.003    0.003    0.003    0.003 {built-in method bottleneck.reduce.nanmean}
       70    0.003    0.000    0.003    0.000 {method 'astype' of 'numpy.ndarray' objects}
       37    0.003    0.000    0.003    0.000 {method 'tolist' of 'numpy.ndarray' objects}
        1    0.002    0.002    0.043    0.043 FINd_optim.py:165(boxFilter)
        1    0.002    0.002    0.002    0.002 FIN

In [8]:
das_images = list(os.listdir("das_images"))
images_sample = random.sample(das_images, 100)

In [33]:
%%prun
results = []
for image in images_sample:
    results.append(findHasherImproved.fromFile(f"das_images/{image}"))

 

         241559 function calls (231119 primitive calls) in 5.017 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    17800    1.182    0.000    1.201    0.000 {built-in method numpy.array}
      100    0.949    0.009    0.949    0.009 FINd_optim.py:108(dct64To16)
     3480    0.812    0.000    0.838    0.000 arraypad.py:88(_pad_simple)
      100    0.281    0.003    0.281    0.003 {built-in method bottleneck.reduce.nanmean}
14020/3580    0.252    0.000    1.324    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
     3680    0.201    0.000    0.201    0.000 {method 'tolist' of 'numpy.ndarray' objects}
      100    0.194    0.002    4.230    0.042 FINd_optim.py:76(findHash256FromFloatLuma)
      100    0.147    0.001    2.913    0.029 FINd_optim.py:165(boxFilter)
      100    0.128    0.001    0.128    0.001 FINd_optim.py:98(decimateFloat)
     6960    0.110    0.000    0.110    0.000 {method 'as

In [49]:
pickle.dump(orig_results, open('orig_results.p', 'wb'))

### Code Profiling

In [23]:
%%timeit
for image in images_sample:
    findHasherImproved.fromFile(f"das_images/{image}")

464 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
%%timeit
for image in images_sample:
    findHasher.fromFile(f"das_images/{image}")

4.23 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
4.57/0.808

5.655940594059406

We see that the most time is spent on the boxFilter() and fillFloatLumaFromBufferImage() functions, so we will mainly look at these for performance tuning. Now let's use line profiler to dive into these specific functions

In [333]:
%load_ext line_profiler

path_to_image = "das_images/0040_10318987.jpg"
%lprun -f findHasher.boxFilter findHasher.fromFile(path_to_image)

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


Timer unit: 1e-06 s

Total time: 2.29095 s
File: /Users/willemzents/Documents/SDS/Code/Michaelmas/DAS/das-summative/FINd.py
Function: boxFilter at line 170

Line #      Hits         Time  Per Hit   % Time  Line Contents
   170                                               @classmethod
   171                                               def boxFilter(cls, input, output, rows, cols, rowWin, colWin):
   172         1          2.0      2.0      0.0          halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
   173         1          1.0      1.0      0.0          halfRowWin = int((rowWin + 2) / 2)
   174       251         95.0      0.4      0.0          for i in range(0, rows):
   175     62750      19771.0      0.3      0.9              for j in range(0, cols):
   176     62500      19563.0      0.3      0.9                  s = 0
   177     62500      38613.0      0.6      1.7                  xmin = max(0, i-halfRowWin)
   178     62500      31103.0      0.5      1.4                  xma

In [334]:
path_to_image = "das_images/0040_10318987.jpg"
%lprun -f findHasherImproved.boxFilter findHasherImproved.fromFile(path_to_image)

Timer unit: 1e-06 s

Total time: 2.57408 s
File: /Users/willemzents/Documents/SDS/Code/Michaelmas/DAS/das-summative/FINd_optim.py
Function: boxFilter at line 167

Line #      Hits         Time  Per Hit   % Time  Line Contents
   167                                               @classmethod
   168                                               def boxFilter(cls, input, output, rows, cols, rowWin, colWin):
   169         1          2.0      2.0      0.0          halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
   170         1          1.0      1.0      0.0          halfRowWin = int((rowWin + 2) / 2)
   171       251         93.0      0.4      0.0          x = np.zeros((rows, cols))
   172     62750      20536.0      0.3      0.8          for k in range(0, rows):
   173     62500      23055.0      0.4      0.9              for l in range(0, cols):
   174     62500      42016.0      0.7      1.6                  x[k, l] = input[k*rows+l]
   175     62500      32793.0      0.5      1.3    

In [22]:
path_to_image = "das_images/0040_10318987.jpg"
%lprun -f findHasherImproved.boxFilter findHasherImproved.fromFile(path_to_image)

Timer unit: 1e-06 s

Total time: 0.059269 s
File: /Users/willemzents/Documents/SDS/Code/Michaelmas/DAS/das-summative/FINd_optim.py
Function: boxFilter at line 165

Line #      Hits         Time  Per Hit   % Time  Line Contents
   165                                               @classmethod
   166                                               def boxFilter(cls, input, output, rows, cols, rowWin, colWin):
   167         1         12.0     12.0      0.0          halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
   168         1          1.0      1.0      0.0          halfRowWin = int((rowWin + 2) / 2)
   169         1       3891.0   3891.0      6.6          x = np.array(input).reshape((rows, cols))
   170         1          2.0      2.0      0.0          temp = [x]
   171         1          0.0      0.0      0.0          matrices = []
   172         3          6.0      2.0      0.0          for i in range(1, halfColWin):
   173         2        919.0    459.5      1.6              temp.a

In [23]:
path_to_image = "das_images/0040_10318987.jpg"
%prun findHasherImproved.fromFile(path_to_image)

 

         2430 function calls (2325 primitive calls) in 0.070 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      179    0.013    0.000    0.013    0.000 {built-in method numpy.array}
        1    0.010    0.010    0.010    0.010 FINd_optim.py:108(dct64To16)
       35    0.010    0.000    0.010    0.000 arraypad.py:88(_pad_simple)
       37    0.006    0.000    0.006    0.000 {method 'tolist' of 'numpy.ndarray' objects}
   141/36    0.004    0.000    0.020    0.001 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    0.003    0.003    0.003    0.003 {built-in method bottleneck.reduce.nanmean}
        1    0.002    0.002    0.052    0.052 FINd_optim.py:76(findHash256FromFloatLuma)
       35    0.002    0.000    0.002    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.002    0.002    0.002    0.002 {method 'decode' of 'ImagingDecoder' objects}
       70    0.002    0.000    0.0

In [50]:
path_to_image = "das_images/0040_10318987.jpg"

%lprun -f findHasherImproved.fillFloatLumaFromBufferImage findHasherImproved.fromFile(path_to_image)

UsageError: Line magic function `%lprun` not found.


### Scraps

In [4]:
from matrix import MatrixUtil
from PIL import Image
from numba import jit, njit 
img=Image.open("das_images/0120_28335973.jpg")

img = img.copy()
img.thumbnail((512, 512))

LUMA_FROM_R_COEFF = float(0.299)
LUMA_FROM_G_COEFF = float(0.587)
LUMA_FROM_B_COEFF = float(0.114)

numCols, numRows = img.size
buffer1 = MatrixUtil.allocateMatrixAsRowMajorArray(numRows, numCols)
buffer2 = MatrixUtil.allocateMatrixAsRowMajorArray(numRows, numCols)

buffer64x64 = MatrixUtil.allocateMatrix(64, 64)
buffer16x64 = MatrixUtil.allocateMatrix(16, 64)
buffer16x16 = MatrixUtil.allocateMatrix(16, 16)
numCols, numRows = img.size

def computeBoxFilterWindowSize(dimension):
    """ Round up."""
    return int(
        (dimension + 64 - 1)
        / 64
    )

windowSizeAlongRows = computeBoxFilterWindowSize(numCols)
windowSizeAlongCols = computeBoxFilterWindowSize(numRows)

def fillFloatLumaFromBufferImage_alt(img, luma):
    rgb_image = img.convert("RGB")
    numCols, numRows = img.size
    for i in range(numRows):
        for j in range(numCols):
            r, g, b = rgb_image.getpixel((j, i))
            luma[i * numCols + j] = (
                LUMA_FROM_R_COEFF * r
                + LUMA_FROM_G_COEFF * g
                + LUMA_FROM_B_COEFF * b
            )
    return luma

def fillFloatLumaFromBufferImage(img, luma):
    numCols, numRows = img.size
    coeffs = np.array(
        [LUMA_FROM_R_COEFF, LUMA_FROM_G_COEFF, LUMA_FROM_B_COEFF])
    converted = np.dot(np.asarray(img), coeffs)

    for i in range(numRows):
        for j in range(numCols):
            luma[i * numCols + j] = converted[i,j]
    
    return luma

def fillFloatLumaFromBufferImage_1(img, luma):
    coeffs = np.array(
        [LUMA_FROM_R_COEFF, LUMA_FROM_G_COEFF, LUMA_FROM_B_COEFF])
    converted = np.dot(np.asarray(img), coeffs).flatten()
    luma=converted.tolist()
    return luma

buffer1 = fillFloatLumaFromBufferImage(img, buffer1)

def boxFilter_orig(input, output, rows, cols, rowWin, colWin):
    halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
    halfRowWin = int((rowWin + 2) / 2)
    for i in range(0, rows):
        for j in range(0, cols):
            s = 0
            xmin = max(0, i-halfRowWin)
            xmax = min(rows, i+halfRowWin)
            ymin = max(0, j-halfColWin)
            ymax = min(cols, j+halfColWin)
            for k in range(xmin, xmax):
                for l in range(ymin, ymax):
                    s += input[k*rows+l]
            output[i*rows+j] = s/((xmax-xmin)*(ymax-ymin))

## Cython

In [4]:
%reload_ext cython

In [5]:
%%cython -a

cimport cython
cimport numpy as c_np
from libc.math cimport round,sqrt
from cython.parallel import prange

def boxFilter_cython(input, output, rows, cols, rowWin, colWin):
    halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
    halfRowWin = int((rowWin + 2) / 2)
    for i in range(0, rows):
        for j in range(0, cols):
            s = 0
            xmin = max(0, i-halfRowWin)
            xmax = min(rows, i+halfRowWin)
            ymin = max(0, j-halfColWin)
            ymax = min(cols, j+halfColWin)
            for k in range(xmin, xmax):
                for l in range(ymin, ymax):
                    s += input[k*rows+l]
            output[i*rows+j] = s/((xmax-xmin)*(ymax-ymin))

LinkError: command 'x86_64-apple-darwin13.4.0-clang' failed with exit status 1

In [None]:
%timeit boxFilter_cython(buffer1, buffer2, numRows, numCols, windowSizeAlongRows, windowSizeAlongCols)

## Multiprocessing

In [65]:
import concurrent.futures
import multiprocessing
import datetime

import json
import math
import glob
import gzip

total = []

def process_file(file_name_list):
    startfile=datetime.datetime.now()
    print("{} - start {}".format(startfile, len(file_name_list)))
    hashes = []
    for file_name in file_name_list:
        out = findHasherImproved.fromFile(file_name)
        hashes.append(out)

    print("{} - finish {} in {}".format(datetime.datetime.now(),len(file_name_list), datetime.datetime.now()-startfile))
    return hashes

if __name__ == "__main__":
    start=datetime.datetime.now()

    images = glob.glob("das_images/*.jpg")
    n = 1000
    x=[images[i:i + n] for i in range(0, len(images), n)]

    with concurrent.futures.ProcessPoolExecutor() as executor:
        results=executor.map(process_file, x[:10])

    for r in results:
        total.append(r)

    time=datetime.datetime.now()-start
    print(time)
    print(total)


, 0, 1, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]), array([1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1

## Spark

In [22]:
from pyspark.sql import SparkSession, SQLContext
spark = SparkSession.builder.appName('my_app').getOrCreate()

In [2]:
#Resilient Distributed Datasets (RDDs) are at the heart of Spark
rdd = spark.sparkContext.parallelize(range(1000))
rdd.first()

0

In [3]:
# path to your image source directory
sample_img_dir = "das_images/"
# Read image data using new image scheme
image_df = spark.read.format("image").load(sample_img_dir)

# Databricks display includes built-in image display support
display(image_df) 


DataFrame[image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>]

In [24]:
sc = SQLContext('local', 'test')
sc.parallelize(images_sample).map(lambda x: FINDHasherImproved.fromFile(x))

AttributeError: 'str' object has no attribute '_jsc'

## Submission Instructions

Please produce: a 3500 word report in PDF format along with all code and its output. The code must be provided in an executable format (e.g., .py Python scripts or .ipynb Jupyter notebooks). The work must be submitted electronically via the Assignment Submission WebLearn Site before midday on Friday of Week 0 (15th January) of Hilary term.

If anything goes wrong with your submission, email msc@oii.ox.ac.uk immediately. In cases where a technical fault that is later determined to be a fault of the WebLearn system (and not a fault of your computer) prevents your submitting the assessment on time, having a time stamped email message will help the Proctors determine if your assessment will be accepted. Please note that you should not wait until the last minute to submit materials since WebLearn can run slowly at peak submission times and this is not considered a technical fault.

Full instructions on using WebLearn for electronic submissions can be found on Canvas.

Candidate Number and Cover Sheet: Remember to use the OII coversheet, stating clearly your candidate number, your course, assignment, title and word count. Your work should be identified ONLY by your candidate number (which can be found by visiting the online Student Self-Service facility).

Remember we are required under regulations to accept your FIRST submission so please make sure you are uploading the correct file.

### To optimizie
* BoxFilter
* FillFloatFromLuma