## Instructions

## Data Analytics at Scale Summative Assessment

You have been contracted as a consultant for Find Images Now (FIN), a tech start-up that wants to match and cluster images at scale. FIN has identified a promising technique but needs help evaluating the performance of the approach. Your evaluation should assess both the accuracy and computational costs (e.g., CPU, memory, runtime demands and how these scale with the number of images as input). 

In particular, FIN has identified “image hashing” approaches and identified the image hash library in Python: https://github.com/JohannesBuchner/imagehash as well as its own in-house hashing approach, called FINd, for which a pure Python implementation is available at https://github.com/oxfordinternetinstitute/das2020 . 
Candidates must first seek to optimize FINd by identifying what portions of the algorithm are computational bottlenecks, implementing alternatives, and comparing computational performance. Candidates must plan and implement two or three optimizations. These optimizations may include: 
* use of scientific Python library such as numpy, scipy, and pandas, 
* different execution approaches (e.g., single process on one CPU vs multiprocessing on a single computer vs distributed approaches), 
* use of GPUs, 
* use of C-compiled code (i.e., cython), 
* etc. 

When implementing the optimizations, candidates should ensure the correctness of their output by writing an appropriate unit test. They should also profile the code to analyse CPU, memory, runtime, and other relevant aspects of computational performance.  

After this, candidates must compare the performance (both in terms of accuracy and computational costs) of two methods from the imagehash library to their optimized version of FINd using the dataset provided in class. That is, compare FINd with any two of the following from the imagehash library: 
* average hashing (aHash) 
* perception hashing (pHash) 
* difference hashing (dHash) 
* wavelet hashing (wHash) 

FIN expect a written report, not to exceed 3,500 words, consisting of two parts. Part 1 will report on FINd including an initial assessment of the performance of the code, the optimizations attempted, and the resulting changes (positive or negative) in computational performance. Candidates should discuss any relevant trade-offs (e.g., CPU vs. memory) of these optimizations. The second half of the report should focus on how FINd compares to the two other image hashing methods selected. This part should analyse both the accuracy of the results as well as the computational costs. The report should focus most in-depth on the trade-offs of different approaches (e.g., the advantages and disadvantages of each approach and how the approaches compare to one another). Finally, candidates should discuss which approach is the 'best' for the given dataset and FIN’s need to match similar images at scale. 

### Further details: 
Projects will be examined based on the following criteria and approximate weights: 
* Part 1 
    * 15 points Clear plan for and analysis of the provided FINd algorithm  
    * 5 points - Executable code analysing the provided FINd algorithm 
    * 10 points - Clear rationale/justification for the planned optimizations for FINd along with a clear description of two or three computational optimizations to FINd that the author will implement and compare (and the relative ‘difficulty’ of these approaches [see below]) 
    * 10 points - Executable code that implements the described optimizations 
    * 15 points - Comparison of each optimization (i.e., what trade-offs are involved) 
* Part 2 
    * 15 points - Comparison of the accuracy and performance of FINd against two other image hashing approaches 
    * 10 points - Executable code that implements the comparisons between FINd and two other image hashing approaches 
    * 20 points - Discussion as to which approach is most suited for the specific data and task. 

The best projects will typically contain at least some of the following: 
* Evidence that the candidate has gone beyond the materials presented in class and the lab sessions 
* Include multiple approaches that are clearly distinct. For example, comparing the performance three different Python libraries is considerably easier than implementing single-process, multi-process, and distributed approaches. 
* Have clear code with sufficient comments/documentation for it to be easily understood  
* Be well-written and demonstrate originality 

### FAQs 
* Can I base my project purely on the code distributed in class? - Yes, but try and at least use it in an original way. A project which simply duplicates a class exercise is unlikely to get a good mark, though it likely would not fail if it was at least well executed 
* Can I use code found on the Internet? Can I use library X? - Yes. Understanding another person’s code / library is a fundamental part of programming, but it is important to understand the code including its computational and memory implications. Following best practices, please cite any sources from where you obtained code and be clear about what modifications you made.  
* Can one of my approaches be a standalone software package (e.g., NodeXL, QDA Minner, etc.) - Not for this class: please choose approaches that require programming 
* Can one of my approaches be in a language other than Python? - No. For the purpose of this assignment, please only compare Python approaches. 
* One of my computational approaches is unlikely to finish in the time available. Help! - It is perfectly fine (and in fact good practice) to profile your code on a subset of the data if the data set is very large. There is no requirement that you run all your approaches on the full dataset. 
* Can I work together with another student? - No. The report submitted and code developed must be entirely your own work. Collaboration with another student in the course is not allowed. 

In [1]:
# from FINd import FINDHasher
from FINd_optim import FINDHasher as FINDHasherImproved
from FINd import FINDHasher as FINDHasher
import imagehash
from PIL import Image

In [2]:
# Additional import statements
import os
import random
random.seed(42)
import pickle
import numpy as np
import math
from scipy import misc
from numba import jit, njit

In [3]:
findHasher=FINDHasher()
findHasherImproved = FINDHasherImproved()

In [4]:
ex1=findHasher.fromFile("das_images/0040_10318987.jpg")
ex2=findHasher.fromFile("das_images/0040_10321701.jpg")

ex3=findHasher.fromFile("das_images/0120_27768345.jpg")

img=Image.open("das_images/0120_28335973.jpg")
ex4=findHasher.fromImage(img)

In [16]:
%lprun -f findHasherImproved.findHash256FromFloatLuma findHasherImproved.fromFile("das_images/0040_10318987.jpg")


Timer unit: 1e-06 s

Total time: 0.123512 s
File: /Users/willemzents/Documents/SDS/Code/Michaelmas/DAS/das-summative/FINd_optim.py
Function: findHash256FromFloatLuma at line 75

Line #      Hits         Time  Per Hit   % Time  Line Contents
    75                                               def findHash256FromFloatLuma(
    76                                                       self,
    77                                                       fullBuffer1,
    78                                                       fullBuffer2,
    79                                                       numRows,
    80                                                       numCols,
    81                                                       buffer64x64,
    82                                                       buffer16x64,
    83                                                       buffer16x16,
    84                                               ):
    85         1         17.0     17.0     

In [14]:
# %load_ext line_profiler
# %lprun -f findHasherImproved.fromImage findHasherImproved.fromFile("das_images/0040_10318987.jpg")
img=Image.open("das_images/0120_28335973.jpg")

findHasherImproved.fromImage("das_images/0040_10318987.jpg")

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
ex1=findHasherImproved.fromFile("das_images/0040_10318987.jpg")
ex2=findHasherImproved.fromFile("das_images/0040_10321701.jpg")

ex3=findHasher.fromFile("das_images/0120_27768345.jpg")

img=Image.open("das_images/0120_28335973.jpg")
ex4=findHasher.fromImage(img)

In [5]:
%prun findHasher.fromFile("das_images/0040_10318987.jpg")

 

         500235 function calls in 0.499 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.343    0.343    0.382    0.382 FINd.py:170(boxFilter)
        1    0.034    0.034    0.102    0.102 FINd.py:68(fillFloatLumaFromBufferImage)
    62500    0.026    0.000    0.067    0.000 Image.py:1345(getpixel)
    62504    0.023    0.000    0.032    0.000 Image.py:801(load)
   125000    0.021    0.000    0.021    0.000 {built-in method builtins.max}
   125000    0.018    0.000    0.018    0.000 {built-in method builtins.min}
    62500    0.010    0.000    0.010    0.000 {method 'getpixel' of 'ImagingCore' objects}
        1    0.009    0.009    0.009    0.009 FINd.py:113(dct64To16)
    62503    0.008    0.000    0.008    0.000 {method 'pixel_access' of 'ImagingCore' objects}
        1    0.001    0.001    0.001    0.001 {method 'decode' of 'ImagingDecoder' objects}
        1    0.001    0.001    0.499    0.499 FINd.py:40(

In [41]:
sample = random.sample(das_images, 1)[0]

In [42]:

%prun findHasherImproved.fromFile(f"das_images/{sample}")

 

         3284 function calls (3137 primitive calls) in 0.101 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      246    0.032    0.000    0.033    0.000 {built-in method numpy.array}
       48    0.014    0.000    0.015    0.000 arraypad.py:88(_pad_simple)
        1    0.010    0.010    0.010    0.010 FINd_optim.py:107(dct64To16)
   197/50    0.007    0.000    0.052    0.001 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    0.007    0.007    0.092    0.092 FINd_optim.py:75(findHash256FromFloatLuma)
       51    0.006    0.000    0.006    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.004    0.004    0.018    0.018 nanfunctions.py:70(_replace_nan)
       96    0.003    0.000    0.003    0.000 {method 'astype' of 'numpy.ndarray' objects}
        1    0.002    0.002    0.074    0.074 FINd_optim.py:164(boxFilter)
       50    0.002    0.000    0.002    0.000 {method 'tolist'

In [47]:
das_images = list(os.listdir("das_images"))
images_sample = random.sample(das_images, 10)

In [14]:
orig_results = []
for image in images_sample:
    orig_results.append(findHasher.fromFile(f"das_images/{image}"))

In [49]:
pickle.dump(orig_results, open('orig_results.p', 'wb'))

### Code Profiling

In [48]:
%%timeit
for image in images_sample:
    findHasherImproved.fromFile(f"das_images/{image}")

863 ms ± 86.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
%%timeit
for image in images_sample:
    findHasher.fromFile(f"das_images/{image}")

4.57 s ± 734 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
4.57/0.808

5.655940594059406

We see that the most time is spent on the boxFilter() and fillFloatLumaFromBufferImage() functions, so we will mainly look at these for performance tuning. Now let's use line profiler to dive into these specific functions

In [333]:
%load_ext line_profiler

path_to_image = "das_images/0040_10318987.jpg"
%lprun -f findHasher.boxFilter findHasher.fromFile(path_to_image)

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


Timer unit: 1e-06 s

Total time: 2.29095 s
File: /Users/willemzents/Documents/SDS/Code/Michaelmas/DAS/das-summative/FINd.py
Function: boxFilter at line 170

Line #      Hits         Time  Per Hit   % Time  Line Contents
   170                                               @classmethod
   171                                               def boxFilter(cls, input, output, rows, cols, rowWin, colWin):
   172         1          2.0      2.0      0.0          halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
   173         1          1.0      1.0      0.0          halfRowWin = int((rowWin + 2) / 2)
   174       251         95.0      0.4      0.0          for i in range(0, rows):
   175     62750      19771.0      0.3      0.9              for j in range(0, cols):
   176     62500      19563.0      0.3      0.9                  s = 0
   177     62500      38613.0      0.6      1.7                  xmin = max(0, i-halfRowWin)
   178     62500      31103.0      0.5      1.4                  xma

In [334]:
path_to_image = "das_images/0040_10318987.jpg"
%lprun -f findHasherImproved.boxFilter findHasherImproved.fromFile(path_to_image)

Timer unit: 1e-06 s

Total time: 2.57408 s
File: /Users/willemzents/Documents/SDS/Code/Michaelmas/DAS/das-summative/FINd_optim.py
Function: boxFilter at line 167

Line #      Hits         Time  Per Hit   % Time  Line Contents
   167                                               @classmethod
   168                                               def boxFilter(cls, input, output, rows, cols, rowWin, colWin):
   169         1          2.0      2.0      0.0          halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
   170         1          1.0      1.0      0.0          halfRowWin = int((rowWin + 2) / 2)
   171       251         93.0      0.4      0.0          x = np.zeros((rows, cols))
   172     62750      20536.0      0.3      0.8          for k in range(0, rows):
   173     62500      23055.0      0.4      0.9              for l in range(0, cols):
   174     62500      42016.0      0.7      1.6                  x[k, l] = input[k*rows+l]
   175     62500      32793.0      0.5      1.3    

In [66]:
path_to_image = "das_images/0040_10318987.jpg"
%lprun -f findHasherImproved.boxFilter findHasherImproved.fromFile(path_to_image)

Timer unit: 1e-06 s

Total time: 0.073981 s
File: /Users/willemzents/Documents/SDS/Code/Michaelmas/DAS/das-summative/FINd_optim.py
Function: boxFilter at line 164

Line #      Hits         Time  Per Hit   % Time  Line Contents
   164                                               @classmethod
   165                                               def boxFilter(cls, input, output, rows, cols, rowWin, colWin):
   166         1          6.0      6.0      0.0          halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
   167         1          1.0      1.0      0.0          halfRowWin = int((rowWin + 2) / 2)
   168                                           
   169         1       2561.0   2561.0      3.5          x = np.array(input).reshape((rows, cols))
   170         1          3.0      3.0      0.0          temp = [x]
   171         1          0.0      0.0      0.0          matrices = []
   172         4          7.0      1.8      0.0          for i in range(1, halfColWin+1):
   173         

In [52]:
%load_ext line_profiler

In [50]:
path_to_image = "das_images/0040_10318987.jpg"

%lprun -f findHasherImproved.fillFloatLumaFromBufferImage findHasherImproved.fromFile(path_to_image)

UsageError: Line magic function `%lprun` not found.


### Scraps

In [32]:
from matrix import MatrixUtil
from PIL import Image
from numba import jit, njit 
img=Image.open("das_images/0120_28335973.jpg")

img = img.copy()
img.thumbnail((512, 512))

LUMA_FROM_R_COEFF = float(0.299)
LUMA_FROM_G_COEFF = float(0.587)
LUMA_FROM_B_COEFF = float(0.114)

numCols, numRows = img.size
buffer1 = MatrixUtil.allocateMatrixAsRowMajorArray(numRows, numCols)
buffer2 = MatrixUtil.allocateMatrixAsRowMajorArray(numRows, numCols)

buffer64x64 = MatrixUtil.allocateMatrix(64, 64)
buffer16x64 = MatrixUtil.allocateMatrix(16, 64)
buffer16x16 = MatrixUtil.allocateMatrix(16, 16)
numCols, numRows = img.size

def computeBoxFilterWindowSize(dimension):
    """ Round up."""
    return int(
        (dimension + 64 - 1)
        / 64
    )

windowSizeAlongRows = computeBoxFilterWindowSize(numCols)
windowSizeAlongCols = computeBoxFilterWindowSize(numRows)

def fillFloatLumaFromBufferImage_alt(img, luma):
    rgb_image = img.convert("RGB")
    numCols, numRows = img.size
    for i in range(numRows):
        for j in range(numCols):
            r, g, b = rgb_image.getpixel((j, i))
            luma[i * numCols + j] = (
                LUMA_FROM_R_COEFF * r
                + LUMA_FROM_G_COEFF * g
                + LUMA_FROM_B_COEFF * b
            )
    return luma

def fillFloatLumaFromBufferImage(img, luma):
    numCols, numRows = img.size
    coeffs = np.array(
        [LUMA_FROM_R_COEFF, LUMA_FROM_G_COEFF, LUMA_FROM_B_COEFF])
    converted = np.dot(np.asarray(img), coeffs)

    for i in range(numRows):
        for j in range(numCols):
            luma[i * numCols + j] = converted[i,j]
    
    return luma

def fillFloatLumaFromBufferImage_1(img, luma):
    coeffs = np.array(
        [LUMA_FROM_R_COEFF, LUMA_FROM_G_COEFF, LUMA_FROM_B_COEFF])
    converted = np.dot(np.asarray(img), coeffs).flatten()
    luma=converted.tolist()
    return luma

buffer1 = fillFloatLumaFromBufferImage(img, buffer1)

def boxFilter_orig(input, output, rows, cols, rowWin, colWin):
    halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
    halfRowWin = int((rowWin + 2) / 2)
    for i in range(0, rows):
        for j in range(0, cols):
            s = 0
            xmin = max(0, i-halfRowWin)
            xmax = min(rows, i+halfRowWin)
            ymin = max(0, j-halfColWin)
            ymax = min(cols, j+halfColWin)
            for k in range(xmin, xmax):
                for l in range(ymin, ymax):
                    s += input[k*rows+l]
            output[i*rows+j] = s/((xmax-xmin)*(ymax-ymin))

def boxFilter(input, output, rows, cols, rowWin, colWin):
    halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
    halfRowWin = int((rowWin + 2) / 2)
    x = np.array(input).reshape((rows,cols))
    temp = [x]
    matrices = []
    for i in range(1,halfColWin+1):
        temp.append(np.pad(x.astype(float),((0,0),(0,i)), mode='constant', constant_values=np.nan)[:, i:])
        temp.append(np.pad(x.astype(float),((0,0),(i,0)), mode='constant', constant_values=np.nan)[:, :-i])
    for matrix in temp:
        for i in range(1,halfRowWin+1):
            matrices.append(np.pad(matrix.astype(float),((i,0),(0,0)), mode='constant', constant_values=np.nan)[:-i, :])
            matrices.append(np.pad(matrix.astype(float),((0,i),(0,0)), mode='constant', constant_values=np.nan)[i:, :])
    matrices.extend(temp)
    means = np.nanmean(np.array(matrices), axis=0)
    output = means.flatten().tolist()
    
    for i in range(0, rows):
        for j in range(0, cols):    
            output[i*rows+j] = means[i, j]


In [31]:
test = np.array([[1,2,3],[4,5,6]])
test[1][1]

5

In [20]:
buffer1 = fillFloatLumaFromBufferImage(img, buffer1)
output_a = buffer2.copy()
output_a = boxFilter(buffer1, output_a, numRows, numCols, windowSizeAlongRows, windowSizeAlongCols)

In [22]:
buffer1 = fillFloatLumaFromBufferImage(img, buffer1)
output_b = buffer2.copy()
output_b = boxFilter_orig(buffer1, output_b, numRows, numCols, windowSizeAlongRows, windowSizeAlongCols)


In [23]:
output_a == output_b

False

In [9]:
output_a = buffer2.copy()
output_b = buffer2.copy()

output_a = findHasherImproved.boxFilter(buffer1, output_a, numRows, numCols, windowSizeAlongRows, windowSizeAlongCols)
findHasher.boxFilter(buffer1, output_b, numRows, numCols, windowSizeAlongRows, windowSizeAlongCols)


In [17]:
len(list(set(output_a).difference(output_b)))


42111

In [38]:
numRows, numCols = 250,250
buffer1 = MatrixUtil.allocateMatrixAsRowMajorArray(numRows, numCols)
test = buffer1.copy()

In [39]:
test2 = findHasherImproved.fillFloatLumaFromBufferImage(img, buffer1)
findHasher.fillFloatLumaFromBufferImage(img, buffer1)
buffer1 == test2

True

In [37]:
buffer1

[175.63499999999996,
 235.86499999999998,
 242.069,
 219.17,
 229.32499999999996,
 231.14699999999996,
 218.995,
 229.09699999999998,
 224.17100000000002,
 223.778,
 223.679,
 224.20100000000002,
 224.09899999999996,
 224.242,
 224.09700000000004,
 223.94599999999997,
 235.34099999999998,
 207.98199999999997,
 242.20999999999998,
 219.726,
 215.182,
 225.111,
 225.86599999999999,
 229.09399999999997,
 225.806,
 226.40400000000002,
 215.11599999999999,
 227.6,
 228.97,
 215.155,
 226.166,
 228.23699999999997,
 222.68,
 219.03900000000002,
 229.056,
 224.845,
 217.976,
 227.22099999999998,
 227.232,
 222.759,
 230.21499999999997,
 232.97,
 218.13799999999998,
 224.138,
 225.682,
 220.155,
 223.69899999999998,
 237.75900000000001,
 226.004,
 219.004,
 233.004,
 228.00399999999996,
 217.004,
 228.00399999999996,
 224.004,
 230.00399999999996,
 223.004,
 217.004,
 236.00399999999996,
 228.00399999999996,
 226.004,
 228.00399999999996,
 221.00399999999996,
 228.00399999999996,
 226.280999999

In [75]:
%lprun -f boxFilter boxFilter(buffer1, buffer2, numRows, numCols, windowSizeAlongRows, windowSizeAlongCols)


Timer unit: 1e-06 s

Total time: 0.06242 s
File: <ipython-input-74-d788e442d3fc>
Function: boxFilter at line 66

Line #      Hits         Time  Per Hit   % Time  Line Contents
    66                                           def boxFilter(input, output, rows, cols, rowWin, colWin):
    67         1          5.0      5.0      0.0      halfColWin = int((colWin + 2) / 2)  # 7->4, 8->5
    68         1          1.0      1.0      0.0      halfRowWin = int((rowWin + 2) / 2)
    69                                           
    70         1       6510.0   6510.0     10.4      x = np.array(input).reshape((rows,cols))
    71         1          2.0      2.0      0.0      temp = [x]
    72         1          0.0      0.0      0.0      matrices = []
    73         4          7.0      1.8      0.0      for i in range(1,halfColWin+1):
    74         3       1622.0    540.7      2.6          temp.append(np.pad(x.astype(float),((0,0),(0,i)), mode='constant', constant_values=np.nan)[:, i:])
    75     

## Multiprocessing

In [None]:
import concurrent.futures
import multiprocessing
import datetime

import json
import math
import glob
import gzip


LONDON=[51.507222, -0.1275]

london_tweets = 0
total_tweets = 0

def process_file(filename):
    startfile=datetime.datetime.now()
    print("{} - start {}".format(startfile,filename))
    london_tweets=0
    total_tweets=0
    
    findHasherImproved.fromFile(f"das_images/{sample}")
    
    with gzip.open(filename,"rt") as fh:
        for line in fh:
            total_tweets+=1
            tweet=json.loads(line)
            if "coordinates" in tweet and tweet["coordinates"]!=None:
                loc1=tweet["coordinates"]["coordinates"]
                loc1=[loc1[1],loc1[0]] #Twitter coordinates are longitude,latitude. geopy expects latitude, longitude.
                dist=distance(LONDON,loc1).km
                if dist<50:
                    london_tweets+=1
    print("{} - finish {} in {}".format(datetime.datetime.now(),filename,datetime.datetime.now()-startfile))
    return (london_tweets,total_tweets)

if __name__ == "__main__":
    start=datetime.datetime.now()

    with concurrent.futures.ProcessPoolExecutor() as executor:
        results=executor.map(process_file, glob.glob("/data/twitter-geo/2016-01-*.gz"))


    for r in results:
        london_tweets+=r[0]
        total_tweets+=r[1]

    time=datetime.datetime.now()-start
    print(time)
    print(london_tweets)
    print(total_tweets)


In [78]:
 findHasherImproved.fromFile(f"das_images/0040_10318987.jpg")

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Submission Instructions

Please produce: a 3500 word report in PDF format along with all code and its output. The code must be provided in an executable format (e.g., .py Python scripts or .ipynb Jupyter notebooks). The work must be submitted electronically via the Assignment Submission WebLearn Site before midday on Friday of Week 0 (15th January) of Hilary term.

If anything goes wrong with your submission, email msc@oii.ox.ac.uk immediately. In cases where a technical fault that is later determined to be a fault of the WebLearn system (and not a fault of your computer) prevents your submitting the assessment on time, having a time stamped email message will help the Proctors determine if your assessment will be accepted. Please note that you should not wait until the last minute to submit materials since WebLearn can run slowly at peak submission times and this is not considered a technical fault.

Full instructions on using WebLearn for electronic submissions can be found on Canvas.

Candidate Number and Cover Sheet: Remember to use the OII coversheet, stating clearly your candidate number, your course, assignment, title and word count. Your work should be identified ONLY by your candidate number (which can be found by visiting the online Student Self-Service facility).

Remember we are required under regulations to accept your FIRST submission so please make sure you are uploading the correct file.

### To optimizie
* BoxFilter
* FillFloatFromLuma