For this problem set, we'll be using the Jupyter notebook:

![](jupyter.png)

## RDD exercises

In this notebook you will implement multiple small methods that process and analyze country, city and location data.

We will use a sample data of "allCountries.txt" data from http://download.geonames.org/export/dump/allCountries.zip.  

You can test your functions in the cell below them. The variable `testFile` contains the data.

Read https://spark.apache.org/docs/latest/rdd-programming-guide.html for a guide about RDDs.

This notebook and all the following notebooks will have hidden tests to check your solutions. If your answer does not pass them, one solution can be trying different variable names in your solution, as if you are using same variable names as the
hidden tests, that can cause them to fail.

### Data schema

Name | Description
------ | :-----
geonameid         | integer id of record in geonames database  
name              | name of geographical point (utf8) varchar(200)  
asciiname         | name of geographical point in plain ascii characters, varchar(200)  
alternatenames    | alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)  
latitude          | latitude in decimal degrees (wgs84)  
longitude         | longitude in decimal degrees (wgs84)  
feature class     | see http://www.geonames.org/export/codes.html, char(1)  
feature code      | see http://www.geonames.org/export/codes.html, varchar(10)  
country code      | ISO-3166 2-letter country code, 2 characters  
cc2               | alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters  
admin1 code       | fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)  
admin2 code       | code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)   
admin3 code       | code for third level administrative division, varchar(20)  
admin4 code       | code for fourth level administrative division, varchar(20)  
population        | bigint (8 byte int)   
elevation         | in meters, integer  
dem               | digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or   30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.  
timezone          | the iana timezone id (see file timeZone.txt) varchar(40)  
modification date | date of last modification in yyyy-MM-dd format  



In [1]:
from pyspark import SparkContext, SparkConf
sc = SparkContext("local","GeoProcessor")
testFile = sc.textFile("allCountries_sample.txt")

22/09/30 19:02:44 WARN Utils: Your hostname, bigdata2022-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
22/09/30 19:02:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/09/30 19:02:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Extract Data
`extractData` removes unnecessary fields and splits the data so that the RDD looks like  RDD(Array("name","countryCode","dem"),...)).

Fields to include:  
* name  
* countryCode  
* dem (digital elevation model)  


param `data`: data set loaded into spark as RDD[String]  

`return`: RDD containing filtered location data. There should be an Array for each location.


Hint: you can first split each line into an array. Columns are separated by tab ("\t") character. Finally you should take the appropriate fields. The fields will be numbered by the location they are ordered in the original data scheme. Use positive indexes. Despite the method's name, you might only need the `map` function.




In [2]:
def extractData(data):
    result = data.map(lambda x: (x.split("\t")))
    result = result.map(lambda x: [x[1], x[8], x[16]])
    
    
    return(result)

In [3]:
#Example print

extractData(testFile).take(5)

[Stage 0:>                                                          (0 + 1) / 1]                                                                                

[['Tosa de la Llosada', 'AD', '2475'],
 ['Riu de la Llosada', 'AD', '1900'],
 ['Obaga de la Llosada', 'AD', '2300'],
 ['Emprius de la Llosada', 'AD', '2299'],
 ['Basers de la Llosada', 'AD', '2321']]

In [4]:
'''extractData tests'''
filtered = extractData(testFile)
testObject = filtered.collect()[1]
assert testObject[0] == "Riu de la Llosada", "the name value of the object was expected to be 'Riu de la Llosada' but it was %s" % testObject[0]
assert testObject[1] == "AD", "the country code value of the object was expected to be 'AD' but it was %s" % testObject[1]
assert testObject[2] == "1900", "the dem value of the object was expected to be 1900 but it was %s" % testObject[2]
assert len(testObject) == 3, "the length of the array was expected to be 3 but it was %s" % len(testObject)
assert type(testObject) is list, "the type of the RDD element was expected to be list but it was %s" % type(testObject)

## Filter Elevation

`filterElevation` is used to filter an RDD to given country code and return an RDD containing only dem information. You will have to convert the dem information to `int` values.

param `countryCode`: country code e.g "AD"  
param `data`: an RDD containing multiple Array["name", "countryCode", "dem"] (as in it was returned by the `extractData` function)   

`return`: RDD[int] containing only dem information related to the country code  


In [5]:
def filterElevation(countryCode, data):
    filtered = data.filter(lambda x: x[1] == countryCode)
    filtered = filtered.map(lambda x: x[2])
    filtered = filtered.map(lambda x: int(x))
    
    return(filtered)

In [6]:
#Example print

filterElevation("AD", extractData(testFile)).take(5)

[2475, 1900, 2300, 2299, 2321]

In [7]:
'''filterElevation tests'''
filtered = extractData(testFile)
first = filterElevation("SE", filtered).first()
assert type(first) is int, "the type of the RDD element was expected to be int but it was %s" % type(first)
assert first == 56, "the value of the RDD element was expected to be 56 but it was %s" % first
object = filterElevation("AD", filtered).collect()[4]
assert object == 2321, "the value of the RDD element was expected to be 2321 but it was %s" % object

## Elevation Average

`elevationAverage` calculates the dem average to specific dataset.  

param `data`: RDD[int] containing only dem information  

`return`: The average elevation  

Hint: use the `sum()` function instead of mapping and reducing.


In [8]:
def elevationAverage(data):
    count = data.count()
    elev = data.reduce(lambda x,y: x+y)
    
    answer = elev/count
    
    
    return(answer)

In [9]:
#Example print

elevationAverage(sc.parallelize(filterElevation("AD", extractData(testFile)).take(5)))

2259.0

In [10]:
'''elevationAverage tests'''
avg = elevationAverage(sc.parallelize([1, 2, 3 ,4])) 
assert abs(avg - 2.5) < 0.00001, "the average was expected to be 2.5 but it was %s" % avg
filtered = extractData(testFile)
elevations = filterElevation("AD", filtered)
avg2 = elevationAverage(elevations)
assert abs(avg2 - 1792.25) < 0.00001, "the average was expected to be 1792.25 but it was %s" % avg2
assert type(avg2) is float, "the type of the RDD element was expected to be float but it was %s" % type(avg2)

## Most Common Words

`mostCommonWords` calculates what is the most common  word in place names and returns an RDD[(String,Int)]. You can assume that words are separated by a single space ' '.

param `data`: an RDD containing multiple Array["name", "countryCode", "dem"].  

`return`: RDD[(String,Int)] where string is the word and Int number of occurances. RDD should be in descending order (sorted by number of occurances). e.g ("hotel", 234), ("airport", 120), ("new", 12). 

Example:  
Assume that the place name is "Andorra la Vella Heliport". We split the name so that we have 4 seperate words "Andorra", "la", "Vella" and "Heliport".


In [57]:
def mostCommonWords(data):
    names = data.map(lambda x: x[0])
    
    # split on spaces
    names = names.flatMap(lambda x: x.split(' '))
    # save values to key (word) value (amount) tuples
    names = names.map(lambda x: (x,1))
    # calculate values together from all unique keys
    names = names.reduceByKey(lambda x, y: x+y)
    # sort = descending order
    names = names.map(lambda x: (x[1], x[0])).sortByKey(False)
    names = names.map(lambda x: (x[1], x[0]))
    
    return(names)

In [58]:
#Example print

mostCommonWords(extractData(testFile)).take(5)

[('Hotel', 22), ('de', 15), ('la', 12), ('Hotell', 7), ('dels', 6)]

In [59]:
'''mostCommonWords tests'''
filtered = extractData(testFile)
words = mostCommonWords(filtered).collect()
first = words[0]
second = words[1]
third = words[2]
assert type(first[0]) is str, "the type of the first value in array was expected to be str but it was %s" % type(first[0])
assert type(first[1]) is int, "the type of the second value in array was expected to be int but it was %s" % type(first[1])
assert first[1] >= second[1], "the first element in RDD was expected to have more occurances than the second"
assert first[0] == "Hotel", "the first element was expected to be named Hotel but it was %s" % first[0]
assert first[1] == 22, "the count of the first element was expected to be 22 but it was %s" % first[1]
assert third[0] == "la", "the third element was expected to be named 'la' but it was %s" % third[0]

## Most Common Country

`mostCommonCountry` tells which country has the most entries in geolocation data. The correct name for specific countrycode can be found from countrycodes.csv. The columns in countrycodes.csv are seperated by ",". More specifially, the file is structured like this:

Fiji,FJ  
Finland,FI  
France,FR  

param `data`: an RDD containing multiple Array["name", "countryCode", "dem"].  
param `codeData`: data from countrycodes.csv file  

`return`: most common country as String e.g Finland or empty string "" if countrycodes.csv doesn't have that entry.

Note: take into account that the countrycodes.csv file has a header.

In [299]:
countryCodes = sc.textFile("countrycodes.csv")

def mostCommonCountry(data, codeData):
    # filter out header
    header = codeData.first()
    codeData = codeData.filter(lambda x: x!= header)
    
    # map all country codes to 'codes'
    codes = codeData.map(lambda x: x.split(','))
    codes = codes.map(lambda x: (x[0], x[1]))
    codes = codes.map(lambda x:(x[1], x[0]))
    
    # take countrycodes from 'data' dataset
    data = data.map(lambda x: x[1])
    # save values to key (word) value (amount) tuples
    data = data.map(lambda x: (x,1))
    # calculate values together from all unique keys
    data = data.reduceByKey(lambda x, y: x+y)
    # sort = descending order
    data = data.map(lambda x: (x[1], x[0])).sortByKey(False)
    data = data.map(lambda x:(x[1], x[0]))
    
    key = data.first()
    key = codes.lookup(key[0])
    
    def returner(k):
        if len(k) > 0:
            return (key[0])
        else:
            return ("")
    
    return(returner(key))

In [300]:
#Example print

mostCommonCountry(extractData(testFile), countryCodes)

'Sweden'

In [301]:
'''mostCommonCountry tests'''
filtered = extractData(testFile)
mostCommon = mostCommonCountry(filtered, countryCodes)
assert type(mostCommon) is str, "the type of the returned object was expected to be str but it was %s" % type(mostCommon)
assert mostCommon == "Sweden", "the most common was expected to be Sweden but it was %s" % mostCommon
mostCommon2 = mostCommonCountry(sc.parallelize(filtered.take(40)), countryCodes)
assert mostCommon2 == "Andorra", "the most common was expected to be Andorra but it was %s" % mostCommon2
false = sc.parallelize([["a", "AA", 123], ["b", "AA", 1234]])
assert mostCommonCountry(false, countryCodes) == "", "The method was expected to return empty when called with false data"

## Hotels In Area

`hotelsInArea` determines how many hotels there are within 10 km (<=10000.0) from given latitude and longitude.
Use Haversine formula ( https://en.wikipedia.org/wiki/Haversine_formula ). Earth radius is 6371000 meters. 

In this exercise you should use the asciiname field instead of name. Start by reading the data and getting the correct fields (asciiname, latitude, longitude) similarly to the `extractData` function. After that you should use the Haversine formula to filter the places in 10 Km radius from the latitude and longitude. You will probably want to use a helper function, Python lets you create functions inside functions. Finally, you will want to filter the places that contain the word "hotel". Location is a hotel if the name contains the word "hotel" in any combination of uppercase or lowercase letters (can be "Hotel" or "hOtel" for instance). There can exist multiple different hotels in the same location. You should not count the same hotel multiple times (as in, hotels with exact same name and location), `distinct` function can help with that.

Note that both latitude and longitude in the data are in decimal degree so you have to change them to radians ( https://en.wikipedia.org/wiki/Decimal_degrees ). They should also be converted to double values. E.g `math.radians(float(x))`

param `lat`: latitude as Double  
param `long`: longitude as Double  
param `data`: the original data set loaded into spark as RDD[String].  

`return`: number of hotels in area


In [344]:
import math
hotels = sc.textFile("hotel_data.txt")

def hotelsInArea(lat, long, data):
    hotels = data.map(lambda x: (x.split("\t")))
    hotels = hotels.map(lambda x: [x[0], math.radians(float(x[3])), math.radians(float(x[4]))])

    return(hotels.take(10))
    

In [345]:
#Example print
hotelsInArea(59.334591, 18.063240, testFile)

22/10/01 00:12:55 ERROR Executor: Exception in task 0.0 in stage 424.0 (TID 420)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/pyspark/rdd.py", line 1562, in takeUpToNumLeft
    yield next(iterator)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_14484/215238163.py", line 6, in <lambda

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 424.0 failed 1 times, most recent failure: Lost task 0.0 in stage 424.0 (TID 420) (10.0.2.15 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/pyspark/rdd.py", line 1562, in takeUpToNumLeft
    yield next(iterator)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_14484/215238163.py", line 6, in <lambda>
ValueError: could not convert string to float: ''

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:713)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:695)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at jdk.internal.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/pyspark/rdd.py", line 1562, in takeUpToNumLeft
    yield next(iterator)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_14484/215238163.py", line 6, in <lambda>
ValueError: could not convert string to float: ''

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:713)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:695)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more


In [329]:
'''hotelsInArea tests'''
a1 = hotelsInArea(42.5423, 1.5977, testFile)
a2 = hotelsInArea(59.334591, 18.063240, testFile)
a3 = hotelsInArea(63.8532, 15.5652, testFile)
assert a1 == 0, "the number of hotels was expected to be 0 but it was %s" % a1
assert a2 == 3, "the number of hotels was expected to be 3 but it was %s" % a2
assert a3 == 1, "the number of hotels was expected to be 1 but it was %s" % a3

22/10/01 00:02:08 ERROR Executor: Exception in task 0.0 in stage 418.0 (TID 414)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/pyspark/rdd.py", line 1562, in takeUpToNumLeft
    yield next(iterator)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_14484/3192885074.py", line 6, in <lambd

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 418.0 failed 1 times, most recent failure: Lost task 0.0 in stage 418.0 (TID 414) (10.0.2.15 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/pyspark/rdd.py", line 1562, in takeUpToNumLeft
    yield next(iterator)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_14484/3192885074.py", line 6, in <lambda>
TypeError: must be real number, not str

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:713)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:695)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at jdk.internal.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/pyspark/rdd.py", line 1562, in takeUpToNumLeft
    yield next(iterator)
  File "/home/bigdata2022/Library/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_14484/3192885074.py", line 6, in <lambda>
TypeError: must be real number, not str

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:713)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:695)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more


In [None]:
sc.stop()