### Task 1

Using the Chicago crimes dataset from above, create:
1. an RDD holding a distinct list of the column "Primary type" (30 rows total including heading)
2. an RDD holding the same distinct list, together with a count of each primary type, eg: ('HOMICIDE', 48)

The dataset can be found at "mnt/training/Chicago-Crimes-2018.csv"

In [0]:
# first RDD

crimes_rdd = sc.textFile("mnt/training/Chicago-Crimes-2018.csv")

# You can use this to make sure which column is "Primary type". We find that it is the 6th column.
# (crimes_rdd
#  .map(lambda x: x.split("\t"))
#  .take(1)
# )


# Create the RDD (remember, if we trigger any kind of action, then it is no longer an RDD)
distinct_list_rdd = (crimes_rdd
 .map(lambda x: x.split("\t")[5]) # As the split() function returns a list, we can use slicing directly on top of it to only get the column that we are interested in
 .distinct()
)

# Validate the count of distinct values in this column
count_of_distinct = distinct_list_rdd.count()

print(f"Distinct rows including heading: {count_of_distinct}")

# Validate that the type is RDD
print(f"The type is: {type(distinct_list_rdd)}")



Distinct rows including heading: 30
The type is: <class 'pyspark.rdd.PipelinedRDD'>


In [0]:
# second RDD

# Create the RDD
rdd_with_counts = (crimes_rdd
 .map(lambda x: x.split("\t")[5])
 .map(lambda x: (x,1)) # x is string type. We now turn it into a tuple, where we hold the string (as key) and integer 1 (as value) for summing up counts.
 .reduceByKey(lambda x,y: x+y) # we group based on key (string) and sum all values where the key is the same
 )

# Let's validate how it looks:
rdd_with_counts.take(30)

Out[2]: [('Primary Type', 1),
 ('HOMICIDE', 48),
 ('BURGLARY', 1037),
 ('BATTERY', 4076),
 ('CRIMINAL DAMAGE', 2178),
 ('ROBBERY', 1018),
 ('KIDNAPPING', 19),
 ('WEAPONS VIOLATION', 439),
 ('NARCOTICS', 1050),
 ('CRIMINAL TRESPASS', 676),
 ('PUBLIC PEACE VIOLATION', 96),
 ('LIQUOR LAW VIOLATION', 27),
 ('PROSTITUTION', 36),
 ('OBSCENITY', 9),
 ('NON-CRIMINAL', 3),
 ('HUMAN TRAFFICKING', 1),
 ('DECEPTIVE PRACTICE', 1380),
 ('THEFT', 5247),
 ('ASSAULT', 1624),
 ('OTHER OFFENSE', 1490),
 ('OFFENSE INVOLVING CHILDREN', 196),
 ('MOTOR VEHICLE THEFT', 1138),
 ('CRIM SEXUAL ASSAULT', 115),
 ('STALKING', 17),
 ('SEX OFFENSE', 47),
 ('INTERFERENCE WITH PUBLIC OFFICER', 99),
 ('ARSON', 29),
 ('GAMBLING', 5),
 ('INTIMIDATION', 4),
 ('CONCEALED CARRY LICENSE VIOLATION', 4)]

### Task 2

Using the jsonplaceholder data, create an RDD that would return username and proportion (decimal percentage) of todos done.

Endpoints:
* https://jsonplaceholder.typicode.com/users
* https://jsonplaceholder.typicode.com/todos

Expected output is a sorted RDD, descending by proportion.</br>
Example take(3):</br>
[('Moriah.Stanton', 0.6),</br>
 ('Kamren', 0.6),</br>
 ('Maxime_Nienow', 0.55)]</br>

In [0]:
import requests

users_path = "https://jsonplaceholder.typicode.com/users"
todos_path = "https://jsonplaceholder.typicode.com/todos"

users_resp = requests.get(users_path)
todos_resp = requests.get(todos_path)

users_rdd = sc.parallelize(users_resp.json())
todos_rdd = sc.parallelize(todos_resp.json())

# Let's see how the data looks
#users_rdd.take(1) # What we need is the "username" field, probably something for joining - most likely the id field
#todos_rdd.take(1) # What we need here is the "completed" field. For joining, the correct field is probably "userId"

# We can create the RDDs in steps, e.g.:
# users_rdd_for_joining = users_rdd.map(lambda x: (x["id"], x["username"]))
# todos_rdd_for_joining = todos_rdd.map(lambda x: (x["userId"], x["completed"]))

# This is good practice if you are just setting up your pipeline. 

# Probably, in the longterm you want something more concise. So, you would write everything together:
todos_done_rdd = (users_rdd.map(lambda x: (x["id"], x["username"]))
 .join(
    todos_rdd.map(lambda x: (x["userId"], x["completed"]))
  ) # we now have a tuple (key-value), where the first element is "userId", the second element is a tuple of "username" and "completed"
 .map(lambda x: (x[1][0],(1, 1)) if x[1][1] else (x[1][0], (0, 1))) # we create a new tuple, with username as key, and value as a tuple with "completed" (1=True, 0=False) and count
 .reduceByKey(lambda x,y: (x[0]+y[0],x[1]+y[1])) # we group the values together, so now we have the total count of completed tasks and the total amount of tasks, per username
 .mapValues(lambda x: x[0]/x[1]) # we divide completed/total to get the proportion
 .sortBy(lambda x: (x[1]), ascending=False) # we sort by the proportion in a descending manner
)

# And we can validate the result:
todos_done_rdd.take(20)

Out[3]: [('Moriah.Stanton', 0.6),
 ('Kamren', 0.6),
 ('Maxime_Nienow', 0.55),
 ('Bret', 0.55),
 ('Elwyn.Skiles', 0.45),
 ('Antonette', 0.4),
 ('Delphine', 0.4),
 ('Samantha', 0.35),
 ('Karianne', 0.3),
 ('Leopoldo_Corkery', 0.3)]