## CSE 255 Programming Assignment 1: Collinear points

For this problem set, we'll be using a Jupyter notebook.

### In this programming assignment, you will write python3 code using pyspark to find sets of collinear points given arbitrary number of 2D points.

Definition of collinearity[1]: In geometry, collinearity of a set of points is the property of their lying on a single line. A set of points with this property is said to be collinear.

![](non-collinear-points.jpg)

Here, points P,Q,R and A,R,B are collinear. However, points A,B,C are non-collinear. For more, refer [2].

Given an input file with a set of co-ordinates, your task is to use pyspark library functions to write a program to find if three or more points are collinear.

For instance, if given these points: {(1,1), (0,1), (2,2), (3,3), (0,5), (3,4), (5,6), (0,-3), (-2,-2)}

Sets of collinear points are: {((-2,-2), (1,1), (2,2), (3,3)), ((0,1), (3,4), (5,6)), ((0,-3), (0,1), (0,5))}. Note that the ordering of the points in a set or the order of the sets does not matter. 

Note: Every set of collinear points has to have atleast three points (single point or a pair of points always lie on a line).

1. https://en.wikipedia.org/wiki/Collinearity
2. http://www.mathcaptain.com/geometry/collinear-points.html

### Initialize spark context using 4 local cores as workers.
Note that we can create a SparkConf() object and use it to initialize the spark context. 

In [21]:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Collinear Points").setMaster("local[4]")
sc = SparkContext(conf=conf)

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=Collinear Points, master=local[4]) created by __init__ at <ipython-input-1-3702202da879>:3 

### Guidelines for implementation

The goal of this assignment is to make you familiar with programming using pyspark. There are many ways to find sets of collinear points from a list of points. For the purposes of this assignment, we shall stick with the below approach:

1. Find the cartesian product of the list of points with itself. For example, given three points [(1,0), (2,0), (3,0)], we first construct [((1,0), (2,0)), ((2,0), (1,0)), ((1,0), (3,0)), ((3,0), (1,0)), ((2,0), (3,0)), ((3,0), (2,0))]. Note that pairs ((1,0),(1,0)), ((2,0),(2,0)) and ((3,0),(3,0)) are deliberately not included in the result of cartesian product, since they all have duplicate points.

2. Use the above intermediary result to find slope (of the line segment connecting) each pair of points in the cartesian product. If two pairs of points have the same slope, and if they have one point in common, the three points have to be collinear. For example, in [((1,0), (2,0)), ((2,0), (1,0)), ((1,0), (3,0)), ((3,0), (1,0)), ((2,0), (3,0)), ((3,0), (2,0))], slope between (1,0) and (2,0) is 0, and slope between (1,0) and (3,0) is also 0. Since (1,0) is common to both of these pairs of points, ((1,0), (2,0), (3,0)) must be collinear.


Keeping the above technique in mind, you will complete the leftover parts of this notebook. You are required to use pyspark's map, reduce, groupby and other library functions to do so.

### Below are some helper functions that will be used in your implementations. You are neither required to nor encouraged to change the definitions of these functions.

In [65]:
def to_tuple(x):
    """
    Converts each point of form 'Ax, Ay' into a point of form (Ax, Ay) for further processing.
    """
    arr = x.split()
    return tuple([int(element) for element in arr])

def non_duplicates(x):
    """  
    input: Pair (A,B) where A and B are of form (Ax, Ay) and (Bx, By) respectively.
    Returns True if A == B, False otherwise.
    
    Use this function inside the get_cartesian() function to filter out pairs with duplicate points. 
    """
    return x[0] != x[1]

def format_result(x):
    """
    input: ((A,slope), [C1,..., Ck]) where each of A, C1,..., Ck is a point of form (Ax, Ay) and slope is a float.
    output: (C1,..., Ck, A)
    
    Concatenates collinear points.
    """
    x[1].append(x[0][0])
    return tuple(x[1])

def to_sorted_points(x):
    """
    Sorts and returns a tuple of points for further processing.
    """
    return tuple(sorted(x))

In [66]:
def get_cartesian(rdd):
    """
    Does a Cartesian product of an RDD with itself and returns an RDD with DISTINCT pairs of points.
    
    Refer:  http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=cartesian#pyspark.RDD.cartesian
    """
    
    ### BEGIN SOLUTION
    rdd = rdd.cartesian(rdd)
    
    # To remove duplicates points, example ((1,0), (1,0))
    rdd = rdd.filter(non_duplicates)
    
    return rdd
    ### END SOLUTION

In [67]:

def find_slope(x):
    """
    input: Pair (A,B) where A and B are of form (Ax, Ay) and (Bx, By) respectively.
    output: Pair ((A,slope), B). Where, A and B have the same definition as input, and slope refers to the slope of the line segment connecting point A and B.
    
    Computes slope between points A and B.
    """
    ### BEGIN SOLUTION
    if x[1][0] - x[0][0] == 0:
        slope = "inf"
    else:
        slope = 1.0 * (x[1][1] - x[0][1])/(x[1][0] - x[0][0])
    return ((x[0], slope), (x[1]))
    ### END SOLUTION



def find_collinear(rdd):
    
    """
    1. Find the slope of the line between all pairs of points A = (Ax, Ay) and B = (Bx, By).
    2. For each (A, B), find all points C = ((C1x, C1y), (C2x, C2y), ... (Cnx, Cny)) 
       where slope of (A,B) = slope of (A, Ci).
    3. Return (A, B, Ck) where Ck = all points of C which satisfy the condition 1.
    
    Call func format_result() from inside this function.
    TODO: to see if definition can be improved.
    """
    ### BEGIN SOLUTION
    rdd = rdd.map(find_slope).groupByKey().mapValues(list).filter(lambda x: len(x[1]) > 1)
    
    rdd = rdd.map(format_result)
    ### END SOLUTION
    return rdd

### This cell below runs all of the above functions step by step. You are again neither required to nor encouraged to change the definitions of these functions.

In [68]:
def get_sorted(x):
    return tuple(sorted(x))

def execute(filename):
    """
    Computes the set of collinear points
    """
    
    # Load the data file into an RDD
    rdd = sc.textFile(filename)
    
    # Transform the loaded points into tuples
    rdd = rdd.map(to_tuple)
    
    # Transform the RDD to now have the cartesian product of the RDD with itself
    rdd = get_cartesian(rdd)
    
    # Transform the RDD to now just have the Collinear Points 
    rdd = find_collinear(rdd)
    
    # Sorting each of your returned sets of collinear points. This is for grading purposes. You may ignore this.
    rdd = rdd.map(to_sorted_points)
    
    # Collecting the collinear points in a set. This is for grading purposes. You may ignore this.
    res = set(rdd.collect())
    
    return res

In [69]:
"""Check that the get_collinear function returns the correct output for several all input files"""
assert execute("data.txt") == {((-2, -2), (1, 1), (2, 2), (3, 3)), ((0, 1), (3, 4), (5, 6)), ((0, -3), (0, 1), (0, 5))}
# assert execute("test.txt") == {((1, 0), (2, 0), (3, 0))}
### BEGIN HIDDEN TESTS

### END HIDDEN TESTS

[(((0, 1), 'inf'), [(0, 5), (0, -3)]), (((0, 5), 'inf'), [(0, 1), (0, -3)]), (((0, 1), 1.0), [(3, 4), (5, 6)]), (((3, 4), 1.0), [(0, 1), (5, 6)]), (((5, 6), 1.0), [(0, 1), (3, 4)]), (((0, -3), 'inf'), [(0, 1), (0, 5)]), (((1, 1), 1.0), [(2, 2), (3, 3), (-2, -2)]), (((3, 3), 1.0), [(1, 1), (2, 2), (-2, -2)]), (((2, 2), 1.0), [(1, 1), (3, 3), (-2, -2)]), (((-2, -2), 1.0), [(1, 1), (2, 2), (3, 3)])]


In [70]:
"""Check that the get_collinear function raises an error for invalid input files"""
try:
    execute("data.txt")
except ValueError:
    pass
else:
    raise AssertionError("did not raise")

[(((0, 1), 'inf'), [(0, 5), (0, -3)]), (((0, 5), 'inf'), [(0, 1), (0, -3)]), (((0, 1), 1.0), [(3, 4), (5, 6)]), (((3, 4), 1.0), [(0, 1), (5, 6)]), (((5, 6), 1.0), [(0, 1), (3, 4)]), (((0, -3), 'inf'), [(0, 1), (0, 5)]), (((1, 1), 1.0), [(2, 2), (3, 3), (-2, -2)]), (((3, 3), 1.0), [(1, 1), (2, 2), (-2, -2)]), (((2, 2), 1.0), [(1, 1), (3, 3), (-2, -2)]), (((-2, -2), 1.0), [(1, 1), (2, 2), (3, 3)])]


AssertionError: did not raise