**Goals**: In this assignment you will be practising basic tools and creating 
building blocks that will be useful in your final projects. This assignment
covers the material taught in the first four weeks, such as
regular expressions, basics of text processing, and functional programming concepts. 
Your solutions for this assignment will help in Assignment 2 for solving two data processing 
problems on the Spark platform.

## 1: Regular Expressions
Write a regular expression pattern matching a _valid URL_. For the purposes of this exercise, a valid URL is any string of the form `protocol://domain/optional_file_path/optional_file_name`, where

   * `protocol` is one of `file`, `http`, `https`, or `ftp`.
   * `domain` is a sequence of labels separated by a single `.` (dot) character where each label is a combination of alphanumeric (i.e., both letters and numbers) characters in either lower or upper case, and the rightmost label representing the top-level domain is not all numbers.
   * `optional_file_path` is a (potentially empty) sequence of labels separated by a `/` (forward slash) character, where each label is a combination of alphanumeric characters in either lower or upper case, and hyphens (`-`).
   * `optional_file_name` is a sequence of labels (of length at most 2) separated by a `.` (dot) character, where each label is a combination of alphanumeric characters in either lower or upper case and hyphens (`-`).

For example all of the following strings are valid URLs: https://my.Domain.com/some/file.html, ftp://com/my-file.json, http://123.456.12a/, http://bigdata and http://cs5234.rhul.ac.uk/sub-dir/ , whereas http://234.345, http://rhul.ac.uk/my.long.filename.html, http://.bigdata/, and http://big..data are not.

In [391]:
import re

# Put your pattern inside ''
url_regex = '(?:http|https|ftp|file):\/\/((?:[a-zA-Z0-9]+\.?)*(\d*[a-zA-Z]{1,}\d*))+\/?(?:[a-zA-Z0-9-]+\/)*(?:[a-zA-Z0-9-]+\.[a-zA-Z0-9]+)?'


Your solution is correct if the value returned by `re.compile(url_regex).fullmatch(s)` is not
`None` for every string `s`, which is a valid URL according to 
the definition above, and `None`, otherwise.

## 2: Regular Expressions
Write a regular expression pattern matching any string consisting of _fields_ separated by _commas_. A field may include any printable characters except whitespaces and commas. 
A valid string must start and end with a field. 
For example, the strings `'ab1c,de_f,xyz'`, `'ab1c,de_%^f,xyz'`, `abc` 
are  valid whereas the strings `'ab1c,, de_f'` and 
`'ab1c,de_f, xyz,'` are not.

In [249]:
# Put your pattern inside ''
csv_regex = '^(?:[^\s,]+,*)*[^\s,]+$'

Your solution is correct if the value returned by `re.compile(csv_regex).fullmatch(s)` is not
`None` for every string `s`, which is valid according to 
the definition above, and `None`, otherwise.

## 3: Generator Functions
Write a generator function `gen_seq_from_csv_string(s)` that takes a string `s` matching the regular expression pattern described by `csv_regex` as argument and produces a sequence of values extracted from `s`. For example, `gen_seq_from_csv_string('ab1c,de_f,xyz')` will return the sequence 
`'ab1c'`, `'de_f'`, `'xyz'`

In [250]:
'''
s: a string matching the pattern stored in csv_regex
Returns a sequence of values extracted from s. 

Replace pass with your code
'''
def gen_seq_from_csv_string(s):
    s1 = (re.search(csv_regex,s)).group()
    match = [x for x in re.split(',',s1) if x!='']
    yield match

## 4: Lambda Expressions
Write the following lambda expressions:
1. `valid_url`: takes a string `s` as argument and returns `True` if `s` 
matches `url_regex`, and `False`, otherwise
2. `concat_csv_strings`: takes two strings `s1` and `s2` as arguments and 
returns a single string consisting of `s1` and `s2` separated by comma. For example, if
the strings
`'ab1c,de_f,xyz'` and `'ab1c,de_%^f,xyz'` are given as arguments, the output must be the string
`'ab1c,de_f,xyz,ab1c,de_%^f,xyz'`
3. `val_by_vec`: takes an object `x` and a sequence of objects `seq`, and returns a sequence
(i.e., an iterator) of tuples `(x, t[0]), (x, t[1]), ...`.<br>
_Hint_: Use a generator expression.
4. `not_self_loop`: takes a 2-tuple `(a, b)` and returns `True` if `a != b` and `False`, otherwise.

In [251]:
# Replace the right-hand side of each lambda with your code
valid_url = lambda s: True if re.match(url_regex,s) else False

concat_csv_strings = lambda s1, s2: s1+','+s2

val_by_vec = lambda x, t: [(x,a) for a in t]

not_self_loop = lambda t: True if t[0] != t[1] else False