## CSV reader for Titanic data

---

### Step 1
>Function `fill_age` receives a parameter `value` of type `str`, and returns a `float` if the parameter is not empty, otherwise returns `-1`. I am assuming `age` to be `-1` unless there is a value for it, and if there is, this value is assigned.

In [1]:
def fill_age(value):
    age = -1
    if value:
        age = float(value)
    return age

---

### Step 2
>Function `split_raw_fields` receives a parameter `line` of type `str`, containing several words separated by commas, and returns a `list` of `str`, where each element corresponds to each of the words of the original string, without any whitespace at the beginning or at the end of the word.
>
>In order to remove only the whitespaces from the beginning and ending (if I just remove all spaces, names would be wrong for example), first I need to `split` the string and then apply `strip` method appending all values to the returned `list`.

In [2]:
def split_raw_fields(line):
    line_splitted = line.split(",")
    list = []
    for i in line_splitted:
        list.append(i.strip())
    return list

---

### Step 3
>Function `build_record` that receives two parameters: `fields` (a `tuple` of heterogeneous elements) and `header` (a `list` of `str`) and returns a `dict`, where each key is an element of `header` and each value is an element of `fields`. 

In [3]:
def build_record(fields, header):
    dict = {}
    for i in range(len(fields)):
        dict[header[i]] = fields[i]
    return dict

---

### Step 4
>Function `extract_fields` receives a parameter `line` of type `str` with a specific structure (see example below), uses `fill_age` and `split_raw_fields`, and returns a `tuple` of elements, with the types (int, bool, int, str, str, float, int, int, str, float, str, str).
>
>In first place, I need to split the line using the `split_raw_fields` function so I can access each element. Then, it is not possible to iterate through the `line_splitted` because I am not converting all the elements to the same data type. In the case of the `bool` I need to convert the element to `int` first because if not it is checking a string and it is `True` either it is `'0'` or `'1'`. In the age element I am using the `fill_age` function to convert it to float.

In [4]:
def extract_fields(line):
    line_splitted = split_raw_fields(line)
    return (
        int(line_splitted[0]),
        bool(int(line_splitted[1])),
        int(line_splitted[2]),
        str(line_splitted[3]),
        str(line_splitted[4]),
        fill_age(line_splitted[5]),
        int(line_splitted[6]),
        int(line_splitted[7]),
        str(line_splitted[8]),
        float(line_splitted[9]),
        str(line_splitted[10]),
        str(line_splitted[11]),
    )


---

### Step 5
>Function `read_data` receives a file handle (as returned by `open`) and returns a `list` of records, each record corresponding to each line of `titanic.csv` (except for the first one).
>
>I am firstly iterating through the file holder `fh` to obtain an intermediate `list` with all the read rows from the csv. The first row in this one is going to be the `header` where I am applying `split_raw_fields` and for the rest I am calling `build_record` to create the list of `records` I am returning. But prior to using `build_record`, I need to apply `extract_fields` so that I transform all the fields to its corresponding data type.

In [5]:
def read_data(fh):
    list = []
    for row in fh:
        list.append(row)
    header = split_raw_fields(list[0])
    records = []
    for i in range(1, len(list)):
        records.append(build_record(extract_fields(list[i]), header))
    return records

---

### Testing

In [6]:
path = 'titanic.csv'

In [7]:
with open(path) as fh:
    records = read_data(fh)

In [8]:
records[0]

{'PassengerId': 1,
 'Survived': False,
 'Pclass': 3,
 'Name': 'Mr. Owen Harris',
 'Sex': 'male',
 'Age': 22.0,
 'SibSp': 1,
 'Parch': 0,
 'Ticket': 'A/5 21171',
 'Fare': 7.25,
 'Cabin': '',
 'Embarked': 'S'}

In [9]:
records[-1]

{'PassengerId': 891,
 'Survived': False,
 'Pclass': 3,
 'Name': 'Mr. Patrick',
 'Sex': 'male',
 'Age': 32.0,
 'SibSp': 0,
 'Parch': 0,
 'Ticket': '370376',
 'Fare': 7.75,
 'Cabin': '',
 'Embarked': 'Q'}

---

### Extra
>Compute the mean age of the first 10 records, excluding the ones where the age is unknown (where "unknown", given `fill_age` function, is `-1`).
>
>For this problem I needed to iterate through the wole length of `records` until I found `10` with known age. First I needed to check if the `age` was known with the `fill_age` and when I reach 10 records, stop. I created the function `mean_age_first_n_records` for that purpose receiving just one parameter `n` to determine the number of records

In [10]:
def mean_age_first_n_records(n):
    total = 0
    n_local = 0
    for i in range(len(records)):
        if fill_age(records[i]["Age"]) != -1:
            total += fill_age(records[i]["Age"])
            n_local += 1
            if n_local == n:
                break
    return total / n

In [11]:
mean_age_first_n_records(10)

25.7