# _Module 1 lesson 2_: Validating restructured data against a schema using a spreadsheet

<div class="alert alert-block alert-warning">
    <b>Learning outcomes:</b>
    <br>
    <ul>
        <li>Employ methods for anonymising data, including address and name redaction, and field obfuscation.</li>
        <li>Perform data validation using Microsoft Excel.</li>
        <li>Learn how to validate machine-readable data in online applications.</li>
    </ul>
</div>

---

## 2.1 Methods for data anonymisation

There are a wide range of techniques available to support anonymisation. Broadly, though, they fit into two types:

- __Redaction__: in which we remove fields or line-item information while maintaining sufficient integrity to permit semantic analysis;
- __Aggregation__: in which we deliberately aggregate data to ensure outlier anonymity;

### 2.1.1 Redaction methods

Before we start doing anything, we need to understand our dataset, and understand how we intend to redact it _while maintaining its internal integrity so that we can continue to conduct analysis_.

#### Attribute suppression

This method requires that we delete an entire field. It is one of the first, and easiest, steps we can take.

- Remove data we do not need
- Remove data we cannot easily redact

This is a destructive step since suppression deletes the original data.

#### Pseudonymisation

Pseudonymisation is the replacement of identifying data with randomised values. This can be reversable, if you create a key between the data and the generated values, but irreversable if you deliberately throw aways the keys. Persistent pseudonyms support linkage between the same individual across different datasets.

#### Generalisation

Generalisation is a deliberate reduction in the precision of data, such as converting a person's age into a range, or a precise location into a less precise location.

- `range`: conversion of precise numbers into quantiles or statistical ranges;
- `cluster`: aggregation of geospatial data into statistically less significant clusters - this can also be used to mask outliers;

Design the data ranges with appropriate sizes. Sometimes quantiles are the most appropriate, sometimes we use statistical definitions (such as geospatial ranges that are designed to include sufficient numbers of people so as to reduce deanonymisation).

#### Shuffling

Shuffling is where data are rearranged such that the individual attribute values are still represented in the dataset, but generally, do not correspond to the original records. This is not appropriate for all data. Swapping diseases amongst different patients will certainly render the data anonymous, but will also confuse any epidemiological analysis.

### 2.1.2 Aggregation methods

Aggregation is far more destructive than is redaction. We will lose resolution on patient morphology, and we will lose the direct relationships between data in exchange for summaries of that data. But we will gain security for the individuals concerned.

Where redaction is guided by the data almost exclusively, aggregation is guided by the research objectives for the data. Any form of aggregation will limit what can be done and awareness of these limitations is critical.

Census data are usually aggregated in this way, with the individual microdata (responses from each household) only made available to accredited researchers, while the aggregated versions are made available to the public.

Our objective will be to create groups of data and then perform aggregations on each group. The range of aggregations we can form include:

- `count`: count of the individual members of the group;
- `totals`: sums of values, and sums of sub-groups within the values (e.g. total duration of illness, and duration of each type of illness);
- `averages`: including `mean`, `median` and `mode` of data sequences;
- `distributions`: including `quantiles`, `normals` or other types of distribution.

The groups can be by specific `categories` or `geospatial` ranges.

In many ways, an entire course of statistics is required to perform aggregations well.

<div class="alert alert-block alert-warning">
    <p><b>Aggregations require familiarity and experience with the data being aggregated.</b> It's very difficult to simply pick up a random dataset and know how to aggregate it
        in a way that supports analysis and extracts meaning from it. You are unlikely to be responsible for aggregating data you don't have experience with, and when you have that
        experience, knowing how to aggregate it will become clearer.</p>
</div>


---

## 2.2 Apply data validation to cells in a spreadsheet

The following is adapted from a [Microsoft Office tutorial](https://support.office.com/en-gb/article/apply-data-validation-to-cells-29fecbcc-d1b9-42c1-9d76-eff3ce5f7249). This approach will work in OpenOffice as well as Google Sheets, although the specific steps are different.

Microsoft has an example file you can [download](http://download.microsoft.com/download/9/6/8/968A9140-2E13-4FDC-B62C-C1D98D2B0FE6/Data%20Validation%20Examples.xlsx).

### 2.2.1 Specify validation for data types

The process is straightforward:

1. Select the cells in a specific column you wish to limit by type
2. Select __Data > Data Tools > Data Validation__.

  ![Excel data validation](images/excel-data-validation.png)

3. On the __Settings__ tab, under __Allow__, select one of:

  ![Validation settings](images/excel-validation-settings.jpg)
 
  - __Whole Number__: restrict the cell to accept only `integer` values.
  - __Decimal__: restrict the cell to accept only `float` or `number` values.
  - __List__: pick data from a drop-down list, and limited by values constrained by `enum`.
  - __Date__: restrict the cell to accept only `date`.
  - __Time__: restrict the cell to accept only `datetime`.
  - __Text Length__: restrict the length of the text, equivalent to constraints `maxLength`.
  - __Custom__: for custom formula.
 
4. Under __Data__, you can select a condition:

  - between
  - not between
  - equal to
  - not equal to
  - greater than
  - less than
  - greater than or equal to
  - less than or equal to

5. Set the other required values, based on what you chose for __Allow__ and __Data__. For example, if you select between, then select the __Minimum:__ and __Maximum:__ values for the cell(s).
6. Select the __Ignore blank__ checkbox if you want to ignore blank spaces (i.e. for missing data). Note, though, that Excel doesn't have any room for special characters you may be using as a marker for missing data (e.g. `..`) so these would be raised as errors.
7. Select __OK__.

Now - only for new data - if a user tries to enter a value that is not valid, a pop-up appears with the message, "This value doesn’t match the data validation restrictions for this cell." We'll run validation on your existing data shortly, but first a detour into `lists`.

### 2.2.2 Lists are a special type

Before you can validate a `list` type, you need to specify valid terms. In Excel, this requires an [extra set of steps](https://support.office.com/en-us/article/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b).

1. Create a new worksheet in Excel, and there list the terms you want to set as valid values. You can quickly convert your list to a table by selecting any cell in the range, and pressing __Ctrl+T__ (this may differ from version to version in Excel, or entirely in Open Office or other spreadsheet applications).

  ![List terms](images/excel-list-terms.png)
  
2. Add your list data and format it as a Table (__Home tab > Styles > Format as Table__).
3. You can name your table from the Table tools tab - this one could be named "CityTable".  This will help you keep track of multiple tables.
4. In the validation process listed above, go to 3. and select __List__, then add a named range or table name for your list. 
5. Specify a source for your terms via __Data tab > Data Validation > Allow List > Source__. Then specify your list of terms as any of:

  ![List source](images/excel-list-source.png)

  - You can select the list sheet and range directly (e.g. `=Sheet1!A4:A10`)
  - Convert your list to a table with __Ctrl+T__, then from the __Table Design__ tab give your table a name, permitting you to reference the table name and column (e.g. `=CityTable[City]`)
  - From the __Formulas__ tab select __Name Manager__, create a __New__ item with an appropriate name (e.g. `CityList`), and reference the cells (e.g. `=Sheet1!A4:A10`), which then lets you reference your list anywhere (e.g. `=CityList`)

### 2.2.3 Validate and get error messages for your existing data

After you've specified validation rules on your existing data you might be disappoined. Excel does not automatically notify you whether these cells contain invalid data. Here's a quick way to [highlight existing invalid cells](https://support.office.com/en-us/article/more-on-data-validation-f38dee73-9900-4ca6-9301-8a5f6e1f0c4c) by circling the values:

  ![Circle invalid data](images/excel-circled-cell.gif)

1. To apply the circles, select the cells you want to evaluate and go to __Data > Data Tools > Data Validation > Circle Invalid Data__.

  ![List terms](images/excel-data-circle.png)
  
2. If you correct an invalid entry, the circle disappears automatically.
3. To remove data validation for a cell, select it, and then go to __Data > Data Tools > Data Validation > Settings > Clear All__.

Now your turn:

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b></p>
    <p>Using the restructured file you created in Lesson 1.1 specify validation criteria for each column. Check for invalid data and correct where necessary.</p>
    <p>One thing you might notice in your data ... sometimes you have an invalid integer specified as a range, e.g. <code>200-210</code>. Here are some ideas about how to deal with that.</p>
    <ul>
        <li><b>Ranges instead of numbers</b>: if, e.g. your range is <code>200-210</code>, you could reset this value as <code>200</code>, or <code>210</code> or even the range average <code>205</code>. Whatever you decide, document your decision in your metadata file.</li>
        <li><b>Date ranges</b>: the same goes for dates, although you should be careful ... the likelihood is that a value applies from the end of the date range, not the beginning or middle, so e.g. <code>2008-2009</code> is most likely to be <code>2009</code></li>
    </ul>
</div>

<div class="alert alert-block alert-info">
    <b>References:</b>
    <br>
    <ul>
        <li><a href="https://support.office.com/en-gb/article/apply-data-validation-to-cells-29fecbcc-d1b9-42c1-9d76-eff3ce5f7249">Apply data validation to cells</a></li>
        <li><a href="https://support.office.com/en-us/article/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b">Create a drop-down list</a></li>
        <li><a href="https://support.office.com/en-us/article/more-on-data-validation-f38dee73-9900-4ca6-9301-8a5f6e1f0c4c">More on data validation</a></li>
    </ul>
</div>

---

## 2.3 Saving your validated file as a comma-separated-value

Comma separated value files (`.csv`) are text files in which the comma character `,` separates each field of text. Where a comma appears in the value - whether a `string` or `number` - the value is then surrounded by quotation marks, e.g. `100, 200, "20,000"` indicates three values in three separate fields.

You can change the separator character that is used in both delimited and .csv text files, and there are a wide range of formats (e.g. `;`, `*`). There are any number of reasons for this, and it is part of the reason that CSV-formatted files are not the cure-all we would hope for ensuring consistency in open data.

In Excel, you can [export a spreadsheet](https://support.office.com/en-gb/article/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba) as a CSV using __Save As__ as follows:

1. Go to __File > Save As__.
2. Click __Browse__.
3. In the __Save As__ dialog box, under __Save as type__ box, choose the text file format for the worksheet; for example, click __CSV (Comma delimited)__.
4. Browse to the location where you want to save the new text file, and then click __Save__.

> You are only able to export the current worksheet (i.e. the one in view when you complete this process) to the new CSV file. You can save other worksheets as separate text files by repeating this procedure for each worksheet.

> All spreadsheet-specific features will be lost. Formatting (bold, colours, etc), formulae and validation criteria will be removed leaving only the data in a text file.

---

## 2.4 Lesson tutorial

<div class="alert alert-block alert-success">
    <p><b>Tutorial:</b></p>
    <p>Complete the processing of the file you started working with in Lesson 1.</p>
    <ul>
        <li><b>Check for invalid data</b>: Specify validation criteria in your spreadsheet application, check for invalid data and correct where necessary..</li>
        <li><b>Save as CSV</b>: Export your machine-readable data from your spreadsheet application and save it as a <code>.csv</code>.</li>
    </ul>
</div>

Please complete the tutorial before continuing with this series. If you are participating in a taught class, please send your tutorial submission via the required process (email or online).