# _Module 1 lesson 1_: Structuring & wrangling messy data in a spreadsheet

<div class="alert alert-block alert-warning">
    <b>Learning outcomes:</b>
    <br>
    <ul>
        <li>Understand, and have practical experience with, the structure and design of machine-readable data files.</li>
        <li>Use Excel to investigate and manipulate source data to learn its metadata, shape and robustness.</li>
        <li>Learn and apply a basic set of methods to restructure messy source data into machine-readable CSV files using Microsoft Excel.</li>
    </ul>
</div>

---

## 1.1 Introduction

The purpose of this syllabus in _Data Wrangling and Validation for Open Data_ is to guide learners to confidence in delivering technically open data: well-structured, machine-readable data, and validated to a defined and standard metadata schema.


## 1.2 The data management lifecycle

While the process for creating, maintaining and inspiring new data are often presented as a cycle, it is more like a spiral. Each cycle spiralling upwards, and resulting in more information. However, the effectiveness of each step is defined by the needs of its users, and its relevance in terms of the process or events it reflects.

![Data lifecycle](images/data-lifecycle-en.jpg)

### 1.2.3 Collection and creation

Before data can be collected, there are a range of things which must be known:

- why are we collecting the data?
- what purpose does it serve?
- do we have consent and / or the legal authority to collect that data?
- is there an existing data series which this data extends or compliments?
- how will the data be collected?
- who will be responsible for data quality, and how will quality be measured?
- who will have access to the data, and how sensitive is this data (e.g. personally-identifying)?
- are we using a standardised and agreed-upon classification or metadata format?

### 1.2.4 Classification and processing

The creator of the data would best know what the data are about and should assign keywords as descriptors. This data about the data is called metadata. The term is ambiguous, as it is used for two fundamentally different concepts:
 
- __Structural metadata__ correspond to internal metadata (i.e. metadata about the structure of database objects such as tables, columns, keys and indexes);
- __Descriptive metadata__ correspond to external metadata. (i.e. metadata typically used for discovery and identification, as information used to search and locate an object such as title, author, subjects, keywords, publisher);
 
Descriptive metadata permits discovery of the object. Structural metadata permits the data to be applied, interpreted, analysed, restructured, and linked to other, similar, datasets.
 
Metadata can permit interoperability between different systems. An agreed-upon structure for querying the ‘aboutness’ of a data series can permit unrelated software systems to find and use remote data.
 
Beyond metadata, there are also mechanisms for the structuring of relationships between hierarchies of keywords. These are known as ontologies and, along with metadata, can be used to accurately define and permit discovery of data.
 
Adding metadata to existing data resources can be a labour-intensive and expensive process. This may become a barrier to implementing a comprehensive knowledge management system.

The types of descriptive metadata we should include (and the terms we should use for them):

| REQUIRED        | RECOMMENDED   | OPTIONAL              |
|:----------------|:--------------|:----------------------|
| Title           | Tag(s)        | Last update           |
| Description     | Terms of use  | Update frequency      |
| Theme(s)        | Contact email | Geographical coverage |
| Publishing body |               | Temporal coverage     |
|                 |               | Validity              |
|                 |               | Related resources     |
|                 |               | Regulations           |

### 1.2.5 Manipulation, conversion or alteration

This part of the process is where the data are transcribed, translated, checked, validated, cleaned and managed.

This presents the greatest risk for data consistency. Any format change or manipulation, or even copying a file from one system to another, introduces the potential for data corruption. Similarly, it also increases the potential for data - whether erroneous or not - to be accidentally released to users or the public before it is ready.

This is also known as __data wrangling__.

### 1.2.6 Analysis and presentation

Analysis may not always be part of the data manager's role, but it is certainly the reason that data are collected.

It is here where the data are interpreted, combined with other datasets to produce meta-analysis, and where analysis becomes the story you wish to tell derived from the data.

The story is only as good as your data and research is only considered valid if the data which informs it are also released.

The purpose of analysis is to inform collective or individual behaviour, influence policy, or support economic activity, amongst many others. Trust can be achieved by  "showing your workings ", which includes data release.

### 1.2.7 Preserving and storing

Data needs to be preserved from corruption, as well as being available for later use once the initial analysis is complete. It is also necessary to preserve the data in case of queries regarding the analysis.

Long-term storage requires that the __metadata__ be well-defined and exceptionally useful to ensure that understanding what the data describe is still possible long after its initial collection.

Data may end up being stored in multiple formats or across multiple systems. It is essential that primacy is established (i.e. which dataset has priority over the others) and that the various formats are kept in alignment.

### 1.2.8 Release and access

Even where data is only released within government - and not for the public - there will always be others who will want to use your data, or would benefit from it if they know it exists. The greatest inefficiency in data management arises when research is repeated because a different department needs the same data but did not know it already existed.

Release is not simply about making data available to the public, it is also about creating a predictable process for that release.

Regularly collected data (such as inflation rates) need a predictable release cycle since many companies base their investment decisions on the availability of such information.

Publishing a release calendar for your stakeholders (and sticking to it) permits them to plan their own analysis, or response to your analysis.

Access implies that you need a centralised database which is accessible to your stakeholders. In the case of open data, how will data be moved from internal servers to a public repository?

Responsibility for this process needs to be assigned and measured.

### 1.2.9 Retention and reuse

Once data are released, the question arises as to how long it will be available? Research data should, ordinarily, be available in perpetuity.

Time-series data become more useful the longer it has been collected. Suddenly removing data can cause tremendous disruption.

If appropriate systems and support have not been put in place then retention can become a very expensive problem.

Importantly, in order to support reuse, clear copyright and licensing which permit data to be freely reused for any purpose is essential.

### 1.2.10 Archival

Datasets can become very large and may only be accessed infrequently. This can become problematic for long-term storage.

A process of archival - where data can be stored more cheaply but still be accessible - may need to be considered.

<div class="alert alert-block alert-danger">
    <b>Actions:</b>
    <br>
    <ul>
        <li>Ensure there is a data owner who is responsible for the data lifecycle, including publication and responding to feedback or queries;</li>
        <li>Prepare a data-collection plan, ensuring that definitions and data structure conform to international standards;</li>
        <li>Ensure metadata are agreed with stakeholders and are useful and standardised;</li>
        <li>Ensure that data are appropriately licensed to ensure release and reuse;</li>
    </ul>
</div>

<div class="alert alert-block alert-info">
    <b>References:</b>
    <br>
    <ul>
        <li><a href="http://www.bu.edu/datamanagement/background/data-life-cycle/">Boston University Data Life Cycle</a></li>
        <li><a href="http://searchsecurity.techtarget.com/magazineContent/Data-Lifecycle-Management-Model-Shows-Risks-and-Integrated-Data-Flow">Tech Target Data Lifecycle Management</a></li>
        <li><a href="http://en.wikipedia.org/wiki/Dublin_Core">Dublin Core Metadata Standard</a></li>
        <li><a href="http://www.w3.org/TR/vocab-dcat/">DCAT Metadata Standard</a></li>
        <li><a href="http://en.wikipedia.org/wiki/Seven_Basic_Tools_of_Quality">Seven Basic Tools of Quality</a></li>
    </ul>
</div>

---

## 1.3 Data wrangling and preparation for publication

There are a number of common formats for data distribution. Some of these are considered "open" (such as CSV, XML, text and others) and some proprietary (SAS, STATA, SPSS, etc.). XLS and XLSX, associated with Microsoft Excel, are relatively open formats and a number of software systems can interpret the data.
 
Proprietary formats are legitimate even in an open data initiative since these are the software systems used by many professional data users. However, since these formats are often not interoperable, the potential for data re-use is limited unless open formats are also supported. Data dissemination in proprietary formats does not preclude dissemination in open formats and vice versa.

Spreadsheets and distributed data systems often lack an agreed data structure. A researcher who wishes to combine this with other data first needs to normalise it and then decide on standardised terms to define the columns and data-types in those columns.

![A typical "human-readable" structured table of data](images/01-01-human-readable-data.jpg)

Converting semi-structured tabular data into a typical machine-readable format results in the comma-separated-value (CSV) file. These are tabular files with a header row which defines each of the data in the columns and rows below.

![A CSV-structured table of data](images/01-01-machine-readable-data.jpg)

Ignoring any further standards compliance, CSV files can be so arranged that they are "joined" on a common column. For example, a set of geospatial coordinates can be used to connect a number of similar files covering different data series.

### 1.3.1 Considerations for machine-readable spreadsheets

Data.gov has a useful [Primer on Machine Readability for Online Documents and Data](https://www.data.gov/developers/blog/primer-machine-readability-online-documents-and-data) and a comment thread at Data.gov.uk offers the following guidance:

- A requester can open the data file in freely-available and widely accessible software packages - this means that formats such as CSV should be preferred over, or offered in addition to formats like Excel, and proprietary formats which can only be opened with commercial or specialist software should be avoided. 
- It is possible to process the data directly, carrying out any appropriate operations on it such as sorting columns, filtering rows, running aggregates of values - this requires well structured data. Where possible, the meaning of the data should not be contained in the layout. 
- Common elements in the dataset are expressed in uniform ways - for example, dates are always in the same format, codes or names are always in the same case, and numbers are expressed consistently (e.g. 1,000 or 1000 but not a mixture of the two). 
- The meaning of fields and values is clearly documented - either through clear naming of fields, or through accompanying descriptions provided along with the data.  

In addition, where possible, machine readability is enhanced if:

- The dataset uses common standards where they exist - including standard identifiers and standard field names. These might be standards like the public spending vocabulary developed for government, or third-party standards such as KML for indicating 'points of interest'

It is important to also note that it may be possible to provide data in a variety of machine readable formats, and where possible the authority should correspond with the requester to identify the best format. For example, points of interest could be provided in a CSV spreadsheet form, or KML. Ideally both would be provided: though depending on the context and the re-user one may be more appropriate than the other. 
  
### 1.3.2 _Long_ vs _wide_ data

Any data series consists of numerical values (usually) described by standardised metadata terms (time, area, a specific description, etc). There are two main ways of presenting these machine-readable data, which can be summarised as _wide_ or _long_. You need to make a deliberate choice as to which format you will choose, and each has its own particular strengths and weaknesses:

- __Wide data__ present numerical data in multiple columns. Either as categories (e.g. each country is presented in its own column) or by date (e.g. each annual update results in a new column). New data go across the screen from left to right:

![Wide data](images/01-01-wide-data.jpg)

__Wide data__ are often used for data visualisation and processing since the data can easily be grouped into the necessary axes for chart libraries. However, it's a difficult archival format since updating such a dataseries requires the equivalent of creating a new field (the _year_ in the fields above) and then updating every row with appropriate information. That can be an expensive operation in a large database, and also means that writing a programmatic method for querying your data is more challenging.

- __Long data__ present numerical data in multiple rows with only one column for values. New data go down the screen from top to bottom:

![Long data](images/01-01-narrow-data.jpg)

__Long data__ are best for archival and for representing the structure you will ordinarily find in a database. Each row in a _long_ dataseries represents a row in a database. Adding new information is relatively straightforward since you only need update a single row at a time. In database terms, you'd be creating a single database entry.

<div class="alert alert-block alert-warning">
    <p>The preference in open data publication is for the <b>long</b> format, and this will be the method usually recommended for release. That said, conversion between them - as long as data are machine-readable with well-defined metadata - is straightforward.</p>
</div>

### 1.3.3 Defining your destination metadata and schema

Creating a `schema` is the first part of the wrangling process. Your schema defines the structural
metadata target for your wrangling process. This is not the format your input data arrive in, but
it is what you require it to look like when you're done.

Your schema sets the requirements, constraints and sensible defaults available for wrangling input data into the fields defined by the schema. Once complete, you will need to perform further cleaning and validation.

In simple terms, the columns in an input CSV or Excel-file will be restructured into new columns
defined by the fields in your schema. These target fields are likely to be those in a database. Until your input data conform to this structure, your data should not be released as open data.

We'll be using [Frictionless Data's](https://specs.frictionlessdata.io/table-schema/#field-descriptors) definitions and schema to define structural metadata fields. 

#### ___`type` and `format`___

`type` defines the data-type of the field, while `format` further refines the specific `type` properties. These are the core types you're likely to use, with indents for formats:

- `string`: any text-based string.
  - `default`: any string
  - `email`: an email address
  - `uri`: any web address / URI
- `number`: any number-based value, including integers and floats.
- `integer`: any integer-based value.
- `boolean`: a boolean [`true`, `false`] value. Can set category constraints to fix term used.
- `object`: any valid JSON data.
- `array`: any valid array-based data.
- `date`: any date without a time. Must be in ISO8601 format, `YYYY-MM-DD`.
- `datetime`: any date with a time. Must be in ISO8601 format, with UTC time specified (optionally) as `YYYY-MM-DD hh:mm:ss Zz`.
- `year`: any year, formatted as `YYYY`.

#### ___`schema` design___

Have a look at the open data presented at [Données hospitalières relatives à l'épidémie de COVID-19](https://www.data.gouv.fr/fr/datasets/donnees-hospitalieres-relatives-a-lepidemie-de-covid-19/). Preview the metadata schema file for hospitalisation:

![Hospitalisation structural metadata](images/01-01-structural-metadata-covid-hospitalisation.jpg)

The first term in each row defines the destination column name (the structural metadata terms), along with its data type, descriptions in English and French, and an example of an expected value for that field.

Before you start restructuring messy data files, create such a structural metadata definition for your data. And the first thing to do when you begin is simply to look at your data and understand how it is currently structured. Your objective in moving from messy to structured data is simple:

 ___Preserve everything that is known from your source data when you restructure it.___

Here is a summary of these schema design requirements:

- _Establish a standard convention for naming column fields, and stick to it_; the convention refers to the physical structure of the word, such as `startDate` or `place_name`. `camelCase` is one convention, `underscore_separation` is another. Don't mix these as it is frustrating for users.
- _Every value of every column of every table should represent only a single thing_; if you look at the `region` column below, you'll see that it should be split into separate `region` and `district` columns for Côte d'Ivoire:

| description | region                   | year | value     |
|:------------|:-------------------------|:-----|:----------|
| Population  | Gbôklé, Bas-Sassandra    | 2014 | 400,798   |
| Population  | Nawa, Bas-Sassandra      | 2014 | 1,053,084 |
| Population  | San-Pédro, Bas-Sassandra | 2014 | 826,666   |
| Population  | Indénié-Djuablin, Comoé  | 2014 | 560,432   |
| Population  | Sud-Comoé, Comoé         | 2014 | 642,620   |

- _Define the `type`, `format` and `definition` for each column name_; the below schema definition is the __long__ form. If each `year` had its own column, you would then - instead of `year` - have a row for `2014`, then the next release might include `2016` and so on. You can see that updating the data and the schema simultaneously is more work for you, and your users, than just updating the data:

| column      | type             | description                                      | example    |
|:------------|:-----------------|:-------------------------------------------------|:-----------|
| description | string           | Definition of the demographic data series        | Population |
| region      | string           | Name of one of the 31 regions in Côte d'Ivoire   | Sud-Comoé  |
| district    | string           | Name of one of the 14 districts in Côte d'Ivoire | Comoé      |
| year        | date             | The year of the data series                      | 2014       |
| value       | integer          | The value for the data series                    | 560432     |

There are additional requirements for validating your data - such as noting that the commas between digits in the numbers must be removed - and we'll go through these below.

#### ___`descriptive metadata` design___

Before we start wrangling our data, first we should review the following table of formal terms which describe the data:

| REQUIRED        | RECOMMENDED   | OPTIONAL              |
|:----------------|:--------------|:----------------------|
| Title           | Tag(s)        | Last update           |
| Description     | Terms of use  | Update frequency      |
| Theme(s)        | Contact email | Geographical coverage |
| Publishing body |               | Temporal coverage     |
|                 |               | Validity              |
|                 |               | Related resources     |
|                 |               | Regulations           |

These are the high-level definitions and terms to describe what the data your are restructuring are about. A person opening a spreadsheet has very little context as to what they are reading, and these metadata provide much-needed context.

Your next step is to begin restructuring your messy data to conform to this schema you've just created.

### 1.3.4 Habits which make spreadsheets unusable

In his excellent essay [The Art of Spreadsheets](http://john.raffensperger.org/ArtOfTheSpreadsheet/Chapter09_ShowAllTheInformation.html) John Raffensperger lists 37 ways that you can hide data in a spreadsheet. Here are 10 of them:

- Do not share the file. This is the most common way of hiding information, and the most effective.
- Hide the sheet. You need at least two sheets first, then: Format, Sheet, Hide.
- Hide the row: Format, Row, Hide.
- Hide the column: Format, Column, Hide.
- Hide the cell and protect the sheet: Format, Cells, Protection, Hidden, then Tools, Protection. This shows a display, but hides the formula: =if(1, "Peace!", "Attack at dawn.").
- Make the column too narrow: Format, Column, Width, 0.
- For formulas that are likely to be zero, use Tools, Options, View, and clear the Zero values box. For example: =IF(1, 0, "Attack at dawn.").
- Use a formula that returns a blank: =IF(1, " ", "Attack at dawn.").
- Create a complicated formula that displays the information, but format it as text (with Format, Cells, Number, Text, or just start the cell with a single quotation mark), so the formula is displayed rather than the output.
- Format the font with Wingdings: Format, Cells, Font, Wingdings. This displays unintelligible characters.

<div class="alert alert-block alert-success">
    <b>Exercise:</b>
    <br>
    <p><a href="https://www.data.gouv.fr/fr/datasets/donnees-hospitalieres-relatives-a-lepidemie-de-covid-19/">Santé publique France</a> produce daily updates on France's COVID-19 hospitalisation data. Using John Raffensperger's list as inspiration, your task is to mess up the <a href="https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7">donnees-hospitalieres-covid19.csv</a> data as much as possible.</p>
    <p>Marks will be awarded for:</p>
    <ul>
        <li>making the presentation just bad enough that someone using the data might be tempted to think they can still use it!</li>
        <li>the use of colour and font effects in ways that really offend the eye</li>
        <li>ingenuity in hiding bits of data in plain sight.</li>
    </ul>
</div>

### 1.3.5 Using Excel to clean messy data

Excel is one of the most powerful pieces of software in use by ordinary (non-software developer) analysts, and probably the most widespread data management and analysis tools in the world.

Excel can permit you to do all of the following:

- Remove duplicate records
- Separate multiple values contained in the same field
- Analyse the distribution of values throughout a data set
- Group together different representations of the same reality

A simple step-by-step guide to preparing and cleaning data:

- __Create the data__ file with new worksheets for each of: OriginalData | InterimData | FinalData
- __Clean the data__:
  - Create an ID column linking your OriginalData to your InterimData to keep track of deleted rows;
  - Manage duplicate records by creating a search key which should uniquely identify each row; Excel can sort on this row and remove any duplicates;
  - Strip out undesirable characters (find/replace);
  - Locate out-of-range values (i.e. values that are not what they should be);
  - Remove any non-publishable data (such as personal information, or any data not approved for publication);
- __Process the data__:
  - Parse data (such as using Text to Columns, and splitting on spaces or tabs);
  - Recode data (to convert terms to standardised metadata);
  - Compute new values (such as totals or averages);
  - Reformat data into a standardised format;
- __Create an analysis-ready copy__ of the data by copying the unformatted InterimData to FinalData and stripping out any unneeded columns;
- __Document the data__:
  - File-level documentation, such as project, date completed, checked by;
  - Produce a text file that goes with the data file that has: description, date produced, data owner, etc (from the metadata described in the classification and processing step);

<div class="alert alert-block alert-info">
    <b>Tutorials & references:</b>
    <br>
    <ul>
        <li><a href="http://blog.ouseful.info/2012/11/27/when-machine-readable-data-still-causes-issues-wrangling-dates/">When Machine Readable Data Still Causes "Issues" – Wrangling Dates…</a></li>
        <li><a href="http://www.opendataimpacts.net/2012/11/more-than-csv/">Digital landscapes: effective open data takes more than a single CSV</a></li>
        <li><a href="https://blog.datawrapper.de/prepare-and-clean-up-data-for-data-visualization/">How to prepare your data for analysis and charting in Excel & Google Sheets</a></li>
        <li><a href="https://support.office.com/en-us/article/Top-ten-ways-to-clean-your-data-2844b620-677c-47a7-ac3e-c2e157d1db19">Top ten ways to clean your data</a></li>
        <li><a href="https://www.mikealche.com/software-development/a-humble-guide-to-database-schema-design">A humble guide to database schema design</a></li>
    </ul>
</div>

---

## 1.4 Worked data wrangling example

The following example was written by [Lisa Charlotte Rost @ Datawrapper.de](https://blog.datawrapper.de/prepare-and-clean-up-data-for-data-visualization/) and is used with permission. This has been edited slightly to conform to the approach in this course. This is a relatively simple example, but will guide you through the main challenges you will experience.

The worked tutorial uses data from the [World Bank](https://data.worldbank.org/indicator/SP.URB.TOTL?view=chart). There's a metadata link which opens the following dialogue:

![Urban population data series](images/01-01-urban-data-descriptive-metadata.jpg)

### 1.4.1 Look at the data

Download [this dataset](http://api.worldbank.org/v2/en/indicator/SP.URB.TOTL?downloadformat=excel) and look at the main sheet.

![Urban population source data](images/full-181101_excel1.png)

When you download an Excel file, it often has multiple sheets. Our data set has three of them, as seen on the bottom: "Data", "Metadata – Countries" and "Metadata – Indicators". Look through all of your sheets and make sure you understand what you're seeing there. Do the headers, file name and/or data itself indicates that you downloaded the right file? Are there footnotes? What do they tell you? Maybe that you're dealing with lots of estimates? (Does that maybe mean that you need to look for other data?) If you don't find notes in the data, make sure you look for them on the website of your source.

Our example data seems fine. There are no estimates we need to worry about. And we get a nice explanation of "Urban population" in the "Metadata – Indicators", beginning with "Urban population refers to people living in urban areas as defined by national statistical offices…" Awesome! That's something we can mention in our chart later on.

### 1.4.2 Rename your file

Now that we know what we're dealing with, let's make sure that we still do so in half a year. "API_SP.URB.TOTL_DS2_en_excel_v2_318520! Yes! I know exactly what that data was about!" said no one ever. (Except three employees at The World Bank.) So let's call it something memorizable and precise: Worldbank_urban-population-per-country, for example.

### 1.4.3 Duplicate the data sheet(s) & never touch it again

This is one of the most important parts of the whole process: Before you change anything in the data, duplicate your data sheet in the same file.

Consider renaming your two datasheets, e.g. in "Raw data" and "Data" or in "Data – original" and "Data – edited". If you have a massive Excel file with lots of sheets, you can also create a backup file of the original document and store that in a "source" folder.

Why should you do this? Because you will heavily edit the data. I've learned the hard way that I'll always change the data more than anticipated. "I don't need to copy the data this time." I think. "I only want to clean it up a bit; I won't delete anything important." Two hours pass…and I need to download the data again from its original source because oh, yeah, I did delete this now-important column an hour ago. Learn from my mistakes, save yourself a lot of time and never edit the original data.

### 1.4.4 Save your source in an extra sheet

This trick, too, will make your future self want to pat your present self on the shoulder and say "Thank you": Create a new sheet, name it "Source" and add links to all the data sources you're using in your document. (And yes, you get bonus points for adding the date when you downloaded the file – just in case.)

### 1.4.5 Delete everything above the header

Excel files often come with information in extra rows above the actual data. In our case, nice World Bank employees want us to know that the data source is the "World Development Indicators", and that the data was last updated in October 2019. That's both good to know and something we can put in the chart. But these extra lines hinder us to sort or filter the data.

Simply get rid of all empty rows and all information above the header. Delete it (you can always check your "raw data" sheet when you need that information) or copy & paste the information to your "Source" sheet.

![A typical "human-readable" structured table of data](images/181101_excel3.gif)

### 1.4.6 Unmerge cells and get rid of double-row headers

Sometimes, you will encounter headers that have two rows, not one. Especially when the table is created to communicate, not analyze, double-row headers can help make sense out of the information. But they will get in your way when you want to delete rows or columns eventually. Machine-readable data must come with one header row, and one header row only.

Unmerge double rows with merged cells like this:

![Double rows with merged cells.](images/full-181101_excel5-bad.png)

To this:

![The better alternative.](images/full-181101_excel5-good.png)

The same is true for merged cells. No matter where they are in your data set, get rid of them:

![More merged cells.](images/full-181101_excel4-bad.png)

To this:

![The better alternative.](images/full-181101_excel4-good.png)

The alternative to double-row headers and merged cells is to copy and paste text (e.g. "Afghanistan" in both examples above). Yes, writing down the same word(s) multiple times doesn't look as tidy as merged cells. But it is tidier in the long run.

In Google Sheets, you unmerge like this:

![Unmerging cells](images/181101_excel16.gif)

### 1.4.7 Bring metrics in the header & delete footnotes

To make sure that data tools like Datawrapper and Excel recognize numbers, ensure you have undisturbed numbers in our data cells. Undisturbed by thousands separators – but we'll take care of that later – and undisturbed by metrics. So free your data cells of all €, $, kg, %, km/h, etc. Instead, put them in the headers.

This will often require a series of `find - replace` steps, but always ensure you've created an additional column capturing these units so that you don't lose them. This is similar to splitting values that contain multiple terms.

This means going from metrics in data cells:

![Metrics in the data cells – not awesome.](images/full-181101_excel6-bad.png)

To ether splitting them out into their own column, or - as here - capturing it in the column field-name:

![Metrics in the header. Way better!](images/full-181101_excel6-good.png)

One thing to note in the above example is that % would normally be as a proportion of 1 (e.g. 48% is 0.48), and so the data in this column are likely to be ambiguous. This needs to be corrected to ensure that your data will validate.

You will also need to extract footnotes. Values like `28.394†` or `1.39[^1]` won't be recognized as numbers. 

Before you delete them, make sure you understand the pattern in the data: Are all 2019 data points estimates? (Should you maybe exclude that year then?) Or is the data from a certain country measured differently? In all these cases, make sure to let the reader of your final chart know. Footnotes in the data you're using should always translate to specific definitions in the metadata file. Your users will need them.

### 1.4.8 Check if the header content make sense

After doing these technical tasks, let's see if the header's names you haven't touched yet actually make sense. Maybe they're just code gibberish like `SP.URB.TOTL`? If that's the case, go back to your source and find out what the codes mean. Or maybe they're too long? For example, `Country Name` can easily be reduced to `Country`. Rename the headers so that it would be easy for outsiders to make sense of them: short, but precise and unique. (Well, if you see columns that you plan to delete, don't bother with the renaming.)

Remember the naming conventions discussed earlier. You could also choose to convert `Country Name` to `country_name`, but be consistent in the way you name all your fields.

### 1.4.9 Because it's convenient: Freeze the first row (and column)

Now you should be at a point where your headers look top-notch. Congrats! Let's make sure you always have these beauties in sight and freeze the header:

![Freeze rows & columns](images/181101_excel7.gif)

That's how our data looks like now. It's a bit cleaner already:

![Cleaner headers](images/full-181101_excel8.png)

### 1.4.10 Delete unneccessary columns and rows

Your data are intended to be source-data, so it should be rare that you ever need to remove anything, but sometimes data are redundant. Either because the data are repeats, or because they provide summaries that are not necessary because they can be calculated.

As one example, if you have line-item expenditure data, do you also need a row for the `total`? That is a decision you will need to make, but remember to document this in your descriptive metadata file.

### 1.4.11 Delete thousands separators

Thousands separators are characters (`,` in English, `.` in German, sometimes it's just a space) that make it easy to recognize the magnitude of a number. For example, `38.394.105` rounds faster to 38 million in our minds than `38394105`.

But while they're great and helpful for humans, they're terrible for computers and lead to ambiguous interpretation. Let's get rid of any kind of thousand separators.

`Find & replace` thousands separators to go from this:

![Thousand separators.](images/full-181101_excel10-bad.png)

To this:

![No thousand separators. Good!](images/full-181101_excel10-good.png)

### 1.4.12 Standardise text

Review names & text (such as countries, poll questions, etc.) and see if they need correction, or if they should conform to a standard. There may be multiple ways people spell the name of a particular place, but there is likely to be one official spelling. If you know users may find this unclear, you may need to create a new column with a standard identifier code.

For example, the UN has official country codes for every country, and this can be very useful ... for example, `Côte d'Ivoire` can also be presented as `Ivory Coast`.

### 1.4.13 Correct dates

Dates are tricky to work with. There are so many different date formats ("Nov 1, 2019", "1/11/19", etc.). And Excel, Google Sheets, etc. don't save them in any of them, but as a "serial number". The 1st of November 2019 becomes 43770 if you change the cell format to a number.

If you ever encounter such a strange number where you expected dates, simply change the cell format to dates. To do so, click through Format > Number > Date.

Always use ISO format for dates (i.e. `YYYY-MM-DD`). This is both better for data analysis, since it permits automatic sorting, and it removes all doubt. The `MMM` format for dates implies an abbreviation of the letters. In different countries, across multiple sites and languages, this can lead to unnecessary confusion and inconsistency. It is always best to remove all doubt, and date formats are a perpetual cause of errors in large data series. Americans, please always be aware that you are alone in the `MM/DD/YYYY` format; never use it.

### 1.4.14 Split variables in separate columns

Sometimes, you'll see multiple variables in a column. Like a column with US states in the format US-TX. Or a column with companies and the product they sell: Datawrapper (Software). You might not care. But when you want to analyze or visualize the data based on company products, you start to care. Good thing they are easy ways to separate the country from its states and the company from its product into two columns:

To make two (or more) columns out of one column, you can use the formula =SPLIT(B1,"-") or the formulas =LEFT(), =RIGHT() and =MID(). I'll show you how to do so in our article ["How to split and extract text from data columns in Excel & Google Sheets"](https://blog.datawrapper.de/split-and-extract-text-in-Excel-and-Google-Sheets).

### 1.4.15 Add columns with additional information (e.g. geographical IDs)

Sometimes, you have data in two different Excel documents. Or in two different sheets of the same Excel file. Sometimes, this information is necessary to visualize the data – e.g. when you need the correct geographical IDs to create a choropleth map, but they're not in your original data source.

These fields are known as `foreign keys` in a database and they permit diverse data to be joined on this common column because of these unique identities.

In our sample dataset, we have some extra information, too. In the "Metadata - Countries" sheet, The World Bank data explains which region and which income group all listed countries are in.

To bring the region and/or income group into our actual "Data" sheet…

- …we can sort the country column alphabetically in both sheets and then copy and paste the region and income group column from their sheet to the "Data" sheet. This approach only works when we have unique values (otherwise sorting becomes unreliable), and if we are 100% certain that the values are the same. As soon as the two columns differ in lengths, we need to choose the next method:
- …we can use the formula =VLOOKUP(B1,A1:A100,2,FALSE). This formula is not super easy-peasy to use, but Microsoft itself does a fairly good job at [explaining it here](https://support.office.com/en-us/article/vlookup-function-0bbc8083-26fe-4963-8ab8-93a18ad188a1).

### 1.4.16 Bring the data in the wide format

One can arrange the same data in rows and columns in different formats. Our sample data from The World Bank is in a so-called "wide" format: The values of each year are in a new column, so one row comes with many values. Datawrapper will understand this format well, so we can just copy and paste it into our step 1, transpose the data and create e.g. a line chart out of it.

But sometimes, your data will be in a different layout: The "long" format. In the long format, each value has its own row. Datawrapper won't be able to handle the long format, so you'll need to convert it to the wide format first.

To transform data from the long format to the wide format, you can use a feature called "pivot tables" in Excel or Google Sheets. You can learn how to use them in our article "How to get data in the right format with pivot tables".

That's it! We went through the whole process. Phew. Now your data should be ready to be analyzed and visualized. You might want to delete some more rows and columns (or add them back in). For now, our spreadsheet looks like this:

![Wrangled final version](images/full-181101_excel15.png)

## 1.5 Lesson tutorial

<div class="alert alert-block alert-success">
    <p><b>Tutorial:</b></p>
    <p>Pick a spreadsheet from <a href="data/lesson-spreadsheet/">training data</a> and restructure it according to the techniques and requirements presented in this lesson.</p>
    <p>The data in this folder are also sourced from the World Bank, but from a long time ago, before the World Bank knew what open or machine-readable data was. It contains some of the worst examples of data-mangling you will ever see.</p>
</div>

Please complete this tutorial before beginning the next lesson. If you are participating in a taught class, please send your tutorial submission via the required process (email or online).