# Data Wrangling: Running Results

Analyze results of [https://valentinslauf.de/Ergebnisse-Fotos/](https://valentinslauf.de/Ergebnisse-Fotos/) 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
df = pd.read_csv(
    "results_10km.csv",
    encoding="iso8859-1",
    skiprows=2,
    sep=";",
    index_col=0
)
df

## Anonymize the data

The data contains personal information as defined by the GDPR. All participants have agreed that their data is to be shared. This however does not include redistribution by third parties. Therefore, the data must be anonymized.

In [None]:
from faker import Faker

f = Faker()

In [None]:
def anonymize(s):
    if type(s) == str and s not in ["", "MCL", "NaN", "Kristian"]:
        return f.first_name()
    return s

## What is Tidy Data?

According to [Tidy Data](https://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham (2014), in tidy data:

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

(Codds 3rd normal form)


## Tidy Data Checklist

Don'ts:

* Column headers are values, not variable names.
* Multiple variables are stored in one column.
* Variables are stored in both rows and columns.
* Multiple types of observational units are stored in the same table.
* A single observational unit is stored in multiple tables.


## Useful phrases to inspect data

## What to put in the index of a DataFrame?

## Methods to fill missing values