# Data Wrangling

Data wrangling comprises a substantial portion of every data professional's life. Wrangling data encompasses the steps you undertake to organize and clean your underlying data for your analysis. Wrangling includes merging and appending datasets, finding typos, and creating new variables.

## Tidy data

Before we get to the analysis stage of a project, you'll likely first want to arrive at a "cleaned" dataset. Some may call this an "analysis" file, which implies you can start doing analysis without much more fuss. The exact structure of such an analysis file will vary between projects, but there are some fundamental concepts that are common across many situations. Let's start by taking a look at this example from the book, _R for Data Science_ by Hadley Wickham ([Chapter 12](https://r4ds.had.co.nz/tidy-data.html)). Consider the following tables of data.

In [1]:
# Loading packages
import pandas as pd  # The de factor way to handle data in Python

# Table (1)
infections_tidy = pd.DataFrame(
    [
        ["Afghanistan", 1999, 745, 19987071],
        ["Afghanistan", 2000, 2666, 20595360],
        ["Brazil", 1999, 37737, 172006362],
        ["Brazil", 2000, 80488, 174504898],
        ["China", 1999, 212258, 1272915272],
        ["China", 2000, 213766, 1280428583],
    ],
    columns=["country", "year", "cases", "population"],
)

# Table (2)
infections_too_long = pd.DataFrame(
    [
        ["Afghanistan", 1999, "cases", 745],
        ["Afghanistan", 1999, "population", 19987071],
        ["Afghanistan", 2000, "cases", 2666],
        ["Afghanistan", 2000, "population", 20595360],
        ["Brazil", 1999, "cases", 37737],
        ["Brazil", 1999, "population", 172006362],
        ["Brazil", 2000, "cases", 80488],
        ["Brazil", 2000, "population", 174504898],
        ["China", 1999, "cases", 212258],
        ["China", 1999, "population", 1272915272],
        ["China", 2000, "cases", 213766],
        ["China", 2000, "population", 1280428583],
    ],
    columns=["country", "year", "variable", "value"],
)

# Table (3a)
infections_just_cases = pd.DataFrame(
    [["Afghanistan", 745, 2666], ["Brazil", 37737, 80488], ["China", 212258, 213766]],
    columns=["country", "1999", "2000"],
)

# Table (3b)
infections_just_population = pd.DataFrame(
    [
        ["Afghanistan", 19987071, 20595360],
        ["Brazil", 172006362, 174504898],
        ["China", 1272915272, 1280428583],
    ],
    columns=["country", "1999", "2000"],
)

If you wanted to calculate the infection rate (cases per population) and plot it over time by country, which dataset makes that the easiest? Key factors in making a dataset "tidy":

- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.

The `pandas` has several functions to help us translate these untidy tables to tidy ones.

In [2]:
import pandas as pd

student_campuses = pd.DataFrame(
    [
        [1, "BERKELEY"],  # Capitalization
        [2, "UCLA "],  # Trailing space
        [3, " UCLA"],  # Leading space
        [4, "U.C.L.A."],  # Punctuation
        [5, "Berkely "],  # Typos
        [6, "UC  Berkeley"],  # Embedded blanks
        [7, "UCB"],  # Abbreviations
        [8, "Not applicable"],  # Missing
    ],
    columns=["student_id", "campus"],
)

In [3]:
pd.pivot(
    infections_too_long, index=["country", "year"], columns="variable", values="value"
).reset_index().rename_axis(None, axis=1)

Unnamed: 0,country,year,cases,population
0,Afghanistan,1999,745,19987071
1,Afghanistan,2000,2666,20595360
2,Brazil,1999,37737,172006362
3,Brazil,2000,80488,174504898
4,China,1999,212258,1272915272
5,China,2000,213766,1280428583


In [4]:
# Tables (3a) + (3b)

# Step (1): Reshape to long
# Step (2): Rename variables
infections_cases_long = infections_just_cases.melt(
    id_vars="country", var_name="year", value_name="cases"
)

infections_population_long = infections_just_population.melt(
    id_vars="country", var_name="year", value_name="population"
)

# Step (3): Join the data
infections_cases_long.merge(infections_population_long, on=["country", "year"])

Unnamed: 0,country,year,cases,population
0,Afghanistan,1999,745,19987071
1,Brazil,1999,37737,172006362
2,China,1999,212258,1272915272
3,Afghanistan,2000,2666,20595360
4,Brazil,2000,80488,174504898
5,China,2000,213766,1280428583


## Tidy values

The previous section discussed the overall structure of the dataset. Now let's dig into common problems you'll run into with actual values (cells) of your data. We can divide these into two categories:

- Strings
- Numeric

There are of course other types of data types (dates, boolean and factors), but let's just focus on these main types for the moment.

### Strings

Strings can be very messy! Let's take the simple example of just asking people to write down what campus they attend.

In [5]:
student_campuses = pd.DataFrame(
    [
        [1, "BERKELEY"],  # Capitalization
        [2, "UCLA "],  # Trailing space
        [3, " UCLA"],  # Leading space
        [4, "U.C.L.A."],  # Punctuation
        [5, "Berkely "],  # Typos
        [6, "UC  Berkeley"],  # Embedded blanks
        [7, "UCB"],  # Abbreviations
        [8, "Not applicable"],  # Missing
    ],
    columns=["student_id", "campus"],
)

In [6]:
# Let's make all campuses uppercase
student_campuses["campus"].str.upper()

0          BERKELEY
1             UCLA 
2              UCLA
3          U.C.L.A.
4          BERKELY 
5      UC  BERKELEY
6               UCB
7    NOT APPLICABLE
Name: campus, dtype: object

In [7]:
# Let's get ride of the spaces
student_campuses["campus"].str.strip().replace(r"\s+", " ", regex=True)

0          BERKELEY
1              UCLA
2              UCLA
3          U.C.L.A.
4           Berkely
5       UC Berkeley
6               UCB
7    Not applicable
Name: campus, dtype: object

In [8]:
# Let's get rid of the punctuation
student_campuses["campus"].str.replace(".", "", regex=False)

0          BERKELEY
1             UCLA 
2              UCLA
3              UCLA
4          Berkely 
5      UC  Berkeley
6               UCB
7    Not applicable
Name: campus, dtype: object

In [9]:
# Let's fix typos
student_campuses["campus"].str.replace("Berkely", "Berkeley", regex=False)

0          BERKELEY
1             UCLA 
2              UCLA
3          U.C.L.A.
4         Berkeley 
5      UC  Berkeley
6               UCB
7    Not applicable
Name: campus, dtype: object

In [10]:
# Let's standardize values
student_campuses["campus"].str.replace("[UC].*Berkeley", "UCB", regex=True)

0          BERKELEY
1             UCLA 
2              UCLA
3          U.C.L.A.
4          Berkely 
5               UCB
6               UCB
7    Not applicable
Name: campus, dtype: object

In [11]:
# Cleaning up missing values
student_campuses["campus"].str.replace("Not applicable", "", regex=False)

0        BERKELEY
1           UCLA 
2            UCLA
3        U.C.L.A.
4        Berkely 
5    UC  Berkeley
6             UCB
7                
Name: campus, dtype: object

In [12]:
# Now let's put it all together!
student_campuses["consistent_campus"] = (
    student_campuses["campus"]
    .str.upper()  # Capitalization
    .str.strip()  # Trailing space
    .replace(r"\s+", " ", regex=True)  # Leading space
    .str.replace(".", "", regex=False)  # Punctuation
    .str.replace("BERKELY", "BERKELEY", regex=False)  # Typos
    .str.replace("(UC\s)?BERKELEY", "UCB", regex=True)  # Embedded blanks
    .str.replace("NOT APPLICABLE", "", regex=False)  # Missing
)

### String or number?

Sometimes you have number stuck in a string (e.g., `"$1,000"`) and sometimes you have a string posing as a number (e.g., ZIP Codes). Let's go through an example of each and how to solve these issues.

In [13]:
zcta_earnings = pd.DataFrame(
    [
        [1240, "$33,530"],
        [1242, "*"],
        [89010, "$26,172"],
        [89019, "$36,354"],
    ],
    columns=["zcta", "median_earnings"],
)

Some identification codes consisting of numbers (e.g., ZIP Codes and FIPS codes) may mistakenly stored or loaded as numbers. Why does this matter? In general, we wouldn't want to treat a ZIP Code as a number. Why is 89010 greater than 1240? Ideally, such variables should be stored as strings instead of numbers.

In [14]:
zcta_earnings["zcta"].astype(str).str.pad(width=5, fillchar="0")

0    01240
1    01242
2    89010
3    89019
Name: zcta, dtype: object

In [15]:
pd.to_numeric(zcta_earnings["median_earnings"].str.replace("\$|,|\*", "", regex=True))

0    33530.0
1        NaN
2    26172.0
3    36354.0
Name: median_earnings, dtype: float64

### Numeric

Numeric variables can be represented as integers or decimal values (float/double). Sometimes integers depict measurements that represent actual integer values (e.g., number of apartments in a building) and sometimes they represent ordinal values (e.g., survey responses of "strongly disagree", "disagree", etc.).

Let's create an example dataset that contains some of the variables.

In [16]:
housing = pd.DataFrame(
    [
        [1, 1, 35, 6, 57324],
        [2, 1, 27, 5, 67366],
        [3, 2, 42, -99, 47343],
        [4, 3, 56, 4, -43123],
    ],
    columns=["resident_id", "unit_id", "age", "education", "income"],
)

One of the first thing to do with numeric values are making sure you're representing missing values correctly. Many data sources will sometimes encode missing values with a placeholder value like -99. Double check your data documentation to make sure you know how missing values are encoded.

In [17]:
import numpy as np  # Use NumPy missing values

housing["education"].replace(-99, np.nan)

0    6.0
1    5.0
2    NaN
3    4.0
Name: education, dtype: float64

Sometimes numeric values have intervals they fall between. For example, the fraction of days absent in a school year must fall between zero and one. Or height must be greater than 0. However, there are sometimes ambigious situations. For example, we might think income needs to be at least zero. But if someone includes business losses in their reported income, negative values might be plausible! For now, let's suppose losses are not possible in our dataset. We can recode all values we think are invalid as missing.

In [18]:
housing["income"].loc[housing["income"] < 0] = np.nan
housing["income"]

0    57324.0
1    67366.0
2    47343.0
3        NaN
Name: income, dtype: float64

Once you're more confident with the values in your numeric columns, you're ready to do the calculations needed for your analysis! Sometimes this could mean doing transformations of variables.

In [19]:
housing["ln_income"] = np.log(housing["income"])

Or sometimes you need to do aggregations.

In [21]:
housing.groupby("unit_id").agg({"income": "mean"}).reset_index()

Unnamed: 0,unit_id,income
0,1,62345.0
1,2,47343.0
2,3,
