In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import os

# Messy Data and Data Cleaning

## Outline

* Data Generating Processes.
* Introduction to field types.
* Outliers: how to spot them and fix them.
* Missing values: understanding them and dropping them.

## There is no such thing as 'raw data'.

* Data are the result of measurements that must be recorded.
* Humans design the measurements and record the results.
* Data is *always* an imperfect record of the underlying processing being measured.

## Data Generating Process

* A **data generating process** is the underlying, real-world (probabilistic) mechanism that generates the observed data. 
* Observed data is an incomplete artifact of the data generating process.
* A data generating process is what a statistical model attempts to describe.

Cleaning data requires understanding of the data generating process.

### Example: Unemployment Data
* Problem: predict the effect policy X has on the US labor market
    - Does it decrease unemployment? Increase wages?
* Data: [labor force data](https://www.bls.gov/cps/cps_htgm.pdf) collected by the BLS


### Example: Unemployment Data
* Sample quality: Is the BLS data a sample or census?
* Measurement; who is counted as:
    - employed? unemployed? underemployed? not in the labor force?
* The data are generated according to a *political* process!
    - For an introduction, see [this article](https://www.nytimes.com/2018/09/14/opinion/columnists/great-recession-economy-gdp.html)

## Data Provenance: can I trust my data?

Understanding as much as possible about the lineage of data from 
1. The assumptions on the data generating process, to
2. the initial measurements of that (or a similar) process, to  
3. the data in its eventually acquired form.

# Data Cleaning

## Data Cleaning

* The process of transforming data:
    - into a faithful representation of an underlying data generating process 
    - to facilitate subsequent analysis.

* In practice, data cleaning is often detective work to understand data provenance.
    - **always be skeptical of your data!**

## Data cleaning often addresses:

* The structure of the recorded data 
    - are the individual properly represented as rows?
* The encoding and format of the values in the data.
    - are the data types of a column reflective of the information it contains?
* Corrupt and "incorrect" data; missing values.
    - were their flaws in the 'data recording process'?

<img src="imgs/image_2.png"/>

### Discussion: are each of the following Quantitative, Ordinal, or Nominal?

1. **Price in dollars of a product?**
1. **Star Rating on Yelp?**
1. **Date/time an item was sold?**
1. **Day of the week an item was sold?**
1. **A Credit Card Number?**

### Answers:

|Question|Answer|
|---|---|
|Price in dollars of a product|Quantitatave|
|Star Rating on Yelp|Ordinal|
|Date an item was sold|Quantitative|
|Day of week an item was sold|????|
|Credit Card Number|Nominal|

## Converting data types in Pandas

### Student dataset
- **Student ID, Student Name** (should be clear)
- **Month, Day, Year**: date when student was accepted to UCSD
- **2018, 2019 tuitions and growth** (should be clear)
- **Paid**: Indicates if tuition is paid yet
- **DSC80 Final grade**: Some students may take class for Pass/Fail

What needs to be changed in the dataframe to compute statistics?

In [None]:
df = pd.read_csv("data/Data.csv")
df

### What is the sum of tuition paid in 2018 and 2019?
* Sum tuition columns using `+`
* Save it in a `pd.Series` called `total`.

In [None]:
df

In [None]:
total = df['2018 tuition'] + df['2019 tuition']
total

### Check the data types of the student table
* What data type *should* each column have?
* What kinds of data should each column have?
    - Quantitative, Catgorical (Ordinal, Nominal)
* Use the `df.dtypes` attribute to peak at the data types.

In [None]:
df.dtypes

### Cleaning up: `Student ID`

* `Student ID` is a `float64`, should be `int64`
* May be a float value due to earlier processing with e.g. Excel.
* Change the type using `.astype` method
    - `.astype` returns a copy!

In [None]:
df['Student ID'] = df['Student ID'].astype(np.int64)
df

### Cleaning up: `20xx tuition`

* `20xx tuition` are stored as `objects` (strings), not numerical values.
* The formatting character ($) causes the entries to be interpreted as strings.
* Use `str` methods to strip the dollar sign.

In [None]:
# try this!
df['2018 tuition'].astype(np.float64)

In [None]:
# strip the $
df['2018 tuition'].str.strip('$').astype(np.float64)

In [None]:
# looping through *columns* is ok! don't loop through rows.

for col in df.columns:
    if 'tuition' in col:
        df[col] = df[col].str.strip('$').astype(np.float64)
        
df

### Cleaning up: `Paid`

* The `Paid` column should be either `bool` type, or {0,1}.
* Y/N typical values from human entry.
* Use the `replace` method.

In [None]:
df['Paid'] = df['Paid'].replace({'Y': True, 'N': False})
df

### Cleaning up: `Month, Day, and Year`
* Each are `int64` types; this could be *fine* for certain purposes.
* Could store as `objects` of the form `Year-Month-Day`
    - String sorting coincides with date sorting
* Could store as `datetime64` objects (later).

In [None]:
# What is happening with adding a Series and a string? (Broadcasting)
(
    df['Year'].astype(str) + '-' + 
    df['Month'].astype(str).str.zfill(2) + '-' + 
    df['Day'].astype(str).str.zfill(2)
)

### Cleaning up: `DSC 80 Final Grade`

* `DSC 80 Final Grade` stored as an object.
    - most entries should be numeric;
    - final entry cannot be converted.
* Can use `pd.to_numeric(Series, errors='coerce')`.
    - Be careful with this!
    - `errors='coerce'` can cause uninformed destruction of data.

In [None]:
# try: astype
df

In [None]:
df['DSC 80 Final Grade'] = pd.to_numeric(df['DSC 80 Final Grade'], errors='coerce')
df

### Cleaning up: `Student Name`
* Need the `Student Name` column to have form **Last Name, First Name**.
* Use a custom function and the `apply` method.
    - `Series.apply(func)` applies the function `func` to each entry of `Series`.

In [None]:
df['Student Name']

In [None]:
def transpose_name(name):
    firstname, lastname = name.split()
    return lastname + ', ' + firstname

transpose_name('Aaron Fraenkel')

In [None]:
df['Student Name'].apply(transpose_name)

### More data type ambiguities

<div class="image-txt-container">

1. 1537660383 looks like a number, but is probably a date (Unix timestamp)

2. "USD 1,000,000" looks like a string, but is actually a number and a unit.

3. 02111 looks like a number, but is really a zip code (and isn't equal to 2,111)

<img src="imgs/image_3.png"/>

</div>

## How well does the data capture "reality"

* Does my data contain unrealistic or "incorrect" values?
    - Dates in the future for events in the past
    - Locations that don't exist
    - Negative counts
    - Misspellings of names
    - Large outliers


## How well does the data capture "reality"

    
* Does my data violate obvious dependencies?
    - E.g., age and birthday don't match 
    

* Was the data entered by hand?
     - Spelling errors, fields shifted …
     - Did the form require fields or provide default values?
     
* Are there obvious signs of curb stoning (data falsification):
    - Repeated names, fake looking email addresses, repeated use of uncommon names or fields.

# Vehicle Stop Data. Practical Example

## Data Source

<img src="imgs/image_4.png"/>


# Police Vehicle Stops

Vehicle stops made by the San Diego Police Department. 

Vehicle Stops files contain all vehicle stops for a given year.

<img src="imgs/image_5.png"/>

# SDPD Vehicle Stop Data


### Identifying messy data, general questions. 

1. Check the data types, notice any issues? What should we do?
2. String type fields have consistent values?
3. No missing values that we don't understand?
4. Are all values look in a reasonable range?
5. How do we deal with the messiness we found?

In [None]:
fp = os.path.join('data', 'Vehicle_stops_2016_datasd.csv')
stops = pd.read_csv(fp)
stops.head()

### SDPD vehicle stops: data types
* Are the data types correct?
* Are they easily fixable?

In [None]:
# are the data types correct? How to fix them?
stops.info()

### SDPD vehicle stops: unfaithfulness
* Are there suspicious values?
* If a value is suspicious, can we trust the observation?
* Age: Nonsensical? Too old? Too young?

In [None]:
stops['subject_age'].unique()

In [None]:
ages = pd.to_numeric(stops['subject_age'], errors='coerce')
ages.head()

In [None]:
ages.describe()

In [None]:
# drop the rows? change age value to null? Is there really a 220 year old? (investigate!)
stops[(ages > 100)]

In [None]:
ages.loc[lambda x:(0<=x) & (x<16)].value_counts()

In [None]:
stops[(0 < ages) & (ages < 16)]

### SDPD data: unfaithful `subject_age`

* Values of 'No Age' and 0 likely explicit null values
* Unusually small/large ages errors in data entry?
    - Rest of record is well formed.
* Hard to tell for ages 14,15.
    - Each has more than one occurance; possibly real?

### SDPD vehicle stops: human entered data
* Which fields were likely entered by a human?
* Which fields were likely generated by code?
    - what was the original source?

In [None]:
# stop cause
stops.stop_cause.value_counts()

In [None]:
# age distribution -- reasonable ages (e.g. 15-85)
ages.loc[lambda x:(x > 15) & (x<=85)].plot(kind='hist', bins=70)

In [None]:
# computer generated?
stops[['timestamp', 'stop_date', 'stop_time']].head()

In [None]:
stops[['timestamp', 'stop_date', 'stop_time']].tail(10)

## Unfaithful data vs Outliers

* Unfaithful data are data that doesn't accurately represent the data generating processing being measured.
* Outliers are "ununsual" observations, unlike the rest of the data. They may be unfaithful, but they may be real (and interesting) as well! 
* The two are hard to tell apart; doing so often requires research.

# Outliers

* **Consistently "nonsense" values**
    - Is it a product of the data ingestion process? Time field has year 1899? Is it an inferred “default” value?
    - Solution: Change the value to the correct one!
    
* **Abnormal artifacts from the data collection process**
    - E.g. unreasonable spikes in recorded ages at round numbers (25, 35, 45)
    - Solution: Try "smoothing" (e.g. binning the ages)
        
* **Unreasonable outliers**
    - Data points with unrealistic and highly unreasonable values. E.g. age = 200
    - Solution: filter it? Maybe it points to bugs in the data collection? Maybe it's real and you should investigate!

# Missing Values

## Many reasons for missing values

* Missing values in a dataset can occur from:
    - Intentional logic, where a value doesn't make sense.
    - A non-response in the measurement process.
    - Mistakes in the recording process
    - ...
    
* Missing values are most often encoded with:
    - `NULL`, `None`, `NaN`, `""`

## Missing values come in many forms

* Missing values can appear as 'placeholder' values:
    - All forms of `0` are a common substitute for null.
    - -1 is column if a column must be non-negative.
    - 1900 and 1970 if a nonnull date is required.

## Missing values come in many forms

<div class="image-txt-container">
    
* These 'Missing Values' may be possible 'real' values!
* "Null Island" at 0°00'00.0"N+0°00'00.0"E
    - Null Island a popular jogging location on Strava fitness tracking app.
    - https://en.wikipedia.org/wiki/Null_Island

<img src="imgs/image_6.png"/>

</div>

### Messy missingness in vehicle stops data
* What are the non-`NaN` null values in the SDPD data?
    - Service Area: 'Unknown'
    - Subject Age: 0
    - Others?

## Handling null values in Pandas

* Null values are encoded using NumPy's `NaN` object, which is of float type.
* Method `.isnull()` for DataFrame/Series detects missing values.
    - returns a boolean DataFrame/Series!
* Methods `.dropna()` and `.fillna()` handle missing data.

In [None]:
# proportion of people without an age recorded
stops.subject_age.isnull().mean()

In [None]:
# all columns null percentage
stops.isnull().mean()

### Handling null values: dropping observations
* What happens if any row with a null value is dropped?
* Best to not drop observations until it's needed!

In [None]:
stops.shape

In [None]:
stops.dropna().shape

In [None]:
# Percentage of dataset dropped:
stops.isnull().any(axis=1).mean()

### `.dropna` method

* `.dropna()` drops rows containing *at least one* null value.
* `.dropna(how='all')` drops any row that contains *only* null values.
* `.dropna(axis=1)` drops *columns* containing at least one null value.
* Other keyword arguments: `thresh`, `subset`

In [None]:
nans = pd.DataFrame([[0,1,np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])
nans

In [None]:
nans.dropna(how='any')

In [None]:
nans.dropna(how='all')

In [None]:
nans.dropna(axis=1)

In [None]:
nans.dropna(subset=[0,1])

### `.fillna` method

* `.fillna(val)` fills null entries with value `val`.
* `.fillna(dict)` fills null entries using a dictionary `dict` of column/row values.
* `.fillna(method='ffill')` fills null entries using neighboring non-null values.

In [None]:
nans

In [None]:
# fill with a fixed value
nans.fillna("FILLED!")

In [None]:
# fill using a column-dictionary
nans.fillna({0:'f0', 1:'f1', 2:'f2'})

In [None]:
# fill with the column mean
means = {c:nans[c].mean() for c in nans.columns}
nans.fillna(means)

In [None]:
# backfill up columns
nans.fillna(method='bfill')

In [None]:
# forward fill down columns
nans.fillna(method='ffill')

## Data Types and `NaN`

* The result of *any* comparison (=,!=,<,>) with `NaN` is `False`.
     - Use functions for checking null: `np.isnan`, `np.isnull`
* `NaN` is of float-type.
* Be careful of Pandas type-coercian with `NaN`!

In [None]:
for x in nans.iloc[0]:
    if x == np.NaN:
        print("it's NaN!")
    else:
        print('nope!')

In [None]:
for x in nans.iloc[0]:
    if np.isnan(x):
        print("it's NaN!")
    else:
        print('nope!')

In [None]:
# series with null: ints are cast as float
nans = pd.Series([0,1,np.NaN])
nnan = pd.Series([0,1,1])

In [None]:
# filled in: of float type
nans.dtype, nan.dtype