# Chapter 0 - Why do we need robust data? 

<div>
<img src="../imgs/mind_the_gap.png" width="750"/>
<figcaption><em>Not every data gap is as obvious as this one...</em></figcaption>
<div>

## What's the deal with data? 

Before we start exploring how to wrangle real-life, messy data, it's key to understand why we would want to do so. Motivating the need for robust data can help us make decisions on how to generate, process, and interpret data. 

Nowadays, we often hear stories of analysis gone awry. Misapplication of cutting-edge technologies is all too common (e.g., machine learning's [reproducibility crisis](https://www.wired.com/story/machine-learning-reproducibility-crisis/)), but proper analytic techniques implemented in the context of flawed data can also lead to faulty conclusions throughout a research pipeline. 

Data can be flawed in many ways. A particularly salient example of data flaws are *biased datasets*. Some of the most common biases you'll find in data include: 

- Selection bias 
  - Ex: a test population not being representative of the whole by excluding a key minority group.
- Historical bias 
  - Ex: NLP word embedding models replicating gender-biased analogies like 'man::doctor, woman::nurse' due to historical disparities in opportunity.
- Survivorship bias 
  - Ex: evaluating performance of hedge funds from 1990 to 2010 - only those hedge which still exist in 2010 will be present, which already implies a certain degree of monetary sucess.
- Availability bias 
  - Ex: anytime we use a convenient data source instead of the best one.
- Outlier bias 
  - Ex: employing summary statistics which mask or are overly-sensitive to outliers to make conclusions and drive decisions.

For more on data bias: 
1. [Types of Biases in Data](https://towardsdatascience.com/types-of-biases-in-data-cafc4f2634fb)
2. [Statistical Bias Types explained](https://data36.com/statistical-bias-types-explained/)
3. [Fairness in Machine Learning](https://developers.google.com/machine-learning/crash-course/fairness/types-of-bias)

Data bias is just one of many ways a dataset can lack robustness. But, before we learn how to create robust datasets, we need to define what robust data is.

## Defining robust data 

While the need for quality data may be apparent, what attributes such data would possess is not immediately obvious. 

What would an "ideal" dataset look like to you? How would you collect it? 

[comment]: <> (Break for Zoom discussion)

<div>
<img src="../imgs/datacollection.png" width="750"/>
<div>

When we look to describe the robustness of a dataset, it can be useful to think about what that data will be used for. Data may be used to: 

- Run experiments
- Validate hypotheses
- Draw inferences
- Make decisions

etc. These usages are all common and critical applications of data, and they all share a foundational element: they employ or rely on *data analysis.* 

Then, when we think about data robustness, it can be useful to think about it as being "robust to analysis" or not. When we define data robustness this way, are there additional ideal dataset qualities you can think of?

[comment]: <> (Break for Zoom discussion)

## A working definition 

Ultimately, the exact definition of robust data are going to be dependent on your field of research and/or your particular application. However, some key considerations in any defintion include: 

- Data sourcing (which can lead to biased data, as previously dicussed)
- Data continuity 
- Data timeliness 

In this workshop, we're going to be exploring how each one of these considerations can impact the creation and analysis of US GDP (**G**ross **D**omestic **P**roduct) data. For some quick interactive definitions of these terms, take a look at the code blocks below. Otherwise, feel free to move onto the next chapter to learn some more about GDP and typical GDP data before jumping in. 

[comment]: <> (Allow time for code block running before moving onto the next chapter)

#### *Data sourcing example:* 

Data sourcing refers to the process of identifying, assessing, and ultimately selecting a data resource to work with (be it a specific dataset, a wider database, or a third-party data vendor). Decisions by the curator of a data resource can impact the availability, representativeness, structure, and other critical attributes of your employed data — which ultimatley can influence analysis . Let's see an example: 

In [3]:
""" Run this block before trying to run any of the definition blocks! """
# installing required libraries: 
%pip install pandas

Collecting pandas
  Downloading pandas-1.4.3-cp310-cp310-win_amd64.whl (10.5 MB)
     --------------------------------------- 10.5/10.5 MB 21.8 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2022.2.1-py2.py3-none-any.whl (500 kB)
     ------------------------------------- 500.6/500.6 kB 30.6 MB/s eta 0:00:00
Collecting numpy>=1.21.0
  Downloading numpy-1.23.2-cp310-cp310-win_amd64.whl (14.6 MB)
     --------------------------------------- 14.6/14.6 MB 28.4 MB/s eta 0:00:00
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.23.2 pandas-1.4.3 pytz-2022.2.1
Note: you may need to restart the kernel to use updated packages.




In [12]:
""" DATA SOURCING INTERACTIVE DEFINITION """
# importing some libraries: 
import os
import pandas as pd 

# retrieving the data - this is a dataset detailing data science job salaries: 
home_dir  = os.path.dirname(os.getcwd())
data_path = os.sep + 'sample_datasets' + os.sep + 'ds_salaries.csv'

data = pd.read_csv(home_dir + data_path, encoding='utf-8')

# let's get a quick look at a few entries: 
print(data.head(5))

# what different fields does this dataset possess for each entry? 
print(list(data.columns))

FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\ad2we\\Desktop\\SSDS Consultancy\\GearUp_Fall_2022\\GearUp-MessyData\\fall_2022\\sample_datasets\\ds_salaries.csv'

#### *Data continuity example:* 

#### *Data timeliness example:* 