# Data Setup
The purpose of this notebook is to add missing data to a copy of the original CSV data set. This will give us a controlled environment in which to compare how easy the different libraries make it to troubleshoot and accommodate these realistic challenges.

## Example data
As the example data, I will be using the famous NYC taxi data set.  You can download it from [Kaggle](https://www.kaggle.com/competitions/nyc-taxi-trip-duration/data).

I picked this dataset because:
- it contains a good mix of column datatypes that we would want to parse for typical use cases (in particular, including datetime).
- it is large enough to allow us to do meaningful performance comparisons

In [1]:
!wc -l raw.csv


1458645 raw.csv


We see it has about 1.5 million rows.

Let's take a look at what the data looks like:

In [2]:
!head raw.csv


id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982154846191406,40.767936706542969,-73.964630126953125,40.765602111816406,N,455
id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415344238281,40.738563537597656,-73.999481201171875,40.731151580810547,N,663
id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979026794433594,40.763938903808594,-74.005332946777344,40.710086822509766,N,2124
id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.010040283203125,40.719970703125,-74.01226806640625,40.706718444824219,N,429
id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973052978515625,40.793209075927734,-73.972923278808594,40.782520294189453,N,435
id0801584,2,2016-01-30 22:01:40,2016-01-30 22:09:03,6,-73.982856750488281,40.742195129394531,-73.992080688476562,40.749183654785156,N,443
id1813257,1,

In [3]:
import pandas as pd
FILE_PATH = 'raw.csv'
df = pd.read_csv(FILE_PATH)

In [14]:
df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


In [4]:
df.isnull().sum().sum()

0

This looks like the data set not contain any missing values. (It is technically possible that the object columns do, and that the missing value is mistakingly recognized as a string – but since we are not interested in the content of the data, this need not concern us anyway.)

In [15]:
df.dtypes


id                     object
vendor_id               int64
pickup_datetime        object
dropoff_datetime       object
passenger_count         int64
pickup_longitude      float64
pickup_latitude       float64
dropoff_longitude     float64
dropoff_latitude      float64
store_and_fwd_flag     object
trip_duration           int64
dtype: object

## Introducing missing values
### Empty String
Setting values to `None` will be written as empty string by default when writting to CSV.

In [17]:
df.iloc[1, :] = None

In [18]:
df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2.0,2016-03-14 17:24:55,2016-03-14 17:32:30,1.0,-73.982155,40.767937,-73.96463,40.765602,N,455.0
1,,,,,,,,,,,
2,id3858529,2.0,2016-01-19 11:35:24,2016-01-19 12:10:48,1.0,-73.979027,40.763939,-74.005333,40.710087,N,2124.0
3,id3504673,2.0,2016-04-06 19:32:31,2016-04-06 19:39:40,1.0,-74.01004,40.719971,-74.012268,40.706718,N,429.0
4,id2181028,2.0,2016-03-26 13:30:55,2016-03-26 13:38:10,1.0,-73.973053,40.793209,-73.972923,40.78252,N,435.0


In [19]:
df.dtypes


id                     object
vendor_id             float64
pickup_datetime        object
dropoff_datetime       object
passenger_count       float64
pickup_longitude      float64
pickup_latitude       float64
dropoff_longitude     float64
dropoff_latitude      float64
store_and_fwd_flag     object
trip_duration         float64
dtype: object

In [20]:
df.to_csv(
    'missing_as_empty_string.csv',
    index=False
)

### Missing as "NA"

In [21]:
df.to_csv(
    'missing_as_NA.csv',
    index=False,
    na_rep='NA'   # This is the crucial part
)

### Missing as "NA" or "nan"
Here we're changing the representation of missing values for some of the columns.

In [26]:
df.iloc[1, 2:4] = 'nan'

df.to_csv(
    'missing_as_NA_or_nan.csv',
    index=False,
    na_rep='NA'
)

### Missing as "???"

In [29]:
df.iloc[1, :] = '???'

df.to_csv(
    'missing_as_question_marks.csv',
    index=False,
    na_rep='NA'
)