# Data Cleaning

In [2]:
import numpy as np
import pandas as pd

#### Missing values

- literally empty
- encoded as special value
    - Python: None
    - Numpy: NaN

### None

In [4]:
data_none = None
print(type(data_none))

<class 'NoneType'>


In [6]:
assert data_none # acts like null type => basic operations like (+) will fail

AssertionError: 

### NaN

In [8]:
data_nan = np.nan

In [12]:
type(data_nan) # is a special type of float value

float

In [11]:
assert data_nan # does't evaluate as null

In [17]:
np.nan is np.NaN is np.NAN

True

In [18]:
# if you want to do basic calculation, a nan element will mess it up
dat_a = np.array([1,2,3,np.nan])

print(np.mean(dat_a))

nan


In [19]:
# but if you use the nan version, it will just ignore nan values
np.nanmean(np.array([1,2,3,np.nan]))

2.0

#### Dealing with missing data is a decision point: what do you do?
- Do you drop the observation?
    - What if this entails dropping a lot of observations?
- Do you keep it, but ignore it in any calculations?
    - What if you end up with different N’s in different calculcations?
- Do you recode that data point?
    - What do you recode it to?

#### Impossible values
for example for a miising age -999 might be used, these are not nan but you will have to deal with that

### Pandas cleaning - Json

In [24]:
df1 = pd.read_json("human.json")

df1
### appaerently json reads columns alphabetivcally, so you would need to rearrange it -- idk it didnt happen here, id is first col
## df1 = df1[['id', 'height']]

Unnamed: 0,id,height
0,1,168.0
1,2,155.0
2,3,
3,4,173.0


In [26]:
# drop nan values
df1.dropna(inplace = True) # inplace - performs the operation on the dataframe, you don't have to resave it into a new variable
df1

Unnamed: 0,id,height
0,1,168.0
1,2,155.0
3,4,173.0


### Pandas cleaning - CSV

In [32]:
df2 = pd.read_csv("data (1).csv")
df2

Unnamed: 0,id,age,weight
0,1,20,11.0
1,2,27,
2,3,25,14.0
3,4,-999,12.0


In [33]:
# drop nan value(s)
df2.drop('weight', axis=1, inplace = True)
# however since for this calculation we do not need the weight, only age and height, we are rejecting good data
# so we just drop the whole weight column

In [34]:
df2

Unnamed: 0,id,age
0,1,20
1,2,27
2,3,25
3,4,-999


In [38]:
# check if there are any nan values in the age column
sum(df2['age'].isnull()) # no insnan, isnull creates a bool array, then counts all trues across it

0

In [39]:
# merge data based on a specific column, if they have the same id, they will combine
df = pd.merge(df1, df2, on = 'id')

In [40]:
df

Unnamed: 0,id,height,age
0,1,168.0,20
1,2,155.0,27
2,4,173.0,-999


In [43]:
# now check the stats if the data seems reasonable
df.describe()
# -317 avg age is weird, maybe -999 is for missign values so filter them out

Unnamed: 0,id,height,age
count,3.0,3.0,3.0
mean,2.333333,165.333333,-317.333333
std,1.527525,9.291573,590.351026
min,1.0,155.0,-999.0
25%,1.5,161.5,-489.5
50%,2.0,168.0,20.0
75%,3.0,170.5,23.5
max,4.0,173.0,27.0


In [48]:
df = df[df['age']>0]

df.describe()
df['age'].mean() #23.5 seems fair

23.5

In [49]:
df

Unnamed: 0,id,height,age
0,1,168.0,20
1,2,155.0,27


### Tips for data cleaning:

- Read any documentation for the dataset you have
    - Things like missing values might be arbitrarily encoded, but should (hopefully) be documented somewhere
- Check that data types are as expected. If you are reading in mixed type data, make sure you end up with the correct encodings
    - Having numbers read in as strings, for example, is a common way data wrangling can go wrong, and this can cause analysis errors
- Visualize your data! Have a look that the distribution seems reasonable (more on this later)
- Check basic statistics. df.describe() can give you a sense if the data is really skewed
- Keep in mind how your data were collected
    - If anything comes from humans entering information into forms, this might take a lot of cleaning
        - Fixing data entry errors (typos)
        - Dealing with inputs using different units / formats / conventions
    - Cleaning this kind of data is likely to take more manual work (since mistakes are likely idiosyncratic)

# Data privacy and anonymizatzon

The Safe Harbor method requires that the following identifiers of the individuals be removed:
- Names
- Geographic Subdivisions smaller than a state**
- Dates (such as birth dates, etc), and all ages above 90
- Telephone Numbers
- Vehicle Identification Numbers
- Fax numbers
- Device identifiers and serial numbers
- Email addresses
- Web Universal Resource Locators (URLs)
- Social security numbers
- Internet Protocol (IP) addresses
- Medical record numbers
- Biometric identifiers, including finger and voice prints
- Health plan beneficiary numbers
- Full-face photographs and any comparable images
- Account numbers
- Certificate/license numbers
- Any other unique identifying number, characteristic, or code

** The first three numbers of the zip code can be kept, provided that more than 20,000 people live in the region covered by all the zip codes that share the same initial three digits (the same geographic subdivision).