Import the pandas library and load in the data file
---

In [None]:
import pandas as pd

demo = pd.read_csv('Demographics.csv')

Look at the data types of different columns
---

In [None]:
demo.loc[:,['SEQN','RIDAGEYR','RIAGENDR','DMQMILIT', 'DMDCITZN']].dtypes

Find unique entries for Military/Veteran Status
---

According to the [Demographics Codebook](https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEMO.htm) there should be 5 unique entries:
  * Yes
  * No
  * Don't know
  * Refused
  * Missing (NaN)

In [None]:
print(len(demo['DMQMILIT'].unique()))
print(demo['DMQMILIT'].unique())

Remove excess whitespace
---

In [None]:
demo.loc[:,'DMQMILIT'] = demo.loc[:,'DMQMILIT'].str.strip()

print(len(demo['DMQMILIT'].unique()))
print(demo['DMQMILIT'].unique())

Change Y and N to Yes and No
---

Use the `replace()` method to change 'Y' and 'N' to 'Yes' and 'No'

* `replace_dict` - specifies how to replace the data
  - Outer dictionary: key is column name, value is dictionary of replacements (inner dictionary)
  - Inner dictionary: key is the value to be replaced, value is what to replace all instances of the key with

In [None]:
replace_dict = {'DMQMILIT': {
                              'Y':'Yes', 
                              'N':'No'
                            }
               }

demo.replace(replace_dict, inplace=True)

print(len(demo['DMQMILIT'].unique()))
print(demo['DMQMILIT'].unique())

Find unique entries for Citizenship Status
---

According to the [Demographics Codebook](https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEMO.htm) there should be 5 unique entries:
  * Citizen by birth or naturalization
  * Not a citizen of the US
  * Refused
  * Don't know
  * Missing (NaN)

In [None]:
print(len(demo['DMDCITZN'].unique()))
print(demo['DMDCITZN'].unique())

Remove excess whitespace
---

In [None]:
demo.loc[:,'DMDCITZN'] = demo.loc[:,'DMDCITZN'].str.strip()

print(len(demo['DMDCITZN'].unique()))
print(demo['DMDCITZN'].unique())

Replace misspellings
---

Three values should be replaced with "Don't know"
  * "Dont know"
  * "Don't Know"
  * "Unknown"

In [None]:
replace_dict = {'DMDCITZN': {
                              "Dont know":"Don't know", 
                              "Don't Know":"Don't know",
                              "Unknown":"Don't know"
                            }
               }

demo.replace(replace_dict, inplace=True)

print(len(demo['DMDCITZN'].unique()))
print(demo['DMDCITZN'].unique())

Replace both columns at once
---

In [None]:
replace_dict = {'DMQMILIT': {
                              'Y':'Yes', 
                              'N':'No'
                            },
                'DMDCITZN': {
                              "Dont know":"Don't know", 
                              "Don't Know":"Don't know",
                              "Unknown":"Don't know"
                            }
               }

demo.replace(replace_dict, inplace=True)

Individual Practice
---

1. Find all of the columns in the demographics file that contain string data. 
2. Get rid of all excess whitespace.
3. Make sure all text entries match the text options in the [demographics codebook](https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEMO.htm).

In [None]:
# Write your code here!

Save data files
---

In [None]:
# demo.to_csv('Demographics', index=False)