Import the pandas library and load in the data file
---

In [1]:
import pandas as pd

demo = pd.read_csv('Demographics.csv')

Look at the data types of different columns
---

In [2]:
demo.loc[:,['SEQN','RIDAGEYR','RIAGENDR','DMQMILIT', 'DMDCITZN']].dtypes

SEQN          int64
RIDAGEYR    float64
RIAGENDR     object
DMQMILIT     object
DMDCITZN     object
dtype: object

Find unique entries for Military/Veteran Status
---

According to the [Demographics Codebook](https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEMO.htm) there should be 5 unique entries:
  * Yes
  * No
  * Don't know
  * Refused
  * Missing (NaN)

In [3]:
print(len(demo['DMQMILIT'].unique()))
print(demo['DMQMILIT'].unique())

38
[nan 'Y' 'Yes' '  No' ' No' 'N' 'No' 'No  ' ' Y' 'N ' 'Yes ' '  N' 'No '
 'N  ' ' Yes' 'Y ' ' N' '  N ' 'Y  ' '  Yes' '  No  ' 'Yes  ' "Don't know"
 '  Y' ' N  ' ' No ' ' No  ' '  N  ' 'Refused' '  No ' '  Yes  '
 "Don't know " ' Yes ' ' N ' '  Yes ' ' Yes  ' 'Refused ' " Don't know"]


Remove excess whitespace
---

In [4]:
demo.loc[:,'DMQMILIT'] = demo.loc[:,'DMQMILIT'].str.strip()

print(len(demo['DMQMILIT'].unique()))
print(demo['DMQMILIT'].unique())

7
[nan 'Y' 'Yes' 'No' 'N' "Don't know" 'Refused']


Change Y and N to Yes and No
---

Use the `replace()` method to change 'Y' and 'N' to 'Yes' and 'No'

* `replace_dict` - specifies how to replace the data
  - Outer dictionary: key is column name, value is dictionary of replacements (inner dictionary)
  - Inner dictionary: key is the value to be replaced, value is what to replace all instances of the key with

In [5]:
replace_dict = {'DMQMILIT': {
                              'Y':'Yes', 
                              'N':'No'
                            }
               }

demo.replace(replace_dict, inplace=True)

print(len(demo['DMQMILIT'].unique()))
print(demo['DMQMILIT'].unique())

5
[nan 'Yes' 'No' "Don't know" 'Refused']


Find unique entries for Citizenship Status
---

According to the [Demographics Codebook](https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEMO.htm) there should be 5 unique entries:
  * Citizen by birth or naturalization
  * Not a citizen of the US
  * Refused
  * Don't know
  * Missing (NaN)

In [6]:
print(len(demo['DMDCITZN'].unique()))
print(demo['DMDCITZN'].unique())

31
['Citizen by birth or naturalization' 'Not a citizen of the US'
 '  Citizen by birth or naturalization'
 ' Citizen by birth or naturalization'
 'Citizen by birth or naturalization '
 'Citizen by birth or naturalization  ' ' Not a citizen of the US'
 'Not a citizen of the US ' 'Not a citizen of the US  ' 'Refused'
 " Don't know" '  Not a citizen of the US'
 '  Citizen by birth or naturalization '
 ' Citizen by birth or naturalization  '
 ' Citizen by birth or naturalization ' 'Unknown' "Don't know"
 ' Dont know' ' Not a citizen of the US '
 '  Citizen by birth or naturalization  ' ' Not a citizen of the US  ' nan
 '  Refused' 'Unknown  ' '  Not a citizen of the US ' 'Dont know'
 ' Refused' "  Don't know " "Don't Know" "  Don't know" "Don't know  "]


Remove excess whitespace
---

In [7]:
demo.loc[:,'DMDCITZN'] = demo.loc[:,'DMDCITZN'].str.strip()

print(len(demo['DMDCITZN'].unique()))
print(demo['DMDCITZN'].unique())

8
['Citizen by birth or naturalization' 'Not a citizen of the US' 'Refused'
 "Don't know" 'Unknown' 'Dont know' nan "Don't Know"]


Replace misspellings
---

Three values should be replaced with "Don't know"
  * "Dont know"
  * "Don't Know"
  * "Unknown"

In [8]:
replace_dict = {'DMDCITZN': {
                              "Dont know":"Don't know", 
                              "Don't Know":"Don't know",
                              "Unknown":"Don't know"
                            }
               }

demo.replace(replace_dict, inplace=True)

print(len(demo['DMDCITZN'].unique()))
print(demo['DMDCITZN'].unique())

5
['Citizen by birth or naturalization' 'Not a citizen of the US' 'Refused'
 "Don't know" nan]


Replace both columns at once
---

In [9]:
replace_dict = {'DMQMILIT': {
                              'Y':'Yes', 
                              'N':'No'
                            },
                'DMDCITZN': {
                              "Dont know":"Don't know", 
                              "Don't Know":"Don't know",
                              "Unknown":"Don't know"
                            }
               }

demo.replace(replace_dict, inplace=True)

Individual Practice
---

1. Find all of the columns in the demographics file that contain string data. 
2. Get rid of all excess whitespace.
3. Make sure all text entries match the text options in the [demographics codebook](https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEMO.htm).

In [10]:
# Write your code here!
demo.dtypes[demo.dtypes == 'string']

Series([], dtype: object)

Save data files
---

In [11]:
# demo.to_csv('Demographics', index=False)