Read, clean, and validate

The first step of almost any data project is to read the data, check for errors and special cases, and prepare data for analysis. This is exactly what you'll do in this chapter, while working with a dataset obtained from the National Survey of Family Growth.

1. DataFrames and Series

1.1 Exploring the NSFG data

To get the number of rows and columns in a DataFrame, you can read its shape attribute.

To get the column names, you can read the columns attribute. The result is an Index, which is a Pandas data structure 
that is similar to a list. Let's begin exploring the NSFG data! It has been pre-loaded for you into a DataFrame called 
nsfg.

In [None]:
import pandas as pd
nsfg = pd.read_hdf('datasets//nsfg.hdf5')

# Display the number of rows and columns
print(nsfg.shape)

# Display the names of the columns
print(nsfg.columns)

# Select column birthwgt_oz1: ounces
ounces = nsfg['birthwgt_oz1']

# Print the first 5 elements of ounces
print(ounces.head())

2. Clean and Validate

2.1 Validate a variable

In the NSFG dataset, the variable 'outcome' encodes the outcome of each pregnancy as shown below:

|value	| label             |
|------ |------------------ |    
|1	    |Live birth         |
|2	    |Induced abortion   |
|3	    |Stillbirth         |
|4	    |Miscarriage        |
|5	    |Ectopic pregnancy  |
|6	    |Current pregnancy  |

The nsfg DataFrame has been pre-loaded for you. Explore it in the IPython Shell and use the methods Allen showed you in the video to answer the following question: How many pregnancies in this dataset ended with a live birth?

Possible Answers:
- 6489  (True)
- 9538
- 1469
- 6

In [None]:
nsfg['outcome'].value_counts()

2.2 Clean a variable

In the NSFG dataset, the variable 'nbrnaliv' records the number of babies born alive at the end of a pregnancy.

If you use .value_counts() to view the responses, you'll see that the value 8 appears once, and if you consult the codebook, you'll see that this value indicates that the respondent refused to answer the question.

Your job in this exercise is to replace this value with np.nan. Recall from the video how Allen replaced the values 98 and 99 in the ounces column using the .replace() method:

```python:
ounces.replace([98, 99], np.nan, inplace=True)
```