# <center>Data Wrangling: Cleaning, Formatting, Standardization, and Investigation - Stacey Sandy</center>

This week you will be retrieving and cleansing data from a survey generated by the National Center for Immunization and Respiratory Diseases about National Immunizations in Children. In completing this assignment, you will be able to combine topics discussed in several of our prior <i>From The Experts</i> (FTE)s.

File needed to complete this assignment are located in the data_5 folder:
* NISPUF14_CODEBOOK.PDF
* nispuf14.dat

Assignment Requirements:
* Retrieve all of the data within nispuf14.dat and store it in a more <i> accessible format</i>
* <i> Accessible format </i> can be any of the following:
    - csv file
    - json file
    - relational database
* For this assignment, feel free to use a dataframe for intermediate steps. 

<hr>

### What's in these two files?
I'm glad you asked that! And to be honest, you probably are not going to like the answer.

NISPUF14_CODEBOOK.PDF is a PDF that contains a description of the format for the data in nispuf14.dat. In other words, the PDF tells you how to read the data in nispuf14.dat.

Why would we need a PDF to tell us how to read our data?  Well, this data file is stored in a positional format. This means that both the value and relative position of each character provides meaning within the dataset.

Here's what the data in nispuf14.dat looks like.
<img align="left" style="padding-right:10px;" src="figures_5/positional_data.jpeg" width = 800><br>

Ugly? Yes! And very much so. However, data in this format is not all that uncommon. Mainframe computers operate on positional formating. 

Q - Who still uses mainframe computers?<br>
A - Mainframes are more prevalent than you'd think. Any industry that has a large volume of daily mathematical calculations to do, most likely use a mainframe computer as part of their normal operations. For example, the banking industry. Certainly, the website and customer-facing applications are not run on a mainframe computer, but the nightly accounting processes probably are. 

The following article walks through the history of the mainframe computer and how it has evolved over the years. 
https://www.thocp.net/hardware/mainframe.htm

<hr>

### How are we supposed to read that?
This is where NISPUF14_CODEBOOK.PDF comes into the picture. Section 1 of the PDF contains the description of the positional formatting information for each data field. Here's how it works!

As an example, let's say that our data file looked like this:<br>
CAT  FLUFFY410<br>
DOG  FIDO  522<br>
BIRD CHIRP 2 1<br>

At a glance, we can determine that each line contains information about animals. We can see a field representing an animal_type and perhaps an animal_name.  However, we have little to no information about what the numerics at the end of each line mean. Or even how many fields the numeric group is representing. The last line is leading us to believe that there might be more than one field represented, but we are not confident at this point.


### Does this come with a 'Magical Decoder Ring'?
Short of an actual magical ring, I'd settle for a description of each field and their relative position in the line.  It would be even better if the description was written down for future reference.

Let's look at the above animal dataset in conjunction with the  following description:<br>
Type 1 5<br>
Name 6 11<br>
Age 12 12<br>
Weight 13 14<br>

Aaahhhhh! Now everything is starting to come together!!! We can now confirm that the first field is indeed animal_type, and the second is animal_name. However, we now know that the numeric grouping is really two fields, animal_age and animal_weight. We can also see that animal_age is a single digit, and animal_weight is a 2-digit numeric. We are also able to determine at this point that the animal_name on the first line is actually 'FLUFFY' and not 'FLUFFY410'.

Time to add a little code to this example.

<hr>

In [1]:
# Load the sample data into a list
animal_data = ['CAT  FLUFFY410', 'DOG  FIDO  522', 'BIRD CHIRP 2 1' ]

# processing each animal_data line
for  line in animal_data:
    animal_type = line[0:5]
    animal_name = line[5:11]
    age = line[11]
    weight = line[12:14]
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')


Hopefully, things are looking less scary at this point? 

Retrieving data from a positionally formatted file is just a matter of chunking the larger string up into smaller pieces. The trick is in determining where to make those chunks. 

The key to all this is the 'magical decoder' description because there are no other clues in the file itself. Unlike a csv type file, positional formatted files don't have a delimited to help identify individual data elements. 

That being said, positional formats do account for every character within a row.  Meaning that even unused characters are given a value. In our example above, a blank  character(' ') was used to fill unused characters. The value used to represent unused characters can literally be anything. For example, if '-' was used instead of a ' ' our sample data would have looked like:

CAT--FLUFFY410<br>
DOG--FIDO--522<br>
BIRD-CHIRP-2-1<br>

Let's see if our code above will still work?

In [2]:
# Load the sample data into a list
animal_data2 = ['CAT--FLUFFY410', 'DOG--FIDO--522', 'BIRD-CHIRP-2-1' ]

# processing each animal_data line
for  line in animal_data2:
    animal_type = line[0:5]
    animal_name = line[5:11]
    age = line[11]
    weight = line[12:14]
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')

FLUFFY is a CAT-- that is 4 years old and weights 10 pound(s)
FIDO-- is a DOG-- that is 5 years old and weights 22 pound(s)
CHIRP- is a BIRD- that is 2 years old and weights -1 pound(s)


Aside from changing the initial list that contains our dataset, no coding changes were needed. 

Our output looks a little different, but that's because of the different unused character representation. Both of the above examples have their respective unused character values in the data elements.  It's just easier to see in the second example over the first.

Let's try stripping out the unused characters in both examples.

In [3]:
# working with the second dataset, animal_data2, first.

# processing each animal_data line
for  line in animal_data2:
    animal_type = line[0:5].strip('-')
    animal_name = line[5:11].strip('-')
    age = line[11].strip('-')
    weight = line[12:14].strip('-')
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')

FLUFFY is a CAT that is 4 years old and weights 10 pound(s)
FIDO is a DOG that is 5 years old and weights 22 pound(s)
CHIRP is a BIRD that is 2 years old and weights 1 pound(s)


In [4]:
# repeat the same things with the first set, animal_data.

# processing each animal_data line
for  line in animal_data:
    animal_type = line[0:5].strip('-')
    animal_name = line[5:11].strip('-')
    age = line[11].strip('-')
    weight = line[12:14].strip('-')
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')

FLUFFY is a CAT   that is 4 years old and weights 10 pound(s)
FIDO   is a DOG   that is 5 years old and weights 22 pound(s)
CHIRP  is a BIRD  that is 2 years old and weights  1 pound(s)


<div class="alert alert-success">
Success!! The two outputs match!
</div>

<hr>

### Back to our assignment
Section 1 of NISPUF14_CODEBOOK.PDF contains the description of the positional format for nispuf14.dat.

<div class="alert alert-block alert-info">
<b>Helpful Hint::</b> Combining pyPDF2 and Tabula would work great for  parsing the information within section 1 of NISPUF14_CODEBOOK.PDF. pyPDF2 to retrieve section 1 of the PDF and Tabula for getting the positional formatting information off the PDF and into a pandas dataframe.
</div>

Installation reminders from FTE for week3.
<div class="alert alert-block alert-success">
<b>Installation - PyPDF2::</b> PyPDF2 can be installed as normal using pip.
</div>

<div class="alert alert-block alert-success">
<b>Installation - Tabula::</b> To install the tabula package, you can use pip as shown before. https://pypi.org/project/tabula-py/
</div>

<div class="alert alert-block alert-success">
<b>Installation - Java::</b> Note: in order to use tabula, you need to have the latest version of java installed. https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py has some useful information if you need help getting java installed on your machine.
</div>

#### Assignment Approach

<div class="alert alert-block alert-warning">
<b>One possible solution: </b> Students are encouraged to define their approach when completing any assignment in this class.  Below, I have shared my approach to the assignment for this week.  Feel free to use some or all of this design, if you'd like.
</div>



for each line in the file:

    data_line = new list
    for each variable (line) found in the dataframe:
        create a dictionary with variable name as key, 
        use start / end position numbers as a slice to give the dictionary's value
        append dictionary to data_line
    write data_line to CSV file

<hr>

## <b>This is where my Week 5 assignment submittal begins:</b>

In [1]:
#import PyPDF2 
from PyPDF2 import PdfFileReader, PdfFileWriter,PdfFileMerger

# create a file object for our PDF
pdfFileObj = open('data_wk5/NISPUF14_CODEBOOK.PDF', 'rb') 

# create an object to access the PDF
pdfReader = PdfFileReader(pdfFileObj) 

In [2]:
#Check the number of pages. (Remember it should be 250 pages.)
print(pdfReader.numPages) 

250


In [3]:
# creating a page object - pdfReader starts at index 0 but we want to see if the first page of variables print.
pageObj = pdfReader.getPage(4) 

# store the text into a string
page_string = pageObj.extractText()

# display the text on the selected page
page_string

'Codebook and Unweighted Frequencies for the 2014 NIS Public-Use FileSECTION INDEX OF VARIABLESPage 1Codebook and Unweighted Frequencies for the 2014 NIS Public-Use File Page 1 Begin End Variable Name \nPosition Position Section Variable Label SEQNUMC 1 6 1 UNIQUE CHILD IDENTIFIER SEQNUMHH 7 11 1 UNIQUE HOUSEHOLD IDENTIFIER PDAT 12 12 1 CHILD HAS ADEQUATE PROVIDER DATA PROVWT_D 13 31 1 FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT (EXCLUDES TERRITORIES) PROVWT_D_TERR 32 50 1 FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT INCLUDING \nTERRITORIES RDDWT_D 51 69 1 FINAL DUAL-FRAME RDD-PHASE WEIGHT (EXCLUDES \nTERRITORIES) RDDWT_D_TERR 70 88 1 FINAL DUAL-FRAME RDD-PHASE WEIGHT INCLUDING \nTERRITORIES STRATUM 89 92 1 STRATUM VARIABLE FOR DUAL-FRAME VARIANCE ESTIMATION YEAR 93 96 1 YEAR OF INTERVIEW AGECPOXR 97 97 2 AGE IN MONTHS AT CHICKEN POX DISEASE (RECODE) HAD_CPOX 98 99 2 CHILD EVER HAD CHICKEN POX DISEASE? SHOTCARD 100 100 2 SHOT CARD USE FLAG AGEGRP 101 101 3 AGE CATEGORY OF CHILD (19-23, 24-29, 30-

In [4]:
def make_pdf_of_single_page(page_num):
    # same as above
    pdfWriter = PdfFileWriter()
    pdfWriter.addPage(pdfReader.getPage(page_num - 1))

    # construct a unique filename
    out_filename = 'data_wk5/page_' + str(page_num) + '.pdf'
    
    # create a new PDF 
    with open(out_filename, 'wb') as out:
        pdfWriter.write(out)

In [5]:
# remember range(i,n) generates an iterator to progress integers starting with i upto n-1
# https://www.tutorialspoint.com/python3/python_for_loop.htm
for num in range(5,22):
    make_pdf_of_single_page(num)

In [6]:
# verify that the pdfs where created
!dir /S page*

 Volume in drive C is Windows
 Volume Serial Number is 34A7-90A0

 Directory of C:\Users\Stacey\Desktop\RU\MSDE621\WeeklyContent\data_wk3

11/18/2019  08:31 AM            26,004 page_10.pdf
11/18/2019  08:31 AM            26,052 page_11.pdf
11/18/2019  08:31 AM            25,972 page_12.pdf
11/18/2019  08:31 AM            26,033 page_13.pdf
11/18/2019  08:31 AM            26,209 page_14.pdf
11/18/2019  08:31 AM            25,828 page_15.pdf
               6 File(s)        156,098 bytes

 Directory of C:\Users\Stacey\Desktop\RU\MSDE621\WeeklyContent\data_wk5

12/15/2019  03:46 PM             5,773 page_10.pdf
12/15/2019  03:46 PM             5,699 page_11.pdf
12/15/2019  03:46 PM             5,902 page_12.pdf
12/15/2019  03:46 PM             5,929 page_13.pdf
12/15/2019  03:46 PM             5,959 page_14.pdf
12/15/2019  03:46 PM             5,885 page_15.pdf
12/15/2019  03:46 PM             5,860 page_16.pdf
12/15/2019  03:46 PM             5,949 page_17.pdf
12/15/2019  03:46 PM       

In [7]:
# let's create a list of all the pdfs we just created
import glob

pdf_list = [f for f in glob.glob('data_wk5/page_*')]

In [8]:
#Print the list
pdf_list

['data_wk5\\page_10.pdf',
 'data_wk5\\page_11.pdf',
 'data_wk5\\page_12.pdf',
 'data_wk5\\page_13.pdf',
 'data_wk5\\page_14.pdf',
 'data_wk5\\page_15.pdf',
 'data_wk5\\page_16.pdf',
 'data_wk5\\page_17.pdf',
 'data_wk5\\page_18.pdf',
 'data_wk5\\page_19.pdf',
 'data_wk5\\page_20.pdf',
 'data_wk5\\page_21.pdf',
 'data_wk5\\page_5.pdf',
 'data_wk5\\page_6.pdf',
 'data_wk5\\page_7.pdf',
 'data_wk5\\page_8.pdf',
 'data_wk5\\page_9.pdf']

In [9]:
#Resort the list so the page numbers are in natural order
from natsort import natsorted
pdf_list = natsorted(pdf_list)

In [10]:
# verify they all contain a single page and are now in order.
for pdf in pdf_list:
    n_pdfFileObj = open(pdf, 'rb') 
    n_pdfReader = PdfFileReader(n_pdfFileObj) 
    print(pdf,'has', n_pdfReader.numPages, 'page(s)') 
    
    # close the fileobjects
    n_pdfFileObj.close()

data_wk5\page_5.pdf has 1 page(s)
data_wk5\page_6.pdf has 1 page(s)
data_wk5\page_7.pdf has 1 page(s)
data_wk5\page_8.pdf has 1 page(s)
data_wk5\page_9.pdf has 1 page(s)
data_wk5\page_10.pdf has 1 page(s)
data_wk5\page_11.pdf has 1 page(s)
data_wk5\page_12.pdf has 1 page(s)
data_wk5\page_13.pdf has 1 page(s)
data_wk5\page_14.pdf has 1 page(s)
data_wk5\page_15.pdf has 1 page(s)
data_wk5\page_16.pdf has 1 page(s)
data_wk5\page_17.pdf has 1 page(s)
data_wk5\page_18.pdf has 1 page(s)
data_wk5\page_19.pdf has 1 page(s)
data_wk5\page_20.pdf has 1 page(s)
data_wk5\page_21.pdf has 1 page(s)


In [11]:
# creating pdf file merger object 
pdfMerger = PdfFileMerger() 

# appending the individual pdfs to the merger object
for pdf in pdf_list:
    pdfMerger.append(PdfFileReader(pdf),'rb')    
        
# writing combined pdf to output pdf file 
with open('data_wk5/merged.pdf', 'wb') as new_file: 
    pdfMerger.write(new_file)

In [12]:
m_pdfFileObj = open('data_wk5/merged.pdf', 'rb') 
m_pdfReader = PdfFileReader(m_pdfFileObj) 
print(m_pdfReader.numPages)

17


In [13]:
# looking at last page_21.pdf
v_pdfFileObj = open('data_wk5/page_21.pdf', 'rb') 
v_pdfReader = PdfFileReader(v_pdfFileObj) 
v_pageObj = v_pdfReader.getPage(0) 

# look at the first 100 characters
print(v_pageObj.extractText()[:200])

Codebook and Unweighted Frequencies for the 2014 NIS Public-Use File Page 17 Begin End Variable Name 
Position Position Section Variable Label XVRCTY7 850 850 9 VARICELLA-CONTAINING VACCINATION #7 TYP


In [14]:
# looking at merged.pdf last page 17
m_pageObj = m_pdfReader.getPage(16) 

# only need the first 100 characters for comparison
print(m_pageObj.extractText()[:200])

Codebook and Unweighted Frequencies for the 2014 NIS Public-Use File Page 17 Begin End Variable Name 
Position Position Section Variable Label XVRCTY7 850 850 9 VARICELLA-CONTAINING VACCINATION #7 TYP


In [15]:
# import tabula library
import tabula

# pdf location
pdf_file = "data_wk5/merged.pdf"

# read our pdf into a pandas dataframe
df_NIS = tabula.read_pdf(pdf_file, pages='all')

In [16]:
#View head of dataframe
df_NIS.head()

Unnamed: 0,Variable Name,Position,Position.1,Section,Variable Label
0,SEQNUMC,1.0,6.0,1.0,UNIQUE CHILD IDENTIFIER
1,SEQNUMHH,7.0,11.0,1.0,UNIQUE HOUSEHOLD IDENTIFIER
2,PDAT,12.0,12.0,1.0,CHILD HAS ADEQUATE PROVIDER DATA
3,PROVWT_D,13.0,31.0,1.0,FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT (EXCLUDES
4,,,,,TERRITORIES)


In [17]:
#View tail of dataframe
df_NIS.tail()

Unnamed: 0,Variable Name,Position,Position.1,Section,Variable Label
704,INS_4_5,861.0,862.0,10.0,"IS CHILD COVERED BY INDIAN HEALTH SERVICE, MIL..."
705,,,,,"CARE, TRICARE, CHAMPUS, OR CHAMP-VA?"
706,INS_6,863.0,864.0,10.0,IS CHILD COVERED BY ANY OTHER HEALTH INSURANCE...
707,,,,,CARE PLAN?
708,INS_11,865.0,866.0,10.0,ANY TIME WHEN CHILD WAS NOT COVERED BY ANY HEALTH


In [18]:
#View length (# of rows) of dataframe
len(df_NIS)

709

In [19]:
#View shape (# of rows, # of columns) of dataframe.
df_NIS.shape

(709, 5)

In [20]:
#Let's see how many duplicated rows are in dataframe, but keep first.
df_NIS.duplicated(keep='first')

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8       True
9      False
10      True
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
679    False
680    False
681    False
682    False
683    False
684    False
685    False
686    False
687    False
688    False
689    False
690    False
691    False
692    False
693    False
694    False
695     True
696    False
697    False
698    False
699    False
700    False
701    False
702    False
703    False
704    False
705    False
706    False
707    False
708    False
Length: 709, dtype: bool

In [21]:
#Let's try to drop these duplicate rows. As we know the header may be duplicated due to the pdf merge.
df_NIS.drop_duplicates()

Unnamed: 0,Variable Name,Position,Position.1,Section,Variable Label
0,SEQNUMC,1,6,1,UNIQUE CHILD IDENTIFIER
1,SEQNUMHH,7,11,1,UNIQUE HOUSEHOLD IDENTIFIER
2,PDAT,12,12,1,CHILD HAS ADEQUATE PROVIDER DATA
3,PROVWT_D,13,31,1,FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT (EXCLUDES
4,,,,,TERRITORIES)
5,PROVWT_D_TERR,32,50,1,FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT INCLUDING
6,,,,,TERRITORIES
7,RDDWT_D,51,69,1,FINAL DUAL-FRAME RDD-PHASE WEIGHT (EXCLUDES
9,RDDWT_D_TERR,70,88,1,FINAL DUAL-FRAME RDD-PHASE WEIGHT INCLUDING
11,STRATUM,89,92,1,STRATUM VARIABLE FOR DUAL-FRAME VARIANCE ESTIM...


Notice the dataframe still contains the NaN rows for the Variable Label column that populated in a different row.

In [22]:
#Look at dataframe shape again after the duplicate drops.
df_NIS.shape

(709, 5)

<b>That didn't work....</b> Looks like the duplicates remain and the shape is the same as before. Hhmmmmmm.

In [23]:
#Convert merged PDF into CSV file
tabula.convert_into("data_wk5/merged.pdf", "data_wk5/output.csv", output_format="csv", pages='all')

In [24]:
# Let's just subset the columns Variable Name, begin position and end position.
#We will rename the columns while we are at it.
NIS_subset = df_NIS[['Variable Name', 'Position', 'Position.1']].rename(columns = {"Variable Name" : "Variable Name", "Position" : "Begin", "Position.1" : "End"})
NIS_subset.head()

Unnamed: 0,Variable Name,Begin,End
0,SEQNUMC,1.0,6.0
1,SEQNUMHH,7.0,11.0
2,PDAT,12.0,12.0
3,PROVWT_D,13.0,31.0
4,,,


In [25]:
#Let's drop NaN values
NIS_dropna = NIS_subset.dropna()
print(NIS_dropna.shape)

(477, 3)


In [26]:
#Let's take a look at the first 12 rows
NIS_dropna.head(12)

Unnamed: 0,Variable Name,Begin,End
0,SEQNUMC,1,6
1,SEQNUMHH,7,11
2,PDAT,12,12
3,PROVWT_D,13,31
5,PROVWT_D_TERR,32,50
7,RDDWT_D,51,69
9,RDDWT_D_TERR,70,88
11,STRATUM,89,92
12,YEAR,93,96
13,AGECPOXR,97,97


In [27]:
#Now let's look at the last 12 rows 
NIS_dropna.tail(12)

Unnamed: 0,Variable Name,Begin,End
694,XVRCTY6,849,849
695,Variable Name,Position,Position
696,XVRCTY7,850,850
697,XVRCTY8,851,851
698,XVRCTY9,852,852
699,INS_1,853,854
701,INS_2,855,856
702,INS_3,857,858
703,INS_3A,859,860
704,INS_4_5,861,862


<b>Great!</b> Looks like the duplicate or NaN values in row 8 and 10 were removed from the dataframe.<br>
We also see that row 695 is still re-appearing for the duplicate headers from the PDF document. We will need to take caare of that too....

In [28]:
#Let's create a final "clean" version dataframe to solve the above problem...
NIS_dfclean = NIS_dropna.drop_duplicates(subset="Variable Name")

In [29]:
#Take a look at the shape.
NIS_dfclean.shape

(462, 3)

In [30]:
#Loo at tail to confirm the drop.
NIS_dfclean.tail(12)

Unnamed: 0,Variable Name,Begin,End
693,XVRCTY5,848,848
694,XVRCTY6,849,849
696,XVRCTY7,850,850
697,XVRCTY8,851,851
698,XVRCTY9,852,852
699,INS_1,853,854
701,INS_2,855,856
702,INS_3,857,858
703,INS_3A,859,860
704,INS_4_5,861,862


<b>This is better!</b> Looks like the duplicates dropped and NaN values are gone. The shape shows 462 rows which is more reasonable for the 3 column values (Variable Name, Begin, End) that we only care about for the purpose of this assignment.<br>
<b>Notice</b> how line 695 of the repeated header from each PDF page has been dropped. Yay!

In [31]:
#Let's try to save the dataframe to CSV.
NIS_csv = NIS_dfclean.to_csv(header=True, line_terminator='\n')

In [32]:
#Print CSV...
NIS_csv

',Variable Name,Begin,End\n0,SEQNUMC,1,6\n1,SEQNUMHH,7,11\n2,PDAT,12,12\n3,PROVWT_D,13,31\n5,PROVWT_D_TERR,32,50\n7,RDDWT_D,51,69\n9,RDDWT_D_TERR,70,88\n11,STRATUM,89,92\n12,YEAR,93,96\n13,AGECPOXR,97,97\n14,HAD_CPOX,98,99\n15,SHOTCARD,100,100\n16,AGEGRP,101,101\n17,BF_ENDR06,102,109\n18,BF_EXCLR06,110,117\n20,BF_FORMR08,118,125\n21,BFENDFL06,126,126\n23,BFFORMFL06,127,127\n25,C1R,128,128\n26,C5R,129,130\n27,CBF_01,131,132\n28,CEN_REG,133,133\n29,CHILDNM,134,134\n30,CWIC_01,135,136\n31,CWIC_02,137,138\n32,EDUC1,139,139\n33,FRSTBRN,140,140\n34,I_HISP_K,141,141\n35,INCPORAR,142,157\n36,INCPOV1,158,158\n37,INCQ298A,159,160\n38,INTRP,161,161\n39,LANGUAGE,162,162\n40,M_AGEGRP,163,163\n41,Variable Name,Position,Position\n42,MARITAL2,164,164\n43,MOBIL_I,165,165\n45,NUM_PHONE,166,167\n47,NUM_CELLS_HH,168,169\n49,NUM_CELLS_PARENTS,170,171\n51,RACE_K,172,172\n52,RACEETHK,173,173\n53,RENT_OWN,174,175\n55,SEX,176,176\n56,ESTIAP14,177,179\n57,EST_GRANT,180,181\n59,STATE,182,183\n60,D6R,184,184\n62,

In [33]:
#Look at the CSV type
type(NIS_csv)

str

In [34]:
#Create a new list of the CSV file and print

NIS_csvline = []
for line in NIS_csv:
    NIS_csvline.append(line)

NIS_csvline

[',',
 'V',
 'a',
 'r',
 'i',
 'a',
 'b',
 'l',
 'e',
 ' ',
 'N',
 'a',
 'm',
 'e',
 ',',
 'B',
 'e',
 'g',
 'i',
 'n',
 ',',
 'E',
 'n',
 'd',
 '\n',
 '0',
 ',',
 'S',
 'E',
 'Q',
 'N',
 'U',
 'M',
 'C',
 ',',
 '1',
 ',',
 '6',
 '\n',
 '1',
 ',',
 'S',
 'E',
 'Q',
 'N',
 'U',
 'M',
 'H',
 'H',
 ',',
 '7',
 ',',
 '1',
 '1',
 '\n',
 '2',
 ',',
 'P',
 'D',
 'A',
 'T',
 ',',
 '1',
 '2',
 ',',
 '1',
 '2',
 '\n',
 '3',
 ',',
 'P',
 'R',
 'O',
 'V',
 'W',
 'T',
 '_',
 'D',
 ',',
 '1',
 '3',
 ',',
 '3',
 '1',
 '\n',
 '5',
 ',',
 'P',
 'R',
 'O',
 'V',
 'W',
 'T',
 '_',
 'D',
 '_',
 'T',
 'E',
 'R',
 'R',
 ',',
 '3',
 '2',
 ',',
 '5',
 '0',
 '\n',
 '7',
 ',',
 'R',
 'D',
 'D',
 'W',
 'T',
 '_',
 'D',
 ',',
 '5',
 '1',
 ',',
 '6',
 '9',
 '\n',
 '9',
 ',',
 'R',
 'D',
 'D',
 'W',
 'T',
 '_',
 'D',
 '_',
 'T',
 'E',
 'R',
 'R',
 ',',
 '7',
 '0',
 ',',
 '8',
 '8',
 '\n',
 '1',
 '1',
 ',',
 'S',
 'T',
 'R',
 'A',
 'T',
 'U',
 'M',
 ',',
 '8',
 '9',
 ',',
 '9',
 '2',
 '\n',
 '1',
 '2',
 ',',
 'Y',
 

Okay that does NOT look really good...Moving on since this is not part of the assignment deliverables anyways.

In [35]:
#Let's create a dictionary of the dataframe.
NIS_dict = NIS_dfclean.to_dict('record')

In [36]:
#By 'record' this is the output:
NIS_dict

[{'Variable Name': 'SEQNUMC', 'Begin': '1', 'End': '6'},
 {'Variable Name': 'SEQNUMHH', 'Begin': '7', 'End': '11'},
 {'Variable Name': 'PDAT', 'Begin': '12', 'End': '12'},
 {'Variable Name': 'PROVWT_D', 'Begin': '13', 'End': '31'},
 {'Variable Name': 'PROVWT_D_TERR', 'Begin': '32', 'End': '50'},
 {'Variable Name': 'RDDWT_D', 'Begin': '51', 'End': '69'},
 {'Variable Name': 'RDDWT_D_TERR', 'Begin': '70', 'End': '88'},
 {'Variable Name': 'STRATUM', 'Begin': '89', 'End': '92'},
 {'Variable Name': 'YEAR', 'Begin': '93', 'End': '96'},
 {'Variable Name': 'AGECPOXR', 'Begin': '97', 'End': '97'},
 {'Variable Name': 'HAD_CPOX', 'Begin': '98', 'End': '99'},
 {'Variable Name': 'SHOTCARD', 'Begin': '100', 'End': '100'},
 {'Variable Name': 'AGEGRP', 'Begin': '101', 'End': '101'},
 {'Variable Name': 'BF_ENDR06', 'Begin': '102', 'End': '109'},
 {'Variable Name': 'BF_EXCLR06', 'Begin': '110', 'End': '117'},
 {'Variable Name': 'BF_FORMR08', 'Begin': '118', 'End': '125'},
 {'Variable Name': 'BFENDFL06', 

In [37]:
type(NIS_dict)

list

The dictionary is a list and NOT what we need. So let's try something else....

In [38]:
#INDEX set to 'Variable Name' column
NIS_dict1 = NIS_dfclean.set_index("Variable Name").T.to_dict('list')

In [39]:
#Print the new dictionary now...
NIS_dict1

{'SEQNUMC': ['1', '6'],
 'SEQNUMHH': ['7', '11'],
 'PDAT': ['12', '12'],
 'PROVWT_D': ['13', '31'],
 'PROVWT_D_TERR': ['32', '50'],
 'RDDWT_D': ['51', '69'],
 'RDDWT_D_TERR': ['70', '88'],
 'STRATUM': ['89', '92'],
 'YEAR': ['93', '96'],
 'AGECPOXR': ['97', '97'],
 'HAD_CPOX': ['98', '99'],
 'SHOTCARD': ['100', '100'],
 'AGEGRP': ['101', '101'],
 'BF_ENDR06': ['102', '109'],
 'BF_EXCLR06': ['110', '117'],
 'BF_FORMR08': ['118', '125'],
 'BFENDFL06': ['126', '126'],
 'BFFORMFL06': ['127', '127'],
 'C1R': ['128', '128'],
 'C5R': ['129', '130'],
 'CBF_01': ['131', '132'],
 'CEN_REG': ['133', '133'],
 'CHILDNM': ['134', '134'],
 'CWIC_01': ['135', '136'],
 'CWIC_02': ['137', '138'],
 'EDUC1': ['139', '139'],
 'FRSTBRN': ['140', '140'],
 'I_HISP_K': ['141', '141'],
 'INCPORAR': ['142', '157'],
 'INCPOV1': ['158', '158'],
 'INCQ298A': ['159', '160'],
 'INTRP': ['161', '161'],
 'LANGUAGE': ['162', '162'],
 'M_AGEGRP': ['163', '163'],
 'Variable Name': ['Position', 'Position'],
 'MARITAL2': ['

<b>Cool...</b> This is something similar to what we need for our for loop.

In [40]:
#Just rechecking the length (# of rows) again...
len(NIS_dict1)

462

In [41]:
#Just rechecking that it is still a dictionary type...
type(NIS_dict1)

dict

In [42]:
#Let's pretty print the dictionary for now reason...just to try it.
from pprint import pprint

In [43]:
def printplus(obj):
    """
    Pretty-prints the object passed in.

    """
    # Dict
    if isinstance(obj, dict):
        for k, v in sorted(obj.items()):
            print (u'{0}: {1}'.format(k, v))

    # List or tuple            
    elif isinstance(obj, list) or isinstance(obj, tuple):
        for x in obj:
            print (x)

    # Other
    else:
        print (obj)

In [44]:
#Now we pretty print the dictionary.
printplus(NIS_dict1)

AGECPOXR: ['97', '97']
AGEGRP: ['101', '101']
BFENDFL06: ['126', '126']
BFFORMFL06: ['127', '127']
BF_ENDR06: ['102', '109']
BF_EXCLR06: ['110', '117']
BF_FORMR08: ['118', '125']
C1R: ['128', '128']
C5R: ['129', '130']
CBF_01: ['131', '132']
CEN_REG: ['133', '133']
CHILDNM: ['134', '134']
CWIC_01: ['135', '136']
CWIC_02: ['137', '138']
D6R: ['184', '184']
D7: ['185', '185']
DDTP1: ['277', '280']
DDTP2: ['281', '284']
DDTP3: ['285', '288']
DDTP4: ['289', '292']
DDTP5: ['293', '296']
DDTP6: ['297', '300']
DDTP7: ['301', '301']
DDTP8: ['302', '302']
DDTP9: ['303', '303']
DFLU1: ['304', '307']
DFLU2: ['308', '311']
DFLU3: ['312', '315']
DFLU4: ['316', '319']
DFLU5: ['320', '323']
DFLU6: ['324', '327']
DFLU7: ['328', '328']
DFLU8: ['329', '329']
DFLU9: ['330', '330']
DHEPA1: ['331', '334']
DHEPA2: ['335', '338']
DHEPA3: ['339', '342']
DHEPA4: ['343', '346']
DHEPA5: ['347', '347']
DHEPA6: ['348', '348']
DHEPA7: ['349', '349']
DHEPA8: ['350', '350']
DHEPA9: ['351', '351']
DHEPB1: ['352', '355

Nice! The pretty print really shows us how the .dat file can be used as a dictionary.

In [45]:
#Here is a TESt to print the values of the first Variable Name...
print(NIS_dict1['SEQNUMC'])

['1', '6']


In [46]:
#EXAMPLE of what to use in the final for loop:

for key, value in NIS_dict1.items():
    print(key, '=', value)

SEQNUMC = ['1', '6']
SEQNUMHH = ['7', '11']
PDAT = ['12', '12']
PROVWT_D = ['13', '31']
PROVWT_D_TERR = ['32', '50']
RDDWT_D = ['51', '69']
RDDWT_D_TERR = ['70', '88']
STRATUM = ['89', '92']
YEAR = ['93', '96']
AGECPOXR = ['97', '97']
HAD_CPOX = ['98', '99']
SHOTCARD = ['100', '100']
AGEGRP = ['101', '101']
BF_ENDR06 = ['102', '109']
BF_EXCLR06 = ['110', '117']
BF_FORMR08 = ['118', '125']
BFENDFL06 = ['126', '126']
BFFORMFL06 = ['127', '127']
C1R = ['128', '128']
C5R = ['129', '130']
CBF_01 = ['131', '132']
CEN_REG = ['133', '133']
CHILDNM = ['134', '134']
CWIC_01 = ['135', '136']
CWIC_02 = ['137', '138']
EDUC1 = ['139', '139']
FRSTBRN = ['140', '140']
I_HISP_K = ['141', '141']
INCPORAR = ['142', '157']
INCPOV1 = ['158', '158']
INCQ298A = ['159', '160']
INTRP = ['161', '161']
LANGUAGE = ['162', '162']
M_AGEGRP = ['163', '163']
Variable Name = ['Position', 'Position']
MARITAL2 = ['164', '164']
MOBIL_I = ['165', '165']
NUM_PHONE = ['166', '167']
NUM_CELLS_HH = ['168', '169']
NUM_CELLS_PA

In [47]:
#This saves the dictionary to a csv file.

import pandas as pd

(pd.DataFrame.from_dict(data=NIS_dict1, orient='index')
    .to_csv('data_wk5/NIS_kv.csv', header=False))

In [48]:
#TEST of pandas reading a file with fixed width columns (pd.read_fwf)
NIS_testfwf = pd.read_fwf('data_wk5/nispuf14.dat', header=None, index_col=0)

In [49]:
NIS_testfwf.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,80,81,82,83,84,85,86,87,88,89
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000011000012 . . 218.30024855484000 218.3002485548400010222014.,223365.2500152.1875182.6250..3,2 131,2,.4223.000000000000001142131299999912 12 222212...,.,.,. .,.,....,. . .,...,,,,,,,. . .,.,.,. .
000021000021 806.84601169505000 806.84601169505000 454.86041741251200 454.8604174125120020362014.,222 91.3125121.7500 91.3125..6,1 123,2,.2120.500000000000003 4.1211 . 2 212 21 363618...,65,127,191 387,.,....,351 387 421,...,D3D3HS,30,74747474.0,D3D308,RMRMRM,,2 . .,2,2,2 .
000031000032 . . 30.54542540283290 30.5454254028329010722014.,222152.1875152.1875 91.3125..6,1 143,1,24111.7668979696736121221222 1 1 131 21 727215...,.,.,. .,.,....,. . .,...,,,,,,,. . .,.,.,. .
000041000041 63.44868567610260 63.44868567610260 36.96593137368630 36.9659313736863020162014.,221334.8125182.6250273.9375..4,1 112,2,.4123.0000000000000011421312 . 2 212 12 161642...,62,91,122 184,.,....,184 217 553,...,D3D3D3D3,30,74747474.0,D3D3D3D3,RGRG,VO,1 2 2,.,2,2 2
000051000051 94.87263225744540 94.87263225744540 64.62020426239790 64.6202042623979010732014.,223 . . . ..8,3 243,1,12120.500000000000003 321212 1 3 312 21 737332...,119,224,308 844,.,....,40410611099,...,HSHSHSHS,VM,747474.0,080821,RM,VM,2 1 1,.,2,277


Very very interesting! It really does try to read the format of the first 100 rows to determin where the fixed width columns are in the file. Unfortionately, the .dat dile is not a good example for this function at the moment. For future reference here is the read_fwf documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html#pandas.read_fwf

Here is the TIP from the week 5 assignment (with my revisions):

NIS_dataline = []

for line in NIS_dict1:
    create a dictionary with Variable Name as key, 
    use Begin / End numbers as a slice to give the dictionary's value
    append dictionary to data_line
write data_line to CSV file

This is another code hint:
Outer loop reads lines from data file:
    Inner loop applies each of the key/value pairs to the line
    Append each result to a dictionary, or string, etc.
Write dictionary/string to CSV or JSON file

#TRY import csv??

NIS_lines = []
with open('data_wk5/nispuf14.dat') as data_file:
    for line in data_file:
        for key, value in NIS_dict1.items():
            keys = line[value]
            NIS_lines.append(line.strip())
            print(key, value)

The above didn't quite work for me, so moving on I reached out for more help to get me out of this rut....

In [50]:
#Let's go back to make sure our dataframe is still there
NIS_dfclean.shape

(462, 3)

In [51]:
#Take a look at the data frame columns
NIS_dfclean.columns

Index(['Variable Name', 'Begin', 'End'], dtype='object')

In [52]:
#Here we will convert the Begin column to an integer using to_numeric.
import pandas as pd
pd.to_numeric(NIS_dfclean['Begin'], errors='coerced', downcast='signed')

0        1.0
1        7.0
2       12.0
3       13.0
5       32.0
7       51.0
9       70.0
11      89.0
12      93.0
13      97.0
14      98.0
15     100.0
16     101.0
17     102.0
18     110.0
20     118.0
21     126.0
23     127.0
25     128.0
26     129.0
27     131.0
28     133.0
29     134.0
30     135.0
31     137.0
32     139.0
33     140.0
34     141.0
35     142.0
36     158.0
       ...  
675    820.0
676    822.0
677    824.0
678    825.0
679    826.0
680    827.0
681    829.0
682    831.0
683    833.0
684    835.0
685    837.0
686    838.0
687    839.0
688    840.0
689    841.0
690    843.0
691    845.0
692    847.0
693    848.0
694    849.0
696    850.0
697    851.0
698    852.0
699    853.0
701    855.0
702    857.0
703    859.0
704    861.0
706    863.0
708    865.0
Name: Begin, Length: 462, dtype: float64

In [53]:
#Here we will do the same to convert the End column to an integer using to_numeric.
pd.to_numeric(NIS_dfclean['End'], errors='coerce', downcast='signed')

0        6.0
1       11.0
2       12.0
3       31.0
5       50.0
7       69.0
9       88.0
11      92.0
12      96.0
13      97.0
14      99.0
15     100.0
16     101.0
17     109.0
18     117.0
20     125.0
21     126.0
23     127.0
25     128.0
26     130.0
27     132.0
28     133.0
29     134.0
30     136.0
31     138.0
32     139.0
33     140.0
34     141.0
35     157.0
36     158.0
       ...  
675    821.0
676    823.0
677    824.0
678    825.0
679    826.0
680    828.0
681    830.0
682    832.0
683    834.0
684    836.0
685    837.0
686    838.0
687    839.0
688    840.0
689    842.0
690    844.0
691    846.0
692    847.0
693    848.0
694    849.0
696    850.0
697    851.0
698    852.0
699    854.0
701    856.0
702    858.0
703    860.0
704    862.0
706    864.0
708    866.0
Name: End, Length: 462, dtype: float64

Oddly enough, it changed both columns 'Begin' and 'End' to a float? Not an integer. Even though downcast is applied to integer it still doesn't work. Let's move on to see what we can do further...

In [54]:
#Just looking at the head of the dataframe again....
NIS_dfclean.head()

Unnamed: 0,Variable Name,Begin,End
0,SEQNUMC,1,6
1,SEQNUMHH,7,11
2,PDAT,12,12
3,PROVWT_D,13,31
5,PROVWT_D_TERR,32,50


In [55]:
data_rows = []
with open('data_wk5/nispuf14.dat') as infile:    
    for line in infile:
        data_dict = {}      # new, empty dictionary for each line
        for row in NIS_dfclean.itertuples():
            start = int(row[0])
            end = int(row[0])
            data_dict[row[1]] = line[start : end]
            data_rows.append(data_dict)

In [56]:
len(data_rows)

11502414

In [57]:
import csv

with open('data_wk5/nispuf14.csv', 'w') as outfile:
    writer = csv.DictWriter(outfile, data_rows[0].keys())
    writer.writeheader()
    for row in data_rows:
        writer.writerow(row)

In conclusion, I was able to get my hands dirty and experience some complicated data sets. All in all, I was able to create a csv file from the complicated data set after cleaning it. This allows us to deal with the data in a cleaner state and as a csv file.