# Reading SPSS files in Pandas

Pandas has an optional pyreadstats dependency. 
To read SPSS we have to install it separately.

Here's an example of the command to install pyreadstat:

` conda install -c conda-forge pyreadstat`

If you ran this notebook from a MacOSX Terminal or Git Bash on Windows, you may even be able to install directly from this notebook using the next cell (note the `!` at the start of the line tells Jupyter Notebooks that this is a command line command)


In [2]:
!conda install -c conda-forge pyreadstat


Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



# Download the Add parent health survey 2015-2017 data and unzip it

Go to the [Add Parent Health Survey 2015-2017](https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/Q2TW3D). 

Click `Access Dataset` (upper right). You'll have to fill in some basic information, but its a free and pretty easy download.

Save the data in a project folder, unzip it (e.g. right click and hit 'Extract all...'). Then manuever a command line interface to your same project folder to open the data in pandas



# Check what SPSS .sav files are in your current directory

In [10]:
from os import listdir
all_files = listdir()
spss_files = [filename for filename in all_files if filename.endswith(".sav")]
print(spss_files)

['PP2wgt.sav', 'PParent2.sav', 'pp2ahwgt.sav', 'PFHHSP2.sav', 'Prsp2.sav', 'PSP2.sav', 'PFHHP2.sav', 'Prprnt2.sav']


# Open an SPSS file as a pandas dataframe

In [9]:
import pandas as pd
df = pd.read_spss('./Prprnt2.sav')
df.head()

Unnamed: 0,AID,PFMID,MEMBRNUM,ROST_AID,TOTHHMEM,MEMBRTYP,P2HR10A,P2HR10B,P2HR10C,P2HR10D,...,P2WP23A1,P2WP23A2,P2WP24A,P2WP24B,P2WP24C,P2WP24D,P2WP25A,P2WP25B,P_LVLFLG,LIVEWITH
0,57103869,27330000,2.0,57103869.0,3.0,AHSM,7.0,997.0,97.0,97.0,...,7.0,6.0,7.0,7.0,0.0,0.0,4.0,1.0,1.0,0.0
1,57103869,27330000,3.0,,3.0,CHILD,7.0,997.0,97.0,97.0,...,97.0,97.0,7.0,7.0,7.0,7.0,97.0,97.0,1.0,0.0
2,57103869,27330000,1.0,,3.0,CHILD,1.0,37.0,97.0,3.0,...,97.0,97.0,7.0,7.0,7.0,7.0,97.0,97.0,1.0,1.0
3,57118381,57118381,3.0,,3.0,CHILD,7.0,997.0,97.0,97.0,...,97.0,97.0,7.0,7.0,7.0,7.0,97.0,97.0,1.0,0.0
4,57118381,57118381,2.0,57118381.0,3.0,AHSM,7.0,997.0,97.0,97.0,...,6.0,6.0,7.0,7.0,0.0,0.0,1.0,4.0,1.0,0.0


Voila! We have a DataFrame. Of course, the column names may be hard to decifer. Often in government or social science publications it is necessary to consult a readme or data dictionary that says exactly what each column means.

Once you know this, you can select the columns you want and give them more informative names within pandas.

# Renaming Cryptic Columns

The pdf file `prprnt2.pdf` (matching the SPSS filename) has a data dictionary that says what each column means.

Let's make some notes about what these weird acronyms stand for:

P2HR10A - Child is 1 - Male 2- Female 6 - Refuse 7- Legitimate Skip 

I don't know what legitimate skip means here - it might be necessary to more carefully read the study documentation. But for now let's open this column and rename as 'sex'

In [None]:
df['Childs_Sex'] = df['P2HR10A']

## Replacing Cryptic result codes with their meaning

In [21]:
df[df['Childs_Sex']==1] = 'Male'
df[df['Childs_Sex']==2] = 'Female'
df[df['Childs_Sex']==6] = 'Unknown'
df[df['Childs_Sex']==7] = 'Unknown'

print(list(df['Childs_Sex'])[0:10])

['Unknown', 'Unknown', 'Male', 'Unknown', 'Unknown', 'Male', 'Male', 'Unknown', 'Unknown', 'Unknown']
