# Pandas: Data Management 2

*Author: Evan Carey*

*Copyright 2017-2019, BH Analytics, LLC*

## Overview

In this section, we will continue coverage of data management using Pandas. 

Our Objectives for this section are:

*  Understand different dtypes in Pandas
*  Create new variables
*  Manipulate dates

## Data for this Session: Healthcare Visits

To demonstrate these concepts, we will use some simulated data from a health care system. There are two files we will use throughout this section. 

*  The first file is called `Patient.csv`, and is information about patients in the healthcare system. There should be one row per patient in this file, so we call this a patient level file. 

*  The second file is called `OutpatientVisit.csv`, and is information about individual visits to the doctor for patients in the healthcare system. There will be more than one row per patient in this file, since patients can have more than one visit. However, there should only be one row per visit, so we call this a visit-level file. 

The files are located here: 

* Data/Data_Sims/healthcare/Patient.csv
* Data/Data_Sims/healthcare/OutpatientVisit.csv

## Libraries

In [1]:
import sys
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
import textwrap

In [2]:
# Enable inline plotting for graphics
%matplotlib inline

There are a few new packages here we are calling! 

* Pandas is the data management package that we will focus on. 
* Numpy is the numerical computation package in Python. You can think of this as part of the engine under the hood of Pandas...
* Matplotlib is the main plotting package in Python
* Seaborn is an 'add-on' plotting package that is based on Matplotlib. More details on these to come later!

In [3]:
# Get Version information
print(textwrap.fill(sys.version),'\n')
print("Pandas version: {0}".format(pd.__version__),'\n')
print("Matplotlib version: {0}".format(matplotlib.__version__),'\n')
print("Numpy version: {0}".format(np.__version__),'\n')
print("Seaborn version: {0}".format(sns.__version__),'\n')

3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] 

Pandas version: 0.24.2 

Matplotlib version: 3.0.3 

Numpy version: 1.16.2 

Seaborn version: 0.9.0 



In [4]:
# So all output comes through from Ipython
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Check your working directory

Subsequent sessions may require you to identify and update your working directory so paths correctly point at the downloaded data files. You can check your working directory like so:

In [1]:
# Working Directory
import os
print("My working directory:\n" + os.getcwd())
# Set Working Directory 
os.chdir(r"/home/ra/host/BH_Analytics/Discover/DataEngineering/")
print("My new working directory:\n" + os.getcwd())

My working directory:
/home/ra/host/BH_Analytics/Discover/DataEngineering/notebooks
My new working directory:
/home/ra/host/BH_Analytics/Discover/DataEngineering


## Set options

Here I set the max printing rows to be 10, so I don't overwhelm the printed workbooks or these presentation materials. You can change this to a larger number since you are running this on your own machine. 

In [6]:
pd.options.display.max_rows = 10

## Importing structured data

We will import the data first. 

In [5]:
## import data
df_patient = pd.read_csv("data/Data_Sims/healthcare/Patient.csv")
df_visits = pd.read_csv("data/Data_Sims/healthcare/OutpatientVisit.csv")

Following import, you can examine the top and bottom of the dataframe simply by calling the object:

In [8]:
df_patient

Unnamed: 0,PatientID,FirstName,LastName,State,ZipCode,DateOfBirth,Gender,Race,Income
0,1,Loretta,Gunter,FL,32250,1979-03-29,female,white,29.0
1,2,Todd,Rea,TX,79602,1936-12-20,male,Missing,
2,3,Margaret,Goodwin,PA,18106,1948-04-19,female,hispanic,53.0
3,4,Anna,McCullough,TX,75039,1997-08-28,female,Missing,20.0
4,5,Glenn,Labrecque,NM,87102,1985-08-19,male,Missing,48.0
...,...,...,...,...,...,...,...,...,...
19995,19996,Lucia,Atkins,MI,49546,1970-01-26,female,white,166.0
19996,19997,Wilfredo,Reinhardt,CO,80112,1950-02-23,male,other,31.0
19997,19998,Thanh,Large,FL,33912,1957-11-24,male,black,60.0
19998,19999,Deidre,Croft,GA,30303,1965-01-24,female,Unknown,


In [9]:
df_visits

Unnamed: 0,VisitID,StaffID,PatientID,VisitDate,ICD10_1,ICD10_2,ICD10_3,ClinicCode
0,1,24,1,2011-08-05,G801,,,7
1,2,13,1,2013-06-15,G801,,,49
2,3,36,1,2013-12-28,G801,,,42
3,4,14,1,2014-10-21,G801,,,29
4,5,45,1,2015-05-11,G801,,,21
...,...,...,...,...,...,...,...,...
181391,181392,44,20000,2011-08-26,E1322,,,17
181392,181393,28,20000,2012-01-05,E1322,,,57
181393,181394,32,20000,2012-01-23,E1322,,,15
181394,181395,38,20000,2012-05-10,E1322,,,1


## Dtypes in Pandas

Now that you have a better understanding of how Dataframes and Series work, let's discuss the different type of data in Pandas. 

One advantage of the Pandas dataframe is that we can include that is of different types. For example, we can have...
* a column of integers (like Patient ID)  
* a column of floats (like Income)  
* a column of text data (like Name)  
* and a column of dates (like date of Birth)  

This would not be possible using Numpy ndarrays - Numpy arrays must all be one type. 

Pandas has the following dtypes we will review:

* Float  
* int  
* bool  
* datetime  
* category  
* object (this is the string dtype)

We can also specify the length, default is 64.

In [10]:
# default length is int64 and float64
df_patient.dtypes

PatientID        int64
FirstName       object
LastName        object
State           object
ZipCode          int64
DateOfBirth     object
Gender          object
Race            object
Income         float64
dtype: object

In [11]:
df_patient.get_dtype_counts()

float64    1
int64      2
object     6
dtype: int64

You can also ask the dtype of a series:

In [12]:
df_patient['Income'].dtypes

dtype('float64')

In [13]:
df_patient['FirstName'].dtypes

dtype('O')

One of our first steps after importing and examining data is to establish the correct data types, and convert them as needed.

## Type Conversion in pandas

We can convert amongst dtypes with the `astype()` method.

Object dtype:

* The is the most general dtype.
* Strings are typically objects on import.

Why would we want to convert the PatientID to an Object dtype?  
I like to do this so later analytic routines and code will not accidentally treat it as a number, and do math on it! For example, the mean value of PatientID makes no sense. 

After you import and examine your data, one of your first steps should be to check the dtpyes, and convert them as needed. 

In [6]:
# explicit conversion with astype
df_patient['PatientID'] = df_patient['PatientID'].astype('object')

## Numeric Dtypes

Numeric dtypes are represented as either floats or integers in Pandas. Integer are often 'faster' than floats in many compute operations, which is one of the reasons for the two classes.  

Type conversion is straightforward..

In [15]:
# You can convert it back to a float or an integer like this:
df_patient['PatientID'].astype('int64')
df_patient['PatientID'].astype('float64')

0            1
1            2
2            3
3            4
4            5
         ...  
19995    19996
19996    19997
19997    19998
19998    19999
19999    20000
Name: PatientID, Length: 20000, dtype: int64

0            1.0
1            2.0
2            3.0
3            4.0
4            5.0
          ...   
19995    19996.0
19996    19997.0
19997    19998.0
19998    19999.0
19999    20000.0
Name: PatientID, Length: 20000, dtype: float64

In [16]:
# or more generally with to_numeric()
pd.to_numeric(df_patient['PatientID'])

0            1
1            2
2            3
3            4
4            5
         ...  
19995    19996
19996    19997
19997    19998
19998    19999
19999    20000
Name: PatientID, Length: 20000, dtype: int64

## Category Dtype

A central dtype in Pandas is called the 'Category' dtype. 

The category dtype is a lookup table, where there is a fixed number of unique values, potentially with a specified order.

We use the category dtype for variables like gender, race, or credit score category. Variables we consider to be traits of our data, and we might make a table out of them.

We would not use the category dtype for variables like address, or full name. Those data elements have far too many unique values!

In [17]:
## Categorical Dtype
df_patient['Gender'].dtypes

dtype('O')

You can convert (type cast) to category using `pd.Categorical()` or `.astype('category')`

In [7]:
df_patient['Race_cat'] = df_patient['Race'].astype('category')
df_patient['Race_cat']

0           white
1         Missing
2        hispanic
3         Missing
4         Missing
5           white
6           black
7           white
8                
9        hispanic
10               
11          white
12          other
13          white
14        Missing
15               
16          white
17          white
18        Missing
19          black
20          white
21          white
22       hispanic
23          white
24          other
25          black
26        Unknown
27       hispanic
28          white
29          white
           ...   
19970     Missing
19971       white
19972       white
19973    hispanic
19974       black
19975       white
19976     Missing
19977       other
19978       white
19979       white
19980     Missing
19981       white
19982       white
19983            
19984            
19985     Missing
19986       white
19987            
19988       black
19989    hispanic
19990       white
19991       other
19992     Missing
19993       black
19994     

There is an issue with this data. Why are there 9 unique values of Race? We typically need to clean our data after import, and this data is no different. It looks like there are some odd race values, which should be considered missing values. We will discuss missing values more in depth later.

In [19]:
# Why is there a 999, and a `?`
df_patient['Race_cat'].unique()
df_patient['Race_cat'].value_counts()

[white, Missing, hispanic, black, , other, Unknown, NaN, 999, ?]
Categories (9, object): [white, Missing, hispanic, black, ..., other, Unknown, 999, ?]

white       8191
hispanic    2773
black       2020
            1971
Missing     1791
other       1413
Unknown      789
999          154
?            152
Name: Race_cat, dtype: int64

In [20]:
# include missing
df_patient['Race_cat'].value_counts(dropna=False)

white       8191
hispanic    2773
black       2020
            1971
Missing     1791
other       1413
Unknown      789
NaN          746
999          154
?            152
Name: Race_cat, dtype: int64

We need to set the 999 values, the 'unknown', and the '?' values to be missing. You could do this overtly using code like this, with the `.isin()` method:

In [8]:
# Make new var
df_patient['Race2'] = df_patient['Race']
df_patient.loc[df_patient['Race2'].isin(['?', 'Unknown', '999', ' ', 'Missing']), 'Race2'] = np.NaN

# Check if fixed
df_patient['Race2'].value_counts(dropna=False)

# Remove that var
del df_patient['Race2']

Or you could do it implicitly by identifying valid category levels, and implicitly setting other levels (anything not listed) to missing (`Nan`).

In [9]:
# Change the 999, '?', and 'unknown' to missing 
df_patient['Race_cat'] = \
    pd.Categorical(df_patient['Race'],
                   categories=['other', 'hispanic', 'white', 'black'])

In [23]:
# Verify its fixed
df_patient['Race_cat'].value_counts(dropna=False)

white       8191
NaN         5603
hispanic    2773
black       2020
other       1413
Name: Race_cat, dtype: int64

## Date-times in Pandas

Any date or date-time variables need to be coerced into an actual datetime variable after import. 

Here is a question for you: 
> Are Dates stored as strings, or as numbers on a computer?

The answer is sort of both...dates are stored as the number of days since some index date, and datetimes are stored as the number of seconds since some index date. In Python, that index date is 1970-01-01, but that doesn't often matter in your code. When you import date data to Python, it should be expressed as a text string. We will then convert it to a Pandas date object. 

We can create datetimes easily with `pd.to_datetime()`

In [24]:
## Datetime dtype 
# Coerce to date time
df_patient['DateOfBirth'].head(5)

0    1979-03-29
1    1936-12-20
2    1948-04-19
3    1997-08-28
4    1985-08-19
Name: DateOfBirth, dtype: object

I like to save them as a new column, then compare to make sure there were no errors. Notice these dates are in the standard `YYYY/MM/DD` format 

In [25]:
# Autodetection works here
df_patient['DateOfBirth_dt'] = pd.to_datetime(df_patient['DateOfBirth'])

In [26]:
df_patient.loc[:, ['DateOfBirth_dt', 'DateOfBirth']]

Unnamed: 0,DateOfBirth_dt,DateOfBirth
0,1979-03-29,1979-03-29
1,1936-12-20,1936-12-20
2,1948-04-19,1948-04-19
3,1997-08-28,1997-08-28
4,1985-08-19,1985-08-19
...,...,...
19995,1970-01-26,1970-01-26
19996,1950-02-23,1950-02-23
19997,1957-11-24,1957-11-24
19998,1965-01-24,1965-01-24


This auto-detection will not always work...

In [27]:
## Will not always work...
pd.to_datetime(pd.Series(["2010/12/25", "12/25/2010", "25/12/2010"]))

0   2010-12-25
1   2010-12-25
2   2010-12-25
dtype: datetime64[ns]

In [28]:
pd.to_datetime(pd.Series(["2010/12/07", "12/7/2010", "7/12/2010"]))

0   2010-12-07
1   2010-12-07
2   2010-07-12
dtype: datetime64[ns]

If your dates are in a consistent format, the autodetection will generally work fine. But it is slower to force Pandas to 'guess' the format.

## Datetimes Formating

Use the format argument to exactly specify the *incoming* date format.

In [29]:
# Specify format (faster execution)
pd.to_datetime(df_patient['DateOfBirth'], format="%Y/%m/%d")

0       1979-03-29
1       1936-12-20
2       1948-04-19
3       1997-08-28
4       1985-08-19
           ...    
19995   1970-01-26
19996   1950-02-23
19997   1957-11-24
19998   1965-01-24
19999          NaT
Name: DateOfBirth, Length: 20000, dtype: datetime64[ns]

We can also go the other direction - create variables that are functions of the date, like day of week, or month, or year. We use these later to answer question about interesting time trends. 

What if you were interested to know if surgeons make more errors on Fridays compared to Mondays (a theoretical dataset)? We would need to do the following:
* Turn surgery_date into a date column
* Extract the weekday from that date
* Calculate errors by weekday

In [30]:
# Extract Year
df_patient['DateOfBirth_dt'].dt.year

0        1979.0
1        1936.0
2        1948.0
3        1997.0
4        1985.0
          ...  
19995    1970.0
19996    1950.0
19997    1957.0
19998    1965.0
19999       NaN
Name: DateOfBirth_dt, Length: 20000, dtype: float64

In [31]:
# Extract Month
df_patient['DateOfBirth_dt'].dt.month

0         3.0
1        12.0
2         4.0
3         8.0
4         8.0
         ... 
19995     1.0
19996     2.0
19997    11.0
19998     1.0
19999     NaN
Name: DateOfBirth_dt, Length: 20000, dtype: float64

You can extract an arbitray format using the `dt.strtime()` method:

In [32]:
df_patient['DateOfBirth_dt'].dt.strftime('The Year was %Y, the day was %A')

0        The Year was 1979, the day was Thursday
1          The Year was 1936, the day was Sunday
2          The Year was 1948, the day was Monday
3        The Year was 1997, the day was Thursday
4          The Year was 1985, the day was Monday
                          ...                   
19995      The Year was 1970, the day was Monday
19996    The Year was 1950, the day was Thursday
19997      The Year was 1957, the day was Sunday
19998      The Year was 1965, the day was Sunday
19999                                        NaT
Name: DateOfBirth_dt, Length: 20000, dtype: object

Check out this link for more details on possible strftime arguments:  
http://strftime.org/ 

## Creating New Columns 

We can add columns by selecting a new column (not using the attribute syntax!), then assigning a value. You have seen this a few times now in the above slides. 

Note we cannot use the column attribute to create a new column!

Use the del keyword to delete a column.

In [10]:
# Add Column
df_patient["Log_Income"] = np.log(df_patient["Income"] + 0.001)
df_patient.head()

Unnamed: 0,PatientID,FirstName,LastName,State,ZipCode,DateOfBirth,Gender,Race,Income,Race_cat,Log_Income
0,1,Loretta,Gunter,FL,32250,1979-03-29,female,white,29.0,white,3.36733
1,2,Todd,Rea,TX,79602,1936-12-20,male,Missing,,,
2,3,Margaret,Goodwin,PA,18106,1948-04-19,female,hispanic,53.0,hispanic,3.970311
3,4,Anna,McCullough,TX,75039,1997-08-28,female,Missing,20.0,,2.995782
4,5,Glenn,Labrecque,NM,87102,1985-08-19,male,Missing,48.0,,3.871222


In [11]:
# Delete Column
del df_patient["Log_Income"]
df_patient.head()

Unnamed: 0,PatientID,FirstName,LastName,State,ZipCode,DateOfBirth,Gender,Race,Income,Race_cat
0,1,Loretta,Gunter,FL,32250,1979-03-29,female,white,29.0,white
1,2,Todd,Rea,TX,79602,1936-12-20,male,Missing,,
2,3,Margaret,Goodwin,PA,18106,1948-04-19,female,hispanic,53.0,hispanic
3,4,Anna,McCullough,TX,75039,1997-08-28,female,Missing,20.0,
4,5,Glenn,Labrecque,NM,87102,1985-08-19,male,Missing,48.0,


## Going from Numeric to Categorical (discretize)

A common data operation is to take numeric data and categorize it (or discretize it). An example would be converting incomes to low, medium, and high. We can easily do this in pandas with the `pd.cut()` function.

In [12]:
# Request 3 equal width bins based on range of data
df_patient['Income_Cat'] = pd.cut(df_patient['Income'], bins=3, labels=['Low','Medium','High'])
df_patient['Income_Cat'].value_counts()
df_patient['Income_Cat'].dtypes


CategoricalDtype(categories=['Low', 'Medium', 'High'], ordered=True)

This distribution is odd! It is because of the skewed nature of income. The base behavior just divided the full range into 3 equal length intervals. Perhaps a better idea would be to base this on the quantiles of the distribution. 

In [36]:
# Specify bins, perhaps based on quantiles.
# First calculate quantiles
qts = df_patient['Income'].quantile([0, 0.3, 0.6, 1])
qts

0.0       3.0
0.3      38.0
0.6      72.0
1.0    1780.0
Name: Income, dtype: float64

In [13]:
# Construct intervals based on this
df_patient['Income_Cat'] = \
    pd.cut(df_patient['Income'],
           bins=[0, 36, 70, 1000],
           labels=['Low', 'Medium', 'High'])
df_patient['Income_Cat'].value_counts()

High      6887
Medium    5693
Low       5278
Name: Income_Cat, dtype: int64

In [38]:
# Use the built in quantile function
# Note this gives you correct intervals 
# but the cutpoints aren't 'pretty'
df_patient['Income_qcat'] = \
    pd.qcut(df_patient['Income'],
            q=[0, 0.33, 0.66, 1.0],
            labels=['Low_third', 'Med_third', 'Upper_third'])
df_patient['Income_qcat'].value_counts()

# Cleanup
del df_patient['Income_qcat']

Upper_third    6235
Med_third      6215
Low_third      6150
Name: Income_qcat, dtype: int64

## Adding a New Variable Based on Boolean Condition

A very common idiom is creating a new variable based on a boolean condition. This is easily accomplished with numpy function `np.where()`

In [39]:
df_patient['Income_binary'] = \
    np.where(df_patient['Income'] >= 100,
             'High',
             'Low')

# Check result
df_patient.loc[:, ['Income', 'Income_binary']]

# Delete for now
del df_patient['Income_binary']

Unnamed: 0,Income,Income_binary
0,29.0,Low
1,,Low
2,53.0,Low
3,20.0,Low
4,48.0,Low
...,...,...
19995,166.0,High
19996,31.0,Low
19997,60.0,Low
19998,,Low


## Review

We covered the following topics:

*  Understand different dtypes in Pandas
*  Create new variables
*  Manipulate dates