# Tidy Data, Missing Data, and Data Transformation

Now that we are comfortable importing data into Series and DataFrame data structures, our next several classes are focused on how to process that data for analysis:

* Today: Tidy Data, Missing Data, and Data Transformation
* Next Week: Text Processing
* Then: Merging and Reshaping Data

Friendly Reminders:

* List Comprehension and Generators DataCamp module due tonight by 11:59 p.m.
* Homework #3 due March 5 by 11:59 p.m.
* Project proposal due March 7 by 11:59 p.m.

In [1]:
import numpy as np
from numpy import nan as NA
import pandas as pd
from pandas import Series, DataFrame

## Tidy Data

In the article on tidy data, Hadley Wickham defines *data tidying* as "structuring datasets to facilitate analysis", which is one of the primary goals of this course.

He defines a *data set* as being a collection of *values*, which are typically numerical (e.g., int, float) or textual (e.g., string). Each value belongs to an *observation* and a *variable*. An observation contains all variable values for a given unit of analysis. A variable includes all measurements of that variable across the observations. 

He defines tidy data as having the following structure:

* Each variable forms a column
* Each observation forms a row
* Each type of observational unit forms a table

In the video game data from the pandas lab, the observations are games and the variables include the game title, platform, release year, genre, publisher, and sales data.

In [2]:
# Import video game data
# path = '/Users/seanbarnes/Dropbox/Teaching/Courses/BUDT758X/data/'
data = pd.read_csv('vgsales.csv', index_col = 0)
data.head()

Unnamed: 0_level_0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [3]:
data['Year'].describe()

count    16327.000000
mean      2006.406443
std          5.828981
min       1980.000000
25%       2003.000000
50%       2007.000000
75%       2010.000000
max       2020.000000
Name: Year, dtype: float64

## Missing Data

Most data sets will not come fully populated with complete information for each observation, either because the data does not exist or it exists but was not observed. Missing data may be represented in many ways in the raw data, for example:

* NA, N/A, etc.
* Blank ('')
* Dashes (-, --)
* Explicit text (e.g., blank, missing)

In Python, missing data is not necessarily equivalent to the False boolean (e.g., 0 is False but not missing):

* None
* np.nan (not a number, imported as NA above)
* pd.NaT (empty datetime object, later)

In [4]:
# Null objects object
print(None, NA, pd.NaT)

None nan NaT


There are two primary approaches for processing missing data:

1. Filter observations with missing data
2. Impute observations with missing data (i.e., estimate or substitute in a value for the missing observations)

The primary tradeoff between filtering and imputing is that when filtering missing data, you lose some of your sample data, which could affect the power of your analysis. If you filter a small proportion of your data, that is probably OK, but if filtering missing data causes you to lose a significant proportion of your data, you should consider imputation techniques.

The challenging with imputing missing data is that you are intentionally fabricating data. However, if done in a reasonable way, imputing missing data can help preserve your sample size without significantly affecting your analysis.

pandas has built in methods for handling missing data in a Series or DataFrame object:

* .dropna - Filter missing observations
* .fillna - Impute missing data
* .isnull - Returns True for each element that is equivalent to missing data, False otherwise
* .notnull - Returns True for each element that is not equivalent to missing data, False otherwise

Similar to many other pandas methods, .dropna and .fillna have inplace arguments to apply the method in place, as opposed to returning a new object (default).

### Missing Data for Series Objects

Handling missing data for Series objects is pretty straightforward, as there is only one set of values.

In [5]:
# Define Series with missing values
ser = Series([1,2,np.nan,-4,np.nan])
ser

0    1.0
1    2.0
2    NaN
3   -4.0
4    NaN
dtype: float64

In [6]:
# .isnull, .notnull methods
ser.isnull()

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [7]:
# Filter missing values - .dropna method
ser.dropna()

0    1.0
1    2.0
3   -4.0
dtype: float64

In [8]:
# Filter missing values - Boolean indexing
ser[ser.notnull()]

0    1.0
1    2.0
3   -4.0
dtype: float64

In [9]:
# Fill missing values with constant, e.g., centrality measure
ser.fillna(ser.mean())

0    1.000000
1    2.000000
2   -0.333333
3   -4.000000
4   -0.333333
dtype: float64

In [10]:
# Fill missing values via ffill or bfill - Appropriate for time series data
ser.fillna(method='ffill')
# print(ser.fillna(method='bfill'))  # back fill
# # for time series

0    1.0
1    2.0
2    2.0
3   -4.0
4   -4.0
dtype: float64

### Missing Data for DataFrames

Missing data for DataFrames is a little more complex, because you may want to process missing data based on a single column or multiple columns.

In [11]:
# Define example DataFrame
df = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
df

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [12]:
# Evaluate missing data
df.isnull().sum()

0    2
1    2
2    2
dtype: int64

In [13]:
# Filtering missing data - .dropna method
df.dropna() # drops rows with any missing values by default

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [14]:
# Filtering missing data - .dropna method
df.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [16]:
# Filtering missing data - .dropna method
df.dropna(thresh=1)
 # df.dropna(thresh=2)
# # at least 1 not NaN value

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [17]:
# Filtering missing data - Boolean indexing
df[df[0].notnull()]

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,


In [18]:
# Fill missing values with a constant
df.fillna(0)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,6.5,3.0


In [19]:
# Fill missing values for each column
df.fillna({0: 0, 1: -1, 2: -2})

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,-1.0,-2.0
2,0.0,-1.0,-2.0
3,0.0,6.5,3.0


## Data Transformation

Data transformation is a general category of data processing that involves subsetting data, creating new variables (from existing variables), or modifying existing variables in some way. We have already encountered several methods for transforming data:

* Indexing and slicing
* Filtering (via boolean arrays)
* Arithmetic operations and comparisons
* Function application and mapping (.map, .apply, .applymap)
* Sorting and ranking

Today, we will explore several additional methods, including:

* Identifying and removing duplicates
* Replacing values
* Renaming axis indexes
* Random sampling
* Discretization and binning
* Computing dummy variables

### Identifying and Removing Duplicates

pandas offers two methods for processing data with duplicate observations:

* The .duplicated method returns a boolean Series with True values that represent duplicate observations
* The .drop_duplicates method filters duplicate observations from Series or DataFrame object

The DataFrame methods have a *subset* argument that you can use to specify the column(s) that you want to be considered for duplicate observations.

In [20]:
# Example DataFrame - NBA Player Nicknames
df = pd.DataFrame([('Allen Iverson', 'The Answer'), ('Earvin Johnson', 'Magic'), ('Michael Jordan', 'Air Jordan'),
                 ('Rodney Hundley', 'Hot Rod'), ('John Williams', 'Hot Rod'), ('George Gervin', 'Iceman'),
                  ('Michael Jordan', 'Air Jordan'), ('John Salley', 'Spider'), ('Jerry Sloan', 'Spider'),
                  ('Charles Barkley', 'The Chuckster')],
                  columns = ['Player', 'Nickname'])
df

Unnamed: 0,Player,Nickname
0,Allen Iverson,The Answer
1,Earvin Johnson,Magic
2,Michael Jordan,Air Jordan
3,Rodney Hundley,Hot Rod
4,John Williams,Hot Rod
5,George Gervin,Iceman
6,Michael Jordan,Air Jordan
7,John Salley,Spider
8,Jerry Sloan,Spider
9,Charles Barkley,The Chuckster


In [21]:
# .duplicated method - Checks for all-column duplicates by default
df.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
9    False
dtype: bool

In [22]:
# .duplicated method - Specific column
df.duplicated(subset='Nickname')

0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

In [23]:
# .drop_duplicates method
df.drop_duplicates(subset='Nickname', keep='last') # default for keep is 'first'

Unnamed: 0,Player,Nickname
0,Allen Iverson,The Answer
1,Earvin Johnson,Magic
4,John Williams,Hot Rod
5,George Gervin,Iceman
6,Michael Jordan,Air Jordan
8,Jerry Sloan,Spider
9,Charles Barkley,The Chuckster


### Replacing Values

Replacement is the general task of substituting one specific value (or a set of specific values) with another value. Imputing missing values is one example (using .fillna), but the .replace method is more general.

In [24]:
# Single replacement
df.replace('The Chuckster', 'The Round Mound of Rebound')

Unnamed: 0,Player,Nickname
0,Allen Iverson,The Answer
1,Earvin Johnson,Magic
2,Michael Jordan,Air Jordan
3,Rodney Hundley,Hot Rod
4,John Williams,Hot Rod
5,George Gervin,Iceman
6,Michael Jordan,Air Jordan
7,John Salley,Spider
8,Jerry Sloan,Spider
9,Charles Barkley,The Round Mound of Rebound


In [25]:
# Multiple replacements, single value
df.replace(['Hot Rod', 'Spider'], 'Duplicated')

Unnamed: 0,Player,Nickname
0,Allen Iverson,The Answer
1,Earvin Johnson,Magic
2,Michael Jordan,Air Jordan
3,Rodney Hundley,Duplicated
4,John Williams,Duplicated
5,George Gervin,Iceman
6,Michael Jordan,Air Jordan
7,John Salley,Duplicated
8,Jerry Sloan,Duplicated
9,Charles Barkley,The Chuckster


In [26]:
# Case-by-case replacements
M = {'The Answer': 'AI', 'Air Jordan': 'MJ', 'The Chuckster': 'The Round Mound of Rebound'}
df.replace(M)

Unnamed: 0,Player,Nickname
0,Allen Iverson,AI
1,Earvin Johnson,Magic
2,Michael Jordan,MJ
3,Rodney Hundley,Hot Rod
4,John Williams,Hot Rod
5,George Gervin,Iceman
6,Michael Jordan,MJ
7,John Salley,Spider
8,Jerry Sloan,Spider
9,Charles Barkley,The Round Mound of Rebound


In [29]:
# Map approach
df['Nickname'].map(lambda name: M.get(name, name))
# get method to avoid a key value error for a dictionary

0                            AI
1                         Magic
2                            MJ
3                       Hot Rod
4                       Hot Rod
5                        Iceman
6                            MJ
7                        Spider
8                        Spider
9    The Round Mound of Rebound
Name: Nickname, dtype: object

### Renaming Axis Indexes

Similar to the values in a Series or DataFrame, the indexes (row index, column names) can also be transformed:

* The row index can be updated by assigning to the .index attribute
* The column index can be updated by assigning to the .columns attribute

Alternatively, you can use the .rename method (inplace or not).

In [30]:
# Update nickname DataFrame index via assignment
df.index = df['Player'].map(lambda name: name.split()[1]) # extract last name
df.index.name = 'Last Name'
df

Unnamed: 0_level_0,Player,Nickname
Last Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Iverson,Allen Iverson,The Answer
Johnson,Earvin Johnson,Magic
Jordan,Michael Jordan,Air Jordan
Hundley,Rodney Hundley,Hot Rod
Williams,John Williams,Hot Rod
Gervin,George Gervin,Iceman
Jordan,Michael Jordan,Air Jordan
Salley,John Salley,Spider
Sloan,Jerry Sloan,Spider
Barkley,Charles Barkley,The Chuckster


In [31]:
# Update nickname DataFrame index via .rename method
df.rename(index=lambda name: name.upper(), columns={'Player': 'Full Name'})

Unnamed: 0_level_0,Full Name,Nickname
Last Name,Unnamed: 1_level_1,Unnamed: 2_level_1
IVERSON,Allen Iverson,The Answer
JOHNSON,Earvin Johnson,Magic
JORDAN,Michael Jordan,Air Jordan
HUNDLEY,Rodney Hundley,Hot Rod
WILLIAMS,John Williams,Hot Rod
GERVIN,George Gervin,Iceman
JORDAN,Michael Jordan,Air Jordan
SALLEY,John Salley,Spider
SLOAN,Jerry Sloan,Spider
BARKLEY,Charles Barkley,The Chuckster


### Random Sampling

In most cases, you will likely work with your entire data set, but there may be cases when it would be useful to take a sample (with or without replacement). For example:

* You are working with a large data set, and you want to develop and test a prediction model on a small (but representative) subset of the data before executing the (much longer) run on the full data set
* You have a data set of observations that you want to feed into a simulation model

In [32]:
# Sampling without replacement
df.sample(n=5, replace=False)
# without replacement. If there is duplicated value it means 
# there is duplicated value exists in the orignal dataset

Unnamed: 0_level_0,Player,Nickname
Last Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sloan,Jerry Sloan,Spider
Hundley,Rodney Hundley,Hot Rod
Johnson,Earvin Johnson,Magic
Jordan,Michael Jordan,Air Jordan
Gervin,George Gervin,Iceman


In [33]:
# Sampling with replacement
df.sample(n=5, replace=True)

Unnamed: 0_level_0,Player,Nickname
Last Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Johnson,Earvin Johnson,Magic
Hundley,Rodney Hundley,Hot Rod
Gervin,George Gervin,Iceman
Barkley,Charles Barkley,The Chuckster
Williams,John Williams,Hot Rod


### Discretization and Binning

Sometimes, we have numerical data that we would like to discretize into range-based categories. Some common examples of discretization:

* Age ranges
* Time/Date ranges
* Tax income brackets
* Market capitalization
* Normal/Abnormal physiological measurements

pandas offers two functions for discretizing numerical data:

* pd.cut determines the ranges based on specific values (bin edges)
* pd.qcut determines the ranges based on quantiles

Let's discretize the release year column for the video game data:

In [34]:
# Explore distribution of the release year
data['Year'].describe()

count    16327.000000
mean      2006.406443
std          5.828981
min       1980.000000
25%       2003.000000
50%       2007.000000
75%       2010.000000
max       2020.000000
Name: Year, dtype: float64

In [35]:
# Create linearly space bins
bins = np.linspace(start=1980, stop=2020, num=9)
bins

array([1980., 1985., 1990., 1995., 2000., 2005., 2010., 2015., 2020.])

In [36]:
# Add discretized workers to DataFrame and preview results
data['Year Bin'] = pd.cut(data['Year'], bins)
data[['Name', 'Platform', 'Year', 'Year Bin']].head(10)

Unnamed: 0_level_0,Name,Platform,Year,Year Bin
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Wii Sports,Wii,2006.0,"(2005.0, 2010.0]"
2,Super Mario Bros.,NES,1985.0,"(1980.0, 1985.0]"
3,Mario Kart Wii,Wii,2008.0,"(2005.0, 2010.0]"
4,Wii Sports Resort,Wii,2009.0,"(2005.0, 2010.0]"
5,Pokemon Red/Pokemon Blue,GB,1996.0,"(1995.0, 2000.0]"
6,Tetris,GB,1989.0,"(1985.0, 1990.0]"
7,New Super Mario Bros.,DS,2006.0,"(2005.0, 2010.0]"
8,Wii Play,Wii,2006.0,"(2005.0, 2010.0]"
9,New Super Mario Bros. Wii,Wii,2009.0,"(2005.0, 2010.0]"
10,Duck Hunt,NES,1984.0,"(1980.0, 1985.0]"


In [37]:
# Summarize frequency of worker bins
data['Year Bin'].value_counts().sort_index()

(1980.0, 1985.0]     127
(1985.0, 1990.0]      85
(1990.0, 1995.0]     484
(1995.0, 2000.0]    1618
(2000.0, 2005.0]    3790
(2005.0, 2010.0]    6328
(2010.0, 2015.0]    3538
(2015.0, 2020.0]     348
Name: Year Bin, dtype: int64

In [38]:
# Explore distribution of the release year
data['Global_Sales'].describe()

count    16598.000000
mean         0.537441
std          1.555028
min          0.010000
25%          0.060000
50%          0.170000
75%          0.470000
max         82.740000
Name: Global_Sales, dtype: float64

In [39]:
# Summarize frequency of decile bins
pd.qcut(data['Global_Sales'], 10).value_counts().sort_index()

(0.009000000000000001, 0.02]    1689
(0.02, 0.05]                    2088
(0.05, 0.08]                    1570
(0.08, 0.12]                    1593
(0.12, 0.17]                    1452
(0.17, 0.25]                    1599
(0.25, 0.38]                    1633
(0.38, 0.61]                    1692
(0.61, 1.21]                    1631
(1.21, 82.74]                   1651
Name: Global_Sales, dtype: int64

In [40]:
# Summarize frequncy of non-uniform quantile bins
pd.qcut(data['Global_Sales'], [0., 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 1.]).value_counts().sort_index()

(0.009000000000000001, 0.02]    1689
(0.02, 0.06]                    2665
(0.06, 0.17]                    4038
(0.17, 0.47]                    4060
(0.47, 1.21]                    2495
(1.21, 2.04]                     822
(2.04, 5.431]                    663
(5.431, 82.74]                   166
Name: Global_Sales, dtype: int64

### Dummy Variables

Converting categorical variables into dummy variables is a common approach for statistical analysis; pandas offers the pd.get_dummies function for this task. The function can be applied to an ndarray, Series, or DataFrame object, and returns a DataFrame that contains each dummy variable in a separate column.

In [41]:
# Apply pd.get_dummies to Series (DataFrame column)
pd.get_dummies(data['Year Bin']).head()

      (1980.0, 1985.0]  (1985.0, 1990.0]  (1990.0, 1995.0]  (1995.0, 2000.0]  \
Rank                                                                           
1                    0                 0                 0                 0   
2                    1                 0                 0                 0   
3                    0                 0                 0                 0   
4                    0                 0                 0                 0   
5                    0                 0                 0                 1   

      (2000.0, 2005.0]  (2005.0, 2010.0]  (2010.0, 2015.0]  (2015.0, 2020.0]  
Rank                                                                          
1                    0                 1                 0                 0  
2                    0                 0                 0                 0  
3                    0                 1                 0                 0  
4                    0                 1    

In [42]:
# Apply pd.get_dummies to specific column in DataFrame
pd.get_dummies(data[['Name', 'Platform', 'Year', 'Year Bin']], columns=['Year Bin'], prefix='', prefix_sep='').head()

Unnamed: 0_level_0,Name,Platform,Year,"(1980.0, 1985.0]","(1985.0, 1990.0]","(1990.0, 1995.0]","(1995.0, 2000.0]","(2000.0, 2005.0]","(2005.0, 2010.0]","(2010.0, 2015.0]","(2015.0, 2020.0]"
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Wii Sports,Wii,2006.0,0,0,0,0,0,1,0,0
2,Super Mario Bros.,NES,1985.0,1,0,0,0,0,0,0,0
3,Mario Kart Wii,Wii,2008.0,0,0,0,0,0,1,0,0
4,Wii Sports Resort,Wii,2009.0,0,0,0,0,0,1,0,0
5,Pokemon Red/Pokemon Blue,GB,1996.0,0,0,0,1,0,0,0,0


In [43]:
# Create dummy variables but drop first category (reference)
pd.get_dummies(data[['Name', 'Platform', 'Year', 'Year Bin']], columns=['Year Bin'], prefix='', prefix_sep='', drop_first=True).head()

Unnamed: 0_level_0,Name,Platform,Year,"(1985.0, 1990.0]","(1990.0, 1995.0]","(1995.0, 2000.0]","(2000.0, 2005.0]","(2005.0, 2010.0]","(2010.0, 2015.0]","(2015.0, 2020.0]"
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Wii Sports,Wii,2006.0,0,0,0,0,1,0,0
2,Super Mario Bros.,NES,1985.0,0,0,0,0,0,0,0
3,Mario Kart Wii,Wii,2008.0,0,0,0,0,1,0,0
4,Wii Sports Resort,Wii,2009.0,0,0,0,0,1,0,0
5,Pokemon Red/Pokemon Blue,GB,1996.0,0,0,1,0,0,0,0


# Next Time: Regular Expressions