# 12 Missing Data
File(s) needed: office_visits.csv, Taiwan_CellSurvey_RAW.xlsx, conditioning_example.csv


# Why is data missing?
In almost every dataset you may have the opportunity to work with (outside of a classroom anyway) there will be missing data. It can be missing for any number of reasons. It may simply be that a survey respondent did not answer that question. Or it could be expected due to the design of the survey. It could also be created while we work on the data.

In a survey, a value may be missing because the question on the survey was optional. Consider this set of questions: 
```
1. Are you currently employed? Yes or No
2. If you answered "Yes" to question 1, do you work full-time or part-time? Full-time or Part-time
```

People who are not employed will not answer the second question so there will be an expected missing value there. In fact, if you really want to get into the data, you would want to make sure there were no answers on #2 for anyone who answered "No" for #1.

Whatever the reason, you can expect to see missing data. But that doesn't mean you need to immediately throw out any responses with missing values. If it is expected, you can use that data for a specialized purpose. If it is not expected, there are other ways to work with it. 

You should have an idea of how you will deal with this problem before you begin your analysis. There are really three things you must address when it comes to handling missing values in your dataset.
1. You have to find the missing values.
2. You have to code them consistently (give them all the same representation).
3. You have to do something with them.

In [1]:
# Load pandas so we are ready to go
import pandas as pd

## Loading data
Missing data can be represented in data in any of several methods:
- null values (there is truly no value present)
- NaN (pandas)
- None (Python 3.x)
- a designated value, like -999

We can have pandas find missing values and either assign the `NaN` value or keep the source value for missing data when we read the file into memory. By default, `pd.read_csv` replaces null, NA, and NaN values with `NaN`. There are three parameters for `pd.read_csv` that allow us to do this:
- `na_values` 
    - This is not used very often because it allows you to specify what values should be considered missing. If you had data using -999 for example, you would use `na_values=[-999]` to make sure pandas treated them as missing.
    - Note that the right side of the equals sign is a list, so you can include multiple values separated by commas.
- `keep_default_na` 
    - This is used in conjunction with `na_values`. 
    - If `True` (the default), then values like NA and NaN will be treated as missing in addition to the values you specified in the `na_values` parameter. 
    - If `False`, just the listed values are considered missing. 
- `na_filter` 
    - Specifies whether _any_ values will be coded as missing.
    - If `True` (the default), missing values are coded as `NaN`.
    - If `False`, nothing is recorded as missing.


In [2]:
# Use a variable to hold the file path and name string
my_file="..\MIS-3335\data\office_visits.csv"

In [3]:
# read the office_visits.csv data using the defaults
pd.read_csv(my_file)

Unnamed: 0,ident,site,dated,cost
0,619,-999,2/8/1927,250.35
1,622,DR-1,2/10/1927,98.65
2,734,DR-3,,678.0
3,735,DR-3,1/12/1930,135.64
4,751,DR-3,2/26/1930,
5,752,DR-3,,854.0
6,837,MSK-4,1/14/1932,
7,844,DR-1,3/22/1932,45.0


In [4]:
# read the office_visits.csv data adding na_values list as missing
pd.read_csv(my_file,na_values=-999)

Unnamed: 0,ident,site,dated,cost
0,619,,2/8/1927,250.35
1,622,DR-1,2/10/1927,98.65
2,734,DR-3,,678.0
3,735,DR-3,1/12/1930,135.64
4,751,DR-3,2/26/1930,
5,752,DR-3,,854.0
6,837,MSK-4,1/14/1932,
7,844,DR-1,3/22/1932,45.0


In [5]:
# read the office_visits.csv data with ONLY na_values treated as missing
pd.read_csv(my_file,na_values=-999,keep_default_na=False)

Unnamed: 0,ident,site,dated,cost
0,619,,2/8/1927,250.35
1,622,DR-1,2/10/1927,98.65
2,734,DR-3,,678.0
3,735,DR-3,1/12/1930,135.64
4,751,DR-3,2/26/1930,
5,752,DR-3,,854.0
6,837,MSK-4,1/14/1932,
7,844,DR-1,3/22/1932,45.0


### The bottom line on handling missing values while reading data
You almost always want to replace the standard missing value representations plus any special representations you know about, so at the very least use the default `pd.read_csv` the way we have been doing. An important note: if "NA" is a valid response in your data, be careful when loading so you don't treat it like it is missing.

#### Load a larger data file with missing values. Look at it in Excel before loading in pandas.
The file `Taiwan_CellSurvey_RAW.xlsx` contains data from actual survey results conducted on computer and cell phone use in Taiwan. This version of the data contains a subset of the cell phone data.


In [6]:
# overhead


# Set display option to make sure you can see all columns
pd.options.display.max_columns = 150

# Load a larger data set with missing data to use in examples


In [7]:
# See what the data set looks like
df2=pd.read_csv("..\MIS-3335\data\Taiwan_CellSurvey_RAW.csv",engine='python')

# Finding missing values
By looking at the output from `df2.info()` we can see there are many missing values in this data. To find where missing values are located we look for null values. If we use the `isna()` method, we get a table of True or False answers to the question "does this cell contain a null value?" It would be more useful to check individual fields for nulls since we know many of the fields will have null values by design.

In [8]:
# Example: finding null values
df2[df2.isna().any(axis=1)]

Unnamed: 0,SurveyID,Age,Gender,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,@3OwnHousePhone,@4UseFreq,BQ4a,BQ4aCont,@5UseLength,BQ5a,BQ5aCont,@6aTexting,@6bEmail,@6cInternet,@6dBank,@6eBills,@6fFacebook,@6gPics,@6hGames,@6iBuy,@7PurchOnline,BQ7a,BQ7aCont,@8CellExper,BQ8a,BQ8b,@9Comfort,BQ9a,@10Satisfaction,BQ10a,PA1,PA2,PA3,PA4,PA5,PA6,PA7,PBI1,PBI2,PBI3,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001
0,S4,,2,1,3,1.0,2.0,,,,1.0,7.0,6.0,56.0,3.0,2.0,0.75,1.0,1.0,1.0,,,1.0,,,,1.0,0.0,0.0,5.0,4.0,3.0,4.0,-1.0,4.0,0.0,5,3,2,3,1,4,3,5,2,4,1,6.3,2.1,6,4,6,,3
1,S5,,1,1,3,2.0,2.0,,10.0,10.0,1.0,5.0,4.0,12.0,3.0,2.0,0.75,1.0,,,,,,,,,1.0,0.0,0.0,3.0,2.0,1.0,5.0,0.0,3.0,-1.0,4,3,4,2,2,3,2,3,2,4,2,4.9,2.8,4,3,4,,3
2,S8,,2,1,2,2.0,2.0,,25.0,25.0,1.0,7.0,6.0,56.0,3.0,2.0,0.75,1.0,1.0,,,,,1.0,,,1.0,0.0,0.0,6.0,5.0,4.0,5.0,0.0,5.0,1.0,2,2,2,3,3,3,3,3,3,3,1,4.2,2.8,5,5,5,,2
3,S12,,2,1,2,3.0,2.0,,12.0,12.0,1.0,7.0,6.0,56.0,3.0,2.0,0.75,1.0,,,,,,1.0,,,1.0,0.0,0.0,4.0,3.0,2.0,5.0,0.0,5.0,1.0,4,1,1,1,1,3,0,4,1,4,1,5.6,1.4,5,5,5,,2
4,S13,,2,1,3,,2.0,,10.0,10.0,1.0,5.0,4.0,12.0,3.0,2.0,0.75,,,,,,,1.0,,,1.0,0.0,0.0,5.0,4.0,3.0,5.0,0.0,3.0,-1.0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,4,4,4,,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,S1,79.0,1,1,1,,2.0,,5.0,5.0,,3.0,2.0,1.0,,,,,,,,,,,,,1.0,0.0,0.0,2.0,1.0,0.0,5.0,0.0,4.0,0.0,3,3,3,1,1,1,1,4,1,1,1,3.5,1.4,3,3,3,,1
268,W028,80.0,2,1,2,2.0,2.0,,8.0,8.0,1.0,5.0,4.0,12.0,3.0,2.0,0.75,,1.0,,,,,,,,1.0,0.0,0.0,3.0,2.0,1.0,4.0,-1.0,2.0,-2.0,1,1,1,2,1,1,1,2,1,3,1,2.5,1.0,2,2,2,沒有學習時間大家較忙碌,2
269,W276,85.0,1,1,4,1.0,2.0,,2.0,2.0,1.0,6.0,5.0,28.0,2.0,1.0,0.25,,,,,,,1.0,,,4.0,3.0,13.0,3.0,2.0,1.0,6.0,1.0,4.0,0.0,3,3,5,5,5,5,3,3,3,6,5,4.5,4.0,5,5,5,,5
270,S33,90.0,2,1,2,1.0,2.0,2.0,,,1.0,1.0,0.0,0.0,1.0,0.0,0.00,,,,,1.0,,,,,1.0,0.0,0.0,1.0,0.0,0.0,1.0,,1.0,,1,1,1,1,3,1,2,2,1,3,1,3.5,1.4,0,1,0,,2


You can also get a frequency count of values in a column with the dataframe method `value_counts`, including a count of null values if we use the parameter `dropna=False`. This is especially good for categorical variables.

In [9]:
# Categorical variable frequency count of the BQ5a and @6aTexting columns
df2['BQ5a'].value_counts(dropna=False)[::-1]

NaN      4
5.0      7
3.0     10
0.0     14
4.0     19
6.0     19
2.0     83
1.0    116
Name: BQ5a, dtype: int64

## Handling missingness
Once we know there are missing values in our data and we have them coded consistently, we have to decide what to do about them. There are three possibilities:
1. ignore the missing data
2. drop the rows with missing data from the dataset
3. fill in the missing data points

Ignoring the problem (as with most problems) may lead to unanticipated consequences for your analysis, so we seldom do that. We can use the `dropna()` method to remove the rows with missing data from the dataframe, but that also can be a problem. Most of the time we will fill in the missing data.

### Filling in with a single value
To fill in missing values with a single value, use the `fillna()` data frame method.

In [10]:
# Fill missing Age values with 60

fill_value={'Age':60}
#
df2.fillna(value=fill_value)

Unnamed: 0,SurveyID,Age,Gender,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,@3OwnHousePhone,@4UseFreq,BQ4a,BQ4aCont,@5UseLength,BQ5a,BQ5aCont,@6aTexting,@6bEmail,@6cInternet,@6dBank,@6eBills,@6fFacebook,@6gPics,@6hGames,@6iBuy,@7PurchOnline,BQ7a,BQ7aCont,@8CellExper,BQ8a,BQ8b,@9Comfort,BQ9a,@10Satisfaction,BQ10a,PA1,PA2,PA3,PA4,PA5,PA6,PA7,PBI1,PBI2,PBI3,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001
0,S4,60.0,2,1,3,1.0,2.0,,,,1.0,7.0,6.0,56.0,3.0,2.0,0.75,1.0,1.0,1.0,,,1.0,,,,1.0,0.0,0.0,5.0,4.0,3.0,4.0,-1.0,4.0,0.0,5,3,2,3,1,4,3,5,2,4,1,6.3,2.1,6,4,6,,3
1,S5,60.0,1,1,3,2.0,2.0,,10.0,10.0,1.0,5.0,4.0,12.0,3.0,2.0,0.75,1.0,,,,,,,,,1.0,0.0,0.0,3.0,2.0,1.0,5.0,0.0,3.0,-1.0,4,3,4,2,2,3,2,3,2,4,2,4.9,2.8,4,3,4,,3
2,S8,60.0,2,1,2,2.0,2.0,,25.0,25.0,1.0,7.0,6.0,56.0,3.0,2.0,0.75,1.0,1.0,,,,,1.0,,,1.0,0.0,0.0,6.0,5.0,4.0,5.0,0.0,5.0,1.0,2,2,2,3,3,3,3,3,3,3,1,4.2,2.8,5,5,5,,2
3,S12,60.0,2,1,2,3.0,2.0,,12.0,12.0,1.0,7.0,6.0,56.0,3.0,2.0,0.75,1.0,,,,,,1.0,,,1.0,0.0,0.0,4.0,3.0,2.0,5.0,0.0,5.0,1.0,4,1,1,1,1,3,0,4,1,4,1,5.6,1.4,5,5,5,,2
4,S13,60.0,2,1,3,,2.0,,10.0,10.0,1.0,5.0,4.0,12.0,3.0,2.0,0.75,,,,,,,1.0,,,1.0,0.0,0.0,5.0,4.0,3.0,5.0,0.0,3.0,-1.0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,4,4,4,,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,S1,79.0,1,1,1,,2.0,,5.0,5.0,,3.0,2.0,1.0,,,,,,,,,,,,,1.0,0.0,0.0,2.0,1.0,0.0,5.0,0.0,4.0,0.0,3,3,3,1,1,1,1,4,1,1,1,3.5,1.4,3,3,3,,1
268,W028,80.0,2,1,2,2.0,2.0,,8.0,8.0,1.0,5.0,4.0,12.0,3.0,2.0,0.75,,1.0,,,,,,,,1.0,0.0,0.0,3.0,2.0,1.0,4.0,-1.0,2.0,-2.0,1,1,1,2,1,1,1,2,1,3,1,2.5,1.0,2,2,2,沒有學習時間大家較忙碌,2
269,W276,85.0,1,1,4,1.0,2.0,,2.0,2.0,1.0,6.0,5.0,28.0,2.0,1.0,0.25,,,,,,,1.0,,,4.0,3.0,13.0,3.0,2.0,1.0,6.0,1.0,4.0,0.0,3,3,5,5,5,5,3,3,3,6,5,4.5,4.0,5,5,5,,5
270,S33,90.0,2,1,2,1.0,2.0,2.0,,,1.0,1.0,0.0,0.0,1.0,0.0,0.00,,,,,1.0,,,,,1.0,0.0,0.0,1.0,0.0,0.0,1.0,,1.0,,1,1,1,1,3,1,2,2,1,3,1,3.5,1.4,0,1,0,,2


Discussions of all the statistical tools available to replace null values in our data is well beyond the scope of our class. However, one simple method (that might be appropriate in a simple dataset) is to replace missing values with the best measure of center calculated from the other values. For the mean this only works for numeric fields, but the median and mode can be used in both numeric and text fields.

In [11]:
# Example: display the mean, median, and mode for the Age column
fill_values={'Age':round(df2['Age'])}

In [12]:
# Which one should we use?


In [13]:
# Example: replace missing values with mean, median or mode


In [14]:
# Example: look at the data frame header to see the table itself hasn't changed.
# Need to add the inplace option to change the underlying data.


In [15]:
# Example: replace missing values inplace with our choice of mean, median or mode
# Which version of fill_values shall we use?


### Dropping rows with missing values
As said previously, this is usually not the best way to deal with missing data, but it can be done if necessary. We use the data frame method `dropna()` to do it. There are also parameters available to control how the data is dropped.

In [16]:
# Drop all rows with any missing values


In [17]:
# Let's focus on the Age column and drop all rows with a missing Age value


#### Remember how to drop a column? Let's get rid of the Comments column because it contains no data of value.

In [18]:
# Drop the Comments column


# Duplicate values
Duplicate data leads to bad results. The duplicated data receives more weight in the analysis than unduplicated data. A couple of duplicated points may not matter much but how do you know only a couple of points are duplicated? Plus, your analysis should always be able to withstand scrutiny and be as reproducible as possible.  So you need to find and remove duplicates.

## Finding duplicates
Use the DataFrame method `.duplicated()` to find duplicate records.

Let's try it on an example dataset.

In [21]:
# Example: finding duplicates
example_df = pd.read_csv("../MIS-3335/data/conditioning_example.csv")

# see how the data is setup
example_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Student         7 non-null      int64 
 1   Dept            7 non-null      object
 2   Class           7 non-null      int64 
 3   Grade           6 non-null      object
 4   Date completed  6 non-null      object
dtypes: int64(2), object(3)
memory usage: 408.0+ bytes


In [23]:
# Example: Find the duplicates
# print the raw data for comparison


# use duplicated() to find the duplicate values
search=pd.DataFrame.duplicated(example_df)
search

0    False
1    False
2     True
3    False
4    False
5    False
6    False
dtype: bool

Duplicate values can be removed and a new copy of the data can be saved without them by using the `drop_duplicates()` method of the DataFrame. The following code leaves us with a new set of data without the duplicate record. We can alter the original data frame by using the `inplace` parameter like we did before.

In [27]:
# remove duplicates
example_df.drop_duplicates(inplace=True)
example_df

Unnamed: 0,Student,Dept,Class,Grade,Date completed
0,101,MIS,3335,A,4/28/2018
1,101,MGMT,4347,B,
3,102,MGMT,4347,C,4/27/2018
4,102,MIS,3328,A,5/1/2018
5,103,MGMT,4347,,4/28/2018
6,103,QMTH,3335,D,5/3/2018


### Example: Taiwan survey data
Let's try it with a bigger dataset. 
First, let's read a new copy of the data with an index specified and get an idea of what it looks like.

NOTE: Since the whole point of the SurveyID is to uniquely identify each row, this is a better way to read this data into memory.

In [29]:
# Example: new copy of Taiwan data
df = pd.read_csv("../MIS-3335/data/Taiwan_CellSurvey_RAW.csv", index_col="SurveyID",engine='python')
df.head()

Unnamed: 0_level_0,Age,Gender,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,@3OwnHousePhone,@4UseFreq,BQ4a,BQ4aCont,@5UseLength,BQ5a,BQ5aCont,@6aTexting,@6bEmail,@6cInternet,@6dBank,@6eBills,@6fFacebook,@6gPics,@6hGames,@6iBuy,@7PurchOnline,BQ7a,BQ7aCont,@8CellExper,BQ8a,BQ8b,@9Comfort,BQ9a,@10Satisfaction,BQ10a,PA1,PA2,PA3,PA4,PA5,PA6,PA7,PBI1,PBI2,PBI3,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001
SurveyID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1
S4,,2,1,3,1.0,2.0,,,,1.0,7.0,6.0,56.0,3.0,2.0,0.75,1.0,1.0,1.0,,,1.0,,,,1.0,0.0,0.0,5.0,4.0,3.0,4.0,-1.0,4.0,0.0,5,3,2,3,1,4,3,5,2,4,1,6.3,2.1,6,4,6,,3
S5,,1,1,3,2.0,2.0,,10.0,10.0,1.0,5.0,4.0,12.0,3.0,2.0,0.75,1.0,,,,,,,,,1.0,0.0,0.0,3.0,2.0,1.0,5.0,0.0,3.0,-1.0,4,3,4,2,2,3,2,3,2,4,2,4.9,2.8,4,3,4,,3
S8,,2,1,2,2.0,2.0,,25.0,25.0,1.0,7.0,6.0,56.0,3.0,2.0,0.75,1.0,1.0,,,,,1.0,,,1.0,0.0,0.0,6.0,5.0,4.0,5.0,0.0,5.0,1.0,2,2,2,3,3,3,3,3,3,3,1,4.2,2.8,5,5,5,,2
S12,,2,1,2,3.0,2.0,,12.0,12.0,1.0,7.0,6.0,56.0,3.0,2.0,0.75,1.0,,,,,,1.0,,,1.0,0.0,0.0,4.0,3.0,2.0,5.0,0.0,5.0,1.0,4,1,1,1,1,3,0,4,1,4,1,5.6,1.4,5,5,5,,2
S13,,2,1,3,,2.0,,10.0,10.0,1.0,5.0,4.0,12.0,3.0,2.0,0.75,,,,,,,1.0,,,1.0,0.0,0.0,5.0,4.0,3.0,5.0,0.0,3.0,-1.0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,4,4,4,,3


In [30]:
# .info() on this data will give you much more than the previous example!
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 272 entries, S4 to W281
Data columns (total 53 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Age              257 non-null    float64
 1   Gender           272 non-null    int64  
 2   Type             272 non-null    int64  
 3   Education        272 non-null    int64  
 4   Income           263 non-null    float64
 5   Employed         271 non-null    float64
 6   FullPart         82 non-null     float64
 7   @1OwnCell        261 non-null    float64
 8   @2UsedCell       261 non-null    float64
 9   @3OwnHousePhone  266 non-null    float64
 10  @4UseFreq        270 non-null    float64
 11  BQ4a             270 non-null    float64
 12  BQ4aCont         270 non-null    float64
 13  @5UseLength      268 non-null    float64
 14  BQ5a             268 non-null    float64
 15  BQ5aCont         268 non-null    float64
 16  @6aTexting       202 non-null    float64
 17  @6bEmail         90

In [36]:
# Find any duplicates
pd.DataFrame.duplicated(df)

SurveyID
S4      False
S5      False
S8      False
S12     False
S13     False
        ...  
S1      False
W028    False
W276    False
S33     False
W281    False
Length: 272, dtype: bool

Look for these rows in the original Excel file before we move on. Once you are satisfied with those results, save a new copy of the DataFrame without the duplicate values.

**If you are intending to save any changes to disk, make sure you work on a _copy_ of the data file and not on the original data file itself. If we have the original data file available we can always start over.**

In [None]:
# remove duplicates
df.