# Intro to Pandas
by Ryan Orsinger

## Module 3: DataFrames Continued

### Pandas DataFrames Continued - Identifying Missing Values
- Identifying and counting missing values
- Removing rows with missing information
- Dropping columns from a DataFrame

In [12]:
import pandas as pd

In [13]:
# Let's generate some data with missing values. 
# Real world data often has missing values
df = pd.DataFrame([
    {
        "item": "crackers",
        "serving_size": "4 crackers",
        "calories": 10,
        "fat": "1.1g",
        "sodium": "125mg",
        "price": 2.99,
        "discount": None
    },
    {
        "item": "club soda",
        "serving_size": "8 oz",
        "calories": None,
        "fat": None,
        "sodium": "75mg",
        "price": 2.25,
        "discount": None

    },
    {
        "item": "apple",
        "serving_size": 2,
        "calories": 95,
        "fat": None,
        "sodium": None,
        "price": 1.99,
        "discount": None
    },
    {
        "item": "banana",
        "serving_size": 3,
        "calories": 105,
        "fat": "0.4g",
        "sodium": "1mg",
        "price": None,
        "discount": None
    },
    {
        "item": "spam",
        "serving_size": "1 tin",
        "calories": None,
        "fat": None,
        "sodium": None,
        "price": None,
        "discount": None
    }
])

# Set the index to be the item name
df.set_index("item", inplace=True)
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price,discount
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99,
club soda,8 oz,,,75mg,2.25,
apple,2,95.0,,,1.99,
banana,3,105.0,0.4g,1mg,,
spam,1 tin,,,,,


In [14]:
# The .info method outputs data types and non-null value count
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, crackers to spam
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   serving_size  5 non-null      object 
 1   calories      3 non-null      float64
 2   fat           2 non-null      object 
 3   sodium        3 non-null      object 
 4   price         3 non-null      float64
 5   discount      0 non-null      object 
dtypes: float64(2), object(4)
memory usage: 280.0+ bytes


In [10]:
# Notice that missing values in a numeric column show as NaN, which means "not a number"
# For more on NaN, see https://en.wikipedia.org/wiki/NaN
df.calories

item
crackers      10.0
club soda      NaN
apple         95.0
banana       105.0
spam           NaN
Name: calories, dtype: float64

In [11]:
# NaN exists to allow us to do math without getting execution errors
# Many math functions ignore NaNs
df.calories.mean()

70.0

In [20]:
# By default, .value_counts ignores NaNs, too
df.sodium.value_counts()

125mg    1
75mg     1
1mg      1
Name: sodium, dtype: int64

In [21]:
# Use dropna=False to count missing values
df.sodium.value_counts(dropna=False)

None     2
125mg    1
75mg     1
1mg      1
Name: sodium, dtype: int64

In [6]:
# Notice that missing values in a string/object column show as None
df.fat

item
crackers     1.1g
club soda    None
apple        None
banana       0.4g
spam         None
Name: fat, dtype: object

In [7]:
# .isna() can operate on a column, returning a boolean series
df.sodium.isna()

item
crackers     False
club soda    False
apple         True
banana       False
spam          True
Name: sodium, dtype: bool

In [8]:
# .isna() can also operate on the entire dataframe
df.isna()

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price,discount
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
crackers,False,False,False,False,False,True
club soda,False,True,True,False,False,True
apple,False,False,True,True,False,True
banana,False,False,False,False,True,True
spam,False,True,True,True,True,True


In [9]:
# Counting the number of nulls by column
print("Number of nulls by column")
df.isna().sum()

Number of nulls by column


serving_size    0
calories        2
fat             3
sodium          2
price           2
discount        5
dtype: int64

In [10]:
print("Proportion of nulls by column")
df.isna().mean()

Proportion of nulls by column


serving_size    0.0
calories        0.4
fat             0.6
sodium          0.4
price           0.4
discount        1.0
dtype: float64

In [11]:
# Counting the number of nulls by row
# Recall that .sum can run on columns or by row, by row with axis=1
print("Number of nulls by row")
df.isna().sum(axis=1)

Number of nulls by row


item
crackers     1
club soda    3
apple        3
banana       2
spam         5
dtype: int64

In [12]:
# Proportion of the number of nulls by row
# Recall that .sum can run on columns or by row, by row with axis=1
print("Proportion of nulls by row")
df.isna().mean(axis=1)

Proportion of nulls by row


item
crackers     0.166667
club soda    0.500000
apple        0.500000
banana       0.333333
spam         0.833333
dtype: float64

### Handling Missing Values
- There's no one right answer for all cases. 
- "It depends" is a common answer in data science. Context matters.
- Sometimes missing values might mean zero, depending on the context, so we can fill in zero.
- Sometimes, dropping entire rows or columns is appropriate
- Other times, filling missing values with the mean, the median, the mode, or a likely value is appropriate

In [13]:
# Example of removing null values 
# dropna drops every row with a null value
# Since there is missing data in every row, this is quite destructive...
# the default axis argument is axis=0, which means row-wise
df.dropna()

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price,discount
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [14]:
# dropna(axis=1) drops all columns with any missing values
# This is also too destructive to be helpful
df.dropna(axis=1)

Unnamed: 0_level_0,serving_size
item,Unnamed: 1_level_1
crackers,4 crackers
club soda,8 oz
apple,2
banana,3
spam,1 tin


In [15]:
# Let's review the dataframe
df.head()

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price,discount
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99,
club soda,8 oz,,,75mg,2.25,
apple,2,95.0,,,1.99,
banana,3,105.0,0.4g,1mg,,
spam,1 tin,,,,,


### Handling Missing Values is a Case in Creative Problem Solving
- Could some missing values be a default like 0 or "unknown"
- Based on the context, is there a logic you could apply to fill missing values?
- Sometimes, analysts drop rows with too many missing values
- Other times, analysts drop columns with too many missing values
- Missing values can also be filled with a reasonable estimation, like a median, mean, or mode value.
- Filling too many missing values can skew the original data

In [16]:
# The discount column is adding no information here, so we can drop it
df.drop(columns="discount", inplace=True)
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,,,75mg,2.25
apple,2,95.0,,,1.99
banana,3,105.0,0.4g,1mg,
spam,1 tin,,,,


In [17]:
# Reassign the df
# df.drop(index=["spam"], inplace=True) would produce the same result
df = df[df.index != "spam"]
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,,,75mg,2.25
apple,2,95.0,,,1.99
banana,3,105.0,0.4g,1mg,


In [18]:
# Example of filling null values with a reasonable value
# Apples and club soda don't have fat, so these missing values can be 0
df.fat = df.fat.fillna(0)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.fat = df.fat.fillna(0)


Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,


In [19]:
# Example of .loc's row_indexing and column_indexing
# [start_row:end_row, column_start:column_end]
# [:,] returns all rows and all columns
df.loc[:,].head()

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,


In [20]:
# Notice how we're getting the range of rows from club soda to apple
df.loc["club soda":"banana",]

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
club soda,8 oz,,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,


In [21]:
# Notice how .loc uses the indexing syntax
df.loc[df.index == "apple"]

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
apple,2,95.0,0,,1.99


In [22]:
# Notice how .loc uses the indexing syntax
df.loc[df.serving_size == 3,]

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
banana,3,105.0,0.4g,1mg,


In [23]:
# Notice how .loc uses the indexing syntax
df.loc[df.index == "apple", "serving_size":"fat"]

Unnamed: 0_level_0,serving_size,calories,fat
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
apple,2,95.0,0


In [24]:
# All rows, show only calories as the column
df.loc[:, "calories"]

item
crackers      10.0
club soda      NaN
apple         95.0
banana       105.0
Name: calories, dtype: float64

In [25]:
# Notice how : for rows returns all rows
# show all the columns from calories through price 9(inclusive)
df.loc[:, "calories":"price"]

Unnamed: 0_level_0,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
crackers,10.0,1.1g,125mg,2.99
club soda,,0,75mg,2.25
apple,95.0,0,,1.99
banana,105.0,0.4g,1mg,


In [26]:
# Some pandas operataions may throw a SettingWithCopyWarning
# Recommend reading the documentation carefully
# Pandas developers designed this warning because effects can be difficult to predict
# Notice how the above operation evaluated, but the warning can feel disruptive.
df.loc[df.calories.isna(), "calories"] = 0
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,0.0,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,


## Additional Resources
- [.value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)
- [Pandas .isna documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html)
- [numpy random choice](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) method to choose a random value from a collection

## Exercises
- Use `pd.read_csv` to read `"penguins.csv"` into a dataframe variable named `penguins`
- Write the pandas code to count the number of missing values by column
- Write the pandas necessary to get the proportion of missing values by row. Store this to a variable named `percent_missing_by_row`
- Sort the `percent_missing_by_row` series in descending order. How many of the rows are mostly empty?

In [3]:
# Use `pd.read_csv` to read `"penguins.csv"` into a dataframe variable named `penguins`


In [4]:
# Use .isna to count the number of missing values by column


In [5]:
# Write the pandas necessary to get the proportion of missing values 
# by row. Store this to a variable named `percent_missing_by_row`


In [None]:
# Sort the `percent_missing_by_row` series in descending order
# How many of the rows are mostly empty