# Intro to Pandas
by Ryan Orsinger

## Module 3: DataFrames Continued

### Pandas DataFrames Continued - Part 1
- Identifying and counting missing values
- Removing rows with missing information
- Dropping columns from a DataFrame
- Filling missing values
- Using `.loc` with DataFrames

In [1]:
import pandas as pd

In [2]:
# Let's generate some data with missing values. 
# Real world data often has missing values
df = pd.DataFrame([
    {
        "item": "crackers",
        "serving_size": "4 crackers",
        "calories": 10,
        "fat": "1.1g",
        "sodium": "125mg",
        "price": "$2.99",
        "discount": None
    },
    {
        "item": "club soda",
        "serving_size": "8oz",
        "calories": None,
        "fat": None,
        "sodium": "75mg",
        "price": "$2.25",
        "discount": None

    },
    {
        "item": "apple",
        "serving_size": 2,
        "calories": 95,
        "fat": None,
        "sodium": None,
        "price": "$1.99",
        "discount": None
    },
    {
        "item": "banana",
        "serving_size": 3,
        "calories": 105,
        "fat": "0.4g",
        "sodium": "1mg",
        "price": None,
        "discount": None
    },
    {
        "item": "spam",
        "serving_size": None,
        "calories": None,
        "fat": None,
        "sodium": None,
        "price": None,
        "discount": None
    }
])

df.set_index("item", inplace=True)
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price,discount
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
crackers,4 crackers,10.0,1.1g,125mg,$2.99,
club soda,8oz,,,75mg,$2.25,
apple,2,95.0,,,$1.99,
banana,3,105.0,0.4g,1mg,,
spam,,,,,,


In [3]:
# The .info method outputs data types and non-null value count
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, crackers to spam
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   serving_size  4 non-null      object 
 1   calories      3 non-null      float64
 2   fat           2 non-null      object 
 3   sodium        3 non-null      object 
 4   price         3 non-null      object 
 5   discount      0 non-null      object 
dtypes: float64(1), object(5)
memory usage: 280.0+ bytes


In [4]:
# Notice that missing values in a numeric column show as NaN, which means "not a number"
# For more on NaN, see https://en.wikipedia.org/wiki/NaN
df.calories

item
crackers      10.0
club soda      NaN
apple         95.0
banana       105.0
spam           NaN
Name: calories, dtype: float64

In [5]:
# Notice that missing values in a string/object column show as None
df.fat

item
crackers     1.1g
club soda    None
apple        None
banana       0.4g
spam         None
Name: fat, dtype: object

In [6]:
# .isna() can operate on a column, returning a boolean series
df.sodium.isna()

item
crackers     False
club soda    False
apple         True
banana       False
spam          True
Name: sodium, dtype: bool

In [7]:
# .isna() can also operate on the entire dataframe
df.isna()

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price,discount
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
crackers,False,False,False,False,False,True
club soda,False,True,True,False,False,True
apple,False,False,True,True,False,True
banana,False,False,False,False,True,True
spam,True,True,True,True,True,True


In [8]:
# Counting the number of nulls by column
print("Number of nulls by column")
df.isna().sum()

Number of nulls by column


serving_size    1
calories        2
fat             3
sodium          2
price           2
discount        5
dtype: int64

In [9]:
print("Proportion of nulls by column")
df.isna().mean()

Proportion of nulls by column


serving_size    0.2
calories        0.4
fat             0.6
sodium          0.4
price           0.4
discount        1.0
dtype: float64

In [10]:
# Counting the number of nulls by row
# Recall that .sum can run on columns or by row, by row with axis=1
print("Number of nulls by row")
df.isna().sum(axis=1)

Number of nulls by row


item
crackers     1
club soda    3
apple        3
banana       2
spam         6
dtype: int64

In [11]:
# Proportion of the number of nulls by row
# Recall that .sum can run on columns or by row, by row with axis=1
print("Number of nulls by row")
df.isna().sum(axis=1)

Number of nulls by row


item
crackers     1
club soda    3
apple        3
banana       2
spam         6
dtype: int64

### Handling Missing Values
- There's no one right answer for all cases. 
- "It depends" is a co
- Sometimes missing values might mean zero, depending on the context, so we can fill in zero.
    - this
    

In [12]:
# Example of removing null values 

In [13]:
# Example of filling null values

- .isna
- .fillna
- .map for transforming values and/or normalizing values
- str.split() on a single column containing multiple columns of data (maybe move this to tidy data?)
- IQR range rule
- 3 sigma rule
- working with dates/datetime

In [14]:
# Preview our data
df = pd.read_csv("penguins.csv")
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
