# Intro to Pandas
by Ryan Orsinger

## Module 3: DataFrames Continued

### Pandas DataFrames Continued - Filling Missing Values
- Filling missing values
- Using `.loc` with DataFrames (similar to `.loc` on Series, but two-dimensional w/ rows and columns)

In [None]:
import pandas as pd

In [1]:
# Let's generate some data with missing values. 
# Real world data often has missing values
df = pd.DataFrame([
    {
        "item": "crackers",
        "serving_size": "4 crackers",
        "calories": 10,
        "fat": "1.1g",
        "sodium": "125mg",
        "price": 2.99,
        "discount": None
    },
    {
        "item": "club soda",
        "serving_size": "8 oz",
        "calories": None,
        "fat": None,
        "sodium": "75mg",
        "price": 2.25,
        "discount": None

    },
    {
        "item": "apple",
        "serving_size": 2,
        "calories": 95,
        "fat": None,
        "sodium": None,
        "price": 1.99,
        "discount": None
    },
    {
        "item": "banana",
        "serving_size": 3,
        "calories": 105,
        "fat": "0.4g",
        "sodium": "1mg",
        "price": None,
        "discount": None
    },
    {
        "item": "spam",
        "serving_size": "1 tin",
        "calories": None,
        "fat": None,
        "sodium": None,
        "price": None,
        "discount": None
    }
])

# Set the index to be the item name
df.set_index("item", inplace=True)
df

NameError: name 'pd' is not defined

In [None]:
# Example of filling null values with a reasonable value
# Apples and club soda don't have fat, so these missing values can be 0
df.fat = df.fat.fillna(0)
df

In [None]:
# Example of .loc's row_indexing and column_indexing
# [start_row:end_row, column_start:column_end]
# [:,] returns all rows and all columns
df.loc[:,].head()

In [None]:
# Notice how we're getting the range of rows from club soda to apple
df.loc["club soda":"banana",]

In [None]:
# Notice how .loc uses the indexing syntax
df.loc[df.index == "apple"]

In [None]:
# Notice how .loc uses the indexing syntax
df.loc[df.serving_size == 3,]

In [None]:
# Notice how .loc uses the indexing syntax
df.loc[df.index == "apple", "serving_size":"fat"]

In [None]:
# All rows, show only calories as the column
df.loc[:, "calories"]

In [None]:
# Notice how : for rows returns all rows
# show all the columns from calories through price 9(inclusive)
df.loc[:, "calories":"price"]

In [None]:
# Some pandas operataions may throw a SettingWithCopyWarning
# Recommend reading the documentation carefully
# Pandas developers designed this warning because effects can be difficult to predict
# Notice how the above operation evaluated, but the warning can feel disruptive.
df.loc[df.calories.isna(), "calories"] = 0
df

In [None]:
# An average price might be reasonable here, since we don't have other information
df.loc[df.price.isna(), "price"] = df.price.mean()
df

## Additional Resources
- [Returning-a-view-versus-a-copy in the pandas docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)
- [pandas .loc documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)
- [pandas .iloc documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)

## Exercises
- Run the cells above to remove or fill most of the missing values from the `df` variable.
- Fill the missing sodium value with a logical choice.
- Use `pd.read_csv` to read `"penguins.csv"` into a dataframe variable named `penguins`
- Fill the missing values of the `bill_length_mm`, `bill_depth_mm`, and `body_mass_g` with their respective average values.
- Run `.value_counts` on the `sex` column
- Fill the missing values in the `sex` column with the `mode`
- Run `.value_counts` on the `sex` column again, after filling the missing values