# Intro to Pandas
by Ryan Orsinger

## Module 3: DataFrames Continued

### Pandas DataFrames Continued - Filling Missing Values
- Filling missing values
- Using `.fillna`
- Using `.loc` with DataFrames (similar to `.loc` on Series, but two-dimensional w/ rows and columns)

### Handling Missing Values is a Case in Creative Problem Solving
- There's no single right answer for all cases. 
- "It depends" is a common answer in data science. Context matters.

- Sometimes missing values might mean zero, depending on the context, so we can fill in zero.
- Sometimes, dropping entire rows or columns is appropriate
- Other times, filling missing values with the mean, the median, the mode, or a likely value is appropriate

- Sometimes, analysts drop rows with too many missing values
- Other times, analysts drop columns with too many missing values
- Missing values can also be filled with a reasonable estimation, like a median, mean, or mode value.
- Filling too many missing values can skew the original data.

In [None]:
import pandas as pd

In [None]:
# Let's generate some data with missing values. 
# Real world data often has missing values
df = pd.DataFrame([
    {
        "item": "crackers",
        "serving_size": "4 crackers",
        "calories": 10,
        "fat": "1.1g",
        "sodium": "125mg",
        "price": 2.99,
    },
    {
        "item": "club soda",
        "serving_size": "8 oz",
        "calories": None,
        "fat": None,
        "sodium": "75mg",
        "price": 2.25,

    },
    {
        "item": "apple",
        "serving_size": 2,
        "calories": 95,
        "fat": None,
        "sodium": None,
        "price": 1.99,
    },
    {
        "item": "banana",
        "serving_size": 3,
        "calories": 105,
        "fat": "0.4g",
        "sodium": "1mg",
        "price": None,
    },
    {
        "item": "spam",
        "serving_size": "1 tin",
        "calories": None,
        "fat": None,
        "sodium": None,
        "price": None,
    }
])

# Set the index to be the item name
df.set_index("item", inplace=True)
df

In [None]:
# Example of filling null values with a reasonable value
# Apples and club soda don't have fat, so these missing values can be 0
df.fat = df.fat.fillna(0)
df

### An Aside About Pandas Warnings
- Pandas warnings are not errors. The code will run. The warning is a notice, not an error that halts execution.
- Depending on your version of pandas, the above code might produce the following warning.
```
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
```
- Since this may impact some users, we'll move into working with `.loc` 


In [None]:
# Example of .loc's row_indexing and column_indexing
# [start_row:end_row, column_start:column_end]
# [:,] returns all rows and all columns
df.loc[:,]

In [None]:
# Notice how we're getting the range of rows from club soda to apple
# df.loc["club soda":"banana", :]
df.loc["club soda":"banana"]

In [None]:
# Notice how .loc uses the indexing syntax
df.loc[df.index == "apple"]

In [None]:
# Notice how .loc uses the indexing syntax
df.loc[df.serving_size == 3]

In [None]:
# Notice how .loc uses the indexing syntax
df.loc[df.index == "apple", "serving_size":"fat"]

In [None]:
# All rows, show only calories as the column
df.loc[:, "calories"]

In [None]:
# Notice how : for rows returns all rows
# show all the columns from calories through price 9(inclusive)
df.loc[:, "calories":"price"]

In [None]:
# Some pandas operataions may throw a SettingWithCopyWarning
# Recommend reading the documentation carefully
# Pandas developers designed this warning because effects can be difficult to predict
# Notice how the above operation evaluated, but the warning can feel disruptive.
df.loc[df.calories.isna(), "calories"] = 0
df

In [None]:
# An average price might be reasonable here, since we don't have other information
df.loc[df.price.isna(), "price"] = df.price.mean()
df

In [None]:
# Actual Spam information
spam_calories = 1080
spam_fat = "96g"
spam_sodium = "4740mg" 
spam_price = 3.25

df.loc[df.index == "spam", "calories":"price"] = [spam_calories, spam_fat, spam_sodium, spam_price]
df

In [None]:
# Let's say we got in some new information about discounts
# The business manager says that we'll use discounts in the future and the existing values should be 0.
# We'll need to re-create the column and assign it zero
df["discount"] = 0

In [None]:
df

## Additional Resources
- Using [.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
- [Returning-a-view-versus-a-copy in the pandas docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)
- [pandas .loc documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)
- [pandas .iloc documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)

## Exercises
- Run the cells above to remove or fill most of the missing values from the `df` variable.
- Fill the missing sodium value with a logical choice.
- Use `pd.read_csv` to read `"penguins.csv"` into a dataframe variable named `penguins`
- Fill the missing values of the `bill_length_mm` with its average
- Fill in the missing values for `bill_depth_mm` with its average
- Fill in the missing values for `body_mass_g` with its average
- Run `.value_counts` on the `sex` column
- Fill the missing values in the `sex` column with the `mode` (Follow .mode() with [0] to access the string value)
- Run `.value_counts` on the `sex` column again, after filling the missing values

In [None]:
# Fill the missing sodium value with a logical choice.


In [None]:
# df.loc[row_indexer, column_indexer] = value
df.loc[df.sodium.insa(), "sodium"] = 0

In [None]:
# Use `pd.read_csv` to read `"penguins.csv"` into a dataframe variable named `penguins`


In [None]:
# Fill the missing values of the `bill_length_mm` with its average


In [None]:
# Fill in the missing values for `bill_depth_mm` with its average


In [None]:
# Fill in the missing values for `body_mass_g` with its average


In [None]:
# Run `.value_counts` on the `sex` column


In [None]:
# Fill the missing values in the `sex` column with the `mode` (Follow .mode() with [0] to access the string value)


In [None]:
# Run `.value_counts` on the `sex` column again, after filling the missing values
