# Dealing with Missing Data

This notebook will review ways to deal with missing data in a pandas dataframe. 

Most machine learning models and visualizations will fail if missing data is present.

To avoid this, missing data may be dropped or imputed so that it is not missing.

The goal is usually to retain as much data as possible.

Overview:
* Identify missing data
* Drop or impute missing values
* Check data manipulations

See the [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) for more information. 

In [None]:
# import packages
import pandas as pd

In [None]:
# read in data
df = pd.read_csv('../Data/anime.csv')

In [None]:
# Check shape of data
## Should be 6668 rows and 33 columns
df.shape

In [None]:
# Summarize info with df.info()
# Notice the count of non-null values
df.info()

# Check Missing Data

In [None]:
# pd.isna() will return True if a NaN value is found and False otherwise
pd.isna(df)

In [None]:
# You can call a sum on is na to get a total count of missing values for a df or column
pd.isna(df['rank']).sum()

In [None]:
# You can call a sum on is na to get a total count of missing values for a df or column
pd.isna(df).sum()

## Plot missing values

In [None]:
# Get a count of null values and plot the sum
df.isnull().sum().plot(kind='bar')

## Drop all rows with a NaN value

In [None]:
# Drop NaNs and return new dataframe
df2 = df.dropna()

In [None]:
# Check how much data was lost
## Notice, we're down to 282 rows-that is a LOT of dropped data
df2.shape

## Drop all Columns with Missing values

In [None]:
# Drop cols with NaNs and return new df
df3 = df.dropna(axis=1)

In [None]:
# Check how much data was lost
## Notice, we dropped cols this time and still have the original number of rows
df3.shape

## Fill NaNs with a value

In [None]:
# Lets focus on df['rank'] and replace the NaN values in that col
df['rank'].describe()

In [None]:
# Create a new column with NaNs filled by mean value
df['rank_filled'] = df['rank'].fillna(df['rank'].mean())

In [None]:
# Check new column-notice count has changed from above
df['rank_filled'].describe()

In [None]:
# Create a new column with NaNs filled with a 0
df['rank_filled_0'] = df['rank'].fillna(0)

In [None]:
# Check new column-notice count has changed from above
df['rank_filled_0'].describe()

In [None]:
# Fill genre with most common
df['genre_filled'] = df['genre'].fillna(df['genre'].value_counts().index[0])

In [None]:
# Fill genre with most common
df['genre_filled'].describe()

# Final Tips and Tricks

Remember, when cleaning missing data, you usually want to retain as much info/data as possible.

For continous values, it may make sense to fill NaNs with a mean value.

For categorial values, you can consider filling NaNs with the most common category. 