# Dealing with NaNs

When dealing with missing data, it is important to find a solution appropriate to the specific dataset and task. 

Ask yourself questions like: Do I have enough data to simply remove rows containing NaNs, or would that make me lose valuable information? Do the NaNs actually carry information? Could I replace the NaNs by a value of my choice, or an average?

Run the code below to generate a dataset with NaNs.

In [None]:
import numpy as np
import pandas as pd

data = {"x1": [np.nan, 0.75 , 0.88, 0.73, np.nan,
        0.77, 0.7, 0.8, 0.58, np.nan],
        
        "x2": [0.87, 0.77, 0.65, np.nan, 0.54, 0.1 ,
        0.68 , 0.20, 0.54, np.nan ],
       
       "y": [1, 1, 2, 2, 1, 2, 1, 1, 2, 2 ]}


df = pd.DataFrame(data)

df

Find a simple way to return a boolean indicating wether this dataframe contains any NaN values. [This](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html) should help.

Compute the NaN value count for each column.

Compute how many rows contain NaN values.

### Dropping rows

If your data permits it (enough data, balanced class, few NaNs), an option is to remove rows containing NaNs. Compute the class count if you were to do so.

### Imputing the mean or median

A common imputation method when dealing with NaNs is to replace them by the column mean. If the data has a lot of outliers, computing the median offers a more robust alternative.

Create two new dataframes by filling with means and medians. [This](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) should help.

In [None]:
df_mean_imputed = 

In [None]:
df_median_imputed = 

### Imputing a Value

At times, your domain knowledge may suggest that NaNs can be replaced by a specific value. Replace missing x1 values by 1 and missing x2 values by 0.