# `pandas` - Missing Data

__Contents__: There are essential three ways of dealing with missing data:
1. Drop the columns with missing data
1. Drop the rows with missing data
1. Replace the missing data with actual values

Load the libraries.

In [4]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

Load the DataFrame from the `imports-85.csv` CSV file. Set the column names.

In [6]:
column_names = ['symboling', 'normalized_losses', 'make', 'fuel-type',
                'aspiration', 'num_of_doors', 'body_style', 'drive_wheels',
                'engine_location', 'wheel_base', 'length', 'width',
                'height', 'curb_weight', 'engine_type', 'num_of_cylinders',
                'engine_size', 'fuel_system', 'bore', 'stroke',
                'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
                'highway_mpg', 'price']
import_df = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                        na_values=['?']
                       )

In [7]:
import_df.isnull().sum().sort_values(ascending=False)[lambda x: x > 0]

### Delete/Drop Columns

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

### `dropna` Method
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dropna.html

In [10]:
import_df.shape

In [11]:
dropna_col_df = import_df.dropna(axis=1)
dropna_col_df.shape

In [12]:
dropna_col_df.info()

Drop columns with the `drop` DataFrame method. The `axis=1` parameter must be included (to drop columns.)

In [14]:
import_df.drop(['normalized_losses','price','stroke','bore',
                'peak_rpm','horsepower','peak_rpm','num_of_doors'],axis=1).info()

## Dropping Rows

### `dropna` Method
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dropna.html

In [17]:
import_df.shape

In [18]:
import_df.dropna(axis=0).shape

### `isnull` Method

Recall that `isnull` method which returns (elementwise) if the values of a Series are `NaN`.

In [21]:
import_df.normalized_losses.isnull().head()

In [22]:
type(import_df.normalized_losses.isnull())

We can this boolean Series to display only those rows with missing value in the `normalized_losses` column.

In [24]:
import_df.loc[import_df.normalized_losses.isnull(),:]

Finally, we can negate the values of this series and return only the rows that do not have missing values for the `normalized_losses` column.

In [26]:
new_df = import_df.loc[~ import_df.normalized_losses.isnull(),:]
new_df.shape

Do this for each of the variables with missing values.

### Impute Values

#### `fillna` Method
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html
- Filling forward/backward is not covered here.

Notice that a missing value is in row `9` of the `price` column.

In [31]:
import_df.price.head(10)

Often numeric missing values are filled/replaced with the median of that (numeric) column. 

So we find the median of the `price` column.

In [33]:
import_df.price.median()

Now fill in that value (and check for it.)

In [35]:
import_df.price.fillna(value=10295.0).head(10)

Check the list of columns with missing values (to find a column to impute.)

In [37]:
import_df.isnull().sum().sort_values(ascending=False)[lambda x: x > 0]

In [38]:
import_df.stroke.fillna(value=import_df.stroke.median).isnull().sum()

In [39]:
column_names = ['symboling', 'normalized_losses', 'make', 'fuel-type',
                'aspiration', 'num_of_doors', 'body_style', 'drive_wheels',
                'engine_location', 'wheel_base', 'length', 'width',
                'height', 'curb_weight', 'engine_type', 'num_of_cylinders',
                'engine_size', 'fuel_system', 'bore', 'stroke',
                'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
                'highway_mpg', 'price']
import_df = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                        na_values=['?']
                       )
import_df = import_df.dropna(axis=0)

__The End__