# Handling Unusual Values

Real-world data often has “unusual values”. Data can have this so called “unusual” values for a number of reasons such as human errors or problem in measuring devices.

## Libraries

In [None]:
from pandas import read_csv
from numpy import nan

## Dataset (Bling Bling $$)

For this tutorial we will be using diamonds dataset. It is a classic dataset and suitable for beginner to explore data analysis.

In [None]:
#load the dataset, please make sure on your working directory
data = read_csv('../input/diamonds/diamonds.csv')

#view the dataset
data

## Summary Statistics

We can use summary statistics to help identify this unusual data.

In [None]:
# summarize the dataset
data.describe()

The red arrow indicate the minimum value in each column. Note that column x, y, z, the dimensions of these diamonds, in mm have minimum value 0.

We know that diamonds can’t have a width of 0 mm, so these values kind of unusual and must be incorrect.

![](https://www.datum.my/img/datadescribe.PNG)

## Whats wrong with having 0 ?
If 0 is the real value, for example 0 mark in an exam, we can keep it.

Now, imagine 0 is a data error, like the diamods dataset, it will really effect the data distribution.

## Handling the Unusual Value
We first locate and count this 0’s.

In [None]:
# Finding and counting 0
(data.loc[:, 'carat':'z'] == 0).sum()

Note that column x, y, z have 8, 7, 20 values with 0 respectively.

![](https://www.datum.my/img/finding0.PNG)

Now lets view the rows when the columns with 0's.

In [None]:
# Filtering rows with column condition
data.query("x == 0 or y == 0 or z == 0")

## Marking with NaN

In Pandas or NumPy, we can replace them with NaN (missing values) or some refer this step as marking with NaN. Values with a NaN value are ignored from operations like sum or count.

Note that we only need to focus on column x, y, z because this are the columns where the 0’s are located.

In [None]:
# Replace 0 with NaN
data[['x','y','z']] = data[['x','y','z']].replace(0, nan)

After we have replaced 0 with NaN, we can use the isnull() function to mark the NaN values as True and count the number of missing values for each column.

In [None]:
# Counting NaN
data.isnull().sum()

Running this will produce a similiar output as counting 0’s. Note that columns x, y, z have the same number of NaN as zero values identified above. Now you can have a peace of mind!

Now let view the row with missing values



In [None]:
data[data.isnull().any(axis = 1)]

Now lets view again the summary of the dataset after some cleanup. Note that:

* There is no more 0 values and 
* The count for x,y,z are < 53940, because NaN's are ignored.

In [None]:
data.describe()

# # Fill NaN with Values 

[I updated this section based on the comment]

Training a model with a dataset containing missing values can impact the quality of your machine learning model. There are few ways to manage this, one such way is by replacing the missing values (also known as *imputation*) within each column with the mean of non-missing values in the column.

Lets replace the missing value with the mean value. We can use the 'DataFrame.fillna' to fill up the missing values in the whole dataset.

In [None]:
data = data.fillna(data.mean())

Finally lets check again whether all the missing values has been handled.

In [None]:
data[data.isnull().any(axis = 1)]

Compared this new summary with the previous one that contains missing values. Note that now, 
* the count for x,y,z = 53940, because NaN's are replaced with mean values respective to each column.

In [None]:
data.describe()