**Handling Outliers**

https://chrisalbon.com/machine_learning/preprocessing_structured_data/handling_outliers/

**Options for  handling outliers**

**DROP - Not a great option. We lose lots of information.Find out if genuine extreme value or broken sensor**

**MARK - Safest option. We can see if outliers had an effect.**

**RESCALE - Take a Log of the values, so that outliers don't have as great an effect**

**Preliminaries**

In [1]:
#import library
import pandas as pd

**Create Data**

In [2]:
houses = pd.DataFrame()

In [3]:
houses['price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

houses

Unnamed: 0,price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500
3,4322032,116.0,48000


**Option 1 : Drop**

In [4]:
#Drop observations greater than some value

houses[houses['Bathrooms'] < 20]

Unnamed: 0,price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


**Option 2 : Mark**

In [6]:
#Load library
import numpy as np

#Create condition based on boolean condition
houses['outlier'] = np.where((houses['Bathrooms'] < 20),0,1)

houses

Unnamed: 0,price,Bathrooms,Square_Feet,outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


**Option 3 - Rescale**

In [8]:
#Log Feature
houses['Log_of_Square_Feet'] = [np.log(x) for x in houses['Square_Feet']]

#Show data
houses

Unnamed: 0,price,Bathrooms,Square_Feet,outlier,Log_of_Square_Feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956
