# Detecting and Replacing Outliers

In [1]:
import pandas as pd
import numpy as np
import os 

In [2]:
filename = os.path.join("/Users/salmanyagaka/Documents/interviews/adult.csv")
df = pd.read_csv(filename, header=0)

###  Get the Dimensions of the Dataset

In [3]:
df.shape

(48842, 15)

### Glance at the Data

In [4]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


## Step 1: Compute the n-th Percentile of a Given Column

As an analyst, your goal is to detect the outliers in the `hours-per-week` column. In particular, you want to get the 99.9th percentile of the values in the `hours-per-week` column.<br>

As was discussed in the videos, *z-scores* can be used to compute the n-th percentile of a data array. Toward the end of this notebook, we will be looking at a few ways to compute the z-scores and then figure out the n-th percentile in a data column. For now, however, we will show you a ready-made method from `numpy` that achieves our objective.

The code cell below uses the `np.percentile()` function and gets the value of `hours-per-week` that corresponds to the 99.9th percentile.

In [5]:
hpw_999 = np.percentile(df['hours-per-week'], 99.9)
hpw_999

np.float64(99.0)

In the code cell below, figure out the value of `education-num` that corresponds to the 90th percentile of the education in years. Hint: Use the same method as the code cell above, but replace the column name and the percentage value. Save your results to variable `edu_90`.

### Graded Cell
The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [9]:
edu_90 = np.percentile(df['educational-num'], 90.0)
edu_90


np.float64(13.0)

### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [7]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testEdu

try:
    p, err = testEdu(df, edu_90)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


Correct!


## Step 2: Add a Column With the Winsorized Version of the Original Column.

In this next section, we will use a new package called SciPy, which stands for Scientific Python. For more information about SciPy, consult the online [documentation](https://scipy.github.io/devdocs/reference/index.html).

First, import the `stats` module from the `scipy` package.

In [10]:
import scipy.stats as stats

Read the documentation for the function `stats.mstats.winsorize()`.

In [11]:
stats.mstats.winsorize?

[0;31mSignature:[0m
[0mstats[0m[0;34m.[0m[0mmstats[0m[0;34m.[0m[0mwinsorize[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0ma[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlimits[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minclusive[0m[0;34m=[0m[0;34m([0m[0;32mTrue[0m[0;34m,[0m [0;32mTrue[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minplace[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnan_policy[0m[0;34m=[0m[0;34m'propagate'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Returns a Winsorized version of the input array.

The (limits[0])th lowest values are set to the (limits[0])th percentile,
and the (limits[1])th highest values are set to the (1 - limits[1])th
percentile.
Masked values are skipped.


Parameters
----------
a : sequence
    Input array.
limits : {None

This function will create a copy of a given column, such that the outlier values will be replaced. In particular, you will pass two percentage cutoffs as a list to the `limits` parameter, and all the column values below the specified lower percentile cutoff, as well as all the values above the upper cutoff, will be replaced with the corresponding percentile value. 

The code cell below uses the `stats.mstats.winsorize()` function to add a new column to DataFrame `df`. The column will be named `education-num-win` and will contain the winsorized version of the `education-num` column, with the cutoff from the 'bottom' and the cutoff from the 'top' both set at the 1% level.

In [15]:
# The argument limits=[0.01, 0.01] means:
# Cap the lowest 1% of values (bottom 1 percentile) at the value at the 1st percentile
# Cap the highest 1% of values (top 1 percentile) at the value at the 99th percentile
df['educational-num-win'] = stats.mstats.winsorize(df['educational-num'], limits=[0.01, 0.01])
df.head(15)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income,educational-num-win
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K,7
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K,9
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K,12
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K,10
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K,10
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K,6
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K,9
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K,15
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K,10
9,55,Private,104996,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K,4


## Deep Dive: Computing z-scores 

First, let's review what the *z-score of a given value* is. <br>
Say your dataset contains a feature (aka a one-dimensional array, a vector, a list, a variable, a data column) called `X`, and you want to compute the z-score for one particular observation (aka an example value, a cell) of this feature. Let's call this observation $x_i$. <br>
A z-score of $x_i$ is given by:
$$z = \frac{x_i-\bar{x}}{s},$$
where $\bar{x}$ is the mean of all the values of $x$ in your data, and $s$ is the standard deviation of those values.<br>

The code cells below implements this formula.

#### Calculate a z-score for one (given) value, a given mean, and a given standard deviation

In [16]:
F_mean = 5.44
F_std = 7.7
value = 4 

value_zscore = (value-F_mean)/F_std
value_zscore

-0.18701298701298705

####  Calculate a z-score for one (given) value, given the full sample of values. (The `numpy` way)

In [19]:
F = [4, 6, 3, -3, 4, 5, 6, 7, 3 , 8, 1, 9, 1, 2, 2, 35, 4, 1]
value = F[0]
F_std = np.std(F)
F_mean = np.mean(F)
value_zscore = (value-F_mean)/F_std
value_zscore

np.float64(-0.1874826669747723)

In [21]:
F_mean

np.float64(5.444444444444445)

#### Calculate the z-score for all values of a feature vector. (The `numpy` way)

All we need to do now is to apply the computation we implemented above to every value in the feature vector `F`. 

In [22]:
F_std = np.std(F)
F_mean = np.mean(F)
zscores = []
for value in F:
    z = (value-F_mean)/F_std
    zscores.append(z)
    
zscores

[np.float64(-0.1874826669747723),
 np.float64(0.07210871806722008),
 np.float64(-0.3172783594957685),
 np.float64(-1.0960525146217457),
 np.float64(-0.1874826669747723),
 np.float64(-0.057686974453776116),
 np.float64(0.07210871806722008),
 np.float64(0.2019044105882163),
 np.float64(-0.3172783594957685),
 np.float64(0.3317001031092125),
 np.float64(-0.5768697445377609),
 np.float64(0.46149579563020865),
 np.float64(-0.5768697445377609),
 np.float64(-0.4470740520167647),
 np.float64(-0.4470740520167647),
 np.float64(3.83618380117611),
 np.float64(-0.1874826669747723),
 np.float64(-0.5768697445377609)]

Now, let's write code that implements the same computation the *pythonic* way -- using *list comprehensions*. <br>
Tip: remember that list comprehension syntax looks like this: <br>
`[action_to_apply(new_var_name) for new_var_name in list_containing_values]`

In [23]:
F_std = np.std(F)
F_mean = np.mean(F)
zscores = [(value-F_mean)/F_std for value in F]
zscores

[np.float64(-0.1874826669747723),
 np.float64(0.07210871806722008),
 np.float64(-0.3172783594957685),
 np.float64(-1.0960525146217457),
 np.float64(-0.1874826669747723),
 np.float64(-0.057686974453776116),
 np.float64(0.07210871806722008),
 np.float64(0.2019044105882163),
 np.float64(-0.3172783594957685),
 np.float64(0.3317001031092125),
 np.float64(-0.5768697445377609),
 np.float64(0.46149579563020865),
 np.float64(-0.5768697445377609),
 np.float64(-0.4470740520167647),
 np.float64(-0.4470740520167647),
 np.float64(3.83618380117611),
 np.float64(-0.1874826669747723),
 np.float64(-0.5768697445377609)]

#### Calculate the z-score for all values of a feature vector. (The `scipy` way)

Previously we were computing the z-score by implementing its definition formula via `numpy`.<br>
This time, we will use a ready-made function `zscore()` from the package `scipy`.

In [28]:
zscores = stats.zscore(df['hours-per-week'])
zscores

array([-0.03408696,  0.77292975, -0.03408696, ..., -0.03408696,
       -1.64812038, -0.03408696], shape=(48842,))

####  Calculate z-scores for all values of all (numeric) columns

We will demonstrate how to use the Pandas `apply()` method to broadcast the same function (`stats.zscore`) onto all columns in a (filtered!) DataFrame:

In [30]:
df_zscores = df.select_dtypes(include=['number']).apply(stats.zscore)
df_zscores.head(10)

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,educational-num-win
0,-0.995129,0.351675,-1.212683,-0.144804,-0.217127,-0.034087,-1.212683
1,-0.046942,-0.945524,-0.426896,-0.144804,-0.217127,0.77293,-0.426896
2,-0.776316,1.394723,0.751784,-0.144804,-0.217127,-0.034087,0.751784
3,0.390683,-0.277844,-0.034003,0.886874,-0.217127,-0.034087,-0.034003
4,-1.505691,-0.815954,-0.034003,-0.144804,-0.217127,-0.841104,-0.034003
5,-0.338691,0.085498,-1.605577,-0.144804,-0.217127,-0.841104,-1.605577
6,-0.703379,0.353796,-0.426896,-0.144804,-0.217127,-0.034087,-0.426896
7,1.776496,-0.805263,1.930465,0.271598,-0.217127,-0.6797,1.930465
8,-1.068066,1.704525,-0.034003,-0.144804,-0.217127,-0.034087,-0.034003
9,1.192996,-0.801759,-2.391364,-0.144804,-0.217127,-2.455137,-2.391364
