# Anomaly Detection

# Continuous Variable Probabilistic Methods for Identifying Outliers

## Exercises
Using the repo setup directions, setup a new local and remote repository named anomaly-detection-exercises. The local version of your repo should live inside of ~/codeup-data-science. This repo should be named anomaly-detection-exercises

Save this work in your anomaly-detection-exercises repo. Then add, commit, and push your changes.



continuous_probabilistic_methods.py or continuous_probabilistic_methods.ipynb 

1. Define a function named get_lower_and_upper_bounds that has two arguments. The first argument is a pandas Series. The second argument is the multiplier, which should have a default argument of 1.5.

## Imports

In [13]:
# standard imports
import numpy as np
import pandas as pd

# my imports
import wrangle as w
import explore as e

# visualization imports
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# argument1 = pd.Series

In [3]:
# argument 2 is a multiplier, which should have a default argument of 1.5
# argument2 = 1.5

In [4]:
# def get_lower_and_upper_bounds(argument1, argument2):
    
#     return

### 1. Using lemonade.csv dataset and focusing on continuous variables:

In [5]:
url = "https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/e4b5d6787015a4782f96cad6d1d62a8bdbac54c7/lemonade.csv"

In [6]:
df = pd.read_csv(url)

In [7]:
summary = w.data_summary(df)
summary

data shape: (365, 7)


Unnamed: 0,data type,#missing,%missing,#unique,count,mean,std,min,25%,50%,75%,max
date,object,0,0.0,365,365.0,,,,,,,
day,object,0,0.0,7,365.0,,,,,,,
temperature,float64,0,0.0,176,365.0,61.224658,18.085892,15.1,49.7,61.1,71.7,212.0
rainfall,float64,0,0.0,35,365.0,0.825973,0.27389,0.4,0.65,0.74,0.91,2.5
flyers,int64,0,0.0,63,365.0,40.10411,13.786445,-38.0,31.0,39.0,49.0,80.0
price,float64,0,0.0,1,365.0,0.5,0.0,0.5,0.5,0.5,0.5,0.5
sales,int64,0,0.0,39,365.0,27.865753,30.948132,7.0,20.0,25.0,30.0,534.0


#### Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense? Which outliers should be kept?

In [15]:
object_cols = e.get_object_cols(df)
object_cols

['date', 'day']

In [16]:
num_cols = e.get_numeric_cols(df)
num_cols

['temperature', 'rainfall', 'flyers', 'price', 'sales']

In [24]:
df_nums = w.remove_columns(df, object_cols)
df_nums

Unnamed: 0,temperature,rainfall,flyers,price,sales
0,27.0,2.00,15,0.5,10
1,28.9,1.33,15,0.5,13
2,34.5,1.33,27,0.5,15
3,44.1,1.05,28,0.5,17
4,42.4,1.00,33,0.5,18
...,...,...,...,...,...
360,42.7,1.00,33,0.5,19
361,37.8,1.25,32,0.5,16
362,39.5,1.25,17,0.5,15
363,30.9,1.43,22,0.5,13


In [27]:
def outlier(df, feature, m=2):
    '''
    outlier will take in a dataframe's feature:
    - calculate it's 1st & 3rd quartiles,
    - use their difference to calculate the IQR
    - then apply to calculate upper and lower bounds
    - using the `m` multiplier
    '''
    q1 = df[feature].quantile(.25)
    q3 = df[feature].quantile(.75)
    
    iqr = q3 - q1
    
    multiplier = m
    upper_bound = q3 + (multiplier * iqr)
    lower_bound = q1 - (multiplier * iqr)
    
    return upper_bound, lower_bound

In [28]:
# outlier(df, feature, m=2)

NameError: name 'feature' is not defined

In [30]:
get_outliers_col(df)

UnboundLocalError: local variable 'df' referenced before assignment

In [21]:
def get_outliers_col(df):
    columns = list(df.columns)
    # total rows
    orig_shape = df.shape[0]

    for i in columns:
        # finding the lower and upper bound outliers for fixed acidity
        col1UP, col1LOW = outlier(df,i,2.5)
        df = df[(df[i] < col1UP) & (df[i] > col1LOW)]
        col1 = df.shape[0]

    print(f"{i}: lower= {col1LOW}, upper= {col1UP}, new rows= {col1}\n")

    new_shape = df.shape[0]
    shape_rem = orig_shape-new_shape
    print(f"Total of rows originally: {orig_shape}")
    print(f"Total of rows removed: {shape_rem}")
    print(f"New total of rows: {new_shape}")
    
    return df

#### Use the IQR Range Rule and the upper and upper bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these upper outliers make sense? Which outliers should be kept?

#### Using the multiplier of 3, IQR Range Rule, and the lower bounds, identify the outliers below the lower bound in each colum of lemonade.csv. Do these lower outliers make sense? Which outliers should be kept?

#### Using the multiplier of 3, IQR Range Rule, and the upper bounds, identify the outliers above the upper_bound in each colum of lemonade.csv. Do these upper outliers make sense? Which outliers should be kept?

### 2. Identify if any columns in lemonade.csv are normally distributed. For normally distributed columns:

- Use a 2 sigma decision rule to isolate the outliers.

> Do these make sense?
>Should certain outliers be kept or removed?


### 3. Now use a 3 sigma decision rule to isolate the outliers in the normally distributed columns from lemonade.csv