## Anomalies in the real world

Allow me to quote the following from classic book **Data Mining. Concepts and Techniques by Han et al.** -

Outlier detection (also known as anomaly detection) is the process of finding data objects with behaviors that are very different from expectation. Such objects are called outliers or anomalies

Suppose, you are a credit card holder and on an unfortunate day it got stolen. Payment Processor Companies (like PayPal) do keep a track of your usage pattern so as to notify in case of any dramatic change in the usage pattern. The patterns include transaction amounts, the location of transactions and so on. If a credit card is stolen, it is very likely that the transactions may vary largely from the usual ones. This is where (among many other instances) the companies use the concepts of anomalies to detect the unusual transactions that may take place after the credit card theft. But don’t let that confuse anomalies with noise. **Noise and anomalies are not the same**. So, how noise looks like in the real world?

Let’s take the example of the sales record of a grocery shop. People tend to buy a lot of groceries at the start of a month and as the month progresses the grocery shop owner starts to see a vivid decrease in the sales. Then he starts to give discounts on a number of grocery items and also does not fail to advertise about the scheme. This discount scheme might cause an uneven increase in sales but are they normal? They, sure, are not. These are **noises** (more specifically stochastic noises).

Could not get any better, right? To be able to make more sense of anomalies, it is important to understand what makes an anomaly different from noise.

The way data is generated has a huge role to play in this. For the normal instances of a dataset, it is more likely that they were generated from the same process but in case of the outliers, it is often the case that they were generated from a different process(s).

<img src="images/anomaly_detection_1.PNG">

In the above figure, I show you what it is like to be outliers within a set of *closely related data-points*. The closeness is governed by the process that generated the data points. From this, it can be inferred that the process for generated those two encircled data-points must have been different from that one that generated the other ones. ut how do we justify that those red data points were generated by some other process? ***Assumptions!***

While doing anomaly analysis, it is a common practice to make several assumptions on the normal instances of the data and then distinguish the ones that violate these assumptions. More on these assumptions later!

The above figure may give you a notion that **anomaly analysis** and **cluster analysis** may be the same things. They are very closely related indeed, but they are not the same! They vary in terms of their purposes. **While cluster analysis** lets you **group similar data points**, **anomaly analysis** lets you **figure out the odd ones among a set of data points.**



## Generation of anomalies in data

The way anomalies are generated hugely varies from domain to domain, application to application. Let’s take a moment to review some of the fields where anomaly detection is extremely vital -

* **Intrusion detection systems**: In the field of computer science, unusual network traffic, abnormal user actions are common forms of intrusions. These intrusions are capable enough to breach many confidential aspects of an organization. Detection of these intrusions is a form of anomaly detection.

* **Fraud detection in transactions** - One of the most prominent use cases of anomaly detection. Nowadays, it is common to hear about events where one’s credit card number and related information get compromised. This can, in turn, lead to abnormal behavior in the usage pattern of the credit cards. Therefore, to effectively detect these frauds, anomaly detection techniques are employed.

* **Electronic sensor events** - Electronic sensors enable us to capture data from various sources. Nowadays, our mobile devices are also powered with various sensors like light sensors, accelerometer, proximity sensors, ultrasonic sensors and so on. Sensor data analysis has a lot of interesting applications. But what happens when the sensors become ineffective? This shows up in the data they capture. When a sensor becomes dysfunctional, it fails to capture the data in the correct way and thereby produces anomalies.  For example, one’s pulse rate may get abnormally high due to several conditions and this leads to anomalies. This point is also very crucial considering today’s industrial scenario. We are approaching and embracing **Industry 4.0** in which **IoT** (Internet of Things) and **AI** (Artificial Intelligence) are integral parts. When these sensors start to behave inconsistently the signals they convey get also uncanny, thereby causing unprecedented troubleshooting. Hence, systematic anomaly detection is a must here.

## Anomalies can be of different types

In the data science literature, anomalies can be of the three types as follows. Understanding these types can significantly affect the way of dealing with anomalies.

* Global
* Contextual
* Collective

In the following subsections, we are to take a closer look at each of the above and discuss their key aspects like their importance, grounds where they should be paid importance to.

### 1. Global anomalies

Global anomalies are the most common type of anomalies and correspond to those data points which deviate largely from the rest of the data points. The figure used in the “Find the odd ones out: Anomalies in data” section actually depicts global anomalies.

<img src="images/anomaly_detection_2.PNG">

You might be thinking that the idea of global anomalies (deviation from the normal) may not always hold practical with respect to numerous conditions, context and similar aspects.  Yes, you are thinking just right. Anomalies can be contextual too!

### 2. Contextual anomalies

Consider today’s temperature to be 32 degrees centigrade and we are in Kolkata, a city situated in India. Is the temperature normal today? This is a highly relative question and demands for more information to be concluded with an answer. Information about the season, location etc. are needed for us to jump to give any response to the question - “Is the temperature normal today?”

Now, in India, specifically in Kolkata, if it is Summer, the temperature mentioned above is fine. But if it is Winter, we need to investigate further.  Let’s take another example. We all are aware of the tremendous climate change i.e. causing the Global Warming. The latest results are with us also. From the archives of The Washington Post:

 - Alaska just finished one of its most unusually warm Marches ever recorded. In its northern reaches, the March warmth was unprecedented.
 
Take note of the phrase “unusually warm”. It refers to 59-degrees Fahrenheit. But this may not be unusually warm for other countries. This unusual warmth is an anomaly here.

These are called **contextual anomalies** where the deviation that leads to the anomaly **depends on contextual information**. These contexts are **governed by contextual attributes and behavioral attributes**. In this example, **location is a contextual attribute** and **temperature is a behavioral attribute.**

<img src="images/anomaly_detection_3.PNG">

The above figure depicts a time-series data over a particular period of time. The plot was further smoothed by kernel density estimation to present the boundary of the trend. The values have not fallen outside the normal global bounds, but there are indeed abnormal points (highlighted in orange) when compared to the seasonality.

### 3. Collective anomalies

n the following figure, the data points marked in green have collectively formed a region which substantially deviates from the rest of the data points.

<img src="images/anomaly_detection_4.PNG">


This an example of a collective anomaly. The main idea behind collective anomalies is that the data points included in forming the collection may not be anomalies when considered individually. Let’s take the example of a daily supply chain in a textile firm. Delayed shipments are very common in industries like this. But on a given day, if there are numerous shipment delays on orders then it might need further investigation. The delayed shipments do not contribute to this individually but a collective summary is taken into account when analyzing situations like this.

Collective anomalies are interesting because here you do not only to look at individual data points but also analyze their behavior in a collective fashion.





## Machine learning & anomalies: Could it get any better?

The heart and soul of any machine learning model is the data that is being fed to it. Data can be of any form practically - structured, semi-structured and unstructured. Let’s go into these categories for now. At all their cores, machine learning models try to find the underlying patterns of the data that best represent them. These patterns are generally learned as mathematical functions and these patterns are used for making predictions, making inferences and so on. To this end, consider the following toy dataset:

<img src="images/anomaly_detection_5.PNG">

The dataset has two features: **x1** and **x2** and the predictor variable (or the label) is **y**. The dataset has got 6 observations. Upon taking a close look at the data points, the fifth data point appears to be the odd one out here. Really? Well, it depends on a few things -

* We need to take the domain into the account here. The domain to which the dataset belongs to. This is particularly important because until and unless we have information on that, we cannot really say if the fifth data point is an extreme one (anomaly). It might so happen that this set of values is possible in the domain.

* While the data was getting captured, what was the state of the capturing process? Was it functioning in the way it is expected to? We may not always have answers to questions like these. But they are worth considering because this can change the whole course of the anomaly detection process.

Now coming to the perspective of a machine learning model, let’s formalize the problem statement:

 - Given a set of input vectors x1 and x2 the task is to predict y.
 
The prediction task is a classification task. Say, you have trained a model M on this data and you got a classification accuracy of 96% on this dataset. Great start for a baseline model, isn’t it? You may not be able to come up with a better model than this for this dataset. Is this evaluation just enough? Well, the answer is no! Let’s now find out why.

When we know that our dataset consists of a weird data-point, just going by the classification accuracy is not correct. Classification accuracy refers to the percentage of the correct predictions made by the model. So, before jumping into a conclusion of the model’s predictive supremacy, we should check if the model is able to correctly classify the weird data-point. Although the importance of anomaly detection varies from application to application, still it is a good practice to take this part into account. So, long story made short, when a dataset contains anomalies, it may not always be justified to just go with the classification accuracy of a model as the evaluation criteria.

 * The illusion, that gets created by the classification accuracy score in situations described above, is also known as  **classification paradox.**
 
So, when a machine learning model is learning the patterns of the data given to it, it may have a critical time figuring out these anomalies and may give unexpected results. A very trivial and naive way to tackle this is just dropping off the anomalies from the data before feeding it to a model. But what happens when in an application, detection of the anomalies (we have seen the examples of these applications in the earlier sections) is extremely important? Can’t the anomalies be utilized in a more systematic modeling process? Well, the next section deals with that.


 


## Getting benefits from anomalies

When training machine learning models for applications where anomaly detection is extremely important,  we need to thoroughly investigate if the models are being able to effectively and consistently identify the anomalies. A good idea of utilizing the anomalies that may be present in the data is to train a model with the anomalies themselves so that the model becomes robust to the anomaly detection. So, on a very high level, the task becomes training a machine learning model to specifically identify anomalies and later the model can be incorporated in a broader pipeline of automation.

A well-known method to train a machine learning model for this purpose is **Cost-Sensitive Learning.** The idea here is to associate a certain cost whenever a model identifies an anomaly. Traditional machine learning models do not penalize or reward the wrong or correct predictions that they make. Let’s take the example of a fraudulent transaction detection system. To give you a brief description of the objective of the model - to identify the fraudulent transactions effectively and consistently. This is essentially a binary classification task. Now, let’s see what happens when a model makes a wrong prediction about a given transaction. The model can go wrong in the following cases  -
 * Either misclassify the legitimate transactions as the fraudulent ones
or
* Misclassify the fraudulent ones as the legitimate ones

<img src="images/anomaly_detection_6.PNG">

To be able to understand this more clearly, we need to take the cost (that is incurred by the authorities) associated with the misclassifications into the account. If a legitimate transaction is categorized as fraudulent, the user generally contacts the bank to figure out what went wrong and in most of the cases, the respective authority and the user come to a mutual agreement. In this case, the administrative cost of handling the matter is most likely to be negligible. Now, consider the other scenario - “Misclassify the fraudulent ones as the legitimate ones.” This can indeed lead to some serious concerns. Consider, your credit card has got stolen and the thief purchased (let’s assume he somehow got to know about the security pins as well) something worth an amount (which is unusual according to your credit limit). Further, consider, this transaction did not raise any alarm to the respective credit card agency. In this case, the amount (that got debited because of the theft) may have to be reimbursed by the agency.

In traditional machine learning models, the optimization process generally happens just by minimizing the cost for the wrong predictions as made by the models. So, when cost-sensitive learning is incorporated to help prevent this potential issue, we associate a hypothetical cost when a model identifies an anomaly correctly. The model then tries to minimize the net cost (as incurred by the agency in this case) instead of the misclassification cost.

**(N.B.: All machine learning models try to optimize a cost function to better their performance.)

 * Effectiveness and consistency are very important in this regard because sometimes a model may be randomly correct in the identification of an anomaly. We need to make sure that the model performs consistently well on the identification of the anomalies.

## A case study of anomaly detection in Python

We will start off just by looking at the dataset from a visual perspective and see if we can find the anomalies.


In [18]:
# Import the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Shpw the plots in your Jupyter Notebook
%matplotlib inline
# Use a predefined style set
plt.style.use('ggplot')

Let's first create a dummy dataset for ourselves. The dataset will contain just two columns -

* Name of the employees of an organization
* Salaries of those employees (in USD) within a range of 1000 to 2500 (Monthly)

For generating the names (and make them look like the real ones) we will use a Python library called Faker (read the documentation here). For generating salaries, we will use the good old numpy. After generating these, we will merge them in a pandas DataFrame. We are going to generate records for 100 employees. Let's begin.

**Note:** Synthesizing dummy datasets for experimental purposes is indeed an essential skill.

In [19]:

from faker import Faker
fake = Faker()

# To ensure the results are reproducible
fake.seed(4321)

names_list = []

fake = Faker()
for _ in range(100):
  names_list.append(fake.name())

# Verify if 100 names were generated
len(names_list)

100

In [20]:
# To ensure the results are reproducible
np.random.seed(7)

salaries = []
for _ in range(100):
    salary = np.random.randint(1000,2500)
    salaries.append(salary)

    # Verify if 100 salariy values were generated
len(salaries)

100

In [21]:

# Create pandas DataFrame
salary_df = pd.DataFrame(
    {'Person': names_list,
     'Salary (in USD)': salaries
    })

# Print a subsection of the DataFrame
salary_df.head()

Unnamed: 0,Person,Salary (in USD)
0,Jason Brown,1175
1,Jacob Stein,2220
2,Cody Brown,1537
3,Larry Morales,1502
4,Jessica Hendricks,1211


Let's now manually change the salary entries of two individuals. In reality, this can actually happen for a number of reasons such as the data recording software may have got corrupted at the time of recording the respective data.

In [22]:

salary_df.at[16, 'Salary (in USD)'] = 23
salary_df.at[65, 'Salary (in USD)'] = 17

In [23]:
# Verify if the salaries were changed
print(salary_df.loc[16])
print(salary_df.loc[65])

Person             Miss Amanda Harris MD
Salary (in USD)                       23
Name: 16, dtype: object
Person             Joyce Bishop
Salary (in USD)              17
Name: 65, dtype: object


To be able to treat the task of anomaly detection as a classification task, we need a labeled dataset. Let's give our existing dataset some labels.

We will first assign all the entries to the class of 0 and then we will manually edit the labels for those two anomalies. We will keep these class labels in a column named class. The label for the anomalies will be 1 (and for the normal entries the labels will be 0).

In [24]:
# First assign all the instances to 
salary_df['class'] = 0

# Manually edit the labels for the anomalies
salary_df.at[16, 'class'] = 1
salary_df.at[65, 'class'] = 1

# Veirfy 
print(salary_df.loc[16])

Person             Miss Amanda Harris MD
Salary (in USD)                       23
class                                  1
Name: 16, dtype: object


We now have a binary classification task. We are going to use proximity-based anomaly detection for solving this task. The basic idea here is that the proximity of an anomaly data point to its nearest neighboring data points largely deviates from the proximity of the data point to most of the other data points in the data set.

We are going to use the k-NN classification method for this. Also, we are going to use a Python library called PyOD which is specifically developed for anomaly detection purposes.

In [26]:
# Importing KNN module from PyOD
from pyod.models.knn import KNN

The column Person is not at all useful for the model as it is nothing but a kind of identifier. Let's prepare the training data accordingly.

In [27]:
# Segregate the salary values and the class labels 
X = salary_df['Salary (in USD)'].values.reshape(-1,1)
y = salary_df['class'].values

# Train kNN detector
clf = KNN(contamination=0.02, n_neighbors=5)
clf.fit(X)

KNN(algorithm='auto', contamination=0.02, leaf_size=30, method='largest',
  metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2,
  radius=1.0)

Let's discuss the two parameters we passed into KNN() -

**contamination** - the amount of anomalies in the data (in percentage) which for our case is 2/100 = 0.02
**n_neighbors** - number of neighbors to consider for measuring the proximity

Let's now get the prediction labels on the training data and then get the outlier scores of the training data. The outlier scores of the training data. The higher the scores are, the more abnormal. This indicates the overall abnormality in the data. These handy features make PyOD a great utility for anomaly detection related tasks.



In [28]:
# Get the prediction labels of the training data
y_train_pred = clf.labels_ 
    
# Outlier scores
y_train_scores = clf.decision_scores_

Let's now try to evaluate KNN() with respect to the training data. PyOD provides a handy function for this - evaluate_print().

In [29]:
# Import the utility function for model evaluation
from pyod.utils import evaluate_print

# Evaluate on the training data
evaluate_print('KNN', y, y_train_scores)

KNN ROC:1.0, precision @ rank n:1.0


We see that the KNN() model was able to perform exceptionally good on the training data. It provides three metrics and their scores -

* ROC
* Precision along with a confidence rank

Note: While detecting anomalies, we almost always consider ROC and Precision as it gives a much better idea about the model's performance. We have also seen its significance in the earlier sections.

We don't have any test data. But we can generate a sample salary value, right?

In [30]:
# A salary of $37 (an anomaly right?)
X_test = np.array([[37.]])

Let's now test how if the model could detect this salary value as an anomaly or not.



In [31]:
# Check what the model predicts on the given test data point
clf.predict(X_test)

array([1])

We can see the model predicts just right. Let's also see how the model does on a normal data point.

In [32]:
# A salary of $1256
X_test_abnormal = np.array([[1256.]])

# Predict
clf.predict(X_test_abnormal)

array([0])

The model predicted this one as the normal data point which is correct. With this, we conclude our case study of anomaly detection which leads us to the concluding section of this article

ref : https://blog.floydhub.com/introduction-to-anomaly-detection-in-python/
https://github.com/sayakpaul/FloydHub-Anomaly-Detection-Blog/blob/master/FloydHub%20Anomaly%20Detection%20Blog.ipynb