![Cloud-First](../image/CloudFirst.png) 


# SIT742: Modern Data Science 
**(Module: Big Data Analytics)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/sit742](https://github.com/tulip-lab/sit742/issues)


Prepared by **SIT742 Teaching Team**

---



## Session 5C: Isolation Forest

Let's import the required libraries first. We are importing numpy, pandas, seaborn and matplotlib. Apart form that we also need to import IsolationForest from sklearn.ensemble.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings('ignore')

Once the libraries are imported we need to read the data from the csv to the pandas data frame and check the first 10 rows of data.

The data is a collection of Hong Kong arrival time series. This data has few anomalies (like values too high or too low) which we will be detecting.

In [None]:
link_to_data = 'https://raw.githubusercontent.com/tulip-lab/open-data/master/HK2012-2018/Australia.csv'

# Data Preprocessing
df = pd.read_csv(link_to_data)
df = df[['date','arrival','Hong kong','Hong kong dollar']]
df = df.set_index('date', drop=False)
df.head(10)

In [None]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df[['arrival','Hong kong','Hong kong dollar']])
df_normal = pd.DataFrame(x_scaled,columns=['arrival','Hong kong','Hong kong dollar'])
df_normal = df_normal.set_index(df.index.values)

## Exploratory Data Analysis

To get more of an idea of the data we have plotted a violin plot of salary data as shown below. A violin plot is a method of plotting numeric data.

Typically a violin plot includes all the data that is in a box plot, a marker for the median of the data, a box or marker indicating the interquartile range, and possibly all sample points, if the number of samples is not too high.

In [None]:
# we firstly try to consolidate all values from columns into one and then give a category to distingush.
df_arr = df_normal[['arrival']]
df_hk = df_normal[['Hong kong']]
df_hkd = df_normal[['Hong kong dollar']]
df_arr['category'] = 'arrival'
df_hk['category'] = 'hongkong'
df_hkd['category'] = 'hongkongdollar'
df_arr.columns = ['value','category']
df_hk.columns = ['value','category']
df_hkd.columns = ['value','category']
df_eda = pd.concat([df_arr,df_hk,df_hkd],axis=0)


In [None]:
ax = sns.violinplot(x="category", y="value", data=df_eda)

To get a better idea of outliers we may like to look at a box plot as well. This is also known as box-and-whisker plot. The box in box plot shows the quartiles of the dataset, while the whiskers shows the rest of the distribution.

In [None]:
ax = sns.boxplot(x="category", y="value", data=df_eda)

## IsolationForest

Once we have completed our exploratory data analysis, it's time to define and fit the model.

We'll create a model variable and instantiate the IsolationForest class. We are passing the values of four parameters to the Isolation Forest method, listed below.

- Number of estimators: n_estimators refers to the number of base estimators or trees in the ensemble, i.e. the number of trees that will get built in the forest. This is an integer parameter and is optional. The default value is 100.

- Max samples: max_samples is the number of samples to be drawn to train each base estimator. If max_samples is more than the number of samples provided, all samples will be used for all trees. The default value of max_samples is 'auto'. If 'auto', then max_samples=min(256, n_samples)

- Contamination: This is a parameter that the algorithm is quite sensitive to; it refers to the expected proportion of outliers in the data set. This is used when fitting to define the threshold on the scores of the samples. The default value is 'auto'. If ‘auto’, the threshold value will be determined as in the original paper of Isolation Forest.

- Max features: All the base estimators are not trained with all the features available in the dataset. It is the number of features to draw from the total features to train each base estimator or tree.The default value of max features is one.


In here, we will define a IsolationForest class with sklearn IsolationForest(max_samples=100)

In [None]:
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(0)
clf = IsolationForest(max_samples=100, random_state=rng)

After we defined the model above we need to train the model using the data given. For this we are using the fit() method as shown above. This method is passed one parameter, which is our data of interest (in this case, It will be the first column of the dataset -- Arrival).

Once the model is trained properly it will output the IsolationForest instance as shown in the output of the cell above.

### Anomaly Score

Now this is the time to add the scores and anomaly column of the dataset.

After the model is defined and fit, let's find the scores and anomaly column. We can find out the values of scores column by calling `decision_function()` of the trained model and passing the salary as parameter.

Similarly we can find the values of anomaly column by calling the `predict()` function of the trained model and passing the salary as parameter.

These columns are going to be added to the data frame df. After adding these two columns let's check the data frame. As expected, the data frame has three columns now: salary, scores and anomaly. A **negative score value and a -1** for the value of anomaly columns indicate the presence of anomaly.** A value of 1 for the anomaly represents the normal data**.

Each data point in the train set is assigned an anomaly score by this algorithm. We can define a threshold, and using the anomaly score, it may be possible to mark a data point as anomalous if its score is greater than the predefined threshold.


In [None]:
clf.fit(df[['arrival']])
score_df = pd.DataFrame()
score_df['scores']=clf.decision_function(df[['arrival']])
score_df['anomaly']=clf.predict(df[['arrival']])
score_df = score_df.set_index(df.index.values)
score_df['arrival'] = df[['arrival']]
score_df.head(20)

### Output the Anomalies

After adding the scores and anomalies for all the rows in the data, we will print the predicted anomalies. To print the predicted anomalies in the data we need to analyse the data after addition of scores and anomaly column. As you can see above for the predicted anomalies the anomaly column values would be -1 and their scores will be negative. Using this information we can print the predicted anomaly (two data points in this case) as below.

Note that we could print not only the anomalous values but also their index in the dataset, which is useful information for further processing. 



In [None]:
anomaly=score_df.loc[score_df['anomaly']==-1]
anomaly_index=list(anomaly.index)
print(anomaly)

Let's draw the time series and also add a red dash line on the date when each amonaly happens

In [None]:
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.figsize':(20,5), 'figure.dpi':120})
positions = list(range(len(score_df.index)))

anomaly_index_seq = []
for i in anomaly_index:
    anomaly_index_seq.append(np.where((score_df.index.values == i))[0][0])

series = score_df['arrival']
ax = series.plot(rot=45)
ax.set_xticks(positions)
ax.set_xticklabels(score_df.index.values)
for xc in anomaly_index_seq:
    ax.axvline(x=xc,color='r',linestyle='dashed',linewidth=1)
plt.show()

It could be seen that the isolation forest is able to capture the anomaly on the peak and also valleys of the arrival series.

## Another IsolationForest Example to identify the amonaly score for multi columns

In [None]:
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(0)

# Helper function to train and predict IF model for a feature
def train_and_predict_if(df, feature):
    clf = IsolationForest(max_samples=100, random_state=rng)
    clf.fit(df[[feature]])
    pred = clf.predict(df[[feature]])
    scores = clf.decision_function(df[[feature]])
    stats = pd.DataFrame()
    stats['val'] = df[feature]
    stats['score'] = scores
    stats['outlier'] = pred 
    stats['min'] = df[feature].min()
    stats['max'] = df[feature].max()
    stats['mean'] = df[feature].mean()
    stats['feature'] = [feature] * len(df)
    return stats

# Helper function to print outliers
def print_outliers(df, feature, n):
    print(feature)
    print(df[feature].head(n).to_string(), "\n")

In [None]:
# Run through all features and save the outlier scores for each feature
num_columns = [i for i in list(df_normal.columns) if i not in list(df_normal.select_dtypes('object').columns) and i not in ['date']]
result = pd.DataFrame()
for feature in num_columns:
    stats = train_and_predict_if(df_normal, feature)
    result = pd.concat([result, stats])
    
# Gather top outliers for each feature
outliers = {team:grp.drop('feature', axis=1) 
       for team, grp in result.sort_values(by='score').groupby('feature')}

# Print the top 10 outlier samples for a few selected features
n_outliers = 10
print_outliers(outliers, "arrival", n_outliers)
print_outliers(outliers, "Hong kong", n_outliers)
print_outliers(outliers, "Hong kong dollar", n_outliers)