## **Extracting the zipfile**

- Use just once while opening

In [None]:
import os
value = [i for i in os.listdir('drive/MyDrive') if i.endswith('.zip')]
import zipfile
z = zipfile.ZipFile('drive/MyDrive/anomaly_scoring_data.zip')
z.extractall('drive/MyDrive')

### Approach : 2 (Static Approach)

### **Statistics Approach**


In this approach basically, we would create a static formula, based on some parameters, and create new parameters derived from the various fields of dataframe and then, see if that parameter can be used to create something, to output an anomaly score, considering the past anomaly score

In [None]:
pip install chart_studio                                           # For visualization purpose

Collecting chart_studio
[?25l  Downloading https://files.pythonhosted.org/packages/ca/ce/330794a6b6ca4b9182c38fc69dd2a9cbff60fd49421cb8648ee5fee352dc/chart_studio-1.1.0-py3-none-any.whl (64kB)
[K     |█████                           | 10kB 16.4MB/s eta 0:00:01[K     |██████████▏                     | 20kB 19.9MB/s eta 0:00:01[K     |███████████████▎                | 30kB 22.5MB/s eta 0:00:01[K     |████████████████████▍           | 40kB 25.5MB/s eta 0:00:01[K     |█████████████████████████▍      | 51kB 26.7MB/s eta 0:00:01[K     |██████████████████████████████▌ | 61kB 28.8MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 7.3MB/s 
Installing collected packages: chart-studio
Successfully installed chart-studio-1.1.0


In [None]:
# Libraries

# Standard
import os
import pathlib
import seaborn as sns
import matplotlib.style as style
sns.set(style='white')
sns.set(style = 'whitegrid',color_codes = True)
import matplotlib.pyplot as plt
%matplotlib inline

# Manipulation
import pandas as pd
import numpy as np

# Plotly
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
import colorlover as cl
from plotly.subplots import make_subplots


# Sklearn
from sklearn.preprocessing import MinMaxScaler,StandardScaler  # For the purpose of shifting and scaling the data

In [None]:
import os
import pathlib
data_path = pathlib.Path('drive/MyDrive/anomaly_scoring_data')    # For easily storing the path address of the parent directory of data files
csv_files = os.listdir('drive/MyDrive/anomaly_scoring_data')      # All the csv files

In [None]:
import pandas as pd
df = pd.read_csv(data_path/csv_files[0])                          # Sample CSV file, and then we would generalize for all the csv files
df['is_anomaly'] = df['is_anomaly'].astype(int)                   # Setting the type of the anomaly_score to be boolean

## Framing the Problem: 
### The task demands to output an anomaly score, similar to what a human would have done, so suppose if I were a person in charge for determining the score of the anomaly, the factor which I would consider would be:

1. Train a model, which would output some number based on the previous time stamp's data, (like output a number, taking all the previous data into consideration)
2. See, if the prediction and the actual number, differ by a **considerable amount**
3. This **considerable amount**, is the main thing, and if we are able to successfully quantify it, then we are done for the task.
4. So, as the prediction as been done for us, and has been given in the dataframe, let us create a column `distance`, which would quantify the distance between the actual prediction and the output prediction.
5. And then, after observing the anomaly data point's distance, we would apply some maths, and would try to predict the anomaly score

In [None]:
df['distance'] = np.abs(df['value'] - df['predicted'])           # Distance, which would be responsible for predicting the anomaly

In [None]:
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df['distance']))                        # Applying the Z Score method, which would be responsible for defining the outliers, if we can predict
df['zscore'] = z                                                # sufficient amount of outliers based on the z score, then we are good to use it for determining the anomaly score
threshold = 3                                                   # The points which are greater than 3 Standard deviation away would be considered as outliers

In [None]:
# Figure layout
fig = make_subplots(rows=1, cols=1, insets=[{'cell': (1,1)}])
fig.update_layout(title="Boxplot of the Z Score of Distance column",font=dict(size=12,color="#7f7f7f"),
                 template = "ggplot2", title_font_size = 20, hovermode= 'closest')
# Figure data
fig.add_trace(go.Box(x = df['zscore'], y = df['is_anomaly'],boxpoints = 'all',jitter = 0.1, 
                     pointpos = -1.6, marker_color = 'rgb(210,105,30)', boxmean = True),
             row = 1, col = 1)
fig.update_traces(orientation='h')

#### Now, from the above diagram, what I did was, all the points, which are at a distance greater than 0.0001*SD (SD -> Standard Deviation), are outliers


The next 3 cells describes, the experimentation on the dataframe, means I was trying to see, what to do, to figure out, how can the anomaly score be given based on the following criterias:

* Anomaly Score should be given considering the, deviation from the recent past, and the anomalies's deviation from all the past data.
* So, now the intuiton can be build that, suppose if I take the max value uptil the given timestamp T, and do some shifting scaling type of process, of all the data, and then compare it with the maximum deviation uptil now, maybe that could be a potential candidate



### Why the above method could be a potential candidate?

* If I were to be assigned to give a score for the anomaly, I would first consider the maximum deviation happened uptil now, (this was my first step), and then scale all the points aroud the 0, and then take the ratio of given data point (scaled version), and the max data point (scaled version), and then output that score.
* Due to this, if the given data point has a deviation higher than the maximum, it would automatically get 100, and else wise, it will get a score comparable to the maximum deviated data point

In [None]:
fraction = 0.0001
thres = df.loc[df['is_anomaly']==1,'distance'].describe()['max']
# Assigning the maximum probability, to the statistical outliers, means the ones that are more than max*0.0001
df.loc[df['distance']>fraction*thres,'scores'] = 100

In [None]:
print(df[df['distance']>fraction*thres]['is_anomaly'].sum())
print(df[df['is_anomaly']==1]['is_anomaly'].sum())

776
776


So, the above method, got us to prove that, statistics can be used here, so, now going to figure out the formula for the anomaly_score



### Some more experiments, by taking more points into consideration, because with the earlier threshold and fraction, there were 0 points, so changing the threshold

In [None]:
fraction = 3
thres = df.loc[df['is_anomaly']==1,'distance'].describe()['75%']

In [None]:
s = MinMaxScaler()                     # Scaling

# Assigning the score, based on distance

'''

OPERATIONS FOR ANOMALY SCORE
STEP 1: Scaling down the distance with the help of Min Max Scaler
STEP 2: Dividing each of the number by the maximum value of the scaled version
STEP 3: Multiply the number by hundred
STEP 4: Clipping the value between 0 and 100


'''
df.loc[(df['is_anomaly']==1) & (df['distance']<=fraction*thres),'scores'] = np.abs(s.fit_transform(df.loc[(df['is_anomaly']==1) & (df['distance']<=fraction*thres),'distance'].values.reshape(-1,1)))
df.loc[(df['is_anomaly']==1) & (df['distance']<=fraction*thres),'scores'] /=df.loc[(df['is_anomaly']==1) & (df['distance']<=fraction*thres),'scores'].max()
df.loc[(df['is_anomaly']==1) & (df['distance']<=fraction*thres),'scores'] *=100
df.loc[(df['is_anomaly']==1) & (df['distance']<=fraction*thres),'scores']  = np.clip(df.loc[(df['is_anomaly']==1) & (df['distance']<=fraction*thres),'scores'] ,0,100)


**Visualizing if the score satisfies the criteria mentioned in the Notion's sheet**

In [None]:
layout = dict(plot_bgcolor='white',
              margin=dict(t=20, l=20, r=20, b=20),
              xaxis=dict(title='Time stamp',
                         linecolor='#d9d9d9',
                         showgrid=False,
                         mirror=True),
              
              yaxis=dict(title='Value',
                         linecolor='#d9d9d9',
                         showgrid=False,
                         mirror=True))
data = go.Scatter(x=list(df[df['is_anomaly']==1]['timestamp'])[:50],
                  y=list(df[df['is_anomaly']==1]['value'])[:50],
                  text=list(df[df['is_anomaly']==1]['scores'])[:50],
                  textposition='top right',
                  #textfont=dict(color='#E58606'),
                  mode='lines+markers+text',
                  marker=dict(color='#5D69B1', size=8),
                  line=dict(color='#52BCA3', width=1, dash='dash'),
)
fig = go.Figure(data=data, layout=layout)
fig.show()

SEEMS TO WORK

### Making the whole pipeline:

1. Take the csv file
2. Make a distance column
3. Create a Function which would help create an anomaly score
4. Give it to the CSV file

In [None]:
# Making a folder
os.mkdir('/content/drive/MyDrive/Cliff.ai')
target_path = pathlib.Path('/content/drive/MyDrive/Cliff.ai')

In [None]:
def calculate_score(df,id):

  '''
  The function would:
  Input: Take a dataframe, and the timestamp (aka, id)
  Process: Take all the entries upto timestamp 'id', 
  IF :
  * NON-ANOMALOUS: RETURN 0
  ELSE
  * Apply the Normalization
  * 
  Output:
  '''
  s = MinMaxScaler()
  if(df.loc[id,'is_anomaly']<1):
        return 0
  target = np.array(df.loc[id,'distance']).reshape(-1,1)  # Taking the target value
  entries_ano =df.loc[:id,'distance'].values.reshape(-1,1)  # Taking all the entries upto timestamp 'id'
  entries_ano = np.abs(s.fit_transform(entries_ano))    # Apply the Normalization

  ''' 
      * The below line of code, was just my intuiton, that mean could also be a factor responsible for determining the anomaly score
      * Adding the mean of the distance, as that could be a additional factor, which would be a key, in defining the score
  '''

  target = np.abs(s.transform(target))  + np.array(np.abs(np.mean(s.fit_transform(np.array(df.loc[id,'distance']).reshape(-1,1))))).reshape(-1,1)
  first_value = 100*target/np.max(entries_ano+1e-3) # In case, the denominator does not become zero

  # Clipping the value between 0 and 100
  return round(np.clip(first_value,0,100)[0][0])

# Looping over all the values      
l = []
for i in list(df.index):
  l.append(calculate_score(df,i))
l = np.array(l)

# Plotting all the enteries
layout = dict(plot_bgcolor='white',
              margin=dict(t=20, l=20, r=20, b=20),
              xaxis=dict(title='Time stamp',
                         linecolor='#d9d9d9',
                         showgrid=False,
                         mirror=True),
              yaxis=dict(title='Value',
                         linecolor='#d9d9d9',
                         showgrid=False,
                         mirror=True))
data = go.Scatter(x=list(df[df['is_anomaly']==1]['timestamp'])[:100],
                  y=list(df[df['is_anomaly']==1]['distance'])[:100],
                  text=list(l[l>0])[:100],
                  textposition='top right',
                  textfont=dict(color='#E58606'),
                  mode='lines+markers+text',
                  marker=dict(color='#5D69B1', size=8),
                  line=dict(color='#52BCA3', width=1, dash='dash'),
)
fig = go.Figure(data=data, layout=layout)
fig.show()

In [None]:
for file in csv_files:

  # STEP 1: Reading the data frame
  df = pd.read_csv(data_path/file)

  # STEP 2: Creating a list for storing the anomaly score
  l = []  

  # STEP 3: Creating a distance column
  df['distance'] = np.abs(df['value'] - df['predicted'])

  # STEP 4: Calculating the score
  for i in list(df.index):
    l.append(calculate_score(df,i))
  l = np.array(l)
  df['anomaly_score'] = l

  # STEP 5: Visualizing the Scores
  layout = dict(plot_bgcolor='white',
              margin=dict(t=20, l=20, r=20, b=20),
              xaxis=dict(title='Time stamp',
                         linecolor='#d9d9d9',
                         showgrid=False,
                         mirror=True),
              yaxis=dict(title='Value',
                         linecolor='#d9d9d9',
                         showgrid=False,
                         mirror=True))
  
  # TIME SERIES
  data = go.Scatter(go.Scatter(x = df['timestamp'],y = df['value'],name = 'Time Series'))
  fig = go.Figure(data=data, layout=layout)

  # SCATTER PLOTS FOR ANOMALIES
  fig.add_trace(go.Scatter(x=list(df[df['is_anomaly']==1]['timestamp']),
                    y=list(df[df['is_anomaly']==1]['value']),
                    text=list(df[df['is_anomaly']==1]['anomaly_score']),
                    textposition='top right',
                    textfont=dict(color='#E58606'),
                    mode='lines+markers+text',
                    marker=dict(color='#5D69B1', size=8),
                    line=dict(color='#52BCA3', width=1, dash='dash'),name = 'anomalous data point'))
  fig.show()

  # STEP 6: DROPPING THE DISTANCE COLUMN, AS IT IS NOT REQUIRED
  df.drop('distance',axis = 1,inplace=True)

  # STEP 7: SAVING THE DATA
  df.to_csv(target_path/file,index = False)