# Example of training an anomaly detection to detect data outliers


### Description:
The model to be trained have a task to calculte and display the probability of being a fault information (data instance inputed by the user).
<br/>**NB:** you don't understand what's going on? visit the [Github Repository](link) for detailed information about the project.

## Dependancies

In [1]:
import pandas as pd # for data analytics
import numpy as np # for numerical computation
from matplotlib import pyplot as plt, style # for ploting
import seaborn as sns # for ploting
from sklearn.metrics import fbeta_score, precision_score, recall_score, confusion_matrix # for evaluation
import itertools

style.use('ggplot')
np.random.seed(42) 

### Example of dataset to sell to the user (entreprises side)
the dataset is from [kaggle](https://www.kaggle.com/datasets/dev0914sharma/customer-clustering?select=segmentation+data.csv) website

In [2]:
dataset = pd.read_csv('../datasets/segmentation data.csv')
dataset.head()

Unnamed: 0,ID,Sex,Marital status,Age,Education,Income,Occupation,Settlement size
0,100000001,0,0,67,2,124670,1,2
1,100000002,1,1,22,1,150773,1,2
2,100000003,0,0,49,1,89210,0,0
3,100000004,0,0,45,1,171565,1,1
4,100000005,0,0,53,1,149031,1,1


### Data preprocessing

In [3]:
dataset = dataset.drop('ID', axis=1)

dataset['Age'] = np.log(dataset['Age'] + 1)
dataset['Income'] = np.log(dataset['Income'] + 1)

from sklearn.preprocessing import StandardScaler, Normalizer

scaler = StandardScaler()
normalizer = Normalizer()

# i'll use the normalizer
X = normalizer.fit_transform(dataset)

### Save the normalizer model as pickle file for future uses

In [6]:
import pickle

pickle_out = open('../deployment/normalizer.pkl', 'wb')
pickle.dump(normalizer, pickle_out)
pickle_out.close()

In [7]:
X = pd.DataFrame(X)

### Anomaly detection

In [8]:
from scipy.stats import multivariate_normal

mu = X.mean(axis=0)
sigma = X.cov()

model = multivariate_normal(cov=sigma, mean=mu)

# predict probability of each instance
probas = model.pdf(X)

In [9]:
#proba_max --> 100%
max_proba = probas.max()
max_proba

65444665.36036694

### Prediction
**Note:** the problem i face in this step is that tere is a very big variance between probabilities values. i tried to normalize them using **Normalizer()** method, then scale them using **StandardScaler()** method but nothing changed.
the max value of probability is 

In [10]:
def predict_proba(x):
    x[2] = np.log(x[2] + 1)
    x[4] = np.log(x[4] + 1)
    x = normalizer.fit_transform([x])
    
    p = model.pdf(x) / max_proba
    return round(min(p*1000, 99), 3) #to display more readable probabilities

In [11]:
# test
sample = np.array([1, 1, 22, 1, 150773, 1, 2])
print(f'the input instance is accure with probability: {predict_proba(sample)}%')

the input instance is accure with probability: 73.79%


### Save the model as pickle file (.pkl)

In [12]:
import pickle

pickle_out = open('../deployment/model.pkl', 'wb')
pickle.dump(model, pickle_out)
pickle_out.close()

# End!