# Demonstration of Quantiles fitting + Linear model

This notebook demonstrates the module for an anomaly detection model for solar panels, using a dataset of 9 solar panels in California. This model is based on the following paper: "insert link".

### Library import 

In [23]:
from anomalydetector import MultiDataHandler,QLinear
from anomalydetector import save,load,train_test_split
import pandas as pd
import os
import numpy as np
from dask.distributed import Client
from scipy.stats import beta

### Data loading

We are using a dataset of 9 solar panels situated in California. 

In [6]:
folder_path = "C:/Users/coren/Solar-data-tools/site_data"

dfs = {}
for filename in os.listdir(folder_path):
    if filename.endswith(".csv"):
        file_path = os.path.join(folder_path, filename)
        dfs[filename] = pd.read_csv(file_path)
        dfs[filename]['measured_on'] = pd.to_datetime(dfs[filename]['measured_on'])+pd.Timedelta(hours=-8)
        dfs[filename] = dfs[filename].iloc[:,:2]

As an extension of the DataHandler class, MultiDataHandler class takes into arguement a list of pandas dataframes or a unique dataframe containing all the timeseries and create a MultiDataHandler object ables to run the pipelines for each timeseries and also align/dilate the dataframes.  

In [9]:
dhs = MultiDataHandler(data_frames=list(dfs.values()),datetime_col='measured_on')
#If the DataFrame index does not represent time, specify the name of the column that does.

### Running the pipelines and aligne datasets

The align method runs the pipelines for each DataHandler (if not already done) and computes the list of common valid days.

In [10]:
dhs.align()

Running pipelines: 100%|██████████| 9/9 [03:19<00:00, 22.20s/it]
Aligning datasets: 100%|██████████| 9/9 [00:00<00:00, 158.78it/s]


The dilate method applies the dilation process to each time series. The result is stored in the object and not returned by the function. If a failure scenario was already computed the function recompute a scenario with default parameters.

In [11]:
dhs.dilate(ndil = 51)

The generate_failure method takes a target as an argument and creates a failure scenario for the given time series. You can also provide a loss distribution in the scipy.stats format (default: uniform distribution), and a proportion_totalday value between 0 and 1, which specifies the proportion of days with a failure lasting the entire day.

In [None]:
example_dist = beta(a = 2, b = 5)
dhs.generate_failure('ac_power_inv_30219',
                     loss_distribution=example_dist,
                     proportion_totalday=0.5)

Finally, you can separate the dataset in a train/test dataset. the first one is used to fit the model while the second one is used to test it.

In [41]:
dh_train,dh_test = train_test_split(dhs,test_size=0.2,shuffle=False)

### QLinear model

The QLinear class contains the model presented in the following paper : "insert link". Firstly, we fit a SmoothPeriodicQuantile model on each timeserie to transform the data in a Gaussian variable. Secondly, we fit a linear regression to predict the target from the other ones. And finally, we classify from the residuals the failure/non-failure scenario.

In [18]:
sites = list(dhs.dil_mat.keys())
ndil = dhs.ndil()
target = dhs.target

In [20]:
model = QLinear(sites=sites,
                ndil=ndil,
                target=target)

#### Fitting

We can fit either all the model all it once, or part by part. 

In [None]:
client = Client(n_workers=4, threads_per_worker=2, memory_limit='15GB')
print(client.dashboard_link)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 59356 instead


http://127.0.0.1:59356/status




In [52]:
client.close()

In [28]:
param = {'weight_quantiles' : 5, #The weight for the quantiles fit
                   'quantiles' : np.array([0.02, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 0.98]), #The different quantiles to fit
                   'solver_quantiles' : 'mosek', #Solver used for the quantiles fitting
                   'num_harmonics' : [30,3], #Number of harmonics for the first and second period
                   #'client' : client, #Dask client to parrallelize the computation
                   'nlag' : 3, #The lag to use you can retrain the model with other nlag whitout changing everything
                   'num_basis' : 8,#Number of Chebychev polynomial used 
                   #'weight_linear'  #weights for the linear regression of size (num_basis,n_features)
                   #'lambda_range' : #The range of lambda for the global weight
                   'num_split' : 5,#number of split to use for the K-fold
                   #'model_residuals' : #the model to use for the residuals classification
                   'train_size' : 0.8,
}

In [44]:
model.fit_quantiles(dh_train,param=param)

100%|███████████████████████████████████████████| 10/10 [01:19<00:00,  7.97s/it]
100%|███████████████████████████████████████████| 10/10 [01:20<00:00,  8.08s/it]
100%|███████████████████████████████████████████| 10/10 [01:15<00:00,  7.59s/it]
100%|██████████████████████████████████████████| 10/10 [32:54<00:00, 197.44s/it]
100%|███████████████████████████████████████████| 10/10 [00:59<00:00,  5.92s/it]
100%|███████████████████████████████████████████| 10/10 [01:00<00:00,  6.01s/it]
100%|███████████████████████████████████████████| 10/10 [01:01<00:00,  6.11s/it]
100%|███████████████████████████████████████████| 10/10 [01:00<00:00,  6.07s/it]
100%|███████████████████████████████████████████| 10/10 [01:04<00:00,  6.40s/it]


In [49]:
model.fit_linear(param = param)

In [50]:
model.fit_residuals(param=param)

{'accuracy_train': 0.9228395061728395,
 'f1_score_train': 0.9163879598662207,
 'accuracy_val': 0.8888888888888888,
 'f1_score_val': 0.8928571428571429}

In [51]:
model.fit(dh_train,param=param)

100%|███████████████████████████████████████████| 10/10 [00:59<00:00,  5.98s/it]
100%|███████████████████████████████████████████| 10/10 [01:01<00:00,  6.13s/it]
100%|███████████████████████████████████████████| 10/10 [01:00<00:00,  6.09s/it]
100%|███████████████████████████████████████████| 10/10 [01:01<00:00,  6.18s/it]
100%|███████████████████████████████████████████| 10/10 [01:04<00:00,  6.46s/it]
100%|███████████████████████████████████████████| 10/10 [01:04<00:00,  6.45s/it]
100%|███████████████████████████████████████████| 10/10 [01:06<00:00,  6.67s/it]
100%|███████████████████████████████████████████| 10/10 [01:05<00:00,  6.54s/it]
100%|███████████████████████████████████████████| 10/10 [01:06<00:00,  6.63s/it]


{'accuracy_train': 0.9228395061728395,
 'f1_score_train': 0.9163879598662207,
 'accuracy_val': 0.8888888888888888,
 'f1_score_val': 0.8928571428571429}

The good way to do it is to fit everything at once the first time, and maybe retrain some part (mostly the last ones) using the pretrain model.

#### Testing

You can test the model with the test method on a new dataset with a failure scenario. 

In [53]:
model.test(dh_test)

{'accuracy_test': 0.8627450980392157, 'f1_score_test': 0.8613861386138614}

#### Predicting

Finally, you can use the predict method on a new dataset.

In [54]:
model.predict(dh_test)

array([1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,