#About

This notebook shows how to use datapop library in a simple way.

In [1]:
%matplotlib inline
import pandas
import numpy
import matplotlib.pyplot as plt

#Data

In [2]:
import pandas
data = pandas.read_csv('data/inputs/popularity-910days.csv')

#ReplicationPlacementStrategy

class **ReplicationPlacementStrategy** generates data replication recommendation.

pandas.DataFrame **data**: data for the analysis.
    
int **min_replicas**: minimum number of datasets replicas. Default: 1
        
int **max_replicas**: maximum number of datasets replicas. Default: 7

In [3]:
from datapop import ReplicationPlacementStrategy
rps = ReplicationPlacementStrategy(data=data, min_replicas=1, max_replicas=7)

Recommendations for which datasets decrease number of replicas and in wich order to **save N Tb** disk space. Reduce just one replica in one step. Number of replicas for the datasets with the lowest metric value are decreased first.

int **n_tb**: number of Tb wanted to save. If **None** - reduce number of replicas for all datasets. Default: None

In [4]:
report = rps.save_n_tb(n_tb=10)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize,Metric,DecreaseReplicas
0,/MC/2011/Beam3500GeV-2011-MagUp-Nu2-Pythia6/Si...,0.009002,0.817979,0.808357,0,0,3,0.018161,0,1
0,/MC/2011/Beam3500GeV-2011-MagUp-Nu2-Pythia6/Si...,0.009002,0.817979,0.808357,0,0,2,0.018161,0,1
0,/MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia...,0.009002,0.817979,0.808357,0,0,3,0.021516,0,1
0,/MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia...,0.009002,0.817979,0.808357,0,0,2,0.021516,0,1
0,/MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia...,0.009002,0.817979,0.808357,0,0,3,0.026533,0,1


Recommendations for which datasets increase number of replicas and in wich order to **fill N Tb** of disk space. Add just one replica in one step. Number of replicas for the datasets with the highest metric value are increased first.

int **n_tb**: number of Tb wanted to fill. If **None** - increase number of replicas for all datasets. Default: None

In [5]:
report = rps.fill_n_tb(n_tb=100)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize,Metric,IncreaseReplicas
0,/LHCb/Collision12/Beam4000GeV-VeloClosed-MagDo...,0.86016,0.818104,0.811092,116439.102564,38776.19917,3.985338,29.072362,29216.870078,1
0,/LHCb/Collision12/Beam4000GeV-VeloClosed-MagUp...,0.86016,0.818104,0.811092,98721.145299,31029.067283,4.008043,28.696464,24630.760024,1
0,/LHCb/Collision12/Beam4000GeV-VeloClosed-MagDo...,0.86016,0.818104,0.811092,116439.102564,38776.19917,4.985338,29.072362,23356.310558,1


Recommendations for which datasets can be remove from disks and in wich order to **clean N Tb** of disk space. The datasets with the lowest probability to be accessed are removed first.

int **n_tb**: number of Tb wanted to clean. If **None** - remove all datasets. Default: None

In [6]:
report = rps.clean_n_tb(n_tb=10)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize
8179,/MC/2011/Beam3500GeV-2011-MagUp-Nu2-Pythia6/Ge...,0.008161,0.817752,0.812015,0,0,1,0.020774
8613,/MC/2012/Beam4000GeV-2012-MagUp-Nu2.5-Pythia6/...,0.008161,0.817752,0.812015,0,0,1,0.013174
8344,/MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia...,0.008161,0.817752,0.812015,0,0,1,0.012778
8366,/MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia...,0.008161,0.817752,0.812015,0,0,3,0.005714
8901,/MC/Dev/Beam3500GeV-2011-MagUp-Fix1-NominalBea...,0.008161,0.817752,0.812015,0,0,1,0.005546


Combination of the long-term prediction and the short-term forecast reports.

pandas.DataFrame **data**: data for the analysis.

In [7]:
report = rps.get_combine_report(data)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize
0,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.468589,0.817803,0.810964,99.138462,133.1586,2.0,0.3179
1,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.072328,0.817803,0.810964,43.746154,130.982252,0.004592,2.402856
2,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.061695,0.817803,0.810964,16.515385,33.867127,0.001443,0.085333
3,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.230931,0.817803,0.810964,12.546154,19.818181,3.973568,0.649204
4,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.131098,0.817803,0.810964,4.115385,12.346154,3.984375,0.803981


Combination of the long-term prediction and the short-term forecast reports with dataset features.

pandas.DataFrame **data**: data for the analysis.

In [8]:
report = rps.get_full_combine_report(data)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize,recency,reuse_distance,first_used,creation,frequency,frequency_week,type,extentions,size,nblfn
0,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.478415,0.818284,0.811681,99.138462,133.1586,2.0,0.3179,10,14,118,197,2122,6,1,1,0.3179,67
1,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.074194,0.818284,0.811681,43.746154,130.982252,0.004592,2.402856,113,12,129,220,0,0,1,1,2.402856,871
2,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.059725,0.818284,0.811681,16.515385,33.867127,0.001443,0.085333,75,27,113,220,0,0,1,1,0.085333,693
3,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.226111,0.818284,0.811681,12.546154,19.818181,3.973568,0.649204,35,1,112,192,1177,6,1,1,0.649204,227
4,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.127448,0.818284,0.811681,4.115385,12.346154,3.984375,0.803981,48,999,48,192,535,1,1,1,0.803981,256
