#About

This notebook shows how to use datapop library in a simple way.

In [1]:
%matplotlib inline
import pandas
import numpy
import matplotlib.pyplot as plt

#Data

In [2]:
import pandas
data = pandas.read_csv('data/inputs/popularity-910days.csv')

#ReplicationPlacementStrategy

class **ReplicationPlacementStrategy** generates data replication recommendation.

pandas.DataFrame **data**: data for the analysis.
    
int **min_replicas**: minimum number of datasets replicas. Default: 1
        
int **max_replicas**: maximum number of datasets replicas. Default: 7

In [3]:
from datapop import ReplicationPlacementStrategy
rps = ReplicationPlacementStrategy(data=data, min_replicas=1, max_replicas=7)

Recommendations for which datasets decrease number of replicas and in wich order to **save N Tb** disk space. Reduce just one replica in one step. Number of replicas for the datasets with the lowest metric value are decreased first.

int **n_tb**: number of Tb wanted to save. If **None** - reduce number of replicas for all datasets. Default: None

In [4]:
report = rps.save_n_tb(n_tb=10)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize,Metric,DecreaseReplicas
0,/LHCb/Collision11/Beam3500GeV-VeloClosed-MagDo...,0.000468,0.871852,0.881211,0,0,4.162791,0.204121,0,1
0,/LHCb/Collision11/Beam3500GeV-VeloClosed-MagDo...,0.000468,0.871852,0.881211,0,0,3.162791,0.204121,0,1
0,/LHCb/Collision11/Beam3500GeV-VeloClosed-MagDo...,0.000468,0.871852,0.881211,0,0,2.162791,0.204121,0,1
0,/LHCb/Collision11/Beam3500GeV-VeloClosed-MagUp...,0.000468,0.871852,0.881211,0,0,4.1875,0.146613,0,1
0,/LHCb/Collision11/Beam3500GeV-VeloClosed-MagUp...,0.000468,0.871852,0.881211,0,0,3.1875,0.146613,0,1


Recommendations for which datasets increase number of replicas and in wich order to **fill N Tb** of disk space. Add just one replica in one step. Number of replicas for the datasets with the highest metric value are increased first.

int **n_tb**: number of Tb wanted to fill. If **None** - increase number of replicas for all datasets. Default: None

In [5]:
report = rps.fill_n_tb(n_tb=100)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize,Metric,IncreaseReplicas
0,/LHCb/Collision12/Beam4000GeV-VeloClosed-MagDo...,0.894488,0.870367,0.878102,90368.930288,38158.044273,3.985338,29.072362,22675.349064,1
0,/LHCb/Collision12/Beam4000GeV-VeloClosed-MagUp...,0.894488,0.870367,0.878102,86111.981671,31571.521761,4.008043,28.696464,21484.794866,1
0,/LHCb/Collision12/Beam4000GeV-VeloClosed-MagDo...,0.894488,0.870367,0.878102,90368.930288,38158.044273,4.985338,29.072362,18126.941501,1


Recommendations for which datasets can be remove from disks and in wich order to **clean N Tb** of disk space. The datasets with the lowest probability to be accessed are removed first.

int **n_tb**: number of Tb wanted to clean. If **None** - remove all datasets. Default: None

In [6]:
report = rps.clean_n_tb(n_tb=10)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize
8428,/MC/2012/Beam4000GeV-2012-MagDown-Nu2.5-Pythia...,0.000427,0.871252,0.88008,0,0,1,0.037904
8827,/MC/2012/Beam4000GeV-2012-MagUp-Nu2.5-Pythia8/...,0.000427,0.871252,0.88008,0,0,1,0.033392
8832,/MC/2012/Beam4000GeV-2012-MagUp-Nu2.5-Pythia8/...,0.000427,0.871252,0.88008,0,0,1,0.03368
7968,/MC/2011/Beam3500GeV-2011-MagUp-Nu2-Pythia8/Ge...,0.000435,0.871252,0.88008,0,0,1,0.00012
7969,/MC/2011/Beam3500GeV-2011-MagUp-Nu2-Pythia8/Ge...,0.000435,0.871252,0.88008,0,0,1,0.000123


Combination of the long-term prediction and the short-term forecast reports.

pandas.DataFrame **data**: data for the analysis.

In [7]:
report = rps.get_combine_report(data)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize
0,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.57545,0.871351,0.881032,6.660679,153.429956,2.0,0.3179
1,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.047044,0.871351,0.881032,0.0,0.25641,0.004592,2.402856
2,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.076623,0.871351,0.881032,5.4e-05,18.919273,0.001443,0.085333
3,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.307749,0.871351,0.881032,10.6752,22.378141,3.973568,0.649204
4,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.333904,0.871351,0.881032,0.0,0.0,3.984375,0.803981


Combination of the long-term prediction and the short-term forecast reports with dataset features.

pandas.DataFrame **data**: data for the analysis.

In [8]:
report = rps.get_full_combine_report(data)
report.head()

Unnamed: 0,Name,Probability,roc_auc,precision0,Prediction,rmse,Nb_Replicas,LFNSize,recency,reuse_distance,first_used,creation,frequency,frequency_week,type,extentions,size,nblfn
0,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.582984,0.871754,0.881163,6.660679,153.429956,2.0,0.3179,10,14,118,197,2122,6,1,1,0.3179,67
1,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.041421,0.871754,0.881163,0.0,0.25641,0.004592,2.402856,113,12,129,220,0,0,1,1,2.402856,871
2,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.083709,0.871754,0.881163,5.4e-05,18.919273,0.001443,0.085333,75,27,113,220,0,0,1,1,0.085333,693
3,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.285343,0.871754,0.881163,10.6752,22.378141,3.973568,0.649204,35,1,112,192,1177,6,1,1,0.649204,227
4,/LHCb/Collision10/Beam3500GeV-VeloClosed-MagDo...,0.308699,0.871754,0.881163,0.0,0.0,3.984375,0.803981,48,999,48,192,535,1,1,1,0.803981,256
