**Basic reading and writing of csv files as a first data processing**  

This script starts from the raw csv files provided by central DQM as an ultimate input.  
These files are difficult to work with since they contain a fixed number of lines, not grouped by e.g. run number, and they contain a large number of histogram types together.  
This script (of which basically all the functionality is in the 'utils' folder) puts them into a more useful form, i.e. one file per histogram type and per year, containing all runs and lumisections for that type for that year.  

It might be a good idea to run this code, where you change the histogram types to the ones that you intend to use in your study.  
Options are also available (although not shown in this small tutorial) to make files per era instead of per year, if you prefer that.

For more information, check the documentation of utils/csv_utils and utils.dataframe_utils! See also the comments in the code below for some more explanation.

In [1]:
### imports

# external modules
import sys
import importlib

# local modules
sys.path.append('../utils')
import csv_utils as csvu
import dataframe_utils as dfu
importlib.reload(csvu)
importlib.reload(dfu)

<module 'dataframe_utils' from '../utils/dataframe_utils.py'>

In [2]:
# read an example csv file

dim = 1 # dimension of histograms (1 or 2)
datadirs = list(csvu.get_data_dirs(year='2017',dim=dim)) 
# get_data_dirs returns the directories where to find the input csv files.
# this is hard-coded for now and might change in the future.
# if your data is located elsewhere, you can easily write an equivalent function with the same output.
print('data directories:')
print(datadirs)
datadir = datadirs[2]
csvfiles = csvu.sort_filenames(list(csvu.get_csv_files(datadir)))
# sort_filenames and get_csv_files are more or less self-explanatory.
print('number of csv files in {}: {}'.format(datadir,len(csvfiles)))
df = csvu.read_csv(csvfiles[0])
# read_csv turns an input csv file into a pandas dataframe. 
# uncomment the following two lines to get a printout of the dataframe before any further processing.
# comment them back again to have a better view of the rest of the printouts in this cell.
#print('example data frame:')
print(df)
print('--- available runs present in this file: ---')
for r in dfu.get_runs(df): print(r)
print('--- available histogram types in this file ---')
for h in dfu.get_histnames(df): print(h)

data directories:
['/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017B_1D_Complete', '/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017C_1D_Complete', '/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017D_1D_Complete', '/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017E_1D_Complete', '/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017F_1D_Complete']
number of csv files in /eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017D_1D_Complete: 34
       fromrun  fromlumi                                   hname  fromrun.1  \
0       302036        29                              goodvtxNbr     302036   
1       302036        29                           adc_PXLayer_1     302036   
2       302036        29                           adc_PXLayer_2     302036   
3       302036        29                           adc_PXLayer_3     302036   
4       302036        29                           adc_PXLayer_4     302036   
...        ...       ...                                     ...        ...   
99

In [3]:
# main reformatting of input csv files
# note that this function can take quite a while to run!

csvu.write_skimmed_csv(['Summary_ClusterStoNCorr__OnTrack__TIB__layer__1','Summary_ClusterStoNCorr__OnTrack__TOB__layer__1', 
                        'Chi2oNDF_lumiFlag_GenTk',
'NumberOfRecHitsPerTrack_lumiFlag_GenTk',
'NumberOfTracks_lumiFlag_GenTk',
'chargeInner_PXLayer_1',
'chargeInner_PXLayer_2',
'chargeInner_PXLayer_3',
'chargeInner_PXLayer_4',
'chargeOuter_PXLayer_1',
'chargeOuter_PXLayer_2',
'chargeOuter_PXLayer_3',
'chargeOuter_PXLayer_4'],'2017',eras=['B'],dim=1)

 












INFO in csv_utils.py / read_and_merge_csv: reading and merging 33 csv files...
  - now processing file 1 of 33...
  - now processing file 2 of 33...
  - now processing file 3 of 33...
  - now processing file 4 of 33...
  - now processing file 5 of 33...
  - now processing file 6 of 33...
  - now processing file 7 of 33...
  - now processing file 8 of 33...
  - now processing file 9 of 33...
  - now processing file 10 of 33...
  - now processing file 11 of 33...
  - now processing file 12 of 33...
  - now processing file 13 of 33...
  - now processing file 14 of 33...
  - now processing file 15 of 33...
  - now processing file 16 of 33...
  - now processing file 17 of 33...
  - now processing file 18 of 33...
  - now processing file 19 of 33...
  - now processing file 20 of 33...
  - now processing file 21 of 33...
  - now processing file 22 of 33...
  - now processing file 23 of 33...
  - now processing file 24 of 33...
  - now processing file 25 of 33...
  - now processing file 26 of 

In [None]:
# extra: for 2D histograms, even the files per histogram type and per era might be too big to easily work with.
# this cell writes even smaller files for quicker testing

year = '2017'
era = 'B'
dim = 1 # dimension of histograms (1 or 2)
histname = 'chargeInner_PXLayer_2'
datadirs = list(csvu.get_data_dirs(year=year,eras=[era],dim=dim)) 
datadir = datadirs[0]
csvfiles = csvu.sort_filenames(list(csvu.get_csv_files(datadir)))
print('number of csv files in {}: {}'.format(datadir,len(csvfiles)))
df = csvu.read_csv(csvfiles[0])
df = dfu.select_histnames(df,[histname])
df.to_csv('DF'+year+era+'_'+histname+'_subset.csv')