## Process excel peak files from the MACS14 peak calling program

MACS14 reports results as excel files that contain all ChIP-seq peak information. The purpose of this code is to filter out peaks that are less than 20-fold enriched over Input and to add an additional column that states the KRAB-ZFP from for which the ChIP-seq peaks have been determined. The results will be saved as bed file for further processing.

In [87]:
import pandas as pd
import glob

In [None]:
files = glob.glob('*.xls')    # create list with file names containing all .xls files in the working directory 
print(files)    

for file in files:    # iterate through list with file names
    
    df = pd.read_csv(file, skiprows=23, sep='\t')    # load file as panda dataframe
    df['Zfp_name'] = file.split('_')[0]              # extract ZFP name from file name and make new column with it
    df_sorted = df.sort_values('fold_enrichment', ascending=False)       # sort dataframe according to fold_enrichment column
    df_sorted_filtered = df_sorted[df_sorted['fold_enrichment'] >= 20]   # filter dataframe to keep only peaks with >= 20-fold enrichment
    new_file_name = file[:-4] + '.bed'    # create new file name to save dataframe (as bed file)
    df_sorted_filtered.to_csv(new_file_name, sep='\t', header=False, index=False)    # save dataframe as csv file in bed format