# Intro

In this notebook, we will preprocess the light curves to make them suitable for the model. The Light Curves can be stored as csv files in a folder with any name. However, each light curve should be its own csv with a unique name and three columns - 'mjd', 'mag', 'magerr'


The package can be installed from pip with a simple pip install

In [None]:
#If the package isn't in the enviornment that you are working in

!pip install QNPy_Latte

### Importing the necessary packages

In [None]:
import QNPy_Latte.Preprocess as pr #Importing Preprocess module from the package
from QNPy_Latte.Preprocess import transform #importing the funcion transform for transformation the data
from QNPy_Latte.Preprocess import * #importing all e6xternal packages from Preprocess
import shutil #Is used for creation and deletion of folders

In [None]:
import QNPy_Latte.SPLITTING_AND_TRAINING as st #Importing SPLITTING_AND_TRAINING module from the package
from QNPy_Latte.SPLITTING_AND_TRAINING import * #Importing all packages from SPLITTING_AND_TRAINING module

### Importing Data and keyword definitions

In [None]:
SRC_LCs = f'Light_Curves/' #The name of the folder that the Light Curves are stored in
#Alternatively, you can have your light curves all in a folder with bands 
#SRC_LCs = f'Light_Curves/band_name/'
file_name = f'LCs' #The suffix to attach to the new files created

In [None]:
#The Names of the files
file_names = []
for name in glob.glob(SRC_LCs+'/*.csv'):
    file_names.append(name)

In [None]:
#Importing the data. This can be done in any desired manner, but the data must contain:
#mjd - MJD or time, mag-magnitude and magerr-magnitude error. 
# In this example we used pandas python package for importing the .csv data, but numpy can be used if the data is 
#in .txt file
#Get the data
path = SRC_LCs
csv_files = glob.glob(path + "/*.csv")
df_list = (pd.read_csv(file) for file in csv_files)
data = pd.concat(df_list, ignore_index=True)

### Cleaning Light Curves

We offer the option to clean the light curve by removing outlier observations. This is achieved by removing extreme outliers (magerr>1), applying a three-point median filter and removing points above a certain threshold from the 5th degree polynomial fit to the light curve. However, the threshokd is increased if too many points are removed until a maximum of 10% of points are removed (methods from Sanchez-Saez et. al. 2021 and Tachibana and Graham et al. 2020. 

In [None]:
#If you would like to clean the curve
clean = True

In [None]:
if clean:
    cleaned_path = f'./Cleaned_Light_Curves_{band}/'
    clean_outliers_median(path,cleaned_path,median = True)
    path = cleaned_path

### Padding the Light Curves

We pad the light curves to ensure that they all have the same number of observations. Thus, we can batch our data for the model. 

In [None]:
# Padding the light curves
# We added the function for backward padding the curves with last observed value
# The length for padding should remain 100 or above 
# Verbose indicates whether the confirmation of the file should be printed (>0) or nothing (=0)
padding= pr.backward_pad_curves(path, f'./Padded_lc_{file_name}', desired_observations=100,verbose=0)

### Preprocessing/Transforming Data 

We preprocess the data so that both the times and magnitudes are scaled to the range of [-2,2]. We also save the coefficients to aid in the reverse transform later on

In [None]:
#Path to Padded Data
DATA_SRC = f"./Padded_lc_{file_name}" 
#path to folder to save preproc data (transformed data)
DATA_DST = f"./preproc_{file_name}"

In [None]:
#Making the preprocess directory
os.makedirs(DATA_DST,exist_ok=True

In [None]:
#listing the data that are going to be transformed. 
#In case that your original data is in one table, this is not needed
files = os.listdir(DATA_SRC) 

In [None]:
#Making the TR_Coeffs file
os.makedirs('TR_Coeffs',exist_ok = True)
trcoeff_filename = f'TR_Coeffs/trcoeff_{file_name}.pickle'

In [None]:
#Running the preprocess transformation
number_of_points, trcoeff = pr.transform_and_save(files, DATA_SRC, DATA_DST, transform,trcoeff_file = trcoeff_filename)

In [None]:
#Remove the padded folder (Optional)
shutil.rmtree(DATA_SRC)

### Splitting the Data

We split the data into train, test, and validation folders. The split is roughly 80-10-10, but the validation folder is guarenteed to have at least two light curves. 

In [None]:
#Make the new data source the transformed light curves
DATA_SRC = DATA_DST #Path to transformed data

In [None]:
#listing the transformed data
files = os.listdir(DATA_SRC) 

In [None]:
#creating the folders for saving the splitted data
st.create_split_folders(train_folder=f'./dataset_{file_name}/train/', test_folder=f'./dataset_{file_name}/test/',\
                        val_folder=f'./dataset_{file_name}/val/')

In [None]:
#path to TRAIN, TEST and VAL folders where your splitted data will be saved. 
#You can directly enter this informations in split_data function
TRAIN_FOLDER = f'./dataset_{file_name}/train/'
TEST_FOLDER = f'./dataset_{file_name}/test/'
VAL_FOLDER = f'./dataset_{file_name}/val/'

In [None]:
# clearing the output folders
# if you don't have anything in your TRAIN, TEST and VAL folders this can be scipped
st.prepare_output_dir(TRAIN_FOLDER) 
st.prepare_output_dir(TEST_FOLDER) 
st.prepare_output_dir(VAL_FOLDER) 

In [None]:
#running the function for splitting the data
#Verbose is similar to the previous function where the confirmation should be printed or not
st.split_data(files, DATA_SRC, TRAIN_FOLDER, TEST_FOLDER, VAL_FOLDER,verbose = 0) 

In [None]:
#Remove the preproc folders (Optional)
shutil.rmtree(DATA_SRC)