# Part 2.1 - Feature Engineering for XGBoost Model
In this notebook, we will engineer aggregated features based on the statistical properties of the flux and observation times.

This notebook is broken down into five sections
1. Compute the flux skewness for each timeseries sequence
2. Compute other aggregated features from `flux` and observation time
3. Load training labels and merge with training set
4. Concatenate features from Parts 1 and 2
5. Concatenate bottleneck features

In [None]:
import cudf as gd
import pandas as pd
import numpy as np
import math
import seaborn as sns
from termcolor import colored
import matplotlib.pyplot as plt
import warnings

from numba import cuda,jit,float32

from utils import scatter, groupby_skew

In [None]:
warnings.filterwarnings("ignore")
sns.set()
print(gd.__version__)

In [None]:
PATH = "../../../../../data/plasticc_data"

In [None]:
ts_cols = ['object_id', 'mjd', 'passband', 'flux', 'flux_err', 'detected']
ts_dtypes = ['int32', 'float32', 'int32', 'float32','float32','int32']

### Section 1 - Compute Flux Skewness for each timeseries sequence

Load the training and test datasets back into cuDF DataFrames.

In [None]:
train_gd = gd.read_csv('%s/training_set.csv'%PATH, names=ts_cols,dtype=ts_dtypes,skiprows=1)
test_gd = gd.read_csv('%s/test_set_sample.csv'%PATH, names=ts_cols,dtype=ts_dtypes,skiprows=1)

In [None]:
test_gd.head().to_pandas()

Let's visualize the flux

In [None]:
for idx, oid in zip(range(1), train_gd.object_id.unique()[:1]):
    train = train_gd.to_pandas()
    mask = train.object_id== oid

    scatter(train.loc[mask,'mjd'].values,
            train.loc[mask,'flux'].values,
            values=train.loc[mask,'passband'].values,
            xlabel='time',
            ylabel='flux',
            title='Object %d class 42'%oid)
    plt.show()

Since we are going to be computing the skewness of the `flux` for each `object_id`, we can safely drop the other columns. 

In [None]:
test_flux_skew_gd = test_gd[['object_id','flux']]
train_flux_skew_gd = train_gd[['object_id','flux']]

The computation of the flux skewness is done using the `groupby() - apply_grouped()` techinque we used in the RNN feature engineering stage.

We will use the `groupby_skew()` helper function, which can be found in the supplementary Python script `utils.py`.

In [None]:
test_flux_skew_gd = groupby_skew(test_flux_skew_gd, "object_id", "flux")
train_flux_skew_gd = groupby_skew(train_flux_skew_gd, "object_id", "flux")

In [None]:
train_flux_skew_gd.head().to_pandas()

### Section 2 - Compute Statistical Summary Features

While very useful, the flux skew, alone, will probably not train a classifier with good accuracy. In this part, we will engineer more features by aggregating existing features. 

### Independent Exercise

Wsing the functions `groupby()` and `agg()`, cuDF can build aggregations for many pre-defined functions, such as `max`, `min`, and `mean`. 

Play around with this strategy in the cell that follows, to get familiar with how this behavior works. When you feel you have a good understanding, fill in the `perform_aggregation()` function so that the remaining aggregated features can be computed in `compute_aggregated_features()`

In [None]:
train_gd.groupby(, as_index=False).agg().head().to_pandas()

In [None]:
def perform_aggregation(df, groupby_col, agg_col, agg_type):
    return # Fill in the groupby().agg() to enable the remaining feature engineering

In [None]:
def groupby_aggs(df,aggs,col = "object_id"):
    res = None
    for i,j in aggs.items():
        for k in j:
            tmp = perform_aggregation(df, col, i, k)
            tmp.columns = [col,'%s_%s'%(k,i)]
            res = tmp if res is None else res.merge(tmp,on=[col],how='left')
        df.drop_column(i)
    return res

In [None]:
def compute_aggregated_features(df):
    """
    Engineer new features by aggregating existing features
    """
    
    aggs = {
        'passband': ['mean'],  # mean passband
        'detected': ['mean'],  # mean detected
        'mjd':['max','min'],   # min / max time range
    }
    
    agg_df = groupby_aggs(df, aggs)
    
    # If flux uncertanty is low, we get a high ratio squared. If the uncertainty is high, 
    # we will get a low ratio squared.
    df['flux_ratio_sq'] = df['flux'] / df['flux_err']
    df['flux_ratio_sq'] = df['flux_ratio_sq'].applymap(lambda x: math.pow(x,2))
    
    # Multiply flux by the ratio squared
    df['flux_by_flux_ratio_sq'] = df['flux'] * df['flux_ratio_sq']
    
    aggs2 = {
        'flux_ratio_sq':['sum'],            # Sum the sq flux ratios
        'flux_by_flux_ratio_sq':['sum'],    # Sum the flux * sq flux ratios
        'flux': ['min', 'max', 'mean'],     # Summary stats for flux
        'flux_err': ['min', 'max', 'mean'], # Summary stats for flux certainty
    }
    
    agg_df2 = groupby_aggs(df, aggs2)
    agg_df = agg_df.merge(agg_df2,on=['object_id'],how='left')
    del agg_df2

    agg_df['flux_diff'] = agg_df['max_flux'] - agg_df['min_flux']
    agg_df['flux_dif2'] = (agg_df['max_flux'] - agg_df['min_flux']) / agg_df['mean_flux']
    
    agg_df['flux_w_mean'] = agg_df['sum_flux_by_flux_ratio_sq'] / agg_df['sum_flux_ratio_sq']
    agg_df['flux_dif3'] = (agg_df['max_flux'] - agg_df['min_flux']) / agg_df['flux_w_mean']
    
    agg_df['mjd_diff'] = agg_df['max_mjd'] - agg_df['min_mjd']
    agg_df.drop_column('max_mjd')
    agg_df.drop_column('min_mjd')
    
    return agg_df

In [None]:
train_final_gd = compute_aggregated_features(train_gd)
test_final_gd = compute_aggregated_features(test_gd)

In [None]:
train_final_gd.head().to_pandas()

### Part 3 - Load Training Labels and Metadata

Metadata is supplied in a separate CSV file for each `object_id` in our training set. This metadata also includes a `target` column, which are the training labels.

In [None]:
cols = ['object_id', 'ra', 'decl', 'gal_l', 'gal_b', 'ddf',
       'hostgal_specz', 'hostgal_photoz', 'hostgal_photoz_err', 
       'distmod','mwebv', 'target']

dtypes = ['int32']+['float32']*4+['int32']+['float32']*5+['int32']

train_meta_gd = gd.read_csv('%s/training_set_metadata.csv'%PATH, names=cols, dtype=dtypes, skiprows=1)

We can drop columns we won't need for training our classifier

In [None]:
for col in ['ra','decl','gal_l','gal_b']:
    train_meta_gd.drop_column(col)

Merge the metadata with our training set by `object_id`

In [None]:
train_final_gd = train_meta_gd.merge(train_final_gd,on=['object_id'],how='left')

We can now safely delete some of the dataframes we no longer need in order to save memory

In [None]:
del train_gd, train_meta_gd, test_gd

### Part 4 - Merge Flux Skew & Statistical Summaries

We use cuDF's `merge()` to combine the flux skew and statistical summary Dataframes by `object_id`

In [None]:
train_final_gd = train_final_gd.merge(train_flux_skew_gd,on=['object_id'],how='left')
test_final_gd = test_final_gd.merge(test_flux_skew_gd,on=['object_id'],how='left')

In [None]:
train_final_gd.head().to_pandas()

### Part 5 - Merge in Bottleneck Features

Load the bottleneck features extracted from the RNN in `Part 1.2` and concatenate with the aggregated features we created above.

In [None]:
train_bn = gd.from_pandas(pd.read_pickle("train_bn.pkl"))
test_bn = gd.from_pandas(pd.read_pickle("test_bn.pkl"))

In [None]:
train_final_gd = train_final_gd.merge(train_bn,on=['object_id'],how='left')
test_final_gd = test_final_gd.merge(test_bn,on=['object_id'],how='left')

In [None]:
train_final_gd.head().to_pandas()

In [None]:
del train_bn,test_bn  # Save device? memory

### Store Final Train/Test Data to Disk

Store our extracted data out to csv files so that we can use it downstream 

In [None]:
train_final_gd.to_pandas().to_pickle("train_gdf.pkl")
test_final_gd.to_pandas().to_pickle("test_gdf.pkl")