# Fouier Coefficient Computation and Parity Feature Generation

This file uses DNF rules to convert raw features into Boolean features, and then generates parity features and computes their corresponding Fourier coefficients. Finally, it selects the top-30 parity features with the largest Fourier coefficients. 

**Input:**
  -  *fname_data*: filename of the raw data file (.csv format); need to manully upload
  -  *fname_rules*: filename of the DNF rules file (.csv format); need to manully upload
  -  *fname_saved*: filename used to save parity features to a .csv file; saved file will be downloaded automatically

**Output:**
- A .csv file that contains the top parity features; filename specified by 'fname_saved'

**Useful references: **

1. How to generate column combinations:  
  - https://stackoverflow.com/questions/43347939/all-possible-combinations-of-columns-in-dataframe-pandas-python
2. Pandas cheat sheet: 
  - https://www.dataquest.io/blog/large_files/pandas-cheat-sheet.pdf

# STEP 1: Boolean Feature Generation

*   Load **DNF formulas** from a .csv file (rule file) (generated from previous step.)
*   Load **Raw features **from a .csv file  (from datasets/Processed(CSV) folder)
*   Convert *raw features* into *Boolean features* using the DNFs


In [0]:
# SET FILENAMES HERE
fname_data = 'sample_raw_data.csv'      # Raw feature filename 
fname_rules = 'sample_raw_data_DNF.csv' # DNF rules filename 
fname_saved = 'sample_raw_data_parityFeat.csv' # The parity features are saved to this file 

## **TODO**: **UPLOAD RAW DATA FILE **

In [0]:
import numpy as np
import pandas as pd
import os
import io
from google.colab import files
# Upload .csv data file from local 
uploaded = files.upload()
df_raw_data = pd.read_csv(io.StringIO(uploaded[fname_data].decode('utf-8')))
df_raw_data.head()

Saving sample_raw_data.csv to sample_raw_data (3).csv


Unnamed: 0,V_1,V_2,V_3,V_4,V_5,V_6,V_7,V_8,V_9,V_10,...,V_22,V_23,V_24,V_25,V_26,V_27,V_28,V_29,V_30,labels_bi
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1


## **TODO**: **UPLOAD DNF RULE FILE**

In [0]:
# Upload .csv Rule file from local
uploaded = files.upload()
df_DNF = pd.read_csv(io.StringIO(uploaded[fname_rules].decode('utf-8')), header = None)
df_DNF.head()

Saving sample_raw_data_DNF.csv to sample_raw_data_DNF (6).csv


Unnamed: 0,0,1,2
0,V_28,>=,0.1556
1,V_8,>=,0.05985
2,V_6,<=,0.1
3,V_1,>=,8.0
4,V_9,<=,0.16


## **Boolean Feature Conversion**

In [0]:
import operator as op

op_dic = {'>=': op.ge, '<=': op.le, '=':op.eq}
# Booleanize
df_bool_data = pd.DataFrame()
for i in range(df_DNF.shape[0]):
  col_name = 'B'+df_DNF[0][i]+'_'+str(i)
  data = pd.DataFrame({col_name: op_dic[df_DNF[1][i]](df_raw_data[df_DNF[0][i]],df_DNF[2][i])*1})
  df_bool_data = df_bool_data.append(data) if i==0 else pd.concat([df_bool_data, data], axis = 1)
# Extract labels 
df_label = pd.DataFrame({'label':df_raw_data.iloc[:,-1]})
df_bool_data.head(15)

Unnamed: 0,BV_28_0,BV_8_1,BV_6_2,BV_1_3,BV_9_4
0,1,1,0,1,0
1,1,1,1,1,0
2,1,1,0,1,0
3,1,1,0,1,0
4,1,1,0,1,0
5,1,1,0,1,0
6,1,1,0,1,0
7,1,1,0,1,0
8,1,1,0,1,0
9,1,1,0,1,0


# STEP 2: Fourier Coefficient Computation for Parity Features

* Take the Booleanized features from above and generate *Parity Features *
* Compute **Fourier coefficients** for partity features of **size 1** to **size 4**
* **Top 30** features are selected 




In [0]:
# set number of top features to be selected
num_top_feat = 30

## Parity feature: **K = 1**

In [0]:
# convert labels into {-1, 1}
df_label_bin = df_label*(-2)+1
# generate parity feature for size 1
parity_feat_k1 = (-1)**(df_bool_data)
# compute Fourier coefficients
fourier_coeff_k1 = abs(parity_feat_k1[parity_feat_k1.columns].multiply(df_label_bin['label'], axis="index").mean(axis=0))
# keep only top-k features and coefficients
fourier_coeff_k1 = fourier_coeff_k1.nlargest(num_top_feat)
parity_feat_k1 = parity_feat_k1[fourier_coeff_k1.index]

## Parity feature: **K = 2**

In [0]:
from  itertools import combinations
# generate parity feature for size 2
cc = list(combinations(df_bool_data.columns,2))
parity_feat_k2 = pd.concat([df_bool_data[c[1]].add(df_bool_data[c[0]]) for c in cc], axis=1, keys=cc)
parity_feat_k2 = (-1)**parity_feat_k2
# compute Fourier coefficients
fourier_coeff_k2 = abs(parity_feat_k2[parity_feat_k2.columns].multiply(df_label_bin['label'], axis="index").mean(axis=0))
# keep only top-k features and coefficients
fourier_coeff_k2 = fourier_coeff_k2.nlargest(num_top_feat)
parity_feat_k2 = parity_feat_k2[fourier_coeff_k2.index]

## Parity feature: **K = 3**




In [0]:
# generate parity feature for size 3
cc = list(combinations(df_bool_data.columns,3))
parity_feat_k3 = pd.concat([df_bool_data[c[2]].add(df_bool_data[c[1]].add(df_bool_data[c[0]])) for c in cc], axis=1, keys=cc)
parity_feat_k3 = (-1)**parity_feat_k3
# compute Fourier coefficients
fourier_coeff_k3 = abs(parity_feat_k3[parity_feat_k3.columns].multiply(df_label_bin['label'], axis="index").mean(axis=0))
# keep only top-k features and coefficients
fourier_coeff_k3 = fourier_coeff_k3.nlargest(num_top_feat)
parity_feat_k3 = parity_feat_k3[fourier_coeff_k3.index]

## Parity feature: **K = 4**


In [0]:
# generate parity feature for size 4
cc = list(combinations(df_bool_data.columns,4))
parity_feat_k4 = pd.concat([df_bool_data[c[3]].add(df_bool_data[c[2]].add(df_bool_data[c[1]].add(df_bool_data[c[0]]))) for c in cc], axis=1, keys=cc)
parity_feat_k4 = (-1)**parity_feat_k4
# compute Fourier coefficients
fourier_coeff_k4 = abs(parity_feat_k4[parity_feat_k4.columns].multiply(df_label_bin['label'], axis="index").mean(axis=0))
# keep only top-k features and coefficients
fourier_coeff_k4 = fourier_coeff_k4.nlargest(num_top_feat)
parity_feat_k4 = parity_feat_k4[fourier_coeff_k4.index]

## **Find Top Feature by Ranking Fourier Coefficients**

In [0]:
fourier_coeff = pd.concat([fourier_coeff_k1, fourier_coeff_k2, fourier_coeff_k3, fourier_coeff_k4], axis = 0)
del fourier_coeff_k1, fourier_coeff_k2, fourier_coeff_k3, fourier_coeff_k4
parity_feat = pd.concat([parity_feat_k1, parity_feat_k2, parity_feat_k3, parity_feat_k4], axis = 1)
del parity_feat_k1, parity_feat_k2, parity_feat_k3, parity_feat_k4
fourier_coeff = fourier_coeff.nlargest(num_top_feat)
parity_feat = pd.concat([parity_feat[fourier_coeff.index], df_label], axis = 1)

#print (parity_feat.shape) # check if the shape make sense
fourier_coeff.head()

BV_28_0              0.972752
(BV_28_0, BV_1_3)    0.950954
BV_1_3               0.923706
BV_8_1               0.918256
(BV_8_1, BV_1_3)     0.896458
dtype: float64

**SAVE Parity features and DOWNLOAD to LOCAL **

In [0]:
from google.colab import files
# save the parity features to a .csv file 
parity_feat.to_csv(fname_saved, index=False)
files.download(fname_saved)  # download file to local 