# Prepare Data for Higgs Dataset

## Install requirements
We will need pandas for the data preparation. 


## Prepare data

### Download and Store Data

To run the examples, we first download the dataset from the HIGGS website. We will download, uncompress, and store the dataset under 

```
/tmp/nvflare/dataset/input/

```

You can either use wget or curl to download directly if you have wget or curl installed. Here we use curl command. It will take a while to download the  2.6+ GB zip file. 
    

In [1]:
!mkdir -p /home/yaqeen/NVFlare/dataset/pcr/input

Alternatively, download with wget ```wget -P /tmp/nvflare/dataset/input/ https://archive.ics.uci.edu/static/public/280/higgs.zip```

With the downloaded zip file, we will unzip it with the pre-installed "unzip" and "gunzip".  

Let's check our current files under the data folder.

In [4]:
!ls -al /home/yaqeen/NVFlare/dataset/pcr/input/

total 128
drwxrwxr-x 2 yaqeen yaqeen   4096 Jul 23 13:31 .
drwxrwxr-x 7 yaqeen yaqeen   4096 Jul 23 13:29 ..
-rw-rw-r-- 1 yaqeen yaqeen    175 Jul 23 13:30 headers.csv
-rw-rw-r-- 1 yaqeen yaqeen 116756 Jul 23 09:48 train_selected_features.csv


In [5]:
import pandas as pd

data = pd.read_csv('/home/yaqeen/NVFlare/dataset/pcr/input/train_selected_features.csv')

data.head()

Unnamed: 0,pcr,original_firstorder_Kurtosis,original_glcm_Imc2,original_shape_Maximum3DDiameter,original_shape_Sphericity,original_shape_Maximum2DDiameterRow,hr,er,pr,her2,tumor_subtype
0,0,3.493691,0.961782,31.870497,0.667575,27.939877,0,0,0,1,1
1,0,2.884773,0.996339,26.676194,0.425407,21.252044,0,0,0,0,6
2,1,3.078196,0.961089,53.990776,0.409528,49.52919,1,1,0,1,0
3,1,4.453866,0.961408,42.018724,0.386899,36.591695,0,0,0,0,6
4,0,3.153635,0.960872,64.169069,0.469745,49.711593,0,0,0,0,6


In [6]:

data = data.apply(pd.to_numeric, errors='coerce')

df = data.sample(frac=1).reset_index(drop=True)
df.to_csv('/home/yaqeen/NVFlare/dataset/pcr/input/train_selected_features.csv')

In [7]:

list_ =data.columns.tolist()
print(list_)

['pcr', 'original_firstorder_Kurtosis', 'original_glcm_Imc2', 'original_shape_Maximum3DDiameter', 'original_shape_Sphericity', 'original_shape_Maximum2DDiameterRow', 'hr', 'er', 'pr', 'her2', 'tumor_subtype']


### Data Split 

HIGGS dataset contains 11 million instances (rows), each with 28 attributes.
The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. 
The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. The last 500,000 examples are used as a test set.

The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features): lepton  pT, lepton  eta, lepton  phi, missing energy magnitude, missing energy phi, jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag, jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag, m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb. For more detailed information about each feature, please see the original paper.

We will split the dataset uniformly: all clients has the same amount of data under the output directory 

```
/tmp/nvflare/dataset/output/

```

First to make it similar to the real world use cases, we generate a header file to store feature names (CSV file headers) in the data directory. 

#### Generate the csv header file


In [8]:
import csv

# Your list of data
features = [
    'pcr	original_firstorder_Kurtosis','original_glcm_Imc2','original_shape_Maximum3DDiameter','original_shape_Sphericity',	'original_shape_Maximum2DDiameterRow','hr',	'er',	'pr',	'her2',	'tumor_subtype'

]

features = list_
# Specify the file path
file_path =  '/home/yaqeen/NVFlare/dataset/pcr/input/headers.csv'

with open(file_path, 'w', newline='') as file:
    csv_writer = csv.writer(file)
    csv_writer.writerow(features)

print(f"features written to {file_path}")

features written to /home/yaqeen/NVFlare/dataset/pcr/input/headers.csv


In [9]:
!cat /home/yaqeen/NVFlare/dataset/pcr/input/headers.csv

pcr,original_firstorder_Kurtosis,original_glcm_Imc2,original_shape_Maximum3DDiameter,original_shape_Sphericity,original_shape_Maximum2DDiameterRow,hr,er,pr,her2,tumor_subtype


Now assume you are on the "/examples/hello-world/step-by-step/higgs" directory

In [10]:
!pwd

/home/yaqeen/NVFlare/examples/hello-world/step-by-step/higgs


#### Split higgs.csv into multiple csv files for clients

Then we split the data into multiple files, one for each site. We make sure each site will has a "header.csv" file corresponding to the csv data. In horizontal split, all the header will be the same; while for vertical learning, each site can have different headers. 

First, we install the requirements, assuming the current directory is '/examples/hello-world/step-by-step/higgs'

In [11]:
!pwd

/home/yaqeen/NVFlare/examples/hello-world/step-by-step/higgs


In [12]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In this tutorial, we set to 3 clients with uniform split. To do so, simply run `split_csv.py`. It is going to take a few minutes. 

>note 
    we used a sample rate of 0.3 to make demo faster to run. You can change the number to even smaller such 0.003 to reduce the file size especially under development or debugging. 

In [14]:
!python split_csv.py \
  --input_data_path=/home/yaqeen/NVFlare/dataset/pcr/input/train_selected_features.csv\
  --input_header_path=/home/yaqeen/NVFlare/dataset/pcr/input/headers.csv\
  --output_dir=/home/yaqeen/NVFlare/dataset/pcr/output/ \
  --site_num=4 \
  --sample_rate=1

site-1= start_index=0 end_index=296
site-2= start_index=296 end_index=592
site-3= start_index=592 end_index=888
site-4= start_index=888 end_index=1186
File copied to /home/yaqeen/NVFlare/dataset/pcr/output/site-1_header.csv
File copied to /home/yaqeen/NVFlare/dataset/pcr/output/site-2_header.csv
File copied to /home/yaqeen/NVFlare/dataset/pcr/output/site-3_header.csv
File copied to /home/yaqeen/NVFlare/dataset/pcr/output/site-4_header.csv


Now let's check the files and their instance counts.

In [46]:
!ls -al /home/yaqeen/NVFlare/dataset/pcr/output/

total 2208
drwxrwxr-x 2 yaqeen yaqeen   4096 Jul 16 17:20 .
drwxrwxr-x 5 yaqeen yaqeen   4096 Jul 11 18:01 ..
-rw-rw-r-- 1 yaqeen yaqeen 558104 Jul 16 18:05 site-1.csv
-rw-rw-r-- 1 yaqeen yaqeen   3530 Jul 16 18:05 site-1_header.csv
-rw-rw-r-- 1 yaqeen yaqeen 556568 Jul 16 18:05 site-2.csv
-rw-rw-r-- 1 yaqeen yaqeen   3530 Jul 16 18:05 site-2_header.csv
-rw-rw-r-- 1 yaqeen yaqeen 556116 Jul 16 18:05 site-3.csv
-rw-rw-r-- 1 yaqeen yaqeen   3530 Jul 16 18:05 site-3_header.csv
-rw-rw-r-- 1 yaqeen yaqeen 560396 Jul 16 18:05 site-4.csv
-rw-rw-r-- 1 yaqeen yaqeen   3530 Jul 16 18:05 site-4_header.csv


In [47]:
!wc -l /home/yaqeen/NVFlare/dataset/pcr/output/site-1.csv

296 /home/yaqeen/NVFlare/dataset/pcr/output/site-1.csv


In [48]:
!wc -l /home/yaqeen/NVFlare/dataset/pcr/output/site-2.csv

296 /home/yaqeen/NVFlare/dataset/pcr/output/site-2.csv


In [49]:
!wc -l /home/yaqeen/NVFlare/dataset/pcr/output/site-3.csv

296 /home/yaqeen/NVFlare/dataset/pcr/output/site-3.csv


Now we have our data prepared. we are ready to do other computations.

In [50]:
!wc -l /home/yaqeen/NVFlare/dataset/pcr/output/site-4.csv

298 /home/yaqeen/NVFlare/dataset/pcr/output/site-4.csv
