# <span style="color:orange">**Feature Extraction**</span>
- This notebook generates 2 CSV, one with training data and one with test data. It extracts features from individual building file and then concatenate them.
- Generating these 2 files ise time-consuming (~12h for train, ~2h30 for test).
- All features are not useful for training/inference, but only 10k/50K. However, better features selection could improve the model.
___

## **Librairies**

In [1]:
# Import useful librairies
from utils_librairies import *

## **Directory Paths**

In [2]:
# Import useful paths
from utils_paths import *

## **Import function to extract features**

In [3]:
from utils_features_creation import *

___
___
# **Load useful columns that will be saved in the csv**
This avoids to save all columns in the CSV for training and inference, in order to save memory.

In [4]:
# Load useful columns. The rest is not used in the process (modelisation and inference)
useful_cols = joblib.load(feats_dir + "useful_cols.joblib")
print(f"Useful columns : {len(useful_cols)}.")

Useful columns : 11021.


___
___
# **Create <u>train</u> df**

/!\ This part is not mandatory to run inference. Only test.csv file needs to be generated.

In [None]:
%%time

# Initialize a dataframe
df = None

# Process all files
for path in tqdm(sorted(os.listdir(train_filedir))) :
    
    # Avoid strange files
    if '(' in path or not(path.endswith('parquet')) : continue
    
    # Open file and create features
    df_tmp = create_df_from_filepath(train_filedir + path)
            
    # Complete dataframe
    if df is None :
        df = df_tmp.copy()
    else :
        df = pd.concat([df, df_tmp[useful_cols]])
    
# Show
print(df.shape)
df.head()

  0%|          | 36/7200 [03:02<10:29:58,  5.28s/it]

## **Export csv**

In [None]:
%%time

# Save to csv for later use
df.to_csv('df_train.csv', index=False)

___
___
# **Create <u>test</u> df**

In [None]:
%%time

# Initialize a dataframe
df = None

# Process all files
for path in tqdm(sorted(os.listdir(test_filedir))) :
    
    # Avoid strange files
    if '(' in path or not(path.endswith('parquet')) : continue
    
    # Open file and create features
    df_tmp = create_df_from_filepath(test_filedir + path)
            
    # Complete dataframe
    if df is None :
        df = df_tmp.copy()
    else :
        df = pd.concat([df, df_tmp[useful_cols]])
    
# Show
print(df.shape)
df.head()

 68%|██████▊   | 980/1441 [1:45:13<33:12,  4.32s/it]   

In [None]:
# Save to csv for later use
df.to_csv('df_test.csv', index=False)