# Data Preparation for Input Data

## Overview
The input data preparation process ensures that the dataset is clean, structured, and ready for further processing. This involves handling missing values, ensuring numerical consistency, and structuring hierarchical levels appropriately.

## Data Requirements
The dataset should adhere to the following structure:

- **Date Column**: Represents the timestamp of the data entry.
- **Hierarchical Levels**:
  - `Category`
  - Other hierarchical columns can vary.
- **Feature Columns**: All feature columns must be numerical.
- **Target Columns**: The target variables should also be numerical and contain no null values.

## Data Cleaning Process

1. **Ensure Date Format Consistency**
   - Ensure the dates are in a standard format.
   - Remove any rows with invalid date entries.

2. **Handle Missing Values**
   - Drop rows with null values in hierarchical columns.
   - Impute or remove missing values in target column to ensure all are numerical.

3. **Convert Hierarchical Levels to Categorical Type**
   - Convert `Category` and other hierarchical columns to categorical data types if applicable.

4. **Ensure All Feature and Target Columns are Numerical**
   - Convert any non-numerical feature or target columns to numerical using encoding if necessary.
   - Remove any columns that cannot be converted to numerical values.

5. **Verify Data Integrity**
   - Check for duplicate rows and remove them if necessary.
   - Ensure hierarchical levels are properly categorized and do not contain inconsistencies.

6. **Final Check**
   - Confirm that no null values remain in target column.
   - Validate numerical consistency across all numerical columns.

## Expected Output
The final dataset should have:
- A properly formatted date column.
- No missing values in hierarchical levels.
- Only numerical values in feature and target columns.
- A structured dataset ready for further processing.



In [1]:
import sys
import os

project_path = os.path.abspath("..")

if project_path not in sys.path:
    sys.path.append(project_path)


import numpy as np
import pandas as pd

from src.brand_health_centre.data_preparation import data_prepare

- The IDV list contains various features that are mapped to predefined pillars.
- Feature transformation methods include:
  - **Custom Scaling**
  - **Standard Scaling**
  - **Min-Max Scaling**
  - **Imputation** for handling missing values

In [2]:

#Change the config settings according to the documentation
config_path = r"D:\BRAND_HUB_PROJECT\brandhub-capability\src\brand_health_centre\config.yml"
scaled_data, idv_list, config, paths = data_prepare(config_file_path=config_path)


{'filtered_data_path': './output\\filtered_data.csv', 'no_null_imputed_data_path': './output\\no_null_imputed_data.csv', 'scaled_data_path': './output\\scaled_data.csv', 'cfa_fit_data_path': './output\\cfa_fit_data.csv', 'rf_fit_data_path': 'output\\rf_fit_data.csv', 'rf_act_pred_data_path': 'output\\rf_act_pred_data.csv', 'pillar_weights_path': 'output\\pillar_weights.csv', 'pillar_data_path': 'output\\pillar_data.csv', 'trend_past_data_path': 'output\\trend_data.csv', 'scaled_score_data_path': 'output\\scaled_score_data.csv', 'imp_rf_fit_data_path': 'output\\imp_rf_fit_data.csv', 'imp_rf_act_pred_data_path': 'output\\imp_rf_act_pred_data.csv', 'score_card_final_df_path': 'output\\score_card_final_df.csv', 'relative_imp_model_results_path': 'output\\relative_imp_model_results.csv'}
All required columns are present in the DataFrame.
All independent variables in idv_list are present in the data.
Minimum date: 2017-01-07 00:00:00
Maximum date: 2025-01-11 00:00:00
Dropped_columns: [('vend

In [3]:
scaled_data

Unnamed: 0,date,vendor,brand,category,metric,value
0,2022-08-13,vendor_1,brand_1,category_1,directions_awareness_total_awareness_net_mentions,0.842007
1,2022-08-13,vendor_3,brand_5,category_1,directions_awareness_total_awareness_net_mentions,0.807087
2,2022-08-13,vendor_3,brand_6,category_2,directions_awareness_total_awareness_net_mentions,0.833969
3,2022-08-20,vendor_1,brand_1,category_1,directions_awareness_total_awareness_net_mentions,0.842007
4,2022-08-20,vendor_1,brand_2,category_1,directions_awareness_total_awareness_net_mentions,0.657894
...,...,...,...,...,...,...
38395,2024-02-24,vendor_1,brand_2,category_2,market_share,0.041971
38396,2024-02-24,vendor_3,brand_6,category_2,market_share,0.059198
38397,2024-03-02,vendor_3,brand_5,category_1,market_share,0.073273
38398,2024-03-09,vendor_1,brand_1,category_1,market_share,0.007873


In [4]:
idv_list

Unnamed: 0,idv,equity_pillar,min,max,negation_flag
0,directions_awareness_total_awareness_net_mentions,awareness,0.0,100.0,P
1,directions_awareness_unaided_awareness_net_men...,awareness,0.0,100.0,P
2,directions_brand_attributes_always_seems_to_be...,brand_perceptions,0.0,100.0,P
3,directions_brand_attributes_creates_excitement...,brand_perceptions,0.0,100.0,P
4,directions_brand_attributes_creates_playful_mo...,brand_perceptions,0.0,100.0,P
5,directions_brand_attributes_has_a_taste_my_cat...,brand_perceptions,0.0,100.0,P
6,directions_brand_attributes_has_a_taste_my_dog...,brand_perceptions,0.0,100.0,P
7,directions_brand_attributes_has_craveable_trea...,brand_perceptions,0.0,100.0,P
8,directions_brand_attributes_has_the_most_satis...,brand_perceptions,0.0,100.0,P
9,directions_brand_attributes_has_treats_i_could...,brand_perceptions,0.0,100.0,P


In [5]:
config

{'input_data': 'D:\\BRAND_HUB_PROJECT\\brandhub-capability\\data\\raw\\harmonised_mock_2.csv',
 'idv_list': 'D:\\BRAND_HUB_PROJECT\\brandhub-capability\\data\\raw\\idv_list_.csv',
 'granularity': 'weekly',
 'data_prep_group_var': ['vendor', 'brand', 'category'],
 'date_column': 'date',
 'date_format': '%d-%m-%Y',
 'start_date': '2022-08-01',
 'end_date': '2024-06-01',
 'dv_column': 'market_share',
 'null_percentage': 0.5,
 'scaling': 'custom',
 'cfa_sampling_seeding': [2, 3, 5, 7, 11, 13, 17, 19],
 'model_type': 'RandomForest',
 'model_config': {'RandomForest': {'grid_search': {'max_depth': [2, 3, 4],
    'n_estimators': [15, 50, 100, 300, 500],
    'max_features': [2, 4, 10],
    'random_state': [42],
    'eval_metrics': []},
   'random_state': 42},
  'XGBoost': {'grid_search': {'max_depth': [2, 3],
    'n_estimators': [100, 500, 1000, 1500],
    'learning_rate': [0.01, 0.02],
    'random_state': [42],
    'eval_metrics': []},
   'random_state': 42},
  'RF_Ridge': {'grid_search': {'al

In [6]:
paths

{'filtered_data_path': './output\\filtered_data.csv',
 'no_null_imputed_data_path': './output\\no_null_imputed_data.csv',
 'scaled_data_path': './output\\scaled_data.csv',
 'cfa_fit_data_path': './output\\cfa_fit_data.csv',
 'rf_fit_data_path': 'output\\rf_fit_data.csv',
 'rf_act_pred_data_path': 'output\\rf_act_pred_data.csv',
 'pillar_weights_path': 'output\\pillar_weights.csv',
 'pillar_data_path': 'output\\pillar_data.csv',
 'trend_past_data_path': 'output\\trend_data.csv',
 'scaled_score_data_path': 'output\\scaled_score_data.csv',
 'imp_rf_fit_data_path': 'output\\imp_rf_fit_data.csv',
 'imp_rf_act_pred_data_path': 'output\\imp_rf_act_pred_data.csv',
 'score_card_final_df_path': 'output\\score_card_final_df.csv',
 'relative_imp_model_results_path': 'output\\relative_imp_model_results.csv'}