In [1]:
import warnings

warnings.filterwarnings("ignore")
import os

if "jbook" in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
FORCE = True

# Feature Engineering
The feature engineering stage comprises several components designed to extract linguistic, structural, and quality attributes from the text. Each feature set captures distinct dimensions of the review data:

1. **POS Counts**: Counts of specific parts of speech (POS), including nouns, verbs, adverbs, adjectives, and determiners, to quantify linguistic elements that are essential in understanding content structure and sentiment potential.

2. **POS Diversity**: The relative proportion of the primary POS tags (nouns, verbs, adverbs, adjectives, determiners) within the text, providing insight into the variety and richness of language used in the review.

3. **POS Intensity**: A normalized metric summing the core POS counts (nouns, verbs, adverbs, adjectives, determiners) to measure the density and concentration of expressive words in each review.

4. **Structural Complexity**: Evaluates the text's structural attributes through unique word proportion, punctuation proportion, and the standard deviation of word length, indicating the complexity and readability of the review content.

5. **Basic Statistics**: Core textual statistics, including counts and proportions of characters, digits, and words, to provide foundational metrics on review size and numeric content.

6. **Review Features**: Captures context-specific features such as the age of the review, its rating deviation from the app's average, and deviation in review length, giving insight into the review's temporal relevance and user perception intensity.

7. **Text Quality Indicators**: Flags notable quality traits, including the presence of terminal punctuation, high readability, excessive digit repetition, and high punctuation density. These indicators help in filtering and categorizing reviews based on clarity and conciseness.

These linguistic, structural, and contextual attributes will facilitate review quality screening and downstream model fine-tuning efforts.


## Import Libraries

In [2]:
from discover.container import DiscoverContainer
from discover.infra.config.flow import FlowConfigReader
from discover.flow.data_prep.feature.stage import FeatureEngineeringStage

## Dependency Container

In [3]:
container = DiscoverContainer()
container.init_resources()
container.wire(
    modules=[
        "discover.flow.data_prep.stage",
    ],
)

## Feature Engineering Pipeline

In [4]:
# Obtain the configuration
reader = FlowConfigReader()
config = reader.get_config("phases", namespace=False)
stage_config = config["dataprep"]["stages"]["feature"]

# Build and run Data Ingestion Stage
stage = FeatureEngineeringStage.build(stage_config=stage_config, force=FORCE)
asset_id = stage.run()

[11/04/2024 02:56:17 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [_remove_dataset_file_by_filepath] : Removed dataset file at workspace/dev/dataset/01_dataprep/appvocai_discover-01_dataprep-02_feature-review-dataset.parquet from repository.
[11/04/2024 02:56:17 PM] [INFO] [discover.infra.persistence.repo.dataset.DatasetRepo] [remove] : Removed dataset dataset-dev-dataprep-feature-review from the repository.




#                           Feature Engineering Stage                            #



your 131072x1 screen size is bogus. expect trouble


:: loading settings :: url = jar:file:/home/john/miniconda3/envs/appvocai/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/john/.ivy2/cache
The jars for the packages stored in: /home/john/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-dcdbb3fc-ab34-4d42-9d40-4f0623b97d8a;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.3.3 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
	found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
	found com.amazonaws#aws-java-sdk-core;1.12.500 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.15 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found software.amazon.ion#ion-java;1.0.2 in central
	found joda-time#joda-time;2.8.1 in central
	found com.amazonaws#jmespath-java;1.12.500 in central
	f



                                    NLPTask                                     
                                    -------                                     
                          Start Datetime | Mon, 04 Nov 2024 14:56:43
pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
[ | ]pos_ud_ewt download started this may take some time.
Approximate size to download 2.2 MB
Download done! Loading the resource.
[ / ]

                                                                                

[OK!]
                       Complete Datetime | Mon, 04 Nov 2024 14:57:06
                                 Runtime | 23.17 seconds


                              ComputePOSStatsTask                               
                              -------------------                               
                          Start Datetime | Mon, 04 Nov 2024 14:57:11
                       Complete Datetime | Mon, 04 Nov 2024 14:57:12
                                 Runtime | 0.59 seconds


                             ComputeBasicStatsTask                              
                             ---------------------                              
                          Start Datetime | Mon, 04 Nov 2024 14:57:12
                       Complete Datetime | Mon, 04 Nov 2024 14:57:12
                                 Runtime | 0.63 seconds


                              ComputeReviewAgeTask                              
                              --------------------                   

                                                                                

                       Complete Datetime | Mon, 04 Nov 2024 14:57:16
                                 Runtime | 3.44 seconds


                            ComputeAggDeviationStats                            
                            ------------------------                            
                          Start Datetime | Mon, 04 Nov 2024 14:57:16
                       Complete Datetime | Mon, 04 Nov 2024 14:57:16
                                 Runtime | 0.26 seconds


                            ComputeAggDeviationStats                            
                            ------------------------                            
                          Start Datetime | Mon, 04 Nov 2024 14:57:16
                       Complete Datetime | Mon, 04 Nov 2024 14:57:17
                                 Runtime | 0.39 seconds


                             ComputeTQAFiltersTask                              
                             ---------------------                          





                           Feature Engineering Stage                            
                           Stage Started | Mon, 04 Nov 2024 14:56:17
                         Stage Completed | Mon, 04 Nov 2024 14:58:31
                           Stage Runtime | 2.0 minutes and 14.42 seconds





                                                                                