In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

Features such as lexical density, review length, and readability measures can reflect the influence a review may have on other buyers' purchase decisions. Recent studies have shown that review readability has a greater effect on review helpfulness than its length, highlighting the importance of such lexical features on consumer behavior {cite}`korfiatisEvaluatingContentQuality2012` 

In this section, we extract textual and lexical features from the review text to garner insights from how a review is expressed. 

**1. Basic Text Features**
- Word Count: Total number of words in a review.
- Sentence Count: Total number of sentences in a review.
- Word Density: Average number of words per sentence.
- Sentence Density: Average number of sentences per paragraph (if applicable).

**2. Lexical Features**
- Unique Words: Count of unique words in a review.
- Type/Token Ratio (TTR): Ratio of unique words to the total number of words.
- Average Word Length: Average number of characters per word.

**3. Readability Metrics**
- Flesch reading ease scores: A widely-used measure of text readability based on average sentence length and average number of syllables per word.   
- Flesch–Kincaid reading difficulty grade scores: Provides the US school grade level of the text.
- Gunning Fog Index: Measures text complexity as a proportion of words that have three or more syllables.
- SMOG Index: Estimates the difficulty of a text based on the number of sentences and words with three or more syllables.

Redundant features such as `category_id` will be removed from the dataset and downstream analyses. Finally, we'll parse the review date into its constituent parts for temporal analysis.


In [2]:
from appvocai-discover.data_prep.feature import FeatureEngineer, FeatureEngineeringConfig

[nltk_data] Downloading package punkt to /home/john/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /home/john/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/john/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


 The following code snippet sets up and executes this process, which involves configuring the feature engineering parameters and applying the transformations to the dataset:

In [3]:
config =  FeatureEngineeringConfig(force=True)
features = FeatureEngineer(config=config)
data_fe =features.run()

File was not found at data/prod/03_clean/reviews.pkl
[Errno 2] No such file or directory: 'data/prod/03_clean/reviews.pkl'
Traceback (most recent call last):
  File "/home/john/projects/appvocai-discover/appvocai-discover/utils/io.py", line 46, in read
    return self._io.read(filepath=filepath)
  File "/home/john/projects/appvocai-discover/appvocai-discover/utils/file.py", line 360, in read
    return io.read(filepath, **kwargs)
  File "/home/john/projects/appvocai-discover/appvocai-discover/utils/file.py", line 43, in read
    data = cls._read(filepath, **kwargs)
  File "/home/john/projects/appvocai-discover/appvocai-discover/utils/file.py", line 245, in _read
    with open(filepath, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/prod/03_clean/reviews.pkl'
Error executing function 'ReadTask.run': [Errno 2] No such file or directory: 'data/prod/03_clean/reviews.pkl'
Exception occurred in ReadTask.run called with <appvocai-discover.data_prep.io.ReadTask objec



#                         FeatureEngineering Pipeline                          #



FileNotFoundError: [Errno 2] No such file or directory: 'data/prod/03_clean/reviews.pkl'

Let's review the results, subsetting on the key features.

In [None]:
data_fe[["id", "app_name", "category", "author", "year", "month", "day", "year_month", "ymd"]].sample(n=5)

The author information has been effectively anonymized, and the date parsing has been completed successfully. Having completed the initial stages of data cleaning and feature engineering, we now move on to a critical phase in our data preparation: text processing. This phase involves transforming raw text data into a structured format that can be effectively used in our analysis and modeling tasks. We will utilize PySpark, a powerful big data processing framework, to handle the large volume of text data efficiently.