In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))
import warnings
warnings.filterwarnings("ignore")

# Feature Engineering

Let's handle the two feature engineering tasks: anonymizing the review author and some date parsing for temporal analysis.

1. Anonymizing Author with Blake2 Hashing Algorithm
We will use the blake2b hashing function from the hashlib library to anonymize the author names.

2. Parsing Dates for Temporal Analysis
We will use the pandas library to parse dates into day of the week, month, and year.

In [2]:
from appvocai-discover.data_prep.feature import FeatureEngineer, FeatureEngineeringConfig

 The following code snippet sets up and executes this process, which involves configuring the feature engineering parameters and applying the transformations to the dataset:

In [3]:
config =  FeatureEngineeringConfig(force=True)
features = FeatureEngineer(config=config)
data_fe =features.execute()



#                         FeatureEngineering Pipeline                          #

Task ReadTask completed successfully.
Task ParseDatesTask completed successfully.
Task AnonymizeAuthorsTask completed successfully.
Task DropFeaturesTask completed successfully.
Task CastDatatypesTask completed successfully.
Task WriteTask completed successfully.


                               FeatureEngineering                               
                          Pipeline Start | 2024-06-07 02:52:48.424546
                           Pipeline Stop | 2024-06-07 02:52:49.518716
                        Pipeline Runtime | 00 Minutes 01.094170 Seconds







Let's review the results, subsetting on the key features.

In [4]:
data_fe[["id", "app_name", "category", "author", "year", "month", "day", "year_month", "ymd"]].sample(n=5)

Unnamed: 0,id,app_name,category,author,year,month,day,year_month,ymd
7604307,1005351188,MyFitnessPal: Calorie Counter,Health & Fitness,ce3907557610831af7db,2014,June,Friday,2014-06,2014-06-06
8267885,7686300705,Seed to Spoon - Growing Food,Health & Fitness,fe0763ad8fadee1fc8d6,2021,August,Friday,2021-08,2021-08-13
5468322,9757470455,"Gym Workouts, Gym Plan Fitness",Health & Fitness,87cb31e00c30d4f56091,2023,March,Monday,2023-03,2023-03-27
10587128,5010441450,Hulu: Watch TV shows & movies,Entertainment,a456c8352581b4ff27cd,2019,October,Friday,2019-10,2019-10-25
6892464,1670012324,Stupid Simple Macro Tracker,Health & Fitness,deac1388df3d6ef3108a,2017,July,Wednesday,2017-07,2017-07-05


The author information has been effectively anonymized, and the date parsing has been completed successfully. Having completed the initial stages of data cleaning and feature engineering, we now move on to a critical phase in our data preparation: text processing. This phase involves transforming raw text data into a structured format that can be effectively used in our analysis and modeling tasks. We will utilize PySpark, a powerful big data processing framework, to handle the large volume of text data efficiently.