This project aims to predict high potentials using a Tabular Neural Network model and Anomaly Detection techniques. The model is trained using historical user purchase and engagement data and leverages AutoGluon for efficient hyperparameter tuning. The project includes scripts for training the model (train.py
) and performing inference (infer.py
) to identify users likely to make high potentials.
- train.py: Script to preprocess data, train an LSTM model & Anomaly Detection, tune hyperparameters, and save the best-performing model along with the scalers and label encoders.
- infer.py: Script to load the saved model, preprocess new data, and predict churn.
- amc_sql.sql: SQL query to pull historical data for training & inference from AMC.
Ensure you have the following installed:
- Python 3.7+
- Docker Desktop
-
Install Docker Desktop: Download and install Docker Desktop from here. This application will allow you to build, share, and run containerized python code on you local environment. It provides a streamlined way to manage Docker containers and images, making it easier to develop, test, and deploy applications in a consistent environment.
-
Directory Structure:
User Churn Prediction/
│
├── ml/
│ ├── code/
│ │ ├── infer/
│ │ │ ├── Dockerfile
│ │ │ ├── infer.py
│ │ │ ├── push_docker.cmd
│ │ │ ├── requirements.txt
│ │ │ ├── test_docker.sh
│ │ │ ├── test_docker.cmd
│ │ │ ├── push_docker.sh
│ │ │ ├── run_docker.sh
│ │ │ └── run_docker.cmd
│ │ │
│ │ └── train/
│ │ ├── Dockerfile
│ │ ├── push_docker.cmd
│ │ ├── push_docker.sh
│ │ ├── requirements.txt
│ │ ├── run_docker.cmd
│ │ ├── run_docker.sh
│ │ └── train.py
│ │
│ ├── input/
│ │ └── data/
│ │ ├── infer/
│ │ │ └── {Training Data.csv}
│ │ └── train/
│ │ └── {Inference Data.csv}
│ │
│ ├── model/
│ └── output/
│ └── data/
│ ├── audiences/
│ └── log/
│
└── SQL/
└── amc_sql.sql
-
Use the
training.sql
andinference.sql
queries in the AMC Sandbox environment. Ensure that the_for_audiences
suffix is removed during synthetic data generation. Apply different time windows for training and inference data. -
Once the synthetic data is available please have them saved in the below paths in csv format
Training = /ml/input/train/ Inference = /ml/input/infer/
-
Make sure the Docker Desktop is Started and Running properly
-
If there is any issues please restart the Docker Desktop before proceeding
-
{Optional} If needed please update the
train.py
andinfer.py
to fit your Use Case -
Update the
run_docker.cmd
in both train and infer section of the code with appropriate path from your local directory. -
Update the
push_docker.cmd
in both train and infer section of the code with appropriate AWS Account IDs
-
Open CLI or Terminal and navigate to the Train section of the code
cd /ml/code/train/
-
Run
run_docker.cmd
(orrun_docker.sh
if your are using Mac) to train the model and save it locally:run_docker.cmd
./run_docker.sh
-
Once satisfied with the execution of Training Container Run
push_docker.cmd
(orpush_docker.sh
if your are using Mac) to push the container to an ECR repository.push_docker.cmd
./push_docker.sh
NOTE: If you are getting errors while running the .sh file use chmod to change the file to an executable
-
Open CLI or Terminal and navigate to the Inference section of the code
cd /ml/code/infer/
-
Run
run_docker.cmd
(orrun_docker.sh
if your are using Mac) to start the Webserver:run_docker.cmd
./run_docker.sh
-
Once the Webserver is live Run
test_docker.cmd
(ortest_docker.sh
if your are using Mac) to generate inference from the trained model and save it locally:test_docker.cmd
./test_docker.sh
-
Once satisfied with the execution of Inference Container Run
push_docker.cmd
(orpush_docker.sh
if your are using Mac) to push the container to an ECR repository.push_docker.cmd
./push_docker.sh
- Preprocesses Data: Scales numerical features and handles missing data.
- Trains a Tabular Neural Network: Uses AutoGluon to optimize hyperparameters and improve performance.
- Saves the Model: Stores the best-performing model, scalers, and other preprocessing artifacts for use during inference.
- Loads the Model: Retrieves the trained model and preprocessing artifacts.
- Preprocesses New Data: Ensures the new data is consistent with the training data.
- Generates Predictions: Outputs probabilities indicating the likelihood of high potentials.
- Sorts Results: Ranks users based on their high potential probability.
- SQL Query Consistency: Use the same SQL query for Training and Inference but apply different time windows.
- Prediction Threshold: Adjust the threshold for classifying high-probability high potential customers based on your business needs.
- Data Validation: Ensure input data does not contain NaN or infinite values before running the scripts.
- Performance Optimization: Use a GPU for faster processing or adjust the batch size and number of epochs for large datasets.
- Experiment with additional models like Random Forests or Gradient Boosting for comparison.
- Introduce advanced feature engineering techniques to capture deeper insights.