Comparing R and Python Methods for Predictive Modeling
I heard recently that statisticians still favor R over Python because they're suspicious of Python's accuracy. I doubt it. But there's an easy way to check. I'm going to answer three questions:
- For comparable predictive algorithms, which of the two languages runs faster.
- Are the predictions similar between the languages?
- Does variable importance and partial dependence change?
x6, x7, x9, x10 are categorical variables
- R with default
- Python with default
- Python with R's defaults
- Python's gradient descent boosted RF
- XGBoost in Python
What are the differences in computational time between these models?
In terms of the predictive performance measures, how do these models compare?
Conduct cross-validation for 10 holdouts, using 80% training and 20% validation.
The holdout indices have been saved into a csv
data\holdout_indices.csv so the holdout is consistent between R and Python.
The results for the predictions will be saved to csv.
The time will be recorded.
The mean absolute error and mean square error will be calculated and saved.
The following hyperparameters will be used:
|Number of trees||500|