This Github page gives in-depth insight in the Airbnb use-case which is discussed in the following Medium blog post. In this small repository the Python notebooks and Data Factory pipeline configuration files are presented. Below an overview of the life cycle of this use case and which files are relevant to each part in the life cycle:
Here the data is extracted via the Opendatasoft API. We used the following data factory pipeline to extract all the cities from opendatasoft; Pipeline_parent_opendatasoft_cities.json and pipeline_child_get_city_listings to copy the datasets from Opendatasoft to datalakehouse. Finally, the process_nested_json notebook is executed within the Data Factory pipeline to parse the json files using Pyspark and writing the structured tables to OneLake.
All the data exploration is in eda notebook.
Based on the data exploration, we determined which transformations to perform on the dataset. In the transformation notebook the dataset is tranformed and eventually written to its' final form before model training is performed
To train the different models we used the following algorithms:
- Linear Regression
- Decision Tree
- Random Forrest
- Multilayer Perceptron
MLflow was used to create a Machine Learning experiment and to track results and store artifacts. The script is found in ml_flow_experiment
In the Random Forrest folder we share the model artifact of the best performing model of this experiment, which was a Random forrest. Next to the model artifact which is generated by Fabric there is also a numpy file which holds the predictions and the ground truth values of the test set.