Skip to content

sekaki22/Medium_blog_Fabric_airbnb

Repository files navigation

Microsoft Fabric Airbnb use case

This Github page gives in-depth insight in the Airbnb use-case which is discussed in the following Medium blog post. In this small repository the Python notebooks and Data Factory pipeline configuration files are presented. Below an overview of the life cycle of this use case and which files are relevant to each part in the life cycle:

Data extraction

Here the data is extracted via the Opendatasoft API. We used the following data factory pipeline to extract all the cities from opendatasoft; Pipeline_parent_opendatasoft_cities.json and pipeline_child_get_city_listings to copy the datasets from Opendatasoft to datalakehouse. Finally, the process_nested_json notebook is executed within the Data Factory pipeline to parse the json files using Pyspark and writing the structured tables to OneLake.

Data exploration

All the data exploration is in eda notebook.

Data transformation/feature engineering

Based on the data exploration, we determined which transformations to perform on the dataset. In the transformation notebook the dataset is tranformed and eventually written to its' final form before model training is performed

Model training

To train the different models we used the following algorithms:

  • Linear Regression
  • Decision Tree
  • Random Forrest
  • Multilayer Perceptron

MLflow was used to create a Machine Learning experiment and to track results and store artifacts. The script is found in ml_flow_experiment

Model evaluation

In the Random Forrest folder we share the model artifact of the best performing model of this experiment, which was a Random forrest. Next to the model artifact which is generated by Fabric there is also a numpy file which holds the predictions and the ground truth values of the test set.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published