Machine Learning on the Cloud

When do we need to compute Machine Learning tasks on the Cloud? This project wants to answer to this question. We will study the PySpark architecture in Local creating our own cluster, then we will reproduce our experiments on the Cloud to see the differences.

Getting Started

Download the repo:

git clone https://github.com/riki95/machine-learning-pyspark

Inside dataset_normalizer folder you can find the pandas code used to adapt the dataset. Inside pipeline_tuning you find the different Cross Validator I have used in an experiment.

bank.csv is the 10MB dataset and can be also downloaded from here: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

There are 3 different files for the algorithm:

LogisticRegression_local to be used in local and can also be used with the pipeline_tuning. In order to launch different tests you can use the MAKEFILE I have created to make it faster using Sparklint, which shows an interface like this to monitor the execution:

LogisticRegression_GCP which should be used on Google Cloud Platform, it changes in folds and parallelism and also the csv read follows a cloud path.
LogisticRegression_HDI is a Notebook file to run on Azure HDInsight Jupyter Notebook, just upload the dataset on HDI and run it (can also be run in GCP using Notebook if you change che path of the dataset)

Author

Riccardo Basso - Università degli studi di Genova - High Performance Computing 2018-2019

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
dataset_normalizer		dataset_normalizer
pipeline_tuning		pipeline_tuning
.gitignore		.gitignore
LICENSE		LICENSE
LogisticRegression_GCP.py		LogisticRegression_GCP.py
LogisticRegression_HDI.ipynb		LogisticRegression_HDI.ipynb
LogisticRegression_local.py		LogisticRegression_local.py
Makefile		Makefile
README.md		README.md
bank.csv		bank.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning on the Cloud

Getting Started

Author

About

Releases

Packages

Languages

License

riki95/machine-learning-pyspark

Folders and files

Latest commit

History

Repository files navigation

Machine Learning on the Cloud

Getting Started

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages