<a href="https://colab.research.google.com/github/salilathalye/rocpy-pycaret-streamlit/blob/main/RocPy_20200216.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Low Code Machine Learning with Colab, PyCaret and Streamlit

### Salil "Sal" Athalye

[LinkedIn](https://www.linkedin.com/in/salilathalye/)

[Github](https://github.com/salilathalye)

## Rochester Python Meetup
### February 16, 2021








## Objective
*Interested in getting started with Machine Learning? In this presentation Sal Athalye will show you how he is using Python and a handful of open-source libraries to create low-cost, low-code, end-to-end solutions.*

## Motivation

* Build a personal hobbyist / "Citizen Data Scientist" codebase for Data Science, ML, NLP and Time Series Forecasting from open-source libraries using techniques from the public domain.
* Give back to the open-source community through public Github repos under the MIT License.
* Enable a low-cost "***workbench***" for experimentation in support of continuous learning.
* Surface areas of weakness in theory and practice.
* Build "muscle memory" with APIs and toolchains to improve productivity and flow.
* Enhance skills with tools and processes needed to contribute to other open-source projects.
* Support my passion for technology and computing and staying informed of new developments in the field.
* Participate in time-boxed datathons and similar challenges in an individual or team-based setting.


## Acknowledgements

To all the Giants, whose shoulders we stand on - Thanks!

# The *Work in Progress* Workbench

## Development Environments
### Python / Jupyter Notebooks
We will mostly be using a handful of Python libraries to create this workbench. Familiarity with writing functions, workings with lists, dictionaries, sets should give you the foundation you need to get started.

### Python Virtual Environment
Coming from the Data Science group of Python users, I've used conda and pip for virtual environment and package management.

### Visual Studio Code
This is a personal decision. You will need an Editor/IDE on your local machine. I also use Jupyter Notebook/Jupyter Lab. You need to build "muscle memory" with your editor(s) of choice!

[Visual Studio Code](https://code.visualstudio.com/)

### Google Colaboratory
Google Colaboratory (aka Colab) provides a cloud-based Python development environment with pre-installed libraries in a familar Jupyter Notebook setting. It integrates with Google Drive and github. Colab is intriguing because we can set up runtime environments with a GPU or TPU, for free - subject to some constraints such as strict idle timeouts, VM RAM limits and upto 12hrs of notebook run time. There is also a Pro tier that removes some of the constraints.

[Introduction to Colab](https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l01c01_introduction_to_colab_and_python.ipynb)

### git, github
Familiarity with creating a git repo on github, connecting your local repo to a remote, using a .gitignore file, adding new files to the repo, staging changes, committing changes, pulling and pushing changes are what we need to get started.

Github gists are a great way to build your collection of commonly used code snippets. You can create public or private gists.

[Github Gist](https://gist.github.com/)

[Sal's gists](https://gist.github.com/salilathalye)

## Python Libraries

### pandas
The pandas library, developed by Wes McKinney, is the Swiss Army Knife for data manipulation and analysis in Python. Familarity with loading data from CSV or xlsx (Excel) formats, manipulating and transforming DataFrames and Series (tabular and columnar representations of in-memory data) is the muscle-memory we need to build here. There are many excellent tutorials on YouTube to help us get started.

[pandas](https://pandas.pydata.org/)




### cookiecutter
cookiecutter is a command-line utility that helps create directory structures and necessary utility code from project templates.

[cookiecutter](https://github.com/cookiecutter/cookiecutter)

For Data Science projects there is a pre-configured template available.

[cookiecutter-data-science](https://github.com/drivendata/cookiecutter-data-science)


### Streamlit
Streamlit was developed to address the need of data science teams to quickly transform their work (analyses, models) into rich, compelling interactive data web applications. 

[Streamlit](https://https://www.streamlit.io/)

[Streamlit Documentation](https://docs.streamlit.io/en/stable/)

Streamlit Sharing is currently invite-only as of this presentation. It provides super-easy deployment of Streamlit applications from a public github repo.

[Streamlit Sharing](https://www.streamlit.io/sharing)

### PyCaret

PyCaret is an open-source low-code Machine Learning library developed by [Moez Ali](https://www.linkedin.com/in/profile-moez/) and team. It provides a set of wrappers and workflows that uses functionality from widely used open-source libraries that *considerably* reduces the amount of code we need to write the navigate through the Machine Learning workflow. Hence the "low-code"!

[PyCaret](https://pycaret.org/)

[PyCaret Guide](https://pycaret.org/guide/)

# PyCaret Dependencies
PyCaret depends on a number of widely used Python libraries from data manipulation, data visualization, exploratory data analysis, ML experiment management, web application development and utilities and provides a Python interface to help users move through the Data Science / ML lifecycle.

In [None]:
!pip install pycaret==2.2.3



# Cross-Industry Standard Process for Data Mining (CRISP-DM)




## What does the process look like at a 30,000ft level?

![](https://drive.google.com/uc?export=view&id=1oShKT3B1xXSa3o54lboxURaK1He0B2TK)

[Source: Wikipedia](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining)

CRISP-DM is a framework first published in 1999 with the aim of providing a consistent process for data mining across industries. Data Science teams tend to use a hybrid approach marrying Agile principles with a light(er) weight interpretation of CRISP-DM.

As a hobbyist or Citizen Data Scientist it is important to start with Business Understanding instead of rushing to build a model. We need to take into account several considerations to build a value-added ***solution***:

* Business Processes that generate the Data
* Addressing the right problem (5 Whys)
* Success Criteria
* Usage (Who, How, When...), $ at risk
* Cost of Errors
* Budget and Resources 
* Model Performance Evaluation
  - Current Champion Model performance
* Organizational Change Management
* ...



And while CRISP-DM appears to end at Deployment, we know that changes in the competitive landscape and the nature of the business, plus the adoption of our solution will change the data profile over time and reduce the performance of our model. This is often referred to as ***model rot***, and hence our solution needs to be **Monitored**.

And there is **Communication** throughout the process, to technical and non-technical constituents, and as you can image the needs of each constituent has to be considered.



Note that many datathon/challenges and toy datasets will not provide this context and information, but its great practice to think through this, so that we build, deploy and support the right solution.

# Experiment Artifacts

There are many pieces of information that need to be tracked during the model development process:

* The data that was used to train the model algorithm
* Training data schema
* Python library version numbers
* Seeds for pseudo-random number generation
* The set of data transformation steps applied to the data prior to training the model algorithm

For each model algorithm under consideration:
* Model inputs schema
* The set of hyperparameters and their exact settings used to guide the training of the model algorithm
* The set of parameters and their exact values that are learnt by the model algorithm during the training phase
* Charts/graphs and tables showing performance
* Persistent, serialized representation of the model (pickle file)
* Logs, logs and more logs!
* ...

Decisions and choices made along the way need to be captured.


Bottom line is that the activity of data preparation, modeling and evaluation generates a lot of information that needs to be catalogued and managed so that we can reproduce our experiments, recreate our models, audit our decisions and compare any Challengers against our current Champion. Doing this manually is time consuming and error-prone.



#Workflow before PyCaret and Streamlit

Prior to PyCaret and Streamlit, a typical hobbyist / Citizen Data Scientist workflow would involve firing up a Jupyter Notebook and entering a cut-and-paste frenzy mode, trying to find applicable code snippets from prior projects and customizing the same in order to meet our immediate needs.

Consistent application of best practices was inconsistent. Experimentation with a variety of model algorithms was restricted to our current codebase or examples we found from our favorite blogs and books that we could integrate in time.

Evaluation was limited to charts/graphs and metrics that we had easy access to. Cataloguing artifacts and tracking was based on our personal discipline and ad-hoc naming and filing conventions.

Many models could be and were built, but only a few were evaluated and compared.

Prior to Streamlit it was possible to build a web-based data app but it required knowledge of how to integrate several component technologies, front-end and back-end. It was harder to share Proof of Value with non-technical constituents without getting them tangled up with codebase and environment used to produce the solution. The models were not integrated back into the Business Context.

# Workflow with PyCaret and Streamlit

PyCaret may be "low-code" or "lower-code" :) but its depth and breadth of features requires "more-thinking"! Yes - it is easy to set it in default mode, but I want to use it to supplement my learning. I find it surfaces gaps in theory and practice quicker (which is in-line with my learning goals) and forces me to be clearer about the choices and options that I select. I spend less time in copy-paste and can invest it in analyzing and evaluating.

One concern is the footprint of the PyCaret 2.X library and its growth as new features and capabilities are added to it. I had issues deploying a Streamlit app serving a PyCaret-build ML model on a free-tier Heroku dyno as it exceeded the 500MB slug size (it came in at 574MB). I haven't done enough research to understand what is contributing to the size. Several bloggers were successfully able to deploy PyCaret 1.X models to the Heroku free tier dynos earlier in 2020.

Models won't be deployed in an IDE or Jupyter Notebooks - they tend to be situated in containers and web apps. Streamlit helps us place the model in an environment closer to its usage context without requiring the app builder to be fluent in front-end and back-end components and technologies. Yes, eventually a productionized solution will require people fluent in building secure, performant web applications - but it enables us to independently showcase our work. It forces us to think about the user experience using "business language".

Ease of deployment and integration with github for deployment of changes are great selling points for Streamlit Sharing. Pricing for the service is not known at this point, currently it is providing a showcase for Data Science and Machine Learning capabilities and possibilities similar to what Tableau Public has provided for Visual Analytics.

# Demo

[Demo Github Repo](https://github.com/salilathalye/cwa-dphitech-challenge-54)

[Streamlit App](https://share.streamlit.io/salilathalye/cwa-dphitech-challenge-54/main/src/app.py)

[Data Science and ML Templates](https://github.com/salilathalye/cwa-templates-dsml)

[DPhi](https://dphi.tech/)