<a href="https://colab.research.google.com/github/stevenhastings/ML-Training-AWS/blob/main/machineLearningDevelopment%26Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***DATA COLLECTION AND DATA MUNGING***

# Data Sources
## Relational Databases or SQL Databases
* Transaction Processing
* Data Warehousing
## NoSQL Databases
* Support Web and some types of analytic applications
## Spreadsheets
* Smaller Semi-informal sources of data
* Combine small datasets and make specialized calculations
* Difficult due to frequently changing structures
## Log files
* Generated by applications and devices
* Tend to be semi-structured
* But tools like Microsofts log-parser are useful for mapping to more structured formats.
## External data sources
* 3rd party data files
* APIs that are programmatically queried

# Extract Transform Load (ETL) or Filter Extract Reformat
* Once data is in hand it must be analyzed to understand how the data needs to be filtered and reformatted to meet the needs of the data modeling requirements. 
* Within a single data source:
 * identify relevant attributes
 * filter unnecessary content
 * reformat to modeling needs

#### Combining Data Sets
* Join on common attributes
* consolidate attributes
* build tabular data structure (DataFrame)

# Experimenting with data, features, and algorithms
* Feature Engineering:
 * Question: What additional features can be derived from original attributes?
 * Helpful for some algorithms such as *Decision Trees* but less helpful for things like Neural Networks since those algorithms can capture non-linear relationships within the data. 

* Algorithm Selection:
 * Evaluating quality of models built using a variety of algorithms. 
 * Ensemble Method: Combining the results of many different Machine Learning models to create the model of best fit for your use case. 


# Testing and Validating Models
* ***Training Data:***
 * used to build a model with data that is available at some point in time.
* ***Test Data*** is used in development to validate the models success on new and unseen data.
 * Used to evaluate the model before deploying the model into production.  
* ***Validation data*** is used to measure quality of predictions made in production (AFTER MODEL DEPLOYMENT). . .to avoid model drift.


## Building Data Science Models
1. Build Model
2. Evaluate Model
3. Implement Changes to model
4. Try again

# Version Control:
* Tracking each version of a program
 * GitHub
 > - Repositories
 > - Adding files
 > - Committing files
 > - Cloning Repositories

# Predictive Model Markup Language (PMML)
* XML standard 
* a machine learning interchange format for describing predictive models
* ELEMENTS OF PMML:
 * Data dictionary
 > - < DataDictionary numberOfFields="5" >
 > - < DataField dataType="double" name="sepal_length" optype="continuous" >
 > - < Interval closure="closedClosed" leftMargin="4.3" rightMargin="7.9">

 * Transformations
 * Models
 * Post-processing

BENEFITS:

* Building in exploratory tools
* Deploying in production platforms
* Standard description of models
* Access to coefficients

# Agile Development
* Continuous Integration (CI)
 * Frequent Deployment
 * Build and implement an application with small changes
 * Detect problems and make changes SOONER rather than LATER

* CI for Data Science
 * Build model
 * Check in code
 * Build deployment package
 * Deploy

* Jenkins
 * A tool for automated continuous deployment
 * Open Sourced tool 
 * written in Java
 * Integrates well with version control systems like Git

* Jenkins Pipeline
 * Delivery pipeline in code
 * supports multiple steps
 * defines execution environments
 * records test results
 * deploys

# Environments
* Router
* Load Balancer
* Application Server
* Cache
* Database

> Types of Environments:
 1. Staging:
  * integration testing
 2. Development
  * Canary Deployment:
  > - a roll out that changes the code only for some users when using multiple application servers we can deploy a new roll out to just one of the servers now only the users working on that server will be running the new model. We can carefully monitor the error rate and other metrics on that server to determine whether there are problems with the release. If you use load balancers to distribute your work load across servers you can configure your load balancer to distribute traffic to each server. For example you can direct only 10% of traffic to the server running the Canary deployment. This allows you to test your new model without exposing all of your users to a potentially flawed release. 
 3. Production

# Security Measures (securing the data science models in production)
* ***Access Controls***
 1. *Authentication*
 > - Confirming the identity of a person, process, or device.
 2. *Authorization*
 > - Rules or set of actions designed to declare WHO is allowed to do WHAT within an application. (make use of existing AUTH systems)
* ***Software Development security***
 * Identifying risks and vulnerabilities
 * Encryption
 * Testing security measures
 * data management
* ***Operations Security***
 * separation of duties
 * change management ( who can make changes )
 * Log monitoring
 * Audit Reviews
* ***Disaster Recovery***
 1. Backups
 2. Design for high availability (Load Balancers)
 3. Game Day Failures (Fail Fast) ( Try to break the system)
 4. Business Continuity Planning

# Performance Monitoring
* Resource Utilization
 * Monitoring the use of CPU, GPU, persistent storage, network resources, etc. (making sure all are sufficient for your needs)
* Service Availability
 * Verify core processes are running
 * Are API endpoints accessible?
 * ARe APIs returning non-errors?
 * Are API calls timing out? 
* Throughput
 * Volume of transactions
 * Backlog of requests
 * Time to deliver results
* Model quality
 * Sample predictions
 * evaluate accuracy
 * watch for model drift
 * identify new training instances

# Containers
* Docker Components
 * Docker daemon
 * Docker client
 * Docker images
 * Docker Registries
 > - repositories that manage the storage of docker images

* Docker Architecture
 * Client:
 > - Command-line application used for interacting with Docker daemon
 * Host:
 > - Server running daemon, running containers, which are executing instances of images
 * Registry:
 > - Store of docker images; may be public or private
 > - The DOCKER HUB is a widely used Docker Repository

# More Docker
## ***Dockerfile***
* A text file with configuration image including base image, packages, network ports, and startup commands

## ***Docker Commands***
* FROM 
 * pulls a base image from your docker repository
 
 EXAMPLE:
* Base Image
 * ***FROM*** python:3.6

Update OS Packages...
* ***RUN*** apt-get update

Copy Local Files to Container
* ***COPY*** requirements.txt requirements.txt

Make Container Accessible
* ***EXPOSE*** 8888

Execute Script
* ***CMD*** python3 data_sci_code.py


---



---

# Example Dockerfile:

```
#Start with the 3.6 version of python
FROM python:3.6

#Update packages installed in the operating system
RUN apt-get update

#Copy the list of python packages to install, e.g numpy, scikit-learn
COPY requirements.txt requirements.txt

#Now install the python packages
RUN pip install -r requirements.txt

#Allow access on port 8888
EXPOSE 8888

#When the container starts, run the script called data_sci_code.py
CMD python3 data_sci_code.py
```

#### Build Docker Image:
`docker build -t devops_example`
#### Run a Docker Image
`docker run -t devops_example`
