# Table of Contents

- [Machine Learning 101](#machine-learning-101)
- [Machine Learning Framework](#machine-learning-framework)
- [Environment Setup](#environment-setup)

# Machine Learning 101

Machine Learning : Computer learn from data
- Supervised : Classification, Regression
- Unsupervised : Clustering, Association Rule Learning 
- Reinforcement : Skill Acquisition, Real-Time Learning

# Machine Learning Framework

1. Problem definition
   - Supervised or Unsupervised
   - Classification or Regression
2. Data
   - Structured, Unstructured
3. Evaluation
   - what is success for us?
4. Features
   - What we already know about the data?
5. Modelling
   - Based on our problem and data, what model should we use?
6. Experimentation
   - How could we improve?

### 1. Problem definition

Main types of machine learning
- Supervised Learning
  - Data and labels
  - Classification
    - Binary Classification
    - Multi-class classification
  - Regression : predict a number

- Unsupervised Learning
  - Data without label


- Transfer Learning
  - leverages what one machine learning model has learned in another machine learning model

- Reinforcement Learning
  - a computer program perform some actions within a defined space and rewarding it for doing it well or punishing it for doing poorly

### 2. Data

Different types of data
- Structured 
- Unstructured : image, audio, natural language text

- Static : doesn't change over time
- Streaming : data which is constantly changed over time



### 3. Evaluation

What defines success for us?
- Classification : Accuracy, Precision, Recall
- Regression : Mean Absolute Error (MAE), Mean Squared Error (MSE), Root  Mean Squared Error (RMSE)
- Recommendation : Precision at K

### 4. Features

What do we already know about the data?

- Numerical features
- Categorical features

- Derived features (from feature engineering)
- Feature Coverage = 1 - (# of missing values / # of examples)

### 5. Modeling
#### 5.1 Splitting Data
Based on our problem and data, what model should we use?

1. Choosing and training model
2. Tuning model
3. Model comparison


3 sets : training set (70-80%), validation set (10-15%), and test set (10-15%)

Generalization : the ability for a machine learning model to perform well on data it hasn't seen before

#### 5.2 Choosing the model
- Structured Data : try tree-based model first
- Unstructured Data : try deep learning, transfer learning

- start with small amount of example first to minimize time between experiments
- Things to remember
  - Some models work better than others and different problems
  - Don't be afraid to try things
  - Start small and build up (add complexity) as you need.

#### 5.3 Tuning
- Things to remember
  - Machine learning models have hyper parameters you can adjust (Tuning)
  - A model first results are not it's last
  - Tuning can take place on training or validation data sets

#### 5.4 Model Comparison
- Overfitting vs. Underfitting

- Solving Underfitting
  - try a more advanced model
  - increase model hyperparameters
  - reduce amount of features
  - train longer

- Solving overfitting
  - collect more data
  - try a less advanced model

- Things to remember
  - avoid overfitting or underfitting. head towards generality
  - keep the test set separate at all costs
  - compare apples to apples (diff model, same dataset)
  - ont best performance metric does not equal best model

### 6. Experimentation

How could we improve / what can we try next?
  - Start with a problem
  - Data Analysis: Data, Evaluation, Features
  - Machine learning modelling: Model 1
  - Experiments: Try model 2
6 Step Machine Learning Framework questions
  - Problem definition: What kind of problems you face day to day?
  - Data: What kind of data do you use?
  - Evaluation: What do you measure?
  - Features: What are features of your problems?
  - Modelling: What was the last thing you testing ability on?

# Environment Setup

Tools We Will Use
- Data Science: 6 Step Machine Learning Framework
- Data Science: [Anaconda](https://www.anaconda.com/), [Jupyter Notebook](https://jupyter.org/)
- Data Analysis: Data, Evaluation and Features
- Data Analysis:[pandas](https://pandas.pydata.org/), [Matplotlib](https://matplotlib.org/), [NumPy](https://numpy.org/)
- Machine Learning: Modelling
- Machine Learning: [TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), [scikit-learn](https://scikit-learn.org/stable/), [XGBoost](https://xgboost.ai/), [CatBoost](https://catboost.ai/)
- [Elements of AI](https://www.elementsofai.com/)

### Introducing Our Tools

- Steps to learn machine learning [Recall]
  - Create a framework [Done] Refer to Section 3
  - Match to data science and machine learning tools
  - Learn by doing
- Your computer -> Setup Miniconda + Conda for Data Science
  - [Anaconda](https://www.anaconda.com/): Hardware Store = 3GB
  - [Miniconda](https://docs.conda.io/en/latest/miniconda.html): Workbench = 200 MB
  - [Anaconda vs. miniconda](https://stackoverflow.com/questions/45421163/anaconda-vs-miniconda)
  - [Conda](https://docs.conda.io/en/latest/): Personal Assistant
- Conda -> setup the rest of tools
  - Data Analysis: [pandas](https://pandas.pydata.org/), [Matplotlib](https://matplotlib.org/), [NumPy](https://numpy.org/)
  - Machine Learning: [TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), [scikit-learn](https://scikit-learn.org/stable/), [XGBoost](https://xgboost.ai/), [CatBoost](https://catboost.ai/)

### What is Conda?

- [Anaconda](https://www.anaconda.com/): Software Distributions (Full package, 3GB)
- [Miniconda](https://docs.conda.io/en/latest/miniconda.html): Software Distributions (Essential package, 200 MB)
- [Anaconda vs. miniconda](https://stackoverflow.com/questions/45421163/anaconda-vs-miniconda)
- [Conda](https://docs.conda.io/en/latest/): Package Manager
- Your computer -> Miniconda + Conda -> install other tools
  - Data Analysis:[pandas](https://pandas.pydata.org/), [Matplotlib](https://matplotlib.org/), [NumPy](https://numpy.org/)
  - Machine Learning: [TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), [scikit-learn](https://scikit-learn.org/stable/), [XGBoost](https://xgboost.ai/), [CatBoost](https://catboost.ai/)
- Conda -> Project 1: sample-project
- Resources
  - [Conda Cheatsheet](conda-cheatsheet.pdf)
  - [Getting started with conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html)
  - [Getting your computer ready for machine learning](https://www.mrdbourke.com/get-your-computer-ready-for-machine-learning-using-anaconda-miniconda-and-conda/)

### Mac Environment Setup

- Resources
  - [Getting Started Anaconda, Miniconda and Conda](https://whimsical.com/BD751gt65nKjAD5i1CNEXU)
  - [Miniconda installers](https://docs.conda.io/en/latest/miniconda.html) - Choose latest pkg version
- Create conda environment: goto [sample-project](https://github.com/chesterheng/machine-learning-data-science/tree/sample-project) folder
  - `conda create --prefix ./env pandas numpy matplotlib scikit-learn`
- Activate conda environment: `conda activate /Users/xxx/Desktop/sample-project/env`
- List Conda environments: `conda env list`
  - `cd ~/.conda` -> `environments.txt`
- Deactivate conda environment: `conda deactivate`
- Install Jupyter: `conda install jupyter`
- Run Jupyter Notebook: `jupyter notebook`
- Remove packages: `conda remove openpyxl xlrd`
- List all packages: `conda list`

### Sharing

- Share a .yml file of your Conda environment: `conda env export --prefix ./env > environment.yml`
  - [Sharing an environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#sharing-an-environment)
- Create an environment called env_from_file from a .yml file: `conda env create --file environment.yml --name env_from_file`
  - [Creating an environment from an environment.yml file](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file)