Notes of the free Machine Learning course from Pluralsight, found here.
To see the python scripts, first you need to have intalled Python 3 and pip.
Then, you have to install Jupyter and then run the Jupyter notebook:
pip install jupyter
jupyter notebook
Machine learning: Building a model from example inputs to make data-driven predictions vs. following strictly static program instructions.
Instead of "if", "case", "while" and "until", uses the data parsed to a format we can use, then we pass this formatted data to an algorithm that analyses the data (data analysis) and then it creates a model that implements the solution to solve the problem based on the data.
-
Supervised: Data is labeled and has features, and we know the result we want to obtain for that data.
-
Unsupervised: Search clusters of blank data and encounters groups of data that share the same traits.
Supervised | Unsupervised |
---|---|
Value prediction | Identify clusters of like data |
Needs training data containing value being predicted | Data does not contain cluster membership |
Trained model predicts value in new data | Model probides access to data by cluster |
An orchestrated and repeatable pattern which systematically transform and processes information to create prediction solutions.
- Ask the right question
- Preparing data
- Selecting the algorithm
- Training the model
- Testing the model --> If something went wrong, iterate from (2)
- Early steps are most important. Each step depends on previous steps
- Expect to go backwards. Later knowledge effects previous steps
- Data is never as you need it. Data will have to be altered.
- More data is better. More data => better results.
- Don't pursue a bad solution. Reevaluate, fix or quit.
Define end goal, starting point and how to achieve goal.
Predict if a person will develop diabetes
This sentence can be improved.
- Define scope (including data sources)
- Define target performance
- Define context for usage
- Define how solution will be created
Detailed:
- Scope and data sources:
- Understand the features in data
- Identify critical features
- Focus on at risk population
- Select data source -> Pimia indian diabetes study is a good source
Using Pima Indian Diabetes data, predict which people will develop diabetes.
- Performance targets:
- Binary result (True or False)
- We want more accurancy than just a coin flip (>50%)
- Genetic difference are a factor
- 70% accurancy is common target.
Using Pima Indian Diabetes data, predict with 70% or great accuracy, which people will develop diabetes.
- Context
- Disease prediction
- Medical research practices
- Unknown variations between people
- Likelihood is used
Using Pima Indian Diabetes data, predict with 70% or great accuracy, which people are likely to develop diabetes.
- Solution creation
- Usually we will use the Machine Learning Workflow to develop the solution.
- Process Pima Indian Data.
- Transform data as required.
- Usually we will use the Machine Learning Workflow to develop the solution.
Using the Machine Learning Workflow to process and transform Pima Indian Diabetes data to create a prediction model. This model must predict which people are likely to develop diabetes with 70% or great accuracy.
Find data we need Inspect and clean data Explore data and modify if necessary Mold the data to tidy data
Tidy datasets area easy to manipulate, model and visualize, and have a specific structure:
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table
50-80% of a ML project is spent getting, cleaning, and organizing data.
Where to get it:
- Government databases
- Professional or company data sources
- Your company
- All of the above
Data Rule #1
: Closer the data is to what you are predicting, the better.
Data Rule #2
: Data will never be in the format you need.