# 6 Steps for any Machine Learning Project

##### Questions to Ask

- **Problem definition —** What business problem are we trying to solve? How can it be phrased as a machine learning problem?
- **Data —** If machine learning is getting insights out of data, what data we have? How does it match the problem definition? Is our data structured or unstructured? Static or streaming?
- **Evaluation —** What defines success? Is a 95% accurate machine learning model good enough?
- **Features —** What parts of our data are we going to use for our model? How can what we already know influence this?
- **Modelling —** Which model should you choose? How can you improve it? How do you compare it with other models?
- **Experimentation —** What else could we try? Does our deployed model do as we expected? How do the other steps change based on what we’ve found?

##### **Problem Definition**
Think, what problem we're trying to solve?<br> 
Is it a supervised or unsupervised learning problem or Is it a classification or regression problem.<br>
Problem where you should use Machine Learning?<br>
Identify if the problem is based on supervised learning, unsupervised learning, transfer learning and reinforcement learning(this is rare).


##### **Data**
Ask yourself, what kind of data do we have?<br>
Since machine learning involves using algorithms to find and learn different patterns in data. Data is a requirement for any machine learning project.<br>
Data comes in mainly two types : structured and unstructured. Structured data is something you'd expect to see in an CSV or Excel file whereas Unstructured data are things like images natural language text such as transcribed phone calls, videos.<br>
Within these two data types there's static and streaming. Static data is data which doesn't change over time. You may have a spreadsheet of patient records in a dot CSP format which stands for commas Separated Values which simply means all of the different data is in one file separated by commas. Streaming data is data which is constantly changed over time. For example say you wanted to predict how a stock price will change.


##### **Evaluation**
Ask yourself, what success means to us?<br>
Since machine learning actually is experimental, you could keep going forever, trying to improve your results in search of the perfect model.<br>

Other things you should take into consideration for classification problems.
- **False negatives —** Model predicts negative, actually positive. In some cases, like email spam prediction, false negatives aren’t too much to worry about. But if a self-driving cars computer vision system predicts no pedestrian when there was one, this is not good.
- **False positives —** Model predicts positive, actually negative. Predicting someone has heart disease when they don’t, might seem okay. Better to be safe right? Not if it negatively affects the person’s lifestyle or sets them on a treatment plan they don’t need.
- **True negatives —** Model predicts negative, actually negative. This is good.
- **True positives —** Model predicts positive, actually positive. This is good.
- **Precision —** What proportion of positive predictions were actually correct? A model that produces no false positives has a precision of 1.0.
- **Recall —** What proportion of actual positives were predicted correctly? A model that produces no false negatives has a recall of 1.0.
- **F1 score —** A combination of precision and recall. The closer to 1.0, the better.
- **Receiver operating characteristic (ROC) curve & Area under the curve (AUC) —** The ROC curve is a plot comparing true positive and false positive rate. The AUC metric is the area under the ROC curve. A model whose predictions are 100% wrong has an AUC of 0.0, one whose predictions are 100% right has an AUC of 1.0.

Other things you should take into consideration for Regression problems.
- **Mean absolute error (MAE) —** The average difference between your model's predictions and the actual numbers.
- **Root mean square error (RMSE) —** The square root of the average of squared differences between your model's predictions and the actual numbers.


##### **Features**
Think, what do we already know about the data?<br>
Within different types of data there are different kinds of features. For Example, to know if someone has heart disease, the features you can use is their body weight, blood pressure or chest pain etc as features.

##### **Modelling**
Based on our problem and data, what machine learning model should we use?<br>
You need to choose the correct model for the correct algorithm. There are plenty pre-made ML model.<br>
Modelling is derived into three parts : Choosing and training a model, Tuning a model and Model Comparision.<br>

##### **Experiments**
Which Model fits best for our problem?<br>
You might have to create thousand of models. Some may have insufficient data or some may fail at evaluation. The last model satifying all the condition is considered as out ML model but even that might not be the perfect model.

### **Workflow of a model**
A common data science workflow begins by opening a csv file in a Jupiter notebook, a tool for building machine learning projects then exploring the data and performing data analysis using pandas, a python library for data analysis and making visualizations such as graphs and comparing different data points using matplotlib then building machine learning models on the data using sckit learn such as a machine learning model to make prediction using these patterns here.

# Different types of Metrics
<img src='./images/Different Types of Metrics.png'/>

# Short Recap
##### **Framework**
- Problem Definition (“What problem are we trying to solve?”)
- Data (“What kinds of data do we have?”)
- Evaluation (“What defines success for us?”)
- Feature (“What do we already know about the data?”)
- Modelling (“Based on our problem and data, what model should we use?”)
- Experimentation (“How could we improve/what can we try next?”)


##### **Problem Definition**
When shouldn’t we use machine learning? Can we hand-code the instructions?<br>
Main types of machine learning:
- Supervised learning
- Unsupervised learning
- Transfer learning
- Reinforcement learning


Supervised learning has two main categories: classification and regression. In both cases you know the inputs and outputs (Example: inputs = patients records, outputs = which patients have a disease).

Unsupervised learning is about clustering. It has no labels. You try to find patterns in the data and derive labels from the existing data.

Transfer learning is about transferring one machine learning model to a different domain.

Reinforcement learning is about repeating a task in a problem space, and rewarding or punishing a certain outcome. Classic example: teaching a computer how to play chess.

##### **Data**
Structured data vs. unstructured data.

- Structured: tables (rows, columns) - CSV
- Unstructured: images, audio files
- Static data: values don’t change over time (e.g., patient data)
- Streaming data: data updates constantly (e.g., news headlines)


##### **Evaluation**
What metrics? accuracy, precision.

##### **Features**
Use Feature variables (input data variables) to predict a target variable.<br>
Feature variables can be numerical or categorical.<br>
Ideal: 100% feature coverage (“complete” sample data).

##### **Modelling**
3 Steps
- Choosing and training a model
- Tuning a model
- Model comparison
Split your input into three different data sets: training, validation and test.<br>
Some models work better than others on different problems.<br>
Try to minimize feedback time in your experimentation/training. Start with small datasets, build up.<br>
Model comparison = “How will the model perform in the real world?“<br>
This step tests your model on data that it hasn’t seen yet. The model should be able to generalize.<br>
Keep the test set separate at all costs.

##### **Experimentation**
The framework is an iterative process.