# Steps to design & improve ML model

https://developers.google.com/machine-learning/glossary
Learn TensorFlow - https://developers.google.com/machine-learning/crash-course

-pipeline
-ensemble

- understand the data

Go through first and last few lines, use describe() and check if there are any discrepancies
Find missing entries, see if those are really important for training.
Map the features against target and see which ones shows major impact.
Use heatmap to understand co-relation among the features

- randomize the data before splitting, to make sure each set has variations
- Split the data into 3 sets - train, validation, test
splitting data into 3 sets, would help you overcome problems of overfitting. Training the data on train set
and then constantly tweaking it on test set, makes the model overfit. It's a better practise to tweak the 
model performance on validation set and the test it on completely unseen data

- Feature engineering
It means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering.
Mapping numeric values:
Integer and floating-point data don't need a special encoding because they can be multiplied by a numeric weight.
Mapping categorical values:
Categorical features have a discrete set of possible values. For example, there might be a feature called street_name with options that include: {'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}
Since models cannot multiply strings by the learned weights, we use feature engineering to convert strings to numeric values.

What to use?
One-hot encoding can be used when a single element can be mapped to 1, and a multi-hot encoding when multiple values map to 1.

- Representation: Qualities of Good Features
1. Avoid rarely used feature values
2. Prefer meanings features
3. Don't mix "magic" values with actual data

For example, suppose a feature holds a floating-point value between 0 and 1. So, values like the following are fine:

quality_rating: 0.82
quality_rating: 0.37
However, if a user didn't enter a quality_rating, perhaps the data set represented its absence with a magic value like the following:
quality_rating: -1
To explicitly mark magic values, create a Boolean feature that indicates whether or not a quality_rating was supplied. Give this Boolean feature a name like is_quality_rating_defined.

In the original feature, replace the magic values as follows:

    For variables that take a finite set of values (discrete variables), add a new value to the set and use it to signify that the feature value is missing.
    For continuous variables, ensure missing values do not affect the model by using the mean value of the feature's data.

- Cleaning the data
Scaling feature values
Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1)
- Handling extreme outliers
how? use log of every value in that feature and if it still leaves a tail on plot, cap the value.. 
something like below..

In [1]:
clipped_feature = my_dataframe["my_feature_name"].apply(lambda x: max(x, 0))

NameError: name 'my_dataframe' is not defined

- Binning
Add bins to understand how features can impact target
- Data scrubbing
Deal with cases like:
Omitted values - For instance, a person forgot to enter a value for a house's age.
Duplicate examples - For example, a server mistakenly uploaded the same logs twice.
Bad labels - For instance, a person mislabeled a picture of an oak tree as a maple.
Bad feature values - For example, someone typed in an extra digit, or a thermometer was left out in the sun.

    handle them carefully, see which ones to keep and which ones to let go.
    ALso use 
    Maximum and minimum
    Mean and median
    Standard deviation
    to understand their importance
    
**Most important step of all is
    Know your data -  Good ML relies on good data.
    
- Fix missing data

- Use of synthetic features / modified data features
If needed combine 2 features and see if those impact the target more than individuals

- choose simplest ML model if you are just starting your journey in ML
It easier to explain and understand
It's always a better practise to choose your ML model from at least 3 ML models; tune parameters and pick the best one
suits your needs

- check the loss
Use confusion matrix
Check True Positive rates and True Negative rates - See which one is more important for your business.
Use Area Under Curve (AUC) for ROC
ROC curve:
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

True Positive Rate
False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

False Positive Rate (FPR) is defined as follows:

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve

    AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

AUC is desirable for the following two reasons:

AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

- Complexity of the model
Finally check the complexity of the model, does it really needs to be complex?

Calculate the size of a model, use regularization
Ridge Regression or Lasso

One way to reduce model complexity is to use a regularization function that encourages weights to be exactly zero. For linear models such as regression, a zero weight is equivalent to not using the corresponding feature at all. In addition to avoiding overfitting, the resulting model will be more efficient.

To calculate the model size, we simply count the number of parameters that are non-zero.