# Machine Learning Practice and Undestanding

<table>
    <tbody>
        <tr>
            <td><img src="attachment:1_PQ8tdohapfm-YHlrRIRuOA%5B1%5D.gif" width="400" /></td>
            <td><img src="attachment:1_0Ve21Rildq950wRrlJvdLQ%5B1%5D.gif" width="500" /> </td>
        </tr>
    </tbody>
</table>

<table>
    <thead>
        <tr>
            <th style="text-align: left;">Factors</th>
            <th style="text-align: left;">Example</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>What is the business use case?</td>
            <td>Recommendation, Search, Forecasting, Prediction</td>
        </tr>
        <tr>
            <td>What are the model metrics?</td>
            <td>F1-score = >0.95; RMSE = < 15.4 </td>
        </tr>
        <tr>
            <td>What kind of data is being used?</td>
            <td>Structured – Numerical, categorical (Tabular, JSON) <br> Unstructured – Image, Text, Audio, Video</td>
        </tr>
        <tr>
            <td>What is the machine learning problem formulation?</td>
            <td>Regression, Classification – multi-class or multi-label, Clustering</td>
        </tr>
        <tr>
            <td>What are the relevant ML and DL models?</td>
            <td>Random Forest, XGBoost, CNNs, Transformers</td>
        </tr>
        <tr>
            <td>What hyperparameter optimization techniques to consider?</td>
            <td>Grid search, Bayesian optimization</td>
        </tr>
        <tr>
            <td>Additional constraints</td>
            <td>Explainability, Fairness, Bias, Privacy</td>
        </tr>
    </tbody>
</table>
<style>
    th, td, tr {
    style="text-align: left;"
    }
</style>
    

**Problem Statement:**  
<br>
**Target and Metrics:**    
<br>



## Import Data and Libraries

In [None]:
# (EDA  ++  Preprocessing)  ->   Feature Selection/Feature Eng    ->   Baseline Modeling   ->   Hyperparameter Tuning  -> Extras

## Exploratory Data Analysis


**1. Statistical Analysis**  
**2. Visualization**  
   > Univariate Analysis **(U)**  
   > Multivariate Analysis **(M)**  
   
<table style="border: 1px solid black;">
    <thead>
        <tr>
            <th>Categorical Column</th>
            <th>Numerical Column</th>
            <th>Univariate/Multivariate</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Boxplot</td>
            <td>Histogram</td>
            <td>U</td>
        </tr>
        <tr>
            <td>Barplot/Countplot</td>
            <td>KDEplot</td>
            <td>U</td>
        </tr>
        <tr>
            <td>Treemap</td>
            <td>Violinplot</td>
            <td>U</td>
        </tr>
        <tr>
            <td>Boxplot with a numerical colum on Y-axis</td>
            <td>Scatterplot</td>
            <td>M</td>
        </tr>
        <tr>
            <td>Relplot with a numerical colum on Y-axis</td>
            <td>Scatterplot with a categorical column as HUE</td>
            <td>M</td>
        </tr>
        <tr>
            <td>Multicategorical Barplot</td>
            <td>Heatmaps</td>
            <td>M</td>
        </tr>
        <tr>
            <td>Boxenplots</td>
            <td>Lineplots</td>
            <td>M / U</td>
        </tr>
    </tbody>
</table>



<table>
    <tbody>
        <tr>
            <td><img src="attachment:1_IDYToFfz-OPcg3iZYstPXw%5B1%5D.png" width="400" /></td>
            <td><img src="attachment:1_LNSsOkVT0QEkNyQA1Rai1w%5B1%5D.png" width="400" /> </td>
        </tr>
    </tbody>
</table>



In [None]:


# null values
# categories present (what type)
# outliers
# skewed

## Preprocessing

<center> <b>Apply selectively:</b>





<table style="border: 1px solid black;">
    <thead>
        <tr>
            <th>Process</th>
            <th>Techniques</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Imputation</td>
            <td>Fill Null values by Central Tendency values OR Imputer models</td>
        </tr>
        <tr>
            <td>Outlier Handling</td>
            <td>IQR, Z-score</td>
        </tr>
        <tr>
            <td>Grouping</td>
            <td>Binning</td>
        </tr>
        <tr>
            <td>Transformation</td>
            <td>Scaler, Normal, Log</td>
        </tr>
        <tr>
            <td>Categorical Encoding</td>
            <td>OneHot (Nominal), Label (Ordinal), Custom Encoders (TargetEncoding)</td>
        </tr>
        <tr>
            <td>Feature Conversion</td>
            <td>Converting Object (like datetime, pincode) features into Meaningful Numbers</td>
        </tr>
    </tbody>
</table>
<br>
at the end of preprocessing: you will have a clean dataset</center>



## Feature Selection

sklearn modules:
* SelectKBest(k=20, method=chi2)
* SelectPercentile
* SelectFromModel(estimator=RandomForestClassifier(), x=20)   *31 => 20 best feats => best score*
* SelectRFE(estimator=, x=20)

selection statistics:
1. Classification:
    * chi2
    * anova (f_classif)
    * mutual_info_classif
2. Regression:
    * p-values (f_refression)
    * pearson's correlation (r_regression)
    * mutual_info_regression


<center>at the end of this step: you will have an ideal dataset</center>


![1_kpqurK-46RQxCllffLgM3w%5B1%5D.png](attachment:1_kpqurK-46RQxCllffLgM3w%5B1%5D.png)

## Baseline Modeling

Linear Models:
* Linear Regression (regression)
* Logistic Regression (classification)
* LinearSVM

Tree-based Models (Non-Linear Models):
* Decision Trees
* Naive Bayes
* Gaussian naive bayes
* KNN
* Support Vector Machines
* Catboost
* ExtraTrees

Ensemble Models:
* Bagged and boosted models
* AdaBoost
* Random Forest


**What is bias?**

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.

**What is variance?**

Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.

![1_9hPX9pAO3jqLrzt0IE3JzA%5B1%5D.png](attachment:1_9hPX9pAO3jqLrzt0IE3JzA%5B1%5D.png)

**What is Bias Variance Tradeoff?**

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.


<img src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Overfitting-underfitting.png?resize=512%2C292&ssl=1" width="600" />

In [None]:
train_test_split

model = Algorithm()

model.fit(X_train, y_train) #model learns here

pred = model.predict(X_test)

In [None]:
compare_score(model.predict(X_train), y_train)  # (low)
compare_scores(pred, y_test)                    # (low)

In [None]:
5 models 

training score =~ test score

# Feature Engineering (optional)

**[Fundamental Techniques of Feature Engineering for Machine Learning](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)**    
Let's look into the terms:
* **Feature:** An attribute useful for your modeling task. If a feature has no impact on the
problem, it is not part of the problem. Hence, when "meaningless attributes" are removed
from the data, the remaining attributes become a FEATURE
* **Feature Importance:** An estimate of the usefulness of a feature. Features are
allocated scores and can then be ranked by their scores. Those features with the highest
scores can be selected for inclusion in the training dataset, whereas those remaining can
be ignored. Some complex predictive modeling algorithms like MARS, Random Forest,
and Gradient Boosted Machines report on the variable importance determined during the
model preparation process.
* **Feature Extraction:** The automatic construction of new features from raw data.
Automatically reducing the dimensionality of these types of observations into a much
smaller set that can be modeled. For tabular data, this might include projection methods
like Principal Component Analysis and unsupervised clustering methods.
* **Feature Selection:** From many features to a few that are useful. Regularization
methods like LASSO and ridge regression may also be considered algorithms with
feature selection baked in, as they actively seek to remove or discount the contribution of
features as part of the model building process.
* **Feature Construction:** The manual construction of new features from raw data.
With tabular data, it often means a mixture of aggregating or combining features to
create new features, and decomposing or splitting features to create new features. Like
splitting Date-Time columns, Reframing numerical columns, Encoding Categorical
columns, etc.


![fs.png](attachment:fs.png)

## Hyperparameter Tuning


* GridSearchCV
* RandomisedSearchCV
* TuneSearchCV
* Optuna


it takes a lot of time and computation power

**steps:**
1. initiate crossvalidation split
2. initiate parameter grid
3. choose an estimator (model)
4. fitting


<img src="https://cs.calvin.edu/courses/data/202/ka37/slides/w11/w11d1-cv-review_files/figure-html/cv-anim-.gif" width="500" />


<table>
    <tbody>
        <tr>
            <td><img src="https://dmol.pub/_images/loss-lr.gif" width="600" /></td>
            <td><img src="https://storage.googleapis.com/kaggle-media/learn/images/rFI1tIk.gif" width="600" /> </td>
        </tr>
    </tbody>
</table>




In [None]:
1. n_rows<total
2. library (vaex, dask)
3. use cloud compute (kaggle, colab)
4. decreasing the size (converting dtype 64bits => 32bits)

df = veax.read_csv(chunks)


# Model Evaluation (performance analysis)

**[20 Popular Machine Learning Metrics - I](https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce)**  

1. Classification:
    * classification report
    * accuracy
    * precision
    * recall
    * f1-score
    * ROC-AUC
    
2. Regression:
    * MSE
    * MAE
    * RMSE
    * r2 score

## Explainable AI