![](https://rapidminer.com/wp-content/uploads/2019/01/Automated-Machine-Learning.jpg)
<font size='5' color='blue' align = 'center'>Table of Contents</font> 
<font size='3' color='purple'>
1. [Introduction](#1)
1. [Pandas Profiling](#2)
1. [Data Analysis Baseline Library (dabl)](#3)
    1. [Data Cleaning](#31)
    1. [EDA](#32)
    1. [Model Building](#33)
1. [Pycaret](#4)
    1. [Load dataset](#41)
    1. [Comparing all models](#42)
    1. [Create Model](#43)
    1. [Tune Model](#44)
    1. [Plot Model](#45)
    1. [Predict Model](#46)
    1. [Finalise Model](#47)
    1. [Predict on unseen data](#48)
    1. [Deploy Trained Model on cloud](#49)
    1. [Save & Load Trained Model from cloud](#410)
1. [datassist](#5)
1. [AutoViz](#6) 
1. [missingno](#7) 
1. [Conclusion](#8)


# 1. Introduction <a id="1"></a> <br>
Automated machine learning (AutoML) represents a fundamental shift in the way organizations of all sizes approach machine learning and data science. Applying traditional machine learning methods to real-world business problems is time-consuming, resource-intensive, and challenging. It requires experts in the several disciplines, including data scientists – some of the most sought-after professionals in the job market right now.

Machine learning technology is no longer just for technology geeks but is now understood and used by business users. However, the future growth of this technology continues to be dependent on the availability of skilled ML workers and data science experts. With the current shortage of skilled ML professionals, most businesses do not have the budget nor the resources to invest in trained team proficient in these technologies.

Automated machine learning changes that, making it easier to build and use machine learning models in the real world by running systematic processes on raw data and selecting models that pull the most relevant information from the data – what is often referred to as “the signal in the noise.” Automated machine learning incorporates machine learning best practices from top-ranked data scientists to make data science more accessible across the organization.

The answer lies in the emergence of the automated machine learning pipeline technology also known as the AutoML pipeline as shown below 

![](https://docs.microsoft.com/en-us/azure/machine-learning/media/tutorial-auto-train-models/flow2.png)

Auto Machine Learning can provide the following benefits:

* Improve productivity of data experts by automating any repetitive ML-related tasks and help them focus on other issues.
* Reduce human errors in ML models that arise mainly due to manual steps.
* Make machine learning accessible for all users, thus promoting a decentralized process.

# 2. Pandas Profiling<a id="2"></a><br>
![](https://warehouse-camo.ingress.cmh1.psfhosted.org/e93a5dcd9f413f15f1c575d45b9e7ab8269179d8/68747470733a2f2f70616e6461732d70726f66696c696e672e6769746875622e696f2f70616e6461732d70726f66696c696e672f646f63732f6173736574732f6c6f676f5f6865616465722e706e67)

Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

* **Type inference**: detect the types of columns in a dataframe.
* **Essentials**: type, unique values, missing values
* **Quantile statistics** like minimum value, Q1, median, Q3, maximum, range, interquartile range
* **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* **Most frequent values**
* **Histogram**
* **Correlations** highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
* **Missing values** matrix, count, heatmap and dendrogram of missing values
* **Text analysis** learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
* **File and Image analysis** extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

In [None]:
!pip install pandas-profiling

In [None]:
import pandas as  pd
from pandas_profiling import ProfileReport
train = pd.read_csv('../input/birdsong-recognition/train.csv')

In [None]:
### To Create the Simple report quickly
train.profile_report()

# 3. Data Analysis Baseline Library (dabl) <a id="3"></a> <br>

This project tries to help make supervised machine learning more accessible for beginners, and reduce boiler plate for common tasks.

This library is in very active development, so it’s not recommended for production use.
Development at github.com/amueller/dabl.


The idea behind dabl is to jump-start your supervised learning task. dabl has several tools that make it easy to clean and inspect your data, and create strong baseline models.

Building machine learning models is an inherently iterative task with a human in the loop. Big jumps in performance are often achieved by better understanding of the data and task, and more appropriate features. dabl tries to provide as much insight into the data as possible, and enable interactive analysis.

Many analyses start with the same rote tasks of cleaning and basic data visualization, and initial modeling. dabl tries to make these steps as easy as possible, so that you can spend your time thinking about the problem and creating more interesting custom analyses.

There are two main packages that dabl takes inspiration from and that dabl builds upon: scikit-learn and auto-sklearn. The design philosophies and use-cases are quite different, however.

Scikit-learn provides many essential building blocks, but is built on the idea to do exactly what the user asks for. That requires specifying every step of the processing in detail. dabl on the other hand has a best-guess philosophy: it tries to do something sensible, and then provides tools for the user to inspect and evaluate the results to judge them.

auto-sklearn is completely automatic and black-box. It searches a vast space of models and constructs complex ensemles of high accuracy, taking a substantial amount of computation and time in the process. The goal of auto-sklearn is to build the best model possible given the data. dabl, conversely, tries to enable the user to quickly iterate and get a grasp on the properties of the data at hand and the fitted models.

dabl is meant to support you in the following tasks, in order:
## 3.1 Data Cleaning<a id="31"></a> <br>

**Install dabl**

In [None]:
!pip install dabl

**Import libraries & Load dataset**

In [None]:
import pandas as pd
import dabl
df = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")

Now let’s ask **dabl** what it thinks by cleaning up the data.

**dabl** tries to detect the types of your data and apply appropriate conversions. It also tries to detect potential data quality issues. The field of data cleaning is impossibly broad, and dabl’s approaches are by no means sophisticated. The goal of dabl is to get the data “clean enough” to create useful visualizations and models, and to allow users to perform custom cleaning operations themselves. In particular if the detection of semantic types (continuous, categorical, ordinal, text, etc) fails, the user can provide **type_hints**:

In [None]:
df_clean = dabl.clean(df, verbose=0)

This provides us with lots of information about what is happening in the different columns. **In this case, we might have been able to figure this out quickly from the call to head, but in larger datasets this might be a bit tricky.** For example we can see that there are several dirty columns with “?” in it. This is probably a marker for a missing value and we could go back and fix our parsing of the CSV, but let’s try and continue with what dabl is doing automatically for now. **In dabl, we can also get a best guess of the column types in a convenient format:**

In [None]:
types = dabl.detect_types(df_clean)
print(types) 

Having a very rough idea of the shape of our data, we can now start looking at the actual content. The easiest way to do that is using visualization of univariate and bivariate patterns. With plot, we can create plot of the features deemed most important for our task.

## 3.2 EDA<a id="32"></a> <br>

**dabl** provides a high-level interface that summarizes several common high-level plots. For low dimensional datasets, all features are shown; for high dimensional datasets, only the most informative features for the given task are shown. This is clearly not guaranteed to surface all interesting aspects with the data, or to find all data quality issues. However, it will give you a quick insight in to what are the important features, their interactions, and how hard the problem might be. It also allows a good assessment of whether there is any data leakage through spurious representations of the target in the data.

In [None]:
dabl.plot(df, 'SalePrice')

## 3.3 Model Building<a id="33"></a> <br>

Finally, we can find an initial model for our data. The SimpleClassifier does all the work for us. It implements the familiar scikit-learn API of fit and predict. Alternatively we could also use the same interface as before and pass the whole data frame and specify the target column.

In [None]:
ec = dabl.SimpleClassifier(random_state=0).fit(df, target_col="SalePrice") 

The SimpleClassifier first tries several baseline and instantaneous models, potentially on subsampled data, to get an idea of what a low baseline should be. This again is a good place to surface data leakage, as well as find the main discriminative features in the dataset. The SimpleClassifier allows specifying data in the scikit-learn-style fit(X, y) with a 1d y and features X, or with X being a dataframe and specifying the target column inside of X as target_col.

The SimpleClassifier also performs preprocessing such as missing value imputation and one-hot encoding. 

# 4. PyCaret<a id="4"></a> <br>
![](https://pycaret.org/wp-content/uploads/2020/04/thumbnail.png)

PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment

**Install pycaret**

In [None]:
!pip install pycaret

## 4.1 Load Dataset<a id="41"></a> <br>

To demonstrate the pycaret capability we will use a dataset from UCI called **Default of Credit Card Clients Dataset**. This dataset contains information on default payments, demographic factors, credit data, payment history, and billing statements of credit card clients in Taiwan from April 2005 to September 2005. There are 24,000 samples and 25 features. Short descriptions of each column are as follows:

- **ID:** ID of each client
- **LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit)
- **SEX:** Gender (1=male, 2=female)
- **EDUCATION:** (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- **MARRIAGE:** Marital status (1=married, 2=single, 3=others)
- **AGE:** Age in years
- **PAY_0 to PAY_6:** Repayment status by n months ago (PAY_0 = last month ... PAY_6 = 6 months ago) (Labels: -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
- **BILL_AMT1 to BILL_AMT6:** Amount of bill statement by n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)
- **PAY_AMT1 to PAY_AMT6:** Amount of payment by n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)
- **default.payment.next.month:** Default payment (1=yes, 0=no) `Target Column`

In [None]:
import pandas as pd
data=pd.read_csv('../input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv')
data.head()

In order to demonstrate the predict_model() function on unseen data, a sample of 1500 records has been withheld from the original dataset to be used for predictions. This should not be confused with a train/test split as this particular split is performed to simulate a real life scenario. Another way to think about this is that these 1500 records are not available at the time when the machine learning experiment was performed.

In [None]:
dataset = data.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = data.drop(dataset.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

PyCaret's classification module (`pycaret.classification`) is a supervised machine learning module which is used for classifying the elements into a binary group based on various techniques and algorithms. 

The PyCaret classification module can be used for Binary or Multi-class classification problems. It has over 18 algorithms and 14 plots to analyze the performance of models. Be it hyper-parameter tuning, ensembling or advanced techniques like stacking, PyCaret's classification module has it all.

In [None]:
#import classification module from pycaret
from pycaret.classification import *
exp_clf = setup(data=dataset, target="default.payment.next.month")

Once the setup has been succesfully executed it prints the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when `setup()` is executed. The majority of these features are out of scope for the purposes of this tutorial however a few important things to note at this stage include:

- **session_id :**  A pseduo-random number distributed as a seed in all functions for later reproducibility. If no `session_id` is passed, a random number is automatically generated that is distributed to all functions. 
<br/>
- **Target Type :**  Binary or Multiclass. The Target type is automatically detected and shown. There is no difference in how the experiment is performed for Binary or Multiclass problems. All functionalities are identical.<br/>
<br/>
- **Label Encoded :**  When the Target variable is of type string (i.e. 'Yes' or 'No') instead of 1 or 0, it automatically encodes the label into 1 and 0 and displays the mapping (0 : No, 1 : Yes) for reference. In this experiment no label encoding is required since the target variable is of type numeric. <br/>
<br/>
- **Original Data :**  Displays the original shape of the dataset. In this experiment (28500, 25) means 28500 samples and 25 features including the target column. <br/>
<br/>
- **Missing Values :**  When there are missing values in the original data this will show as True. For this experiment there are no missing values in the dataset. 
<br/>
<br/>
- **Numeric Features :**  The number of features inferred as numeric. In this dataset, 15 out of 25 features are inferred as numeric. <br/>
<br/>
- **Categorical Features :**  The number of features inferred as categorical. In this dataset, 9 out of 25 features are inferred as categorical. <br/>
<br/>
- **Transformed Train Set :**  Displays the shape of the transformed training set. Notice that the original shape of (28500, 25) is transformed into (5984, 90) for the transformed train set and the number of features have increased to 90 from 25 due to categorical encoding <br/>
<br/>
- **Transformed Test Set :**  Displays the shape of the transformed test/hold-out set. There are 2566 samples in test/hold-out set. This split is based on the default value of 70/30 that can be changed using the `train_size` parameter in setup. <br/>

Notice how a few tasks that are imperative to perform modeling are automatically handled such as missing value imputation (in this case there are no missing values in the training data, but we still need imputers for unseen data), categorical encoding etc. Most of the parameters in `setup()` are optional and used for customizing the pre-processing pipeline. 

## 4.2 Comparing all models<a id="42"></a> <br>
Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using stratified cross validation for metric evaluation. The output prints a score grid that shows average Accuracy, AUC, Recall, Precision, F1 and Kappa accross the folds (10 by default) of all the available models in the model library.

In [None]:
compare_models()

There you go created over 15 models using 10 fold stratified cross validation and evaluated the 6 most commonly used classification metrics (Accuracy, AUC, Recall, Precision, F1, Kappa). The score grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using 'Accuracy' (highest to lowest) which can be changed by passing the sort parameter. For example **compare_models(sort = 'Recall')** will sort the grid by Recall instead of Accuracy. If you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For example **compare_models(fold = 5)** will compare all models on 5 fold cross validation. Reducing the number of folds will improve the training time.
## 4.3 Create model<a id="43"></a> <br>
While compare_models() is a powerful function and often a starting point in any experiment, it does not return any trained models. PyCaret's recommended experiment workflow is to use compare_models() right after setup to evaluate top performing models and finalize a few candidates for continued experimentation. As such, the function that actually allows to you create a model is unimaginatively called **create_model()**.

There are 18 classifiers available in the model library of PyCaret. 

For illustration purposes only we will be considering the following Classifiers .

* Logistic Regression('lr')
* Decision Tree Classifier ('dt')
* K Neighbors Classifier ('knn')
* Random Forest Classifier ('rf')

In [None]:
lr = create_model('lr')

In [None]:
dt = create_model('dt')

In [None]:
knn = create_model('knn')

In [None]:
rf = create_model('rf')

Notice that the mean score of all models matches with the score printed in compare_models(). This is because the metrics printed in the compare_models() score grid are the average scores across all CV folds. Similar to compare_models(), if you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For Example: create_model('dt', fold = 5) will create a Decision Tree Classifier using 5 fold stratified CV.
## 4.4 Tune model<a id="44"></a> <br>
When a model is created using the create_model() function it uses the default hyperparameters. In order to tune hyperparameters, the tune_model() function is used. This function automatically tunes the hyperparameters of a model on a pre-defined search space and scores it using stratified cross validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold.

Now let us tune the below models 
* Logistic Regression('lr')
* Decision Tree Classifier ('dt')
* K Neighbors Classifier ('knn')
* Random Forest Classifier ('rf')

In [None]:
# Tune the Logistic regression model
tuned_lr = tune_model('lr')

In [None]:
# Tune the Decision Tree Classifier model
tuned_dt = tune_model('dt')

In [None]:
# Tune the K Neighbors Classifier model
tuned_knn = tune_model('knn')

In [None]:
# Tune the Random Forest Classifier model
tuned_rf = tune_model('rf')

**Note:**

Notice how the results after tuning have been improved:

* Logistic Regression(Before: 0.7786 , After: 0.7786)
* Decision Tree Classifier (Before: 0.7216 , After: 0.7413)
* K Neighbors Classifier (Before: 0.7355 , After: 0.7772)
* Random Forest Classifier (Before: 0.8015 , After: 0.8103)

## 4.5 Plot Model<a id="45"></a> <br>

Before model finalization, the `plot_model()` function can be used to analyze the performance across different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a plot based on the test / hold-out set. 

There are 15 different plots available.

In [None]:
#Plot LR model: ROC-AUC curve
plot_model(lr)

In [None]:
#Plot LR model: ROC-AUC curve
plot_model(tuned_lr)

In [None]:
#Plot Decision Tree model: ROC-AUC curve
plot_model(dt)

In [None]:
#Plot KNN model: ROC-AUC curve
plot_model(knn)

To analyze the performance of models is to use the **evaluate_model()** function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model() function.

In [None]:
evaluate_model(lr)

In [None]:
#create a tree base model to interpret model and check feature importance
dt = create_model('dt')
#interpret a model
interpret_model(dt)

In [None]:
#optimize threshold for trained LR model
optimize_threshold(lr)

## 4.6 Predict<a id="46"></a> <br>
Now, using our final trained model stored in the tuned_rf variable we will predict against the hold-out sample and evaluate the metrics to see if they are materially different than the CV results.

In [None]:
predict_model(tuned_rf);

## 4.7 Finalise Trained Model<a id="47"></a> <br>
Model finalization is the last step in the experiment. A normal machine learning workflow in PyCaret starts with setup(), followed by comparing all models using compare_models() and shortlisting a few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling, stacking etc. This workflow will eventually lead you to the best model for use in making predictions on new and unseen data. The finalize_model() function fits the model onto the complete dataset including the test/hold-out sample (30% in this case). The purpose of this function is to train the model on the complete dataset before it is deployed in production.

Once the model is finalized using finalize_model(), the entire dataset including the test/hold-out set is used for training. 

In [None]:
final_rf = finalize_model(tuned_rf)

## 4.8 Predict on unseen data<a id="48"></a> <br>
The predict_model() function is also used to predict on the unseen dataset. The only difference from section 11 above is that this time we will pass the data_unseen parameter. data_unseen is the variable created at the beginning of the tutorial and contains 5% (1500 samples) of the original dataset which was never exposed to PyCaret.

In [None]:
unseen_predictions = predict_model(final_rf, data=data_unseen)
unseen_predictions.head()

The Label and Score columns are added onto the data_unseen set. Label is the prediction and score is the probability of the prediction.We have now finished the experiment by finalizing the tuned_rf model which is now stored in final_rf variable. We have also used the model stored in final_rf to predict data_unseen. 

## 4.9 Deploy Trained Model on cloud<a id="49"></a> <br>

In [None]:
deploy_model(model = tuned_rf, model_name = 'deploy_lr', platform = 'flask', authentication = {'bucket' : 'pycaret-test'})

## 4.10 Save & Load Trained Model from cloud<a id="410"></a> <br>

To load a saved model at a future date in the same or an alternative environment, we would use PyCaret's load_model() function and then easily apply the saved model on new unseen data for prediction.

In [None]:
save_model(tuned_rf, 'tuned_rf_16072020')

In [None]:
# Load the saved model
saved_rf = load_model('tuned_rf_16072020')

In [None]:
# Save the experiment
save_experiment('experiment_16072020')

In [None]:
# Load experiment
saved_experiment = load_experiment('experiment_16072020')

# 5.datasist<a id="5"></a> <br>
![](https://warehouse-camo.ingress.cmh1.psfhosted.org/6572d848c045b008268a4d6ca2617526a102d9b0/68747470733a2f2f726973656e772e6769746875622e696f2f64617461736973742f64617461736973742e706e67)
**datasist** is a python package providing fast, quick, and an abstracted interface to popular and frequently used functions or techniques relating to data analysis, visualization, data exploration, feature engineering, Computer, NLP, Deep Learning, modeling, model deployment etc.

In [None]:
!pip install datasist

In [None]:
import pandas as pd
import datasist as ds  #import datasist library
import numpy as np

train_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

**check_train_test_set**: Checks the distribution of train and test for uniqueness in order to determine the best feature engineering strategy.

In [None]:
ds.structdata.check_train_test_set(train_df, test_df, index=None, col=None)

**describe:** Calculates statistics and information about a data set. Information displayed are shapes, size, number of categorical/numeric/date features, missing values, dtypes of objects etc.

In [None]:
ds.structdata.describe(train_df)

In [None]:
ds.structdata.describe(test_df)

**detect_outliers**: Detect Rows with outliers.

In [None]:
numerical_feats = ds.structdata.get_num_feats(train_df)
ds.structdata.detect_outliers(train_df,80,numerical_feats)

**display_missing**: Display missing values as a pandas dataframe.

In [None]:
ds.structdata.display_missing(train_df)

In [None]:
ds.structdata.display_missing(test_df)

**get_cat_feats** : Returns the categorical features in a data set

In [None]:
cat_feats = ds.structdata.get_cat_feats(train_df)
cat_feats

**get_num_feats** : Returns the numerical features in a data set

In [None]:
num_feats = ds.structdata.get_num_feats(train_df)
num_feats

In [None]:
get_unique_counts = ds.structdata.get_unique_counts(train_df)
get_unique_counts

In [None]:
ds.visualizations.autoviz(train_df)

In [None]:
all_data, ntrain, ntest = ds.structdata.join_train_and_test(train_df, test_df)
print("New size of combined data {}".format(all_data.shape))
print("Old size of train data: {}".format(ntrain))
print("Old size of test data: {}".format(ntest))

#later splitting after transformations
train = all_data[:ntrain]
test = all_data[ntrain:]

In [None]:
new_train_df = ds.feature_engineering.drop_missing(train_df,  
                                                    percent=7.0)
ds.structdata.display_missing(new_train_df)

In [None]:
ds.feature_engineering.drop_redundant(new_train_df)

In [None]:
df = ds.feature_engineering.fill_missing_cats(train_df)
ds.structdata.display_missing(df)

In [None]:
df = ds.feature_engineering.fill_missing_num(train_df)
ds.structdata.display_missing(df)

In [None]:
df = ds.feature_engineering.fill_missing_num(df)
df = ds.feature_engineering.log_transform(df,columns=['Id'])

# 6.AutoViz<a id="6"></a> <br>
![](https://github.com/AutoViML/AutoViz/raw/master/logo.png)
Automatically Visualize any dataset, any size with a single line of code.

AutoViz performs automatic visualization of any dataset with one line. Give any input file (CSV, txt or json) and AutoViz will visualize it.

In [None]:
!pip install autoviz

In [None]:
import pandas as pd
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()

In [None]:
sep = ','
target = 'medv'
datapath = ''
filename = 'https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/MASS/Boston.csv'
dft = AV.AutoViz(datapath+filename, sep=sep, depVar=target, dfte=df, header=0, verbose=2,
                            lowess=False,chart_format='svg',max_rows_analyzed=1500,max_cols_analyzed=30)

# 7. missingno <a id="7"></a> <br>
![](https://storage.googleapis.com/coderzcolumn/static/tutorials/data_science/article_image/missingno%20-%20Visualize%20Missing%20Data%20in%20Python.jpg)
Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. Just pip install missingno to get started.

In the case of a real-world dataset, it is very common that some values in the dataset are missing. We represent these missing values as NaN (Not a Number) values. But to build a good machine learning model our dataset should be complete. That’s why we use some imputation techniques to replace the NaN values with some probable values. But before doing that we need to have a good understanding of how the NaN values are distributed in our dataset.

Missingno library offers a very nice way to visualize the distribution of NaN values. Missingno is a Python library and compatible with Pandas.

In [None]:
!pip install missingno

In [None]:
import missingno as msno
import pandas as pd
import numpy as np

train_df = pd.read_csv('../input/titanic/train.csv')

**Matrix:**

Visualising missing values for a sample of 150 Using this matrix you can very quickly find the pattern of missingness in the dataset.

In [None]:
msno.matrix(train_df.sample(100))

**Heatmap**

The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:

In [None]:
msno.heatmap(train_df)

**Bar Chart :**

This bar chart gives you an idea about how many missing values are there in each column.

In [None]:
msno.bar(train_df.sample(880))

**Dendogram:**

The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.

Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually are or ought to be match each other in nullity , then the height of the cluster leaf tells you, in absolute terms, how often the records are "mismatched" or incorrectly filed—that is, how many values you would have to fill in or drop, if you are so inclined.

As with matrix, only up to 50 labeled columns will comfortably display in this configuration. However the dendrogram more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.

In [None]:
msno.dendrogram(train_df)

# Conclusion <a id="8"></a> <br>

Hence Automated ML tools is enabling data scientists to improve their productivity and realize their true potential quickly and time to market with quicker insights. I hope you find this kernel useful and will use the above tools to good effect in your day to day data science career path.

# If you like this kernel greatly appreciate to <font color='red'>UPVOTE 