# Introducing CyberDeck - A free one click platform to perform end to end Data Science Pipeline


Data Science is one of the most beautiful things out there. The sheer ability to extract meaningful information out of nothing is a wonder by itself. But being a data scientist comes with its own kinks. One of the major challenges of data science is code writing. One often writes hundreds of lines of codes to achive the feat of extracting valuable information from data. But then also, this is not scalable. As the data changes, the code changes. As a result, we have to write hundreds of lines of codes again even though the processes/pipelines remain the same. But what if we could make a one stop platform for doing our regular data science work at the click of a mouse - Starting from data processing, EDA all the way to modelling?

To answer this ever growing problem, we are developing a **FREE** one stop community platform for Data Scientists named **"CyberDeck"**. Every similar platform we have seen costs a lot of money. But not this. This is not out for release yet. But if we get a positive feedback from this gold mine of a community, we will make that a reality soon tentatively by the end of this year!

To demonstrate the capability of CyberDeck, we are choosing the COVID India dataset. We will do a side by side comparison of coding vs using CyberDeck for a variety of processes. We will perform a thorough Exploratory Data Analysis and then do machine learning to predict the number of total cases. After that, we will demonstrate our Explainable AI section to understand what features impact the chances of COVID the most and do a "WHAT-IF" analysis to see some interesting turn of events that would have happened if some feature values were changed.

In our previous kernel (https://www.kaggle.com/sagarnildass/convert-your-data-science-hours-to-minutes), we gave more attention to the EDA section of CyberDeck. This time, we will explore the Dashboard section more thoroughly.

As we plan to give this product for free, we really need to understand if there will be a need for this among the data scientists. So we couldn't think of a better place than Kaggle. So do let us know if you need this product!

As a final note, if you like this demo and want to be a contributor, contact us at sagarnil.das@cyberdeck.in

**You can sign up for the Pre-release here:** https://cyberdeck.in/

You need it? You got it!

**You can see the medium article here:** https://medium.com/analytics-vidhya/convert-your-data-science-hours-to-minutes-with-this-one-method-7089ff2664ff

Let's dive right in!

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
import plotly.express as px

warnings.filterwarnings('ignore')
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/latest-covid19-india-statewise-data/Latest Covid-19 India Status.csv')

In [None]:
df.head()

# A) Exploratory Data Analysis

## 1. Number of total cases by state

### With Coding

In [None]:
plt.figure(figsize=(8, 10))
sns.barplot(data = df, y="State/UTs", x="Total Cases")

### With CyberDeck

We go to the **Dashboard** section of CyberDeck and select the data


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/1_dashboard_data_select.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Then we click the **Open Chart Type** button and select a **Horizontal Bar Chart**


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/2_select_bar_chart.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We put the appropriate values in the X and Y axis.


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/4_select_axes.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


And BOOM! We have now generated the same plot with some click of the mouse!


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/3_bar_chart.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So we see that Maharastra has the highest number of COVID cases.

## 2. Correlation between Active ratio and Death Ratio

### With coding

In [None]:
px.scatter(df, x='Active Ratio (%)',y='Death Ratio (%)', color=df['State/UTs'])


### With CyberDeck

We create a new plot **(Scatter Plot)** and select the axes in the same way. 

**NOTE:** All the plots are draggable and resizable


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/5_scatter_plot_added-1.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We see that Punjab has the highest Death ratio of 2.74 and Mizoram has the highest active ratio of 18.34.

## 3. Heatmap of Number of total cases by state

### With Coding

In [None]:
px.density_heatmap(df, y="Total Cases", x="State/UTs", nbinsx=20, nbinsy=20)

### With CyberDeck

We choose a 2d Histogram and select **State** in the X-Axis and **Total cases** in y-axis


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/6_2d_hist_added.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So you can understand how much time we are actually saving with this. We are just dragging and dropping things and plots are automagically generated. Let's move forward.

## 4. Relationship between Total Cases and Discharged

### With Coding

In [None]:
plt.figure(figsize=(12,8))
sns.relplot(x = 'Total Cases', y ='Discharged', hue = 'State/UTs', data = df)


### With CyberDeck

We select another scatter plot from the available plots and select **Total Cases** in the x-axis, **Discharged** in the y-axis and **State** as color.


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/7_another_scatter_added.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 5. Relationship between Total Cases", "Active", "Discharged", "Deaths"

### With Coding

For this let's plot a scatter matrix.

In [None]:
fig = px.scatter_matrix(df, dimensions=["Total Cases", "Active", "Discharged", "Deaths"], color=df['State/UTs'])
fig.show()

### With CyberDeck

We select a **Scatter Matrix** from the list of available plots. We select "Total Cases", "Active", "Discharged", "Deaths" in the **Dimensions** field, State in the **Color** Field and done!


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/8_scatter_mat_added.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 6. Relationship between Total cases, Active and Deaths

### With Coding

In [None]:
fig = px.scatter(df, x="Total Cases", y="Active", size="Deaths", color=df['State/UTs'], log_x=True, size_max=50)
fig.show()

### With CyberDeck

We select a **Bubble Chart** from the list and select **Total Cases** in x-axis, **Active** in y-axis, **State** in color and **Death** as size. Note that we have resized the charts a little bit.


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/9_bubble_chart_added.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


From this plot, we see 

1. Maharastra has the highest number of death but it also has very low Active cases.
2. Kerala has the 2nd highest Number of total cases, but the number of active cases is also tremendously high. 

## 7. Total Cases by State - Pie chart

### With coding

In [None]:
fig = px.pie(df, values='Total Cases', names=df['State/UTs'], title='Covid cases (%) in all states of India')
fig.show()

### With CyberDeck

We select a **Pie Chart** from the Chart type and select the appropriate fields.


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/10_pie_chart_added.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 8. EDA Conclusion : Time saver - isn't it!

So you see just how much time we saved in this whole process instead of coding redundantly! This is what CyberDeck is all about. Now let's move onto the next section where we will train a ML model to predict the number of deaths. I am not going to code in this section because I am lazy and that's why I created CyberDeck :D . But if you want to see a coding vs CyberDeck comparison for Machine Learning and see how much time we can save, I would suggest you look at this kernel: https://www.kaggle.com/sagarnildass/convert-your-data-science-hours-to-minutes

# B) Predicting number of deaths with CyberDeck AutoML

First we go to the AutoML Section of CyberDeck, select the dataset, and choose "Training".

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/11_automl_select_label.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


If you want to go the shortest route, choose the Target variable from the dropdown and hit **"Generate Leaderboard"** and you're done. CyberDeck will run a plethora of well known algorithms in the backend, perform K-Fold cross validation and present you with a leaderboard. On the other hand, you can also choose how to preprocess your dataset by selecting **"AutoML parameters"**. If you hit this, a popup like this will appear.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/13_13_automl_parameters.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So you can see here we can customize to our heart's content. We can 

1. Select the train size
2. Normalize and Transform the data
3. Perform PCA
4. Ignore Low variance features.
5. Ignore outliers
6. Remove multicolinearlity
7. Choose Feature selection method
8. Fix if there's any imbalance in the dataset

For this demo purpose, we are going to leave everything at its default value and then hit that **Generate Leaderboard** button.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/13_leaderboard.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


CyberDeck trains a plethora of ML algorithms and presents the user with a leaderboard. Here we can see that Catboost worked the best in terms of MAE, MSE, RMSE and R2. But the time taken is the largest. So if any user wants to train a slightly inferior but fast model, he/she will have the ability to do that and the user will understand everything just by looking at this leaderboard! Let’s go with the Catboost for now.

During the leaderboard generation, the model took a subset of the dataset, trained the ML models and did K-Fold cross validation. Now once the user finalizes a model and hits the **“Train model”** button, it will take the full dataset and train the model on top of that. It will also tune the hyperparameters to find the best version of the selected model. But the user will also have the ability to provide the hyperparameters manually with the second button i.e. **“Tune Hyperparameters”**. Let’s take a look at that.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/14_tune_hyperparameters.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


For now, we will let CyberDeck tune the model hyperparameters. So we will select the **“Catboost Regressor”** and hit the **“Train Model”** button. When the training is complete, this model will be saved as a pickle file which can be later used for inference purpose. When the training is done, We will see a **“Training Successful”** notification below the leaderboard.


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/14_training_done.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


# C) Explainable AI

**Explainable AI** is becoming a major part of any organization/data science workflow very fast. The Shapley values which is based on a game-theoretic approach gives us tremendous insights about the **WHY**. This in turn helps data scientists to directly generate insights or change their workflow. In this section, we will implement that in CyberDeck. But now that you have got the point of this platform, we will not do any coding and directly show you how easy it is inside CyberDeck.

Inside CyberDeck, we go to the **ML Model Explainer** Section, choose the model we want the explanation for and hit **Show Model** Explanation

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/15_model_explain_model_select.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Immediately we see, 5 tabs are generated

1. **Feature Importance**: To see the global level importance of features
2. **Model Summary**: To explore various model metrics and performance
3. **Individual Prediction Analysis**: To deep dive into individual row level and see what factors impacted the outcome
4. **What If Analysis**: To change certain parameters and check how the outcome would have changed
5. **Dependence**: To see Shap summary and Shap dependence plot.

Let's go through them one by one

## 1. Feature importance

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/16_feature_imp.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We see that the most important features affecting the number of deaths are:

1. Active
2. Death Ratio (%)
3. Active Ratio (%)
4. Discharge Ratio (%)

## 2. Model Performance

The next set of screenshots show all the customizable performance metrics monitoring you have at your disposal at the click of a mouse e.g Model Performance, Residuals etc.



<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/17_model_summary.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 3. Individual Prediction analysis

In this tab, we can go to any row number and see what factors affected the most the Deaths due to COVID.

a) In the **Select Index** section, we can select any row number for which we want the analysis. We can also filter out these rows by prediction probability or observed label.

b) In the **Prediction** section, we get the observed label and also the model's prediction probability for the same. 

c) In the **Contributions Plot** section, we see what factors impacted positively (in green) and negatively (in red) for the deaths due to COVID.

d) In the **Partial Dependence** plot, we can see how a particular feature's contribution towards the death varies with its different value.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/18_individual_preds-1.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 4. What-If Analysis

In this section, we can change different feature values for any row and see how the outcome (Number of Deaths) would have changed. For this row, we see that the values are as follows:

1. Active: 1247
2. Death Ratio (%): 1.35
3. Active Ratio (%): 0.38
4. Discharge Ratio (%): 98.27%

For this row, we see that the number of predicted Deaths are **8437**.


<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/19_what_if_1.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


Let's change the Death Ratio from 1.35 to 5 and see how the predicted number changes.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/20_what_if_2-1.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


We immediately see that the predicted number of deaths jumps from 8437 to 9162. This is very intuitive because as the Death Ratio increases, number of Deaths will also increase.

**NOTE**: The **WHAT-IF** analysis can be a very big part of any business or industry. This allows you to quickly check any hypothesis you might have without actually implementing it which can be very costly.

## 5. Dependence

In this last tab, you will see two plots. The one on the left is again a global feature importance, but this time, it also shows how that variable is affecting survival (in a positive or negative way). The one on the right is a Shap dependence plot which shows us the relation between feature values and SHAP values. This allows you to investigate general relationship between feature value and impact on prediction.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/21_dependence_tab.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 6. AutoML Conclusion (TL-DC: Too Long, Didn't Code)

We see that in this whole process, we didn't have to write a single line of code and doing whatever we did here takes at least hundreds of lines of codes. We generated really strong insights from the data via machine learning and explainable AI.

Next, we are going to see the AutoClustering Module of CyberDeck



# D) Auto Clustering

This data is a good candidate of clustering. As there are multiple important features here about the Demographical impact of COVID, we would try to cluster this data in an unsupervised manner to see if some data points are similar to each other or not. 

From our Dashboarding section, we saw that Maharastra and Kerala are big outliers. Hence, we will remove these two data points.

## 1. Read the data

We go to the **Auto Clustering** module of CyberDeck and select the dataset. 

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/22_clustering_read_data.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


As soon as we click on the **Read File** button two things will happen.

1. The file will be read.
2. An elbow plot will be generated by a default k-means clustering method to notify the user about the optimium number of clusters.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/23_elbow_plot.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 2. Clustering Parameter

So in here, we see that the Elbow is formed around 3 clusters and hence we can deduce that 3 is the number of optimized clusters to form. In order to choose 3 clusters, we click on the **Clustering Parameter** button. Here we can choose the number of clusters we want along with many other variables. Let's take a look at them.

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/24_clustering_parameter.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


So again, just like AutoML, we see that the user will be able to select a plethora of parameters for the clustering. For now, we only change the number of cluster to 3 and keep everything else at its default value. CyberDeck offers a variety of clustering algorithm viz:

1. K-means clustering
2. K-modes clustering
3. Affinity propagation clustering
4. Mean shift clustering
5. Spectral clustering
6. Agglomerative clustering
7. Density based spatial clustering
8. OPTICS clustering
9. Birch Clustering

We hit the **Run Clustering** button now and wait for the algorithm to finish running. After it is finished, we are presented with the following charts.

## 3. Run Auto Clustering

<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/25_clustering_viz_1.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


1. In the first chart, we see reduced dimension data (via PCA) to see how the clusters look like (color coded) in 2-dimension. Here, we see a good separation between the points which is a good sign.
2. In the second chart, we see how many datapoints are assigned to each cluster. 
3. In the third plot, we see a TSNE plot in 3-dimension to understand the separation of clusters again.

All these plots are interactive. Upon hovering on the points, we find out that the states are clustered in the following manner:

* Cluster 0: Uttar pradesh, Madhya Pradesh, Uttarakhand, Bihar etc
* Cluster 1: Odissa, Telengana, Assam, West Bengal
* Cluster 2: Andhra Pradesh, Karnataka, Tamil Nadu

So we see that often times, neighbouring states occupy the same cluster in terms of impact of COVID. This is very intuitive as the medical services, rate of spread, immunity etc will be similar for the people of these states.


## 4. Gain additional insights

If we scroll down a little bit on this page, we see that there is a provision for two Scatter charts : 2d and 3d. Users can select different features here and they will automatically get colored by the cluster number. Here, we can get a better understanding of the dependence of different features vs the clustering. 

1. In the first plot, we plot Active vs Deaths (colored by Cluster number)

2. In the second plot, we plot Active vs Discharged vs Deaths (colored by Cluster number)



<div style="width:100%;text-align: center;"> <img align=middle src="https://cyberdeck.in/wp-content/uploads/2021/09/26_clustering_viz_2.png" alt="Heat beating" style="height:400px;margin-top:3rem;"> </div>


## 5. Clustering conclusion

So you can see, how many different ways can we approach the problem at hand and what's even more beautiful is every approach opens up a new door of insights and information. All you have to do is think how you want to approach a problem, which is of fundamental value to Data Science and CyberDeck will ease your life as much as it can so that you can spend more time on thinking and less time on coding.

# E) Conclusion

This marks the end of the first demo of the CyberDeck platform that we are actively building right now. We plan to make this a community product for Data Scientists, by Data Scientists. We would be making the first pre-release around January 2022. So if you think that this product can make your life a little bit easier, then don't forget to sign up at https://www.cyberdeck.in/ and we will get right back as soon as we can!

That's it for now! Stay tuned till we bring a next demo for CyberDeck!