##   <h1><center><font size="10">THE DATA SCIENCE STORY OF INDIA</font></center></h1>

<h1><center>Told through the dataset of Kaggle Machine Learning & Data Science Survey 2021!</center></h1>

<h1><center>All charts are generated using Seaborn!</center></h1>

Population of India is ~1.4 Billion!! 

Only ~25% of the population lives in Urban areas. 

Only ~50% of the population has access to Internet.

The slice of Indian population in focus for the Survey are the age groups including students in final years of their graduation & professionals nearing retirement age of 60 years.

With these points, lets deep-dive into the Kaggle Data Science Survey & try to extract insights regarding the People of India in the field of Data Science.

I will explain the upcoming code, graph & insights in markdown text for easy understanding.

We start with importing the necessary libraries.

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.ticker as mtick  
import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline

To leverage the wide screen space lets make the jupyter notebook wider to fit the screen & display charts better!

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Importing the dataset & examining the columns & shape of the dataset in next steps.

In [None]:
data=pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.info()

We examine the number of survey entries received from India in the next step.

In [None]:
data['Q3'].value_counts()

We can see out of total 25974 responses received for this survey, 7434 are from India. 

Since this is a sufficiently large subset (~29%) therefore, this notebook is an effort to analyze & understand the Data Science scene in India.

The first row after the header in dataset is the full question asked in the survey. I will drop this row.

In [None]:
data=data.drop(data.index[:1])

To narrow down the survey entries of only India I will drop the rows from other countries in next step.

In [None]:
data.drop(data[data['Q3']!='India'].index, inplace=True)

Lets examine the dataset again to see if the Column 'Q3' contains only entries for India.

In [None]:
data['Q3'].value_counts()

In [None]:
data.info()

In [None]:
data.head()

Let us examine the gender column & review the data.

In [None]:
data['Q2'].value_counts()

We can see value counts for genders apart from Man & Woman are less than 100 in data set of 7434.

In [None]:
data2=data.set_index('Q2')

In [None]:
data2.head()

In [None]:
data2=data2.drop(["Prefer not to say","Prefer to self-describe","Nonbinary"])


In [None]:
data2.reset_index(level=['Q2'],inplace=True)

In [None]:
data2['Q2'].value_counts()

In [None]:
data.shape

In [None]:
data2.info()

Lets plot a pie chart to check the gender names in the dataset.

In [None]:
data2.groupby('Q2').size().plot(kind='pie', autopct='%.2f')
sns.set_palette('bright')
rcParams['figure.figsize'] = 20,10

Let's review the null values in the dataset.

Null Values will not be treated since this is a survey dataset where each null value is a form of response received from the survey participant.

I will show how null values are handled when I explain pd.melt() function in further steps.

In [None]:
data2.isnull().sum()

In [None]:
missing = pd.DataFrame((data2.isnull().sum())*100/data2.shape[0]).reset_index()
plt.figure(figsize=(50,10))
ax = sns.pointplot('index',0,data=missing)
plt.xticks(rotation =90,fontsize =7)
plt.title("Percentage of Missing values")
plt.ylabel("PERCENTAGE")
plt.show()

Next we will focus on the dataset headers.

Right now the header names are not conveying the exact meaning of the column, therefore, lets provide meaningful names to each of the columns in the next step. 

We will also fix the data-type of the column in same step.

In [None]:
data2['Duration(secs)']=data2['Time from Start to Finish (seconds)'].astype('category')
data2.pop('Time from Start to Finish (seconds)')

In [None]:
data2['Age_years']=data2['Q1'].astype('category')
data2.pop('Q1')

In [None]:
data2.head()

Let's fix all column headers in next steps!

In [None]:
data2['gender']=data2['Q2'].astype('category')
data2.pop('Q2')
data2['Country']=data2['Q3'].astype('category')
data2.pop('Q3')
data2['Education']=data2['Q4'].astype('category')
data2.pop('Q4')
data2['Title']=data2['Q5'].astype('category')
data2.pop('Q5')
data2['Coding_Exp']=data2['Q6'].astype('category')
data2.pop('Q6')
data2['Coding-Python']=data2['Q7_Part_1'].astype('category')
data2.pop('Q7_Part_1')
data2['Coding-R']=data2['Q7_Part_2'].astype('category')
data2.pop('Q7_Part_2')
data2['Coding-SQL']=data2['Q7_Part_3'].astype('category')
data2.pop('Q7_Part_3')
data2['Coding-C']=data2['Q7_Part_4'].astype('category')
data2.pop('Q7_Part_4')
data2['Coding-C++']=data2['Q7_Part_5'].astype('category')
data2.pop('Q7_Part_5')
data2['Coding-Java']=data2['Q7_Part_6'].astype('category')
data2.pop('Q7_Part_6')
data2['Coding-Javascript']=data2['Q7_Part_7'].astype('category')
data2.pop('Q7_Part_7')
data2['Coding-Julia']=data2['Q7_Part_8'].astype('category')
data2.pop('Q7_Part_8')
data2['Coding-Swift']=data2['Q7_Part_9'].astype('category')
data2.pop('Q7_Part_9')
data2['Coding-Bash']=data2['Q7_Part_10'].astype('category')
data2.pop('Q7_Part_10')
data2['Coding-MATLAB']=data2['Q7_Part_11'].astype('category')
data2.pop('Q7_Part_11')
data2['Coding-None']=data2['Q7_Part_12'].astype('category')
data2.pop('Q7_Part_12')
data2['Coding-Other']=data2['Q7_OTHER'].astype('category')
data2.pop('Q7_OTHER')
data2['Coding_lang_Newbies']=data2['Q8'].astype('category')
data2.pop('Q8')
data2['IDE-Jupyter']=data2['Q9_Part_1'].astype('category')
data2.pop('Q9_Part_1')
data2['IDE-Rstudio']=data2['Q9_Part_2'].astype('category')
data2.pop('Q9_Part_2')
data2['IDE-Visual_Studio']=data2['Q9_Part_3'].astype('category')
data2.pop('Q9_Part_3')
data2['IDE-VScode']=data2['Q9_Part_4'].astype('category')
data2.pop('Q9_Part_4')
data2['IDE-Pycharm']=data2['Q9_Part_5'].astype('category')
data2.pop('Q9_Part_5')
data2['IDE-Spyder']=data2['Q9_Part_6'].astype('category')
data2.pop('Q9_Part_6')
data2['IDE-Notepad++']=data2['Q9_Part_7'].astype('category')
data2.pop('Q9_Part_7')
data2['IDE-Sublime_text']=data2['Q9_Part_8'].astype('category')
data2.pop('Q9_Part_8')
data2['IDE-Vim/Emacs']=data2['Q9_Part_9'].astype('category')
data2.pop('Q9_Part_9')
data2['IDE-Matlab']=data2['Q9_Part_10'].astype('category')
data2.pop('Q9_Part_10')
data2['IDE-Jupyter_Notebook']=data2['Q9_Part_11'].astype('category')
data2.pop('Q9_Part_11')
data2['IDE-None']=data2['Q9_Part_12'].astype('category')
data2.pop('Q9_Part_12')
data2['IDE-Other']=data2['Q9_OTHER'].astype('category')
data2.pop('Q9_OTHER')
data2['HostedNB-Kaggle']=data2['Q10_Part_1'].astype('category')
data2.pop('Q10_Part_1')
data2['HostedNB-Colab']=data2['Q10_Part_2'].astype('category')
data2.pop('Q10_Part_2')
data2['HostedNB-Azure']=data2['Q10_Part_3'].astype('category')
data2.pop('Q10_Part_3')
data2['HostedNB-Paperspace/Gradient']=data2['Q10_Part_4'].astype('category')
data2.pop('Q10_Part_4')
data2['HostedNB-Binder/JupyterHub']=data2['Q10_Part_5'].astype('category')
data2.pop('Q10_Part_5')
data2['HostedNB-Code_Ocean']=data2['Q10_Part_6'].astype('category')
data2.pop('Q10_Part_6')
data2['HostedNB-Watson_Studio']=data2['Q10_Part_7'].astype('category')
data2.pop('Q10_Part_7')
data2['HostedNB-Amazon_Sagemaker']=data2['Q10_Part_8'].astype('category')
data2.pop('Q10_Part_8')
data2['HostedNB-Amazon_EMR']=data2['Q10_Part_9'].astype('category')
data2.pop('Q10_Part_9')
data2['HostedNB-GC_NB']=data2['Q10_Part_10'].astype('category')
data2.pop('Q10_Part_10')
data2['HostedNB-GC_Datalab']=data2['Q10_Part_11'].astype('category')
data2.pop('Q10_Part_11')
data2['HostedNB-Databricks_collab']=data2['Q10_Part_12'].astype('category')
data2.pop('Q10_Part_12')
data2['HostedNB-Zepl']=data2['Q10_Part_13'].astype('category')
data2.pop('Q10_Part_13')
data2['HostedNB-Deepnote']=data2['Q10_Part_14'].astype('category')
data2.pop('Q10_Part_14')
data2['HostedNB-Observable']=data2['Q10_Part_15'].astype('category')
data2.pop('Q10_Part_15')
data2['HostedNB-None']=data2['Q10_Part_16'].astype('category')
data2.pop('Q10_Part_16')
data2['HostedNB-Other']=data2['Q10_OTHER'].astype('category')
data2.pop('Q10_OTHER')
data2['Computing_Platform']=data2['Q11'].astype('category')
data2.pop('Q11')
data2['Spl_hardware-nVidia GPUs']=data2['Q12_Part_1'].astype('category')
data2.pop('Q12_Part_1')
data2['Spl_hardware-GC TPUs']=data2['Q12_Part_2'].astype('category')
data2.pop('Q12_Part_2')
data2['Spl_hardware-AWS Trainium Chips']=data2['Q12_Part_3'].astype('category')
data2.pop('Q12_Part_3')
data2['Spl_hardware-AWS Inferentia Chips']=data2['Q12_Part_4'].astype('category')
data2.pop('Q12_Part_4')
data2['Spl_hardware-None']=data2['Q12_Part_5'].astype('category')
data2.pop('Q12_Part_5')
data2['Spl_hardware-Other']=data2['Q12_OTHER'].astype('category')
data2.pop('Q12_OTHER')
data2['TPU_Usage_Count']=data2['Q13'].astype('category')
data2.pop('Q13')
data2['Vis_lib-Matplotlib']=data2['Q14_Part_1'].astype('category')
data2.pop('Q14_Part_1')
data2['Vis_lib-Seaborn']=data2['Q14_Part_2'].astype('category')
data2.pop('Q14_Part_2')
data2['Vis_lib-Plotly/Plotly Express']=data2['Q14_Part_3'].astype('category')
data2.pop('Q14_Part_3')
data2['Vis_lib-GGplot/ggplot2']=data2['Q14_Part_4'].astype('category')
data2.pop('Q14_Part_4')
data2['Vis_lib-Shiny']=data2['Q14_Part_5'].astype('category')
data2.pop('Q14_Part_5')
data2['Vis_lib-D3 js']=data2['Q14_Part_6'].astype('category')
data2.pop('Q14_Part_6')
data2['Vis_lib-Altair']=data2['Q14_Part_7'].astype('category')
data2.pop('Q14_Part_7')
data2['Vis_lib-Bokeh']=data2['Q14_Part_8'].astype('category')
data2.pop('Q14_Part_8')
data2['Vis_lib-Geoplotlib']=data2['Q14_Part_9'].astype('category')
data2.pop('Q14_Part_9')
data2['Vis_lib-Leaflet/Folium']=data2['Q14_Part_10'].astype('category')
data2.pop('Q14_Part_10')
data2['Vis_lib-None']=data2['Q14_Part_11'].astype('category')
data2.pop('Q14_Part_11')
data2['Vis_lib-Other']=data2['Q14_OTHER'].astype('category')
data2.pop('Q14_OTHER')
data2['ML methods-years']=data2['Q15'].astype('category')
data2.pop('Q15')
data2['ML FRMWKS-SK Learn']=data2['Q16_Part_1'].astype('category')
data2.pop('Q16_Part_1')
data2['ML FRMWKS-TensorFlow']=data2['Q16_Part_2'].astype('category')
data2.pop('Q16_Part_2')
data2['ML FRMWKS-Keras']=data2['Q16_Part_3'].astype('category')
data2.pop('Q16_Part_3')
data2['ML FRMWKS-PyTorch']=data2['Q16_Part_4'].astype('category')
data2.pop('Q16_Part_4')
data2['ML FRMWKS-Fast.ai']=data2['Q16_Part_5'].astype('category')
data2.pop('Q16_Part_5')
data2['ML FRMWKS-MXNet']=data2['Q16_Part_6'].astype('category')
data2.pop('Q16_Part_6')
data2['ML FRMWKS-XGBoost']=data2['Q16_Part_7'].astype('category')
data2.pop('Q16_Part_7')
data2['ML FRMWKS-LightGBM']=data2['Q16_Part_8'].astype('category')
data2.pop('Q16_Part_8')
data2['ML FRMWKS-CatBoost']=data2['Q16_Part_9'].astype('category')
data2.pop('Q16_Part_9')
data2['ML FRMWKS-Prophet']=data2['Q16_Part_10'].astype('category')
data2.pop('Q16_Part_10')
data2['ML FRMWKS-H2O 3']=data2['Q16_Part_11'].astype('category')
data2.pop('Q16_Part_11')
data2['ML FRMWKS-Caret']=data2['Q16_Part_12'].astype('category')
data2.pop('Q16_Part_12')
data2['ML FRMWKS-Tidymodels']=data2['Q16_Part_13'].astype('category')
data2.pop('Q16_Part_13')
data2['ML FRMWKS-JAX']=data2['Q16_Part_14'].astype('category')
data2.pop('Q16_Part_14')
data2['ML FRMWKS-PyTorch Lightning']=data2['Q16_Part_15'].astype('category')
data2.pop('Q16_Part_15')
data2['ML FRMWKS-Huggingface']=data2['Q16_Part_16'].astype('category')
data2.pop('Q16_Part_16')
data2['ML FRMWKS-None']=data2['Q16_Part_17'].astype('category')
data2.pop('Q16_Part_17')
data2['ML FRMWKS-Other']=data2['Q16_OTHER'].astype('category')
data2.pop('Q16_OTHER')
data2['ML Algo-Linear/Logistic Reg']=data2['Q17_Part_1'].astype('category')
data2.pop('Q17_Part_1')
data2['ML Algo-Decision Tree/Random Forests']=data2['Q17_Part_2'].astype('category')
data2.pop('Q17_Part_2')
data2['ML Algo-Gradient Boosting']=data2['Q17_Part_3'].astype('category')
data2.pop('Q17_Part_3')
data2['ML Algo-Bayesian']=data2['Q17_Part_4'].astype('category')
data2.pop('Q17_Part_4')
data2['ML Algo-Evolutionary']=data2['Q17_Part_5'].astype('category')
data2.pop('Q17_Part_5')
data2['ML Algo-Dense Neural Ntwks']=data2['Q17_Part_6'].astype('category')
data2.pop('Q17_Part_6')
data2['ML Algo-Conv Neural Ntwks']=data2['Q17_Part_7'].astype('category')
data2.pop('Q17_Part_7')
data2['ML Algo-Gen Adv Ntwks']=data2['Q17_Part_8'].astype('category')
data2.pop('Q17_Part_8')
data2['ML Algo-Rec Neural Ntwks']=data2['Q17_Part_9'].astype('category')
data2.pop('Q17_Part_9')
data2['ML Algo-Transf Ntwks']=data2['Q17_Part_10'].astype('category')
data2.pop('Q17_Part_10')
data2['ML Algo-None']=data2['Q17_Part_11'].astype('category')
data2.pop('Q17_Part_11')
data2['ML Algo-Other']=data2['Q17_OTHER'].astype('category')
data2.pop('Q17_OTHER')
data2['CV-Image/Video']=data2['Q18_Part_1'].astype('category')
data2.pop('Q18_Part_1')
data2['CV-Image Seg']=data2['Q18_Part_2'].astype('category')
data2.pop('Q18_Part_2')
data2['CV-Obj Detection']=data2['Q18_Part_3'].astype('category')
data2.pop('Q18_Part_3')
data2['CV-Image Classi']=data2['Q18_Part_4'].astype('category')
data2.pop('Q18_Part_4')
data2['CV-Gen Ntwks']=data2['Q18_Part_5'].astype('category')
data2.pop('Q18_Part_5')
data2['CV-None']=data2['Q18_Part_6'].astype('category')
data2.pop('Q18_Part_6')
data2['CV-Other']=data2['Q18_OTHER'].astype('category')
data2.pop('Q18_OTHER')
data2['NLP-Word Embeddings']=data2['Q19_Part_1'].astype('category')
data2.pop('Q19_Part_1')
data2['NLP-Encoder/Decoder']=data2['Q19_Part_2'].astype('category')
data2.pop('Q19_Part_2')
data2['NLP-Contextualised Emb']=data2['Q19_Part_3'].astype('category')
data2.pop('Q19_Part_3')
data2['NLP-Transf lang']=data2['Q19_Part_4'].astype('category')
data2.pop('Q19_Part_4')
data2['NLP-none']=data2['Q19_Part_5'].astype('category')
data2.pop('Q19_Part_5')
data2['NLP-Other']=data2['Q19_OTHER'].astype('category')
data2.pop('Q19_OTHER')
data2['Current Employer']=data2['Q20'].astype('category')
data2.pop('Q20')
data2['Size of Company']=data2['Q21'].astype('category')
data2.pop('Q21')
data2['DS team']=data2['Q22'].astype('category')
data2.pop('Q22')
data2['ML models Used']=data2['Q23'].astype('category')
data2.pop('Q23')
data2['Role-Analyze & Understand']=data2['Q24_Part_1'].astype('category')
data2.pop('Q24_Part_1')
data2['Role-Build/Run Data Infra']=data2['Q24_Part_2'].astype('category')
data2.pop('Q24_Part_2')
data2['Role-Explore ML application']=data2['Q24_Part_3'].astype('category')
data2.pop('Q24_Part_3')
data2['Role-Build/Run ML service']=data2['Q24_Part_4'].astype('category')
data2.pop('Q24_Part_4')
data2['Role-Improve ML models']=data2['Q24_Part_5'].astype('category')
data2.pop('Q24_Part_5')
data2['Role-Research for ML']=data2['Q24_Part_6'].astype('category')
data2.pop('Q24_Part_6')
data2['Role-None']=data2['Q24_Part_7'].astype('category')
data2.pop('Q24_Part_7')
data2['Role-Other']=data2['Q24_OTHER'].astype('category')
data2.pop('Q24_OTHER')
data2['Yearly Compensation']=data2['Q25'].astype('category')
data2.pop('Q25')
data2['Money Spent on ML/Cloud']=data2['Q26'].astype('category')
data2.pop('Q26')
data2['Cloud Pltfm-AWS']=data2['Q27_A_Part_1'].astype('category')
data2.pop('Q27_A_Part_1')
data2['Cloud Pltfm-MS Azure']=data2['Q27_A_Part_2'].astype('category')
data2.pop('Q27_A_Part_2')
data2['Cloud Pltfm-GCP']=data2['Q27_A_Part_3'].astype('category')
data2.pop('Q27_A_Part_3')
data2['Cloud Pltfm-Red Hat']=data2['Q27_A_Part_4'].astype('category')
data2.pop('Q27_A_Part_4')
data2['Cloud Pltfm-Oracle']=data2['Q27_A_Part_5'].astype('category')
data2.pop('Q27_A_Part_5')
data2['Cloud Pltfm-SAP']=data2['Q27_A_Part_6'].astype('category')
data2.pop('Q27_A_Part_6')
data2['Cloud Pltfm-Salesforce']=data2['Q27_A_Part_7'].astype('category')
data2.pop('Q27_A_Part_7')
data2['Cloud Pltfm-VMWare']=data2['Q27_A_Part_8'].astype('category')
data2.pop('Q27_A_Part_8')
data2['Cloud Pltfm-Alibaba']=data2['Q27_A_Part_9'].astype('category')
data2.pop('Q27_A_Part_9')
data2['Cloud Pltfm-Tencent']=data2['Q27_A_Part_10'].astype('category')
data2.pop('Q27_A_Part_10')
data2['Cloud Pltfm-None']=data2['Q27_A_Part_11'].astype('category')
data2.pop('Q27_A_Part_11')
data2['Cloud Pltfm-Other']=data2['Q27_A_OTHER'].astype('category')
data2.pop('Q27_A_OTHER')
data2['Best Cloud for Developer exp']=data2['Q28'].astype('category')
data2.pop('Q28')
data2['Cloud Prod-Amazon EC2']=data2['Q29_A_Part_1'].astype('category')
data2.pop('Q29_A_Part_1')
data2['Cloud Prod-MS Azure VM']=data2['Q29_A_Part_2'].astype('category')
data2.pop('Q29_A_Part_2')
data2['Cloud Prod-Google GCE']=data2['Q29_A_Part_3'].astype('category')
data2.pop('Q29_A_Part_3')
data2['Cloud Prod-None']=data2['Q29_A_Part_4'].astype('category')
data2.pop('Q29_A_Part_4')
data2['Cloud Prod-Other']=data2['Q29_A_OTHER'].astype('category')
data2.pop('Q29_A_OTHER')
data2['Data Stor-MS Azure Data Lake']=data2['Q30_A_Part_1'].astype('category')
data2.pop('Q30_A_Part_1')
data2['Data Stor-MS Azure Disk Storage']=data2['Q30_A_Part_2'].astype('category')
data2.pop('Q30_A_Part_2')
data2['Data Stor-Amazon S3']=data2['Q30_A_Part_3'].astype('category')
data2.pop('Q30_A_Part_3')
data2['Data Stor-Amazon EFS']=data2['Q30_A_Part_4'].astype('category')
data2.pop('Q30_A_Part_4')
data2['Data Stor-GCS']=data2['Q30_A_Part_5'].astype('category')
data2.pop('Q30_A_Part_5')
data2['Data Stor-GCF']=data2['Q30_A_Part_6'].astype('category')
data2.pop('Q30_A_Part_6')
data2['Data Stor-No']=data2['Q30_A_Part_7'].astype('category')
data2.pop('Q30_A_Part_7')
data2['Data Stor-Other']=data2['Q30_A_OTHER'].astype('category')
data2.pop('Q30_A_OTHER')
data2['ML Prod-Amazon sagemaker']=data2['Q31_A_Part_1'].astype('category')
data2.pop('Q31_A_Part_1')
data2['ML Prod-MS Azure ML Studio']=data2['Q31_A_Part_2'].astype('category')
data2.pop('Q31_A_Part_2')
data2['ML Prod-GC Vertex AI']=data2['Q31_A_Part_3'].astype('category')
data2.pop('Q31_A_Part_3')
data2['ML Prod-DataRobot']=data2['Q31_A_Part_4'].astype('category')
data2.pop('Q31_A_Part_4')
data2['ML Prod-DataBricks']=data2['Q31_A_Part_5'].astype('category')
data2.pop('Q31_A_Part_5')
data2['ML Prod-Dataiku']=data2['Q31_A_Part_6'].astype('category')
data2.pop('Q31_A_Part_6')
data2['ML Prod-Alteryx']=data2['Q31_A_Part_7'].astype('category')
data2.pop('Q31_A_Part_7')
data2['ML Prod-Rapidminer']=data2['Q31_A_Part_8'].astype('category')
data2.pop('Q31_A_Part_8')
data2['ML Prod-No']=data2['Q31_A_Part_9'].astype('category')
data2.pop('Q31_A_Part_9')
data2['ML Prod-Other']=data2['Q31_A_OTHER'].astype('category')
data2.pop('Q31_A_OTHER')
data2['BD-MySQL']=data2['Q32_A_Part_1'].astype('category')
data2.pop('Q32_A_Part_1')
data2['BD-PostgreSQL']=data2['Q32_A_Part_2'].astype('category')
data2.pop('Q32_A_Part_2')
data2['BD-SQLite']=data2['Q32_A_Part_3'].astype('category')
data2.pop('Q32_A_Part_3')
data2['BD-Oracle DB']=data2['Q32_A_Part_4'].astype('category')
data2.pop('Q32_A_Part_4')
data2['BD-MongoDB']=data2['Q32_A_Part_5'].astype('category')
data2.pop('Q32_A_Part_5')
data2['BD-Snowflake']=data2['Q32_A_Part_6'].astype('category')
data2.pop('Q32_A_Part_6')
data2['BD-IBM DB2']=data2['Q32_A_Part_7'].astype('category')
data2.pop('Q32_A_Part_7')
data2['BD-MS SQL Server']=data2['Q32_A_Part_8'].astype('category')
data2.pop('Q32_A_Part_8')
data2['BD-MS Azure SQL DB']=data2['Q32_A_Part_9'].astype('category')
data2.pop('Q32_A_Part_9')
data2['BD-MS Azure Cosmos DB']=data2['Q32_A_Part_10'].astype('category')
data2.pop('Q32_A_Part_10')
data2['BD-Amazon Redshift']=data2['Q32_A_Part_11'].astype('category')
data2.pop('Q32_A_Part_11')
data2['BD-Amazon Aurora']=data2['Q32_A_Part_12'].astype('category')
data2.pop('Q32_A_Part_12')
data2['BD-Amazon RDS']=data2['Q32_A_Part_13'].astype('category')
data2.pop('Q32_A_Part_13')
data2['BD-Amazon DynamoDB']=data2['Q32_A_Part_14'].astype('category')
data2.pop('Q32_A_Part_14')
data2['BD-GC BigQuery']=data2['Q32_A_Part_15'].astype('category')
data2.pop('Q32_A_Part_15')
data2['BD-GC SQL']=data2['Q32_A_Part_16'].astype('category')
data2.pop('Q32_A_Part_16')
data2['BD-GC Firestore']=data2['Q32_A_Part_17'].astype('category')
data2.pop('Q32_A_Part_17')
data2['BD-GC BigTable']=data2['Q32_A_Part_18'].astype('category')
data2.pop('Q32_A_Part_18')
data2['BD-GC Spanner']=data2['Q32_A_Part_19'].astype('category')
data2.pop('Q32_A_Part_19')
data2['BD-None']=data2['Q32_A_Part_20'].astype('category')
data2.pop('Q32_A_Part_20')
data2['BD-Other']=data2['Q32_A_OTHER'].astype('category')
data2.pop('Q32_A_OTHER')
data2['BD-Selected Choice']=data2['Q33'].astype('category')
data2.pop('Q33')
data2['BI-Tools-Amazon QuickSight']=data2['Q34_A_Part_1'].astype('category')
data2.pop('Q34_A_Part_1')
data2['BI-Tools-MS Power BI']=data2['Q34_A_Part_2'].astype('category')
data2.pop('Q34_A_Part_2')
data2['BI-Tools-G Data Studio']=data2['Q34_A_Part_3'].astype('category')
data2.pop('Q34_A_Part_3')
data2['BI-Tools-Looker']=data2['Q34_A_Part_4'].astype('category')
data2.pop('Q34_A_Part_4')
data2['BI-Tools-Tableau']=data2['Q34_A_Part_5'].astype('category')
data2.pop('Q34_A_Part_5')
data2['BI-Tools-Salesforce']=data2['Q34_A_Part_6'].astype('category')
data2.pop('Q34_A_Part_6')
data2['BI-Tools-Tableau CRM']=data2['Q34_A_Part_7'].astype('category')
data2.pop('Q34_A_Part_7')
data2['BI-Tools-Qlik']=data2['Q34_A_Part_8'].astype('category')
data2.pop('Q34_A_Part_8')
data2['BI-Tools-Domo']=data2['Q34_A_Part_9'].astype('category')
data2.pop('Q34_A_Part_9')
data2['BI-Tools-TIBCO Spotfire']=data2['Q34_A_Part_10'].astype('category')
data2.pop('Q34_A_Part_10')
data2['BI-Tools-Alteryx']=data2['Q34_A_Part_11'].astype('category')
data2.pop('Q34_A_Part_11')
data2['BI-Tools-Sisense']=data2['Q34_A_Part_12'].astype('category')
data2.pop('Q34_A_Part_12')
data2['BI-Tools-SAP Analytics Cloud']=data2['Q34_A_Part_13'].astype('category')
data2.pop('Q34_A_Part_13')
data2['BI-Tools-MS Azure Synapse']=data2['Q34_A_Part_14'].astype('category')
data2.pop('Q34_A_Part_14')
data2['BI-Tools-Thoughtspot']=data2['Q34_A_Part_15'].astype('category')
data2.pop('Q34_A_Part_15')
data2['BI-Tools-None']=data2['Q34_A_Part_16'].astype('category')
data2.pop('Q34_A_Part_16')
data2['BI-Tools-Other']=data2['Q34_A_OTHER'].astype('category')
data2.pop('Q34_A_OTHER')
data2['BI-Tools-Selected Choice']=data2['Q35'].astype('category')
data2.pop('Q35')
data2['AutoML-Data Aug']=data2['Q36_A_Part_1'].astype('category')
data2.pop('Q36_A_Part_1')
data2['AutoML-Feature Engg/Sel']=data2['Q36_A_Part_2'].astype('category')
data2.pop('Q36_A_Part_2')
data2['AutoML-Model Sel']=data2['Q36_A_Part_3'].astype('category')
data2.pop('Q36_A_Part_3')
data2['AutoML-Model Arch']=data2['Q36_A_Part_4'].astype('category')
data2.pop('Q36_A_Part_4')
data2['AutoML-Hyparam Tuning']=data2['Q36_A_Part_5'].astype('category')
data2.pop('Q36_A_Part_5')
data2['AutoML-Full ML Pipelines']=data2['Q36_A_Part_6'].astype('category')
data2.pop('Q36_A_Part_6')
data2['AutoML-No/None']=data2['Q36_A_Part_7'].astype('category')
data2.pop('Q36_A_Part_7')
data2['AutoML-Other']=data2['Q36_A_OTHER'].astype('category')
data2.pop('Q36_A_OTHER')
data2['2AutoML-GC AutoML']=data2['Q37_A_Part_1'].astype('category')
data2.pop('Q37_A_Part_1')
data2['2AutoML-H2O Driverless AI']=data2['Q37_A_Part_2'].astype('category')
data2.pop('Q37_A_Part_2')
data2['2AutoML-Databricks AutoML']=data2['Q37_A_Part_3'].astype('category')
data2.pop('Q37_A_Part_3')
data2['2AutoML-DataRobot AutoML']=data2['Q37_A_Part_4'].astype('category')
data2.pop('Q37_A_Part_4')
data2['2AutoML-Amazon Sagemaker Autopilot']=data2['Q37_A_Part_5'].astype('category')
data2.pop('Q37_A_Part_5')
data2['2AutoML-Azure Auto ML']=data2['Q37_A_Part_6'].astype('category')
data2.pop('Q37_A_Part_6')
data2['2AutoML-No/None']=data2['Q37_A_Part_7'].astype('category')
data2.pop('Q37_A_Part_7')
data2['2AutoML-Other']=data2['Q37_A_OTHER'].astype('category')
data2.pop('Q37_A_OTHER')
data2['ML Exp-Neptune.ai']=data2['Q38_A_Part_1'].astype('category')
data2.pop('Q38_A_Part_1')
data2['ML Exp-Weights & Biases']=data2['Q38_A_Part_2'].astype('category')
data2.pop('Q38_A_Part_2')
data2['ML Exp-Comet.ml']=data2['Q38_A_Part_3'].astype('category')
data2.pop('Q38_A_Part_3')
data2['ML Exp-Sacred+Omniboard']=data2['Q38_A_Part_4'].astype('category')
data2.pop('Q38_A_Part_4')
data2['ML Exp-TensorBoard']=data2['Q38_A_Part_5'].astype('category')
data2.pop('Q38_A_Part_5')
data2['ML Exp-Guild.ai']=data2['Q38_A_Part_6'].astype('category')
data2.pop('Q38_A_Part_6')
data2['ML Exp-Polyaxon']=data2['Q38_A_Part_7'].astype('category')
data2.pop('Q38_A_Part_7')
data2['ML Exp-ClearML']=data2['Q38_A_Part_8'].astype('category')
data2.pop('Q38_A_Part_8')
data2['ML Exp-Domino Model Monitor']=data2['Q38_A_Part_9'].astype('category')
data2.pop('Q38_A_Part_9')
data2['ML Exp-MLflow']=data2['Q38_A_Part_10'].astype('category')
data2.pop('Q38_A_Part_10')
data2['ML Exp-No/None']=data2['Q38_A_Part_11'].astype('category')
data2.pop('Q38_A_Part_11')
data2['ML Exp-Other']=data2['Q38_A_OTHER'].astype('category')
data2.pop('Q38_A_OTHER')
data2['Share ML App-Plotly Dash']=data2['Q39_Part_1'].astype('category')
data2.pop('Q39_Part_1')
data2['Share ML App-Streamlit']=data2['Q39_Part_2'].astype('category')
data2.pop('Q39_Part_2')
data2['Share ML App-NB Viewer']=data2['Q39_Part_3'].astype('category')
data2.pop('Q39_Part_3')
data2['Share ML App-Github']=data2['Q39_Part_4'].astype('category')
data2.pop('Q39_Part_4')
data2['Share ML App-Personal Blog']=data2['Q39_Part_5'].astype('category')
data2.pop('Q39_Part_5')
data2['Share ML App-Kaggle']=data2['Q39_Part_6'].astype('category')
data2.pop('Q39_Part_6')
data2['Share ML App-Colab']=data2['Q39_Part_7'].astype('category')
data2.pop('Q39_Part_7')
data2['Share ML App-Shiny']=data2['Q39_Part_8'].astype('category')
data2.pop('Q39_Part_8')
data2['Share ML App-Not Sharing']=data2['Q39_Part_9'].astype('category')
data2.pop('Q39_Part_9')
data2['Share ML App-Other']=data2['Q39_OTHER'].astype('category')
data2.pop('Q39_OTHER')
data2['Course-Coursera']=data2['Q40_Part_1'].astype('category')
data2.pop('Q40_Part_1')
data2['Course-edX']=data2['Q40_Part_2'].astype('category')
data2.pop('Q40_Part_2')
data2['Course-Kaggle']=data2['Q40_Part_3'].astype('category')
data2.pop('Q40_Part_3')
data2['Course-DataCamp']=data2['Q40_Part_4'].astype('category')
data2.pop('Q40_Part_4')
data2['Course-Fast.ai']=data2['Q40_Part_5'].astype('category')
data2.pop('Q40_Part_5')
data2['Course-Udacity']=data2['Q40_Part_6'].astype('category')
data2.pop('Q40_Part_6')
data2['Course-Udemy']=data2['Q40_Part_7'].astype('category')
data2.pop('Q40_Part_7')
data2['Course-LinkedIN']=data2['Q40_Part_8'].astype('category')
data2.pop('Q40_Part_8')
data2['Course-Cloud Cert']=data2['Q40_Part_9'].astype('category')
data2.pop('Q40_Part_9')
data2['Course-Univ Courses']=data2['Q40_Part_10'].astype('category')
data2.pop('Q40_Part_10')
data2['Course-None']=data2['Q40_Part_11'].astype('category')
data2.pop('Q40_Part_11')
data2['Course-Other']=data2['Q40_OTHER'].astype('category')
data2.pop('Q40_OTHER')
data2['Primary Tool']=data2['Q41'].astype('category')
data2.pop('Q41')
data2['DS Media-Twitter']=data2['Q42_Part_1'].astype('category')
data2.pop('Q42_Part_1')
data2['DS Media-Email']=data2['Q42_Part_2'].astype('category')
data2.pop('Q42_Part_2')
data2['DS Media-Reddit']=data2['Q42_Part_3'].astype('category')
data2.pop('Q42_Part_3')
data2['DS Media-Kaggle']=data2['Q42_Part_4'].astype('category')
data2.pop('Q42_Part_4')
data2['DS Media-Course Forums']=data2['Q42_Part_5'].astype('category')
data2.pop('Q42_Part_5')
data2['DS Media-Youtube']=data2['Q42_Part_6'].astype('category')
data2.pop('Q42_Part_6')
data2['DS Media-Podcasts']=data2['Q42_Part_7'].astype('category')
data2.pop('Q42_Part_7')
data2['DS Media-Blogs']=data2['Q42_Part_8'].astype('category')
data2.pop('Q42_Part_8')
data2['DS Media-Journals']=data2['Q42_Part_9'].astype('category')
data2.pop('Q42_Part_9')
data2['DS Media-Slack Communities']=data2['Q42_Part_10'].astype('category')
data2.pop('Q42_Part_10')
data2['DS Media-None']=data2['Q42_Part_11'].astype('category')
data2.pop('Q42_Part_11')
data2['DS Media-Other']=data2['Q42_OTHER'].astype('category')
data2.pop('Q42_OTHER')
data2['N2Y-AWS']=data2['Q27_B_Part_1'].astype('category')
data2.pop('Q27_B_Part_1')
data2['N2Y-MS Azure']=data2['Q27_B_Part_2'].astype('category')
data2.pop('Q27_B_Part_2')
data2['N2Y-GCP']=data2['Q27_B_Part_3'].astype('category')
data2.pop('Q27_B_Part_3')
data2['N2Y-Red Hat']=data2['Q27_B_Part_4'].astype('category')
data2.pop('Q27_B_Part_4')
data2['N2Y-Oracle Cloud']=data2['Q27_B_Part_5'].astype('category')
data2.pop('Q27_B_Part_5')
data2['N2Y-SAP Cloud']=data2['Q27_B_Part_6'].astype('category')
data2.pop('Q27_B_Part_6')
data2['N2Y-VMware Cloud']=data2['Q27_B_Part_7'].astype('category')
data2.pop('Q27_B_Part_7')
data2['N2Y-Salesforce Cloud']=data2['Q27_B_Part_8'].astype('category')
data2.pop('Q27_B_Part_8')
data2['N2Y-Alibaba Cloud']=data2['Q27_B_Part_9'].astype('category')
data2.pop('Q27_B_Part_9')
data2['N2Y-Tencent Cloud']=data2['Q27_B_Part_10'].astype('category')
data2.pop('Q27_B_Part_10')
data2['N2Y-None']=data2['Q27_B_Part_11'].astype('category')
data2.pop('Q27_B_Part_11')
data2['N2Y-OTHER']=data2['Q27_B_OTHER'].astype('category')
data2.pop('Q27_B_OTHER')
data2['N2Y Cloud Comp-Amazon EC2']=data2['Q29_B_Part_1'].astype('category')
data2.pop('Q29_B_Part_1')
data2['N2Y Cloud Comp-MS Azure VM']=data2['Q29_B_Part_2'].astype('category')
data2.pop('Q29_B_Part_2')
data2['N2Y Cloud Comp-GC Compute Engine']=data2['Q29_B_Part_3'].astype('category')
data2.pop('Q29_B_Part_3')
data2['N2Y Cloud Comp-None']=data2['Q29_B_Part_4'].astype('category')
data2.pop('Q29_B_Part_4')
data2['N2Y Cloud Comp-Other']=data2['Q29_B_OTHER'].astype('category')
data2.pop('Q29_B_OTHER')
data2['N2Y Data Stor-MS Azure Data Lake']=data2['Q30_B_Part_1'].astype('category')
data2.pop('Q30_B_Part_1')
data2['N2Y Data Stor-MS Azure Disk']=data2['Q30_B_Part_2'].astype('category')
data2.pop('Q30_B_Part_2')
data2['N2Y Data Stor-Amazon S3']=data2['Q30_B_Part_3'].astype('category')
data2.pop('Q30_B_Part_3')
data2['N2Y Data Stor-Amazon EFS']=data2['Q30_B_Part_4'].astype('category')
data2.pop('Q30_B_Part_4')
data2['N2Y Data Stor-GCS']=data2['Q30_B_Part_5'].astype('category')
data2.pop('Q30_B_Part_5')
data2['N2Y Data Stor-GCF']=data2['Q30_B_Part_6'].astype('category')
data2.pop('Q30_B_Part_6')
data2['N2Y Data Stor-No/None']=data2['Q30_B_Part_7'].astype('category')
data2.pop('Q30_B_Part_7')
data2['N2Y Data Stor-Other']=data2['Q30_B_OTHER'].astype('category')
data2.pop('Q30_B_OTHER')
data2['N2Y ML Prod-Amazon Sagemaker']=data2['Q31_B_Part_1'].astype('category')
data2.pop('Q31_B_Part_1')
data2['N2Y ML Prod-Azure ML Studio']=data2['Q31_B_Part_2'].astype('category')
data2.pop('Q31_B_Part_2')
data2['N2Y ML Prod-GC Vertex AI']=data2['Q31_B_Part_3'].astype('category')
data2.pop('Q31_B_Part_3')
data2['N2Y ML Prod-DataRobot']=data2['Q31_B_Part_4'].astype('category')
data2.pop('Q31_B_Part_4')
data2['N2Y ML Prod-Databricks']=data2['Q31_B_Part_5'].astype('category')
data2.pop('Q31_B_Part_5')
data2['N2Y ML Prod-Dataiku']=data2['Q31_B_Part_6'].astype('category')
data2.pop('Q31_B_Part_6')
data2['N2Y ML Prod-Alteryx']=data2['Q31_B_Part_7'].astype('category')
data2.pop('Q31_B_Part_7')
data2['N2Y ML Prod-Rapidminer']=data2['Q31_B_Part_8'].astype('category')
data2.pop('Q31_B_Part_8')
data2['N2Y ML Prod-None']=data2['Q31_B_Part_9'].astype('category')
data2.pop('Q31_B_Part_9')
data2['N2Y ML Prod-Other']=data2['Q31_B_OTHER'].astype('category')
data2.pop('Q31_B_OTHER')
data2['N2Y BD Prod-MySQL']=data2['Q32_B_Part_1'].astype('category')
data2.pop('Q32_B_Part_1')
data2['N2Y BD Prod-PostgreSQL']=data2['Q32_B_Part_2'].astype('category')
data2.pop('Q32_B_Part_2')
data2['N2Y BD Prod-SQLite']=data2['Q32_B_Part_3'].astype('category')
data2.pop('Q32_B_Part_3')
data2['N2Y BD Prod-Oracle DB']=data2['Q32_B_Part_4'].astype('category')
data2.pop('Q32_B_Part_4')
data2['N2Y BD Prod-Mongo DB']=data2['Q32_B_Part_5'].astype('category')
data2.pop('Q32_B_Part_5')
data2['N2Y BD Prod-Snowflake']=data2['Q32_B_Part_6'].astype('category')
data2.pop('Q32_B_Part_6')
data2['N2Y BD Prod-IBM DB2']=data2['Q32_B_Part_7'].astype('category')
data2.pop('Q32_B_Part_7')
data2['N2Y BD Prod-MS SQL Server']=data2['Q32_B_Part_8'].astype('category')
data2.pop('Q32_B_Part_8')
data2['N2Y BD Prod-MS Azure SQL DB']=data2['Q32_B_Part_9'].astype('category')
data2.pop('Q32_B_Part_9')
data2['N2Y BD Prod-MS Azure Cosmos DB']=data2['Q32_B_Part_10'].astype('category')
data2.pop('Q32_B_Part_10')
data2['N2Y BD Prod-Amazon Redshift']=data2['Q32_B_Part_11'].astype('category')
data2.pop('Q32_B_Part_11')
data2['N2Y BD Prod-Amazon Aurora']=data2['Q32_B_Part_12'].astype('category')
data2.pop('Q32_B_Part_12')
data2['N2Y BD Prod-Amazon DynamoDB']=data2['Q32_B_Part_13'].astype('category')
data2.pop('Q32_B_Part_13')
data2['N2Y BD Prod-Amazon RDS']=data2['Q32_B_Part_14'].astype('category')
data2.pop('Q32_B_Part_14')
data2['N2Y BD Prod-GC BigQuery']=data2['Q32_B_Part_15'].astype('category')
data2.pop('Q32_B_Part_15')
data2['N2Y BD Prod-GC SQL']=data2['Q32_B_Part_16'].astype('category')
data2.pop('Q32_B_Part_16')
data2['N2Y BD Prod-GC Firestore']=data2['Q32_B_Part_17'].astype('category')
data2.pop('Q32_B_Part_17')
data2['N2Y BD Prod-GC BigTable']=data2['Q32_B_Part_18'].astype('category')
data2.pop('Q32_B_Part_18')
data2['N2Y BD Prod-GC Spanner']=data2['Q32_B_Part_19'].astype('category')
data2.pop('Q32_B_Part_19')
data2['N2Y BD Prod-None']=data2['Q32_B_Part_20'].astype('category')
data2.pop('Q32_B_Part_20')
data2['N2Y BD Prod-Other']=data2['Q32_B_OTHER'].astype('category')
data2.pop('Q32_B_OTHER')
data2['N2Y BI Tools-MS Power BI']=data2['Q34_B_Part_1'].astype('category')
data2.pop('Q34_B_Part_1')
data2['N2Y BI Tools-Amazon QuickSight']=data2['Q34_B_Part_2'].astype('category')
data2.pop('Q34_B_Part_2')
data2['N2Y BI Tools-Google Data Studio']=data2['Q34_B_Part_3'].astype('category')
data2.pop('Q34_B_Part_3')
data2['N2Y BI Tools-Looker']=data2['Q34_B_Part_4'].astype('category')
data2.pop('Q34_B_Part_4')
data2['N2Y BI Tools-Tableau']=data2['Q34_B_Part_5'].astype('category')
data2.pop('Q34_B_Part_5')
data2['N2Y BI Tools-Salesforce']=data2['Q34_B_Part_6'].astype('category')
data2.pop('Q34_B_Part_6')
data2['N2Y BI Tools-Tableau CRM']=data2['Q34_B_Part_7'].astype('category')
data2.pop('Q34_B_Part_7')
data2['N2Y BI Tools-Qlik']=data2['Q34_B_Part_8'].astype('category')
data2.pop('Q34_B_Part_8')
data2['N2Y BI Tools-Domo']=data2['Q34_B_Part_9'].astype('category')
data2.pop('Q34_B_Part_9')
data2['N2Y BI Tools-TIBCO Spotfire']=data2['Q34_B_Part_10'].astype('category')
data2.pop('Q34_B_Part_10')
data2['N2Y BI Tools-Alteryx']=data2['Q34_B_Part_11'].astype('category')
data2.pop('Q34_B_Part_11')
data2['N2Y BI Tools-Sisense']=data2['Q34_B_Part_12'].astype('category')
data2.pop('Q34_B_Part_12')
data2['N2Y BI Tools-SAP Analytics Cloud']=data2['Q34_B_Part_13'].astype('category')
data2.pop('Q34_B_Part_13')
data2['N2Y BI Tools-MS Azure Synapse']=data2['Q34_B_Part_14'].astype('category')
data2.pop('Q34_B_Part_14')
data2['N2Y BI Tools-Thoughtspot']=data2['Q34_B_Part_15'].astype('category')
data2.pop('Q34_B_Part_15')
data2['N2Y BI Tools-None']=data2['Q34_B_Part_16'].astype('category')
data2.pop('Q34_B_Part_16')
data2['N2Y BI Tools-Other']=data2['Q34_B_OTHER'].astype('category')
data2.pop('Q34_B_OTHER')
data2['N2Y AutoML-Data Aug']=data2['Q36_B_Part_1'].astype('category')
data2.pop('Q36_B_Part_1')
data2['N2Y AutoML-Feature Engg/Sel']=data2['Q36_B_Part_2'].astype('category')
data2.pop('Q36_B_Part_2')
data2['N2Y AutoML-Model Sel']=data2['Q36_B_Part_3'].astype('category')
data2.pop('Q36_B_Part_3')
data2['N2Y AutoML-Model Arch']=data2['Q36_B_Part_4'].astype('category')
data2.pop('Q36_B_Part_4')
data2['N2Y AutoML-Hyparam Tuning']=data2['Q36_B_Part_5'].astype('category')
data2.pop('Q36_B_Part_5')
data2['N2Y AutoML-Full ML Pipelines']=data2['Q36_B_Part_6'].astype('category')
data2.pop('Q36_B_Part_6')
data2['N2Y AutoML-None']=data2['Q36_B_Part_7'].astype('category')
data2.pop('Q36_B_Part_7')
data2['N2Y AutoML-Other']=data2['Q36_B_OTHER'].astype('category')
data2.pop('Q36_B_OTHER')
data2['N2Y SP AutoML-Google Cloud']=data2['Q37_B_Part_1'].astype('category')
data2.pop('Q37_B_Part_1')
data2['N2Y SP AutoML-H2O Driverless AI']=data2['Q37_B_Part_2'].astype('category')
data2.pop('Q37_B_Part_2')
data2['N2Y SP AutoML-Databricks']=data2['Q37_B_Part_3'].astype('category')
data2.pop('Q37_B_Part_3')
data2['N2Y SP AutoML-DataRobot']=data2['Q37_B_Part_4'].astype('category')
data2.pop('Q37_B_Part_4')
data2['N2Y SP AutoML-Amazon Sagemaker']=data2['Q37_B_Part_5'].astype('category')
data2.pop('Q37_B_Part_5')
data2['N2Y SP AutoML-Azure']=data2['Q37_B_Part_6'].astype('category')
data2.pop('Q37_B_Part_6')
data2['N2Y SP AutoML-None']=data2['Q37_B_Part_7'].astype('category')
data2.pop('Q37_B_Part_7')
data2['N2Y SP AutoML-Other']=data2['Q37_B_OTHER'].astype('category')
data2.pop('Q37_B_OTHER')
data2['N2Y ML Exp-Neptune.ai']=data2['Q38_B_Part_1'].astype('category')
data2.pop('Q38_B_Part_1')
data2['N2Y ML Exp-Weights & Biases']=data2['Q38_B_Part_2'].astype('category')
data2.pop('Q38_B_Part_2')
data2['N2Y ML Exp-Comet.ML']=data2['Q38_B_Part_3'].astype('category')
data2.pop('Q38_B_Part_3')
data2['N2Y ML Exp-Sacred+Omniboard']=data2['Q38_B_Part_4'].astype('category')
data2.pop('Q38_B_Part_4')
data2['N2Y ML Exp-Tensorboard']=data2['Q38_B_Part_5'].astype('category')
data2.pop('Q38_B_Part_5')
data2['N2Y ML Exp-Guild.ai']=data2['Q38_B_Part_6'].astype('category')
data2.pop('Q38_B_Part_6')
data2['N2Y ML Exp-Polyaxon']=data2['Q38_B_Part_7'].astype('category')
data2.pop('Q38_B_Part_7')
data2['N2Y ML Exp-Clear ML']=data2['Q38_B_Part_8'].astype('category')
data2.pop('Q38_B_Part_8')
data2['N2Y ML Exp-Domino Model Monitor']=data2['Q38_B_Part_9'].astype('category')
data2.pop('Q38_B_Part_9')
data2['N2Y ML Exp-MLflow']=data2['Q38_B_Part_10'].astype('category')
data2.pop('Q38_B_Part_10')
data2['N2Y ML Exp-None']=data2['Q38_B_Part_11'].astype('category')
data2.pop('Q38_B_Part_11')
data2['N2Y ML Exp-Other']=data2['Q38_B_OTHER'].astype('category')
data2.pop('Q38_B_OTHER')

Let's examine the dataset to review if all columns are renamed.

In [None]:
data2.head()

At this point I have completed data cleaning activities & I can start plotting charts to gain insights. 

But before that, please note few observations about the general structure of the dataset.

There are 2 types of questions in the dataset - 

1. Single column-Multiple options questions (for e.g. Age_years, gender, Education etc) & 

2. Multiple columns-Multiple Options questions. 

Within the 2nd type of questions described above there are further 2 types - 

1. Questions pertaining to current preferences (ML Algorithm/BI tool/Cloud Computing) & 

2. Questions related to future preferences i.e. interests in "Next 2 Years".

Let's plot charts for the first type of questions.

I will be using countplots of seaborn & utilise the hue parameter to extract insights.

In the first chart let's plot Age & gender.

We can note from the chart below - Major Indian population participating in the survey is in the 18-21 age group representing students in the final years of their graduate programmes.

The numbers drop significantly in the 30+ age groups while we see few participants from the 70+ category.

In [None]:
sns.countplot(data=data2, x='Age_years',order = data2['Age_years'].value_counts().index, hue='gender')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large',
)
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 20,5

Next we will analyze the distribution of the Education taken with the current Job title.

From the chart below, we can note that the maximum number of professionals are with a Bachelor's degree.

There are majority of students studying for Bachelor's degree & learning data science.

Professionals with Master's degree is the next large subset.

In the 4th position, we can note a large number of students taking education for data science which is not a bachelor's degree.

In [None]:
sns.countplot(data=data2, x='Education',order = data2['Education'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

In the next chart, let's plot coding experience in years with the current title.

From the chart below, we can say maximum Data science engineers are having coding experience between 1 to 3 years.

In [None]:
sns.countplot(data=data2, x='Coding_Exp',order = data2['Coding_Exp'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,10

Next I will plot a chart of Title & coding languages. 

Since this is a multiple response type of a question, I will use pd.melt() function to bring multiple columns of data into 1 column of coding language. 

The second column will be of the number of Indians who have responded & the third column will be for Title of the participant. 

This helps us to plot chart using Seaborn's Hue parameter & gain additional insights in a single chart.

This procedure is followed for all further multiple response type questions in this notebook.

In the next step I have created a separate dataframe for the coding languages & title.

In [None]:
data6=data2[["Title","Coding-Python","Coding-R","Coding-SQL","Coding-C","Coding-C++","Coding-Java","Coding-Javascript","Coding-Julia","Coding-Swift","Coding-Bash","Coding-MATLAB","Coding-None","Coding-Other"]]

Let's examine this dataset.

In [None]:
data6.head()

I will now use the pd.melt() function to convert above dataset into a 3 column dataset.

In [None]:
data6melted=data6.melt(id_vars=['Title'],var_name='language',value_name='Indians')

Let's examine this dataset

In [None]:
data6melted.head()

We can see there are NaN values in the dataset & these must be removed to get tidy & neat charts.

In [None]:
data6melted.isnull().sum()

In [None]:
data6melted1=data6melted.dropna(subset=['Indians'])
data6melted2=data6melted1.reset_index(drop=True)

Let's examine this dataset to check if the NaN values have been removed.

In [None]:
data6melted2.head()

In [None]:
data6melted2.isnull().sum()

We can see the NaN values are removed from the dataset & now the chart can be plotted.

From the chart below we can see Python is the most popular coding language amongst all job titles.

Data Scientists have voted SQL as 2nd favourite & R is at 3rd position.

In [None]:
sns.countplot(data=data6melted2, x='language', order=data6melted2['language'].value_counts().index,hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

The next chart is for number of times Tensor Processing Unit (TPU) is used. We can see "Never" topping the charts. 

Surprisingly a large number of Data Scientists have also responded as "Never"

In [None]:
sns.countplot(data=data2, x='TPU_Usage_Count',order = data2['TPU_Usage_Count'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

The next chart is for persons using ML methods in no. of years with coding experience.

We can see at the top position, persons with coding experience less than a year are using machine learning methods under 1 year.

At 3rd position there are persons who dont use ML methods & its a significant number.

In [None]:
sns.countplot(data=data2, x='ML methods-years', order = data2['ML methods-years'].value_counts().index,hue='Coding_Exp')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

Below chart shows where the survey partipants are currently employed.

The chart indicates major number of participants are from Computers Technology trying to make/already made a transition to the field of Data Science.

This is followed by a range of various industries!

In [None]:
sns.countplot(data=data2, x='Current Employer',order = data2['Current Employer'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

Next chart below is for Size of Company & the size of Data Science team. 

We can see large companies with 10,000 or more employees are having largest Data Science Team of 20+ professionals.

In [None]:
sns.countplot(data=data2, x='Size of Company',order = data2['Size of Company'].value_counts().index, hue='DS team')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

Next chart is to examine which Titles are there in various Sizes of Companies.

The chart shows participants who are statisticians are employed in maximum number for companies with 10000 or more employees.

Maximum number of Data Scientists work in companies with 0-49 employees.

In [None]:
sns.countplot(data=data2, x='Size of Company',order = data2['Size of Company'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,6

Next chart below is for ML models used by Size of company.

At the 3rd position we can note a point of interest - The large brown bar for companies with 10,000 or more employees have models in production for more than 2 years!

In [None]:
sns.countplot(data=data2, x='ML models Used', order = data2['ML models Used'].value_counts().index,hue='Size of Company')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

Next chart is for Best Cloud for developer experience with Coding experience of the participant. 

Amazon wins the vote (irrespective of coding experience) here & Google Cloud is at 3rd place with Azure coming 4th.

In [None]:
sns.countplot(data=data2, x='Best Cloud for Developer exp',order = data2['Best Cloud for Developer exp'].value_counts().index, hue='Coding_Exp')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

The next chart is for selected choice of Big Data tool by Title of the participant.

We can note MySQL is the favourite of all titles & takes the top position by a long margin.

Mongo DB at 3rd position is a favourite with Data Scientists.

In [None]:
sns.countplot(data=data2, x='BD-Selected Choice', order = data2['BD-Selected Choice'].value_counts().index,hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

Below next chart is for popular BI tools by Title of participant & we can note Tableau & Power BI continue being the favourites with Google Data Studio overtaking Qlik for the 3rd position.

In [None]:
sns.countplot(data=data2, x='BI-Tools-Selected Choice', order = data2['BI-Tools-Selected Choice'].value_counts().index,hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

Next chart is for primary tool used by various Titles & it is interesting to note - Excel/Google Sheets is the most popular tool followed by the core tools in the 2nd place.

In [None]:
sns.countplot(data=data2, x='Primary Tool',order = data2['Primary Tool'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

Next chart I have plotted is for Coding language suggested for newcomers to the field of data science by participants having various years of experience.

Couple of interesting insights here - Python is the universal favourite here & wins the vote without any ambibuity.

SQL wins a distant 2nd while R is relegated to the 3rd place. Old time favourites such as C/C++ & Java are further in the back seats.

In [None]:
sns.countplot(data=data2, x='Coding_lang_Newbies',order = data2['Coding_lang_Newbies'].value_counts().index, hue='Coding_Exp')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

In the next chart I will plot IDE used by various titles.

The chart indicates Jupyter Notebook is not only popular with students but also with Data Scientists.

VS Code is the next popular tool followed by Pycharm.

Spyder is on the 5th position & R Studio is even further away from the top.

In [None]:
data13=data2[["Title","IDE-Jupyter","IDE-Rstudio","IDE-Visual_Studio","IDE-VScode","IDE-Pycharm","IDE-Spyder","IDE-Notepad++","IDE-Sublime_text","IDE-Vim/Emacs","IDE-Matlab","IDE-Jupyter_Notebook","IDE-None","IDE-Other"]]
data13melted=data13.melt(id_vars=['Title'],var_name='IDE',value_name='Indians')
data13melted1=data13melted.dropna(subset=['Indians'])
data13melted2=data13melted1.reset_index(drop=True)
sns.countplot(data=data13melted2, x='IDE',order=data13melted2['IDE'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 20,5

Next chart is for Hosted Notebooks by various titles.

We can see maximum number of Data Scientists are using Google Colab & Kaggle is at the 2nd position.

In [None]:
data21=data2[["Title","HostedNB-Kaggle","HostedNB-Colab","HostedNB-Azure","HostedNB-Paperspace/Gradient","HostedNB-Binder/JupyterHub","HostedNB-Code_Ocean","HostedNB-Watson_Studio","HostedNB-Amazon_Sagemaker","HostedNB-Amazon_EMR","HostedNB-GC_NB","HostedNB-GC_Datalab","HostedNB-Databricks_collab","HostedNB-Zepl","HostedNB-Deepnote","HostedNB-Observable","HostedNB-None","HostedNB-Other"]]
data21melted=data21.melt(id_vars=['Title'],var_name='HOSTED NB',value_name='Indians')
data21melted1=data21melted.dropna(subset=['Indians'])
data21melted2=data21melted1.reset_index(drop=True)
sns.countplot(data=data21melted2, x='HOSTED NB',order=data21melted2['HOSTED NB'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 20,5

Next chart is for visualisation libraries used by various titles.

As can be seen in the chart, Matplotlib, Seaborn & Plotly occupy the top 3 positions respectively.

D3js-The library used in the Notebook winning the early submission prize is rarely used by the participants & takes a low position.

In [None]:
data40=data2[["Title","Vis_lib-Matplotlib","Vis_lib-Seaborn","Vis_lib-Plotly/Plotly Express","Vis_lib-GGplot/ggplot2","Vis_lib-Shiny","Vis_lib-D3 js","Vis_lib-Altair","Vis_lib-Bokeh","Vis_lib-Geoplotlib","Vis_lib-Leaflet/Folium","Vis_lib-None","Vis_lib-Other"]]
data40melted=data40.melt(id_vars=['Title'],var_name='VISUALISATION LIBRARY',value_name='Indians')
data40melted1=data40melted.dropna(subset=['Indians'])
data40melted2=data40melted1.reset_index(drop=True)
sns.countplot(data=data40melted2, x='VISUALISATION LIBRARY',order=data40melted2['VISUALISATION LIBRARY'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

I am plotting the next chart to see which online platform is preferred for educational courses by various titles.

As seen in the chart below, Coursera is everyone's favourite with Kaggle in 2nd place & Udemy in 3rd position.

In [None]:
data50=data2[["Title","Course-Coursera","Course-edX","Course-Kaggle","Course-DataCamp","Course-Fast.ai","Course-Udacity","Course-Udemy","Course-LinkedIN","Course-Cloud Cert","Course-Univ Courses","Course-None","Course-Other"]]
data50melted=data50.melt(id_vars=['Title'],var_name='Courses',value_name='Indians')
data50melted1=data50melted.dropna(subset=['Indians'])
data50melted2=data50melted1.reset_index(drop=True)
sns.countplot(data=data50melted2, x='Courses',order=data50melted2['Courses'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='xx-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

Next Chart is to understand which is the media platform used by various titles to gain information about Data Science.

We can see Kaggle tops the charts here! Followed by Youtube & blogs.

In [None]:
data51=data2[["Title","DS Media-Twitter","DS Media-Email","DS Media-Reddit","DS Media-Kaggle","DS Media-Course Forums","DS Media-Youtube","DS Media-Podcasts","DS Media-Blogs","DS Media-Journals","DS Media-Slack Communities","DS Media-None","DS Media-Other"]]
data51melted=data51.melt(id_vars=['Title'],var_name='DS INFO MEDIA SOURCE',value_name='Indians')
data51melted1=data51melted.dropna(subset=['Indians'])
data51melted2=data51melted1.reset_index(drop=True)
sns.countplot(data=data51melted2, x='DS INFO MEDIA SOURCE',order=data51melted2['DS INFO MEDIA SOURCE'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='xx-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

We will see which are the ML frameworks used by various titles in the next chart.

SK Learn is the most favourite here followed by TensorFlow & Keras.

In [None]:
data52=data2[["Title","ML FRMWKS-SK Learn","ML FRMWKS-TensorFlow","ML FRMWKS-Keras","ML FRMWKS-PyTorch","ML FRMWKS-Fast.ai","ML FRMWKS-MXNet","ML FRMWKS-XGBoost","ML FRMWKS-LightGBM","ML FRMWKS-CatBoost","ML FRMWKS-Prophet","ML FRMWKS-H2O 3","ML FRMWKS-Caret","ML FRMWKS-Tidymodels","ML FRMWKS-JAX","ML FRMWKS-PyTorch Lightning","ML FRMWKS-Huggingface","ML FRMWKS-None","ML FRMWKS-Other"]]
data52melted=data52.melt(id_vars=['Title'],var_name='ML FRAMEWORKS',value_name='Indians')
data52melted1=data52melted.dropna(subset=['Indians'])
data52melted2=data52melted1.reset_index(drop=True)
sns.countplot(data=data52melted2, x='ML FRAMEWORKS',order=data52melted2['ML FRAMEWORKS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='xx-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

In the next chart we can see the most favourite ML Algorithms.

Linear Regression & Random Forest take the top 2 spots.

More Data Scientists use Gradient Boosting of 4th place than Neural networks in 3rd position.

In [None]:
data53=data2[["Title","ML Algo-Linear/Logistic Reg","ML Algo-Decision Tree/Random Forests","ML Algo-Gradient Boosting","ML Algo-Bayesian","ML Algo-Evolutionary","ML Algo-Dense Neural Ntwks","ML Algo-Conv Neural Ntwks","ML Algo-Gen Adv Ntwks","ML Algo-Rec Neural Ntwks","ML Algo-Transf Ntwks","ML Algo-None","ML Algo-Other"]]
data53melted=data53.melt(id_vars=['Title'],var_name='ML ALGORITHMS',value_name='Indians')
data53melted1=data53melted.dropna(subset=['Indians'])
data53melted2=data53melted1.reset_index(drop=True)
sns.countplot(data=data53melted2, x='ML ALGORITHMS',order=data53melted2['ML ALGORITHMS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

Next chart is for which Managed ML Products are popular by title.

The chart indicates a surprising result - major number of people are not using ML Products regularly.

Amazon Sagemaker & MS Azure Studio are close to each other in terms of popularity amongst Data Scientists.

In [None]:
data54=data2[["Title","ML Prod-Amazon sagemaker","ML Prod-MS Azure ML Studio","ML Prod-GC Vertex AI","ML Prod-DataRobot","ML Prod-DataBricks","ML Prod-Dataiku","ML Prod-Alteryx","ML Prod-Rapidminer","ML Prod-No","ML Prod-Other"]]
data54melted=data54.melt(id_vars=['Title'],var_name='MANAGED ML PRODUCTS',value_name='Indians')
data54melted1=data54melted.dropna(subset=['Indians'])
data54melted2=data54melted1.reset_index(drop=True)
sns.countplot(data=data54melted2, x='MANAGED ML PRODUCTS',order=data54melted2['MANAGED ML PRODUCTS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

Now, let's look at what participants have voted for next 2 years for ML Products.

The chart shows MS Azure ML Studio taking the top place with students wanting it to learn the most in the next 2 years.

Even more number of students want to learn Google Cloud which appears in the 2nd place on the chart.

In [None]:
data57=data2[["Title","N2Y ML Prod-Azure ML Studio","N2Y ML Prod-GC Vertex AI","N2Y ML Prod-DataRobot","N2Y ML Prod-Databricks","N2Y ML Prod-Dataiku","N2Y ML Prod-Alteryx","N2Y ML Prod-Rapidminer","N2Y ML Prod-None","N2Y ML Prod-Other"]]
data57melted=data57.melt(id_vars=['Title'],var_name='PERSONS GETTING FAMILIAR WITH ML PRODUCTS IN NEXT 2 YEARS',value_name='Indians')
data57melted1=data57melted.dropna(subset=['Indians'])
data57melted2=data57melted1.reset_index(drop=True)
sns.countplot(data=data57melted2, x='PERSONS GETTING FAMILIAR WITH ML PRODUCTS IN NEXT 2 YEARS',order=data57melted2['PERSONS GETTING FAMILIAR WITH ML PRODUCTS IN NEXT 2 YEARS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

The next chart shows the tools used for managing ML Experiments by various titles.

Currently major number of participants are not using any tool for managing ML Experiments.

Tensorboard is the most popular tool used in 2nd place.

Lets examine in the next chart which tools are on the list for participants to learn in next 2 years.

In [None]:
data55=data2[["Title","ML Exp-Neptune.ai","ML Exp-Weights & Biases","ML Exp-Comet.ml","ML Exp-Sacred+Omniboard","ML Exp-TensorBoard","ML Exp-Guild.ai","ML Exp-Polyaxon","ML Exp-ClearML","ML Exp-Domino Model Monitor","ML Exp-MLflow","ML Exp-No/None","ML Exp-Other"]]
data55melted=data55.melt(id_vars=['Title'],var_name='TOOL FOR MANAGING ML EXPERIMENTS',value_name='Indians')
data55melted1=data55melted.dropna(subset=['Indians'])
data55melted2=data55melted1.reset_index(drop=True)
sns.countplot(data=data55melted2, x='TOOL FOR MANAGING ML EXPERIMENTS',order=data55melted2['TOOL FOR MANAGING ML EXPERIMENTS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

The graph below shows students are most interested in learning Tensorboard for managing ML experiments in the next 2 years.

In [None]:
data58=data2[["Title","N2Y ML Exp-Neptune.ai","N2Y ML Exp-Weights & Biases","N2Y ML Exp-Comet.ML","N2Y ML Exp-Sacred+Omniboard","N2Y ML Exp-Tensorboard","N2Y ML Exp-Guild.ai","N2Y ML Exp-Polyaxon","N2Y ML Exp-Clear ML","N2Y ML Exp-Domino Model Monitor","N2Y ML Exp-MLflow","N2Y ML Exp-None","N2Y ML Exp-Other"]]
data58melted=data58.melt(id_vars=['Title'],var_name='PERSONS GETTING FAMILIAR WITH TOOLS FOR ML EXPERIMENTS IN NEXT 2 YEARS',value_name='Indians')
data58melted1=data58melted.dropna(subset=['Indians'])
data58melted2=data58melted1.reset_index(drop=True)
sns.countplot(data=data58melted2, x='PERSONS GETTING FAMILIAR WITH TOOLS FOR ML EXPERIMENTS IN NEXT 2 YEARS',order=data58melted2['PERSONS GETTING FAMILIAR WITH TOOLS FOR ML EXPERIMENTS IN NEXT 2 YEARS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

In the chart below we can see the most popular ML apps for sharing as voted by the participants.

We can see Github takes the top spot followed by Kaggle.

Though Colab appears at 4th position it is in 3rd position as voted by Data Scientists.

In [None]:
data56=data2[["Title","Share ML App-Plotly Dash","Share ML App-Streamlit","Share ML App-NB Viewer","Share ML App-Github","Share ML App-Personal Blog","Share ML App-Kaggle","Share ML App-Colab","Share ML App-Shiny","Share ML App-Not Sharing","Share ML App-Other"]]
data56melted=data56.melt(id_vars=['Title'],var_name='SHARING APP FOR ML',value_name='Indians')
data56melted1=data56melted.dropna(subset=['Indians'])
data56melted2=data56melted1.reset_index(drop=True)
sns.countplot(data=data56melted2, x='SHARING APP FOR ML',order=data56melted2['SHARING APP FOR ML'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

Next chart is for checking Auto ML libraries used by various titles.

We can see major participants are not using Auto ML libraries.

In the 2nd position Data Scientists are using Auto ML libraries for model selection.

In the next chart lets examine which Auto ML libraries will be studied by participants in next 2 years.

In [None]:
data72=data2[["Title","AutoML-Data Aug","AutoML-Feature Engg/Sel","AutoML-Model Sel","AutoML-Model Arch","AutoML-Hyparam Tuning","AutoML-Full ML Pipelines","AutoML-No/None","AutoML-Other"]]
data72melted=data72.melt(id_vars=['Title'],var_name='CATEGORY OF AUTO ML TOOLS USED REGULARLY',value_name='Indians')
data72melted1=data72melted.dropna(subset=['Indians'])
data72melted2=data72melted1.reset_index(drop=True)
sns.countplot(data=data72melted2, x='CATEGORY OF AUTO ML TOOLS USED REGULARLY',order=data72melted2['CATEGORY OF AUTO ML TOOLS USED REGULARLY'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

In the next 2 years participants want to learn Auto ML for full ML pipelines followed by Auto ML for Model Selection.

In [None]:
data73=data2[["Title","N2Y AutoML-Data Aug","N2Y AutoML-Feature Engg/Sel","N2Y AutoML-Model Sel","N2Y AutoML-Model Arch","N2Y AutoML-Hyparam Tuning","N2Y AutoML-Full ML Pipelines","N2Y AutoML-None","N2Y AutoML-Other"]]
data73melted=data73.melt(id_vars=['Title'],var_name='CATEGORY OF AUTO ML TOOLS TO BECOME FAMILIAR IN NEXT 2 YEARS',value_name='Indians')
data73melted1=data73melted.dropna(subset=['Indians'])
data73melted2=data73melted1.reset_index(drop=True)
sns.countplot(data=data73melted2, x='CATEGORY OF AUTO ML TOOLS TO BECOME FAMILIAR IN NEXT 2 YEARS',order=data73melted2['CATEGORY OF AUTO ML TOOLS TO BECOME FAMILIAR IN NEXT 2 YEARS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

In the next chart we will examine which are the Auto ML tools used by various titles.

Maximum number of participants are not using any Auto ML tool.

In the 2nd place is Google Cloud Auto ML followed by MS Azure in 3rd.

In the next chart lets see which are the Auto ML tools voted by participants for learning in the next 2 years.

In [None]:
data74=data2[["Title","2AutoML-GC AutoML","2AutoML-H2O Driverless AI","2AutoML-Databricks AutoML","2AutoML-DataRobot AutoML","2AutoML-Amazon Sagemaker Autopilot","2AutoML-Azure Auto ML","2AutoML-No/None","2AutoML-Other"]]
data74melted=data74.melt(id_vars=['Title'],var_name='AUTO ML TOOLS USED REGULARLY',value_name='Indians')
data74melted1=data74melted.dropna(subset=['Indians'])
data74melted2=data74melted1.reset_index(drop=True)
sns.countplot(data=data74melted2, x='AUTO ML TOOLS USED REGULARLY',order=data74melted2['AUTO ML TOOLS USED REGULARLY'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 20,7

We can see the same entries as previous chart get promoted by 1 position each in the top 3.

In [None]:
data75=data2[["Title","N2Y SP AutoML-Google Cloud","N2Y SP AutoML-H2O Driverless AI","N2Y SP AutoML-Databricks","N2Y SP AutoML-DataRobot","N2Y SP AutoML-Amazon Sagemaker","N2Y SP AutoML-Azure","N2Y SP AutoML-None","N2Y SP AutoML-Other"]]
data75melted=data75.melt(id_vars=['Title'],var_name='AUTO ML TOOLS TO BECOME FAMILIAR IN NEXT 2 YEARS',value_name='Indians')
data75melted1=data75melted.dropna(subset=['Indians'])
data75melted2=data75melted1.reset_index(drop=True)
sns.countplot(data=data75melted2, x='AUTO ML TOOLS TO BECOME FAMILIAR IN NEXT 2 YEARS',order=data75melted2['AUTO ML TOOLS TO BECOME FAMILIAR IN NEXT 2 YEARS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,6

In the next chart we can see maximum number of participants have a role of analyzing & understanding data in their respective organisations. 

We can see there are more number of Data Scientists in the 4th position who have a role of improving ML models.

In [None]:
data67=data2[["Title","Role-Analyze & Understand","Role-Build/Run Data Infra","Role-Explore ML application","Role-Build/Run ML service","Role-Improve ML models","Role-Research for ML","Role-None","Role-Other"]]
data67melted=data67.melt(id_vars=['Title'],var_name='ROLE AT WORK',value_name='Indians')
data67melted1=data67melted.dropna(subset=['Indians'])
data67melted2=data67melted1.reset_index(drop=True)
sns.countplot(data=data67melted2, x='ROLE AT WORK',order=data67melted2['ROLE AT WORK'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

<h1><center><font size="5">Let's focus on Computer Vision, NLP, Big Data & BI in next charts!</font></center></h1>

In the chart below we can see Image Classification is the most popular field of Computer Vision for all Titles.

In [None]:
data68=data2[["Title","CV-Image/Video","CV-Image Seg","CV-Obj Detection","CV-Image Classi","CV-Gen Ntwks","CV-None","CV-Other"]]
data68melted=data68.melt(id_vars=['Title'],var_name='COMPUTER VISION METHODS USED ON REGULAR BASIS',value_name='Indians')
data68melted1=data68melted.dropna(subset=['Indians'])
data68melted2=data68melted1.reset_index(drop=True)
sns.countplot(data=data68melted2, x='COMPUTER VISION METHODS USED ON REGULAR BASIS',order=data68melted2['COMPUTER VISION METHODS USED ON REGULAR BASIS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

In the chart below we can see word embeddings is the most popular field of NLP for all participants.

In [None]:
data69=data2[["Title","NLP-Word Embeddings","NLP-Encoder/Decoder","NLP-Contextualised Emb","NLP-Transf lang","NLP-none","NLP-Other"]]
data69melted=data69.melt(id_vars=['Title'],var_name='NLP METHODS USED ON REGULAR BASIS',value_name='Indians')
data69melted1=data69melted.dropna(subset=['Indians'])
data69melted2=data69melted1.reset_index(drop=True)
sns.countplot(data=data69melted2, x='NLP METHODS USED ON REGULAR BASIS',order=data69melted2['NLP METHODS USED ON REGULAR BASIS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

Below chart shows Amazon at the top spot for Data Storage preferred by participants followed by Google Cloud in 2nd position.

An interesting observation here - Zero Indians have voted for using Data Storage in the Next 2 Years question.

In [None]:
data70=data2[["Title","Data Stor-MS Azure Data Lake","Data Stor-MS Azure Disk Storage","Data Stor-Amazon S3","Data Stor-Amazon EFS","Data Stor-GCS","Data Stor-GCF","Data Stor-No","Data Stor-Other"]]
data70melted=data70.melt(id_vars=['Title'],var_name='DATA STORAGE PRODUCTS USED REGULARLY',value_name='Indians')
data70melted1=data70melted.dropna(subset=['Indians'])
data70melted2=data70melted1.reset_index(drop=True)
sns.countplot(data=data70melted2, x='DATA STORAGE PRODUCTS USED REGULARLY',order=data70melted2['DATA STORAGE PRODUCTS USED REGULARLY'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

The chart below shows Cloud platforms preferred by various participants.

Amazon takes the top spot. 

Again! for Data Scientists we can see an equal number of them prefer Google Cloud from 2nd position & MS Azure from 4th position from the charts.

Lets examine the next chart to see if the preferences of the participants change in the next 2 years.

In [None]:
data64=data2[["Title","Cloud Prod-Amazon EC2","Cloud Prod-MS Azure VM","Cloud Prod-Google GCE","Cloud Prod-None","Cloud Prod-Other"]]
data64melted=data64.melt(id_vars=['Title'],var_name='CLOUD PLATFORM USED ON REGULAR BASIS',value_name='Indians')
data64melted1=data64melted.dropna(subset=['Indians'])
data64melted2=data64melted1.reset_index(drop=True)
sns.countplot(data=data64melted2, x='CLOUD PLATFORM USED ON REGULAR BASIS',order=data64melted2['CLOUD PLATFORM USED ON REGULAR BASIS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

We can see the top 3 Cloud platforms remain the same as in previous Chart with some new entrants on the bottom side.

In [None]:
data65=data2[["Title","N2Y-AWS","N2Y-MS Azure","N2Y-GCP","N2Y-Red Hat","N2Y-Oracle Cloud","N2Y-SAP Cloud","N2Y-VMware Cloud","N2Y-Salesforce Cloud","N2Y-Alibaba Cloud","N2Y-Tencent Cloud","N2Y-None","N2Y-OTHER"]]
data65melted=data65.melt(id_vars=['Title'],var_name='CLOUD PLATFORM TO BECOME FAMILIAR WITH IN NEXT 2 YEARS',value_name='Indians')
data65melted1=data65melted.dropna(subset=['Indians'])
data65melted2=data65melted1.reset_index(drop=True)
sns.countplot(data=data65melted2, x='CLOUD PLATFORM TO BECOME FAMILIAR WITH IN NEXT 2 YEARS',order=data65melted2['CLOUD PLATFORM TO BECOME FAMILIAR WITH IN NEXT 2 YEARS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

In the next Chart we will see which are the most popular Big Data tools voted by the participants.

The chart indicates MySQL is highly voted by all categories.

MongoDB takes the 3rd place!

An interesting observation is low votes by students for this question in the survey.

Lets see in the next graph which Big Data tool is voted by the participants for next 2 years.

In [None]:
data61=data2[["Title","BD-MySQL","BD-PostgreSQL","BD-SQLite","BD-Oracle DB","BD-MongoDB","BD-Snowflake","BD-IBM DB2","BD-MS SQL Server","BD-MS Azure SQL DB","BD-MS Azure Cosmos DB","BD-Amazon Redshift","BD-Amazon Aurora","BD-Amazon RDS","BD-Amazon DynamoDB","BD-GC BigQuery","BD-GC SQL","BD-GC Firestore","BD-GC BigTable","BD-GC Spanner","BD-None","BD-Other"]]
data61melted=data61.melt(id_vars=['Title'],var_name='BIG DATA',value_name='Indians')
data61melted1=data61melted.dropna(subset=['Indians'])
data61melted2=data61melted1.reset_index(drop=True)
sns.countplot(data=data61melted2, x='BIG DATA',order=data61melted2['BIG DATA'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

We can see in the chart below students have voted highly in every category.

MySQL remains the favourite at top position.

In [None]:
data62=data2[["Title","N2Y BD Prod-MySQL","N2Y BD Prod-PostgreSQL","N2Y BD Prod-SQLite","N2Y BD Prod-Oracle DB","N2Y BD Prod-Mongo DB","N2Y BD Prod-Snowflake","N2Y BD Prod-IBM DB2","N2Y BD Prod-MS SQL Server","N2Y BD Prod-MS Azure SQL DB","N2Y BD Prod-MS Azure Cosmos DB","N2Y BD Prod-Amazon Redshift","N2Y BD Prod-Amazon Aurora","N2Y BD Prod-Amazon DynamoDB","N2Y BD Prod-Amazon RDS","N2Y BD Prod-GC BigQuery","N2Y BD Prod-GC SQL","N2Y BD Prod-GC Firestore","N2Y BD Prod-GC BigTable","N2Y BD Prod-GC Spanner","N2Y BD Prod-None","N2Y BD Prod-Other"]]
data62melted=data62.melt(id_vars=['Title'],var_name='LEARNING BIG DATA TOOLS IN NEXT 2 YEARS',value_name='Indians')
data62melted1=data62melted.dropna(subset=['Indians'])
data62melted2=data62melted1.reset_index(drop=True)
sns.countplot(data=data62melted2, x='LEARNING BIG DATA TOOLS IN NEXT 2 YEARS',order=data62melted2['LEARNING BIG DATA TOOLS IN NEXT 2 YEARS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,6

In the chart below we can see Tableau is the most popular BI tool for dashboarding & business analysis.

A surprisingly large number participants have voted as not using any BI tools.

PowerBI appears at 3rd position with the others following in small numbers.

Let's examine in the next chart for coming 2 years if there are any changes!

In [None]:
data59=data2[["Title","BI-Tools-Amazon QuickSight","BI-Tools-MS Power BI","BI-Tools-G Data Studio","BI-Tools-Looker","BI-Tools-Tableau","BI-Tools-Salesforce","BI-Tools-Tableau CRM","BI-Tools-Qlik","BI-Tools-Domo","BI-Tools-TIBCO Spotfire","BI-Tools-Alteryx","BI-Tools-Sisense","BI-Tools-SAP Analytics Cloud","BI-Tools-MS Azure Synapse","BI-Tools-Thoughtspot","BI-Tools-None","BI-Tools-Other"]]
data59melted=data59.melt(id_vars=['Title'],var_name='POPULAR BI TOOLS',value_name='Indians')
data59melted1=data59melted.dropna(subset=['Indians'])
data59melted2=data59melted1.reset_index(drop=True)
sns.countplot(data=data59melted2, x='POPULAR BI TOOLS',order=data59melted2['POPULAR BI TOOLS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

We can see high number of students have voted for every BI tool.

Tableau keeps its lead over Power BI & Google Data Studio has interest of many students!

In [None]:
data60=data2[["Title","N2Y BI Tools-MS Power BI","N2Y BI Tools-Amazon QuickSight","N2Y BI Tools-Google Data Studio","N2Y BI Tools-Looker","N2Y BI Tools-Tableau","N2Y BI Tools-Salesforce","N2Y BI Tools-Tableau CRM","N2Y BI Tools-Qlik","N2Y BI Tools-Domo","N2Y BI Tools-TIBCO Spotfire","N2Y BI Tools-Alteryx","N2Y BI Tools-Sisense","N2Y BI Tools-SAP Analytics Cloud","N2Y BI Tools-MS Azure Synapse","N2Y BI Tools-Thoughtspot","N2Y BI Tools-None","N2Y BI Tools-Other"]]
data60melted=data60.melt(id_vars=['Title'],var_name='POPULAR BI TOOLS IN NEXT 2 YEARS',value_name='Indians')
data60melted1=data60melted.dropna(subset=['Indians'])
data60melted2=data60melted1.reset_index(drop=True)
sns.countplot(data=data60melted2, x='POPULAR BI TOOLS IN NEXT 2 YEARS',order=data60melted2['POPULAR BI TOOLS IN NEXT 2 YEARS'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,5

<h1><center><font size="5">Earning & Spending of Indians in Data Science!</font></center></h1>

Let's plot a chart of yearly compensation for various titles.

We can see maximum number of participants are in the lowest $0-999 bracket & much lesser numbers in higher brackets.

Due to such low earnings let's analyse the spending of Indians in Data Science in further charts below.

In [None]:
sns.countplot(data=data2, x='Yearly Compensation',order = data2['Yearly Compensation'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 30,5

From the chart below, majority participants with 1-3 years of experience have yearly compensation in the bracket of $0-999.

In [None]:
sns.countplot(data=data2, x='Yearly Compensation',order = data2['Yearly Compensation'].value_counts().index, hue='Coding_Exp')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 30,5

In the chart below we can see maximum participants of yearly compensation $0-999 work in Companies with 0-49 employees.

In [None]:
sns.countplot(data=data2, x='Yearly Compensation',order = data2['Yearly Compensation'].value_counts().index, hue='Size of Company')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 30,5

In the chart below we can see majority of people do not have special hardware.

Even if we exclude the students subset there is still a significant number of participants not using any specialised hardware.

In [None]:
data30=data2[["Title","Spl_hardware-nVidia GPUs","Spl_hardware-GC TPUs","Spl_hardware-AWS Trainium Chips","Spl_hardware-AWS Inferentia Chips","Spl_hardware-None","Spl_hardware-Other",]]
data30melted=data30.melt(id_vars=['Title'],var_name='SPECIAL HARDWARE',value_name='Indians')
data30melted1=data30melted.dropna(subset=['Indians'])
data30melted2=data30melted1.reset_index(drop=True)
sns.countplot(data=data30melted2, x='SPECIAL HARDWARE',order=data30melted2['SPECIAL HARDWARE'].value_counts().index, hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='xx-large')
plt.legend(loc='upper right')
sns.set_palette("bright")
rcParams['figure.figsize'] = 30,10

The next chart shows the preferred computing platform for various professionals/students. 

No surprises here - the laptop is favourite due to its mobility & convenience of use while the desktop lags by a big lead. 

Cloud computing platforms are gaining popularity & are currently at 3rd position.

A small number of professionals/students use a deep learning workstation.

High end computing has never been used by majority of the participants.

In [None]:
sns.countplot(data=data2, x='Computing_Platform', order = data2['Computing_Platform'].value_counts().index,hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

In the below chart we can see major number of participants are not spending any money for Machine learning or Cloud.

For the higher price brackets there are even fewer numbers although in every bracket Data Scientists are spending more than other titles.

In [None]:
sns.countplot(data=data2, x='Money Spent on ML/Cloud', order = data2['Money Spent on ML/Cloud'].value_counts().index,hue='Title')
plt.xticks(
rotation=45, 
horizontalalignment='right',
fontweight='light',
fontsize='x-large')
plt.legend(loc='upper right')
rcParams['figure.figsize'] = 20,5

<h1><center><font size="5">INSIGHTS FROM THE DATA SCIENCE STORY OF INDIA</font></center></h1>

01. India has a large, young & trained work force for the field of Data Science.
02. Even students currently nearing completion of their studies are practising Data Science as evident in the responses received in the survey.
03. Persons who are not trained in certain aspects of Data Science have planned to learn it in the next 2 years.
04. Major number of participants are with a Bachelor's Degree.
05. Maximum number of participants have Coding experience of 1 to 3 years.
06. Python is the most popular coding language in India.
07. Most participants have never used a Tensor Processing Unit.
08. Maximum number of participants with coding experience between 1 to 3 years are using ML methods.
09. Maximum number of participants are software engineers.
10. Large companies with 10,000 or more employees have largest team of 20+ Data Scientists.
11. Large companies with 10,000 or more employees have largest number of Statisticians employed.
12. Large companies with 10,000 or more employees have ML  models in production for more than 2 years.
13. Best Cloud for Developer experience is Amazon.
14. MySQL is the most popular big data tool.
15. Tableau is the most popular Business Intelligence tool for creating dashboards.
16. Excel is the most popular primary tool of the participants.
17. Jupyter Notebook is the most popular IDE for all participants.
18. Google Colab is the most popular hosted notebook service for major participants.
19. Matplotlib is the most popular visualisation library.
20. Coursera is the most popular online education platform preferred by maximum participants.
21. Kaggle is the most voted platform by the participants as a media source of Data Science.
22. SK Learn is the most popular of all ML Frameworks.
23. Linear Regression is the most popular ML Algorithm of major participants.
24. Although Amazon SageMaker is the most popular of all ML managed products currently, major participants have voted to learn MS Azure ML Studio in the next 2 years.
25. Tensorboard is the most popular tool for managing ML experiments and remains at the top position for learning in the next 2 years.
26. Github is most popular for sharing ML apps.
27. Currently, Auto ML is used for model selection by major participants & in next 2 years participants want to use AutoML for Full ML Pipelines.
28. Google Cloud remains the current as well as next 2 years favourite for major number of participants.
29. Major number of Data Scientists have a role of analysing & understanding in their respective companies.
30. Image Classification is the most popular Computer Vision field for most participants.
31. Word Embeddings is the most popular NLP field for most participants.
32. Amazon remains the current as well as next 2 years favourite for major participants.
33. Maximum number of Indian participants earn yearly compensation in the lowest bracket of $0-999.
34. Major participants have never used special hardware for Data Science activities.
35. Laptops are the most popular of all computing platforms amongst Indian participants.
36. Major number of participants are not spending any money for Machine Learning or Cloud Computing.

<h1><center>Thank you for reading till the end!</center></h1>



<h1><center>See you next year in the 2022 Kaggle Machine Learning & Data Science Survey</center></h1>