# Introduction

The Kaggle Survey 2020 has a total of 20,037 responses from various Kaggle users of each age group. Using this data I deep-dive to gain further insights.

The approach taken was to get an indepth knowledge about our Kagglers about their skills, their companies, their knowledge base in various domains and what all skills they plan to enhance or learn.

# Table Of Contents
---
1. [Data](#section-one)
2. [Age Groups interested in Data Science](#section-2)
3. [Countries with the highest resident Kagglers](#section-3)
4. [Gender Count of the Kagglers](#section-4)
5. [Profession-wise Kaggle User Count](#section-5)
6. [Coding Proficiency](#section-6)
    * [Programming Languages](#section-7)
    * [Recommended first language to learn for an aspiring data scientist](#section-8)
    * [Regularly used IDE's](#section-9)
	* [Hosted notebook products used on a regular basis](#section-10)
    
7. [Computing Platform](#section-11)
	* [Specialized hardware](#section-12)
8. [Data Visualization](#section-13)
	* [Visualization Libraries](#section-13)
9. [Machine Learning](#section-14)
	* [Years of Experience using ML methods](#section-14)
	* [Machine Learning Frameworks and Algorithms](#section-15)
	* [Machine Learning Products](#section-16)
	* [Computer Vision Methods Used Regularly](#section-17)
	* [Natural Language Processing Methods](#section-18)    
10. [Company-wise Data](#section-19)
    * [Total Employees vs Number of Employees that are responsible for the Data Science Workload](#section-19)
	* [Companies where Kaggle Users are employed and how they incorporate ML](#section-20)
	* [Kagglers and their role designated by the current employer](#section-21)
	* [Yearly Compensation in USD](#section-22)
	* [Money spent on Machine Learning and cloud computing services in the past 5 years](#section-23)   
11. [Cloud Computing Platforms](#section-24)
12. [Big Data Products used Regularly](#section-25)
13. [Business Intelligence Tools](#section-26)
14. [Automated ML Tools](#section-27)
	* [Tools To Manage Machine Learning Experiments](#section-28)
15. [Deployment Platforms](#section-29)
16. [Platforms to Learn Data Science](#section-30)
	* [Primary Tool Used to analyse data](#section-31)
	* [Media Source to report Data Science Topics](#section-32)
17. [ML, Cloud Computing Platforms, etc. Kagglers hope to get familiar with](#section-33)

---
---

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing
import altair as alt
import gc

<a id="section-one"></a>
## Survey Data

In [None]:
data=pd.read_csv(r'/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)
print(data.shape)
data.head(5)

<a id="section-2"></a>
### Age Groups that are most interest in Data Science

It is found that the top 3 age groups most interested in data science are:
* 25-29
* 22-24
* 18-21

In [None]:
list_group_counts=data.Q1[1:].value_counts().sort_index()
list_group_counts=pd.DataFrame(list_group_counts)
list_group_counts.rename(columns={"Q1":"Number of people participating in the survey"},inplace=True)
plt.figure(figsize=(25,8))
plt.ylabel('Number of people participating in the survey')
plt.xlabel('Age Groups')
plt.title('Age Group Distribution')
clrs = ['#6EB5FF' if (x < max(list_group_counts['Number of people participating in the survey'])) else '#00008B' for x in list_group_counts['Number of people participating in the survey']]
sns.barplot(x=list_group_counts.index,y=list_group_counts['Number of people participating in the survey'], palette=clrs)
plt.show()

<a id="section-3"></a>
## Top 10 Countries with the highest residents interested in Kaggle

In [None]:
list_group_counts=data.Q3[1:].value_counts().sort_index()
list_group_counts=pd.DataFrame(list_group_counts).sort_values(by='Q3', ascending=False).head(10)
list_group_counts.rename(columns={"Q3":"Number of people participating in the survey"},inplace=True)
plt.figure(figsize=(30,8))
plt.ylabel('Number of people participating in the survey')
plt.title('Country-wise Kaggle Users')
clrs = ['#00B0B1' if (x < max(list_group_counts['Number of people participating in the survey'])) else '#ADFF2F' for x in list_group_counts['Number of people participating in the survey'] ]

sns.barplot(x=list_group_counts.index,y=list_group_counts['Number of people participating in the survey'], palette=clrs)
plt.show()

## *Analysis:*

- India is leading with a huge margin of Kaggle Users.
- Second- highest users are from USA
- Other is inclusive of all the countries not listed in the survey.

<a id="section-4"></a>
## Gender count who filled the kaggle survey

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x=data.Q2[1:].value_counts().index,y=data.Q2[1:].value_counts() , color='#6EB5FF')

plt.show()

There are more male users than female Kaggle users

<a id="section-5"></a>
## Profession-wise Kaggle User count

In [None]:
data_copy=data[1:]
grouped_data=data_copy.groupby(['Q2','Q5'])['Q2'].count().reset_index(name='Count')
clustered_column = sns.catplot(
    data=grouped_data, kind="bar",
    x="Q5", y="Count", hue="Q2",
    ci="sd", palette="dark", alpha=.6, height=8.27, aspect=30/11
)
clustered_column.despine(left=True)
clustered_column.set_axis_labels("", "Number Of Users")
plt.show()

## *Analysis:*

A majority of the Kaggle Users are Students, followed by Data Scientist and then Software Engineers.


<a id="section-6"></a>
## Coding Proficiency

In [None]:
data_copy=data[1:]
grouped_data=data_copy.groupby(['Q6','Q5'])['Q6'].count().reset_index(name='Count')
clustered_column = sns.catplot(
    data=grouped_data, kind="bar",
    x="Q5", y="Count", hue="Q6",
    ci="sd", palette="Spectral", height=8.27, aspect=30/11
)
clustered_column.despine(left=True)
clustered_column.set_axis_labels("", "Number Of Users")
plt.show()

It is found that a majority of the most experienced and the Kaggle Users profeicient in coding having 20+ yrs as well as 10-20 yrs are working as a Data Scientist, Software Engineer or Research Scientist.

The next group with years of experience as 5-10 yrs are majorly Data Scientist, Software Engineers and Students

A vast majority of Students who are Kaggle users have a 1-2 yrs of coding experience.




<a id="section-7"></a>
## Programming Languages
The majority of users use Python, R and SQL programming language on a regular basis


In [None]:
list_Q7=data.iloc[:,7:20].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q7:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
plt.figure(figsize=(10,5))
clrs = ['#00DBFF' if (x < max(list_1)) else '#708090' for x in list_1]
t=sns.barplot(x=list_value.index,y=list_1,palette=clrs )

plt.show(t)

<a id="section-8"></a>
### Recommended first language to learn for an aspiring data scientist

With Python recommended by a huge majority of the users as the programming language to learn first by an aspiring data scientist

In [None]:
plt.figure(figsize=(10,6))
labels = data.Q8[1:].value_counts().index
pie=plt.pie(x=data.Q8[1:].value_counts(),pctdistance=0.50)
plt.title(data.Q8[0])
plt.legend(pie[0],labels, bbox_to_anchor=(1,0.5), loc="center right", fontsize=10, 
           bbox_transform=plt.gcf().transFigure)
plt.subplots_adjust(left=0.0, bottom=0.1, right=0.45)
plt.show()


<a id="section-9"></a>
### Regularly used IDE's

In [None]:
list_Q9=data.iloc[:,21:33].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q9:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
plt.figure(figsize=(29,5))
clrs = ['#AFEEEE' if (x < max(list_1)) else '#708090' for x in list_1]

t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)

plt.show(t)

The most use IDEs are Jupyter Notebook followed VSCode and then Pycharm.


<a id="section-10"></a>
### Hosted notebook products used on a regular basis

In [None]:
list_Q10=data.iloc[:,33:46].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q10:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
plt.figure(figsize=(35,5))
clrs = ['#AFEEEE' if (x < max(list_1)) else '#708090' for x in list_1]

t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)

plt.show(t)

Kagglers love and prefer to use Colab Notebooks followed by Kaggle Notebooks

<a id="section-11"></a>
## Computing Platform

Many survey responses suggest that the kaggle users prefer personal computer or laptops compare to a hosted computed platform.

This huge preference for using personal computers can be due to a fact that most of the Kaggle Users are **students**. Since students would prefer a personal computer than a hosted paid computing platform.

In [None]:
plt.figure(figsize=(35,8))
plt.title(data.Q11[0])
list_1=data.Q11[1:].value_counts()
clrs = ['#98FB98' if (x < max(list_1)) else '#556B2F' for x in list_1]
sns.barplot(x=data.Q11[1:].value_counts().index,y=data.Q11[1:].value_counts(), palette=clrs) #,color='#93C47D' )

plt.show()

<a id="section-12"></a>
### GPU the preferred Specialized hardware

In [None]:
list_Q12=data.iloc[:,48:52].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q12:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([
        list_value,test])

plt.figure(figsize=(22,5))
clrs = ['#FFD700' if (x < max(list_1)) else '#FF4500' for x in list_1]
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.title(data.Q12_Part_1[0])
plt.show(t)
plt.figure(figsize=(22,5))
plt.title(data.Q13[0])
clrs = ['#FFD700' if (x < max(data.Q13[1:].value_counts())) else '#FF4500' for x in data.Q13[1:].value_counts()]
sns.barplot(x=data.Q13[1:].value_counts().index,y=data.Q13[1:].value_counts(), color='#F0AD4E', palette=clrs )

plt.show()

<a id="section-13"></a>
# Data Visualization
## Visualization Libraries

In [None]:
list_Q14=data.iloc[:,53:65].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q14:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])

plt.figure(figsize=(22,5))
clrs = ['#00DBFF' if (x < max(list_1)) else '#708090' for x in list_1]
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.title(data.Q14_Part_1[0])
plt.show(t)

Matplotlib and Seaborn are the frequently used data visualization libraries.

<a id="section-14"></a>
# Machine Learning
### Years of Experience using ML methods

In [None]:
data_copy=data[1:]
sample_data= data_copy.groupby(['Q5','Q15'])['Q15'].count().reset_index(name='Total')
years= sample_data.Q15.unique()
color=['red', 'blue',  'green', 'orange','magenta']
plt.rc('xtick', labelsize=14)    # fontsize of the tick labels
plt.rc('ytick', labelsize=14)
for i,j in zip(years, color):
    plt.figure(figsize=(38,10))
    new=sample_data[sample_data.Q15==i]
    t=plt.plot(new.Q5, new.Total.tolist(), color=j, marker='o')
    plt.xlabel('Profession', fontsize=18)
    plt.ylabel('Total Number of people', fontsize=18)
    #plt.barplot(new.Q5, new.Total, marker='' , linewidth=1, alpha=0.9, label= new.Q5)
    plt.title('Machine Learning methods used for '+i, fontsize=22)
    plt.show(t)

Students are the majority of the kaggle users who have used machine learning methods for 1-2 years. Following which Data Scientist is the another profession with users using the ML methods for 2-3 and 3-4 years.

Kaggle users who are Data Scientist and Research Scientist have used Machine Learnig methods for 10-20 years and 20 years or more 

<a id="section-15"></a>
###  Machine Learning Frameworks and Algorithms

In [None]:
gc.collect()
list_Q16=data.iloc[:,66:82].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q16:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])

plt.figure(figsize=(22,5))
plt.title('Machine Learning Frameworks used Regularly', fontsize=14)
plt.xlabel('ML Frameworks/Libraries', fontsize=11)
plt.ylabel('Total Number of People who use ML Frameworks Regularly', fontsize=11)
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

list_Q17=data.iloc[:,82:94].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q17:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])

plt.figure(figsize=(50,10))
plt.xlabel('ML Algorithms', fontsize=22)
plt.ylabel('Total Number of People who use it Regularly', fontsize=22)
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)

t=sns.barplot(x=list_value.index,y=list_1,  palette=clrs)
plt.title('Machine Learning Algorithms used Regularly', fontsize=24)
plt.show(t)

Kagglers prefer to use Scikit-Learn as a Machine Learning Framework regularly. 
The most preferrable and commonly used ML Algorithms are Logistic and Linear Regression which are the basic ML libraries.

<a id="section-16"></a>
### Machine Learning Products

In [None]:
list_Q28=data.iloc[:,144:155].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q28:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(35,5))
plt.title('Machine Learning Products Regularly Used', fontsize=14)
plt.xlabel('Machine Learning Products', fontsize=11)
plt.ylabel('Total Number of People who use Machine Learning Products', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

Most Kaggle Users do not use any Machine Learning Products Regularly.

<a id="section-17"></a>
###  Computer Vision Methods Used Regularly

In [None]:
list_Q18=data.iloc[:,94:101].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q18:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])


labels=list_value.index
plt.figure(figsize=(40,30))
pie=plt.pie(list_1,pctdistance=0.50)
plt.title('Categories of computer vision methods used on a regular basis used Regularly', fontsize=20)
plt.legend(pie[0],labels, bbox_to_anchor=(1,0.5), loc="center right", fontsize=18, 
           bbox_transform=plt.gcf().transFigure)
plt.subplots_adjust(left=0.0, bottom=0.40, right=0.90)

The most common Computer Vision Methods are VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet,etc. These are the common image classification networks used by Kagglers.

<a id="section-18"></a>
###  Natural Language Processing Methods

In [None]:
list_Q19=data.iloc[:,101:107].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q19:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])


labels=list_value.index
plt.figure(figsize=(40,30))
pie=plt.pie(list_1,pctdistance=0.50)
plt.title('NLP Methods used Regularly', fontsize=20)
plt.legend(pie[0],labels, bbox_to_anchor=(1,0.5), loc="center right", fontsize=18, 
           bbox_transform=plt.gcf().transFigure)
plt.subplots_adjust(left=0.0, bottom=0.40, right=0.90)

The most common NLP Methods used by Kagglers are Word Embeddingd/vectors (GLove, fastText, word2vec)

<a id="section-19"></a>
# Company-wise Data

### Total Employees vs Number of Employees that are responsible for the Data Science Workload

In [None]:
data_copy=data[1:]
grouped_data=data_copy.groupby(['Q20','Q21'])['Q21'].count().reset_index(name='Count')
clustered_column = sns.catplot(
    data=grouped_data, kind="bar",
    x="Q21", y="Count", hue="Q20",
    ci="sd", palette="Spectral", height=8.27, aspect=30/11
)
clustered_column.despine(left=True)
clustered_column.set_axis_labels("The number of individuals are responsible for data science workloads oin the company", "Total number of Employees in the Company")
plt.show()

It is found that smaller the company i.e. the total number of employees in the company are less than 50, the number of people that handle the Data Science workloads are 1-2.

Whereas when there are 10,000 or more employees the individuals that are handling Data Science workloads are 20 or more.

<a id="section-20"></a>
## Companies where Kaggle Users are employed and how they incorporate ML

In [None]:
plt.figure(figsize=(42,10))
plt.title(data.Q22[0], fontsize=18)
plt.xlabel('Company', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)
clrs = ['#6D9EEB' if (x < max(data.Q22[1:].value_counts())) else '#708090' for x in data.Q22[1:].value_counts()]
sns.barplot(x=data.Q22[1:].value_counts().index,y=data.Q22[1:].value_counts(), palette=clrs  )
plt.show()

Most of the companies where Kagglers work, the employers are exploring Machine Learning as a domain.
Machine Learning models that are in development may one day be pushed to production.

<a id="section-21"></a>
### Kagglers and their role designated by the current employer

In [None]:
list_Q23=data.iloc[:,110:118].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q23:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])


labels=list_value.index
plt.figure(figsize=(40,30))
pie=plt.pie(list_1,pctdistance=0.50)
plt.title('Role at the Current Company', fontsize=20)
plt.legend(pie[0],labels, bbox_to_anchor=(1,0.5), loc="center right", fontsize=18, 
           bbox_transform=plt.gcf().transFigure)
plt.subplots_adjust(left=0.0, bottom=0.40, right=0.90)

Since there are many Kaggler that are working as a data scientist as depicted in [this chart](#section-5). 
Therefore most of the Kagglers have to analyze and understant data to influence product or business decisions.

<a id="section-22"></a>
### Yearly Compensation in $USD

In [None]:
gc.collect()
plt.figure(figsize=(42,10))
plt.title(data.Q24[0], fontsize=18)
plt.xlabel('Company', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)
clrs = ['#6D9EEB' if (x < max(data.Q24[1:].value_counts())) else '#708090' for x in data.Q24[1:].value_counts()]
sns.barplot(x=data.Q24[1:].value_counts().index,y=data.Q24[1:].value_counts(), palette=clrs  )
plt.show()

data_copy=data[1:]
plt.figure(figsize=(70,30))
grouped_data= data_copy.groupby(['Q5','Q24'])['Q24'].count().reset_index(name='Total')
clustered_column = sns.catplot(
    data=grouped_data, kind="bar",
    x="Q24", y="Total", hue="Q5",
    ci="sd", palette="Spectral", height=8.27, aspect=60/20
)

plt.title('Yearly compensation with respect to profession')
clustered_column.despine(left=True)
clustered_column.set_axis_labels("Yearly Compensation", "Total number of peope who having a particular yearly compensation")
plt.show()

Maximum Kagglers that have a yearly compensation of 0-999$ work as a Data Scientist and Software Engineers.

<a id="section-23"></a>
###  Money spent on Machine Learning and cloud computing services in the past 5 years

Most Kagglers have not spent any money on ML and Cloud Computing Services in the past 5 years

In [None]:
plt.figure(figsize=(42,15))
plt.title(data.Q25[0], fontsize=18)
plt.xlabel('Money spent on Machine Learning and cloud computing services past 5 years', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)
clrs = ['#6D9EEB' if (x < max(data.Q25[1:].value_counts())) else '#708090' for x in data.Q25[1:].value_counts()]
sns.barplot(x=data.Q25[1:].value_counts().index,y=data.Q25[1:].value_counts(), palette=clrs  )
plt.show()

<a id="section-24"></a>
# Cloud Computing Platforms

In [None]:
list_Q26=data.iloc[:,120:132].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q26:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])

plt.figure(figsize=(28,5))
plt.title('Cloud Computing Platforms used Regularly', fontsize=14)
plt.xlabel('Cloud Computing Platforms', fontsize=11)
plt.ylabel('Total Number of People who use Cloud Computing Platforms Regularly', fontsize=11)
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

list_Q27=data.iloc[:,132:144].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q27:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])

plt.figure(figsize=(35,5))
plt.title('Cloud Computing Products', fontsize=14)
plt.xlabel('Cloud Computing Products', fontsize=11)
plt.ylabel('Total Number of People who use Cloud Computing Products', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

Most Kagglers use Amazon Web Services (AWS) and its cloud computing capabilities on a regular basis.
With Amazon EC2 as the most popular cloud computing product.

<a id="section-25"></a>
# Big Data Products used Regularly

In [None]:
list_Q29=data.iloc[:,155:173].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q29:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(40,10))
plt.title('Big Data Products Regularly Used', fontsize=14)
plt.xlabel('Big Data Products (relational databases, data warehouses, data lakes, or similar)', fontsize=11)
plt.ylabel('Total Number of People who use Big Data Products', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

MySQL is a big data product that is most regularly used among different Kagglers. Followed which Postgres-SQL is the next big data product that regularly used.

<a id="section-26"></a>
# Business Intelligence Tools

In [None]:
list_Q31=data.iloc[:,174:189].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q31:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(30,5))
plt.title(data['Q31_A_Part_1'][0][:-61], fontsize=14)
plt.xlabel('Business Intelligence Tools', fontsize=11)
plt.ylabel('Total Number of People who use Business Intelligence Tools ', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

plt.figure(figsize=(42,10))
plt.title(data.Q32[0][:-16], fontsize=20)
plt.xlabel('Business Intelligence Tools', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
plt.rc('xtick', labelsize=14)    # fontsize of the tick labels
plt.rc('ytick', labelsize=14)
clrs = ['#6D9EEB' if (x < max(data.Q32[1:].value_counts())) else '#708090' for x in data.Q32[1:].value_counts()]
sns.barplot(x=data.Q32[1:].value_counts().index,y=data.Q32[1:].value_counts(), palette=clrs  )
plt.show()

Tableu is the tool that has been most oftenly used among the choices in the survey.

<a id="section-27"></a>
# Automated ML Tools

In [None]:
gc.collect()
list_Q34=data.iloc[:,198:210].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q34:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(50,10))
plt.title('Automated Machine Learning Tools used Regularly', fontsize=18)
plt.xlabel('Automated Machine Learning Tools', fontsize=11)
plt.ylabel('Total Number of People who use Automated Machine Learning Tools', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

Many Kagglers do not use any Automated Machine Learning Tools. With the second highest count is Auto SkLearn that is commonly used.

<a id="section-28"></a>
## Tools To Manage Machine Learning Experiments

In [None]:
list_Q35=data.iloc[:,210:221].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q34:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(50,10))

plt.title('Tools used to manage machine learning experiments', fontsize=14)
plt.xlabel('Automated Machine Learning Tools', fontsize=11)
plt.ylabel('Total Number of People who use Automated Machine Learning Tools', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

Many Kagglers do not use any Tools. With the second highest count is Auto SkLearn that is commonly used.

<a id="section-29"></a>
# Deployment Platforms

Gitlab is the most commonly used Deployment Platforms

In [None]:
list_Q36=data.iloc[:,221:231].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q36:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(50,10))

plt.title('Publicly Share or deploy ML applications', fontsize=14)
plt.xlabel('Tools to publicly Share or deploy ML applications', fontsize=11)
plt.ylabel('Total Number of People who use those Tools', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

<a id="section-30"></a>
# Platforms to Learn Data Science

Coursera is the leading platform among Kagglers for learning Data Science

In [None]:
list_Q37=data.iloc[:,231:243].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q37:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(50,10))

plt.title('Platforms to begin or completed data science courses', fontsize=14)
plt.xlabel('Various Educational Platforms', fontsize=11)
plt.ylabel('Total Number of People who use those Platforms', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

<a id="section-31"></a>
## Primary Tool Used to analyse data 

Jupyter Notebooks are commonly used to create scripts such that data can be analysed and insights can be depicted with beautiful and illustrative charts.

In [None]:
plt.figure(figsize=(42,15))
plt.title(data.Q38[0][:-18], fontsize=18)
plt.xlabel('Money spent on Machine Learning and cloud computing services past 5 years', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)
clrs = ['#6D9EEB' if (x < max(data.Q38[1:].value_counts())) else '#708090' for x in data.Q38[1:].value_counts()]
sns.barplot(x=data.Q38[1:].value_counts().index,y=data.Q38[1:].value_counts(), palette=clrs  )
plt.show()

<a id="section-32"></a>
## Media Source to report Data Science Topics

Kaggle is leading, many Kaggle users prefer the Kaggle platform to learn about the various Data Science topics.

In [None]:
list_Q39=data.iloc[:,244:256].columns
list_1=[]
list_value=pd.DataFrame()
for i in list_Q39:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(50,10))
plt.title('Various media sources that report on data science topics', fontsize=14)
plt.xlabel('Media sources that report on data science topics', fontsize=11)
plt.ylabel('Total Number of People who read the media sources', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

<a id="section-33"></a>
# ML, Cloud Computing Platforms, etc. Kagglers hope to get familiar with

In the next two years, many Kaggle Users want to learn the AWS Cloud Computing Service along with Google Cloud Compute Engine. 
In the ML domain, Kagglers also hope to get familiar with the Google Cloud AI Platform and AI Engine. Tensorboad is the most selected option Kagglers chose to become familiar with.

MySQL is the database that was the most selected one along with Tableu in the Business Intelligence Side.



In [None]:
list_Q26B=data.iloc[:,256:268].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q26B:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(35,5))
plt.title('Cloud Computing Platforms Kagglers hope to become familiar with in the next 2 years', fontsize=14)
plt.xlabel('Cloud Computing Platforms', fontsize=11)
plt.ylabel('Total Number of People who use Cloud Computing Products', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

list_Q27B=data.iloc[:,268:280].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q27B:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(35,5))
plt.title('In the next 2 years, do you hope to become more familiar with any of these specific cloud computing products?', fontsize=14)
plt.xlabel('Cloud Computing Products', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)


list_Q28B=data.iloc[:,280:291].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q28B:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(35,5))
plt.title('In the next 2 years, do you hope to become more familiar with any of these specific machine learning products?', fontsize=14)
plt.xlabel('Machine Learning Products', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

list_Q29B=data.iloc[:,291:309].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q29B:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(35,5))
plt.title('In the next 2 years, which of the big data products do you hope to become more familiar with?', fontsize=14)
plt.xlabel('Big Data Products', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)

list_Q31B=data.iloc[:,309:324].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q31B:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(35,5))
plt.title('Which of the following business intelligence tools do you hope to become more familiar with in the next 2 years?', fontsize=14)
plt.xlabel('Business Intelligence Tools', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)




list_Q34B=data.iloc[:,344:355].columns
list_value=pd.DataFrame()
list_1=[]
for i in list_Q34B:
    test=pd.DataFrame(data[i][1:].value_counts())
    list_1.append(data[i][1:].value_counts()[0])
    list_value=pd.concat([list_value,test])
clrs = ['#6D9EEB' if (x < max(list_1)) else '#708090' for x in list_1]
plt.figure(figsize=(35,5))
plt.title('Which of the specific automated machine learning tools do you hope to become more familiar with in the next 2 years?', fontsize=14)
plt.xlabel('Automated ML Tools', fontsize=11)
plt.ylabel('Total Number of People', fontsize=11)
t=sns.barplot(x=list_value.index,y=list_1, palette=clrs)
plt.show(t)
