# Introduction

Welcome to Kaggle's annual Machine Learning and Data Science Survey competition!
This year, as in 2017, 2018, and 2019 kaggle set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for 3.5 weeks in October, and after cleaning the data we finished with 20,036 responses!

There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field.
In our fourth year running this survey, community is suprised, awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale. For that reason, we’re invited by the researchers to dive deep into the survey datasets and help them tell the diverse stories of data scientists from around the world.


The Kaggle's 2020 Machine Learning and Data Science Survey data consists about 20,036 responses from Kaggle members.  From the total data, we will be focusing on analyzing responses from India.

# Objective

the objective is to tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration.The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners.This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

# Dataset


In the dataset we are given two files. Main data and Supplementary data. 
Main data consists of *kaggle_survey_2020_responses.csv*:  Responses to multiple choice questions (only a single choice can be selected) were recorded in individual columns. Responses to multiple selection questions (multiple choices can be selected) were split into multiple columns (with one column per answer choice).

Supplementary data consists of teo files:
1. kaggle_survey_2020_answer_choices.pdf: *list of answer choices for every question*
With footnotes describing which questions were asked to which respondents.

2. kaggle_survey_2020_methodology.pdf: a description of how the survey was conducted.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Kaggle is About Learning and Sharing Knowledge
I want to explore what the current trends in Data Science are and what we can learn from experienced Machine Learning practitioners.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from matplotlib.patches import Rectangle
from matplotlib_venn import venn2, venn2_circles

import seaborn as sns
import re

import string
import spacy # Leading library for NLP
nlp = spacy.load('en')

from wordcloud import WordCloud
from sklearn.linear_model import LinearRegression

**Importing the Dataset**

In [None]:
df = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', header=1)

df.head()

In [None]:
print('shape: ', df.shape)
print('columns with NaN values: \n', df.isna().sum().loc[df.isna().sum() != 0].shape[0])

### size of the dataset for the persons attending the Survey form.

In [None]:
print("In '2020 Kaggle ML & DS Survey', {} had attended the survey.".format(df.shape[0]))

In [None]:
def plot_frequency_charts(df, feature, title, xlabel, pallete):
    freq_df = pd.DataFrame()
    freq_df[feature] = df[feature]
    
    f, ax = plt.subplots(1,1, figsize=(16,4))
    total = float(len(df))
    g = sns.countplot(df[feature], order = df[feature].value_counts().index[:20], palette=pallete)

    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 

    plt.title('Frequency and percentage of {}'.format(title),weight='bold', size=15)
    plt.ylabel('Frequency', size=12, weight='bold')
    plt.xlabel(xlabel, size=12, weight='bold')
    plt.xticks(rotation=90)
    plt.show()

# Age Group:

In [None]:
plot_frequency_charts(df, 'What is your age (# years)?', 'age group','Age Group', 'inferno')

From the Understanding of the above graph we can see that more than half of the graph is below 30 years of Age.
person having Age between 25-29 years is using Kaggle as the platform for data science is huge and massive.
Other years are so comparitive, We cant differentiate between ages of persons.but Yes we are seeing less response from the ages above 40.

# Gender:

In [None]:
plot_frequency_charts(df, 'What is your gender? - Selected Choice', 'gender','Gender', 'inferno')

# Country:

In [None]:

plot_frequency_charts(df, 'In which country do you currently reside?', 'responders currently reside in','Country', 'winter')

We are able to redict from the above graph that India has most percentage of residers that is 29% while others have sam distribution Follow on USA , Brazil , russia.

# Filling form genderwise

In [None]:
df_gender_time = df.groupby(['What is your gender? - Selected Choice']).agg({'Duration (in seconds)':'mean'}).reset_index()

f, ax = plt.subplots(1,1, figsize=(15,5))
splot = sns.barplot(df_gender_time['What is your gender? - Selected Choice'],
                                   df_gender_time['Duration (in seconds)'], alpha=0.8)

for p in splot.patches:
    splot.annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center',
                   va = 'center', xytext = (0, 9), textcoords = 'offset points')
plt.title('Average time taken to fill the form - genderwise',weight='bold', size=15)
plt.xlabel('Gender', fontsize=12, fontweight='bold')
plt.ylabel('Time Taken in seconds', fontsize=12, fontweight='bold')
plt.xticks(rotation=90)
plt.show()

In the above code and graph We are predicting the estimated time take by various category of genders to fill the survey form. This is the nice visualization seeing Average category of man takes approx 9147 sec to respond to the form given.Woman takes 9234 secs to fill it.

# Education:

In [None]:

plot_frequency_charts(df, 
'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?',
"responder's education",'Education', 'brg')

We are observing the variation of degrees in the above plot.We came to know person who is participating in the kaggle is mostly from Masters degree following which there is Bachelor's degree which has 34.83%. 
I would like to say degree doesn't matter for entering into the filed, this is just the survey. If we are interested and have the zeal to coe upon, then we can surely reach great heights..


# Occupation :

In [None]:
plot_frequency_charts(df, 
'Select the title most similar to your current role (or most recent title if retired): - Selected Choice',
"responder's occupation",'Occupations', 'winter')

# Coding experience:

In [None]:
plot_frequency_charts(df, 'For how many years have you been writing code and/or programming?',
                      "responder's coding experience",'Duration', 'viridis')

Person having more years of experience has more paricipation in the field which is so true because 22.69% are using kaggle for doing projects who has 5 years experience .....

# Programming Language:

In [None]:
plot_frequency_charts(df,
    'What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice',
                      "responder's recommend an aspiring data scientist to learn first",'Programming language', 'husl')

We usually know the thing we are predicting in above graph i.e Programming language supportive in data Science.
The graph is preicting that Python is the most used programing language which we already know because Most of the persons working in the field have their proficiency in this language.
Although kaggle supports both Python and R but we know Google colab which only supports Python.
So this survey is quiet useful for us.

# Typs of Computing platform:

In [None]:
df['What type of computing platform do you use most often for your data science projects? - Selected Choice'] = df[
    'What type of computing platform do you use most often for your data science projects? - Selected Choice'].replace(
    {'A personal computer or laptop':'PC or Laptop',
     'A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)':'Cloud platform',
     'A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)':'Deep Learning workstation'})

plot_frequency_charts(df,
    'What type of computing platform do you use most often for your data science projects? - Selected Choice',
                      "type of computing platform used",'Platform', 'brg')

# Number of times TPU used:

In [None]:
plot_frequency_charts(df, 'Approximately how many times have you used a TPU (tensor processing unit)?',
                      "responder's used TPU (tensor processing unit)",'Occations', 'twilight')

In [None]:
data = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')

data.head()

In [None]:
dff = pd.DataFrame()

relevant_2020 = pd.DataFrame()

In [None]:
relevant_2020['Age'] = data.Q1
relevant_2020['Compensation_Amout'] = data.Q24
relevant_2020['Country'] = data.Q3
relevant_2020['Occupation'] = data.Q5
#relevant_2020['Occupation_Freeform'] = data.Q5_OTHER
relevant_2020['Education'] = data.Q4
relevant_2020['Language_Recommendation'] = data.Q8
#relevant_2020['Language_Recommendation_Freeform'] = data.Q8_OTHER
relevant_2020['Programming_Experience'] = data.Q6

relevant_2020['Hosted_Notebook_Products__Kaggle_Notebooks'] = data.Q10_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Google_Colab'] = data.Q10_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Azure_Notebooks'] =data.Q10_Part_3.apply(lambda x: 0 if x != x else 1)| data.Q17_OTHER.apply(lambda x: 1 if ('databricks' in str(x).lower()) else 0)

relevant_2020['Hosted_Notebook_Products__Google_Cloud_Datalab'] = data.Q10_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Paperspace_Gradient'] = data.Q10_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__FloydHub'] = data.Q10_OTHER.apply(lambda x: 1 if ('floyd' in str(x).lower()) else 0)
relevant_2020['Hosted_Notebook_Products__Binder_JupyterHub'] = data.Q10_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__IBM_Watson_Studio'] = data.Q10_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Code_Ocean'] = data.Q10_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__AWS_Notebook'] =data.Q10_Part_9.apply(lambda x: 0 if x != x else 1) | data.Q10_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Domino_Datalab'] = data.Q10_OTHER.apply(lambda x: 1 if ('domino' in str(x).lower()) else 0) 
relevant_2020['Hosted_Notebook_Products__Crestle'] = data.Q10_OTHER.apply(lambda x: 1 if ('crestle' in str(x).lower()) else 0) 
relevant_2020['Hosted_Notebook_Products__Cocalc'] = data.Q10_OTHER.apply(lambda x: 1 if ('cocalc' in str(x).lower()) else 0) 
relevant_2020['Hosted_Notebook_Products__Datalore'] = data.Q10_OTHER.apply(lambda x: 1 if ('datalore' in str(x).lower()) else 0)
relevant_2020['Hosted_Notebook_Products__Google_Cloud_AI'] = data.Q10_Part_10.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Databricks'] = data.Q10_Part_12.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__None'] = data.Q10_Part_13.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Other'] = data.Q17_OTHER

relevant_2020['Activities__Analyze_and_understand_data'] = data.Q23_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Build_ML_service'] = data.Q23_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Build_data_infrastructure'] = data.Q23_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Build_prototypes'] = data.Q23_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Build_Experimentation'] = data.Q23_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Research'] = data.Q23_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__None'] = data.Q23_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Other'] = data.Q23_OTHER

relevant_2020['Algorithms__Linear_or_Logistic_Regression'] = data.Q17_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Decision_Trees_or_Random_Forests'] =data.Q17_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Gradient_Boosting_Machines'] =data.Q17_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Bayesian_Approaches'] = data.Q17_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Evolutionary_Approaches'] =data.Q17_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Dense_Neural_Networks'] = data.Q17_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__CNN'] = data.Q17_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__GAN'] = data.Q17_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__RNN'] = data.Q17_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Transformer_Networks'] = data.Q17_Part_10.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__None'] = data.Q17_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Other'] = data.Q17_OTHER

relevant_2020['Algorithm_Cluster_Traditional_ML'] = relevant_2020.Algorithms__Linear_or_Logistic_Regression| relevant_2020.Algorithms__Decision_Trees_or_Random_Forests | relevant_2020.Algorithms__Gradient_Boosting_Machines | relevant_2020.Algorithms__Bayesian_Approaches | relevant_2020.Algorithms__Evolutionary_Approaches | relevant_2020.Algorithms__Dense_Neural_Networks
relevant_2020['Algorithm_Cluster_Deep_Learning_Vision'] = relevant_2020.Algorithms__CNN | relevant_2020.Algorithms__GAN
relevant_2020['Algorithm_Cluster_Deep_Learning_NLP'] = relevant_2020.Algorithms__RNN | relevant_2020.Algorithms__Transformer_Networks
relevant_2020['Algorithm_Cluster_Deep_Learning'] = relevant_2020['Algorithm_Cluster_Deep_Learning_Vision'] | relevant_2020['Algorithm_Cluster_Deep_Learning_NLP']
relevant_2020['Algorithm_Cluster_Other'] = relevant_2020['Algorithms__Other'].apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithm_Cluster_None'] = ((relevant_2020.Algorithm_Cluster_Traditional_ML == 0) & (relevant_2020.Algorithm_Cluster_Deep_Learning_Vision == 0 )
                                           & (relevant_2020.Algorithm_Cluster_Deep_Learning_NLP == 0) & (relevant_2020.Algorithm_Cluster_Other == 0)).astype(int)

relevant_2020['Hardware__CPU'] = data.Q12_OTHER.apply(lambda x: 1 if ('cpu' in str(x).lower()) else 0)
relevant_2020['Hardware__GPU'] = data.Q12_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hardware__TPU'] = data.Q12_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hardware__None'] = data.Q12_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hardware__Other'] = data.Q12_OTHER

relevant_2020['Framework__TensorFlow'] = data.Q16_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Framework__Keras'] = data.Q16_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Framework__PyTorch'] = data.Q16_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Framework__TensorFlow_Keras'] = relevant_2020['Framework__TensorFlow'] & relevant_2020['Framework__Keras']
relevant_2020['Framework__TensorFlow_Keras_PyTorch'] = relevant_2020['Framework__TensorFlow'] | relevant_2020['Framework__Keras'] | relevant_2020['Framework__PyTorch']

relevant_2020['Cloud_Platform__GCP'] =data.Q26_A_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__AWS'] =data.Q26_A_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Azure'] =data.Q26_A_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__IBM'] =data.Q26_A_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Alibaba'] = data.Q26_A_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Tencent'] = data.Q26_A_Part_10.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Salesforce'] = data.Q26_A_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Oracle'] = data.Q26_A_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__SAP'] = data.Q26_A_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__VMWare'] = data.Q26_A_Part_8.apply(lambda x: 0 if x != x else 1) 
relevant_2020['Cloud_Platform__Red_Hat'] = data.Q26_A_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__None'] = data.Q26_A_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Other'] = data.Q26_A_OTHER.apply(lambda x: 0 if x != x else 1)



relevant_2020['Media__Twitter'] = data.Q39_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Newsletters'] = data.Q39_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Reddit'] = data.Q39_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Kaggle'] = data.Q39_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Course_Forums'] = data.Q39_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__YouTube'] = data.Q39_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Podcast'] = data.Q39_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Blogs'] = data.Q39_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Journal_Publications'] = data.Q39_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Slack'] = data.Q39_Part_10.apply(lambda x: 0 if x != x else 1) 
relevant_2020['Media__None'] = data.Q39_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Other'] =data.Q39_OTHER.apply(lambda x: 0 if x != x else 1)

In [None]:
relevant_2020['Year'] = 2020
relevant_2020 = relevant_2020.loc[1:].reset_index(drop=True)

dff = relevant_2020

In [None]:
def categorize_occupations(x):
    if 'data scien' in x:
        return 'Data Scientist'
    elif 'software developer' in x:
        return 'Software Engineer'
    elif any(s in x for s in ['student', 'intern']):
        return 'Student'
    elif 'research' in x:
        return 'Research Scientist'
    elif 'data analy' in x:
        return 'Data Analyst'
    elif 'ness analy' in x:
        return 'Business Analyst'
    elif any(s in x for s in ['prof', 'teach', 'lect', 'educat', 'faculty', 'academi']):
        return 'Teacher/Professor'
    elif any(s in x for s in ['ngineer', 'enginner']):
        return 'Other Engineer'
    elif 'project' in x:
        return 'Product/Project Manager'
    elif any(s in x for s in ['lead', 'manag', 'head', 'direct', 'dircetor']):
        return 'Manager'
    elif any(s in x for s in ['chief', 'cto', 'ceo', 'coo','cfo', 'cio', 'cdo']):
        return 'Chief Officer'
    elif 'test' in x:
        return 'Software Tester'
    elif 'tired' in x:
        'Not employed'
    elif any(s in x for s in ['writer', 'journal']):
        'Data Journalist'
    else:
        return 'Other'
    
dict_occupation = {'Software Developer/Software Engineer' : 'Software Engineer',
               'Researcher' : 'Research Scientist',
               'Scientist/Researcher' : 'Research Scientist',
               'Research Assistant' : 'Research Scientist',
                   
               'Marketing Analyst' : 'Other',
                   'Salesperson' : 'Other',
                   
                   'Consultant' : 'Other',
              }
dff['Occupation'] = dff['Occupation'].replace(dict_occupation)


dff.Activities__Other = dff.Activities__Other.fillna('NaN')
dff.Occupation = dff.apply(lambda x: 'Student' if (('student' in x.Activities__Other.lower()) & ~('teach' in x.Activities__Other.lower()) & ~('educate' in x.Activities__Other.lower()) & (x.Occupation != 'Student')) else x.Occupation, axis = 1)

# Kagglers are Levelling Up

This year, we have seen a large selection of different competitions on Kaggle. My feeling is that competitions with tabular data might be more popular among Kagglers. I think this is because we are all here to learn and competitions with tabular data are especially beginner friendly because you can apply traditional machine learning algorithms, which might be easier to apply for beginners.

In [None]:
temp = dff.groupby('Year').agg(Survey_Participants = ('Year', 'count' ),
                       Algorithm_Cluster_Traditional_ML = ('Algorithm_Cluster_Traditional_ML', 'sum'),
                       Algorithm_Cluster_Deep_Learning_Vision = ('Algorithm_Cluster_Deep_Learning_Vision', 'sum'),
                       Algorithm_Cluster_Deep_Learning_NLP = ('Algorithm_Cluster_Deep_Learning_NLP', 'sum'),
                       Algorithm_Cluster_Deep_Learning = ('Algorithm_Cluster_Deep_Learning', 'sum'),
                       Algorithms__None = ('Algorithms__None', 'sum'),
                              Algorithms_Cluster_Other = ('Algorithm_Cluster_Other', 'sum'),
                      )
for c in temp.columns:
    if c != 'Survey_Participants':
        temp[c] = temp[c] / temp['Survey_Participants'] *100

f, ax = plt.subplots(nrows=1, ncols=2, figsize=(16,8))
ax[0].set_title("Percentage of Kagglers' \nUsage of Machine Learning Algorithms", fontsize=16)

group1 = int((dff[(dff.Year == 2020) & (dff.Algorithm_Cluster_Traditional_ML == 0) & (dff.Algorithm_Cluster_Deep_Learning != 0)].shape[0] / dff[(dff.Year == 2020)].shape[0])*100)
group2 = int((dff[(dff.Year == 2020) & (dff.Algorithm_Cluster_Traditional_ML != 0) & (dff.Algorithm_Cluster_Deep_Learning == 0)].shape[0] / dff[(dff.Year == 2020)].shape[0])*100) 
intersection = int((dff[(dff.Year == 2020) & (dff.Algorithm_Cluster_Traditional_ML != 0) & (dff.Algorithm_Cluster_Deep_Learning != 0)].shape[0] / dff[(dff.Year == 2020)].shape[0])*100)

v1 = venn2(subsets = (group1, group2, intersection),
          set_labels = ( '', '', ''),
          set_colors=( 'deepskyblue', 'lightgrey'),
           alpha=1,
           ax=ax[0])

v1.get_patch_by_id('11').set_color('skyblue')
c1 = venn2_circles(subsets = (group1, group2, intersection), color='skyblue', ax=ax[0])
c1[0].set_lw(3.0)
c1[1].set_lw(0.0)
ax[0].annotate('Deep Learning', xy=v1.get_label_by_id('10').get_position() - np.array([0, -0.05]), xytext=(-70,70),ha='center', textcoords='offset points', 
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='gray'))

ax[0].annotate('Traditional Machine Learning\n (Regression, Decision Trees, \nGBM, etc.)', xy=v1.get_label_by_id('01').get_position() - np.array([0, -0.05]), xytext=(70,70),
ha='center', textcoords='offset points', 
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=-0.5',color='gray'))

ax[0].annotate('Both', xy=v1.get_label_by_id('11').get_position() - np.array([0, -0.05]), xytext=(0,70),
ha='center', textcoords='offset points',
arrowprops=dict(arrowstyle='->', color='gray'))

**Findings**

* 31% of Kagglers regularly use both traditional Machine Learning and Deep Learning algorithms
* Most Kagglers using Deep Learning algorithms also use traditional machine learning algorithms

Since we are all here to learn, let's explore what we can learn from Kagglers that are more experienced in machine learning, professional data scientists and researchers in the field.

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 5))

sns.countplot(data.iloc[1:].Q15, order=['Under 1 year', '1-2 years', '2-3 years', '3-4 years', '4-5 years','5-10 years',  '10-20 years',
       '20 or more years'], palette='Blues')

ax.set_title('Machine Learning Experience', fontsize = 14)
ax.set_ylabel(ylabel = 'Number of Respondents', fontsize = 12)
ax.set_xlabel(xlabel = 'Years of Machine Learning Experience', fontsize = 12)

# Q1: PyTorch or TensorFlow/Keras?
This question might even be more frequently asked "I am new to Deep Learning. Should I start with PyTorch or TensorFlow/Keras?"Here is the answer.

In [None]:
temp = dff[(dff.Occupation.isin(['Data Scientist', 'Research Scientist']) & (dff.Year==2020))]
temp = temp.groupby(['Year', 'Occupation']).agg(Framework__TensorFlow_Keras_PyTorch = ('Framework__TensorFlow_Keras_PyTorch', 'sum' ),
                                                                    Framework__TensorFlow = ('Framework__TensorFlow', 'sum'),
                                                                    Framework__Keras = ('Framework__Keras', 'sum'),
                                                                    Framework__PyTorch = ('Framework__PyTorch', 'sum'),
                                                                    Framework__TensorFlow_Keras = ('Framework__TensorFlow_Keras', 'sum'))

temp['Framework__TensorFlow'] = temp['Framework__TensorFlow'] / temp['Framework__TensorFlow_Keras_PyTorch']
temp['Framework__Keras'] = temp['Framework__Keras'] / temp['Framework__TensorFlow_Keras_PyTorch']
temp['Framework__PyTorch'] = temp['Framework__PyTorch'] / temp['Framework__TensorFlow_Keras_PyTorch']
temp['Framework__TensorFlow_Keras'] = temp['Framework__TensorFlow_Keras'] / temp['Framework__TensorFlow_Keras_PyTorch']

temp = temp.reset_index(drop=False)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
for i, occ in enumerate(['Data Scientist', 'Research Scientist']):#, 'Student']):

    framework_by_occupation = temp[temp.Occupation==occ][['Year', 'Framework__TensorFlow', 'Framework__Keras', 'Framework__PyTorch']]
    reg_tensorflow = LinearRegression().fit(np.array(framework_by_occupation.Year).reshape(-1, 1), np.array(framework_by_occupation.Framework__TensorFlow).reshape(-1, 1))
    reg_keras = LinearRegression().fit(np.array(framework_by_occupation.Year).reshape(-1, 1), np.array(framework_by_occupation.Framework__Keras).reshape(-1, 1))
    reg_pytorch = LinearRegression().fit(np.array(framework_by_occupation.Year).reshape(-1, 1), np.array(framework_by_occupation.Framework__PyTorch).reshape(-1, 1))
    
    framework_by_occupation = framework_by_occupation.append({'Year':2021, 
                  'Framework__TensorFlow': reg_tensorflow.predict(np.array([[2021]]))[0,0], 
                  'Framework__Keras': reg_keras.predict(np.array([[2021]]))[0,0], 
                  'Framework__PyTorch': reg_pytorch.predict(np.array([[2021]]))[0,0], }, 
                 ignore_index=True)
    framework_by_occupation = framework_by_occupation.append({'Year':2022, 
                  'Framework__TensorFlow': reg_tensorflow.predict(np.array([[2022]]))[0,0], 
                  'Framework__Keras': reg_keras.predict(np.array([[2022]]))[0,0], 
                  'Framework__PyTorch': reg_pytorch.predict(np.array([[2022]]))[0,0], }, 
                 ignore_index=True)
    
    framework_by_occupation.set_index('Year').plot(ax=ax[i], marker='x')
    
    ax[i].set_ylim([0,1])
    ax[i].set_ylabel(ylabel = 'Percentage of Respondents Used by [0.01%]', fontsize = 14)
    ax[i].set_xlabel(xlabel = 'Year', fontsize = 14)
    ax[i].set_xticks([2018, 2019, 2020, 2021, 2022,])# 2023, 2024])                                    
    ax[i].add_patch(Rectangle((2020.5, 0), 4, 1, fill=True, alpha=0.2, color='Grey', lw=0))
    ax[i].annotate('Prediction', xy=(2020.8, 0.9), fontsize=14, color='black')
    ax[i].set_title(occ, fontsize = 16)