# Trends in 2020 with Advice from Top Kagglers

In this notebook I want to explore **what the current trends in Data Science are** and **what we can learn from experienced Machine Learning practitioners**. For this, I will also incorporate the [Meta-Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset and the [Chai Time Data Science](https://www.kaggle.com/rohanrao/chai-time-data-science) dataset, which contains the transcripts of the [CTDS Podcast](https://chaitimedatascience.com/).

This year has been eventful to say the least. Let's explore the Kaggle survey data and find out what the current trends in Data Science are:
* Did **PyTorch** exceed **TensorFlow/Keras** in popularity?
* **TPUs** are now available in Kaggle kernels but how popular are they?

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_colwidth', -1)
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from matplotlib_venn import venn2, venn2_circles

import seaborn as sns
import re

import warnings
warnings.filterwarnings('ignore')

import string
import spacy # Leading library for NLP
nlp = spacy.load('en')

from wordcloud import WordCloud
from sklearn.linear_model import LinearRegression

# Load survey data from 2017
survey_schema_2017 = pd.read_csv("../input/kaggle-survey-2017/schema.csv",encoding='ISO-8859-1')
multiple_choice_responses_2017 = pd.read_csv("../input/kaggle-survey-2017/multipleChoiceResponses.csv",encoding='ISO-8859-1')
freeform_responses_2017 = pd.read_csv("../input/kaggle-survey-2017/freeformResponses.csv",encoding='ISO-8859-1')

# Load survey data from 2018
survey_schema_2018 = pd.read_csv("../input/kaggle-survey-2018/SurveySchema.csv")
multiple_choice_responses_2018 = pd.read_csv("../input/kaggle-survey-2018/multipleChoiceResponses.csv")
freeform_responses_2018 = pd.read_csv("../input/kaggle-survey-2018/freeFormResponses.csv")

# Load survey data from 2019
survey_schema_2019 = pd.read_csv("../input/kaggle-survey-2019/survey_schema.csv")
multiple_choice_responses_2019 = pd.read_csv("../input/kaggle-survey-2019/multiple_choice_responses.csv")
freeform_responses_2019 = pd.read_csv("../input/kaggle-survey-2019/other_text_responses.csv")

# Load survey data from 2020
multiple_choice_responses_2020 = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")

# Load Kaggle Meta Data
kagglers_achievements_df = pd.read_csv("../input/meta-kaggle/UserAchievements.csv")
kagglers_df = pd.read_csv("../input/meta-kaggle/Users.csv")

ctds_df = pd.read_csv("../input/chai-time-data-science/Episodes.csv")
ctds_kaggle_df = ctds_df[ctds_df.category == 'Kaggle']

all_data = []
for episode in ctds_kaggle_df.index:
    try:
        transcript = pd.read_csv(f"/kaggle/input/chai-time-data-science/Cleaned Subtitles/E{episode}.csv")
        transcript['episode'] = episode
        transcript['heroes'] = ctds_kaggle_df.loc[episode].heroes
        if episode == 1:
            all_data = transcript
        else:
            all_data = all_data.append(transcript)#[1:-1])
    except:
        print(f"Episode {episode} does not have a transcript.")

all_data = all_data.reset_index()
all_data.columns = ['line', 'time', 'speaker', 'text', 'episode', 'heroes']

all_data_red = all_data[['episode', 'heroes','speaker', 'text']][(all_data.speaker != 'Unknown Speaker') & (all_data.episode !=34)] # Sorry, I have yet to learn how to handle the data cleaning for three different speakers
all_data_red['speaker'] = all_data_red['speaker'].apply(lambda x: re.sub('Dr. ', '', x))
all_data_red['key'] = (all_data_red['speaker'] != all_data_red['speaker'].shift(1)).astype(int).cumsum()
all_data_grouped = all_data_red.groupby(['episode', 'heroes', 'key', 'speaker'])['text'].apply(' '.join).to_frame().reset_index(drop=False)
all_data_grouped['speaker_generic'] = all_data_grouped['speaker'].apply(lambda x: 0 if x=='Sanyam Bhutani' else 1)
all_data_grouped = all_data_grouped.groupby(['episode', 'heroes'], as_index=False).apply(lambda x: x.iloc[:-1]).reset_index(drop=True)
all_data_grouped['new_question'] = (all_data_grouped['speaker'] == 'Sanyam Bhutani').astype(int).cumsum()
all_data_grouped = all_data_grouped.pivot(index=['episode', 'heroes', 'new_question'], columns='speaker_generic')['text']
all_data_grouped.columns = ['interviewer', 'guest']
all_data_grouped = all_data_grouped.fillna('')
quote = all_data_grouped.reset_index(drop=False)

In [None]:
kaggle_motivation = freeform_responses_2017[~freeform_responses_2017.KaggleMotivationFreeForm.isna()].KaggleMotivationFreeForm

# Convert text to lowercase
kaggle_motivation = kaggle_motivation.apply(lambda x: x.lower())

# Remove punctuation
kaggle_motivation = kaggle_motivation.apply(lambda x: re.sub("[%s]" % re.escape(string.punctuation), "", x))

# Remove non-Roman characters
kaggle_motivation = kaggle_motivation.apply(lambda x: re.sub("([^\x00-\x7F])+", " ", x))

# Tokenize
def tokenize(x):
    x = nlp(x)
    x_clean = ""
    for token in x:
        # Remove stop words and remove words with fewer than 3 chars
        if (not token.is_stop) and (len(token) > 3):
            # Lemmatize and tokenize
            x_clean += str(token) + " " 
    return x_clean

kaggle_motivation = kaggle_motivation.apply(lambda x: tokenize(x))

# Start with one review:
text = ''.join(kaggle_motivation)

def one_color_func(word=None, font_size=None, 
                   position=None, orientation=None, 
                   font_path=None, random_state=None):
    h = 200 # 0 - 360
    s = 100 # 0 - 100
    l = random_state.randint(30, 70) # 0 - 100
    return "hsl({}, {}%, {}%)".format(h, s, l)
# Create and generate a word cloud image:
wordcloud = WordCloud(background_color ='white', random_state=2020, color_func=one_color_func).generate(text)

# Display the generated image:
plt.figure(figsize=[14,6])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


In [None]:
### Help finctions ###
def unify_age_2017(x):
    if x < 22:
        return '18-21'
    elif x < 25:
        return '22-24'
    elif x < 30:
        return '25-29' 
    elif x < 35:
        return '30-34' 
    elif x < 40:
        return '35-39'
    elif x < 45:
        return '40-44'
    elif x < 50:
        return '45-49'
    elif x < 55:
        return '50-54'
    elif x < 60:
        return '55-59'
    elif x < 70:
        return '60-69'
    else:
        return '70+'


def unify_compensation_2017(x):
    try:
        x = re.sub(',','', x)
        x = int(x)
        if x < 10000:
            return '0-9,999'
        elif x <20000:
            return '10,000-19,999'
        elif x < 30000:
            return '20,000-29,999'
        elif x < 40000:
            return '30,000-39,999'
        elif x < 50000:
            return '40,000-49,999'
        elif x < 60000:
            return '50,000-59,999'
        elif x < 70000:
            return '60,000-69,999'
        elif x  < 80000:
            return '70,000-79,999'
        elif x < 90000:
            return '80,000-89,999'
        elif x < 100000:
            return '90,000-99,999'
        elif x < 125000:
            return '100,000-124,999'
        elif x < 150000:
            return '125,000-149,999'
        elif x < 200000:
            return '150,000-199,999'
        elif x < 250000:
            return '200,000-249,999'
        elif x < 300000:
            return '250,000-299,999'
        elif x < 500000:
            return '300,000-500,000'
        else: 
            return '> $500,000'
    except:
        return x
    
def unify_compensation_2018(x):
    if x == '0-10,000':
        return '0-9,999'
    elif x == '0-10,000':
        return '2,000-2,999'
    elif x == '0-10,000':
        return '5,000-7,499'
    elif x == '0-10,000':
        return '7,500-9,999'
    elif x == '10-20,000':
        return '10,000-19,999'
    elif x == '20-30,000':
        return '20,000-29,999'
    elif x == '30-40,000':
        return '30,000-39,999'
    elif x == '40-50,000':
        return '40,000-49,999'
    elif x == '50-60,000':
        return '50,000-59,999'
    elif x == '60-70,000':
        return '60,000-69,999'
    elif x == '70-80,000':
        return '70,000-79,999'
    elif x == '80-90,000':
        return '80,000-89,999'
    elif x == '90-100,000':
        return '90,000-99,999'
    elif x == '100-125,000':
        return '100,000-124,999'
    elif x == '125-150,000':
        return '125,000-149,999'
    elif x == '150-200,000':
        return '150,000-199,999'
    elif x == '200-250,000':
        return '200,000-249,999'
    elif x == '250-300,000':
        return '250,000-299,999'
    elif ((x == '300-400,000') | (x == '400-500,000')):
        return '300,000-500,000'
    elif x == '500,000+':
        return '> $500,000'
    else: 
        return 'I do not wish to disclose my approximate yearly compensation'

def unify_compensation_2019(x):
    if ((x == '$0-999') | (x== '1,000-1,999') | (x == '2,000-2,999') | (x == '3,000-3,999') | (x == '4,000-4,999')  | (x == '5,000-7,499') | (x == '7,500-9,999')):
        return '0-9,999'
    elif ((x == '10,000-14,999') | (x == '15,000-19,999')):
        return '10,000-19,999'
    elif ((x == '20,000-24,999') | (x == '25,000-29,999')):
        return '20,000-29,999'
    else: 
        return x

In [None]:
# Merge all survey data from all years into one dataframe and unify responses as much as possible
df =  pd.DataFrame()

relevant_2017 =  pd.DataFrame()
relevant_2018 =  pd.DataFrame()
relevant_2019 =  pd.DataFrame()
relevant_2020 =  pd.DataFrame()


### 2017 ###
relevant_2017['Age'] = multiple_choice_responses_2017.Age
relevant_2017['Age'] = relevant_2017['Age'].apply(lambda x: unify_age_2017(x))
relevant_2017['Compensation_Amout'] = multiple_choice_responses_2017.CompensationAmount
relevant_2017['Compensation_Currency'] = multiple_choice_responses_2017.CompensationCurrency
relevant_2017['Compensation'] = relevant_2017['Compensation_Amout'].apply(lambda x: unify_compensation_2017(x))

relevant_2017['Country'] = multiple_choice_responses_2017.Country
relevant_2017['Occupation'] = multiple_choice_responses_2017.CurrentJobTitleSelect
relevant_2017['Occupation_Freeform'] = freeform_responses_2017.CurrentJobTitleFreeForm
relevant_2017['Student'] = multiple_choice_responses_2017.StudentStatus
relevant_2017['Occupation'] = relevant_2017.apply(lambda x: 'Student' if x.Student == 'Yes' else x.Occupation, axis=1)
relevant_2017 = relevant_2017.drop('Student', axis=1)
relevant_2017['Education'] = multiple_choice_responses_2017.FormalEducation
relevant_2017['Language_Recommendation'] = multiple_choice_responses_2017.LanguageRecommendationSelect
relevant_2017['Language_Recommendation_Freeform'] = freeform_responses_2017.LanguageRecommendationFreeForm
relevant_2017['Programming_Experience'] = multiple_choice_responses_2017.Tenure
relevant_2017['Year'] = 2017

### 2018 ###
relevant_2018['Age'] = multiple_choice_responses_2018.Q2
relevant_2018['Age'] = relevant_2018['Age'].replace({'70-79' : '70+', '80+' : '70+'}) 
relevant_2018['Compensation_Amout'] = multiple_choice_responses_2018.Q9
relevant_2018['Compensation_Currency'] = 'USD'
relevant_2018['Compensation'] = relevant_2018['Compensation_Amout'].apply(lambda x: unify_compensation_2018(x))
relevant_2018['Country'] = multiple_choice_responses_2018.Q3
relevant_2018['Occupation'] = multiple_choice_responses_2018.Q6
relevant_2018['Occupation_Freeform'] = freeform_responses_2018.Q6_OTHER_TEXT
relevant_2018['Education'] = multiple_choice_responses_2018.Q4
relevant_2018['Language_Recommendation'] = multiple_choice_responses_2018.Q18
relevant_2018['Language_Recommendation_Freeform'] = freeform_responses_2018.Q18_OTHER_TEXT
relevant_2018['Programming_Experience'] = multiple_choice_responses_2018.Q24

relevant_2018['Hosted_Notebook_Products__Kaggle_Notebooks'] = multiple_choice_responses_2018.Q14_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2018['Hosted_Notebook_Products__Google_Colab'] = multiple_choice_responses_2018.Q14_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2018['Hosted_Notebook_Products__Azure_Notebooks'] = multiple_choice_responses_2018.Q14_Part_3.apply(lambda x: 0 if x != x else 1) | freeform_responses_2018.Q14_OTHER_TEXT.apply(lambda x: 1 if ('databricks' in str(x).lower()) else 0) 
relevant_2018['Hosted_Notebook_Products__Google_Cloud_Datalab'] = multiple_choice_responses_2018.Q14_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2018['Hosted_Notebook_Products__Paperspace_Gradient'] = multiple_choice_responses_2018.Q14_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2018['Hosted_Notebook_Products__FloydHub'] = multiple_choice_responses_2018.Q14_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2018['Hosted_Notebook_Products__Binder_JupyterHub'] = multiple_choice_responses_2018.Q14_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2018['Hosted_Notebook_Products__IBM_Watson_Studio'] = freeform_responses_2018.Q14_OTHER_TEXT.apply(lambda x: 1 if (('ibm' in str(x).lower()) |('watson' in str(x).lower())) else 0) 
relevant_2018['Hosted_Notebook_Products__Code_Ocean'] = freeform_responses_2018.Q14_OTHER_TEXT.apply(lambda x: 1 if ('ocean' in str(x).lower()) else 0) 
relevant_2018['Hosted_Notebook_Products__AWS_Notebook'] = freeform_responses_2018.Q14_OTHER_TEXT.apply(lambda x: 1 if ('aws' in str(x).lower()) else 0) | freeform_responses_2018.Q14_OTHER_TEXT.apply(lambda x: 1 if ('sagemaker' in str(x).lower()) else 0) 
relevant_2018['Hosted_Notebook_Products__Domino_Datalab'] = multiple_choice_responses_2018.Q14_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2018['Hosted_Notebook_Products__Crestle'] = multiple_choice_responses_2018.Q14_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2018['Hosted_Notebook_Products__Cocalc'] = freeform_responses_2018.Q14_OTHER_TEXT.apply(lambda x: 1 if ('cocalc' in str(x).lower()) else 0) 
relevant_2018['Hosted_Notebook_Products__Datalore'] = freeform_responses_2018.Q14_OTHER_TEXT.apply(lambda x: 1 if ('datalore' in str(x).lower()) else 0) 
relevant_2018['Hosted_Notebook_Products__Databricks'] = freeform_responses_2018.Q14_OTHER_TEXT.apply(lambda x: 1 if ('brick' in str(x).lower()) else 0) 
relevant_2018['Hosted_Notebook_Products__None'] = multiple_choice_responses_2018.Q14_Part_10.apply(lambda x: 0 if x != x else 1)
relevant_2018['Hosted_Notebook_Products__Other'] = freeform_responses_2018.Q14_OTHER_TEXT

relevant_2018['Activities__Analyze_and_understand_data'] = multiple_choice_responses_2018.Q11_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2018['Activities__Build_ML_service'] = multiple_choice_responses_2018.Q11_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2018['Activities__Build_data_infrastructure'] = multiple_choice_responses_2018.Q11_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2018['Activities__Build_Experimentation'] = 0
relevant_2018['Activities__Build_prototypes'] = multiple_choice_responses_2018.Q11_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2018['Activities__Research'] = multiple_choice_responses_2018.Q11_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2018['Activities__None'] = multiple_choice_responses_2018.Q11_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2018['Activities__Other'] = freeform_responses_2018.Q11_OTHER_TEXT

relevant_2018['Framework__TensorFlow'] = multiple_choice_responses_2018.Q19_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2018['Framework__Keras'] = multiple_choice_responses_2018.Q19_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2018['Framework__PyTorch'] = multiple_choice_responses_2018.Q19_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2018['Framework__TensorFlow_Keras'] = relevant_2018['Framework__TensorFlow'] & relevant_2018['Framework__Keras']
relevant_2018['Framework__TensorFlow_Keras_PyTorch'] = relevant_2018['Framework__TensorFlow'] | relevant_2018['Framework__Keras'] | relevant_2018['Framework__PyTorch']

relevant_2018['Cloud_Platform__GCP'] = multiple_choice_responses_2018.Q15_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2018['Cloud_Platform__AWS'] = multiple_choice_responses_2018.Q15_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2018['Cloud_Platform__Azure'] = multiple_choice_responses_2018.Q15_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2018['Cloud_Platform__IBM'] = multiple_choice_responses_2018.Q15_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2018['Cloud_Platform__Alibaba'] = multiple_choice_responses_2018.Q15_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2018['Cloud_Platform__Salesforce'] = freeform_responses_2018.Q15_OTHER_TEXT.apply(lambda x: 1 if ('force' in str(x).lower()) else 0) 
relevant_2018['Cloud_Platform__Oracle'] = freeform_responses_2018.Q15_OTHER_TEXT.apply(lambda x: 1 if ('oracle' in str(x).lower()) else 0) 
relevant_2018['Cloud_Platform__SAP'] = freeform_responses_2018.Q15_OTHER_TEXT.apply(lambda x: 1 if ('sap ' in str(x).lower()) else 0) 
relevant_2018['Cloud_Platform__VMWare'] = freeform_responses_2018.Q15_OTHER_TEXT.apply(lambda x: 1 if ('vm' in str(x).lower()) else 0) 
relevant_2018['Cloud_Platform__Red_Hat'] = freeform_responses_2018.Q15_OTHER_TEXT.apply(lambda x: 1 if ('red' in str(x).lower()) else 0) 
relevant_2018['Cloud_Platform__None'] = multiple_choice_responses_2018.Q15_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2018['Cloud_Platform__Other'] = freeform_responses_2018.Q15_OTHER_TEXT.apply(lambda x: 0 if x != x else 1)

relevant_2018['Media__Twitter'] = multiple_choice_responses_2018.Q38_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__Newsletters'] = multiple_choice_responses_2018.Q38_Part_2.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2018.Q38_Part_7.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2018.Q38_Part_15.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__Reddit'] = multiple_choice_responses_2018.Q38_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__Kaggle'] = multiple_choice_responses_2018.Q38_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__Course_Forums'] = multiple_choice_responses_2018.Q38_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__YouTube'] = multiple_choice_responses_2018.Q38_Part_6.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2018.Q38_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__Podcast'] = multiple_choice_responses_2018.Q38_Part_8.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2018.Q38_Part_16.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2018.Q38_Part_17.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__Blogs'] = multiple_choice_responses_2018.Q38_Part_10.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2018.Q38_Part_13.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2018.Q38_Part_14.apply(lambda x: 0 if x != x else 1)  | multiple_choice_responses_2018.Q38_Part_18.apply(lambda x: 0 if x != x else 1)  | multiple_choice_responses_2018.Q38_Part_19.apply(lambda x: 0 if x != x else 1)  | multiple_choice_responses_2018.Q38_Part_20.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__Journal_Publications'] = multiple_choice_responses_2018.Q38_Part_11.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2018.Q38_Part_12.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__Slack'] = multiple_choice_responses_2018.Q38_OTHER_TEXT.apply(lambda x: 1 if ('slack' in str(x).lower()) else 0) 
relevant_2018['Media__None'] = multiple_choice_responses_2018.Q38_Part_21.apply(lambda x: 0 if x != x else 1)
relevant_2018['Media__Other'] = multiple_choice_responses_2018.Q38_Part_22.apply(lambda x: 0 if x != x else 1)

relevant_2018['Visual__ggplot2'] = multiple_choice_responses_2018.Q21_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2018['Visual__Matplotlib'] = multiple_choice_responses_2018.Q21_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2018['Visual__Altair'] = multiple_choice_responses_2018.Q21_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2018['Visual__Shiny'] = multiple_choice_responses_2018.Q21_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2018['Visual__D3.js'] = multiple_choice_responses_2018.Q21_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2018['Visual__Plotly'] = multiple_choice_responses_2018.Q21_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2018['Visual__Bokeh'] = multiple_choice_responses_2018.Q21_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2018['Visual__Seaborn'] = multiple_choice_responses_2018.Q21_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2018['Visual__Geoplotlib'] = multiple_choice_responses_2018.Q21_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2018['Visual__Leaflet_Folium'] = multiple_choice_responses_2018.Q21_Part_10.apply(lambda x: 0 if x != x else 1)

relevant_2018['Year'] = 2018

### 2019 ###
relevant_2019['Age'] = multiple_choice_responses_2019.Q1
relevant_2019['Compensation_Amout'] = multiple_choice_responses_2019.Q10
relevant_2019['Compensation_Currency'] = 'USD'
relevant_2019['Compensation'] = relevant_2019['Compensation_Amout'].apply(lambda x: unify_compensation_2019(x))
relevant_2019['Country'] = multiple_choice_responses_2019.Q3
relevant_2019['Occupation'] = multiple_choice_responses_2019.Q5
relevant_2019['Occupation_Freeform'] = freeform_responses_2019.Q5_OTHER_TEXT
relevant_2019['Education'] = multiple_choice_responses_2019.Q4
relevant_2019['Language_Recommendation'] = multiple_choice_responses_2019.Q19
relevant_2019['Language_Recommendation_Freeform'] = freeform_responses_2019.Q19_OTHER_TEXT
relevant_2019['Programming_Experience'] = multiple_choice_responses_2019.Q15

relevant_2019['Hosted_Notebook_Products__Kaggle_Notebooks'] = multiple_choice_responses_2019.Q17_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hosted_Notebook_Products__Google_Colab'] = multiple_choice_responses_2019.Q17_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hosted_Notebook_Products__Azure_Notebooks'] = multiple_choice_responses_2019.Q17_Part_3.apply(lambda x: 0 if x != x else 1) | freeform_responses_2019.Q17_OTHER_TEXT.apply(lambda x: 1 if ('databricks' in str(x).lower()) else 0) 
relevant_2019['Hosted_Notebook_Products__Google_Cloud_Datalab'] = multiple_choice_responses_2019.Q17_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hosted_Notebook_Products__Paperspace_Gradient'] = multiple_choice_responses_2019.Q17_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hosted_Notebook_Products__FloydHub'] = multiple_choice_responses_2019.Q17_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hosted_Notebook_Products__Binder_JupyterHub'] = multiple_choice_responses_2019.Q17_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hosted_Notebook_Products__IBM_Watson_Studio'] = multiple_choice_responses_2019.Q17_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hosted_Notebook_Products__Code_Ocean'] = multiple_choice_responses_2019.Q17_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hosted_Notebook_Products__AWS_Notebook'] = multiple_choice_responses_2019.Q17_Part_10.apply(lambda x: 0 if x != x else 1) | freeform_responses_2019.Q17_OTHER_TEXT.apply(lambda x: 1 if ('sagemaker' in str(x).lower()) else 0) 
relevant_2019['Hosted_Notebook_Products__Domino_Datalab'] = freeform_responses_2019.Q17_OTHER_TEXT.apply(lambda x: 1 if ('domino' in str(x).lower()) else 0) 
relevant_2019['Hosted_Notebook_Products__Crestle'] = freeform_responses_2019.Q17_OTHER_TEXT.apply(lambda x: 1 if ('crestle' in str(x).lower()) else 0) 
relevant_2019['Hosted_Notebook_Products__Cocalc'] = freeform_responses_2019.Q17_OTHER_TEXT.apply(lambda x: 1 if ('cocalc' in str(x).lower()) else 0) 
relevant_2019['Hosted_Notebook_Products__Datalore'] = freeform_responses_2019.Q17_OTHER_TEXT.apply(lambda x: 1 if ('datalore' in str(x).lower()) else 0) 
relevant_2019['Hosted_Notebook_Products__Databricks'] = freeform_responses_2019.Q17_OTHER_TEXT.apply(lambda x: 1 if ('brick' in str(x).lower()) else 0) 
relevant_2019['Hosted_Notebook_Products__None'] = multiple_choice_responses_2019.Q17_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hosted_Notebook_Products__Other'] = freeform_responses_2019.Q17_OTHER_TEXT

relevant_2019['Activities__Analyze_and_understand_data'] = multiple_choice_responses_2019.Q9_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2019['Activities__Build_ML_service'] = multiple_choice_responses_2019.Q9_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2019['Activities__Build_data_infrastructure'] = multiple_choice_responses_2019.Q9_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2019['Activities__Build_prototypes'] = multiple_choice_responses_2019.Q9_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2019['Activities__Build_Experimentation'] = multiple_choice_responses_2019.Q9_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2019['Activities__Research'] = multiple_choice_responses_2019.Q9_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2019['Activities__None'] = multiple_choice_responses_2019.Q9_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2019['Activities__Other'] = freeform_responses_2019.Q9_OTHER_TEXT

relevant_2019['Algorithms__Linear_or_Logistic_Regression'] = multiple_choice_responses_2019.Q24_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__Decision_Trees_or_Random_Forests'] = multiple_choice_responses_2019.Q24_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__Gradient_Boosting_Machines'] = multiple_choice_responses_2019.Q24_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__Bayesian_Approaches'] = multiple_choice_responses_2019.Q24_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__Evolutionary_Approaches'] = multiple_choice_responses_2019.Q24_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__Dense_Neural_Networks'] = multiple_choice_responses_2019.Q24_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__CNN'] = multiple_choice_responses_2019.Q24_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__GAN'] = multiple_choice_responses_2019.Q24_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__RNN'] = multiple_choice_responses_2019.Q24_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__Transformer_Networks'] = multiple_choice_responses_2019.Q24_Part_10.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__None'] = multiple_choice_responses_2019.Q24_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithms__Other'] = freeform_responses_2019.Q24_OTHER_TEXT

relevant_2019['Algorithm_Cluster_Traditional_ML'] = relevant_2019.Algorithms__Linear_or_Logistic_Regression | relevant_2019.Algorithms__Decision_Trees_or_Random_Forests | relevant_2019.Algorithms__Gradient_Boosting_Machines | relevant_2019.Algorithms__Bayesian_Approaches | relevant_2019.Algorithms__Evolutionary_Approaches | relevant_2019.Algorithms__Dense_Neural_Networks
relevant_2019['Algorithm_Cluster_Deep_Learning_Vision'] = relevant_2019.Algorithms__CNN | relevant_2019.Algorithms__GAN
relevant_2019['Algorithm_Cluster_Deep_Learning_NLP'] = relevant_2019.Algorithms__RNN | relevant_2019.Algorithms__Transformer_Networks
relevant_2019['Algorithm_Cluster_Deep_Learning'] = relevant_2019['Algorithm_Cluster_Deep_Learning_Vision'] | relevant_2019['Algorithm_Cluster_Deep_Learning_NLP']
relevant_2019['Algorithm_Cluster_Other'] = relevant_2019['Algorithms__Other'].apply(lambda x: 0 if x != x else 1)
relevant_2019['Algorithm_Cluster_None'] = ((relevant_2019.Algorithm_Cluster_Traditional_ML == 0) & (relevant_2019.Algorithm_Cluster_Deep_Learning_Vision == 0 ) & (relevant_2019.Algorithm_Cluster_Deep_Learning_NLP == 0) & (relevant_2019.Algorithm_Cluster_Other == 0)).astype(int)


relevant_2019['Hardware__CPU'] = multiple_choice_responses_2019.Q21_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hardware__GPU'] = multiple_choice_responses_2019.Q21_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hardware__TPU'] = multiple_choice_responses_2019.Q21_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hardware__None'] = multiple_choice_responses_2019.Q21_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2019['Hardware__Other'] = freeform_responses_2019.Q21_OTHER_TEXT

relevant_2019['Framework__TensorFlow'] = multiple_choice_responses_2019.Q28_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2019['Framework__Keras'] = multiple_choice_responses_2019.Q28_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2019['Framework__PyTorch'] = multiple_choice_responses_2019.Q28_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2019['Framework__TensorFlow_Keras'] = relevant_2019['Framework__TensorFlow'] & relevant_2019['Framework__Keras']
relevant_2019['Framework__TensorFlow_Keras_PyTorch'] = relevant_2019['Framework__TensorFlow'] | relevant_2019['Framework__Keras'] | relevant_2019['Framework__PyTorch']

relevant_2019['Cloud_Platform__GCP'] = multiple_choice_responses_2019.Q29_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__AWS'] = multiple_choice_responses_2019.Q29_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__Azure'] = multiple_choice_responses_2019.Q29_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__IBM'] = multiple_choice_responses_2019.Q29_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__Alibaba'] = multiple_choice_responses_2019.Q29_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__Salesforce'] = multiple_choice_responses_2019.Q29_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__Oracle'] = multiple_choice_responses_2019.Q29_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__SAP'] = multiple_choice_responses_2019.Q29_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__VMWare'] = multiple_choice_responses_2019.Q29_Part_9.apply(lambda x: 0 if x != x else 1) 
relevant_2019['Cloud_Platform__Red_Hat'] = multiple_choice_responses_2019.Q29_Part_10.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__None'] = multiple_choice_responses_2019.Q29_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2019['Cloud_Platform__Other'] = freeform_responses_2019.Q29_OTHER_TEXT.apply(lambda x: 0 if x != x else 1)

relevant_2019['Media__Twitter'] = multiple_choice_responses_2019.Q12_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__Newsletters'] = multiple_choice_responses_2019.Q12_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__Reddit'] = multiple_choice_responses_2019.Q12_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__Kaggle'] = multiple_choice_responses_2019.Q12_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__Course_Forums'] = multiple_choice_responses_2019.Q12_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__YouTube'] = multiple_choice_responses_2019.Q12_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__Podcast'] = multiple_choice_responses_2019.Q12_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__Blogs'] = multiple_choice_responses_2019.Q12_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__Journal_Publications'] = multiple_choice_responses_2019.Q12_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__Slack'] = multiple_choice_responses_2019.Q12_Part_10.apply(lambda x: 0 if x != x else 1) 
relevant_2019['Media__None'] = multiple_choice_responses_2019.Q12_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2019['Media__Other'] = multiple_choice_responses_2019.Q12_OTHER_TEXT.apply(lambda x: 0 if x != x else 1)

relevant_2019['Visual__ggplot2'] = multiple_choice_responses_2019.Q20_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2019['Visual__Matplotlib'] = multiple_choice_responses_2019.Q20_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2019['Visual__Altair'] = multiple_choice_responses_2019.Q20_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2019['Visual__Shiny'] = multiple_choice_responses_2019.Q20_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2019['Visual__D3.js'] = multiple_choice_responses_2019.Q20_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2019['Visual__Plotly'] = multiple_choice_responses_2019.Q20_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2019['Visual__Bokeh'] = multiple_choice_responses_2019.Q20_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2019['Visual__Seaborn'] = multiple_choice_responses_2019.Q20_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2019['Visual__Geoplotlib'] = multiple_choice_responses_2019.Q20_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2019['Visual__Leaflet_Folium'] = multiple_choice_responses_2019.Q20_Part_10.apply(lambda x: 0 if x != x else 1)

relevant_2019['Year'] = 2019


### 2020 ###
relevant_2020['Age'] = multiple_choice_responses_2020.Q1
relevant_2020['Compensation_Amout'] = multiple_choice_responses_2020.Q24
relevant_2020['Compensation_Currency'] = 'USD'
relevant_2020['Compensation'] = relevant_2020['Compensation_Amout'].apply(lambda x: unify_compensation_2019(x))
relevant_2020['Country'] = multiple_choice_responses_2020.Q3
relevant_2020['Occupation'] = multiple_choice_responses_2020.Q5
#relevant_2020['Occupation_Freeform'] = multiple_choice_responses_2020.Q5_OTHER
relevant_2020['Education'] = multiple_choice_responses_2020.Q4
relevant_2020['Language_Recommendation'] = multiple_choice_responses_2020.Q8
#relevant_2020['Language_Recommendation_Freeform'] = multiple_choice_responses_2020.Q8_OTHER
relevant_2020['Programming_Experience'] = multiple_choice_responses_2020.Q6
relevant_2020['ML_Experience'] = multiple_choice_responses_2020.Q15

relevant_2020['Hosted_Notebook_Products__Kaggle_Notebooks'] = multiple_choice_responses_2020.Q10_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Google_Colab'] = multiple_choice_responses_2020.Q10_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Azure_Notebooks'] = multiple_choice_responses_2020.Q10_Part_3.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2020.Q17_OTHER.apply(lambda x: 1 if ('databricks' in str(x).lower()) else 0) 
relevant_2020['Hosted_Notebook_Products__Google_Cloud_Datalab'] = multiple_choice_responses_2020.Q10_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Paperspace_Gradient'] = multiple_choice_responses_2020.Q10_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__FloydHub'] = multiple_choice_responses_2020.Q10_OTHER.apply(lambda x: 1 if ('floyd' in str(x).lower()) else 0)
relevant_2020['Hosted_Notebook_Products__Binder_JupyterHub'] = multiple_choice_responses_2020.Q10_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__IBM_Watson_Studio'] = multiple_choice_responses_2020.Q10_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Code_Ocean'] = multiple_choice_responses_2020.Q10_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__AWS_Notebook'] = multiple_choice_responses_2020.Q10_Part_9.apply(lambda x: 0 if x != x else 1) | multiple_choice_responses_2020.Q10_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Domino_Datalab'] = multiple_choice_responses_2020.Q10_OTHER.apply(lambda x: 1 if ('domino' in str(x).lower()) else 0) 
relevant_2020['Hosted_Notebook_Products__Crestle'] = multiple_choice_responses_2020.Q10_OTHER.apply(lambda x: 1 if ('crestle' in str(x).lower()) else 0) 
relevant_2020['Hosted_Notebook_Products__Cocalc'] = multiple_choice_responses_2020.Q10_OTHER.apply(lambda x: 1 if ('cocalc' in str(x).lower()) else 0) 
relevant_2020['Hosted_Notebook_Products__Datalore'] = multiple_choice_responses_2020.Q10_OTHER.apply(lambda x: 1 if ('datalore' in str(x).lower()) else 0)
relevant_2020['Hosted_Notebook_Products__Google_Cloud_AI'] = multiple_choice_responses_2020.Q10_Part_10.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Databricks'] = multiple_choice_responses_2020.Q10_Part_12.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__None'] = multiple_choice_responses_2020.Q10_Part_13.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hosted_Notebook_Products__Other'] = multiple_choice_responses_2020.Q17_OTHER

relevant_2020['Activities__Analyze_and_understand_data'] = multiple_choice_responses_2020.Q23_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Build_ML_service'] = multiple_choice_responses_2020.Q23_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Build_data_infrastructure'] = multiple_choice_responses_2020.Q23_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Build_prototypes'] = multiple_choice_responses_2020.Q23_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Build_Experimentation'] = multiple_choice_responses_2020.Q23_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Research'] = multiple_choice_responses_2020.Q23_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__None'] = multiple_choice_responses_2020.Q23_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Activities__Other'] = multiple_choice_responses_2020.Q23_OTHER

relevant_2020['Algorithms__Linear_or_Logistic_Regression'] = multiple_choice_responses_2020.Q17_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Decision_Trees_or_Random_Forests'] = multiple_choice_responses_2020.Q17_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Gradient_Boosting_Machines'] = multiple_choice_responses_2020.Q17_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Bayesian_Approaches'] = multiple_choice_responses_2020.Q17_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Evolutionary_Approaches'] = multiple_choice_responses_2020.Q17_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Dense_Neural_Networks'] = multiple_choice_responses_2020.Q17_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__CNN'] = multiple_choice_responses_2020.Q17_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__GAN'] = multiple_choice_responses_2020.Q17_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__RNN'] = multiple_choice_responses_2020.Q17_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Transformer_Networks'] = multiple_choice_responses_2020.Q17_Part_10.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__None'] = multiple_choice_responses_2020.Q17_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithms__Other'] = multiple_choice_responses_2020.Q17_OTHER

relevant_2020['Algorithm_Cluster_Traditional_ML'] = relevant_2020.Algorithms__Linear_or_Logistic_Regression | relevant_2020.Algorithms__Decision_Trees_or_Random_Forests | relevant_2020.Algorithms__Gradient_Boosting_Machines | relevant_2020.Algorithms__Bayesian_Approaches | relevant_2020.Algorithms__Evolutionary_Approaches | relevant_2020.Algorithms__Dense_Neural_Networks
relevant_2020['Algorithm_Cluster_Deep_Learning_Vision'] = relevant_2020.Algorithms__CNN | relevant_2020.Algorithms__GAN
relevant_2020['Algorithm_Cluster_Deep_Learning_NLP'] = relevant_2020.Algorithms__RNN | relevant_2020.Algorithms__Transformer_Networks
relevant_2020['Algorithm_Cluster_Deep_Learning'] = relevant_2020['Algorithm_Cluster_Deep_Learning_Vision'] | relevant_2020['Algorithm_Cluster_Deep_Learning_NLP']
relevant_2020['Algorithm_Cluster_Other'] = relevant_2020['Algorithms__Other'].apply(lambda x: 0 if x != x else 1)
relevant_2020['Algorithm_Cluster_None'] = ((relevant_2020.Algorithm_Cluster_Traditional_ML == 0) & (relevant_2020.Algorithm_Cluster_Deep_Learning_Vision == 0 ) & (relevant_2020.Algorithm_Cluster_Deep_Learning_NLP == 0) & (relevant_2020.Algorithm_Cluster_Other == 0)).astype(int)


relevant_2020['Hardware__CPU'] = multiple_choice_responses_2020.Q12_OTHER.apply(lambda x: 1 if ('cpu' in str(x).lower()) else 0)
relevant_2020['Hardware__GPU'] = multiple_choice_responses_2020.Q12_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hardware__TPU'] = multiple_choice_responses_2020.Q12_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hardware__None'] = multiple_choice_responses_2020.Q12_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Hardware__Other'] = multiple_choice_responses_2020.Q12_OTHER

relevant_2020['Framework__TensorFlow'] = multiple_choice_responses_2020.Q16_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Framework__Keras'] = multiple_choice_responses_2020.Q16_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Framework__PyTorch'] = multiple_choice_responses_2020.Q16_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Framework__TensorFlow_Keras'] = relevant_2020['Framework__TensorFlow'] & relevant_2020['Framework__Keras']
relevant_2020['Framework__TensorFlow_Keras_PyTorch'] = relevant_2020['Framework__TensorFlow'] | relevant_2020['Framework__Keras'] | relevant_2020['Framework__PyTorch']

relevant_2020['Cloud_Platform__GCP'] = multiple_choice_responses_2020.Q26_A_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__AWS'] = multiple_choice_responses_2020.Q26_A_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Azure'] = multiple_choice_responses_2020.Q26_A_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__IBM'] = multiple_choice_responses_2020.Q26_A_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Alibaba'] = multiple_choice_responses_2020.Q26_A_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Tencent'] = multiple_choice_responses_2020.Q26_A_Part_10.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Salesforce'] = multiple_choice_responses_2020.Q26_A_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Oracle'] = multiple_choice_responses_2020.Q26_A_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__SAP'] = multiple_choice_responses_2020.Q26_A_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__VMWare'] = multiple_choice_responses_2020.Q26_A_Part_8.apply(lambda x: 0 if x != x else 1) 
relevant_2020['Cloud_Platform__Red_Hat'] = multiple_choice_responses_2020.Q26_A_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__None'] = multiple_choice_responses_2020.Q26_A_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2020['Cloud_Platform__Other'] = multiple_choice_responses_2020.Q26_A_OTHER.apply(lambda x: 0 if x != x else 1)


relevant_2020['Media__Twitter'] = multiple_choice_responses_2020.Q39_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Newsletters'] = multiple_choice_responses_2020.Q39_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Reddit'] = multiple_choice_responses_2020.Q39_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Kaggle'] = multiple_choice_responses_2020.Q39_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Course_Forums'] = multiple_choice_responses_2020.Q39_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__YouTube'] = multiple_choice_responses_2020.Q39_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Podcast'] = multiple_choice_responses_2020.Q39_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Blogs'] = multiple_choice_responses_2020.Q39_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Journal_Publications'] = multiple_choice_responses_2020.Q39_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Slack'] = multiple_choice_responses_2020.Q39_Part_10.apply(lambda x: 0 if x != x else 1) 
relevant_2020['Media__None'] = multiple_choice_responses_2020.Q39_Part_11.apply(lambda x: 0 if x != x else 1)
relevant_2020['Media__Other'] = multiple_choice_responses_2020.Q39_OTHER.apply(lambda x: 0 if x != x else 1)

relevant_2020['Visual__Matplotlib'] = multiple_choice_responses_2020.Q14_Part_1.apply(lambda x: 0 if x != x else 1)
relevant_2020['Visual__Seaborn'] = multiple_choice_responses_2020.Q14_Part_2.apply(lambda x: 0 if x != x else 1)
relevant_2020['Visual__Plotly'] = multiple_choice_responses_2020.Q14_Part_3.apply(lambda x: 0 if x != x else 1)
relevant_2020['Visual__ggplot2'] = multiple_choice_responses_2020.Q14_Part_4.apply(lambda x: 0 if x != x else 1)
relevant_2020['Visual__Shiny'] = multiple_choice_responses_2020.Q14_Part_5.apply(lambda x: 0 if x != x else 1)
relevant_2020['Visual__D3.js'] = multiple_choice_responses_2020.Q14_Part_6.apply(lambda x: 0 if x != x else 1)
relevant_2020['Visual__Altair'] = multiple_choice_responses_2020.Q14_Part_7.apply(lambda x: 0 if x != x else 1)
relevant_2020['Visual__Bokeh'] = multiple_choice_responses_2020.Q14_Part_8.apply(lambda x: 0 if x != x else 1)
relevant_2020['Visual__Geoplotlib'] = multiple_choice_responses_2020.Q14_Part_9.apply(lambda x: 0 if x != x else 1)
relevant_2020['Visual__Leaflet_Folium'] = multiple_choice_responses_2020.Q14_Part_10.apply(lambda x: 0 if x != x else 1)

relevant_2020['Year'] = 2020


relevant_2018 = relevant_2018.loc[1:].reset_index(drop=True)
relevant_2019 = relevant_2019.loc[1:].reset_index(drop=True)
relevant_2020 = relevant_2020.loc[1:].reset_index(drop=True)

df = relevant_2017
df = df.append(relevant_2018)
df = df.append(relevant_2019)
df = df.append(relevant_2020)

df = df.reset_index(drop=True)

In [None]:
def categorize_occupations(x):
    if 'data scien' in x:
        return 'Data Scientist'
    elif 'software developer' in x:
        return 'Software Engineer'
    elif any(s in x for s in ['student', 'intern']):
        return 'Student'
    elif 'research' in x:
        return 'Research Scientist'
    elif 'data analy' in x:
        return 'Data Analyst'
    elif 'ness analy' in x:
        return 'Business Analyst'
    elif any(s in x for s in ['prof', 'teach', 'lect', 'educat', 'faculty', 'academi']):
        return 'Teacher/Professor'
    elif any(s in x for s in ['ngineer', 'enginner']):
        return 'Other Engineer'
    elif 'project' in x:
        return 'Product/Project Manager'
    elif any(s in x for s in ['lead', 'manag', 'head', 'direct', 'dircetor']):
        return 'Manager'
    elif any(s in x for s in ['chief', 'cto', 'ceo', 'coo','cfo', 'cio', 'cdo']):
        return 'Chief Officer'
    elif 'test' in x:
        return 'Software Tester'
    elif 'tired' in x:
        'Not employed'
    elif any(s in x for s in ['writer', 'journal']):
        'Data Journalist'
    else:
        return 'Other'

dict_occupation = {'Software Developer/Software Engineer' : 'Software Engineer',
               'Researcher' : 'Research Scientist',
               'Scientist/Researcher' : 'Research Scientist',
               'Research Assistant' : 'Research Scientist',
                   
               'Marketing Analyst' : 'Other',
                   'Salesperson' : 'Other',
                   
                   'Consultant' : 'Other',
              }
df['Occupation'] = df['Occupation'].replace(dict_occupation)

df['Occupation_Freeform'] = df['Occupation_Freeform'].apply(lambda x: str(x).lower())
df['Occupation_Freeform'] = df['Occupation_Freeform'].apply(lambda x: re.sub('\W+',' ', x))
df['Occupation_Freeform'] = df['Occupation_Freeform'].apply(lambda x: re.sub('senior','', x))
df['Occupation_Freeform'] = df['Occupation_Freeform'].apply(lambda x: categorize_occupations(x))
df['Occupation'] = df.apply(lambda x: x.Occupation_Freeform if x.Occupation == 'Other' else x.Occupation, axis= 1)#.value_counts().to_frame().head(50)

df.Activities__Other = df.Activities__Other.fillna('NaN')
df.Occupation = df.apply(lambda x: 'Student' if (('student' in x.Activities__Other.lower()) & ~('teach' in x.Activities__Other.lower()) & ~('educate' in x.Activities__Other.lower()) & (x.Occupation != 'Student')) else x.Occupation, axis = 1)

#df.Occupation.value_counts()

# Characteristics of a Kaggler
While Kaggle - the platform - is a platform for data science competitions, Kaggle - the community - is a community about learning and sharing knowledge about data science. On a very shallow analysis level of what type of person a typical Kaggler is, we would probably get some shallow information like: male, mid twenties, college educated. However, if we look deeper, we can see that **Kagglers are curious and hardworking individuals, ready for a challenge, and open to learn new things**.

Let me show you why I think the survey data indicates this:

## Kagglers are Levelling Up in Deep Learning
This year, we have seen a large selection of different competitions on Kaggle. 
My feeling is that **competitions with tabular data might be more popular** among Kagglers. I think this is because **we are all here to learn** and competitions with tabular data are especially beginner friendly because you can apply traditional machine learning algorithms, which might be easier to apply for beginners.

Let's explore what the skill set of Kagglers contains in regards to machine learning algorithms:

In [None]:
temp = df.groupby('Year').agg(Survey_Participants = ('Year', 'count' ),
                       Algorithm_Cluster_Traditional_ML = ('Algorithm_Cluster_Traditional_ML', 'sum'),
                       Algorithm_Cluster_Deep_Learning_Vision = ('Algorithm_Cluster_Deep_Learning_Vision', 'sum'),
                       Algorithm_Cluster_Deep_Learning_NLP = ('Algorithm_Cluster_Deep_Learning_NLP', 'sum'),
                       Algorithm_Cluster_Deep_Learning = ('Algorithm_Cluster_Deep_Learning', 'sum'),
                       Algorithms__None = ('Algorithms__None', 'sum'),
                              Algorithms_Cluster_Other = ('Algorithm_Cluster_Other', 'sum'),
                      )

for c in temp.columns:
    if c != 'Survey_Participants':
        temp[c] = temp[c] / temp['Survey_Participants'] *100


f, ax = plt.subplots(nrows=1, ncols=2, figsize=(16,8))
ax[0].set_title("Percentage of Kagglers' \nUsage of Machine Learning Algorithms", fontsize=16)

group1 = int((df[(df.Year == 2020) & (df.Algorithm_Cluster_Traditional_ML == 0) & (df.Algorithm_Cluster_Deep_Learning != 0)].shape[0] / df[(df.Year == 2020)].shape[0])*100)
group2 = int((df[(df.Year == 2020) & (df.Algorithm_Cluster_Traditional_ML != 0) & (df.Algorithm_Cluster_Deep_Learning == 0)].shape[0] / df[(df.Year == 2020)].shape[0])*100) 
intersection = int((df[(df.Year == 2020) & (df.Algorithm_Cluster_Traditional_ML != 0) & (df.Algorithm_Cluster_Deep_Learning != 0)].shape[0] / df[(df.Year == 2020)].shape[0])*100)

v1 = venn2(subsets = (group1, group2, intersection),
          set_labels = ( '', '', ''),
          set_colors=( 'deepskyblue', 'lightgrey'),
           alpha=1,
           ax=ax[0])

v1.get_patch_by_id('11').set_color('skyblue')
c1 = venn2_circles(subsets = (group1, group2, intersection), color='skyblue', ax=ax[0])
c1[0].set_lw(3.0)
c1[1].set_lw(0.0)
ax[0].annotate('Deep Learning', xy=v1.get_label_by_id('10').get_position() - np.array([0, -0.05]), xytext=(-70,70),
ha='center', textcoords='offset points', 
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='gray'))

ax[0].annotate('Traditional Machine Learning\n (Regression, Decision Trees, \nGBM, etc.)', xy=v1.get_label_by_id('01').get_position() - np.array([0, -0.05]), xytext=(70,70),
ha='center', textcoords='offset points', 
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=-0.5',color='gray'))

ax[0].annotate('Both', xy=v1.get_label_by_id('11').get_position() - np.array([0, -0.05]), xytext=(0,70),
ha='center', textcoords='offset points',
arrowprops=dict(arrowstyle='->', color='gray'))



ax[1].set_title("Percentage of Kagglers' \nUsage of Deep Learning Algorithms", fontsize=16)
group1 = int((df[(df.Year == 2020) & (df.Algorithm_Cluster_Deep_Learning_NLP == 0) & (df.Algorithm_Cluster_Deep_Learning_Vision != 0)].shape[0] / df[(df.Year == 2020)].shape[0])*100)
group2 = int((df[(df.Year == 2020) & (df.Algorithm_Cluster_Deep_Learning_NLP != 0) & (df.Algorithm_Cluster_Deep_Learning_Vision == 0)].shape[0] / df[(df.Year == 2020)].shape[0])*100) 
intersection = int((df[(df.Year == 2020) & (df.Algorithm_Cluster_Deep_Learning_NLP != 0) & (df.Algorithm_Cluster_Deep_Learning_Vision != 0)].shape[0] / df[(df.Year == 2020)].shape[0])*100)

v2 = venn2(subsets = (group1, group2, intersection),
          set_labels = ( '', '', ''),
          set_colors=( 'lightskyblue', 'skyblue'),ax=ax[1])
v2.get_patch_by_id('11').set_color('deepskyblue')
v2.get_patch_by_id('11').set_alpha(1)
#c2 = venn2_circles(subsets = (group1, group2, intersection),linewidth=3, color='skyblue', ax=ax[1])

ax[1].annotate('Computer Vision', xy=v2.get_label_by_id('10').get_position() - np.array([0, -0.05]), xytext=(-70,60),
ha='center', textcoords='offset points', 
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='gray'))

ax[1].annotate('NLP', xy=v2.get_label_by_id('01').get_position() - np.array([0, -0.05]), xytext=(70,60),
ha='center', textcoords='offset points', 
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=-0.5',color='gray'))

ax[1].annotate('Both', xy=v2.get_label_by_id('11').get_position() - np.array([0, -0.05]), xytext=(0,60),
ha='center', textcoords='offset points',
arrowprops=dict(arrowstyle='->', color='gray'))


plt.show()

temp.columns = ['Participants', 'Traditional ML',
       'Deep Learning (Vision)',
       'DeepLearning (NLP)',
       'Algorithm_Cluster_Deep_Learning', 'Algorithms__None',
       'Algorithms_Cluster_Other']     
temp[['Participants', 'Traditional ML',
       'Deep Learning (Vision)',
       'DeepLearning (NLP)']].iloc[2:].style.set_caption('Percentage of Respondents Using Machine Learning Algorithms')\
    .format({"Traditional ML": "{:20,.1f}%", "Deep Learning (Vision)": "{:20,.1f}%",  "DeepLearning (NLP)": "{:20,.1f}%"})\
    .background_gradient(subset=['Traditional ML', 'Deep Learning (Vision)', 'DeepLearning (NLP)'], cmap='Blues', vmin=0, vmax=62.0)

We can see that 62% of Kagglers regularly use traditional Machine Learning algorithms such as linear regression, decision trees, etc. 34% of Kagglers regularly use Deep Learning algorithms out of which **29% use it in the vision** domain and **19% use it in the language** domain. Furthermore,  31% of Kagglers regularly use both traditional Machine Learning and Deep Learning algorithms. Most Kagglers using Deep Learning algorithms also use traditional machine learning algorithms

Additionally, the **percentage of Kagglers using Deep Learning algorithms has increased by 1.6% from 2019 to 2020.** I believe this is because Kaggle is constantly pushing us with a lot of interesing challenges. About one year ago, I learned how to use Deep Learning algorithms for the first time on Kaggle. Therefore, I want to say **thank you to all the Kagglers sharing their work and teaching each other and thank you to Kaggle for providing us with new challenges!**

## Kagglers' Tier Achievements in 2020
While we are on the topic of **Kagglers levelling up**, let's have quick look at their achievements in 2020.
54 Kagglers reached the Grandmaster tier this year. This is a lot of hard and consistent work. Good job, guys!

In [None]:
kagglers_achievements_df = pd.read_csv("../input/meta-kaggle/UserAchievements.csv")
kagglers_achievements_df = kagglers_achievements_df[kagglers_achievements_df.TierAchievementDate.notna()]
kagglers_achievements_df['TierAchievementYear'] = kagglers_achievements_df['TierAchievementDate'].apply(lambda x: str(x).split('/')[2])

labels = ['Expert', 'Master', 'Grandmaster']
tiers = kagglers_achievements_df[(kagglers_achievements_df.TierAchievementYear == '2020') & (kagglers_achievements_df.Tier > 1)].Tier.value_counts().values

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(6, 4))
rects = ax.bar(x, tiers, width)

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Number of Kagglers', fontsize=14)
ax.set_ylim([0, 2200])
ax.set_title('Tier Achievements in 2020', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=14)
rects[0].set_color('plum')
rects[1].set_color('orangered')
rects[2].set_color('gold')


"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
    height = rect.get_height()
    ax.annotate('{}'.format(height),
                xy=(rect.get_x() + rect.get_width() / 2, height),
                xytext=(0, 3),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom')



fig.tight_layout()

plt.show()

## Learning New Things

This year Kaggle has enabled [TPUs on Kaggle kernels](https://www.youtube.com/watch?v=1pdwRQ1DQfY) and launched a few challenges for us to get familiar with TPUs. But how many of us have played around with them?

Although Kaggle kernels have TPUs available and there were a few challenges to learn about TPUs on Kaggle, the majority of Kagglers have never used a TPU. Even Kagglers that have used TPUs before have only used them a few times. **However, the percentages of Kagglers that have used a TPU before has doubled**. That is why I think Kagglers are open to new challenges and learning new things. 

In [None]:
hardware_2019_df = df[(df.Year ==2019)][['Algorithm_Cluster_Traditional_ML',
       'Algorithm_Cluster_Deep_Learning_Vision',
       'Algorithm_Cluster_Deep_Learning_NLP','Hardware__CPU', 'Hardware__GPU', 'Hardware__TPU',
       ]]

hardware_2019_df.columns = ['Traditional ML', 'Deep Learning (Vision)', 'Deep Learning (NLP)','CPU', 'GPU', 'TPU']


f, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,6))

sns.countplot(multiple_choice_responses_2020.iloc[1:].Q13, order=['Never',  'Once', '2-5 times','6-25 times', 'More than 25 times'], palette='Blues', ax=ax[0])

ax[0].set_title('TPU Experience')
ax[0].set_ylabel(ylabel = 'Number of Respondents', fontsize = 12)
ax[0].set_xlabel(xlabel = 'Number of Used Times', fontsize = 12)

hardware_2020_df = df[(df.Year ==2020)][['Algorithm_Cluster_Traditional_ML',
       'Algorithm_Cluster_Deep_Learning_Vision',
       'Algorithm_Cluster_Deep_Learning_NLP','Hardware__CPU', 'Hardware__GPU', 'Hardware__TPU',
       ]]
hardware_2020_df.columns = ['Traditional ML', 'Deep Learning (Vision)', 'Deep Learning (NLP)','CPU', 'GPU', 'TPU',]

ax[1].set_title('TPU Usage in 2020', fontsize=14)
ax[1].bar([2019, 2020], 
          [round(hardware_2019_df[['TPU']].sum(axis=0).values[0] / len(hardware_2019_df) * 100, 2), 
           round(hardware_2020_df[['TPU']].sum(axis=0).values[0]/ len(hardware_2020_df) * 100, 2)])

ax[1].set_ylim([0, 5])
ax[1].set_ylabel(ylabel = 'Percentage of Respondents [%]', fontsize = 14)
ax[1].set_xticks([2019, 2020])
ax[1].set_xticklabels(['2019', '2020'])

plt.show()

Since we are all here to learn, let's explore what we can learn from Kagglers that are more experienced in machine learning, professional data scientists and researchers in the field. In the following, we will explore tips from top Kagglers for people ranging from absolute beginners to experienced data science practitioners. For fun reasons, I have used the Kaggler tiers to denote the experience level. 

# Tips for Novices

In my opinion, [CPMP](https://www.kaggle.com/cpmpml) gives one of the most comprehensive advice for newbies on the [CTDS Podcast](https://chaitimedatascience.com/). The TL;DR version is: learn to code, learn some Machine Learning basics, and then get your hands dirty in a Kaggle competition. In the following section, we will explore this a little bit more.

> "So first practice Python. It's not the only language you can use. You can use R. [...] Take some courses. So I like Andrew Ng's Machine Learning course or the Stanford ML course [...] [D]on't start with deep learning unless all you care about is images or natural language processing [...] and then enter a Kaggle 'Playground' or 'Getting Started' competition [...]" - [CPMP](https://www.kaggle.com/cpmpml)$^*$

$^*$ Please unhide output for full quote.


In [None]:
advice = quote[quote.new_question == 562]
print("Sanyam: '" + advice.interviewer.values[0] + "'\n")
print(advice.heroes.values[0] +  ": '" + advice.guest.values[0])

## Pick A Programming Language
The data science Venn diagram cosists of programming, math, and domain knowledge. All three are important to practice data science. Therefore, [Parul Pandey](https://www.kaggle.com/parulpandey) stresses that you should get a good foundation in coding. But what is the best programming language to get started?

> "[...] [M]ake sure you have a solid background and foundation. I especially say stats and coding - **don't run away from coding**. And coding is such an essential tool kit even  if you're not a data scientist [...] And I think you should really, really focus on coding and you should because if you're good at coding, data science also becomes easier. [...] " -  [Parul Pandey](https://www.kaggle.com/parulpandey)$^*$

In [None]:
parul_interview_df = pd.read_csv("../input/chai-time-data-science/Cleaned Subtitles/E49.csv")
parul_interview_df[parul_interview_df.Time == '53:40'].Text.values[0]

In below figure we can see that Python is the dominant choice of programming language among experienced data scientists.

In [None]:
language_recommendation_df = df[df.Language_Recommendation.isin(['Python', 'R', 'SQL', 'Julia'])].groupby('Year').Language_Recommendation.value_counts().to_frame()
language_recommendation_df.columns = ['dummy']
language_recommendation_df = language_recommendation_df.reset_index(drop=False)
language_recommendation_df = language_recommendation_df.pivot(index='Year', columns='Language_Recommendation')['dummy']
language_recommendation_df['n_responses'] = df.groupby('Year').Year.count()

for c in language_recommendation_df.columns:
    if c != 'n_responses':
        language_recommendation_df[c] = language_recommendation_df[c]/language_recommendation_df['n_responses'] * 100
language_recommendation_df = language_recommendation_df.drop('n_responses', axis=1) 
language_recommendation_df = language_recommendation_df.fillna(0)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 6))


language_recommendation_df[['Python', 'R', 'SQL']].plot(ax=ax, marker='x')

ax.set_title('Recommendation for Beginners')
ax.set_ylabel(ylabel = 'Percentage of Respondents [%]', fontsize = 12)
ax.set_xticks([2017, 2018, 2019, 2020])
ax.set_xticklabels(['2017', '2018', '2019', '2020'])
plt.show()

**Python has been the most recommended programming language** for data science newbies since the first survey was launched in 2017. This year about 70% of respondents recommend Python while in 2017 less than half of respondents were recommending it. The popularity of R has been decreasing slowly over the past four years. Although it is not a classical programming language, **more and more Kagglers are now recommending that data science newbies should also know SQL**.

To conclude, **pick Python or R. Both are good choices.** However, Python is definitely more popular. Additionally, if you want to get a job as a data scientist, then you probably should know how to write SQL queries.

One a side note: Some of you may have noticed that Kaggle kernels also now give you the option of **Julia scripts**. Although [Julia](https://julialang.org/) has been around since 2012, this year I have seen Julia being titled the [new kid on the block for data science](https://towardsdatascience.com/introducing-julia-an-alternative-to-python-and-r-for-data-science-dcbf98346253). However, none of the respondents were recommending it for beginners in the past two years. This year, only **0.6% of respondents recommended Julia** for data science newbies.

In [None]:
print(f"{round(language_recommendation_df.iloc[3].Julia,2)}% of respondents recommend learning Julia for beginners.")

## Learn Fundamental Machine Learning Theory
From above quotes from [CPMP](https://www.kaggle.com/cpmpml) and [Parul Pandey](https://www.kaggle.com/parulpandey), we already see that getting some fundamental Machine Learning and Statistics knowledge is important.

**(TODO: What online courses? What level of education? )**

## Enter Your First Competition

Now it is time to get some practice. Kaggle competitions are a great way to get started. Pick a competition that interests you. If you feel intimidated by live competitions, then you can always pick a Playground or Getting Started competition. But to be honest, don't be intimidated. Start with a simply dummy submission and then build up your solution from there. Read through other peoples notebooks and try to include hints from the discussions into your solution. 
>"[...] But, you know, I would really say get an account go sign in and you know, go go put a submission in you know, make it all zeros make it the average, you know, and just just improve. It's a it's an iterative process to almost all of us. [...] And most people will have better results if they can chip away at that problem in bite sized chunks. So the first thing is get the data in, **go submit something really basic and then just iterate from there** and see where it takes you." - [Dieter](kaggle.com/christofhenkel)$^*$

In [None]:
advice = quote[quote.new_question == 1115]
print("Sanyam: '" + advice.interviewer.values[0] + "'\n")
print(advice.heroes.values[0] +  ": '" + advice.guest.values[0])


# Tips for Contributors
So, now you that you have entered your first competition. Where do you go from here? Once you feel comfortable with traditional machine learning techniques, you can start expanding your knowledge in another field such as Deep Learning or time series.  

## PyTorch or TensorFlow/Keras?
Over the past year, I have seen this questions a few times in the discussion sections of competitions: 
> "I am new to Deep Learning. Should I start with PyTorch or TensorFlow/Keras?"

This question might even be more frequently asked than whether you should learn Python or R. 
Although both have their advantages and disadvantages, in my opinion both frameworks strongly benefit from this healthy competition and are even moving in the direction of strong cooperation by [enabling conversion](https://www.youtube.com/watch?v=nngMcB7LKzE) between the two.
However, from what I have seen in Kaggle kernels the past year, **my gut feeling says that PyTorch is about to pass TensorFlow/Keras in popularity**. Let's see if that is the case.

In [None]:
temp = df[(df.Occupation.isin(['Data Scientist', 'Research Scientist']) & (df.Year != 2017))]
temp = temp.groupby(['Year', 'Occupation']).agg(Framework__TensorFlow_Keras_PyTorch = ('Framework__TensorFlow_Keras_PyTorch', 'sum' ),
                                                                   Framework__TensorFlow = ('Framework__TensorFlow', 'sum'),
                                                                    Framework__Keras = ('Framework__Keras', 'sum'),
                                                                    Framework__PyTorch = ('Framework__PyTorch', 'sum'),
                                                                    Framework__TensorFlow_Keras = ('Framework__TensorFlow_Keras', 'sum'))
temp['Framework__TensorFlow'] = temp['Framework__TensorFlow'] / temp['Framework__TensorFlow_Keras_PyTorch']
temp['Framework__Keras'] = temp['Framework__Keras'] / temp['Framework__TensorFlow_Keras_PyTorch']
temp['Framework__PyTorch'] = temp['Framework__PyTorch'] / temp['Framework__TensorFlow_Keras_PyTorch']
temp['Framework__TensorFlow_Keras'] = temp['Framework__TensorFlow_Keras'] / temp['Framework__TensorFlow_Keras_PyTorch']

temp = temp.reset_index(drop=False)


fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))

#ax = sns.heatmap(temp, vmin=0, vmax=100,  annot=True,)
for i, occ in enumerate(['Data Scientist', 'Research Scientist']):#, 'Student']):

    framework_by_occupation = temp[temp.Occupation==occ][['Year', 'Framework__TensorFlow', 'Framework__Keras', 'Framework__PyTorch']]
    reg_tensorflow = LinearRegression().fit(np.array(framework_by_occupation.Year).reshape(-1, 1), np.array(framework_by_occupation.Framework__TensorFlow).reshape(-1, 1))
    reg_keras = LinearRegression().fit(np.array(framework_by_occupation.Year).reshape(-1, 1), np.array(framework_by_occupation.Framework__Keras).reshape(-1, 1))
    reg_pytorch = LinearRegression().fit(np.array(framework_by_occupation.Year).reshape(-1, 1), np.array(framework_by_occupation.Framework__PyTorch).reshape(-1, 1))

    framework_by_occupation = framework_by_occupation.append({'Year':2021, 
                  'Framework__TensorFlow': reg_tensorflow.predict(np.array([[2021]]))[0,0], 
                  'Framework__Keras': reg_keras.predict(np.array([[2021]]))[0,0], 
                  'Framework__PyTorch': reg_pytorch.predict(np.array([[2021]]))[0,0], }, 
                 ignore_index=True)
    
    framework_by_occupation = framework_by_occupation.append({'Year':2022, 
                  'Framework__TensorFlow': reg_tensorflow.predict(np.array([[2022]]))[0,0], 
                  'Framework__Keras': reg_keras.predict(np.array([[2022]]))[0,0], 
                  'Framework__PyTorch': reg_pytorch.predict(np.array([[2022]]))[0,0], }, 
                 ignore_index=True)
    '''
    framework_by_occupation = framework_by_occupation.append({'Year':2023, 
                  'Framework__TensorFlow': reg_tensorflow.predict(np.array([[2023]]))[0,0], 
                  'Framework__Keras': reg_keras.predict(np.array([[2023]]))[0,0], 
                  'Framework__PyTorch': reg_pytorch.predict(np.array([[2023]]))[0,0], }, 
                 ignore_index=True)
    
    framework_by_occupation = framework_by_occupation.append({'Year':2024, 
                  'Framework__TensorFlow': reg_tensorflow.predict(np.array([[2024]]))[0,0], 
                  'Framework__Keras': reg_keras.predict(np.array([[2024]]))[0,0], 
                  'Framework__PyTorch': reg_pytorch.predict(np.array([[2024]]))[0,0], }, 
                 ignore_index=True)'''
    #* Among **students** TensorFlow/Keras is still more popular than PyTorch. However, it looks like Keras is gaining popularity due to its low threshold to getting started. However, PyTorch might replace TensorFlow in teaching students concepts.
    
    framework_by_occupation.set_index('Year').plot(ax=ax[i], marker='x')

    
    ax[i].set_ylim([0,1])
    ax[i].set_ylabel(ylabel = 'Percentage of Respondents Used by [0.01%]', fontsize = 14)
    ax[i].set_xlabel(xlabel = 'Year', fontsize = 14)
    ax[i].set_xticks([2018, 2019, 2020, 2021, 2022,])# 2023, 2024])
    ax[i].add_patch(Rectangle((2020.5, 0), 4, 1, fill=True, alpha=0.2, color='Grey', lw=0))
    ax[i].annotate('Prediction', xy=(2020.8, 0.9), fontsize=14, color='black')
    ax[i].set_title(occ, fontsize = 16)

From the above plots, we can see that my gut feeling was wrong but also not completely off. In the **industry**, TensorFlow/Keras is still leading. However, this could change in two years. In the **academia**, PyTorch is about to pass TensorFlow/Keras in its popularity. Looking forward to see if that is the case next year.  Overall it looks like PyTorch is going to become more popular than TensorFlow/Keras in the next few years with academia leading the way and industry following close behind.

> "I've worked quite a lot on **TensorFlow and Keras** in the last two years. So I'm very comfortable with it [...] but **only for natural language processing** problems. **When it comes to images, I like PyTorch more** becase, first of all, when you start with an image competition  you can build a small simple convolutional neural net and see how it's performing. Then what's the next thing you are going to do? You're going to take some model from some pre-train model, right? Say ResNet or something like that and try to find the ImageNet weights weren't. Yeah, and then fine-tune. But it's difficult in Tensorflow or Keras, because they don't have a proper a model zoo where you can find these weights for different kinds of models. And with PyTorch people seem to be quite fast, and doing these kind of things. So you have a very extensive zoo of models that you can choose from. And you also have the weights available. So you don't have to train it from scratch on ImageNet.' - [Abhishek Thakur](https://www.kaggle.com/abhishek)$^*$


In [None]:
advice = quote[quote.new_question == 78]

print("Sanyam: '" + advice.interviewer.values[0] + "'\n")
print(advice.heroes.values[0] +  ": '" + advice.guest.values[0] + "'")

In [None]:
temp = df[(df.Occupation.isin(['Data Scientist', 'Research Scientist']) & (df.Algorithm_Cluster_Deep_Learning == 1) & (df.Year != 2017))]

nlp = temp[temp.Algorithm_Cluster_Deep_Learning_NLP == 1]
nlp = nlp[['Framework__PyTorch', 'Framework__TensorFlow_Keras']].sum()# / len(nlp) *100

vision = temp[temp.Algorithm_Cluster_Deep_Learning_Vision == 1]
vision = vision[['Framework__PyTorch', 'Framework__TensorFlow_Keras']].sum()# / len(vision) * 100

deep_learning_framework = pd.concat([nlp, vision], axis=1)
deep_learning_framework.columns = ['NLP', 'Vision']

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 6))

deep_learning_framework.plot(kind='bar', ax=ax, color=['dodgerblue', 'skyblue'])

ax.set_title('Framework Popularity among Deep Learning Practitioners ', fontsize = 16)
ax.set_ylabel(ylabel = 'Respondents', fontsize = 14)
ax.set_xticklabels(labels = ['PyTorch', 'TensorFlow/Keras'], rotation=0, fontsize = 14)

plt.show()

## Understand the Importance of EDA and Visualization
Exploratory Data Analysis (EDA) and Visualization are an integral part of the data science workflow. On a side note: If you don't want to start building models directly, you could also start by digging into the data sets and creating EDAs. Let's see which tools there are to help you along:
If you want to visualize your data, then you might want to pick one of the below depending on the programming language you chose earlier.
* R: ggplot, Shiny, 
* Python: Matplotlib, Altair, Plotly, Python, Seaborn, Geoplotlib, Leaflet / Folium

Let's see if we can see anything that highly experienced Kagglers are doing differently.

In [None]:
visualization_cols = df.columns[df.columns.str.contains('Visu')]

all_df = df[(df.Year == 2020) ].groupby('Year')[visualization_cols].sum()

all_df['respondents'] = df[(df.Year == 2020)].Year.value_counts()

for c in all_df.columns:
    all_df[c] = all_df[c] / all_df['respondents'] * 100

all_df = all_df.drop('respondents', axis=1)
all_df.columns = [re.sub('_', ' ', c.split('__')[1]) for c in all_df.columns]
all_df['label'] = 'All Respondents'
experienced_df = df[(df.Year == 2020) & df.ML_Experience.isin(['5-10 years',  '10-20 years','20 or more years'])].groupby('Year')[visualization_cols].sum()

experienced_df['respondents'] = df[(df.Year == 2020) & df.ML_Experience.isin(['5-10 years',  '10-20 years','20 or more years'])].Year.value_counts()

for c in experienced_df.columns:
    experienced_df[c] = experienced_df[c] / experienced_df['respondents'] * 100

experienced_df = experienced_df.drop('respondents', axis=1)
experienced_df.columns = [re.sub('_', ' ', c.split('__')[1]) for c in experienced_df.columns]
experienced_df['label'] = 'Experienced Respondents'
temp = all_df.append(experienced_df)
temp =temp.set_index('label')
#temp = temp.reset_index(drop=False)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))

temp[['Matplotlib', 'Seaborn','Plotly', 'Bokeh', 'Geoplotlib', 'Leaflet Folium', 'Altair',]].T.plot(kind='bar', ax=ax[0], color=['lightblue', 'deepskyblue', 'steelblue'])
ax[0].set_title('Usage of Visualisation Libraries for Python', fontsize = 14)
ax[0].set_ylabel(ylabel = 'Percentage of Respondents [%]', fontsize = 12)
ax[0].set_ylim([0,80])

temp[['ggplot2', 'Shiny']].T.plot(kind='bar', ax=ax[1], color=['lightblue', 'deepskyblue', 'steelblue'])
ax[1].set_title('Usage of Visualisation Libraries for R', fontsize = 14)
ax[1].set_ylabel(ylabel = 'Percentage of Respondents [%]', fontsize = 12)
ax[1].set_ylim([0,80])
#plt.xticks(rotation=0, fontsize=12)

plt.show()

For Python the most popular libraries are matplotlib, seaborn, and plotly. For R ggplot2 is more popular than Shiny. **Overall Kagglers with more than 5 years of experience in Machine Learning have a higher response percentage** than the average Kaggler. From the last point I think we can gain some great insight: **Visualizing your data is an integral part of the Data Science workflow**. So, pick a few libraries and start visualizing some data! 

## How Much Computing Power Do I Need?

Training Machine Learning algorithms can be compute expensive. Large neural networks take ages to train on CPUs alone. This can be much faster if you use GPUs instead. 
Additionally, there are so-called Tensor Processing Units (TPUs) which are especially desgined to lift heavy computes for Deep Learning algorithms. 

Let's see how the usage of certain algorithms correlates to the usage of certain hardware.

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 4))

sns.heatmap(hardware_2019_df.corr()[['CPU', 'GPU', 'TPU']].iloc[:3], vmin=0, vmax=1, annot=True, cmap='Blues', ax= ax)
ax.set_title('Correlation of Algorithm and Hardware Usage in 2019', fontsize=14)


plt.show()


**CPUs** are sufficient for the usage of **traditional Machine Learning** algorithms. The hardware of choice for **Deep Learning** algorithms seems to be **GPUs**. Although only 4% of respondents use TPUs, **respondents using TPUs have doubled from 2019 to 2020**.

We can see that if you want to train a Deep Learning algorithm, GPUs are the most popular choice right now. However, GPUs are quite costly. An alternative is using hosted notebook products. For example, Kaggle kernels enable you to use [more than 30 hours of GPU time per week](https://news.developer.nvidia.com/how-kaggle-makes-gpus-accessible-to-5-million-data-scientists/).

In [None]:
advice = quote[quote.new_question == 385]

print("Sanyam: '" + advice.interviewer.values[0] + "'\n")
print(advice.heroes.values[0] +  ": '" + advice.guest.values[0] + "'")

In [None]:
notebook_cols = df.columns[df.columns.str.startswith('Hosted_Notebook')]
cloud_cols = df.columns[df.columns.str.startswith('Cloud_Platform')]
alternatives = df[(df.Hardware__GPU == 1) & (df.Year == 2020)][['Occupation',  'Hosted_Notebook_Products__Google_Colab', 'Hosted_Notebook_Products__Kaggle_Notebooks','Hosted_Notebook_Products__Binder_JupyterHub', 'Hosted_Notebook_Products__None']]
alternatives['Student_status'] = alternatives.Occupation.apply(lambda x: 1 if x == 'Student' else 0)
temp = alternatives.groupby(['Student_status']).sum()
temp['respondents'] = alternatives.Student_status.value_counts()

for c in temp.columns:
    temp[c] = temp[c] / temp['respondents'] * 100

temp = temp.drop('respondents', axis=1)
temp.index = ['Not Student', 'Student']
temp.columns = [re.sub('_', ' ', c.split('__')[1]) for c in temp.columns]
    
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))

temp.T.plot(kind='bar', ax=ax, color=['deepskyblue', 'steelblue'])
ax.set_title('Usage of Hosted Notebook Products', fontsize = 14)
ax.set_ylabel(ylabel = 'Percentage of Respondents [%]', fontsize = 12)
plt.xticks(rotation=0, fontsize=12)
plt.show()


The most popular hosted notebook products among Kagglers are Google Colab, Kaggle Notebooks and Binder JupyterHub. **Students tend to use hosted notebook products more** than not students.

# Tips for Experts
So, now that you have levelled up and gained quite a bit of knowledge, how do you become truly good at this? The key is to be persistent and continuously keep learning. But what if you have all that basics down? A good way to keep learning is to read up on how competition winners approached the problem. Usually, it will be posted in the discussion sections after a competition.
>"[...] So I think you should be **persistent**. So even if you didn't perform well, **read how the winner approached the problem.**" - [Abhishek Thakur](https://www.kaggle.com/abhishek)$^*$

> "[...] I think the other thing still is that we live in an information age and there is so much information out there and not making use of it is stupid. So you should also always try to **read on the solutions of previous competitions**. You should also always try  to figure out when a hint on the discussion forums is important because I think that's a very crucial part that a lot of top characters can figure out quite quickly, like someone posting something about, oh, this feature I have found that this feature is important that maybe people should get into it and most kind of ignore that because and top guy just kind of can make the connection between that and what what they should do. [...]" - [Psi](https://www.kaggle.com/philippsinger)$^*$

In [None]:
advice = quote[quote.new_question == 93]
print("Sanyam: '" + advice.interviewer.values[0] + "'\n")
print(advice.heroes.values[0] +  ": '" + advice.guest.values[0])

advice = quote[quote.new_question == 516]
print("Sanyam: '" + advice.interviewer.values[0] + "'\n")
print(advice.heroes.values[0] +  ": '" + advice.guest.values[0])

# Sorry for the long post
here is somebody that has answered all freeform answers with 'potato'.

In [None]:
freeform_responses_2019.iloc[16716]

In [None]:
"""
## Any Media Recommendations?

Where do Kagglers get the latest news on data science topic? Let's rank the three most popular media sources for each year this question was asked.
"""

media_df = df[['Year','Media__Twitter', 'Media__Newsletters', 'Media__Reddit',
       'Media__Kaggle', 'Media__Course_Forums', 'Media__YouTube',
       'Media__Podcast', 'Media__Blogs', 'Media__Journal_Publications',
       'Media__Slack', 'Media__None']].groupby('Year').sum()

media_df['n_respondents'] = df.Year.value_counts()

for c in media_df.columns:
    if c != 'n_respondents':
        media_df[c] = media_df[c]/media_df['n_respondents'] * 100
media_df = media_df.drop('n_respondents', axis=1) 

media_df = media_df.T.reset_index(drop=False)
media_df.columns=['Media', 2017, 2018, 2019, 2020]

media_dict = {}
for m in ['Media__Twitter', 'Media__Newsletters', 'Media__Reddit',
       'Media__Kaggle', 'Media__Course_Forums', 'Media__YouTube',
       'Media__Podcast', 'Media__Blogs', 'Media__Journal_Publications',
       'Media__Slack', 'Media__None']:
    media_dict[m] = m.split('__')[1]
media_df.Media = media_df.Media.replace(media_dict)

media_ranking_df = pd.DataFrame(columns=[2018, 2019, 2020])
for year in [2018, 2019, 2020]:
    media_ranking_df[year] = media_df.sort_values(by=year, ascending=False).Media[:3].reset_index(drop=True)

    
def color_df(val):
    if val == 'Kaggle':
        return 'background-color: deepskyblue'
    elif val == 'YouTube':
        return 'background-color: tomato'
    elif val == 'Blogs':
        return 'background-color: steelblue'

media_ranking_df.style.applymap(color_df).set_caption('Most Popular Media Sources')

"""
**Findings**:
* Kagglers get their news on Kaggle - duh!
* Video content is becoming more popular than text content

This year, there has been some great content on YouTube by a few Kagglers. Kaggle's first 4x Grandmaster [Abhishek](https://www.kaggle.com/abhishek) has been posting great tutorials and interviews on [his YouTube channel](https://www.youtube.com/user/abhisheksvnit) he **started this year**. 

Also, [Sanyam](https://www.kaggle.com/init27) has been uploading the [Chai Time Data Science Show](https://www.youtube.com/c/ChaiTimeDataScience), which is a podcast where he interviews researchers, Kagglers, and data science practitioners, **twice a week** this year. There even was a fun **analytics challenge** based on the [dataset](https://www.kaggle.com/rohanrao/chai-time-data-science/notebooks) from the interviews.

Another cool podcast by fellow Kagglers [currypurin](https://www.kaggle.com/currypurin) and [regonn](https://www.kaggle.com/regonn) I enjoy is [regonn&curry.fm](https://open.spotify.com/show/4qRP8siOYH92k3oMSxUG9U?si=yUHWVrk-Sp2f9lV_vnJmOA) , where they **discuss the latest Kaggle challenges**. Unfortunately, it is in Japanese only but if you speak Japanese, I definetely recommend it.

**Thank you guys for sharing your high quality content with us! I have learned a lot!**

Since we are on the topic of podcast, I was surprised that **only 7.5% Kagglers listen to data science related podcasts**.

"""

print(f"Only {round(media_df[media_df.Media == 'Podcast'][2020].values[0],1)}% of Kagglers listen to data science related Podcasts.")