# The Market - The Story of Data Science Products
By Aaryam Sharma

The Market is a metaphor for the collection of sites on the internet that provide us tools, data and knowledge to conduct our Data Science jobs! The Market is marvelous for sure - because this notebook is practically in this same Market because it's providing data and knowledge!!!

This notebook is a journey into this market (that sometimes gives free stuff like Kaggle :D) of technological tools and innovations that surrounds the world of Data Science through the lens of experienced and new machine learning practitioners.

Together, we will analyze how the market is shaped today, and how it has come to this point by looking at some past years data.
****

**Why I Should I Read This Notebook?**

Firstly, if you are new to ML, then this Notebook is very helpful for you. You will learn about the several different tools available at your disposal, and also learn about who uses what. This can help you choose which tool to learn about and add to your own repertoire.

Even if you are experienced and know about ML tools, this notebook will be very helpful for you. Here, you will be able to look at how the market has changed over time - something developers will find especially useful. Moreover, you will get to know about how other people are using ML tools across the globe - so you understand about a more global market!
****

**Motivation Of This Topic**

The objective of this data science competition is to identify and present several insights into the data provided by this survey. I noticed that how sparingly the market of products surrounding machine learning has been analyzed. Moreover, these analyses seldom relate the present and past data in conjunction to generate information about the potential changes in the future market. I decided to present insights into this topic, to help programmers understand the direction the user-base has taken in terms of using Data Science based tools and software.

Additionally, I decided to compare 2 distict groups - Experienced and New practitioners, because they reflect the market in the previous and current generation respectively. Furthermore, in this notebook, I will use these statistics to explain whether these trends are beneficial or are harmful to the Data Science world.

****

**How We will Analyze the Data**

Firstly I will present a graph of how the users are using the chosen technology in 2021. I will bring attention to the main points of focus and explain them. Here, I will plot the market *share* of users.
The data plotted will be relative numbers, i.e., of all the people in the dataset in our groups, what percentage of them used this tool, etc.

I will then focus on these tools/products, and how they evolved over time. These are also relative.

I took relative numbers to accomodate the different number of survey takers each year. Moreover, the number of Experienced users is far less than the New users. So this makes it easier to analyze between these classes. Additionally it helps us analyze the data from some questions that were not asked to certain groups (such as non-professionals)

*Note: Certain tools and products were not available in the previous Surveys as Options. They may still be prominent features of the 2021 graph. I will unfortunately not analyze these in past years, but I will clearly indicate when they occur*

Certainly, there will also be some rare exceptions.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
import seaborn as sn
import plotly.figure_factory as ff
import math
import plotly.graph_objects as go

In [None]:
#Importing all data

data_18 = pd.read_csv("/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv", )
data_19 = pd.read_csv("/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv", )
data_20 = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", )
data_21 = pd.read_csv("/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv", )

In [None]:
# Gets all information from a single question for a single term
def get_all_from_sq(q, term, data = data_21):
    return data[data[q].isin([term])]

# Gets all information from a single question with multiple parts for terms in term_list
def get_all_from_mcq(q, term_list, data = data_21):
    df = data
    for i in range(len(term_list)):
        df = df[df[q + "Part_" + str(i + 1)].isin(term_list)]
    return df

# Gets all information that is = term1 in q1 and term2 in q1 (Extra)
def get_and(q1, q2, term1, term2, data = data_21):
    return (data[data[q1].isin([term1])])[data[q2].isin([term2])]

# Counts the data after checking data type
def data_count(q1 = 1, cross_ref = None, qt = "sq", cross_ref_t = None, term = 1, term2 = None, data = data_21):
    if(cross_ref == None):
        # for single questions it'll return total number of terms
        if(qt == "sq"):
            return len(get_all_from_sq(q1, term).count(axis = 1))
        else:
            # for multiple answer questions it'll return total number of terms in all parts
            return len(get_all_from_mcq(q1, term).count(axis = 1))
    else:
        return len(get_and(q1, cross_ref, term, term2).count(axis = 1))

# Builds a dataset based on 2 questions by corelating them. q1_terms and q2_terms are the terms you want
# to relate
def build_2_var_sq(q1, q2, q1_terms, q2_terms, data = data_21):
    len_opt = len(q1_terms)
    df = {}
    for i in range(len_opt):
        n1 = q1 + str(i + 1)
        df[n1] = data_count(q1 = q1, cross_ref = q2, term = q1_terms[i], term2 = q2_terms, cross_ref_t = "sq", data = data_21)
    df = pd.DataFrame(df.items(), columns = ['VarX', 'VarY'])
    return df

# Instead of using the logical AND operation we use the OR operation, so it's checking other possibilities
def build_2_var_sq_or(q1, q2, q1_terms, q2_terms, data = data_21):
    len_opt = len(q1_terms)
    df = {}
    for i in range(len_opt):
        n1 = q1 + str(i + 1)
        s1 = 0
        for j in range(len(q2_terms)):
            s1 += data_count(q1 = q1, cross_ref = q2, term = q1_terms[i], term2 = q2_terms[j], cross_ref_t = "sq",
                            data = data)
        df[n1] = s1
    df = pd.DataFrame(df.items(), columns = ['VarX', 'VarY'])
    return df

# From a question it gets all the names of the options
def get_question_index(q1, data = data_21):
    return list(data[q1].value_counts().index)

# If we have 2 data counts for 2 variables, it formats the data
# It also converts it to relative data if rel = true
# Output = [VarX, Mid, VarY]
def relate_data_counts(lab, counts, indexes, rel):
    d = []
    tot = 0
    if(counts.size == 0 and not rel):
        return [lab, 0]
    for i in range(min(len(indexes), len(counts))):
        d.append([lab, indexes[i], counts[i]])
        tot += counts[i]
    if(rel):
        d.append(tot)
    return d
# Get the value_counts as an array
def get_question_counts(q1, data = data_21):
    return np.array(data[q1].value_counts())

# A remade and more efficient version that applies the and operation between 2 questions.
def get_question_and(q1, q2, data = data_21, rel = False):
    q1_names = get_question_index(q1, data)
    q2_names = get_question_index(q2, data)
    d = []
    for name in q1_names:
        if(not rel):
            d1 = relate_data_counts(name,
                                     get_question_counts(q2, data[data[q1].isin([name])])[:-1],
                                     q2_names[:-1], rel)
            for j in d1:
                d.append(j)
        else:
            d1 = relate_data_counts(name,
                                     get_question_counts(q2, data[data[q1].isin([name])])[:-1],
                                     q2_names[:-1], rel)
            tot = d1[-1]
            for j in range(len(d1) - 1):
                d.append([d1[j][0], d1[j][1], float(d1[j][2])/float(tot + 1)])
    return d
# Renames a certain dataframe - helpful to fix data with large names
def rename_frame(array, naming_1, naming_2):
    d = pd.DataFrame(np.array(array), columns = ['VarX', 'Mid', 'VarY'])
    for i in range(len(naming_1)):
        d = d.replace(naming_1[i], value= naming_2[i])
    return d

# For a single variable, it compiles all data from all years. Input the list of question numbers
# If you have "Q1_Part_n", then just put "Q1_"
# Then enter the variable you are searching for.
def multi_Y_data_SV(q1_list, q1_var, get_var = False, data_list = [data_21], year_list = ["2021"], rel = False):
    d = []
    if(get_var):
        for i in range(len(q1_list)):
            co = get_question_counts(q1_list[i], data = data_list[i])[0]
            d.append([year_list[i], q1_var, co])
    else:
        for i in range(len(q1_list)):
            co = get_question_counts(q1_list[i], data = data_list[i])
            names = get_question_index(q1_list[i], data = data_list[i])
            n1 = ""
            for j in range(len(names)):
                if(q1_var[i] == names[j]):
                    n1 = j
                    break
            d.append([year_list[i], q1_var, co[j]])
    return d

# Formats the above data for making into a line graph
def line_graph_SV(q1_list, q1_var, get_var = False, data_list = [data_21], year_list = ["2021"], rel = False, 
                  alternate_name = None, naming = None):
    if(alternate_name == None):
        d = multi_Y_data_SV(q1_list, q1_var, get_var, data_list, year_list, 
                                     rel)
        d = pd.DataFrame(np.array(d), columns = ['VarX', 'Mid', 'VarY'])
    else:
        d = rename_frame(multi_Y_data_SV(q1_list, q1_var, get_var, data_list, year_list, 
                                     rel),
                     naming, 
                     alternate_name)
    d['VarY'] = d['VarY'].astype(str).astype(float)
    return d

# Helps relate multiple variables in multiple questions.
def multi_Y_data_MV(q1_list, q1_var, get_var = False, data_list = [data_21], year_list = ["2021"], year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = False, cut_off = 1, reformat = False):
    d = []
    for i in range(len(q1_var)):
        q_nums = []
        i1 = 0
        for n in q1_list:
            if(has_other and i == len(q1_var) - 1):
                if(i1 <= cut_off):
                    q_nums.append(n + "OTHER_TEXT")
                else:
                    q_nums.append(n + "OTHER")
            elif(year_wise_skip != None):
                if(i1 in year_wise_skip[i]):
                    i1 += 1
            else:
                q_nums.append(n + "Part_" + str(i + 1))
            
            i1 += 1
        d1 = multi_Y_data_SV(q_nums, q1_var[i], get_var = True, data_list = data_list, 
                             year_list = year_list)
        d.append(pd.DataFrame(d1, columns = ['VarX', 'Mid', 'VarY']))
    return pd.concat(d)

# Helps rescale according to what year we are looking at
# the ident is what we are using as base year
def reformat_yearwise(data, year_dict, ident):
    d = []
    for index, row in data.iterrows():
        d.append([row['VarX'][:-3], row['Mid'], 
                  round(row['VarY'] / (1 + year_dict[row['VarX']]) * year_dict[ident])])
    return d

# Convert data into relative data
# Ident is just there for convinience
def reformat_yearwise_rel(data, year_dict, ident):
    d = []
    for index, row in data.iterrows():
        d.append([row['VarX'][:-3], row['Mid'], 
                  row['VarY'] / (1 + year_dict[row['VarX']])])
    return d

# Maps the salary into different numeric categories depending on maping (Extra)
def salary_map(data, qnum, maping):
    d = data
    for key in maping:
        d[qnum] = d[qnum].replace(key, maping[key])
    return d

# A salary map
change_dict_1 = {
    '25,000-29,999':27500, '60,000-69,999':65000, '$0-999':500, '30,000-39,999': 35000,
       '15,000-19,999':17500, '70,000-79,999':75000, '2,000-2,999':2500, '10,000-14,999':12500,
       '5,000-7,499':6250, '20,000-24,999':22500, '1,000-1,999':1500, '100,000-124,999':112500,
       '7,500-9,999':8750, '4,000-4,999':4500, '40,000-49,999':45000, '50,000-59,999':55000,
       '3,000-3,999':3500, '300,000-499,999':400000, '200,000-249,999':225000,
       '125,000-149,999':137500, '250,000-299,999':275000, '80,000-89,999':85000,
       '90,000-99,999':95000, '150,000-199,999':175000, '>$1,000,000': 1100000,
       '$500,000-999,999':750000
}

# convert the data to log. (Extra)
def to_log(data):
    d = []
    for i in range(len(data)):
        d.append(math.log(data[i] + 1))
    return d

# Reconvert the numeric salary to smaller ranges (Extra)
def compile_salary(data, ranges):
    d = []
    for point in data:
        flag = True
        for j in ranges:
            if(isinstance(point, int) or isinstance(point, float)):
                if(ranges[j][0] <= point and point <= ranges[j][1]):
                    d.append(j)
                    flag = False
                    break
            else:
                d.append("nan")
        if(flag):
            d.append("nan")
    return d

# Another function to convert the data into relative values
# This is less generalized
def conv_rel(cat_list, data):
    d = []
    for cat in cat_list:
        d_temp = data[data["VarX"].isin([cat])]
        s_t = d_temp["VarY"].sum()
        d_temp["VarY"] = d_temp["VarY"] / s_t
        d.append(d_temp)
    return pd.concat(d)

# Formats data for making into a line graph with multiple variables.
def year_relator(q_list, y_list, q_points, data_list, q_py):
    d = []
    j = 0
    qs = 0
    for i in range(len(q_list)):
        if(qs == q_py):
            j += 1
            qs = 0
        count = get_question_counts(q_list[i], data_list[j])
        d.append([y_list[j], q_points[qs], count[0]])
        qs += 1
    return d

"""
test_1 = year_relator(["Q7_Part_1", "Q7_Part_2", "Q7_Part_11", "Q7_Part_1", "Q7_Part_2", "Q7_Part_11",
              "Q7_Part_1", "Q7_Part_2", "Q7_Part_11", "Q7_Part_1", "Q7_Part_2", "Q7_Part_11",
              "Q18_Part_1", "Q18_Part_2", "Q18_Part_10", "Q18_Part_1", "Q18_Part_2", "Q18_Part_10",
             "Q16_Part_1", "Q16_Part_2", "Q16_Part_9", "Q16_Part_1", "Q16_Part_2", "Q16_Part_9"
             ], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New", "2018 Old", "2018 New"], 
             ["Python", "R", "MATLAB"], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y, d18_o, d18_y], 3)
"""

# Splits data into old and new users
def bifurcate_data(dat):
    old = []
    ne = []
    for val in dat:
        name = val[0]
        if(name[-3:] == "Old"):
            old.append(val)
        else:
            ne.append(val)
    return pd.DataFrame(old, columns = ["VarX", "Mid", "VarY"]), pd.DataFrame(ne, columns = ["VarX", "Mid", "VarY"])

# Combines elements according to the middle values
def comb_mids1(dat, mid_vals, rem_vals, years):
    d = dat
    for i in range(len(years)):
        n1 = years[i]
        for mid in mid_vals:
            #if(d.loc[(d["VarX"] == n1) & (d["Mid"] == mid[1]), 'VarY'] != np.nan and d.loc[(d["VarX"] == n1) & (d["Mid"] == mid[0]), 'VarY'] != np.nan):
            # d.loc[(d["VarX"] == n1) & (d["Mid"] == mid[0]), 'VarY'] += d.loc[(d["VarX"] == n1) & (d["Mid"] == mid[0]), 'VarY'] + d.loc[(d["VarX"] == n1) & (d["Mid"] == mid[1]), 'VarY']
            d.loc[(d["VarX"] == n1) & (d["Mid"] == mid[0]), 'VarY'] += d.loc[[(d["VarX"] == n1) & (d["Mid"] == mid[0]), (d["VarX"] == n1) & (d["Mid"] == mid[1])], 'VarY'].sum()
    
    d = d.loc[~d["Mid"].isin(rem_vals)]
    
    return d

# A better version of comb_mids1
def comb_mids(dat, mid_vals, rem_vals, years):
    d = []
    for i in range(len(years)):
        for j in range(len(mid_vals)):
            for k in range(len(dat)):
                if(dat[k][0] == years[i] and dat[k][1] == mid_vals[j][0]):
                    flag = False
                    for l in range(len(dat)):
                        if(dat[l][0] == years[i] and dat[l][1] == mid_vals[j][1]):
                            d.append([dat[k][0], dat[k][1], dat[k][2] + dat[l][2]])
                            flag = True
                            break
                    if(flag):
                        break
    for i in range(len(dat)):
        flag = True
        for j in range(len(mid_vals)):
            if((dat[i][1] == mid_vals[j][0] and dat[i][0] in years) or (dat[i][0] in years and dat[i][1] == mid_vals[j][1])):
                flag = False
                break
        if(flag):
            d.append(dat[i])
    return d

# Makes a line graph for single choice question with multiple variables
def Line_SV_2(q_list, q_to_find, naming, year_list, per_year, data):
    count = 0
    y_no = 0
    d = []
    for i in range(len(q_list)):
        if(count == per_year):
            count = 0
            y_no += 1
        valY = data[y_no][q_list[i]].value_counts()[q_to_find[i]]
        d.append([year_list[y_no], naming[count], valY])
        count += 1
    d = pd.DataFrame(d, columns = ["VarX", "Mid", "VarY"])
    d['VarY'] = d['VarY'].astype(str).astype(float)
    return d

# Another way to convert to relative values
def conv_rel_2(data, mapping):
    d = []
    for i in range(len(data)):
        d.append([data[i][0], data[i][1], data[i][2] / mapping[data[i][0]]])
    d = pd.DataFrame(d, columns = ["VarX", "Mid", "VarY"])
    d['VarY'] = d['VarY'].astype(str).astype(float)
    return d

# Change names of data
def re_name_2(data, naming):
    d = []
    for i in range(len(data)):
        d.append([data[i][0], naming[data[i][1]], data[i][2]])
    return pd.DataFrame(d, columns = ['VarX', 'Mid', 'VarY'])

# Helps make graphs in ploty.go
def add_many_traces(fig, data, trace_list, color_list):
    for i in range(len(trace_list)):
        fig.add_trace(go.Scatter(x = data[data['Mid'].isin([trace_list[i]])]['VarX'],
                                y = data[data['Mid'].isin([trace_list[i]])]['VarY'],
                                 hovertemplate = trace_list[i] + ': %{y:.2f}<extra></extra>',
                                 name = trace_list[i],
                                 marker_color = color_list[i]
                                ))

****
**Who are Experienced and New Practitioners?**

Before proceeding, it is important to note, what I constituted "Experienced" and "New" practitioners to be. I chose "New" Practitioners to be coders who have less than 3 years of *Machine Learning Experience* and the *Experienced Group* are the set of people with more than 10 years of experience.

Why did I choose this metric? Well, I had narrowed down two other options - Salary, Coding. 

Let me explain why I didn't choose these for my analysis (they may be good for others). First of all, Salary is not well suited to our purpose. We wish to analuze the previous and current generation of ML practioners. However, in some rare instances, we do have younger practitioners earning a high salary, which would club them in the same group as older practitioners. This certainly would not be helpful to our analysis.
Coding experience is another invalid metric because a coder with 20 years of Full Stack Experience who has just transitioned to Machine Learning is *still in the new generation of ML coders* but will be counted in the older generation. 
Clearly, these metrics will present discrepancies.

Now, why 3 and 10? Well, I didn't want to choose an extremely small group. Besides, in the world of computing, even 10 years is a large number!

In [None]:
# Segregating the data according to the decided metrics

d21_o = data_21[data_21["Q15"].isin(["10-20 years", "20 or more years"])]
d21_y = data_21[data_21["Q15"].isin(["Under 1 year", "1-2 years"])]
d20_y = data_20[data_20["Q15"].isin(["Under 1 year", "1-2 years"])]
d20_o = data_20[data_20["Q15"].isin(["10-20 years", "20 or more years"])]
d18_y = data_18[data_18["Q25"].isin(["< 1 year", "1-2 years"])]
d18_o = data_18[data_18["Q25"].isin(["10-15 years", "20+ years"])]
d19_y = data_19[data_19["Q23"].isin(["< 1 years", "1-2 years"])]
d19_o = data_19[data_19["Q23"].isin(["10-20 years", "20+ years"])]

In [None]:
print("2021 - Experienced:", len(d21_o), "New Users:", len(d21_y))
print("2020 - Experienced:", len(d20_o), "New Users:", len(d20_y))
print("2019 - Experienced:", len(d19_o), "New Users:", len(d19_y))
print("2018 - Experienced:", len(d18_o), "New Users:", len(d18_y))

**Let The Numbers Rain Down**

Let's start by counting how many people each group has per year. You may check the above cell to notice that the number of users in each group from 2018 to 2021 are:

2021 - Experienced: 573 New Users: 13838

2020 - Experienced: 419 New Users: 9771

2019 - Experienced: 183 New Users: 8947

2018 - Experienced: 465 New Users: 11179

****

**An Overview**

Let's use the data about the type of tools that are used by the survey responders to understand what categories our two groups are at a superficial level. In particular, we'll look at the primary tool they use to analyze data

In [None]:
# Primary Analysis Tool

# Get the question names
names_old_Analysis = get_question_index("Q41", d21_o)
names_y_Analysis = get_question_index("Q41", d21_y)
names_old_Analysis_18 = get_question_index("Q12_MULTIPLE_CHOICE", d18_o)
names_y_Analysis_18 = get_question_index("Q12_MULTIPLE_CHOICE", d18_y)

# Get all data from the single choice questions
Analysis_old_2021 = Line_SV_2(["Q41","Q41","Q41","Q41","Q41","Q41"],
            names_old_Analysis,
          names_old_Analysis, 
          ["2021 Old", "2021 New"], 6, 
          [d21_o[~d21_o["Q41"].isin([np.nan])],d21_y[~d21_y["Q41"].isin([np.nan])]])

Analysis_new_2021 = Line_SV_2(["Q41","Q41","Q41","Q41","Q41","Q41"],
            names_y_Analysis,
          names_y_Analysis, 
          ["2021 New"], 6, 
          [d21_y[~d21_y["Q41"].isin([np.nan])]])

Analysis_2021 = pd.concat([Analysis_old_2021, Analysis_new_2021])

# Rescale to be on par with 2021 data
Analysis_2021 = conv_rel_2(Analysis_2021.to_numpy(), {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465, 
                                       '2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179})
                         #ident = ['2021 Old', '2021 New'])
# Rename the long names
Analysis_2021 = re_name_2(Analysis_2021.to_numpy(), {'Local or hosted development environments (RStudio, JupyterLab, etc.)': 'Development Environments',
                                     'Local development environments (RStudio, JupyterLab, etc.)': 'Development Environments',
                                     'Basic statistical software (Microsoft Excel, Google Sheets, etc.)': 'Basic Stat Software',
                                     'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)': 'Business Intelligence',
                                     'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)': 'Cloud',
                                      'Other': 'Other',
                                      'Advanced statistical software (SPSS, SAS, etc.)': 'Advanced Stat Soft'
                                     })
# Make a graph
fig = go.Figure()
fig.add_trace(go.Bar(x = Analysis_2021[Analysis_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = Analysis_2021[Analysis_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = Analysis_2021[Analysis_2021['VarX'].isin(['2021 New'])]['Mid'],
                     y = Analysis_2021[Analysis_2021['VarX'].isin(['2021 New'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Primary Data Analysis Tool for Users in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Data Analysis Tool',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

We notice that Development Environments are the most popular tool overall. For experienced users it is largely the most used type of platform. On the other hand, for new users, Basic Statistics Software such as MS Excel are the most popular.

1. Development environments are still very common among new users, but much less than for experienced users
2. Advanced statistics software and cloud tools are more popular among experienced users than for new users

This allows us to infer that new users use ML applications for relatively simple tasks, since they use basic statistics software. On the other hand, experienced users use similar tools, but they focus more on development of software since they use development environments the most.

In [None]:
Analysis_old_2020 = Line_SV_2(["Q38","Q38","Q38","Q38","Q38","Q38"],
            names_old_Analysis,
          names_old_Analysis, 
          ["2020 Old", "2020 New"], 6, 
          [d20_o[~d20_o["Q38"].isin([np.nan])],d20_y[~d20_y["Q38"].isin([np.nan])]])

Analysis_new_2020 = Line_SV_2(["Q38","Q38","Q38","Q38","Q38","Q38"],
            names_y_Analysis,
          names_y_Analysis, 
          ["2020 New"], 6, 
          [d20_y[~d20_y["Q38"].isin([np.nan])]])

Analysis_old_2019 = Line_SV_2(["Q14","Q14","Q14","Q14","Q14","Q14"],
            names_old_Analysis,
          names_old_Analysis, 
          ["2019 Old", "2019 New"], 6, 
          [d19_o[~d19_o["Q14"].isin([np.nan])],d19_y[~d19_y["Q14"].isin([np.nan])]])

Analysis_new_2019 = Line_SV_2(["Q14","Q14","Q14","Q14","Q14","Q14"],
            names_y_Analysis,
          names_y_Analysis, 
          ["2019 New"], 6, 
          [d19_y[~d19_y["Q14"].isin([np.nan])]])

Analysis_old_2018 = Line_SV_2(["Q12_MULTIPLE_CHOICE","Q12_MULTIPLE_CHOICE","Q12_MULTIPLE_CHOICE","Q12_MULTIPLE_CHOICE",
                              "Q12_MULTIPLE_CHOICE","Q12_MULTIPLE_CHOICE"],
            names_old_Analysis_18,
          names_old_Analysis_18, 
          ["2018 Old", "2018 New"], 6, 
          [d18_o[~d18_o["Q12_MULTIPLE_CHOICE"].isin([np.nan])],d18_y[~d18_y["Q12_MULTIPLE_CHOICE"].isin([np.nan])]])

Analysis_new_2018 = Line_SV_2(["Q12_MULTIPLE_CHOICE","Q12_MULTIPLE_CHOICE","Q12_MULTIPLE_CHOICE","Q12_MULTIPLE_CHOICE",
                              "Q12_MULTIPLE_CHOICE","Q12_MULTIPLE_CHOICE"],
            names_y_Analysis_18,
          names_y_Analysis_18, 
          ["2018 New"], 6, 
          [d18_y[~d18_y["Q12_MULTIPLE_CHOICE"].isin([np.nan])]])

Analysis_2021 = pd.concat([Analysis_old_2021, Analysis_new_2021])
AnalysisVsYear = pd.concat([Analysis_2021, Analysis_old_2020, Analysis_new_2020,Analysis_old_2019, Analysis_new_2019,
                           Analysis_old_2018, Analysis_new_2018])


AnalysisVsYear[AnalysisVsYear['Mid'].isin(['Local or hosted development environments (RStudio, JupyterLab, etc.)'])]['Mid'] = 'Local development environments (RStudio, JupyterLab, etc.)'


#AnalysisVsYear.loc(AnalysisVsYear['Mid'] == 'Local or hosted development environments (RStudio, JupyterLab, etc.)')['Mid'] = 'Local development environments (RStudio, JupyterLab, etc.)'


AnalysisVsYear = conv_rel_2(AnalysisVsYear.to_numpy(), {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465, 
                                       '2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179})

AnalysisVsYear = re_name_2(AnalysisVsYear.to_numpy(), {'Local or hosted development environments (RStudio, JupyterLab, etc.)': 'Development Environments',
                                     'Local development environments (RStudio, JupyterLab, etc.)': 'Development Environments',
                                     'Basic statistical software (Microsoft Excel, Google Sheets, etc.)': 'Basic Stat Software',
                                     'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)': 'Business Intelligence',
                                     'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)': 'Cloud',
                                      'Other': 'Other',
                                      'Advanced statistical software (SPSS, SAS, etc.)': 'Advanced Stat Soft'
                                     })

AnalysisVsYear_o, AnalysisVsYear_y = bifurcate_data(AnalysisVsYear.to_numpy())

fig = go.Figure()
add_many_traces(fig, AnalysisVsYear_o.reindex(index=AnalysisVsYear_o.index[::-1]), ['Development Environments', 
                                                                                    'Basic Stat Software', 'Business Intelligence', 'Cloud',
                                                                                    'Other', 'Advanced Stat Soft'],
                ['red', 'green', 'blue', 'black', 'yellow', 'grey', 'purple', 'brown', 'dark green'])
fig.update_layout(title_text = "Preferred Data Analysis Tool for Users vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, AnalysisVsYear_y.reindex(index=AnalysisVsYear_y.index[::-1]),['Development Environments', 
                                                                                    'Basic Stat Software', 'Business Intelligence', 'Cloud',
                                                                                    'Other', 'Advanced Stat Soft'], 
                ['red', 'green', 'blue', 'black', 'yellow', 'grey', 'purple', 'brown', 'dark green'])
fig.update_layout(title_text = "Preferred Data Analysis Tool for Users vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()



Surprisingly, we notice that the usage of advanced statistics software has declined for both groups. 
Moreover:
1. The use of basic statistics software has risen greatly for both pools.
2. For experienced users, development environments have slightly decreased, but for new users, it has decreased over time significantly.
3. For experienced users, the use of cloud and other software have also declined a little bit.

Because experienced users are moving towards basic statistics software, it may be possible that these software are actually more useful in the long run, and provide more applications in daily life. This can be infered because the experienced users are likely to have used most of the types of software, before deciding to use this software most frequenctly. If that is the case, then it is a good trend that new users are also following the old users by also using basic software more often (but it's more likely to be that it's easier to use)

Now that we have a bit of an overview, we should try to get a bit more information about the 2 groups that will help us in our ensuing analysis.
****
**Expenditure On ML**

We already looked at how data analysis tools are used by each group at a high level. Before moving ahead, we should analyze how the different groups spend money on different products. This will tell us if we made a good decision in forming our categories as well...

In [None]:
Expenditure = Line_SV_2(["Q26","Q26", "Q26","Q26", "Q26","Q26",
          "Q25","Q25", "Q25","Q25", "Q25","Q25",
          "Q11","Q11","Q11","Q11","Q11","Q11",
           "Q26","Q26", "Q26","Q26", "Q26","Q26",
          "Q25","Q25", "Q25","Q25", "Q25","Q25",
          "Q11","Q11","Q11","Q11","Q11","Q11"], 
          ["$0 ($USD)", '$1-$99', '$100-$999', '$1000-$9,999', '$10,000-$99,999', '$100,000 or more ($USD)',
           "$0 ($USD)", '$1-$99', '$100-$999', '$1000-$9,999', '$10,000-$99,999', '$100,000 or more ($USD)',
           "$0 (USD)", '$1-$99', '$100-$999','$1000-$9,999', '$10,000-$99,999', '> $100,000 ($USD)',
           "$0 ($USD)", '$1-$99', '$100-$999', '$1000-$9,999', '$10,000-$99,999', '$100,000 or more ($USD)',
           "$0 ($USD)", '$1-$99', '$100-$999', '$1000-$9,999', '$10,000-$99,999', '$100,000 or more ($USD)',
           "$0 (USD)", '$1-$99', '$100-$999','$1000-$9,999', '$10,000-$99,999', '> $100,000 ($USD)'],
          ['0', '<100', '<1000', '<10000', '<100000', '>100000'], ["2021 Old", "2020 Old", "2019 Old",
                                                                  "2021 New", "2020 New", "2019 New"], 6, 
          [d21_o, d20_o, d19_o, d21_y, d20_y, d19_y])

Expenditure = conv_rel_2(Expenditure.to_numpy(), {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465, 
                                       '2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179})
                         #ident = ['2021 Old', '2021 New'])


fig = px.bar(Expenditure, x = 'Mid', y = 'VarY', color = 'VarX', barmode = "group", labels = {
    '2021 Old': '2021 Experienced', '2020 Old': '2020 Experienced', '2019 Old': '2019 Experienced',
    '2021 New': '2021 New', '2020 New': '2020 New', '2019 New': '2019 New'
})
fig.show()

Firstly, notice that the experienced users denoted by "Old", for "Old users" have proportionately spent more money on ML products than the new users. This is best illustrated where we see that the first 3 bars of the last 3 categories are much higher than those for the other 3 bars (new users). This means that new users don't spend as much on ML than their experienced counter-parts.

This implies that Experienced users will most likely have experimented with different products, implying that the tools they currently use, are most likely the optimal tools as per their knowledge and practice.
This reinforces the choice of using ML Experience as our differentiating metric.
****
**Education**

Lastly, before looking at the market, let's look at what education platforms our buyers and sellers have used. This will help us get a full backdrop on why they use the products they do.

In [None]:
# Education Platforms 2021

Ed_2021 = multi_Y_data_MV(["Q40_", "Q40_"], ['Coursera',
'edX',
'Kaggle Learn Courses',
'DataCamp',
'Fast.ai',
'Udacity',
'Udemy',
'LinkedIn Learning',
'Cloud-certification programs',
'University Courses',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

#num_old_Deploy_ans = Experiments_2021[Experiments_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
#num_young_Deploy_ans = Experiments_2021[Experiments_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


Ed_2021 = conv_rel(["2021 Old", "2021 Young"], Ed_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = Ed_2021[Ed_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = Ed_2021[Ed_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = Ed_2021[Ed_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = Ed_2021[Ed_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Education Platforms Used in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Education Platform',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()




The most common platform is Coursera overall. For new users, Kaggle Learn is the most popular platform.

1. Experienced users had used coursera in the past.
2. They also relied heavily on university courses.
3. On the other hand new users rely on Kaggle Learn, Coursera, and Udemy the most.
4. Many more old users have "no" education as compared to the new users


In [None]:
# Graph all except linkedin and cloud because of lack of data

EdVsYear = year_relator(["Q40_Part_1", "Q40_Part_2", "Q40_Part_3", "Q40_Part_4", "Q40_Part_5", "Q40_Part_6", 
                         "Q40_Part_7", "Q40_Part_8", "Q40_Part_9", "Q40_Part_10", "Q40_Part_11",
                         "Q40_Part_1", "Q40_Part_2", "Q40_Part_3", "Q40_Part_4", "Q40_Part_5", "Q40_Part_6", 
                         "Q40_Part_7", "Q40_Part_8", "Q40_Part_9", "Q40_Part_10", "Q40_Part_11",
              "Q37_Part_1", "Q37_Part_2", "Q37_Part_3", "Q37_Part_4", "Q37_Part_5", "Q37_Part_6", 
                         "Q37_Part_7", "Q37_Part_8", "Q37_Part_9", "Q37_Part_10", "Q37_Part_11",
              "Q37_Part_1", "Q37_Part_2", "Q37_Part_3", "Q37_Part_4", "Q37_Part_5", "Q37_Part_6", 
                         "Q37_Part_7", "Q37_Part_8", "Q37_Part_9", "Q37_Part_10", "Q37_Part_11",
               "Q13_Part_2", "Q13_Part_3", "Q13_Part_6", "Q13_Part_4", "Q13_Part_7", "Q13_Part_1", 
                         "Q13_Part_8", "Q13_Part_9", "Q13_Part_12", "Q13_Part_10", "Q13_Part_11",
               "Q13_Part_2", "Q13_Part_3", "Q13_Part_6", "Q13_Part_4", "Q13_Part_7", "Q13_Part_1", 
                         "Q13_Part_8", "Q13_Part_9", "Q13_Part_12", "Q13_Part_10", "Q13_Part_11",
                "Q36_Part_2", "Q36_Part_3", "Q36_Part_6", "Q36_Part_4", "Q36_Part_7", "Q36_Part_1", 
                         "Q36_Part_9", "Q36_Part_13", "Q36_Part_13", "Q36_Part_11", "Q36_Part_12",
                "Q36_Part_2", "Q36_Part_3", "Q36_Part_6", "Q36_Part_4", "Q36_Part_7", "Q36_Part_1", 
                         "Q36_Part_9", "Q36_Part_13", "Q36_Part_13", "Q36_Part_11", "Q36_Part_12"], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New", "2018 Old", "2018 New"], 
             ['Coursera', 'EdX', 'Kaggle', 'DataCamp', 'Fast.ai', 'Udacity', 'Udemy', 'Linkedin', 'Cloud',
             'University', 'None'], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y, d18_o, d18_y], 11)

EdVsYear_o, EdVsYear_y = bifurcate_data(EdVsYear)


EdVsYear_o = pd.DataFrame(reformat_yearwise_rel(EdVsYear_o, {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])
EdVsYear_y = pd.DataFrame(reformat_yearwise_rel(EdVsYear_y, {'2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])
EdVsYear_o = EdVsYear_o.loc[EdVsYear_o['Mid'] != 'Linkedin']
EdVsYear_o = EdVsYear_o.loc[EdVsYear_o['Mid'] != 'Cloud']
EdVsYear_y = EdVsYear_y.loc[EdVsYear_y['Mid'] != 'Linkedin']
EdVsYear_y = EdVsYear_y.loc[EdVsYear_y['Mid'] != 'Cloud']

fig = go.Figure()
add_many_traces(fig, EdVsYear_o.reindex(index=EdVsYear_o.index[::-1]), ['Coursera', 'EdX', 'Kaggle', 'DataCamp', 'Fast.ai', 'Udacity', 'Udemy',
             'University', 'None'],
                ['red', 'grey', 'blue', 'grey', 'grey', 'grey', 'lightgreen', 'yellow', 'grey'])
fig.update_layout(title_text = "Education Platform vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, EdVsYear_y.reindex(index=EdVsYear_y.index[::-1]),['Coursera', 'EdX', 'Kaggle', 'DataCamp', 'Fast.ai', 'Udacity', 'Udemy',
             'University', 'None'], 
                ['red', 'grey', 'blue', 'grey', 'grey', 'grey', 'lightgreen', 'yellow', 'grey'])
fig.update_layout(title_text = "Education Platform vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()


In the data above, we see that most of the grey lines are actually decreasing in usage over time.
On the other hand:
1. University education has increased significantly for both groups.
2. For new users, kaggle has had the largest increase as an education platform.
3. Coursera has remained almost constant for experienced users, but has declined significantly for new users
4. Udemy has remained almost constant for both groups and maintained a significant share of the education market

****
**First Things First - Programming Languages**

How can we start without looking at how the usage of Programming Languages has evolved over time?

In 2021, the following graph illustrates the *share* of users of ML.

In [None]:
Lang2021 = multi_Y_data_MV(["Q7_", "Q7_"], ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", "TypeScript", "Bash", "MATLAB", "None", 
                 "Other"], get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = False, cut_off = 1, reformat = False)

Lang2021 = conv_rel(["2021 Old", "2021 Young"], Lang2021)

fig = go.Figure()
fig.add_trace(go.Bar(x = Lang2021[Lang2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = Lang2021[Lang2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = Lang2021[Lang2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = Lang2021[Lang2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Programming Language Usage for Experienced and New Users", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Programming Language',
                 yaxis_title = 'Share of Users',
                 )
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

In this graph, there are 5 key observations:

1. Python has the most number of users in both classes. We notice that the Newer users are using Python Proportionately more than they are using other languages. This is most likely because Experienced users also know many other languages which most likely decreases the relative proportion of Python Users.
2. There are Proportionately more users using R that are more experienced than the present users. In fact proportionatley, twice the number of users know R that are Experienced than are New
3. Experienced users are proportionately more profficient in MATLAB than the newer users. We can infer this because the bar for MATLAB is 3.5 times taller for Experienced users than it is for new users.
4. The highlighted languages represent the majority of languages that ML practicitioners use for Experienced and Inexperienced users, since the bar for "Other" languages is negligible (shows 0.00 here, but it's non zero)
5. There are some people on both sides that do not use any Programming Language. It may be because they are using Statistics software such as Excel. More shockingly, there are more experienced users in this category than new users. It is a positive trend because it shows us that more programmers are learning coding languages.

Let's now plot the trend in programming languages for MATLAB, Python and R over time:

In [None]:
LangVsYear = year_relator(["Q7_Part_1", "Q7_Part_2", "Q7_Part_11", "Q7_Part_1", "Q7_Part_2", "Q7_Part_11",
              "Q7_Part_1", "Q7_Part_2", "Q7_Part_11", "Q7_Part_1", "Q7_Part_2", "Q7_Part_11",
              "Q18_Part_1", "Q18_Part_2", "Q18_Part_10", "Q18_Part_1", "Q18_Part_2", "Q18_Part_10",
             "Q16_Part_1", "Q16_Part_2", "Q16_Part_9", "Q16_Part_1", "Q16_Part_2", "Q16_Part_9"
             ], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New", "2018 Old", "2018 New"], 
             ["Python", "R", "MATLAB"], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y, d18_o, d18_y], 3)

LangVsYear_o, LangVsYear_y = bifurcate_data(LangVsYear)

LangVsYear_o = pd.DataFrame(reformat_yearwise_rel(LangVsYear_o, {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])
LangVsYear_y = pd.DataFrame(reformat_yearwise_rel(LangVsYear_y, {'2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])

fig = go.Figure()
add_many_traces(fig, LangVsYear_o.reindex(index=LangVsYear_o.index[::-1]), ["Python", "R", "MATLAB"], ['red', 'green', 'blue'])
fig.update_layout(title_text = "Programming Language Usage vs Time for Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, LangVsYear_y.reindex(index=LangVsYear_y.index[::-1]), ["Python", "R", "MATLAB"], ['red', 'green', 'blue'])
fig.update_layout(title_text = "Programming Language Usage vs Time for New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()


We see that the % of users using Python has been increasing over time for new practitioners, whereas it is more or less constant for experienced practioners. Moreover, the % of users using Python is very high. On the other hand, the % of users using MATLAB and R has slightly decreased over time for both groups. Particularly, R has had the largest decrease in users for the new practitioners.

Since Experienced users are also stopping to use R, it may be that they find the features of Python more attractive (such as the powerful ML libraries). (Sorry for the tough love R!). The same situation is being seen for MATLAB.

Unless R and MATLAB get such libraries for ML, Python will only increase it's share over time as the trend indicates.



*Hopefully, you are now accustomed to the analysis style. Let's dive deeper*

****
**What about the IDEs?**

We know how the programming languages have changed over time, but what about the IDEs on which they are coded in?

Let's plot their share of users in 2021

In [None]:
# IDE Vs Time

IDE2021 = multi_Y_data_MV(["Q9_", "Q9_"], ["JupyterLab", "RStudio", "Visual Studio", 
                                           "VS Code", "PyCharm", "Spyder", "Notepad++", 
                                           "Sublime Text", "Vim", "MATLAB", "Jupyter Notebook", "None",
                                           "Other"], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

IDE2021 = conv_rel(["2021 Old", "2021 Young"], IDE2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = IDE2021[IDE2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = IDE2021[IDE2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = IDE2021[IDE2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = IDE2021[IDE2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "IDE Usage for Experienced and New Users", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'IDE Platform',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

Unsurprisingly, we notice that Jupyter Notebook is the most used platform by both groups, but unsurprisingly, the newer generation uses it more often. We also notice some other features:

1. VS Code and PyCharm are much more popular in newer users than they are for the older generation.
2. RStudio is more popular for the Experienced users. But, this is in accordance with what we fould out for the language R. It only makes sense that it's IDE is less used if the language is less used by newer users. (But I used RStudio :O)
3. Vim is much more popular among Experienced users. It's because it was released during the time they started out, so they are more used to it.
4. There are very few people that actually don't use an IDE.

Let's focus on Jupyter lab Products, RStudio, Visual Studio and Vim have changed over time as these are the prominent distinctions

In [None]:
# Focus on Jupyter, RStudio, Visual Studio, Vim

IDEvsYear_2120 = year_relator(["Q9_Part_1", "Q9_Part_2", "Q9_Part_3", "Q9_Part_4", "Q9_Part_9", "Q9_Part_11", "Q9_Part_1", "Q9_Part_2", "Q9_Part_3", "Q9_Part_4", "Q9_Part_9","Q9_Part_11",
              "Q9_Part_1", "Q9_Part_2", "Q9_Part_3", "Q9_Part_4", "Q9_Part_9", "Q9_Part_11", "Q9_Part_1", "Q9_Part_2", "Q9_Part_3", "Q9_Part_4", "Q9_Part_9","Q9_Part_11"],
             #"Q13_Part_1", "Q13_Part_2", "Q13_Part_4", "Q13_Part_8", "Q13_Part_11", "Q13_Part_1", "Q13_Part_2", "Q13_Part_4", "Q13_Part_8","Q13_Part_11"
             #], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New"], 
                         # "2018 Old", "2018 New"], 
             ["Jupyter Lab", "Jupyter Notebooks", "RStudio", "Visual Studio", "Visual Studio Code", "Vim"], 
             [d21_o, d21_y, d20_o, d20_y], 
              # d19_o, d19_y, d18_o, d18_y], 
              6)

IDEvsYear_18 = year_relator(["Q13_Part_1", "Q13_Part_2", "Q13_Part_4", "Q13_Part_8", "Q13_Part_11", "Q13_Part_1", "Q13_Part_2", "Q13_Part_4", "Q13_Part_8", "Q13_Part_11"],
                         ["2018 Old", "2018 New"], 
             ["Jupyter Lab", "RStudio", "Visual Studio Code", "Visual Studio", "Vim"], 
              [d18_o, d18_y], 
              5)

IDEvsYear_19 = year_relator(["Q16_Part_1", "Q16_Part_2", "Q16_Part_6", "Q16_Part_8","Q16_Part_1", "Q16_Part_2", "Q16_Part_6", "Q16_Part_8",],
                         ["2019 Old", "2019 New"], 
             ["Jupyter Lab", "RStudio", "Visual Studio", "Vim"], 
              [d19_o, d19_y], 
              4)

IDEvsYear = pd.concat([pd.DataFrame(IDEvsYear_2120, columns = ["VarX", 'Mid', 'VarY']), 
                       pd.DataFrame(IDEvsYear_19, columns = ["VarX", 'Mid', 'VarY']),
                       pd.DataFrame(IDEvsYear_18, columns = ["VarX", 'Mid', 'VarY'])])
IDEvsYear = comb_mids(IDEvsYear.to_numpy(), [["Visual Studio","Visual Studio Code"], ["Jupyter Lab", "Jupyter Notebooks"]],
                     ["Visual Studio Code", "Jupyter Notebooks"], ["2021 Old", "2020 Old","2021 New", "2020 New"])
IDEvsYear = comb_mids(IDEvsYear, [["Visual Studio","Visual Studio Code"]],
                     ["Visual Studio Code"], ["2018 Old", "2018 New"])


IDEVsYear_o, IDEVsYear_y = bifurcate_data(IDEvsYear)
IDEVsYear_o["VarY"] = IDEVsYear_o["VarY"].astype(str).astype(float)
IDEVsYear_y["VarY"] = IDEVsYear_y["VarY"].astype(str).astype(float)

IDEVsYear_o = pd.DataFrame(reformat_yearwise_rel(IDEVsYear_o, {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])

IDEVsYear_o = IDEVsYear_o.sort_values('VarX', ascending = False)

IDEVsYear_y = pd.DataFrame(reformat_yearwise_rel(IDEVsYear_y, {'2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])
IDEVsYear_y = IDEVsYear_y.sort_values('VarX', ascending = False)

fig = go.Figure()
add_many_traces(fig, IDEVsYear_o.reindex(index=IDEVsYear_o.index[::-1]), ["Jupyter Lab", "RStudio", "Visual Studio", "Vim"], ['red', 'green', 'blue', 'black'])
fig.update_layout(title_text = "IDE Used vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, IDEVsYear_y.reindex(index=IDEVsYear_y.index[::-1]), ["Jupyter Lab", "RStudio", "Visual Studio", "Vim"], ['red', 'green', 'blue', 'black'])
fig.update_layout(title_text = "IDE Used vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()


We notice that there is a sudden spike in the number of Vim Users in both categories in recent times. 
On average, Visual Studio and Jupyter Lab have been constant in % of users.
RStudio has been steadily declining in recent years

For new users, there has been a large fall in Jupyter Lab usage. This is most likely a statistical discrepancy, as it is certainly an outlier.

****

**Seeing that Notebooks are Very Popular, which Notebook is the most common?**

Let's now see how the notebook platforms have changed their number of users.

In [None]:
# Notebook Usage in 2021

Notebooks2021 = multi_Y_data_MV(["Q10_", "Q10_"], ['Kaggle Notebooks',
'Colab Notebooks',
'Azure Notebooks',
'Paperspace/Gradient',
'Binder/JupyterHub',
'Code Ocean',
'IBM Watson Studio',
'Amazon Sagemaker Notebooks',
'Amazon EMR Notebooks',
'Google Cloud Notebooks',
'Google Cloud Datalab',
'Databricks Colab Notebooks',
'Zeppelin / Zepl Notebooks',
'Deepnote Notebooks',
'Observable Notebooks',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

Notebooks2021 = conv_rel(["2021 Old", "2021 Young"], Notebooks2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = Notebooks2021[Notebooks2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = Notebooks2021[Notebooks2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = Notebooks2021[Notebooks2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = Notebooks2021[Notebooks2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Notebooks Usage in 2021 for Experienced and New Users", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Notebook',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

The space of notebooks has been dominated by Google Colab and Kaggle for both categories. We also however, see that many people don't use any notebooks (this was seen because we already saw in the previous data, that there are many people that don't use notebooks). Other features we observe:

1. Kaggle Notebooks are much more common for newer users.
2. Colab Notebooks are more common for newer users, but the difference is much smaller than Kaggle Notebooks
3. Many more experienced users use NO notebook at all, than inexperienced users.
4. All Other notebooks are used much less than Kaggle and Colab.

But, how has this composition changed over time? Is Colab gaining pace or is Kaggle?

In [None]:
# Focus on Kaggle, Colab, Azure, None

NbVsYear = year_relator(["Q10_Part_1", "Q10_Part_2", "Q10_Part_3", "Q10_Part_16", "Q10_Part_1", "Q10_Part_2", "Q10_Part_3", "Q10_Part_16",
              "Q10_Part_1", "Q10_Part_2", "Q10_Part_3", "Q10_Part_13", "Q10_Part_1", "Q10_Part_2", "Q10_Part_3", "Q10_Part_13",
              "Q17_Part_1", "Q17_Part_2", "Q17_Part_3", "Q17_Part_11", "Q17_Part_1", "Q17_Part_2", "Q17_Part_3", "Q17_Part_11",
             "Q14_Part_1", "Q14_Part_2", "Q14_Part_3", "Q14_Part_10", "Q14_Part_1", "Q14_Part_2", "Q14_Part_3", "Q14_Part_10"
             ], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New", "2018 Old", "2018 New"], 
             ["Kaggle", "Colab", "Azure", "None"], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y, d18_o, d18_y], 4)

NbVsYear_o, NbVsYear_y = bifurcate_data(NbVsYear)


NbVsYear_o = pd.DataFrame(reformat_yearwise_rel(NbVsYear_o, {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])
NbVsYear_y = pd.DataFrame(reformat_yearwise_rel(NbVsYear_y, {'2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])

fig = go.Figure()
add_many_traces(fig, NbVsYear_o.reindex(index=NbVsYear_o.index[::-1]), ["Kaggle", "Colab", "Azure", "None"], ['red', 'green', 'blue', 'black'])
fig.update_layout(title_text = "Notebooks Used vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, NbVsYear_y.reindex(index=NbVsYear_y.index[::-1]), ["Kaggle", "Colab", "Azure", "None"], ['red', 'green', 'blue', 'black'])
fig.update_layout(title_text = "Notebooks Used vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()


We see that the number of people using absolutely no notebook is decreasing over time for both groups. I.e., we see that people are using Notebooks more often. 
1. The fastest rise in % of users is for Colab in both categories
2. For experienced users, the % of them using Kaggle has remained almost constant
3. For new users on the other hand, they are using Kaggle Notebooks a lot more as time progresses, although the rate of change is not too steep.
4. I graphed MS Azure also to sample the other notebooks. I noticed that it's user % is decreasing over time for both groups, but at a very slow rate.

This tells us that Colab will rise in nuumbers in the next few years. Moreover, it may be a good decision to shift to Colab notebooks, as more experienced users are preferring to use it themselves. This is a positive trend that New users are following in their footsteps. Despite this competition, the fact that Kaggle books are almost constant for experienced users indicates that they may also be a good choice for new users. 
****
**Notebooks often come with TPUs. What about those?**

Well, let's actually take a look at how often TPUs are used in today's time. Note that TPUs are called Tensor Processing Units.




In [None]:
# TPU Usage

TPUVsYear = Line_SV_2(["Q13","Q13","Q13","Q13","Q13",
            "Q13","Q13","Q13","Q13","Q13",
            "Q22","Q22","Q22","Q22","Q22",
            "Q13","Q13","Q13","Q13","Q13",
            "Q13","Q13","Q13","Q13","Q13",
            "Q22","Q22","Q22","Q22","Q22"],
            ['Never', 'Once', '2-5 times', '6-25 times', 'More than 25 times',
           'Never', 'Once', '2-5 times', '6-25 times', 'More than 25 times',
           'Never', 'Once', '2-5 times', '6-24 times', '> 25 times',
           'Never', 'Once', '2-5 times', '6-25 times', 'More than 25 times',
           'Never', 'Once', '2-5 times', '6-25 times', 'More than 25 times',
           'Never', 'Once', '2-5 times', '6-24 times', '> 25 times'],
          ['Never', 'Once', '2-5 times', '6-25 Times', '> 25 Times'], 
          ["2021 Old", "2020 Old", "2019 Old", "2021 New", "2020 New", "2019 New"], 5, 
          [d21_o[~d21_o["Q13"].isin([np.nan])], d20_o[~d20_o["Q13"].isin([np.nan])], d19_o[~d19_o["Q22"].isin([np.nan])], 
           d21_y[~d21_y["Q13"].isin([np.nan])], d20_y[~d20_y["Q13"].isin([np.nan])], d19_y[~d19_y["Q22"].isin([np.nan])]])

TPUVsYear = conv_rel_2(TPUVsYear.to_numpy(), {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465, 
                                       '2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179})
                         #ident = ['2021 Old', '2021 New'])
TPUVsYear_o, TPUVsYear_y = bifurcate_data(TPUVsYear.to_numpy())

TPUVsYear_2021 = TPUVsYear[TPUVsYear["VarX"].isin(['2021 Old', '2021 New'])]
fig = go.Figure()
fig.add_trace(go.Bar(x = TPUVsYear_2021[TPUVsYear_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = TPUVsYear_2021[TPUVsYear_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = TPUVsYear_2021[TPUVsYear_2021['VarX'].isin(['2021 New'])]['Mid'],
                     y = TPUVsYear_2021[TPUVsYear_2021['VarX'].isin(['2021 New'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "TPU Usage in 2021 for Experienced and New Users", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Frequency of TPU Usage',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

From the graph, we see that the proportion (here it is simply % because the classes are mutually exclusive) of users that have never used TPUs is pretty high. This may be because of 2 reasons - TPUs are an advanced technology that most people don't have access to in person, and TPUs are specialized for Tensorflow - Google's special ML Framework. Not everyone uses Tensorflow, and so may not use TPUs either (we'll see this later :)). But, we also observe the following:

1. The proportion of people that have never used TPUs is more for new users than experienced users.
2. The number of experienced users that have used TPUs more than 6 times is proportionately higher than that of new users

But, as before, we do want to know how the usage of TPUs is changing over time in the 2 different categories. Let's plot this. However, I noticed that it is a bit difficult to read the graph, so I only highlighted the key features in color. To read about the other lines, please hover over them to see more.

In [None]:
# TPU Line Graphs

fig = go.Figure()
add_many_traces(fig, TPUVsYear_o.reindex(index=TPUVsYear_o.index[::-1]), ['Never', 'Once', '2-5 times', '6-25 Times', '> 25 Times'],
                ['red', 'grey', 'blue', 'grey', 'green'])
fig.update_layout(title_text = "TPU Usage Frequency vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, TPUVsYear_y.reindex(index=TPUVsYear_y.index[::-1]), ['Never', 'Once', '2-5 times', '6-25 Times', '> 25 Times'], 
                ['red', 'grey', 'blue', 'grey', 'green'])
fig.update_layout(title_text = "TPU Usage Frequency vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()


Notice that most people have still not used TPUs ever in both groups. But we see that:
1. The % of people that never used TPUs is decreasing every year for both groups. The rate of decrease is also approximately the same
2. The % of people that have just started using TPUs (2-5 times) is increasing for both groups at similar rates.
3. The rate of increase of usage of at the highest level (> 25) is very slow for both new and experienced users. In fact, it is pretty negligible.

The 2nd point, indicates that people are exploring TPUs quite openly, as they do try it more than once. However, there are 2 possibilities for the 3rd point for experienced users -
Firstly, they may not find it particularly useful, so they stop using it after their initial experimentation.
Alternatively, they may not be doint enough projects per year using TPUs so that they go from the 2-5 category to >25 category.

The second explanation certainly holds for the New users, because after 3 years (even if they do enough projects) they move into the group of 3-10 year experienced users. This is certainly a drawback of the grouping used in the analysis. Nevertheless, we learn 2 possible things from this:
 - The frequency with which TPUs are used by either group is not very high in projects (or that the projects themselves are long)
 OR
 - Both groups do not find practical use for TPUs after some tries.
 
 Although, the second possibility is less likely, because TPUs are said to increase performance by 100 times as compared to CPU.

****
**Hold On! Did you say Tensorflow?**

In the field of Data Science, ML Frameworks have expanded greatly, and many people are using them quite often - even those people with little experience with ML. This has widened access to the field. It is natural to check the market share of the several ML Frameworks out there


In [None]:
# ML Frameworks 2021

Frame_2021 = multi_Y_data_MV(["Q16_", "Q16_"], ['Scikit-learn',
'TensorFlow',
'Keras',
'PyTorch',
'Fast.ai',
'MXNet',
'Xgboost',
'LightGBM',
'CatBoost',
'Prophet',
'H2O 3',
'Caret',
'Tidymodels',
'JAX',
'PyTorch Lightning',
'Huggingface',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

Frame_2021 = conv_rel(["2021 Old", "2021 Young"], Frame_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = Frame_2021[Frame_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = Frame_2021[Frame_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = Frame_2021[Frame_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = Frame_2021[Frame_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Machine Learning Frameworks Used in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Machine Learning Framework',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

In the graph above, we definitely see that the space is dominated by: Scikit-learn, TensorFlow, Keras, Xgboost and Pytorch.
So, there is a large variety of options, each holding a substantial share of the market in ML Frameworks. Moreover:

1. More new members use Scikit-learn, Tensorflow and Keras than Experienced users proportionately. Although, the disparity for Scikit-learn is the quite large and is the largest of the 3. But note that this is market share, this difference is most likely because experienced users use multiple frameworks as seen above.
2. On the other hand, Xgboost and PyTorch are preferred by Veteran users very slightly more.
3. It is more common for Experienced users to use relatively uncommon frameworks, which means that they are more likely to learn and use multiple frameworks as compared to new users
4. New Users are more likely to not use any framework than veterans.

Point 3 tells us that almost all the popular frameworks gain considerable traction from experienced users. It is hence likely to continue this way, meaning that as the new users gain experience, they will look at more frameworks.

Let's now plot some trends for the most popular frameworks and identify patterns. This time, I will highlight all the frameworks as there is a pattern in each one, and they are easy to distinguish.

In [None]:
# Focus on scikit-learn TensorFlow, Keras, PyTorch, Xgboost, LightGBM, None

# Not taking hugging face because lack of data

FrameVsYear = year_relator(["Q16_Part_1", "Q16_Part_2", "Q16_Part_3", "Q16_Part_4", "Q16_Part_7", "Q16_Part_8", 'Q16_Part_17', 
                           "Q16_Part_1", "Q16_Part_2", "Q16_Part_3", "Q16_Part_4", "Q16_Part_7", "Q16_Part_8", 'Q16_Part_17',
              "Q16_Part_1", "Q16_Part_2", "Q16_Part_3", "Q16_Part_4", "Q16_Part_7", "Q16_Part_8", 'Q16_Part_15',
              "Q16_Part_1", "Q16_Part_2", "Q16_Part_3", "Q16_Part_4", "Q16_Part_7", "Q16_Part_8", 'Q16_Part_15',
              "Q28_Part_1","Q28_Part_2", "Q28_Part_3", "Q28_Part_6", "Q28_Part_5", "Q28_Part_8", "Q28_Part_11",
              "Q28_Part_1","Q28_Part_2", "Q28_Part_3", "Q28_Part_6", "Q28_Part_5", "Q28_Part_8", "Q28_Part_11",
             "Q19_Part_1","Q19_Part_2", "Q19_Part_3", "Q19_Part_4", "Q19_Part_10", "Q19_Part_14", "Q19_Part_14",
             "Q19_Part_1","Q19_Part_2", "Q19_Part_3", "Q19_Part_4", "Q19_Part_10", "Q19_Part_14", "Q19_Part_14",
             ], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New", "2018 Old", "2018 New"], 
             ["Scikit-learn", "TensorFlow", "Keras", 'PyTorch', 'Xgboost', 'LightGBM', "None"], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y, d18_o, d18_y], 7)

FrameVsYear_o, FrameVsYear_y = bifurcate_data(FrameVsYear)


FrameVsYear_o = pd.DataFrame(reformat_yearwise_rel(FrameVsYear_o, {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])
FrameVsYear_y = pd.DataFrame(reformat_yearwise_rel(FrameVsYear_y, {'2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])

fig = go.Figure()
add_many_traces(fig, FrameVsYear_o.reindex(index=FrameVsYear_o.index[::-1]), ["Scikit-learn", "TensorFlow", "Keras", 'PyTorch', 'Xgboost', 'LightGBM', "None"],
                ['red', 'green', 'blue', 'black', 'yellow', 'purple', 'grey'])
fig.update_layout(title_text = "ML Frameworks Used vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, FrameVsYear_y.reindex(index=FrameVsYear_y.index[::-1]), ["Scikit-learn", "TensorFlow", "Keras", 'PyTorch', 'Xgboost', 'LightGBM', "None"], 
                ['red', 'green', 'blue', 'black', 'yellow', 'purple', 'grey'])
fig.update_layout(title_text = "ML Frameworks Used vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()





Observe that in each graph Scikit-learn has the most number of users. We also see that:
1. The % of users of sk-learn has remained approximately constant in both groups.
2. On the other hadn % of tensorflow, keras has been slowly decreasing in both groups.
3. In the experienced pool, the % of pytorch and xgboost users has increased rapidly. On the other hand, for new users, Xgboost has very slightly declined and PyTorch has increased but not as much as for experienced users
4. LightBGM has remained approximately constant for new users, but has increased significantly for veterans.
5. The number of people using no framework has slightly increased for new members, but has decreased for older members -significantly

Point 3 is a positive trend for PyTorch, since it shows that experienced users are liking it more and more, and this news has reached new members. This is good because new members are learning from the experienced members, showing development in the field.

On the other hand, 4 is negative because new users are not realizing the same advantages of LightGBM that advanced users are. Similarly, 5 shows that ML frameworks must be useful because veterans are shifting towards them, but new users are drifting away slowly. However, this increase is very slight, and may not be indicative of a lack of future market.

****
**Let's go one step further - Let's not help, let's automate - AutoML**

AutoML tools help completely automate certain aspects of ML programming. These tools may only partially automate, or they may fully automate. 

Let us first understand what tasks AutoML helps us with the most


In [None]:
# AutoML

Auto_2021 = multi_Y_data_MV(["Q36_A_", "Q36_A_"], ['Data augmentation (imgaug, albumentations)',
'Feature engineering/selection (tpot, boruta_py)',
'Model selection (auto-sklearn, xcessiv)',
'Model architecture searches (darts, enas)',
'Hyperparameter tuning (hyperopt, ray.tune, Vizier)',
'Full ML pipelines (Google AutoML, H20 Driverless AI)',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

num_old_Auto_ans = Auto_2021[Auto_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
num_young_Auto_ans = Auto_2021[Auto_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


Auto_2021 = conv_rel(["2021 Old", "2021 Young"], Auto_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = Auto_2021[Auto_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = Auto_2021[Auto_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = Auto_2021[Auto_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = Auto_2021[Auto_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "AutoML (or Partial) Usage in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'AutoML Platform',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

Before you read further, it is important to understand how the proportions were taken specifically here. This question was asked to only "Professionals" which excluded a large segment of both groups. As a result, I re-normalized based on the number of people that were asked this question. This was done elsewhere, and has been clearly explained where.

To me, it was surprising that experienced people use AutoML much more than new users!
1. Most new users do not use any AutoML techniques
2. Many experienced users also don't use any AutoML techniques, but it is proportionately much less
3. The most used feature by experienced users is Hyperparameter tuning
4. As a whole, model selection is used the most by both groups (when we consider the groups overall)
5. These actually constitute most of the tasks done in AutoML, as the other tasks are negligible in proportion.

We can ask ourselves, has the demand for these products increased over time?

In [None]:
# Graph all and test

AutoVsYear = year_relator(["Q36_A_Part_1", "Q36_A_Part_2","Q36_A_Part_3", "Q36_A_Part_4", "Q36_A_Part_5", "Q36_A_Part_6", "Q36_A_Part_7",
                           "Q36_A_Part_1", "Q36_A_Part_2","Q36_A_Part_3", "Q36_A_Part_4", "Q36_A_Part_5", "Q36_A_Part_6", "Q36_A_Part_7",
              "Q33_A_Part_1", "Q33_A_Part_2", "Q33_A_Part_3", "Q33_A_Part_4", "Q33_A_Part_5", "Q33_A_Part_6", "Q33_A_Part_7",
              "Q33_A_Part_1", "Q33_A_Part_2", "Q33_A_Part_3", "Q33_A_Part_4", "Q33_A_Part_5", "Q33_A_Part_6", "Q33_A_Part_7",
              "Q25_Part_1", "Q25_Part_2", "Q25_Part_3", "Q25_Part_4", "Q25_Part_5", "Q25_Part_6", "Q25_Part_7",
              "Q25_Part_1", "Q25_Part_2", "Q25_Part_3", "Q25_Part_4", "Q25_Part_5", "Q25_Part_6", "Q25_Part_7"], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New"], 
             ['Augmentation', 'Feature Engineering', 'Model Selection', 'Architechture Searches',
              'Hyperparameter', 'Full Pipelines', 'None'], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y], 7)

AutoVsYear_o, AutoVsYear_y = bifurcate_data(AutoVsYear)

AutoVsYear_o = pd.DataFrame(AutoVsYear_o, columns = ['VarX', 'Mid', 'VarY'])
AutoVsYear_y = pd.DataFrame(AutoVsYear_y, columns = ['VarX', 'Mid', 'VarY'])

num_old_Auto_ans_21 = AutoVsYear_o[AutoVsYear_o['VarX'].isin(["2021 Old"])]['VarY'].sum()
num_young_Auto_ans_21 = AutoVsYear_y[AutoVsYear_y['VarX'].isin(["2021 New"])]['VarY'].sum()
num_old_Auto_ans_20 = AutoVsYear_o[AutoVsYear_o['VarX'].isin(["2020 Old"])]['VarY'].sum()
num_young_Auto_ans_20 = AutoVsYear_y[AutoVsYear_y['VarX'].isin(["2020 New"])]['VarY'].sum()
num_old_Auto_ans_19 = AutoVsYear_o[AutoVsYear_o['VarX'].isin(["2019 Old"])]['VarY'].sum()
num_young_Auto_ans_19 = AutoVsYear_y[AutoVsYear_y['VarX'].isin(["2019 New"])]['VarY'].sum()


AutoVsYear_o = pd.DataFrame(reformat_yearwise_rel(AutoVsYear_o, {'2021 Old' : num_old_Auto_ans_21, '2020 Old' : num_old_Auto_ans_20,
                                                             '2019 Old' : num_old_Auto_ans_19, '2018 Old': 465},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])
AutoVsYear_y = pd.DataFrame(reformat_yearwise_rel(AutoVsYear_y, {'2021 New' : num_young_Auto_ans_21, '2020 New' : num_young_Auto_ans_20, 
                                                             '2019 New' : num_young_Auto_ans_19, '2018 New': 11179},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])
fig = go.Figure()
add_many_traces(fig, AutoVsYear_o.reindex(index=AutoVsYear_o.index[::-1]), ['Augmentation', 'Feature Engineering', 'Model Selection', 'Architechture Searches',
              'Hyperparameter', 'Full Pipelines', 'None'],
                ['red', 'green', 'blue', 'black', 'yellow', 'grey', 'purple'])
fig.update_layout(title_text = "AutoML (or Partial) Usage vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, AutoVsYear_y.reindex(index=AutoVsYear_y.index[::-1]),['Augmentation', 'Feature Engineering', 'Model Selection', 'Architechture Searches',
              'Hyperparameter', 'Full Pipelines', 'None'], 
                ['red', 'green', 'blue', 'black', 'yellow', 'grey', 'purple'])
fig.update_layout(title_text = "AutoML (or Partial) Usage vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

Over time, we see that the number of people not using AutoML in new users has increased, whereas it has remained approximately constant for experienced users.

1. For experienced users, between 2019 and 2021, we see that the uses of AutoML have remained approximately constant. Hyperparameter tuning applications have increased, but at the same time, use of feature engineering and model selection have slightly decreased (basically offsetting the increase).
2. For new users, all the uses have decreased in demand.

This is a negative trend for AutoML, as it indicates, that newer users aren't using this technology, and it's demand is decreasing over time. For experienced users, we only see a shift in their type of demands, but the overall usage is constant.

**Let's now check the Specific Software**

In [None]:
# AutoML Specific Tools

AutoP_2021 = multi_Y_data_MV(["Q37_A_", "Q37_A_"], ['Google Cloud AutoML',
'H20 Driverless AI',
'Databricks AutoML',
'DataRobot AutoML',
'Amazon Sagemaker Autopilot',
'Azure Automated Machine Learning',
'No / None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

num_old_AutoP_ans = AutoP_2021[AutoP_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
num_young_AutoP_ans = AutoP_2021[AutoP_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


AutoP_2021 = conv_rel(["2021 Old", "2021 Young"], AutoP_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = AutoP_2021[AutoP_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = AutoP_2021[AutoP_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = AutoP_2021[AutoP_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = AutoP_2021[AutoP_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Usage of Specific AutoML tools in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Auto ML Tool',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

Of the tools in the survey, we see that Google Cloud AutoML is used the most by both groups, with a significant increase for new users. For these specific tools, it seems as though experienced users don't use them. It means that the experienced users use alternative tools, as also indicated by the relatively large "other" bar. Nevertheless, for experienced users, the market share for all the tools is approximately the same.

**Ok, I've made something. I want to experiment with it to improve it - Experimentation**

To manage our experiments, there have been many tools in the market. Let's look at some of them:

In [None]:
# Experiments

Experiments_2021 = multi_Y_data_MV(["Q38_A_", "Q38_A_"], ['Neptune.ai',
'Weights & Biases',
'Comet.ml',
'Sacred + Omniboard',
'TensorBoard',
'Guild.ai',
'Polyaxon',
'Trains',
'Domino Model Monitor',
'MLflow',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

num_old_Experiments_ans = Experiments_2021[Experiments_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
num_young_Experiments_ans = Experiments_2021[Experiments_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


Experiments_2021 = conv_rel(["2021 Old", "2021 Young"], Experiments_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = Experiments_2021[Experiments_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = Experiments_2021[Experiments_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = Experiments_2021[Experiments_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = Experiments_2021[Experiments_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Experiment Management Tools Used in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Experiment Management Tools',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

As in AutoML, we notice that significantly many new and experienced users both don't actually use this technology. However, it is clear that experienced users use this technology much more than new users:
1. Tensorboard and MLflow are the favorites for experienced users. They are also very commonly used by new users, but experienced users use it proportionately much more.
2. A reverse trend is seen here since the least common ML products are actually being used by new users more than experienced users. We were seeing the opposite before.
3. This isn't a comprehensive list, since both groups use a few other tools (as indicated by the "other" bar)

This means that experiment management overall used more by experienced users. This may be because it is a more advanced topic.
****

**Look here buddy, you talked about so many software. Where do I learn about them when they come out?**

Most people use specific sources to learn about the advancements in the Data Science field. Be it about algorithms, software tools, or even cool products.

Let's take a peek into these sources

In [None]:
# Social Media Platforms 2021

Report_2021 = multi_Y_data_MV(["Q42_", "Q42_"], ['Twitter',
"Email newsletters",
'Reddit',
'Kaggle',
'Course Forums',
'YouTube',
'Podcasts',
'Blogs',
'Journal Publications',
'Slack Communities',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

#num_old_Deploy_ans = Experiments_2021[Experiments_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
#num_young_Deploy_ans = Experiments_2021[Experiments_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


Report_2021 = conv_rel(["2021 Old", "2021 Young"], Report_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = Report_2021[Report_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = Report_2021[Report_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = Report_2021[Report_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = Report_2021[Report_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Media Platform for Data Science Reporting Referred to in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Reporting Platform',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()


The graph above shows very diverse sources. Let's start with the experienced users:
1. Experienced users prefer to look at Kaggle, Blogs and Journals the most.
2. They also look at YouTube but not as much. They also look at Email and Twitter
3. This is not a comprehensive list because there are quite a few other sources not listed here for this group.

On the other hand for new users:
4. Their main source for getting updates is Kaggle. This is proportionately much higher than for experienced group.
5. This is closely followed by YouTube (higher than experienced) and then Blogs
6. In stark contrast to veterans, they refer to very few journal publications

The 5,6th points is not surprising, but concerning. This shows a large difference between the two groups. The experienced group is able to learn more, and get to know about new developments in the field much quicker than new users. Why this is not surprising is because new users don't have the appropriate guidance to know how to read research papers quickly. On the other hand, YouTube is really easy to digest and watch. 
Why this is concerning is because they are instead relying on sources such as YouTube that are completely unregulated. Although there are good sources maintained by trusted organizations, there are several channels that don't present accurate or complete information

In [None]:
ReportVsYear = year_relator(["Q42_Part_1", "Q42_Part_2", "Q42_Part_3", "Q42_Part_4", "Q42_Part_5", "Q42_Part_6", 
                         "Q42_Part_7", "Q42_Part_8", "Q42_Part_9", "Q42_Part_10", "Q42_Part_11",
                         "Q42_Part_1", "Q42_Part_2", "Q42_Part_3", "Q42_Part_4", "Q42_Part_5", "Q42_Part_6", 
                         "Q42_Part_7", "Q42_Part_8", "Q42_Part_9", "Q42_Part_10", "Q42_Part_11",
              "Q39_Part_1", "Q39_Part_2", "Q39_Part_3", "Q39_Part_4", "Q39_Part_5", "Q39_Part_6", 
                         "Q39_Part_7", "Q39_Part_8", "Q39_Part_9", "Q39_Part_10", "Q39_Part_11",
              "Q39_Part_1", "Q39_Part_2", "Q39_Part_3", "Q39_Part_4", "Q39_Part_5", "Q39_Part_6", 
                         "Q39_Part_7", "Q39_Part_8", "Q39_Part_9", "Q39_Part_10", "Q39_Part_11",
              "Q12_Part_1", "Q12_Part_2", "Q12_Part_3", "Q12_Part_4", "Q12_Part_5", "Q12_Part_6", 
                         "Q12_Part_7", "Q12_Part_8", "Q12_Part_9", "Q12_Part_10", "Q12_Part_11",
               "Q12_Part_1", "Q12_Part_2", "Q12_Part_3", "Q12_Part_4", "Q12_Part_5", "Q12_Part_6", 
                         "Q12_Part_7", "Q12_Part_8", "Q12_Part_9", "Q12_Part_10", "Q12_Part_11"], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New"], 
             ['Twitter', 'Newsletters', 'Reddit', 'Kaggle', 'Course Forums', 'Youtube', 'Podcasts', 
              'Blogs', 'Journals', 'Slack', 'None'], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y], 11)

"""
ReportVsYear_2018 = year_relator(['Q38_Part_1', 'Q38_Part_2', 'Q38_Part_3','Q38_Part_4','Q38_Part_5',
                              'Q38_Part_6','Q38_Part_7','Q38_Part_8','Q38_Part_9','Q38_Part_10','Q38_Part_11',
                              'Q38_Part_12','Q38_Part_13','Q38_Part_14','Q38_Part_15','Q38_Part_16','Q38_Part_17',
                              'Q38_Part_18','Q38_Part_19','Q38_Part_20','Q38_Part_21','Q38_Part_22',
                              'Q38_Part_1', 'Q38_Part_2', 'Q38_Part_3','Q38_Part_4','Q38_Part_5',
                              'Q38_Part_6','Q38_Part_7','Q38_Part_8','Q38_Part_9','Q38_Part_10','Q38_Part_11',
                              'Q38_Part_12','Q38_Part_13','Q38_Part_14','Q38_Part_15','Q38_Part_16','Q38_Part_17',
                              'Q38_Part_18','Q38_Part_19','Q38_Part_20','Q38_Part_21','Q38_Part_22'],
                             ['2018 Old', '2018 New'],
                             [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
                             [d18_o, d18_y]
                            )

ReportvsYear = comb_mids(ReportVsYear.to_numpy(), [[6, ], ["Jupyter Lab", "Jupyter Notebooks"]],
                     ["Visual Studio Code", "Jupyter Notebooks"], ["2021 Old", "2020 Old","2021 New", "2020 New"])

"""

ReportVsYear_o, ReportVsYear_y = bifurcate_data(ReportVsYear)


ReportVsYear_o = pd.DataFrame(reformat_yearwise_rel(ReportVsYear_o, {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])
ReportVsYear_y = pd.DataFrame(reformat_yearwise_rel(ReportVsYear_y, {'2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])
ReportVsYear_o = ReportVsYear_o.loc[ReportVsYear_o['Mid'] != 'Newsletters']
ReportVsYear_y = ReportVsYear_y.loc[ReportVsYear_y['Mid'] != 'Newsletters']

fig = go.Figure()
add_many_traces(fig, ReportVsYear_o.reindex(index=ReportVsYear_o.index[::-1]), ['Twitter', 'Newsletters', 'Reddit', 'Kaggle', 'Course Forums', 'Youtube', 'Podcasts', 
              'Blogs', 'Journals', 'Slack', 'None'],
                ['grey', 'grey', 'grey', 'blue', 'yellow', 'red', 'grey', 'green', 'orange', 'grey',
                'grey'])
fig.update_layout(title_text = "Media Reporting Platform vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, ReportVsYear_y.reindex(index=ReportVsYear_y.index[::-1]),['Twitter', 'Newsletters', 'Reddit', 'Kaggle', 'Course Forums', 'Youtube', 'Podcasts', 
              'Blogs', 'Journals', 'Slack', 'None'], 
                ['grey', 'grey', 'grey', 'blue', 'yellow', 'red', 'grey', 'green', 'orange', 'grey',
                'grey'])
fig.update_layout(title_text = "Media Reporting Platform vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()


In the two graphs above we see the same alarming trend repeat itself:

1. YouTube is remaining constant for both groups
2. Journals and Blogs are greatly decreasing in usage.
3. Course Forums and Kaggle are also losing users

This means that at a relative scale, Youtube is becoming more popular. This is quite alarming because there can be a lot of misinformation on YouTube. Whereas, journals are accurate and peer reviewed.

On a more positive note, this means that to obtain more traction over time, YouTube may be a good solution.

****

**You discussed so much about ML. I even know where to get information. What about me, I like graphing?**

Seeing is Believing!

We looked at notebooks, frameworks, AutoML and so on, but another key item that most data scientists use is visualization. In fact, I'm using it right now!

Let's check the current visualization libraries

In [None]:
# Visualization Libraries

VisL_2021 = multi_Y_data_MV(["Q14_", "Q14_"], ['Matplotlib',
'Seaborn',
'Plotly / Plotly Express',
'Ggplot / ggplot2',
'Shiny',
'D3 js',
'Altair',
'Bokeh',
'Geoplotlib',
'Leaflet / Folium',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

VisL_2021 = conv_rel(["2021 Old", "2021 Young"], VisL_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = VisL_2021[VisL_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = VisL_2021[VisL_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = VisL_2021[VisL_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = VisL_2021[VisL_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Visualization Libraries Used in 2021 for Experienced and New Users", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Visualization Library',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

We see that Matplotlib and Seaborn have most of the market share.
However, Plotly and Ggplot are not too far behind.

1. We see that experienced users often use plotly and ggplot more often.
2. New users use Matplotlib and Seaborn much more than older users
3. Experienced users have many more visualization library knowledge, because hey use more uncommon libraries proportionately more than new users.

In [None]:
# Focus on Matplotlib, Seaborn, Ploty, Ggplot, Shiny, None

VisLVsYear = year_relator(["Q14_Part_1", "Q14_Part_2", "Q14_Part_3", "Q14_Part_4", "Q14_Part_5", "Q14_Part_11", 
                           "Q14_Part_1", "Q14_Part_2", "Q14_Part_3", "Q14_Part_4", "Q14_Part_5","Q14_Part_11",
              "Q14_Part_1", "Q14_Part_2", "Q14_Part_3", "Q14_Part_4", "Q14_Part_5", "Q14_Part_11", 
              "Q14_Part_1", "Q14_Part_2", "Q14_Part_3", "Q14_Part_4", "Q14_Part_5","Q14_Part_11",
              "Q20_Part_2", "Q20_Part_8", "Q20_Part_6", "Q20_Part_1", "Q20_Part_4", "Q20_Part_11",
              "Q20_Part_2", "Q20_Part_8", "Q20_Part_6", "Q20_Part_1", "Q20_Part_4", "Q20_Part_11",
             "Q21_Part_2", "Q21_Part_8", "Q21_Part_6", "Q21_Part_1", "Q21_Part_4", "Q21_Part_12",
             "Q21_Part_2", "Q21_Part_8", "Q21_Part_6", "Q21_Part_1", "Q21_Part_4", "Q21_Part_12",
             ], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New", "2018 Old", "2018 New"], 
             ["Matplotlib", "Seaborn", "Ploty", 'GGPlot2', 'Shiny', "None"], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y, d18_o, d18_y], 6)

VisLVsYear_o, VisLVsYear_y = bifurcate_data(VisLVsYear)


VisLVsYear_o = pd.DataFrame(reformat_yearwise_rel(VisLVsYear_o, {'2021 Old' : 573, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])
VisLVsYear_y = pd.DataFrame(reformat_yearwise_rel(VisLVsYear_y, {'2021 New' : 13838, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])

fig = go.Figure()
add_many_traces(fig, VisLVsYear_o.reindex(index=VisLVsYear_o.index[::-1]), ["Matplotlib", "Seaborn", "Ploty", 'GGPlot2', 'Shiny', "None"],
                ['red', 'green', 'blue', 'black', 'yellow', 'purple'])
fig.update_layout(title_text = "Visualization Libraries Used vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, VisLVsYear_y.reindex(index=VisLVsYear_y.index[::-1]), ["Matplotlib", "Seaborn", "Ploty", 'GGPlot2', 'Shiny', "None"], 
                ['red', 'green', 'blue', 'black', 'yellow', 'purple'])
fig.update_layout(title_text = "Visualization Libraries Used vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

The graphs above reflect very similar data as the bar chart did. However, notice that:
1. The % of experienced users using Matplotlib and Seaborn has increased significantly over time. These libraries have also increased for new users, but at a much smaller level
2. % Users of shiny, ggplot has been declining quite a bit for both groups
3. On average, plotly and no library users have been constant for both groups

This tells us that Matplotlib and Seaborn's market share will increase greatly for experienced users in the future, and slightly for new users. On the other hand, libraries such as ggplot and shiny are declining in popularity.

The fact that both groups show similar trends means that new users are learning from experienced users, implying that the field is progressing, as there is communication between the two types, and they are discovering the same benefits in each software.

****

**Ok ok, we get it. But I've heard that people also use Cloud Services**

Another emerging market is cloud computing in machine learning. 
But here, it get's a bit tricky. Firstly, not all people were asked this question. From analyzing the spending, we saw that mostly new members weren't asked this question. So, because we're analyzing market share, let's only focus on the people that were considered professionals by the survey.

Let's check this out:

In [None]:
# Cloud Computing 2021

CC_2021 = multi_Y_data_MV(["Q27_A_", "Q27_A_"], ['Amazon Web Services (AWS)',
'Microsoft Azure',
'Google Cloud Platform',
'IBM Cloud / Red Hat',
'Oracle Cloud',
'SAP Cloud',
'Salesforce Cloud',
'VMware Cloud',
'Alibaba Cloud',
'Tencent Cloud',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

num_old_CC_ans = CC_2021[CC_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
num_young_CC_ans = CC_2021[CC_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


CC_2021 = conv_rel(["2021 Old", "2021 Young"], CC_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = CC_2021[CC_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = CC_2021[CC_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = CC_2021[CC_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = CC_2021[CC_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Cloud Computing Platforms Used in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Cloud Computing Platform',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()
names_old_CC = get_question_index("Q28", d21_o)
names_y_CC = get_question_index("Q28", d21_y)

CC_2021_Fav_old = Line_SV_2(["Q28","Q28", "Q28", "Q28", "Q28", "Q28", "Q28","Q28", "Q28"],
            names_old_CC,
          names_old_CC, 
          ["2021 Old", "2021 New"], 9, 
          [d21_o[~d21_o["Q28"].isin([np.nan])],d21_y[~d21_y["Q28"].isin([np.nan])]])

CC_2021_Fav_new = Line_SV_2(["Q28","Q28", "Q28", "Q28", "Q28", "Q28", "Q28","Q28", "Q28","Q28","Q28", "Q28"],
            names_y_CC,
          names_y_CC, 
          ["2021 New"], 12, 
          [d21_y[~d21_y["Q28"].isin([np.nan])]])

CC_2021_Fav = pd.concat([CC_2021_Fav_old,CC_2021_Fav_new])

CC_2021_Fav = conv_rel_2(CC_2021_Fav.to_numpy(), {'2021 Old' : num_old_CC_ans, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465, 
                                       '2021 New' : num_young_CC_ans, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179})
                         #ident = ['2021 Old', '2021 New'])

fig = go.Figure()
fig.add_trace(go.Bar(x = CC_2021_Fav[CC_2021_Fav['VarX'].isin(['2021 Old'])]['Mid'],
                     y = CC_2021_Fav[CC_2021_Fav['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = CC_2021_Fav[CC_2021_Fav['VarX'].isin(['2021 New'])]['Mid'],
                     y = CC_2021_Fav[CC_2021_Fav['VarX'].isin(['2021 New'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Preferred Cloud Computing Platform Used in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Cloud Computing Platform',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

I have plotted 2 graphs above - One about the market share, and 1 about the preferences in products

1. AWS, GCP and Azure have the most share.
2. For almost all Cloud Computing platforms, we see that experienced users use them much more than new users.
3. A significant amount of users from both groups don't use cloud computing at all. But, the new users mostly don't use any cloud computing platform at all.

Before I talk about the preferences, note that most people didn't actually even answer this question. Hence, the proportions may not add up to 1.

4. Unsurprisingly, most people prefer AWS and GCP, but shockingly, a significant amount of people said that they had a similar dev experience. This amount was in fact larger than the popularity for MS Azure.
5. A considerable amount of experienced users stated that none of them were satisfactory.

In [None]:
# Focus on AWS, Azure, Google, None

CCVsYear = year_relator(["Q27_A_Part_1", "Q27_A_Part_2", "Q27_A_Part_3", "Q27_A_Part_11","Q27_A_Part_1", "Q27_A_Part_2", "Q27_A_Part_3", "Q27_A_Part_11",
              "Q26_A_Part_1", "Q26_A_Part_2", "Q26_A_Part_3", "Q26_A_Part_11","Q26_A_Part_1", "Q26_A_Part_2", "Q26_A_Part_3", "Q26_A_Part_11",
              "Q29_Part_2", "Q29_Part_3", "Q29_Part_1", "Q29_Part_11","Q29_Part_2", "Q29_Part_3", "Q29_Part_1", "Q29_Part_11",
              "Q15_Part_2", "Q15_Part_3", "Q15_Part_1", "Q15_Part_6","Q15_Part_2", "Q15_Part_3", "Q15_Part_1", "Q15_Part_6"
             ], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New", "2018 Old", "2018 New"], 
             ["AWS", "Azure", "Google", "None"], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y, d18_o, d18_y], 4)

CCVsYear_o, CCVsYear_y = bifurcate_data(CCVsYear)

num_old_CC_ans_21 = CCVsYear_o[CCVsYear_o['VarX'].isin(["2021 Old"])]['VarY'].sum()
num_young_CC_ans_21 = CCVsYear_y[CCVsYear_y['VarX'].isin(["2021 New"])]['VarY'].sum()
num_old_CC_ans_20 = CCVsYear_o[CCVsYear_o['VarX'].isin(["2020 Old"])]['VarY'].sum()
num_young_CC_ans_20 = CCVsYear_y[CCVsYear_y['VarX'].isin(["2020 New"])]['VarY'].sum()
num_old_CC_ans_19 = CCVsYear_o[CCVsYear_o['VarX'].isin(["2019 Old"])]['VarY'].sum()
num_young_CC_ans_19 = CCVsYear_y[CCVsYear_y['VarX'].isin(["2019 New"])]['VarY'].sum()
num_old_CC_ans_18 = CCVsYear_o[CCVsYear_o['VarX'].isin(["2018 Old"])]['VarY'].sum()
num_young_CC_ans_18 = CCVsYear_y[CCVsYear_y['VarX'].isin(["2018 New"])]['VarY'].sum()

CCVsYear_o = pd.DataFrame(reformat_yearwise_rel(CCVsYear_o, {'2021 Old' : num_old_CC_ans_21, '2020 Old' : num_old_CC_ans_20,
                                                             '2019 Old' : num_old_CC_ans_19, '2018 Old': num_old_CC_ans_18},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])
CCVsYear_y = pd.DataFrame(reformat_yearwise_rel(CCVsYear_y, {'2021 New' : num_young_CC_ans_21, '2020 New' : num_young_CC_ans_20,
                                                             '2019 New' : num_young_CC_ans_19, '2018 New': num_young_CC_ans_18},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])


fig = go.Figure()
add_many_traces(fig, CCVsYear_o.reindex(index=CCVsYear_o.index[::-1]), ["AWS", "Azure", "Google", "None"],
                ['red', 'green', 'blue', 'black'])
fig.update_layout(title_text = "Cloud Computing Platform Usage vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, CCVsYear_y.reindex(index=CCVsYear_y.index[::-1]),["AWS", "Azure", "Google", "None"], 
                ['red', 'green', 'blue', 'black'])
fig.update_layout(title_text = "Cloud Computing Platform Usage vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()





Year on year, the most striking development is that the number of people using NO cloud computing platform has significantly decreased over time.

1. The user base for AWS has slightly decreased in the experienced pool, and has remaind about constant for the new group.
2. Google and Azure platforms have increased significantly in both groups.

This tells us that the market share of GCP and Azure is increasing quite a bit. We can infer that the people that previously did not use any cloud computing might be shifting to these platforms.
****

**Speaking of Cloud Computing, what about Managed Machine Learning?**

Unfortunately, there wasn't enough data to do a trend analysis of this over time, but let's look at this in 2021:

In [None]:
# Managed ML

MML_2021 = multi_Y_data_MV(["Q31_A_", "Q31_A_"], ['Amazon SageMaker',
'Azure Machine Learning Studio',
'Google Cloud Vertex AI',
'DataRobot',
'Databricks',
'Dataiku',
'Alteryx',
'Rapidminer',
'No / None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

#num_old_MML_ans = CC_2021[CC_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
#num_young_MML_ans = CC_2021[CC_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


MML_2021 = conv_rel(["2021 Old", "2021 Young"], MML_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = MML_2021[MML_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = MML_2021[MML_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = MML_2021[MML_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = MML_2021[MML_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Managed ML Platforms Usage in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Managed ML Platform',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

This is a very untapped market. It's clear that most users in both groups don't use any Managed ML.

1. For each platform, more experienced users use them.
2. There are many more new users that don't use any platform that nexperienced users
3. Of the platforms used, Amazon SageMaker, Azure, Databricks and Google are in a large lead.
4. Proportionately, there is a large gap between experienced and new users for Rapidminer and Dataiku - meaning that new users may have not discovered them yet.

****
**Sadly, Managed ML isn't too hot right now. You know what is? Big Data!**

Yes, for the past few years, Data Science, ML and Big Data have been synonymous. Let's study the progress in Big Data Tools over time. But first, what is its current state in 2021?

In [None]:
# Big Data

BD_2021 = multi_Y_data_MV(["Q32_A_", "Q32_A_"], ['MySQL',
'PostgreSQL',
'SQLite',
'Oracle Database',
'MongoDB',
'Snowflake',
'IBM Db2',
'Microsoft SQL Server',
'Microsoft Azure SQL Database',
'Microsoft Azure Cosmos DB',
'Amazon Redshift',
'Amazon Aurora',
'Amazon RDS',
'Amazon DynamoDB',
'Google Cloud BigQuery',
'Google Cloud SQL',
'Google Cloud Firestore',
'Google Cloud BigTable',
'Google Cloud Spanner',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

num_old_BD_ans = BD_2021[BD_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
num_young_BD_ans = BD_2021[BD_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


BD_2021 = conv_rel(["2021 Old", "2021 Young"], BD_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = BD_2021[BD_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = BD_2021[BD_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = BD_2021[BD_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = BD_2021[BD_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Big Data Used in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Big Data Tool',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()
names_old_BD = get_question_index("Q33", d21_o)
names_y_BD = get_question_index("Q33", d21_y)

BD_2021_Fav_old = Line_SV_2(["Q33", "Q33", "Q33", "Q33", "Q33",
                            "Q33", "Q33", "Q33", "Q33", "Q33",
                            "Q33", "Q33", "Q33", "Q33", "Q33",
                            "Q33", "Q33", "Q33", "Q33"],
            names_old_BD,
          names_old_BD, 
          ["2021 Old", "2021 New"], 19, 
          [d21_o[~d21_o["Q33"].isin([np.nan])],d21_y[~d21_y["Q33"].isin([np.nan])]])

BD_2021_Fav_new = Line_SV_2(["Q33", "Q33", "Q33", "Q33", "Q33",
                            "Q33", "Q33", "Q33", "Q33", "Q33",
                            "Q33", "Q33", "Q33", "Q33", "Q33",
                            "Q33", "Q33", "Q33", "Q33", 'Q33'],
            names_y_BD,
          names_y_BD, 
          ["2021 New"], 20, 
          [d21_y[~d21_y["Q33"].isin([np.nan])]])

BD_2021_Fav = pd.concat([BD_2021_Fav_old,BD_2021_Fav_new])

BD_2021_Fav = conv_rel_2(BD_2021_Fav.to_numpy(), {'2021 Old' : num_old_BD_ans, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465, 
                                       '2021 New' : num_young_BD_ans, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179})
                         #ident = ['2021 Old', '2021 New'])

fig = go.Figure()
fig.add_trace(go.Bar(x = BD_2021_Fav[BD_2021_Fav['VarX'].isin(['2021 Old'])]['Mid'],
                     y = BD_2021_Fav[BD_2021_Fav['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = BD_2021_Fav[BD_2021_Fav['VarX'].isin(['2021 New'])]['Mid'],
                     y = BD_2021_Fav[BD_2021_Fav['VarX'].isin(['2021 New'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Preferred Big Data Tools Used in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Big Data Tool',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

Wow... There are so many, where do I even start? Let's go from left to right

1. MySQL is the most popular tool for new users, and experienced users. In fact, the usage of mySQL by new users is much higher than any other tool.
2. PostgreSQL is another popular tool. It is slightly more popular for experienced users.
3. In the graph there aren't any large disparities between experienced and new users except for in MySQL
4. Surprisingly, many people don't actually use any Big Data Tools - This is moreso for new users.
5. MS SQL, MongoDB, and SQLite are popular and hold approximately equal market share.

In terms of preferences:
6. MYSQL ranks in first for both groups. There is a large margin here for the new users in fact.
7. PostgreSQL is also prefered by experienced users, and is almost the same as the preference level of MySQL.

This tells us that other than MySQL, the market is pretty evenly shared. Despite PostgreSQL having a smaller market share, it has an approximately equal liking as MySQL. This means it may be better to try PostgreSQL than MySQL.

Although MongoDB has a lot of market share, I will not analyze it due to the lack of past data

In [None]:
# Focus on MySQL, PostgreSQL, SQLite, MS SQL, Snowflake, None

BDVsYear = year_relator(["Q32_A_Part_1", "Q32_A_Part_2","Q32_A_Part_3", "Q32_A_Part_5", "Q32_A_Part_6", "Q32_A_Part_8", "Q32_A_Part_20",
              "Q32_A_Part_1", "Q32_A_Part_2","Q32_A_Part_3", "Q32_A_Part_5", "Q32_A_Part_6", "Q32_A_Part_8", "Q32_A_Part_20",
              "Q29_A_Part_1", "Q29_A_Part_2","Q29_A_Part_3", "Q29_A_Part_5", "Q29_A_Part_6", "Q29_A_Part_8", "Q29_A_Part_17",
              "Q29_A_Part_1", "Q29_A_Part_2","Q29_A_Part_3", "Q29_A_Part_5", "Q29_A_Part_6", "Q29_A_Part_8", "Q29_A_Part_17",
              "Q34_Part_1", "Q34_Part_2","Q34_Part_3", "Q32_Part_12", "Q34_Part_4", "Q32_Part_12", "Q32_Part_11",
              "Q34_Part_1", "Q34_Part_2","Q34_Part_3", "Q32_Part_12", "Q34_Part_4",  "Q32_Part_12", "Q32_Part_11",
              "Q29_Part_10", "Q29_Part_11","Q29_Part_12", "Q29_Part_12", "Q29_Part_9", "Q30_Part_16", "Q30_Part_24",
                        "Q29_Part_10", "Q29_Part_11","Q29_Part_12", "Q29_Part_12", "Q29_Part_9", "Q30_Part_16", "Q30_Part_24"], 
             ["2021 Old", "2021 New", "2020 Old", "2020 New", "2019 Old", "2019 New", "2018 Old", "2018 New"], 
             ['MySQL', 'PostgreSQL', 'SQLite', 'MongoDB', 'MS SQL', 'Snowflake', 'None'], 
             [d21_o, d21_y, d20_o, d20_y, d19_o, d19_y, d18_o, d18_y], 7)

BDVsYear_o, BDVsYear_y = bifurcate_data(BDVsYear)

num_old_BD_ans_21 = BDVsYear_o[BDVsYear_o['VarX'].isin(["2021 Old"])]['VarY'].sum()
num_young_BD_ans_21 = BDVsYear_y[BDVsYear_y['VarX'].isin(["2021 New"])]['VarY'].sum()
num_old_BD_ans_20 = BDVsYear_o[BDVsYear_o['VarX'].isin(["2020 Old"])]['VarY'].sum()
num_young_BD_ans_20 = BDVsYear_y[BDVsYear_y['VarX'].isin(["2020 New"])]['VarY'].sum()
num_old_BD_ans_19 = BDVsYear_o[BDVsYear_o['VarX'].isin(["2019 Old"])]['VarY'].sum()
num_young_BD_ans_19 = BDVsYear_y[BDVsYear_y['VarX'].isin(["2019 New"])]['VarY'].sum()
num_old_BD_ans_18 = BDVsYear_o[BDVsYear_o['VarX'].isin(["2018 Old"])]['VarY'].sum()
num_young_BD_ans_18 = BDVsYear_y[BDVsYear_y['VarX'].isin(["2018 New"])]['VarY'].sum()

BDVsYear_o = pd.DataFrame(reformat_yearwise_rel(BDVsYear_o, {'2021 Old' : num_old_BD_ans_21, '2020 Old' : num_old_BD_ans_20, 
                                                             '2019 Old' : num_old_BD_ans_19, '2018 Old': num_old_BD_ans_18},
                              '2021 Old') , columns = ['VarX', 'Mid', 'VarY'])
BDVsYear_y = pd.DataFrame(reformat_yearwise_rel(BDVsYear_y, {'2021 New' : num_young_BD_ans_21, '2020 New' : num_young_BD_ans_20, 
                                                             '2019 New' : num_young_BD_ans_19, '2018 New': num_young_BD_ans_18},
                              '2021 New'), columns = ['VarX', 'Mid', 'VarY'])

BDVsYear_o = BDVsYear_o.loc[BDVsYear_o['Mid'] != 'MongoDB']
BDVsYear_o = BDVsYear_o.loc[BDVsYear_o['Mid'] != 'Snowflake']
BDVsYear_y = BDVsYear_y.loc[BDVsYear_y['Mid'] != 'MongoDB']
BDVsYear_y = BDVsYear_y.loc[BDVsYear_y['Mid'] != 'Snowflake']

fig = go.Figure()
add_many_traces(fig, BDVsYear_o.reindex(index=BDVsYear_o.index[::-1]), ['MySQL', 'PostgreSQL', 'SQLite', 'MS SQL', 'None'],
                ['red', 'green', 'blue', 'black', 'yellow', 'grey', 'purple'])
fig.update_layout(title_text = "Big Data Tools Usage vs Time by Experienced Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()

fig = go.Figure()
add_many_traces(fig, BDVsYear_y.reindex(index=BDVsYear_y.index[::-1]),['MySQL', 'PostgreSQL', 'SQLite', 'MS SQL', 'None'], 
                ['red', 'green', 'blue', 'black', 'yellow', 'grey', 'purple'])
fig.update_layout(title_text = "Big Data Tools Usage vs Time by New Users", legend = dict(x=1,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Time (Years)',
                 yaxis_title = 'Proportion of Users',)
fig.show()






Over time, the number of people not using Big Data tools has decreased. We also see that:
1. For experienced users, MySQL has been approximately constant, whereas for new users it has increased significantly.
2. Microsoft SQL has been losing users very sharply.
3. PostgreSQL has increased very slightly in both groups

Since we see that the people using Big Data has been increasing sharply (seen from the 'none' line), but MySQL, PostgreSQL, etc. as graphed above have either decreased, remained constant or only slightly increased, we can infer that other Big Data Tools are gaining more traction in recent times. I.e., the market share is equalizing.

****
**What rhymes with Big? Bi-...-Business! (It Really Doesn't)**

Now that we've explored Big Data, let's look at Business intelligence platforms


In [None]:
# Business Intelligence Tools

BI_2021 = multi_Y_data_MV(["Q34_A_", "Q34_A_"], ['Amazon QuickSight',
'Microsoft Power BI',
'Google Data Studio',
'Looker',
'Tableau',
'Salesforce',
'Einstein Analytics',
'Qlik',
'Domo',
'TIBCO Spotfire',
'Alteryx',
'Sisense',
'SAP Analytics Cloud',
'Microsoft Azure Synapse',
'Thoughtspot',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

num_old_BI_ans = BI_2021[BI_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
num_young_BI_ans = BI_2021[BI_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


BI_2021 = conv_rel(["2021 Old", "2021 Young"], BI_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = BI_2021[BI_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = BI_2021[BI_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = BI_2021[BI_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = BI_2021[BI_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Business Intelligence Tools Used in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Business Intelligence Tools',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()
names_old_BI = get_question_index("Q35", d21_o)
names_y_BI = get_question_index("Q35", d21_y)

BI_2021_Fav_old = Line_SV_2(["Q35","Q35","Q35","Q35","Q35","Q35","Q35",
                            "Q35","Q35","Q35","Q35"],
            names_old_BI,
          names_old_BI, 
          ["2021 Old", "2021 New"], 11, 
          [d21_o[~d21_o["Q35"].isin([np.nan])]])

BI_2021_Fav_new = Line_SV_2(["Q35","Q35","Q35","Q35","Q35","Q35","Q35",
                            "Q35","Q35","Q35","Q35","Q35","Q35","Q35",
                            "Q35"],
            names_y_BI,
          names_y_BI, 
          ["2021 New"], 15, 
          [d21_y[~d21_y["Q35"].isin([np.nan])]])

BI_2021_Fav = pd.concat([BI_2021_Fav_old,BI_2021_Fav_new])

BI_2021_Fav = conv_rel_2(BI_2021_Fav.to_numpy(), {'2021 Old' : num_old_BI_ans, '2020 Old' : 419, '2019 Old' : 183, '2018 Old': 465, 
                                       '2021 New' : num_young_BI_ans, '2020 New' : 9771, '2019 New' : 8947, '2018 New': 11179})
                         #ident = ['2021 Old', '2021 New'])

fig = go.Figure()
fig.add_trace(go.Bar(x = BI_2021_Fav[BI_2021_Fav['VarX'].isin(['2021 Old'])]['Mid'],
                     y = BI_2021_Fav[BI_2021_Fav['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = BI_2021_Fav[BI_2021_Fav['VarX'].isin(['2021 New'])]['Mid'],
                     y = BI_2021_Fav[BI_2021_Fav['VarX'].isin(['2021 New'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Preferred Business Intelligence Tools in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Business Intelligence Tools',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()

Business intelligence is one of those tools that most people don't use unless they are actually in a Business. The survey takers have come from a wide variety of industries, so it is unlikely that they use these tools.

Nevertheless, we observe the following:
1. The largest proportion of people in both groups don't actually use Business Intelligence Tools.
2. Surprisingly, experienced users use these tools less than new users.
3. Tableau and MS Power BI have the largest market share for both groups. Other than this, we notice that the other platforms have approximately equal market share.
4. New users use Tableau, MS Power BI and Google Data Studio proportionately more than veterans.
5. Veterans on the other hand use more uncommon tools. This is a very common trend we have been seeing.

6. Microsoft Power BI is the most preferred tool in both groups. It surpasse Tableau despite having slightly less market share for new users. This may mean that it is a better tool as its users are more satisfied with it.

****




**It's Aliveee!!! - Deployment**

Huff... We've come so far... You're probably tired of just Window shopping and now want to actually get stuff. Don't worry, we're almost there :D

Even if you made an amazing product. If you want to share it with the world, or set up your own shop in the market, you'll need to deploy this.

Let's check Deployment:

In [None]:
# Deployment 2021

Deploy_2021 = multi_Y_data_MV(["Q39_", "Q39_"], ['Plotly Dash',
'Streamlit',
'NBViewer',
'GitHub',
'Personal blog',
'Kaggle',
'Colab',
'Shiny',
'None',
'Other'], 
                          get_var = False, data_list = [d21_o, d21_y], year_list = ["2021 Old", "2021 Young"], 
                       year_wise_skip = None,
                  alternate_name = None, naming = None, has_other = True, cut_off = -1, reformat = False)

#num_old_Deploy_ans = Experiments_2021[Experiments_2021['VarX'].isin(["2021 Old"])]['VarY'].sum()
#num_young_Deploy_ans = Experiments_2021[Experiments_2021['VarX'].isin(["2021 Young"])]['VarY'].sum()


Deploy_2021 = conv_rel(["2021 Old", "2021 Young"], Deploy_2021)
fig = go.Figure()
fig.add_trace(go.Bar(x = Deploy_2021[Deploy_2021['VarX'].isin(['2021 Old'])]['Mid'],
                     y = Deploy_2021[Deploy_2021['VarX'].isin(['2021 Old'])]['VarY'],
                     hovertemplate = 'Old: %{y:.2f}<extra></extra>',
                    name = 'Experienced'))
fig.add_trace(go.Bar(x = Deploy_2021[Deploy_2021['VarX'].isin(['2021 Young'])]['Mid'],
                     y = Deploy_2021[Deploy_2021['VarX'].isin(['2021 Young'])]['VarY'],
                     hovertemplate = 'New: %{y:.2f}<extra></extra>',
                     name = 'New',
                    marker_color = 'red'))

#fig = go.Figure(data = [go.bar(x = Lang2021['Mid'], y = Lang2021['VarY'], color = Lang2021['VarX'])])
fig.update_layout(title_text = "Deployment Platforms Used in 2021", legend = dict(x=0.8,
        y=1.0,
        bgcolor='rgba(255, 255, 255, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
), paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
                 xaxis_title = 'Deployment Platform',
                 yaxis_title = 'Proportion of Users')
fig.update_layout(barmode='group', xaxis={'categoryorder':'total descending'})


fig.show()


By far, GitHub is everyone's favorite platform to deploy their products on. But a large subset of people don't deploy their application anywhere. 

1. More experienced users don't upload their products anywhere. But this gap is only a slight gap.
2. Kaggle is the next most common deployment platform for both groups. It is twice as popular for new users than it is for experienced users.
3. Google Colab is also quite popular with both.
4. The other options such as Personal blogs, Shiny, Plotyl Dash, etc. are very popular with the veterans.

We can guess that veterans code some cool stuff. Maybe they already have the recognition they deserve, which is why they post it on more private sources such as blogs. On the other hand, new users are just starting out, so they want as much recognition as possible and the ability to transmit their products quickly. Hence, their choice for kaggle and github.

****
**To Summarize**

- We have gone through the entire market, and looked at its various shops and owners
- Some shops were niche, like Business Intelligence, while others were used by everyone such as Education
- Most places had 1-2 guys with monopoly, but some places where much more open and had much more competition!
- By the time we reached the end of the market, we have a good idea about how the market is today, and how it came about to be. We even have a small idea about the future

:D

If you enjoyed this notebook, please upvote it!

And if you want to have fun, feel free to guess my category (new/experienced) in the comment section!