# What's the difference between a Data Scientist and a Data Analyst?

This is a question that has been asked many times by many people. Answers are typically contested and convoluted and the truth is, there's a bit of a blurred line between the two.

In this notebook, I explore [Kaggle's 2019 ML & DS Survey](https://www.kaggle.com/c/kaggle-survey-2019) - specifically focusing on the answers by those who identified themselves as either a **Data Scientist** or **Data Analyst**. My goal is to provide a data-driven example to this discussion by identifying trends or tangible differences between the responses of the two groups. 

My exploration will cover three main pillars: 
1. Demographics (Age, Gender, Geographic Location)
2. Professional/Career Identifiers (Education Level, Income, Work Responsibilities, Years of Experience)
3. Tooling Used (Programming Language, ML Frameworks, Visualization Libraries, Algorithms Used)

#### Importing Packages & Data

In [None]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
from itertools import chain
from fuzzywuzzy import fuzz
import math

%matplotlib inline
sns.set()

# setting some global styles for matplotlib & pandas
plt.rcParams['font.family'] = 'monospace'
sns.set_style({"axes.facecolor": "1.0", 'grid.linestyle': '--', 'grid.color': '.8'})
colors = ["#fcd74e", "#0b84a5"]
pd.set_option("display.max_columns", 300)
pd.set_option('display.max_colwidth', -1)

# dev vs prod settings
dev = False

if dev == True:
    import os   
    # Print input files
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))
else:            
    import warnings 
    warnings.filterwarnings('ignore')
            
# import data
mult_choice_responses = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')
other_responses       = pd.read_csv('/kaggle/input/kaggle-survey-2019/other_text_responses.csv')

#### Data Wrangling

In [None]:
###############################################################################################
# replace mult_choice OTHER columns with their actual text responses (from other_responses)
OTHERS_to_replace = ['Q9_OTHER_TEXT', 'Q14_Part_1_TEXT', 'Q14_Part_2_TEXT', 'Q14_Part_3_TEXT',
                     'Q14_Part_4_TEXT', 'Q14_Part_5_TEXT', 'Q18_OTHER_TEXT', 'Q19_OTHER_TEXT',
                     'Q20_OTHER_TEXT', 'Q24_OTHER_TEXT', 'Q27_OTHER_TEXT', 'Q28_OTHER_TEXT', 
                     'Q34_OTHER_TEXT']

for col in OTHERS_to_replace:
    mult_choice_responses[col] = other_responses[col]
###############################################################################################

###############################################################################################
# rename columns for easier access
name_mapping = {
    # demographics
    'Q1': 'Age',
    'Q2': 'Gender',
    'Q3': 'Country',
    # professional
    'Q4': 'Education',
    'Q5': 'Title',
    'Q10': 'Annual Compensation',
    'Q15': 'Years_Exp_Data',
    'Q23': 'Years_Exp_ML',
    # day-to-day work roles
    'Q9_Part_1': 'Analyze data to influence biz decisions',
    'Q9_Part_2': 'Build/run data infrastructure',
    'Q9_Part_3': 'Prototype ML applications',
    'Q9_Part_4': 'Build/run an internal ML service',
    'Q9_Part_5': 'Improve existing ML models',
    'Q9_Part_6': 'Research state-of-the-art ML',
    'Q9_Part_7': 'None of these',
    'Q9_Part_8': 'Other',
    # tools used
    'Q14_Part_1_TEXT': 'Basic_stats_software',
    'Q14_Part_2_TEXT': 'Advanced_stats_software',
    'Q14_Part_3_TEXT': 'BI_software',
    'Q14_Part_4_TEXT': 'Local_envs',
    'Q14_Part_5_TEXT': 'Cloud_software'}

mult_choice_responses = mult_choice_responses.rename(columns = name_mapping)
###############################################################################################

###############################################################################################
# aggregate multi-column questions for plotting ease
text_question_cols = {
    'work_roles':    ['Analyze data to influence biz decisions', 'Build/run data infrastructure', 
                      'Prototype ML applications', 'Build/run an internal ML service',
                      'Improve existing ML models', 'Research state-of-the-art ML',
                      'None of these', 'Other'],
    'Programming Language':     list(mult_choice_responses.filter(like='Q18').columns), 
    'beginner_lang': list(mult_choice_responses.filter(like='Q19').columns),
    'Visualization Tools':      list(mult_choice_responses.filter(like='Q20').columns),
    'Algorithms':    list(mult_choice_responses.filter(like='Q24').columns),
    'NLP Tools':     list(mult_choice_responses.filter(like='Q27').columns),
    'ML Frameworks': list(mult_choice_responses.filter(like='Q28').columns),
    'Relational DB Tools':  list(mult_choice_responses.filter(like='Q34').columns)}
###############################################################################################

###############################################################################################
# renaming tool
def renamer(col, old_name, new_name):
    mult_choice_responses[col] = mult_choice_responses[col].replace(regex=old_name, value=new_name)

renamer(col='Country', 
        new_name='United Kingdom', 
        old_name='United Kingdom of Great Britain and Northern Ireland')    
renamer(col='Education', 
        new_name='Some college', 
        old_name='Some college/university study without earning a bachelor’s degree')
renamer(col='Education', 
        new_name='High school', 
        old_name='No formal education past high school')
renamer(col='Education', 
        new_name='Never', 
        old_name='I have never written code')
renamer(col='Analyze data to influence biz decisions', 
        new_name='Analyze data to influence biz decisions', 
        old_name='Analyze and understand data to influence product or business decisions')
renamer(col='Build/run data infrastructure', 
        new_name='Build/run data infrastructure', 
        old_name='Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data')
renamer(col='Prototype ML applications', 
        new_name='Prototype ML applications', 
        old_name='Build prototypes to explore applying machine learning to new areas')
renamer(col='Build/run an internal ML service', 
        new_name='Build/run an internal ML service', 
        old_name='Build and/or run a machine learning service that operationally improves my product or workflows')
renamer(col='Improve existing ML models', 
        new_name='Improve existing ML models', 
        old_name='Experimentation and iteration to improve existing ML models')
renamer(col='Research state-of-the-art ML', 
        new_name='Research state-of-the-art ML', 
        old_name='Do research that advances the state of the art of machine learning')
renamer(col='None of these', 
        new_name='None of these', 
        old_name='None of these activities are an important part of my role at work')
###############################################################################################

###############################################################################################
# bucket compensation col
compensation_replace_dict = {
    '$0-999': '< 10,000','1,000-1,999': '< 10,000','2,000-2,999': '< 10,000','3,000-3,999': '< 10,000',
    '4,000-4,999': '< 10,000','5,000-7,499': '< 10,000','7,500-9,999': '< 10,000','10,000-14,999': '10,000 - 50,000',
    '15,000-19,999': '10,000 - 50,000','20,000-24,999': '10,000 - 50,000','25,000-29,999': '10,000 - 50,000',
    '30,000-39,999': '10,000 - 50,000','40,000-49,999': '10,000 - 50,000','50,000-59,999': '50,000 - 99,000',
    '60,000-69,999': '50,000 - 99,000','70,000-79,999': '50,000 - 99,000','80,000-89,999': '50,000 - 99,000',
    '90,000-99,999': '50,000 - 99,000','100,000-124,999': '> 100,000','125,000-149,999': '> 100,000',
    '150,000-199,999': '> 100,000','200,000-249,999': '> 100,000','250,000-299,999': '> 100,000',
    '300,000-500,000': '> 100,000','> $500,000': '> 100,000'}

mult_choice_responses['Annual Compensation'] = mult_choice_responses['Annual Compensation'].replace(compensation_replace_dict)
###############################################################################################

###############################################################################################
# preprocess the responses to "tools used"
for col in ['Basic_stats_software', 'Advanced_stats_software', 'BI_software', 'Local_envs', 'Cloud_software']:
    mult_choice_responses[col] = mult_choice_responses[col].str.strip().str.lower().fillna('')
###############################################################################################

###############################################################################################
# only keep necessary columns for profiling
non_text_cols = list(name_mapping.values())
text_cols     = list(chain(*text_question_cols.values()))
# list + set to remove duplicate work_roles
keep          = list(set(non_text_cols + text_cols))

mult_choice_responses = mult_choice_responses[keep]
###############################################################################################

In [None]:
# Looking @ just Data Scientists, Data Analysts
data_jobs = mult_choice_responses[mult_choice_responses['Title'].isin(['Data Scientist', 'Data Analyst'])]

num_analysts = len(data_jobs[data_jobs['Title'] == 'Data Analyst'])
num_scientists = len(data_jobs[data_jobs['Title'] == 'Data Scientist'])

#### Wrangling contd.

In [None]:
tools_lookup_dict = {
    'Basic_stats_software':    ['excel', 'python', 'sheets', 'r', 'power bi', 'sql', 'libra', 'tableau', 'weka'],
    'Advanced_stats_software': ['sas', 'spss', 'python', 'r', 'matlab', 'sap'],
    'BI_software':             ['tableau', 'power bi', 'qlik'], 
    'Local_envs':              ['jupyter', 'rstudio', 'pycharm', 'spyder', 'visual studio', 'vscode', 'anaconda'],
    'Cloud_software':          ['aws', 'amazon', 'azure', 'gcp', 'bigquery', 'colab', 'watson', 'ibm', 
                                'databricks', 'paperspace', 'sagemaker']}

tools_rename_dict = {
    'Basic_stats_software':    {'excel': 'Excel', 'python': 'Python', 'sheets': 'Sheets', 'r': 'R', 'power bi': 'Power BI', 
                                'sql': 'SQL', 'libra': 'Libra', 'tableau': 'Tableau', 'weka': 'Weka'},
    
    'Advanced_stats_software': {'sas': 'SAS', 'spss': 'SPSS', 'python': 'Python', 'r': 'R', 'matlab': 'Matlab', 'sap': 'SAP'},
    
    'BI_software':             {'tableau': 'Tableau', 'power bi': 'Power BI', 'qlik': 'Qlik'}, 
    
    'Local_envs':              {'jupyter': 'Jupyter', 'rstudio': 'RStudio', 'pycharm': 'PyCharm', 'spyder': 'Spyder', 
                                'visual studio': 'Visual Studio', 'vscode': 'Visual Studio', 'anaconda': 'Anaconda'},
    
    'Cloud_software':          {'aws': 'AWS', 'amazon': 'AWS', 'azure': 'Azure', 'gcp': 'GCP', 'bigquery': 'GCP', 
                                'colab': 'Colab', 'watson': 'IBM Watson', 'ibm': 'IBM Watson', 'databricks': 'Databricks',
                                'paperspace': 'Paperspace', 'sagemaker': 'Sagemaker'}}

def fuzzy_match(row, tool_category, match):
    '''function to fuzzy match values via the dict above'''
    name = row[tool_category]
    return fuzz.partial_ratio(name, match)

def create_tool_df(tool_category):
    '''creates a DataFrame of a certain tool category'''
    
    dataframe = pd.DataFrame()
    
    for tool in tools_lookup_dict[tool_category]:
        if tool != 'r':
            # fuzzy match cell values (threshold = > 70)
            temp = data_jobs[data_jobs.apply(fuzzy_match, tool_category=tool_category, match=tool, axis=1) > 70]
        else:
            # no good way to fuzzy match r
            temp = data_jobs[data_jobs[tool_category] == 'r']
            
        # narrow down columns
        temp = temp[['Title', tool_category]]
        # rename columns to their appropriate tool
        temp[tool_category] = tool
        # append to dataframe
        dataframe = dataframe.append(temp)
        
    # rename    
    dataframe[tool_category] = dataframe[tool_category].replace(tools_rename_dict[tool_category])
        
    return dataframe

Basic_stats_software = create_tool_df('Basic_stats_software')
Advanced_stats_software = create_tool_df('Advanced_stats_software')
BI_software = create_tool_df('BI_software')
Local_envs = create_tool_df('Local_envs')
Cloud_software = create_tool_df('Cloud_software')

#### Creating helper functions for plotting & getting frequency %s

In [None]:
def get_single_col_freq_percents(dataframe, col):
    '''Get the % of responses per value (single column of df)'''
    # get counts by job title
    count_per_col = dataframe.groupby(['Title', col]).size()
    
    # get frequency percentage via the counts
    col_freq_percents = pd.DataFrame(count_per_col.groupby(level=0).apply(lambda x: 100 * x / float(x.sum())))
    
    # reset and rename index
    col_freq_percents.reset_index(inplace=True)
    col_freq_percents = col_freq_percents.rename(columns={0: '% of Responses'})
    
    # sort DataFrame
    order = {
        'Education': ['High school', 'Some college', 'Professional degree', 
                      'Bachelor’s degree', 'Master’s degree',  'Doctoral degree'],
        'Years_Exp_Data': ['< 1 years', '1-2 years', '3-5 years', '5-10 years', '10-20 years', '20+ years'],
        'Years_Exp_ML': ['< 1 years', '1-2 years', '2-3 years', '3-4 years', 
                         '4-5 years', '5-10 years', '10-15 years', '20+ years'],
        'Gender': ['Female', 'Male'],
        'Annual Compensation': ['< 10,000', '10,000 - 50,000', '50,000 - 99,000', '> 100,000']}
    
    if col in list(order.keys()):
        col_freq_percents[col] = pd.Categorical(col_freq_percents[col], order[col])
        col_freq_percents.sort_values(col)
    
    return col_freq_percents


def get_multicol_freq_percents(question_list):
    '''Get frequency % of responses (over multiple columns)'''
    
    # get counts for Data Scientists and Data Analysts
    DA = data_jobs[data_jobs.Title == 'Data Analyst'][question_list].fillna('').stack().value_counts()
    DS = data_jobs[data_jobs.Title == 'Data Scientist'][question_list].fillna('').stack().value_counts()
    
    # calculate frequency percents
    DA = DA.apply(lambda x: (x / num_analysts) * 100)
    DS = DS.apply(lambda x: (x / num_scientists) * 100)
    
    # combine titles
    df = pd.DataFrame({'Data Analyst': DA, 'Data Scientist': DS}).drop(labels='')
    
    # select top 10 responses
    top10_idx = df.sum(axis=1).sort_values(ascending=False).head(10).index
    df = df.loc[top10_idx]
    
    return df

def single_col_bar_plotter(category, plot_size=(6,5), axis_limit=5.0, legend_loc='lower right', axes=None):
    '''plot a single column response'''
    
    #plt.figure(figsize=plot_size)
    if category == 'Country':
        df = get_single_col_freq_percents(data_jobs, 'Country')
        temp = df[df['Country'] != 'Other'] # Remove 'Others'
        temp = temp.groupby(['Country']).sum() # sum by % of respondants
        temp = temp.reset_index().sort_values('% of Responses', ascending=False).head(10) # top 10
        top_10_countries = list(temp['Country'])

        df = df[df['Country'].isin(top_10_countries)]
        ax = sns.barplot(x='% of Responses', y=category, data=df, 
                         hue='Title', palette=sns.color_palette(colors), edgecolor='.2', ax=axes)
        
    elif category in ['Basic_stats_software','Advanced_stats_software','BI_software','Local_envs','Cloud_software']:
        ax = sns.barplot(x='% of Responses', y=category, data=get_single_col_freq_percents(globals()[category], category),
                         hue='Title', palette=sns.color_palette(colors), edgecolor='.2', ax=axes)
        
    else:
        ax = sns.barplot(x='% of Responses', y=category, data=get_single_col_freq_percents(data_jobs, category),
                         hue='Title', palette=sns.color_palette(colors), edgecolor='.2', ax=axes)
  
    # title formatting
    if category == 'Country':
        ax.set_title('Top 10 Responding Countries\n', fontsize=14)
    elif category == 'Years_Exp_Data':
        ax.set_title('Years Using Code to Analyze Data\n', fontsize=14)
    elif category == 'Years_Exp_ML':
        ax.set_title('Years Using Machine Learning Methods\n', fontsize=14)
    elif category == 'Annual Compensation':
        ax.set_title(category + ' (US$) of Respondents \n', fontsize=14)
    elif category == 'Basic_stats_software':
        ax.set_title('Basic Statistical Software Used by Respondents\n', fontsize=14)
    elif category == 'Advanced_stats_software':
        ax.set_title('Advanced Statistical Software Used by Respondents\n', fontsize=14)        
    elif category == 'BI_software':
        ax.set_title('Business Intelligence Software Used by Respondents\n', fontsize=14)
    elif category == 'Local_envs':
        ax.set_title('Local Environments Used by Respondents\n', fontsize=14)    
    elif category == 'Cloud_software':
        ax.set_title('Cloud Software Used by Respondents\n', fontsize=14)          
    else:
        ax.set_title(category + ' of Respondents \n', fontsize=14)

    # legend formatting
    legend = plt.legend(frameon=True)
    legend_frame = legend.get_frame()
    legend_frame.set_facecolor('white')
    legend_frame.set_edgecolor('black')
    plt.legend(loc=legend_loc)
    
    # axis formatting
    ax.yaxis.label.set_visible(False)
    ax.xaxis.set_major_formatter(mtick.PercentFormatter())
    x_ax_lim = int(math.ceil(ax.get_xlim()[1] / axis_limit) * axis_limit) # end at a 5.0%
    ax.set(xlim=(0.0, x_ax_lim))
    
    return ax

def multi_col_bar_plotter(category, plot_size=(6,5), axis_limit=5.0, axes=None):
    '''plots multi column responses'''
    
    #plt.figure(figsize=plot_size)
    
    # get frequency percents
    df = get_multicol_freq_percents(text_question_cols[category])
    
    # melt multiple columns
    df = df.reset_index()
    df = pd.melt(df,
                 id_vars=['index'], var_name='Title',
                 value_vars=['Data Analyst', 'Data Scientist'], value_name='% of Responses')
    
    ax = sns.barplot(x='% of Responses', y='index', data=df, hue='Title', 
                 palette=sns.color_palette(colors), edgecolor=".2", ax=axes)

    # title formatting
    if category == 'beginner_lang':
        ax.set_title('Respondents\' suggested programming language for beginners\n', fontsize=14)
    elif category == 'work_roles':
        ax.set_title('Respondents\' roles at work \n', fontsize=14)
    else:
        ax.set_title(category + ' used by Respondents \n', fontsize=14)
    
    # legend formatting
    ax.legend(loc='lower right')
    
    # axis formatting
    ax.yaxis.label.set_visible(False)
    ax.xaxis.set_major_formatter(mtick.PercentFormatter())
    x_ax_lim = int(math.ceil(ax.get_xlim()[1] / axis_limit) * axis_limit) # end at a 5.0%
    ax.set(xlim=(0.0, x_ax_lim))
    
    return ax

### The Landscape
The survey's "job title" question had 12 possible options (not including 'Other' or non-responses). Let's take a look at how many of the 19,717 respondents identified themselves as either a Data Scientist or Data Analyst:

In [None]:
ax = sns.countplot(x='Title', data=data_jobs, order=['Data Analyst', 'Data Scientist'],
              palette=sns.color_palette(colors), edgecolor='.2')

ax.set_title('Number of Respondents by Title\n')
ax.set_ylabel('')
ax.set_xlabel('')

for p in ax.patches:
    ax.annotate(format(p.get_height()), (p.get_x() + p.get_width() / 2.,p.get_height()), 
                ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

Interesting. My assumption would have been that we'd see more Analyst than Scientists, as the Data Scientist title is often more sought after & more of a specialization. This *is* a Kaggle survey, you know, "Your Home for Data Science".

Back to the numbers, we're looking at about 8% of respondents being Data Analysts and 21% being Data Scientists.

*Note: for all of the following plots, I'll be using frequency percents (i.e., # of responses per group/# of total responses).*
* For example, # who answered Age = 18-21/total # of respondents to the Age question
* Total #s are within their respective profession

**Now for the meat of this exploration.**
## Demographics

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(20, 6))
plt.subplots_adjust(wspace=0.4)
single_col_bar_plotter('Age', legend_loc='upper right', axes=axarr[0])
single_col_bar_plotter('Gender', legend_loc='upper right', axes=axarr[1]);

**Age**

The difference here is subtle, but it looks like Data Analysts tend to be a bit younger on average and conversely, Data Scientists seem to be a bit older.

The shift between professions seems to occur between age 29 & 30, where Data Scientists begin to outweigh Analysts in every succeeding bin.

**Gender**

Other than the stark difference in overall gender response sizes which has been covered in a number of other kernels; we can see that there is an almost equal 5-7% shift in titles between genders. That shift being, more Data Analyst females and more Data Scientist males.

In [None]:
fig, axarr = plt.subplots(1, figsize=(10, 4))
single_col_bar_plotter('Country', legend_loc='upper right');

**Country**

Looking at the top 10 responding countries, we can see see a few insights:
* The number one responder, India, overindexes in Data Analysts.
* Number two, the United States, has an even mix but a *slightly* larger Data Scientist population.
* Germany and France have quite a few more Data Scientists than Data Analysts - the largest title-by-title disparities on the plot.

Otherwise, no surprises with regards to who showed up here - all of these countries are large. The smallest of the bunch is Canada which according to the United Nation's [Population Estimates](http://data.un.org/Data.aspx?d=PopDiv&f=variableID%3a12%3btimeID%3a83%2c84%3bvarID%3a2&c=2,4,6,7&s=_crEngNameOrderBy:asc,_timeEngNameOrderBy:desc,_varEngNameOrderBy:asc&v=1), ranks 39th worldwide in population.

## Career & Education

In [None]:
fig, axarr = plt.subplots(1, figsize=(10, 4))
single_col_bar_plotter('Education', legend_loc='upper right');

**Education**
* Data Analysts tend to have completed a lower level of schooling as compared to Data Scientists. A vast majority of Data Scientists have at least a Bachelor's degree or higher.
* Interestingly enough, individuals with Master's degrees tend to make up the largest portion of both job titles.
* Doctorate holders are about two times more likely to be Data Scientists.

In [None]:
fig, axarr = plt.subplots(1, figsize=(10, 4))
single_col_bar_plotter('Annual Compensation', legend_loc='lower right');

**Annual Compensation**

Two differences clearly stand out here:
1. Many more Data Analysts make < \$10,000 than Data Scientists
2. Many more Data Scientists make > \$100,000 than Data Analysts

> **Get that Scientist title if you want the dough**

Otherwise the biggest surprise to me as an American in this industry is the number of respondents making \$0 - \$50,000. PayScale.com's salary information has the average US salary for [Data Analysts](https://www.payscale.com/research/US/Job=Data_Analyst/Salary) at \$59,732 while [Data Scientists](https://www.payscale.com/research/US/Job=Data_Scientist%2C_IT/Salary) sit at a cool \$91,260. I know my surprise here might expose some global ignorance but it just goes to show how diverse the crowd here at Kaggle is.

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(20, 6))
plt.subplots_adjust(wspace=0.4)
single_col_bar_plotter('Years_Exp_Data', axes=axarr[0])
single_col_bar_plotter('Years_Exp_ML', legend_loc='lower right', axes=axarr[1]);

**Years of Experience Using Code to Analyze Data & Using ML Methods**

Nothing here comes as too much of a surprise. 
* Most Data Analysts have been **using code to analyze data for < 5 years** and have only been **using machine learning methods for 0-2 years**.
* On the flip side, it looks like Data Scientists have been **using code to analyze data for 3-10 years** and have been **using machine learning methods for 2-10 years**.

This really reinforces what I'd assume after seeing the Educational breakdown - Data Scientists are simply more experienced. 

Lastly, one thing that stuck out to me was the sharp dropoff in response percentage by Data Scientists after the 5-10 year bin. Dropping the term "machine learning on [Google Trends](https://trends.google.com/trends/explore?date=all&q=machine%20learning) produced the following graph:
![ml_trends_over_time](https://i.imgur.com/Fqr8yvA.png)

This graph really shows the growth in this industry kicked off around 2013 which lines up well with these responses.

In [None]:
fig, axarr = plt.subplots(1, figsize=(12, 6))
multi_col_bar_plotter('work_roles');

**Work Roles** 

Whenever I've read or partaken in discussions around the difference between these two titles, work roles has always seemed to be a differentiator. Without imparting my opinion on the matter, I'll talk about what the data above shows:
* **Far and away the largest response for Data Analysts is "Analyzing Data to influence business decisions"**. This is self explanatory given the name "Analyst" in their job title. This is still an important part of a Data Scientists work though!
* Another interesting category in which Data Analysts overindex is in the "Build/run data infrastructure". I'd typically assume this role to be in the wheelhouse of Data Engineers or DBAs, so I think a good deep dive would be to see this broken out by company size. Maybe Analysts in smaller orgs are taking on more responsibilities?
* **Where Data Scientists 'take the lead' per say, is any role that involves machine learning**. "Prototype ML applications", "Improve existing ML models", "Build/run internal ML service", and "Research state-of-the-art ML" are all a bigger part of the Data Scientist respondents work roles vs. Data Analysts

<u><font size="1">Note: I renamed these roles from their true text strings.</font></u>

## Tooling Used
I'm not so sure these are a great indicator of the difference between titles but it will be interesting to see who is using what.

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(20, 6))
plt.subplots_adjust(wspace=0.4)
multi_col_bar_plotter('Programming Language', axes=axarr[0])
multi_col_bar_plotter('beginner_lang', axes=axarr[1]);

**Programming Language**
* You know those toothpaste commericials that say [4 out of 5 dentists recommend](https://www.youtube.com/watch?v=tXqAyMhgc7I)? 
    * Yeah well, **~6 out of 10 Data Scientists and ~5 out of 10 Data Analysts suggest that if you're a beginner, learn Python**. 
* As far as the languages used by these professionals, Python, R and SQL being the top 3 come as no surprise.
* What does interest me is the slight uptick in Bash usage by Data Scientists - I'm going to guess this is a byproduct of some more heavy duty coding & model building. 

In [None]:
fig, axarr = plt.subplots(2, 2, figsize=(20, 10))
plt.subplots_adjust(wspace=0.4, hspace=0.6)
single_col_bar_plotter('BI_software', axes=axarr[0][0])
single_col_bar_plotter('Cloud_software', axes=axarr[0][1], legend_loc='lower right')
single_col_bar_plotter('Basic_stats_software', axes=axarr[1][0])
single_col_bar_plotter('Advanced_stats_software', axes=axarr[1][1], legend_loc='upper right');

**Software Used**
I chose to drop 4 different categories here because they're a tinge related and honestly - there are no huge differences between the two titles. 

What I will highlight is what is used most by these professions, together:
* **BI Software**: Tableau has a firm hold here. I guess that's why [Salesforce bought it](https://techcrunch.com/2019/06/10/salesforces-tableau-acquisition-is-huge-but-not-the-hugest/) for \$15.7 billion USD.
* **Cloud Software**: AWS, GCP, and Azure round out the top 3 here with AWS having a firm lead. Interestingly, Data Scientists seem to use Colab a fair bit more than Data Analysts 
* **Basic Statistical Software**: Excel and Google Sheets. That's it.
* **Advanced Statistical Software**: SAS and SPSS!! The classics. I know they don't get much love today but there's no doubting that they're used in legacy software.

In [None]:
fig, axarr = plt.subplots(1, figsize=(10, 4))
single_col_bar_plotter('Local_envs', legend_loc='lower right');

**Local Environments**

Each title about mimicked the other. They both love Jupyter though.

In [None]:
fig, axarr = plt.subplots(1, figsize=(10, 8))
multi_col_bar_plotter('Algorithms');

**Algorithms**

There's something to this one. 
* Both Data Scientists and Data Analysts *more often than not* use Linear/Logistic Regression or Decision Trees/Random Forests. **These are the clear favorite for Data Analysts** - makes sense given that they're accessible, classic techniques.
* More advanced, Deep Learning algorithms, like **Dense Neural Nets, CNNs, and Recurrent Neural Nets, are used more often by Data Scientists.** 
* Gradient Boosted Machines being chosen by > 50% of Data Scientists is an interesting result to me personally. I wonder if this is due to their extreme popularity in the Kaggle community.

In [None]:
fig, axarr = plt.subplots(1, figsize=(10, 4))
multi_col_bar_plotter('ML Frameworks');

**Machine Learning Frameworks**
* The only thing that stands out for Analysts is the number that replied "None". This does reinforce what we saw before in the work roles section though.
* Data Scientists overindex in all of these frameworks, also reinforcing their response about using ML at work.
* Oh, and Scikit-learn rules the game.

In [None]:
fig, axarr = plt.subplots(1, figsize=(10, 4))
multi_col_bar_plotter('Visualization Tools');

**Visualization Tools**

No tangible differences here, minus the slight % of Data Analysts who don't use visualization tools. One could say that Data Scientists use visualization tools more often.

Matplotlib/seaborn dominate the Python ecosystem, ggplot/ggplot2 dominate R.

In [None]:
fig, axarr = plt.subplots(1, figsize=(10, 4))
multi_col_bar_plotter('Relational DB Tools');

**Relational Database Tools**

MySQL, PostgresSQL, and SQL Server are the favorites here. I'm interested in why **so many more Data Scientists use PostgresSQL vs. Data Analysts**.

In [None]:
fig, axarr = plt.subplots(1, figsize=(10, 8))
multi_col_bar_plotter('NLP Tools');

**NLP Tools**

Basically, Data Scientists are working on Natural Language Processing much more often than Data Analysts - we can see this in the poor response rate of these tools by Data Analysts.

## Conclusion

Well - did we settle the debate? What do you think? Leave a comment below if you learned anything from this.

In my eyes, there was a clear line of separation between Data Analysts and Data Scientists when it came to years of experience, education level, and day-to-day work responsibilities. I hope some of the other data points I highlighted provided some illumination to the reader.

I ommitted decent number of questions that in my opinion, didn't add to the dialogue. If there are any you would want me to add, please do.

Lastly, if you'd like to fork this and look at some other professions - please change the 'Data Scientist' and 'Data Analyst' strings in cells 4, 6, 8.

Thanks for reading!