# 1. Competition main information

This dataset provides insights from Kaggle’s annual user survey focused on working data scientists.

Kaggle has conducted an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live from 09/01/2021 to 10/04/2021, and after cleaning the data we finished with 25,973 responses!

The data scientists and ML engineers submitted responses on their backgrounds and day to day experience - everything from educational details to salaries to preferred technologies and techniques.

The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration.

**As a <font color = 'blue'>Data Scientist</font> I want to understand who are the <font color = 'blue'>Kaggle Professionals</font> working with data** (as Data Scientists, Data/ML Engineers, etc) and the **<font color = 'blue'>Common Profiles</font>** for these professionals.

#  2. Import, Clean and Transform Data

In order to be able to analyse the data we will start by cleaning and transforming the features in the dataset.

We will be executing the following steps:
   >       1. import needed packages
   >       2. create the utility functions and classes that will be used during the analysis
   >       3. import the raw data
   >       4. merge features for multiple selection questions

In [None]:
import numpy  as np # linear algebra
import pandas as pd # data processing
import os

import plotly.graph_objects as go
import plotly.express as px

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
def plotPercentageRate(p_data, p_feature, p_colorsIdx, p_title):
    v_plot_l1 = p_data.copy()
    v_plot_l1['Level_01'] = v_plot_l1[p_feature].apply(lambda x: x.split('|')[-2].replace('_', '<br>'))
    v_plot_l1 = ( v_plot_l1.reset_index().groupby(['Level_01'])[['index']].count()
                   .rename(columns = {'index': 'Count'}).reset_index() )
    v_plot_l1['Count'] = v_plot_l1['Count'] / v_plot_l1['Count'].sum() * 100
    v_lambda = ( lambda x:   '<i style="color:blue" >' + x['Level_01']
                           + '</i><br>' + '<b><i style="color:blue" >(' 
                           + str(np.round(x['Count'], 2)) + '%)</i></b>' )
    v_plot_l1['Level_01_D'] = v_plot_l1[['Level_01', 'Count']].apply(v_lambda, axis = 1)
    v_plot_l1 = v_plot_l1.set_index('Level_01')[['Level_01_D']].to_dict(orient = 'index')

    v_plot = ( p_data.reset_index().groupby([p_feature])[['index']].count()
                   .rename(columns = {'index': 'Count'}).reset_index() )
    v_plot['Count'] = v_plot['Count'] / v_plot['Count'].sum() * 100
    
    v_plot['Level_01'] = v_plot[p_feature].apply(lambda x: x.split('|')[-2].replace('_', '<br>'))
    if len(v_plot_l1) > 1:
        v_plot['Level_01'] = v_plot['Level_01'].apply(lambda x: v_plot_l1[x]['Level_01_D'])
    
    v_plot['Level_02'] = v_plot[p_feature].apply(lambda x: x.split('|')[-1].replace('_', '<br>'))
    v_lambda = ( lambda x:    x['Level_02'] 
                           + '<br>' + '<b>(' 
                           + str(np.round(x['Count'], 2)) + '%)</b>'
                           + '<br>' + ' ' )
    v_plot['Level_02'] = v_plot[['Level_02', 'Count']].apply(v_lambda, axis = 1)
    
    v_colors = ['crimson' if idx in p_colorsIdx else 'lightslategray' for idx in range(v_plot.shape[0])] 
    fig = go.Figure(data=[go.Bar( x = [ v_plot['Level_01'].tolist(), v_plot['Level_02'].tolist() ],
                                  y = v_plot['Count'],
                                  marker_color = v_colors )])
    fig.update_layout( title_text = f'{p_title} - Percentage Rate', height = 350,
                       margin = go.layout.Margin( l = 0, # left margin
                                                  r = 0, # right margin 
                                                  b = 0, # bottom margin
                                                  t = 40, # top margin
                     ) )
    fig.show()
    
    return

## 2.1 Import Raw Data

When importing the raw data we see that there are **25973** records and **369** features in the raw dataset.

In [None]:
v_dataRaw = pd.read_csv( '/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', 
                         sep = ',', header = [0, 1], low_memory = False )
v_colsMap = { col[0]: col[1] for col in v_dataRaw.columns }
v_dataRaw.columns = ['Time in Sec'] + list(v_colsMap.keys())[1:]
print('Dataset size: ', v_dataRaw.shape)
display(v_dataRaw.head(3))

## 2.2 Merge features for multiple selection questions

Responses to **multiple choice questions** (only a single choice can be selected) were recorded in individual columns. 

Responses to **multiple selection questions** (multiple choices can be selected) were split into multiple columns (with one column per answer choice). We will join all the columns linked to one question into one feature for easier processing and feature engineering.

In [None]:
# Check all the questions that were used in the survey
v_questions = sorted(set( [f"{'_'.join(key.split('_')[:2])} ==> {value[:200].split('?')[0]}" 
                              for key, value in v_colsMap.items()] ))
v_questions = pd.DataFrame({value.split(' ==> ')[0]: value.split(' ==> ')[1] for value in v_questions}, index = [0]).T
v_questions.reset_index(inplace = True)
v_questions.columns = ['Question', 'Description']
v_questions['No'] = v_questions['Question'].apply(lambda x: x.split('_')[0][1:]).apply(lambda x: 0 if not x.isnumeric() else int(x))
v_questions = v_questions[ ~v_questions['Question'].str.contains('OTHER') ].sort_values('No').drop('No', axis = 1)
fig = go.Figure(data = [go.Table( header = dict( values = list(v_questions.columns),
                                                 align  = 'left',
                                                 fill_color = 'paleturquoise'),
                                  cells  = dict( values = [ v_questions['Question'], v_questions['Description'] ],
                                                 fill_color = 'lavender',
                                                 align = 'left' )) ])
fig.update_layout( height = 350, margin = go.layout.Margin( l = 0, r = 0, b = 0, t = 40 ) )
fig.show()

#--------------------------------------------------------------------------------------------
X_data     = v_dataRaw[list(v_dataRaw.columns)[:7]].copy()
v_usedCols = X_data.columns.tolist()
X_data.columns = [ 'Filling_Duration', 'Age', 'Gender', 'Residence_Country', 'Highest_Degree', 'Role_Title', 
                   'Dev_Experience' ]
X_data['Age'] = X_data['Age'].replace({ '60-69': '60+',
                                        '70+':   '60+' })

#--------------------------------------------------------------------------------------------
v_lambda = lambda x: '__'.join(sorted([value for value in x if value != '']))
for item in [ ('Q25',      'Salary'),
              ('Q15',      'ML_years'),
              ('Q13',      'TPU_Usage'),
              ('Q24_Part', 'Role_Activities'),
              ('Q20',      'Company_Industry'),
              ('Q21',      'Company_Size'),
              ('Q22',      'Company_DS_Team_Size'),
              ('Q23',      'Company_ML_Used'),
              ('Q26',      'Company_ML_Budget'),
              ('Q8',       'Recommend_Programming_Language'),
              ('Q41',      'Most_Used_Analyse_Tool'), 
              ('Q9_Part',  'IDE'),
              ('Q7_Part',  'Dev_Programming_Language'),
              ('Q10_Part', 'Hosted_Notebooks'),
              ('Q12_Part', 'Specialized_Hardware'),
              ('Q14_Part', 'Data_Visualization_Library'),                    
              ('Q18_Part', 'Computer_Vision_Methods'),
              ('Q19_Part', 'NLP_Algorithm'),
              ('Q17_Part', 'ML_Algorithm'),
              ('Q16_Part', 'ML_Framework'),
              ('Q31_A_',   'ML_Managed'),
              ('Q31_B_',   'ML_Managed_2years'),
              ('Q36_A_',   'ML_Auto_01_Tools'),
              ('Q36_B_',   'ML_Auto_01_Tools_2years'),
              ('Q37_A_',   'ML_Auto_02_Tools'),
              ('Q37_B_',   'ML_Auto_02_Tools_2years'),
              ('Q38_A_',   'ML_Experiments'),
              ('Q38_B_',   'ML_Experiments_2years'),
              ('Q39_Part', 'ML_Public_Share'),
              ('Q40_Part', 'DS_Learning_Platform'),
              ('Q42_Part', 'DS_Media_Source'),      
              ('Q28',      'Best_Cloud_Platform'),
              ('Q27_A_',   'Cloud_Computing_Platform'),
              ('Q27_B_',   'Cloud_Computing_Platform_2years'),
              ('Q11',      'Cloud_Computing_Platform_Most_Used'),
              ('Q29_A_',   'Cloud_Computing_Products'),
              ('Q29_B_',   'Cloud_Computing_Products_2years'),
              ('Q30_A_',   'Data_Storage_Products'),
              ('Q30_B_',   'Data_Storage_Products_2years'),
              ('Q32_A_',   'BigData_Product'),
              ('Q32_B_',   'BigData_Product_2years'),
              ('Q33',      'BigData_Product_Most_Used'),
              ('Q34_A_',   'BI_Tool'),
              ('Q34_B_',   'BI_Tool_2years'),
              ('Q35',      'BI_Tool_Most_Used') ]:
    if not '_' in item[0]:
        X_data[item[1]] = v_dataRaw[item[0]]
        v_usedCols.append(item[0])
    else:
        v_tmpCols = [col for col in v_dataRaw.columns if item[0] in col
                                                         or (    '_Part' in item[0]
                                                             and col == f'{item[0].split("_")[0]}_OTHER') ]
        X_data[item[1]] = v_dataRaw[v_tmpCols].fillna('').apply(v_lambda, axis = 1)
        v_usedCols += v_tmpCols

v_data_Ck = v_dataRaw[[col for col in v_dataRaw.columns if col not in v_usedCols]]
if v_data_Ck.shape[1] > 0:
    display(v_data_Ck)
    display({key: value for key, value in v_colsMap.items() if key in v_data_Ck.columns})
    
X_baseData = X_data.reset_index()
X_baseData['index'] = X_baseData['index'].apply(lambda x: f'ID_{x}')
X_baseData.set_index('index', inplace = True)
display(X_baseData.head(3))

# 3. Filter the Dataset

The goal of this analysis is to find the different profiles of respondents that have replied to the **2021 Kaggle Survey** and:
   * are **<font color = 'blue'>Professionals</font>** - based on the feature **Role Title**
   * have an **<font color = 'blue'>University Degree</font>**
   * **<font color = 'blue'>Know if the Company has or not Machine Learning Implemented</font>**
   * have **<font color = 'blue'>Experience with Machine Learning</font>**
   * have a **<font color = 'blue'>Declared Gender</font>**
   * have **<font color = 'blue'>less than 60 years old</font>**
   * have **<font color = 'blue'>Declared Their Salary</font>**

## 3.1 Filter Dataset based on the **<font color = 'blue'>Role Title</font>**

We will check the different **Role Titles** that can be found in the dataset.

We can see that **26%** of the respondents in the dataset are **Students** and **74%** of the respondents could be considered as **Professionals**. 

Based on the distribution that we find in the Role Titles, we decide to **transform the data** in new categories in order to group the values for which we have very few respondents:

| Role Title | Percentage Rate | New Feature Value |
| --- | --- | --- |
| Data Scientist | 13.92% | Data Scientist |
| Machine Learning Engineer | 5.77% | Data / ML Engineer |
| Data Engineer | 2.57% | Data / ML Engineer |
| Research Scientist | 5.92% | Research |
| Statistician | 1.21% | Research |
| Data Analyst | 8.86% | Analyst |
| Business Analyst | 3.73% | Analyst |
| Software Engineer |  9.43% | Software Engineer |
| Program/Project Manager | 3.27% | IT__Other |
| Product Manager | 1.23% | IT__Other |
| DBA/Database Engineer | 0.66% | IT__Other |
| Developer Relations/Advocacy | 0.38% | IT__Other |
| Other | 9.21% | Other |
| Currently not employed | 7.65% | Currently not employed |

For the rest or the analysis we will only select the **<font color = 'blue'>Professionals</font>**.

In [None]:
import plotly.graph_objects as go

v_map = { 'Data Scientist':                  'Data Scientist',
          'Data Engineer':                   'Data/ML Engineer',
          'Machine Learning Engineer':       'Data/ML Engineer',
          'Research Scientist':              'Research',
          'Statistician':                    'Research',
          #-----------------------------------------------------------
          'Data Analyst':                    'Analyst',
          'Business Analyst':                'Analyst',
          #-----------------------------------------------------------
          'Software Engineer':               'Software Engineer',
          'Program/Project Manager':         'IT__Other',
          'Product Manager':                 'IT__Other',
          'DBA/Database Engineer':           'IT__Other',
          'Developer Relations/Advocacy':    'IT__Other', }
X_baseData['Role_Title_F'] = X_baseData['Role_Title'].replace(v_map)

v_cntRole_1 = X_baseData[ X_baseData['Role_Title'] == 'Student' ].shape[0] / X_baseData.shape[0] * 100
v_cntRole_2 = X_baseData[ X_baseData['Role_Title'] != 'Student' ].shape[0] / X_baseData.shape[0] * 100
v_data = { 'nodes':   [ 'Role<br>Titles', 
                        f'Students<br>({np.round(v_cntRole_1, 2)}%)', f'Professionals<br>({np.round(v_cntRole_2, 2)}%)' ],
           'parents': ['', 'Role<br>Titles', 'Role<br>Titles'],
           'values':  [ 100.1, v_cntRole_1, v_cntRole_2 * 1.001] }
v_cntRole_2 = v_cntRole_2 * X_baseData.shape[0] / 100
for value in X_baseData.reset_index().groupby('Role_Title_F')[['index']].count().reset_index().to_dict(orient = 'records'):
    if value['Role_Title_F'] == 'Student': continue
    v_perc = np.round(value['index'] / X_baseData.shape[0] * 100, 2)
    if v_perc > 5:
        v_data['nodes'].append(f"{value['Role_Title_F'].replace(' ', '<br>')}<br>({v_perc}%)")
    else:
        v_data['nodes'].append(value['Role_Title_F'].replace(' ', '<br>'))
    v_data['parents'].append(v_data['nodes'][2])
    v_data['values'].append(v_perc)

fig = go.Figure(go.Sunburst( labels  = v_data['nodes'],
                             parents = v_data['parents'],
                             values  = [np.round(value, 4) for value in v_data['values']],
                             branchvalues = 'total',
                             insidetextorientation = 'radial',
                             marker  = dict( colorscale = 'solar' ), ))
fig.update_layout(margin = dict(t = 0, l = 0, r = 0, b = 0), uniformtext=dict(minsize=10), height = 450)
fig.show()

X_professionals = X_baseData[ X_baseData['Role_Title'] != 'Student' ].copy()
X_professionals['Role_Title__Filtered'] = X_professionals['Role_Title_F']
X_professionals.drop(['Role_Title', 'Role_Title_F'], axis = 1, inplace = True)


## 3.2 Filter respondents based on the **<font color = 'blue'>Highest Degrees</font>**

We check the different values that have been provided for the feature **Highest Degrees**. We can see that we have the following distribution:
   - **9%** of respondents **didn't reply** or have **No University Degree** 
   - **91%** of respondents have an **University Degree**
   
As we only have **1.77%** of the respondents that have a **Professional Doctorate** we will consider that they have a **Doctoral Degree +**.

For the rest or the analysis we will only select the respondents that have an **<font color = 'blue'>University Degree</font>**.

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'Highest_Degree'
v_colNameF  = f'{v_colName}__Filtered'
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('0|Unknown|***NAV')

#-----------------------------------------------------------------------------------------------
for item in [ ('I prefer not to answer',               '1|No University|No_Answer'),
              ('No formal education past high school', '1|No University|No_University'),
              ('without earning a bachelor',           '1|No University|No_University'),
              ('Bachelor’s degree',                    '2|University|Bachelor_Degree'),
              ('Master’s degree',                      '3|University|Master_Degree'),
              ('Doctoral degree',                      '4|University|Doctoral_Degree'),
              ('Professional doctorate',               '5|University|Professional_Doctorate') ]:
    v_idx = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: item[0] in x) ].index
    v_MLProfile.loc[v_idx, v_colName] = item[1]

plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [0, 1], p_title = 'Highest Degree - Initial')
v_MLProfile = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: x.split('|')[1] != 'No University' ) ]
v_MLProfile[v_colName] = v_MLProfile[v_colName].apply(lambda x: x.replace('|University', ''))
v_MLProfile[v_colName] = v_MLProfile[v_colName].replace({ '4|Doctoral_Degree':        '4|Doctoral_Degree+',
                                                          '5|Professional_Doctorate': '4|Doctoral_Degree+' })
v_MLProfile[v_colNameF] = v_MLProfile[v_colName].apply(lambda x: x.split('|')[0] + '| |' + x.split('|')[1])
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
plotPercentageRate( v_MLProfile, v_colNameF, p_colorsIdx = [], p_title = 'Highest Degree - Filtered')
X_professionals = v_MLProfile.copy()


## 3.3 Filter respondents based on the **<font color = 'blue'>Company Machine Learning Implementation</font>**

We check the different **Company Machine Learning Implementations** that can be found in the dataset. We can see that we have the following distribution:
   - **29.6%** of respondents **didn't reply** or **don't know** the type of Machine Learning implemented in their company
   - **16.6%** of respondents are working for a company where **No Machine Learning is implemented** in their company
   - **53.6%** of respondents are working for a company where **Machine Learning is implemented** in their company

For the rest or the analysis we will only select the respondents for which the **Company Machine Learning Implementation** **<font color = 'blue'>is Known</font>**.

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'Company_ML_Used'
v_colNameF  = f'{v_colName}__Filtered'
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('0|Unknown|***NAV')

#-----------------------------------------------------------------------------------------------
for item in [ ('I do not know',                              '0|Unknown|I do not know'),
              ('No (we do not use ML methods)',              '1|No_Machine_Learning|No_Machine_Learning'),
              ('We are exploring ML methods',                '2|Machine_Learning|Machine_Learning_Exploring'),
              ('We use ML methods for generating insights',  '3|Machine_Learning|Machine_Learning_Generate_Insights'),
              ('models in production for less than 2 years', '4|Machine_Learning|Machine_Learning_Less_than_2 years'),
              ('models in production for more than 2 years', '5|Machine_Learning|Machine_Learning_More_than_2 years') ]:
    v_idx = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: item[0] in x) ].index
    v_MLProfile.loc[v_idx, v_colName] = item[1]
    
plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [0, 1], p_title = 'Machine Learning Implementation - Initial')
v_MLProfile[v_colNameF] = v_MLProfile[v_colName]
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
v_MLProfile = v_MLProfile[ v_MLProfile[v_colNameF].apply(lambda x: x.split('|')[1] != 'Unknown' ) ]
plotPercentageRate( v_MLProfile, v_colNameF, p_colorsIdx = [], p_title = 'Machine Learning Implementation - Filtered')

X_professionals = v_MLProfile.copy()


## 3.4 Filter respondents based on the **<font color = 'blue'>Experience with Machine Learning</font>**

We check the different **Machine Learning Years** that can be found in the dataset. We can see that we have the following distribution:
   - **14%** of respondents **didn't reply** or **don't have Machine Learning Experience** 
   - **86%** of respondents have **Machine Learning Experience**

For the rest or the analysis we will only select the respondents that have **<font color = 'blue'>Machine Learning Experience</font>**.

We will also group the data based on the type of experience: **Very Junior** (less than 1 Year of Experience), **Junior** (with 1 to 2 Years of Experience), **Junior+** (with 2 to 3 Years of Experience), **Medium** (with 3 to 5 Years of Experience) and **Senior** (with more than 5 Years of Experience).

In the filtered dataset we can see that **71%** of the respondents have a **Junior** Experience with Machine Learning.

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'ML_years'
v_colNameF  = f'{v_colName}__Filtered'
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('0|Unknown|***NAV')

#-----------------------------------------------------------------------------------------------
for item in [ ('I do not use machine learning',            '0|No_Machine_Learning|No_ML'),
              ('Under 1 year',                             '1|Level_Very_Junior|Less than 1 year'),
              ('1-2 years',                                '2|Level_Junior|1 - 2 Years'),
              ('2-3 years',                                '3|Level_Junior+|2 - 3 Years'),
              ('3-4 years',                                '4|Level_Medium|3 - 4 Years'),
              ('4-5 years',                                '4|Level_Medium|4 - 5 Years'),
              ('5-10 years',                               '5|Level_Senior|5 - 10 Years'),
              ('10-20 years',                              '5|Level_Senior|10 - 20 Years'),
              ('20 or more',                               '5|Level_Senior|20+ Years'), ]:
    v_idx = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: item[0] in x) ].index
    v_MLProfile.loc[v_idx, v_colName] = item[1]

plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [0, 1], p_title = 'Machine Learning Years - Initial')
v_MLProfile = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: 'Level' in x ) ]
v_MLProfile[v_colNameF] = v_MLProfile[v_colName].apply(lambda x: '|'.join(x.split('|')[:2]))
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
v_MLProfile = v_MLProfile[ v_MLProfile[v_colNameF].str.contains('Level') ]
v_MLProfile[v_colNameF] = v_MLProfile[v_colNameF].apply(lambda x: x.split('|')[0]
                                                                  + (      '|Junior|' if 'Junior' in x.split('|')[1] 
                                                                      else '|Not_Junior|' )  
                                                                  + x.split('|')[1])
plotPercentageRate( v_MLProfile, v_colNameF, p_colorsIdx = [], p_title = 'Machine Learning Years - Filtered')
X_professionals = v_MLProfile.copy()


## 3.5 Filter respondents based on the **<font color = 'blue'>Gender</font>**

We check the different **Genders** that can be found in the dataset. We can see that we have the following distribution:
   - **1.72%** of respondents provided a different value than **Man** or **Women** 
   - **98.2%** of respondents have Declared their Gender

For the rest or the analysis we will only select the respondents that have a **<font color = 'blue'>Declared Gender</font>**.

In the filtered dataset we can see that we have **85%** of the respondents that are **Male** and **15%** of the respondents that are **Female**.

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'Gender'
v_colNameF  = f'{v_colName}__Filtered'
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('0|Unknown|***NAV')

#-----------------------------------------------------------------------------------------------
for item in [ ('Prefer not to say',                        '0|Not_Declared|Prefer not to say'),
              ('Prefer to self-describe',                  '0|Not_Declared|Prefer to self-describe'),
              ('Nonbinary',                                '0|Not_Declared|Nonbinary'),
              ('Man',                                      '1|Declared|Male'),
              ('Woman',                                    '2|Declared|Female'), ]:
    v_idx = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: item[0] in x) ].index
    v_MLProfile.loc[v_idx, v_colName] = item[1]

plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [0, 1, 2], p_title = 'Gender - Initial')
v_MLProfile = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: x.split('|')[1] == 'Declared' ) ]
v_MLProfile[v_colNameF] = v_MLProfile[v_colName].apply(lambda x: x.split('|')[0] + '| |' + x.split('|')[2])
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
plotPercentageRate( v_MLProfile, v_colNameF, p_colorsIdx = [], p_title = 'Gender - Filtered')
X_professionals = v_MLProfile.copy()


## 3.6 Filter respondents based on the **<font color = 'blue'>Age</font>**

We check the different **Ages** that can be found in the dataset. We can see that we have the following distribution:
   - **39.6%** of respondents have **Less Than 30 Years** 
   - **31%** of respondents have **Between 30 and 40 Years** 
   - **17.6%** of respondents have **Between 40 and 50 Years** 
   - **8%** of respondents have **Between 50 and 60 Years** 
   - **3.5%** of respondents have **Mre than 60 Years**

For the rest or the analysis we will only select the respondents that have **<font color = 'blue'>Less than 60 years old</font>**.

In the filtered dataset we have **60%** of the Professional respondants that are **Younger than 35 years old**.

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'Age'
v_colNameF  = f'{v_colName}__Filtered'
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('0|Unknown|***NAV')

#-----------------------------------------------------------------------------------------------
for item in [ ('18-21',                '1|Less_Than_30|18-24'),
              ('22-24',                '1|Less_Than_30|18-24'),
              ('25-29',                '1|Less_Than_30|25-29'),
              ('30-34',                '2|Less_Than_40|30-34'),
              ('35-39',                '2|Less_Than_40|35-39'), 
              ('40-44',                '3|Less_Than_50|40-44'), 
              ('45-49',                '3|Less_Than_50|45-49'), 
              ('50-54',                '4|Less_Than_60|50-59'), 
              ('55-59',                '4|Less_Than_60|50-59'), 
              ('60+',                  '5|60+|60+'), 
            ]:
    v_idx = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: item[0] in x) ].index
    v_MLProfile.loc[v_idx, v_colName] = item[1]

plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [7], p_title = 'Age - Initial')
v_MLProfile = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: x.split('|')[1] != '60+' ) ]
v_map = ['18-24', '25-29', '30-34']
v_MLProfile[v_colNameF] = v_MLProfile[v_colName].apply(lambda x: x.split('|')[0] 
                                                                + ( '|Less Than 35 Years Old|' if x.split('|')[2] in v_map
                                                                    else '|More Than 35 Years Old|') 
                                                                + x.split('|')[2])
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
plotPercentageRate( v_MLProfile, v_colNameF, p_colorsIdx = [], p_title = 'Age - Filtered')
X_professionals = v_MLProfile.copy()


## 3.7 Filter respondents based on the **<font color = 'blue'>Salary</font>**

We check the different **Salaries** that can be found in the dataset. We can see that we have the following distribution:
   - **2.6%** of respondents **did not declare their salary** 
   - **40%** of respondents have a **Low Salary** 
   - **26%** of respondents have a **Medium Salary** 
   - **31%** of respondents have a **High Salary** 

For the rest or the analysis we will only select the respondents that have **<font color = 'blue'>Declared their Salary</font>**.

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'Salary'
v_colNameF  = f'{v_colName}__Filtered'
v_lambda = (lambda x: x.replace(',', '').replace('$', '').replace('>', '-'))
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('***NAV').apply(v_lambda)
v_lambda = (lambda x:      '0|Unknown|***NAV' if x == '***NAV' 
                      else '9|High|200k+'        if x == '-1000000'
                      else '1|Low|0 - 1K'        if int(x.split('-')[1]) < 1000
                      else '2|Low|1k - 2K'       if int(x.split('-')[1]) < 2000
                      else '3|Low|2K - 5K'       if int(x.split('-')[1]) < 5000
                      else '4|Low|5K - 10K'      if int(x.split('-')[1]) < 10000
                      else '5|Medium|10K - 20K'  if int(x.split('-')[1]) < 20000
                      else '6|Medium|20K - 50K'  if int(x.split('-')[1]) < 50000
                      else '7|High|50K - 100K'   if int(x.split('-')[1]) < 100000
                      else '8|High|100K - 200K'  if int(x.split('-')[1]) < 200000
                      else '9|High|200k+' )
v_MLProfile[v_colName] = v_MLProfile[v_colName].apply(v_lambda)

plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [0], p_title = 'Salary - Initial')
v_MLProfile = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: x.split('|')[1] != 'Unknown' ) ]
v_MLProfile[v_colNameF] = v_MLProfile[v_colName]
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
plotPercentageRate( v_MLProfile, v_colNameF, p_colorsIdx = [], p_title = 'Salary - Filtered')
X_professionals = v_MLProfile.copy()

# 4. Feature Engineering

We will apply feature engineering techniques for the following features:
   - [x] **Company Data Science Team Size**
   - [x] **Company Size**
   - [x] **Development Experience**
   - [x] **Residence Country**

## 4.1 Feature Engineering for feature **<font color = 'blue'>Company Data Science Team Size</font>**

We check the different **Company Data Science Team Size** that can be found in the dataset. We can see that we can use the following distribution:
   - **12.2%** of respondents work in a company that has **No Data Science Team** 
   - **23.5%** of respondents work in a company that has a **Very Small Data Science Team** 
   - **17.6%** of respondents work in a company that has a **Small Data Science Team** 
   - **20.3%** of respondents work in a company that has a **Medium Data Science Team** 
   - **26.2%** of respondents work in a company that has a **Large Data Science Team** 

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'Company_DS_Team_Size'
v_colNameF  = f'{v_colName}__Filtered'
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('0|Unknown|***NAV')

#-----------------------------------------------------------------------------------------------
for item in [ ('0',                    '1|NO_DSTeam|0'),
              ('1-2',                  '2|Very_Small_DSTeam|1 - 2'),
              ('3-4',                  '3|Small_DSTeam|3 - 4'),
              ('5-9',                  '4|Medium_DSTeam|5 - 9'),
              ('10-14',                '4|Medium_DSTeam|10 - 14'),
              ('15-19',                '5|Large_DSTeam|15 - 19'),
              ('20+',                  '5|Large_DSTeam|20+'),
            ]:
    v_idx = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: item[0] == x) ].index
    v_MLProfile.loc[v_idx, v_colName] = item[1]

plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [7], p_title = 'Data Science Team Size - Initial')
v_MLProfile[v_colNameF] = v_MLProfile[v_colName].apply(lambda x: x.split('|')[0] + '| |' + x.split('|')[1])
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
plotPercentageRate( v_MLProfile, v_colNameF, p_colorsIdx = [], p_title = 'Data Science Team Size - Grouped')
X_professionals = v_MLProfile.copy()

## 4.2 Feature Engineering for feature **<font color = 'blue'>Company Size</font>**

We check the different **Company Size** that can be found in the dataset. We can see that we can use the following distribution:
   - **29.5%** of respondents work in a **Small Company** with 0 to 49 employees
   - **30.5%** of respondents work in a **Medium Company** with 50 to 249 employees
   - **40%** of respondents work in a **Large Company** with 250+ employees

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'Company_Size'
v_colNameF  = f'{v_colName}__Filtered'
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('0|Unknown|***NAV')

#-----------------------------------------------------------------------------------------------
for item in [ ('0-49',                             '1|Small|0 - 49 employees'),
              ('50-249',                           '2|Medium|50 - 249 employees'),
              ('250-999',                          '2|Medium|250 - 999 employees'),
              ('1000-9,999',                       '3|Large|1000 - 9999 employees'),
              ('10,000 or more',                   '3|Large|10000+ employees') ]:
    v_idx = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: item[0] in x) ].index
    v_MLProfile.loc[v_idx, v_colName] = item[1]

plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [], p_title = 'Company Size - Initial')
v_MLProfile[v_colNameF] = v_MLProfile[v_colName].apply(lambda x: '|'.join(x.split('|')[:2]))
v_MLProfile[v_colNameF] = v_MLProfile[v_colNameF].apply(lambda x: x.split('|')[0] + '| |' + x.split('|')[1])
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
plotPercentageRate( v_MLProfile, v_colNameF, p_colorsIdx = [], p_title = 'Company Size - Grouped')
X_professionals = v_MLProfile.copy()

## 4.3 Feature Engineering for feature **<font color = 'blue'>Development Experience</font>**

We check the different **Development Experiences** that can be found in the dataset. We can see that we can use the following distribution:
   - **13%** of respondents are **Very Junior** with less than 1 year of Development Experience
   - **26%** of respondents are **Junior** with 1 to 3 years of Development Experience
   - **37%** of respondents are **Medium** with 3 to 10 years of Development Experience
   - **24%** of respondents are **Seniors** with more than 10 years of Development Experience

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'Dev_Experience'
v_colNameF  = f'{v_colName}__Filtered'
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('0|Unknown|***NAV')

#-----------------------------------------------------------------------------------------------
for item in [ ('I have never written code',                '0|No_Coding|No_Coding'),
              ('< 1 years',                                '1|Level_Very_Junior|Less than 1 year'),
              ('1-3 years',                                '2|Level_Junior|1 - 3 years'),
              ('3-5 years',                                '3|Level_Medium|3 - 5 years'),
              ('5-10 years',                               '3|Level_Medium|5 - 10 years'),
              ('10-20 years',                              '4|Level_Senior|10 - 20 years'),
              ('20+',                                      '4|Level_Senior|20+ years'), ]:
    v_idx = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: item[0] in x) ].index
    v_MLProfile.loc[v_idx, v_colName] = item[1]

plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [], p_title = 'Development Experience - Initial')
v_MLProfile = v_MLProfile[ v_MLProfile[v_colName].apply(lambda x: 'Level' in x ) ]
v_MLProfile[v_colNameF] = v_MLProfile[v_colName].apply(lambda x: '|'.join(x.split('|')[:2]))
v_MLProfile[v_colNameF] = v_MLProfile[v_colNameF].apply(lambda x: x.split('|')[0] + '| |' + x.split('|')[1])
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
plotPercentageRate( v_MLProfile, v_colNameF, p_colorsIdx = [], p_title = 'Development Experience - Grouped')
X_professionals = v_MLProfile.copy()

## 4.4 Feature Engineering for feature **<font color = 'blue'>Residence Country</font>**

We check the different **Residence Country** that can be found in the dataset. We can see that we can use the following distribution:
   - **42%** of respondents are from **Asia**, from which **22%** of respondents are from **India**
   - **21%** of respondents are from **Europe**
   - **14%** of respondents are from **North America**
   - **7.5%** of respondents are from **Africa**
   - **7%** of respondents are from **South America**
   - **3.5%** of respondents are from **Oceania**

In [None]:
v_MLProfile = X_professionals.copy()
v_colName   = 'Residence_Country'
v_colNameF  = f'{v_colName}__Filtered'
v_MLProfile[v_colName] = v_MLProfile[v_colName].fillna('0|Unknown|***NAV')

#-----------------------------------------------------------------------------------------------
v_map = { 'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
          'United States of America':                             'United States',
          'Hong Kong (S.A.R.)':                                   'China',
          'South Korea':                                          'S. Korea',
          'South Africa':                                         'S. Africa',
          'Taiwan':                                               'Vietnam',
          'Viet Nam':                                             'Vietnam',
          'Iran, Islamic Republic of...':                         'Iran',
          'I do not wish to disclose my location':                'Other' }
v_countriesMap = pd.read_csv( '/kaggle/input/countries/Countries.csv', sep = ',', low_memory = False )
v_continentsMap = { 'eu':    '1|Europe|Others',
                    'na':    '2|North America|Others',
                    'sa':    '3|South America|Others',
                    'af':    '4|Africa|Others',
                    'asia':  '5|Asia|Others',
                    'ocean': '6|Oceania|Others' }
v_countriesMap['continent'] = v_countriesMap['continent'].apply(lambda x: x if not x in v_continentsMap.keys() 
                                                                          else v_continentsMap[x] )
v_countriesMap = v_countriesMap.set_index('name')[['continent']].to_dict(orient = 'index')
v_countriesMap = { key: value['continent'] for key, value in v_countriesMap.items() }
v_countriesMap['India'] = '5|Asia|India'
v_countriesMap['United States'] = '2|North America|USA'
v_countriesMap['Other'] = '9|Other|Other'

#-----------------------------------------------------------------------------------------------
v_MLProfile[v_colName] = v_MLProfile[v_colName].replace(v_map).replace(v_countriesMap)
plotPercentageRate( v_MLProfile, v_colName, p_colorsIdx = [], p_title = 'Development Experience - Grouped')
v_MLProfile[v_colNameF] = v_MLProfile[v_colName]
v_MLProfile.drop(v_colName, axis = 1, inplace = True)
X_professionals = v_MLProfile.copy()

# 5. Common Profiles

We will apply an anomaly detection algorithm in order to identify:
   - [x] **Common Profiles**
   - [ ] **Special Profiles**
   
Anomaly detection, also called outlier detection, is the identification of unexpected observations or items that differ significantly from the norm. Anomaly detection rests upon two basic assumptions:
   * Anomalies in data occur only very rarely
   * The features of data anomalies are significantly different from those of normal instances
   
Based on the encoded data returned by the outlier detection algorithm, we will filter the common profiles and do a segmentation through clustering.

We will also apply a multi-classification model in order to check that the profiles all well splitted.

In the end we will analyse all these profiles in order to understand their common and/or specific characteristics.

## 5.1. Create a training and validation dataset

We will split the dataset into train and validation.

The training dataset will used during the learning process by the model.

The validation dataset is a sample of data held back from training the model that is used to give an estimate of the model skill and performance.

In [None]:
from sklearn.model_selection import train_test_split

X_data = X_professionals[[col for col in X_professionals.columns if '__Filtered' in col]].copy()
X_data.columns = [col.replace('__Filtered', '') for col in X_data.columns]
for col in X_data.columns:
    if col in ['Residence_Country']: continue
    v_values = X_data[col].value_counts().index.tolist()
    if sum([1 for value in v_values if '|' in value]) > 0:        
        if len(set([value.split('|')[1] for value in v_values])) > 1:
            X_data[f'{col}__Categ'] = X_data[col].apply(lambda x: x.split('|')[1])
        X_data[col] = X_data[col].apply(lambda x: x.split('|')[0] + '__' + x.split('|')[2])
y_data = pd.DataFrame()
v_labelCols = [col for col in X_data.columns if 'Salary' in col]
for label in sorted(set([col for col in v_labelCols])):
    y_data[label] = X_data[[col for col in v_labelCols if label in col]].sum(axis = 1) 

X_data  = X_data.drop(v_labelCols, axis = 1)
X_dataD = pd.get_dummies(X_data)
display(X_data.head(3).T.head())
display(y_data.head(3).T.head())
X_train, X_valid, y_train, y_valid = train_test_split( X_dataD, y_data,
                                                       test_size    = 0.3, 
                                                       random_state = 2013,
                                                       stratify     = y_data )
print('Total number of records: ', X_data.shape, y_data.shape)
print('Train records: ', X_train.shape, y_train.shape)

## 5.2 Identify the **Comon Profiles**

We create the AutoEncoder model which will be trained in order to identify the **<font color = 'blue'>Common Profiles</font>**.

This **Unsupervised Method** of anomaly detection detects anomalies in the unlabeled dataset based solely on the intrinsic properties of that data. The working assumption is that, as in most cases, the large majority of the instances in the data set will be normal. The anomaly detection algorithm will then detect instances that appear to fit with the rest of the dataset least congruently.

In [None]:
import tensorflow as tf
from numpy.random import seed
from datetime import datetime

seed(2013)
tf.random.set_seed(2013)

In [None]:
class AutoEncoderBase(tf.keras.models.Model):
    
    __threshold__  = None
    
    def __init__(self, p_inputDims, p_outputDims, p_encodedDims):
        super().__init__()    
        # deconstruct / encoder
        self.__encoder__ = tf.keras.models.Sequential([ 
                                  tf.keras.layers.Dense(p_inputDims, activation = 'selu'), 
                                  tf.keras.layers.Dense(30, activation = 'selu'),
                                  tf.keras.layers.Dropout(0.1),
                                  tf.keras.layers.Dense(20, activation = 'selu'),
                                  tf.keras.layers.Dropout(0.1),
                                  tf.keras.layers.Dense(p_encodedDims, activation = 'selu') ])
        # reconstruction / decoder
        self.__decoder__ = tf.keras.models.Sequential([                   
                                  tf.keras.layers.Dense(20, activation = 'selu'),
                                  tf.keras.layers.Dropout(0.1),
                                  tf.keras.layers.Dense(30, activation = 'selu'),
                                  tf.keras.layers.Dropout(0.1),
                                  tf.keras.layers.Dense(p_outputDims, activation = 'relu') ])            
        return
        
    def call(self, inputs):
        v_encoded = self.__encoder__(inputs)
        v_decoded = self.__decoder__(v_encoded)
        return v_decoded
    
    def getEncoder(self):
        return self.__encoder__

class AutoEncoder():
    
    __model__   = None
    __encoder__ = None
    
    def __init__(self, p_model):
        self.__model__   = p_model
        self.__encoder__ = self.__model__.get_layer(self.__model__.layers[0].name)
        self.__encoder__.summary()
        return
    
    def set_threshold(self, X_data, y_data):
        y_pred = self.__model__.predict(X_data.values)
        v_loss = tf.keras.losses.mean_squared_logarithmic_error(y_data, y_pred).numpy()
        self.__threshold__ = np.mean(v_loss) + 1.8 * np.std(v_loss)
        return 
    
    def getPredictions(self, X_data, y_data):
        y_pred = self.__model__.predict(X_data.values)
        v_loss = pd.Series(tf.keras.losses.mean_squared_logarithmic_error(y_data, y_pred).numpy(), index = X_data.index)
        v_loss = ( v_loss <= self.__threshold__ ).astype(int)
        y_pred = pd.DataFrame(self.__encoder__.predict(X_data.values), index = X_data.index)
        y_pred.columns = [f'Dimension__{col + 1}' for col in y_pred.columns]
        y_pred['Common'] = v_loss
        display(v_loss.value_counts())
        return y_pred

try:
    v_path = '/kaggle/input/professionals-common-profiles-autoencoder/autoencoder_Model.dmp'
    v_autoencoder = tf.keras.models.load_model(v_path)
except:
    v_path = 'autoencoder_Model.dmp'
    v_BATCH_SIZE = 16

    # define our early stopping
    early_stop = tf.keras.callbacks.EarlyStopping( monitor   = 'val_loss',
                                                   min_delta = 0.0001,
                                                   patience  = 10,
                                                   verbose   = 1, 
                                                   mode      = 'min',
                                                   restore_best_weights = True ) 

    v_autoencoder = AutoEncoderBase( p_inputDims   = X_train.shape[1], 
                                     p_outputDims  = X_train.shape[1], 
                                     p_encodedDims = 9 )
    v_autoencoder.compile(loss = 'mse', metrics = ['msle'], optimizer = 'adam')

    # Train the model
    history = v_autoencoder.fit( X_train, X_train,
                                 shuffle    = True,
                                 epochs     = 25,
                                 batch_size = v_BATCH_SIZE,
                                 verbose    = 2,
                                 callbacks  = [early_stop],
                                 validation_data = (X_valid, X_valid) )

    v_autoencoder.save(v_path)
    v_autoencoder = tf.keras.models.load_model(v_path)

v_autoencoder = AutoEncoder(v_autoencoder)

# Compute the threshold based on the training data
v_autoencoder.set_threshold(X_train, X_train)

# Create predictions
v_predEnc = pd.concat([ v_autoencoder.getPredictions(X_dataD, X_dataD), X_data ], axis = 1) 

# Display the distribution of the profiles
v_plot = ( v_predEnc['Common'].fillna(2).value_counts() / v_predEnc.shape[0] * 100 ).reset_index()
v_plot.columns = ['Common Profile', 'Percentage']
v_plot['Common Profile'] = v_plot['Common Profile'].replace({0: 'No', 1: 'Yes'})

fig = px.bar(v_plot, y = 'Common Profile', x = 'Percentage', color = 'Common Profile',
             labels={'pop':'population of Canada'}, height=400)
fig.show()

# Select only the common profiles
v_predEnc = v_predEnc[ v_predEnc['Common'] == 1 ].drop('Common', axis = 1)

We can see that **5%** of the dataset has been marked as **Special Profiles** and that **95%** of the dataset has been marked as **Common Profiles**.

## 5.3. Create the profiles through clustering

People seek approximation based on similarity, especially when it comes to business statistics. With data, the process is much more precise, what we call clustering. The term clustering is used to refer to the act of grouping information.

Separating into groups, categorizing, and segmenting is a way of gathering information or data based on common characteristics.

We decide to create these profiles in order to better understand the common attributes of the Professionals, but also the differences between them.

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster     import KMeans

X_kmean = v_predEnc[[col for col in v_predEnc.columns if 'Dimension__' in col]]
v_model = KMeans()
# k is range of number of clusters.
visualizer = KElbowVisualizer(v_model, k = (2, 30), timings = False, random_state = 2013)
visualizer.fit(X_kmean)              # Fit data to visualizer
visualizer.show()                    # Finalize and render figure

v_kmean = KMeans(n_clusters = 9, random_state = 2013)
v_kmean.fit(X_kmean)
X_clusters = v_predEnc[[col for col in v_predEnc.columns if not 'Dimension__' in col]].copy()
X_clusters['Cluster'] = v_kmean.predict(X_kmean)

We will create **9** profiles for the Professionals.

In order to understand how performants are our profiles and also which are the common characteristics, we will create a LightGBM multi-classification model.

In [None]:
import lightgbm as lgb
import sklearn.metrics as metrics

def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix
    target_names: given classification classes such as [0, 1, 2] the class names, for example: ['high', 'medium', 'low']
    title:        the text to display at the top of the matrix
    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
        
    cm = pd.DataFrame(cm.tolist())
    for col in cm.columns:
        cm[col] = cm[col].apply(lambda x: 0 if x < 0.1 else np.round(x, 2))
    cm = cm.values
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, cm[i, j] if cm[i, j] > 0 else '',
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    ax = plt.gca()
    ax.grid(False)
    plt.show()
    return

# Create the ytraining and validation dataset
X_lgb = pd.get_dummies(X_clusters.drop(['Cluster'], axis = 1).fillna('***NAV'))
y_lgb = X_clusters['Cluster']

X_train, X_valid, y_train, y_valid = train_test_split( X_lgb, y_lgb,
                                                       test_size    = 0.3, 
                                                       random_state = 2013,
                                                       stratify     = y_lgb )
print('Total number of records: ', X_data.shape, y_data.shape)
print('Train records: ', X_train.shape, y_train.shape)

# Set the parameters for the LGB Model
v_params = {}
v_params['objective']        = 'multiclass' # the target to predict is the number of clusters
v_params['is_unbalance']     = False
v_params['n_jobs']           = -1
v_params['random_state']     = 2013
v_params['metric']           = ['multi_logloss']
v_params['num_class']        = y_lgb.nunique() + 1
v_params['learning_rate']    = 0.05
v_params['num_leaves']       = 30
v_params['max_depth']        = 9
v_params['feature_fraction'] = 0.75
v_params['bagging_fraction'] = 0.75
v_params['verbosity'] = -1

# Train the LGB model
v_lgbtrain = lgb.Dataset(X_train, y_train)
v_lgbvalid = lgb.Dataset(X_valid, y_valid)
v_model    = lgb.train( v_params, v_lgbtrain, 2000, valid_sets = [v_lgbtrain, v_lgbvalid], early_stopping_rounds = 100, 
                        verbose_eval = 20 )

# Reduce the number of features used and retrain the model if we have more than 75 features in the base dataset
if X_train.shape[1] > 75:
    v_features = pd.DataFrame(sorted(zip(X_train.columns, v_model.feature_importance('gain'))), columns=['Feature', 'Value'])
    v_features = v_features.sort_values('Value', ascending = False).head(75)['Feature'].tolist()
    X_lgb = X_lgb[v_features]
    v_lgbtrain = lgb.Dataset(X_train[v_features], y_train)
    v_lgbvalid = lgb.Dataset(X_valid[v_features], y_valid)
    v_model    = lgb.train( v_params, v_lgbtrain, 2000, valid_sets = [v_lgbtrain, v_lgbvalid], early_stopping_rounds = 100, 
                            verbose_eval = 20 )

# Generate the predictions
v_pred = pd.DataFrame(v_model.predict(X_lgb, num_iteration = v_model.best_iteration), index = X_lgb.index)
v_pred['__Pred'] = v_pred.apply(lambda x: np.argmax(x), axis = 1)
v_pred['__Real'] = y_lgb
v_pred[['__Real', '__Pred']].value_counts()

# Show the confusion matrix
v_confMatrix = metrics.confusion_matrix(v_pred['__Real'], v_pred['__Pred'])
plot_confusion_matrix(v_confMatrix, [f'Cluster_{label + 1}' for label in sorted(v_pred['__Real'].unique())] )

We can see from the **exceptional performance** of the model that the different profiles that we have identified have specific characteristics which **<font color = 'blue'>Makes them Unique</font>**.

## 5.4. Most important features

We will check the most important features for the identified profiles by using SHAP Values. 

In 2017 two computer scientists from the University of Washington published a technique for generating fast and practical explanations of a particular kind of ML called tree-based models (specifically, a variant called XGBoost). The algorithm’s authors named their work SHAP, for Shapley additive explanations.

We can see that the most important features for the different profiles are:
   - **Age Category for Less Than 35 Years Old**
   - **Machine Learning Experience Level Junior**
   - **Development Experience Level Medium**
   - **Development Experience Level Junior**
   - **Highest Degree Master**

In [None]:
import shap

# Calculate the SHAP values and plot the summary
v_explainer   = shap.TreeExplainer(v_model)
v_shap_values = v_explainer.shap_values(X_lgb)
shap.summary_plot(v_shap_values, X_lgb, max_display = 20)

## 5.5. Analyze Identified profiles

In [None]:
!pip install pyvis

In [None]:
from pyvis.network import Network
import json

v_options = {
  "nodes": {
    "borderWidth": 3,
    "borderWidthSelected": 5,
    "fixed": {
      "x": True
    }
  },
  "edges": {
    "arrows": {
      "to": {
        "enabled": True
      }
    },
    "color": {
      "inherit": True
    },
    "smooth": False
  },
  "layout": {
    "hierarchical": {
      "enabled": True,
      "treeSpacing": 225,
      "sortMethod": "directed"
    }
  },
  "physics": {
    "hierarchicalRepulsion": {
      "centralGravity": 0
    },
    "minVelocity": 0.75,
    "solver": "hierarchicalRepulsion"
  }
}

v_profilesPath = { 'Profile 1': ['Senior ML', 'Older\nThan 35 years|S'],
                   'Profile 3': ['Medium ML', 'Younger\nThan 35 years|M'],
                   'Profile 9': ['Medium ML', 'Older\nThan 35 years|M'],
                   'Profile 5': ['Junior ML', 'Older\nThan 35 years|J', 'Large\nCompany|J'],
                   'Profile 4': ['Junior ML', 'Older\nThan 35 years|J', 'Small/Medium\nCompany|J'],                   
                   'Profile 8': ['Junior ML', 'Younger\nThan 35 years|J', 'Very Junior\nExp.Dev|J',],
                   'Profile 2': ['Junior ML', 'Younger\nThan 35 years|J', 'Junior\nExp.Dev|J', 'Bachelor\nDegree|J'],
                   'Profile 7': ['Junior ML', 'Younger\nThan 35 years|J', 'Junior\nExp.Dev|J', 'Master\nDegree|J'],
                   'Profile 6': ['Junior ML', 'Younger\nThan 35 years|J', 'Medium\nExp.Dev|J',], }
v_group = { key: idx + 2 for idx, key in enumerate(list(v_profilesPath.keys())) }

v_network_S = Network(notebook = True, height = '500px', width = '600px')
v_network_J = Network(notebook = True, height = '780px', width = '1000px')

v_network_S.add_node(2, label = 'Senior ML', size = 25, color = 'red')
v_network_S.add_node(3, label = 'Medium ML', size = 25, color = 'red')
v_network_J.add_node(4, label = 'Junior ML', size = 25, color = 'red')

v_nodes = { 'Experience ML': 1, 'Senior ML': 2, 'Medium ML': 3, 'Junior ML': 4 }
for key, paths in v_profilesPath.items():
    v_network = (      v_network_S if paths[0] == 'Senior ML'
                  else v_network_S if paths[0] == 'Medium ML'
                  else v_network_J )
    v_lastNode = paths[0]
    for node in paths:
        if not node in v_nodes.keys(): 
            v_nodes[node] = len(v_nodes) + 1
            v_network.add_node(v_nodes[node], label = (node if not '|' in node else node.split('|')[0]), size = 6, group = v_group[key])
            v_network.add_edge(v_nodes[v_lastNode], v_nodes[node], weight = 3.5)
        v_lastNode = node
    
    v_nodes[key] = len(v_nodes) + 1
    v_network.add_node(v_nodes[key], label = key, color = 'blue', size = 15)
    v_network.add_edge(v_nodes[v_lastNode], v_nodes[key])

### 5.5.1 Main Profiles Characteristics

For the respondants of the Kaggle Survey that:
   - have an **<font color = 'blue'>University Degree</font>**
   - **<font color = 'blue'>know if the Company has or not Machine Learning Implemented</font>**
   - have **<font color = 'blue'>Experience with Machine Learning</font>**
   - have a **<font color = 'blue'>Declared Gender</font>**
   - have **<font color = 'blue'>less than 60 years old</font>**
   - have **<font color = 'blue'>Declared Their Salary</font>**

we have identified **<font color = 'blue'>9 Profiles</font>** with the following characteristics:

   * 1 - Professionals that have **<font color = '#ba5618'>Higher Experience with Machine Learning</font>** (**3 Profiles**)
        - One profile for Professionals:
          - **<font color = '#ba5618'>Senior Level with Machine Learning</font>** (**7.5 times** more than in the average profile)
          - **<font color = '#ba5618'>Older than 35 years</font>** (**2.2 times** more than in the average profile)
          - Senior Level in Development (**4 times** more than in the average profile)
          - Usually not working for Small Companies (50% less than in the average profile)
          - Usually not living in Asia (30% less than in the average profile)
          - Usually not living in India (70% less than in the average profile)
          - Usually not doing Machine Learning Exploring (50% less than in the average profile)
        - One profile for Professionals:
          - **<font color = '#ba5618'>Medium Level with Machine Learning</font>** (**5 times** more than in the average profile)
          - **<font color = '#ba5618'>Younger than 35 years</font>** (**1.7 times** more than in the average profile)
          - Medium Level in Development (**2.3 times** more than in the average profile)
          - 43% of them are living in Europe or USA (1.3 times more than in the average profile)
          - 21% of them have a Doctoral Degree or higher (1.2 times more than in the average profile)
          - 22% of them have the role title ML Engineer (1.5 times more than in the average profile)
        - One profile for Professionals:
          - **<font color = '#ba5618'>Medium Level with Machine Learning</font>** (**5.2 times** more than in the average profile)
          - **<font color = '#ba5618'>Older than 35 years</font>** (**2.5 times** more than in the average profile)
          - Medium Level in Development (**2.3 times** more than in the average profile)
          - Usually not working for Companies that don't have Machine Learning Implemented (40% less than in the average profile)
          - 50% of them are living in Europe or USA (1.4 times more than in the average profile)
          - Usually not living in India (40% less than in the average profile)

In [None]:
v_network_S.set_options(json.dumps(v_options))
v_network_S.show('networkS.html')

  * 2 - Professionals that have **<font color = '#ba5618'>Junior Experience with Machine Learning</font>** (**6 Profiles**)
      - 2.1. Professionals that are **<font color = '#ba5618'>Older then 35 years</font>** (**2 Profiles**)
        - One profile for Professionals:
          - **<font color = '#ba5618'>Not Working for Large Companies</font>** 
          - 21% of them are working for companies without a Data Science Team (**1.8 times** more than in the average profile)
          - 13% of them have the role title IT Other (**1.8 times** more than in the average profile)
          - 18% of them have the role title Other (**1.8 times** more than in the average profile)
          - Usually not having the role title Data Scientist (50% less than in the average profile)
        - One profile for Professionals:
          - **<font color = '#ba5618'>Working for Large Companies</font>** (**2.4 times** more than in the average profile)
          - 20% of them have the role title IT Other (**2.6 times** more than in the average profile)
          - 16% of them have the role title Other (**1.6 times** more than in the average profile)
          - Usually not having the role title Research (45% less than in the average profile)
          - Usually not having the role title Data Scientist (50% less than in the average profile)
          - Usually not working is Small Data Science Teams (40% less than in the average profile)
      - 2.2. Professionals that are **<font color = '#ba5618'>Younger then 35 years</font>** (**4 Profiles**)
        - One profile for Professionals:
          - **<font color = '#ba5618'>Junior Level in Development</font>** (**3.6 times** more than in the average profile)
          - **<font color = '#ba5618'>Bachelor Degree</font>** (**3 times** more than in the average profile)
          - 23% of them have the role title Analyst (**1.3 times** more than in the average profile)
          - 14% of them are living in Africa (**1.9 times** more than in the average profile)
          - Usually not living in Europe (60% less than in the average profile)
        - One profile for Professionals:
          - **<font color = '#ba5618'>Junior Level in Development</font>** (**3.6 times** more than in the average profile)
          - **<font color = '#ba5618'>Master Degree</font>** (**1.9 times** more than in the average profile)
          - 24% of them have the role title Analyst (**1.4 times** more than in the average profile)
          - 19% of them are Female (**1.4 times** more than in the average profile)
        - One profile for Professionals:
          - **<font color = '#ba5618'>Medium Level in Development</font>** (**2.7 times** more than in the average profile)
          - 23% of them have the role title Software Engineer (**1.8 times** more than in the average profile)
          - 18% of them have the role title ML Engineer (**1.2 times** more than in the average profile)
        - One profile for Professionals:
          - **<font color = '#ba5618'>Very Junior Level in Development</font>** (**7.2 times** more than in the average profile)
          - Very Junior Level with Machine Learning (**3 times** more than in the average profile)
          - 24% of them are working for companies without a Data Science Team (**2 times** more than in the average profile)
          - 21% of them are Female (**1.5 times** more than in the average profile)
          - 14% of them are living in Africa (**1.8 times** more than in the average profile)

In [None]:
v_network_J.set_options(json.dumps(v_options))
v_network_J.show('networkJ.html')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

v_profileName = {}
v_avgFeat  = X_lgb.describe().T[['mean']]
v_clusters = []
for cluster in sorted(y_lgb.unique()):
    v_data = X_lgb.copy()
    v_data['Cluster'] = y_lgb
    v_dataC = v_data[ v_data['Cluster'] == cluster ] 
    v_data = v_dataC.drop('Cluster', axis = 1).describe().T[['mean']]
    v_dataC = np.round(v_dataC.shape[0] / X_lgb.shape[0] * 100, 2)
    v_profileName[cluster] = v_profilesPath[f'Profile {cluster + 1}'][0] + f'\nProfile {cluster + 1}\n({v_dataC}%)'
    v_clusters.append(v_data[   (v_data['mean'] > 0.7)
                              | (v_data['mean'] < 0.3) ].rename(columns = {'mean': v_profileName[cluster]}))

def maskImportant(p_data):
    p_data = p_data.T
    
    v_mask = p_data.copy().fillna(1)
    for col in v_mask.columns:
        v_mask[col] = v_mask[col].apply(lambda x: 0 if x < 0.1 else x)
    v_mask['Sum'] = v_mask.sum(axis = 1)
    p_data = p_data[ v_mask['Sum'] > 0 ]
    
    v_mask = p_data.copy()
    for col in v_mask.columns:
        v_mask[col] = v_mask[col].apply(lambda x: 0 if np.isnan(x) else 1)
    v_mask['__Rank'] = v_mask.sum(axis = 1)
    
    for col in p_data.columns:
        p_data[col] = p_data[col].apply(lambda x: 0 if x < 0.05 else x)
    p_data['__Sum']  = p_data.fillna(0).sum(axis = 1)
    v_idx = p_data[ p_data['__Sum'] == 0 ].index
    p_data['__Rank'] = v_mask['__Rank']        
    p_data.loc[v_idx, '__Rank'] = p_data['__Rank'].max() + 1
    
    v_mask = v_mask.sort_values('__Rank', ascending = False)
    v_mask = v_mask[ v_mask['__Rank'] > 0 ].index.values
    p_data = p_data.loc[v_mask, :]

    return p_data.sort_values(['__Rank', '__Sum'], ascending = [False, False]).drop(['__Rank', '__Sum'], axis = 1)

def plotProfile(p_data, p_top = 100, p_avgFeat = v_avgFeat):
    v_data = p_data.reset_index().merge(p_avgFeat.reset_index(), how = 'inner', on = 'index').set_index('index')
    for col in p_data.columns:
        v_data[col] = np.round(v_data[col] / v_data['mean'], 3)
        v_data[col] = v_data[col].replace({0: np.NaN})
    v_data = v_data[['mean'] + p_data.columns.tolist()]
        
    v_figSize = ( (4 * p_data.shape[1]) if p_data.shape[1] > 1 else 6, 0.4 * p_data.shape[0] )        
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize = v_figSize)
    sns.heatmap(np.round(p_data.head(p_top), 2), annot = True, linewidths = .5, cmap = "Blues", ax = ax1)
    sns.heatmap(np.round(v_data.head(p_top), 2), annot = True, linewidths = .5, cmap = "Reds", ax = ax2)
    ax1.xaxis.tick_top()
    ax1.title.set_text('Profiles')
    ax2.xaxis.tick_top()
    ax2.yaxis.set_visible(False)
    ax2.title.set_text('Difference with AVG')
    plt.subplots_adjust(left = 0.1,bottom = 0.1, right = 0.9, top = 0.9, wspace = 0.4, hspace = 0.4)
    plt.show()
    return
    
v_clusters = maskImportant(pd.concat(v_clusters, axis = 1).T)
v_clusters_F = v_clusters.T

for item in [ 'ML_years__Categ_Not_Junior',
              'Age__Categ_More Than 35 Years Old',
              'Dev_Experience_2__Level_Junior' ]:
    plotProfile(maskImportant(v_clusters_F[ np.round(v_clusters_F[item], 0) == 1 ]))
    v_clusters_F = maskImportant(v_clusters_F[ np.round(v_clusters_F[item], 0) != 1 ]).T

plotProfile(v_clusters_F.T)

### 5.5.2 Main characteristics distribution

Findings:
   - **<font color = 'blue'>70%</font>** of professionals that have a **Senior Level for Machine Learning** will be found into **Profile 1**
   - **<font color = 'blue'>68%</font>** of professionals that have a **Very Junior Development Experience** will be found into **Profile 8** 
   - **<font color = 'blue'>80%</font>** of professionals that have a **Junior Development Experience** will be found into **Profile 2** and **Profile 7** 
   - **<font color = 'blue'>97%</font>** of professionals that have a **Medium Level for Machine Learning** will be found into **Profile 3** and **Profile 9** 
   - **<font color = 'blue'>90%</font>** of professionals that have a **Senior Development Experience** will be found into **Profile 4**, **Profile 5**, **Profile 9** and **Profile 1**
   - Profiles **Profile 4** and **Profile 5** corresponds to **Junior Level for Machine Learning**, which means that the professionals into these 2 profiles have changed their carrer path from IT Development to ML Development
   - **<font color = 'blue'>50%</font>** of professionals that have the role title **IT Other** will be found into **Profile 4** and **Profile 5** 
   - the professionals that have **Medium / Senior Level for Machine Learning** also have **Medium / Senior Development Experience**
   - **<font color = 'blue'>88%</font>** of professionals from Africa are **Junior Level for Machine Learning**
   - **<font color = 'blue'>80%</font>** of professionals from India are **Junior Level for Machine Learning**
   - **<font color = 'blue'>70%</font>** of professionals from South America are **Junior Level for Machine Learning**
   - **<font color = 'blue'>60%</font>** of professionals that have a **Doctoral Degree+** also have a **Medium / Senior Level for Machine Learning**
   - **<font color = 'blue'>only 22%</font>** of companies having a **Very Small Data Science Team** will have professionals with a **Medium / Senior Level for Machine Learning**
   - **<font color = 'blue'>only 12%</font>** of professionals that have a **Bachelor Degree** also have a **Medium / Senior Level for Machine Learning**
   - **<font color = 'blue'>only 17%</font>** of professionals that have the role title **Software Engineer** also have a **Medium / Senior Level for Machine Learning**

In [None]:
X_cluster = pd.concat([y_lgb, X_lgb], axis = 1)
X_cluster['Cluster'] = X_cluster['Cluster'].apply(lambda x: v_profileName[x])
v_data = []
for col in X_lgb.columns:
    v_tmp = X_cluster[ X_cluster[col] != 0 ]
    v_tmp = v_tmp.groupby('Cluster')[[col]].sum() / v_tmp[col].sum() * 100
    v_data.append(v_tmp)
v_data = pd.concat(v_data, axis = 1).T.fillna(0)
v_cols = v_data.columns
v_data['__Min']  = v_data[v_cols].min(axis = 1)
v_data['__Max']  = v_data[v_cols].max(axis = 1)
v_data['__Diff'] = v_data['__Max'] - v_data['__Min']
v_data = ( v_data[ v_data['__Max'] > 20 ]
              .sort_values('__Diff', ascending = False)
              .drop(['__Min', '__Max'], axis = 1) 
              .rename(columns = {'__Diff': 'Difference\nBetween\nMin and Max'}) )
v_data['__Total']  = v_data[v_cols].sum(axis = 1)
fig, ax = plt.subplots(figsize = ( 1.3 * v_data.shape[1], 0.5 * v_data.shape[0] ))
sns.heatmap( np.round(v_data, 2), annot = True, linewidths = .5, cmap = "viridis_r", fmt = '.0f', vmax = 40, ax = ax )
ax.xaxis.tick_top()
plt.show()

print('\n')
v_data.drop(v_data.columns.tolist()[-2:], axis = 1, inplace = True)
for col in v_data.columns:
    v_perc = float(col.split('\n')[2].split('%')[0][1:])
    v_data[col] = v_data[col] / v_perc
    v_data[col] = v_data[col].apply(lambda x: np.NaN if x < 0.02 else x)

v_colors  = sns.color_palette("Reds_r")[:2] + sns.color_palette("Greys")[:1] * 4 + sns.color_palette("Blues")[-2:]
fig, ax = plt.subplots(figsize = ( 1.3 * v_data.shape[1], 0.5 * v_data.shape[0] ))
sns.heatmap( np.round(v_data, 2), annot = True, linewidths = .5, cmap = v_colors, fmt = '.2f', vmin = 0.5, vmax = 2, ax = ax )
ax.xaxis.tick_top()
plt.title('Rate Difference with expected value')
plt.show()

### 5.5.3 **Role Title** and **Residence Country**

Findings:
   - Role Titles **Data Scientist** and **Engineer (Software / Data / ML)** are the most common ones with an average value of **28%**
   - Role Title **Data Analyst** is rather composed by professionals that have a **Junior Level for Machine Learning**
   - **Profile 2** and **Profile 8** have around **40%** of the respondants based in India (70% above average value)
   - **Profile 4** and **Profile 5** have only **25%** of their respondents having a role title **Data Scientist** or **Data Engineer**
   - **Medium / Senior Level for Machine Learning** profiles have around **40%** of the respondants having the Role Title **Data Scientist** compared to **Profile 4** and **Profile 5** where it drops to **14%** (50% below average value)
   - **Medium / Senior Level for Machine Learning** profiles have around **50%** of the respondants based in Europe (50% above average value) and USA (more than 70% above average value)   

In [None]:
def layoutPerProfile( p_data, p_cluster, p_cols = [('N/A', 'N/A')], p_total = False, p_profileName = v_profileName, p_title = None, p_debug = False,
                      p_groupValues = False, p_minGroup = 12 ):    
    v_dataA = []
    v_valuesU_All = []
    v_valuesAvg_All = []
    v_return = {}
    v_mergedDS = p_cluster.reset_index().merge(p_data.reset_index(), how = 'inner', on = 'index')
    if p_debug: print(v_mergedDS.shape)
    v_none = '*** None'
    for item in p_cols:
        # Select the merged data for the column of interest
        v_col  = item[0]
        v_data = v_mergedDS[['index', 'Cluster', v_col]].copy()
        v_data[v_col] = v_data[v_col].fillna(v_none).replace({'None': v_none, '': v_none})
        v_data[v_col] = v_data[v_col].apply(lambda x: v_none if x.strip() in ['No / None', ''] else x)
        v_data[f'{v_col}__L'] = v_data[v_col].apply(lambda x: x.split('__')).apply(lambda x: [v_none] if len(x) == 0 else x)
        
        # Extract all the possible values for the given feature
        v_valuesU = []
        for value in pd.Series(v_data[v_col].unique()).apply(lambda x: x.split('__')).tolist():
            v_valuesU += value
        v_values = list(sorted(set(v_valuesU)))
                
        #----------------------------------------------------------------------------------------------------
        # Create the columns based on the extracted features
        v_valuesU = v_data[['index', 'Cluster']].copy()
        for value in v_values:
            v_valuesU[value.strip()] = v_data[f'{v_col}__L'].apply(lambda x: value in x).astype(int)            
        v_valuesU = v_valuesU.set_index(['index', 'Cluster']).stack().reset_index()
        v_groupedCols    = {}
        v_groupedColsAll = []
        if p_groupValues:
            v_valuesU2 = v_valuesU.pivot_table( index   = ['index', 'Cluster'],
                                                columns = 'level_2',
                                                values  = 0 )
            v_cols   = [col for col in v_valuesU2.columns if col == v_none]
            v_values = v_valuesU2.drop(v_cols, axis = 1).sum().to_dict()
            v_values = sorted(set([value.split(' ')[0] for value in v_values.keys() if len(value.split(' ')[0]) > 4
                                                                                   and value.split(' ')[0] not in v_values.keys()]))
            for value in v_values:
                v_cols = [col for col in v_valuesU2.columns if col.split(' ')[0] == value]
                if len(v_cols) > 1:
                    v_col = f'Grouped__{value}'
                    v_valuesU2[v_col] = v_valuesU2[v_cols].max(axis = 1)
                    v_groupedCols[v_col] = v_cols
                    
            if item[0] == 'BigData_Product':
                v_col = f'Grouped___NotCloud'
                v_groupedCols[v_col] = ['Microsoft SQL Server', 'MongoDB', 'MySQL', 'Oracle Database', 'PostgreSQL', 'SQLite', 'IBM Db2',]
                v_valuesU2[v_col] = v_valuesU2[v_groupedCols[v_col]].max(axis = 1)
                
                v_col2 = f'Grouped___Cloud'
                v_groupedCols[v_col2] = [value for value in v_valuesU['level_2'].unique() if value not in v_groupedCols[v_col] + [v_none, 'Other',]]
                v_valuesU2[v_col2] = v_valuesU2[v_groupedCols[v_col2]].max(axis = 1)                
                    
            v_values = v_valuesU2[list(v_groupedCols.keys())].sum() / v_valuesU2.shape[0] * 100
            v_values = v_values[ v_values < 10 ]
            v_groupedCols = {key: value for key, value in v_groupedCols.items() if key not in v_values.index.tolist()}
            v_valuesU2 = v_valuesU2[list(v_groupedCols.keys())].stack().reset_index()
            v_valuesU  = pd.concat([v_valuesU, v_valuesU2])               
            for value in v_groupedCols.values():
                v_groupedColsAll += value
        v_valuesU['level_2'] = v_valuesU['level_2'].apply(lambda x: f'{item[1]} - {x}')
        if p_debug: print('Total number of cases - Step 01:', v_valuesU[[0]].sum().sum(), v_valuesU2[[0]].sum().sum())   
                    
        if item[0] == 'Role_Title':
            v_valuesU['level_2'] = v_valuesU['level_2'].apply(lambda x: 'Engineer' if 'Engineer' in x else x)
            
        #----------------------------------------------------------------------------------------------------
        # Group values into Other(s) if below a certain threshold
        v_valuesAvg = v_valuesU.groupby('level_2')[[0]].sum().sort_values(0, ascending = False).reset_index().reset_index()
        v_valuesAvg['Avg'] = v_valuesAvg[0] / v_data.shape[0] * 100
        v_valuesAvg = v_valuesAvg[ ~(   ( v_valuesAvg['index'] < 10 )
                                      & ( v_valuesAvg['Avg'] > p_minGroup ) ) ]['level_2'].tolist()
        v_valuesAvg = {value: f'_ {item[1]} - Other(s)' for value in v_valuesAvg if value not in list(v_groupedCols.keys())}   
        v_groupedColsAll = [value for value in v_valuesAvg.keys() if value in v_groupedColsAll]
        v_valuesU = v_valuesU[ ~v_valuesU.isin(v_groupedColsAll) ]
        v_valuesU['level_2'] = v_valuesU['level_2'].replace(v_valuesAvg)        
        v_return[v_col] = [value.replace(f'{item[1]} - ', '') for value in v_valuesU['level_2'].unique()]        

        #----------------------------------------------------------------------------------------------------
        # Aggregate the data
        v_valuesU = v_valuesU.groupby(['index', 'Cluster', 'level_2'])[[0]].max().reset_index()        
        v_valuesU = v_valuesU.groupby(['Cluster', 'level_2']).agg({0: ['sum', 'count']}).reset_index()
        v_valuesU.columns = ['Cluster', v_col, 'Count', 'Cluster_No'] 
        v_clusterNo = v_valuesU[['Cluster', 'Cluster_No']].drop_duplicates()
        v_clusterNo['Cluster_No'] = np.round(v_clusterNo['Cluster_No'] / v_mergedDS.shape[0] * 100, 2)                 
        v_valuesU = v_valuesU.pivot_table( index   = ['Cluster', 'Cluster_No'],
                                           columns = v_col,
                                           values  = 'Count',
                                           aggfunc = np.sum )  
        v_valuesU.columns = v_valuesU.columns.tolist()
        v_valuesAvg = v_valuesU.sum().reset_index()
        v_valuesAvg['Avg'] = v_valuesAvg[0] / v_data.shape[0] * 100
        v_valuesAvg = v_valuesAvg.set_index('index').sort_values('Avg', ascending = False)
            
        #----------------------------------------------------------------------------------------------------         
        if p_total: 
            v_valuesU['__Total'] = v_valuesU.sum(axis = 1)        
            v_valuesAvg.loc['__Total', 'Avg'] = 100
        
        #----------------------------------------------------------------------------------------------------         
        v_valuesAvg_All.append(v_valuesAvg)
        if p_debug: print('Total number of cases - Step 02:', v_valuesU.sum().sum())
        
        #----------------------------------------------------------------------------------------------------         
        v_valuesU = v_valuesU.reset_index().set_index('Cluster')
        for col in v_valuesU.drop('Cluster_No', axis = 1).columns:
            v_valuesU[col] = v_valuesU[col] / v_valuesU['Cluster_No'] * 100
        
        #----------------------------------------------------------------------------------------------------
        v_valuesU = v_valuesU.drop('Cluster_No', axis = 1).T
        v_valuesU_All.append(v_valuesU)
    
    #----------------------------------------------------------------------------------------------------
    v_valuesAvg_All = pd.concat(v_valuesAvg_All)
    if p_debug:
        print('Total number of cases - Step 03:', p_data.shape, p_cluster.shape, v_mergedDS.shape)
        display(v_clusterNo) 
    v_clusterNo = v_clusterNo.set_index('Cluster').to_dict(orient = 'index')
    v_profileName = { key: value.split('\n') for key, value in p_profileName.items() }
    v_profileName = { key: [ ( 'Ju.' if value[0].strip() == 'Junior ML' else 'Me.' if value[0].strip() == 'Medium ML' else 'Se.' ), 
                                value[1], f"({v_clusterNo[key]['Cluster_No']}%)" ] for key, value in v_profileName.items() }
    v_profileName = { key: '\n'.join(value) for key, value in v_profileName.items() }
    v_valuesU_All = pd.concat(v_valuesU_All)
    v_valuesU_All.columns = [v_profileName[col] for col in v_valuesU_All.columns]
    v_valuesU_All = v_valuesU_All[sorted(list(v_valuesU_All.columns))]
    
    #----------------------------------------------------------------------------------------------------
    v_sort = v_valuesU_All.index.tolist()
    v_data_01 = v_valuesU_All.copy()   
    v_data_02 = pd.concat([v_valuesAvg_All, v_valuesU_All], axis = 1)
    for col in v_valuesU_All.columns:
        v_data_02[col] = np.round(v_data_02[col] / v_data_02['Avg'] - 1, 2)
    v_data_02 = v_data_02[v_valuesU_All.columns]

    v_figSize = (20, 0.6 * v_data_01.shape[0])
    v_colors  = sns.color_palette("Reds_r")[:2] + sns.color_palette("Greys")[:1] * 4 + sns.color_palette("Blues")[-2:]
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = v_figSize, gridspec_kw={'width_ratios': [9, 1, 9]})
    sns.heatmap(np.round(v_data_01.loc[v_sort, :], 2), annot = True, fmt = '.0f', linewidths = .5, cmap = "viridis_r", vmax = 75, ax = ax1)
    sns.heatmap(np.round(v_valuesAvg_All.loc[v_sort, ['Avg']], 2), annot = True, fmt = '.0f', linewidths = .5, cmap = "viridis_r", vmax = 75, ax = ax2, cbar=False)
    sns.heatmap(np.round(v_data_02.loc[v_sort, :], 2), annot = True, linewidths = .5, cmap = v_colors, vmin = -0.5, vmax = 0.5, ax = ax3)
    ax1.xaxis.tick_top()
    ax1.title.set_text('Profiles')
    ax2.xaxis.tick_top()
    ax2.yaxis.set_visible(False)
    ax2.xaxis.set_visible(False)
    ax2.title.set_text('Average')
    ax3.yaxis.set_visible(False)
    ax3.xaxis.tick_top()
    ax3.title.set_text('Difference with Median')
    plt.subplots_adjust(left = 0.1,bottom = 0.1, right = 0.9, top = 0.9 if p_title is None else 0.8, wspace = 0.1, hspace = 0.1)    
    plt.suptitle(p_title)  
    plt.show()
    
    return [ v_return, v_valuesAvg_All, v_data_01 ]

v_profileChar = {}
v_profileChar['Role_Title'] = layoutPerProfile(X_baseData, y_lgb, [('Role_Title', 'Role Title')], p_total = True, p_minGroup = 10 )
v_profileChar['Residence_Country'] = layoutPerProfile( X_professionals, y_lgb, [('Residence_Country__Filtered', 'Residence Country')], 
                                                       p_total = True, p_minGroup = 7 )

X_professionalsF = X_baseData[['Role_Title']].reset_index().merge(X_professionals.reset_index(), how = 'inner', on = 'index').set_index('index')
X_professionalsF['Role_Title__Grouped'] = X_professionalsF['Role_Title'].apply(lambda x: x if x in v_profileChar['Role_Title'][0]['Role_Title']
                                                                                     else 'Engineer' if 'Engineer' in x
                                                                                     else '_Other(s)' )
v_profileChar['Role_Title'][2] = v_profileChar['Role_Title'][2].drop('__Total')
v_profileChar['Residence_Country'][2] = v_profileChar['Residence_Country'][2].drop('__Total')

### 5.5.3 **BigData Product** and **BI Tools**

Findings:

   - **<font color = 'blue'>52%</font>** of **Data Analysts** are using **BI Tools**
   - **Data Analysts** which have a **Senior Level for Machine Learning** are less likely to use **BI Tools** (decrease to **35%**)
   - **56%** of **Data Analysts** are using **BigData Products**
   - **7%** of **Data Analysts** are using **BigData Cloud Related Solutions**
   <br>
   
   - **<font color = 'blue'>63%</font>** of **Data Scientists** are using **BigData Products**
   - **<font color = 'blue'>33%</font>** of **Data Scientists** are using **BigData Cloud Related Solutions**
   - **Data Scientists** which have a **Medium / Senior Level for Machine Learning** are more likely to use **BigData Cloud Related solutions**
   <br>
   
   - **56%** of **Engineers** are using **BigData Products**
   - **7%** of **Engineers** are using **BigData Cloud Related Solutions**
   - **<font color = 'blue'>34%</font>** of **Engineers** are using **BI Tools**
   <br>
   
   - **40%** of other role titles will use **BI Tools**
   - **<font color = 'blue'>55%</font>** of other role titles which have a **Medium / Senior Level for Machine Learning** are more likely to use **BigData Products**, but only **5%** will use **BigData Cloud Related Solutions**

In [None]:
def layoutPerProfileRole(X_professionalsF, y_lgb, p_cols, p_minGroup, p_debug = False, p_groupValues = True):
    v_data_01, v_data_02 = [], []
    for role in sorted(X_professionalsF['Role_Title__Grouped'].unique()):
        print('\n***********************************************************************************')
        print(role)
        _, v_tmp_01, v_tmp_02 = layoutPerProfile( X_professionalsF[ X_professionalsF['Role_Title__Grouped'] == role ], y_lgb, p_cols, 
                                                  p_debug = p_debug, p_groupValues = p_groupValues, p_minGroup = p_minGroup )
        v_tmp_01['Role'] = role
        v_tmp_02['Role'] = role
        v_data_01.append(v_tmp_01)
        v_data_02.append(v_tmp_02)
        
    v_data_01 = pd.concat(v_data_01)
    v_data_02 = pd.concat(v_data_02)
    v_dataG = v_data_01.reset_index()
    v_dataG = v_dataG[ v_dataG['Avg'] > 30 ]
    v_dataG = v_dataG.pivot_table( index   = 'index',
                                   columns = 'Role',
                                   values  = 'Avg')
    v_dataG.index = v_dataG.index.tolist()
    v_figSize = (10, 0.6 * v_dataG.shape[0])
    fig, ax = plt.subplots(figsize = v_figSize)
    sns.heatmap(np.round(v_dataG, 2), annot = True, fmt = '.0f', linewidths = .5, cmap = "viridis_r", vmin = 50, vmax = 75, ax = ax)    
    ax.xaxis.tick_top()
    ax.title.set_text('Average Usage by Role - minimum average = 30%')
    plt.show()
    
    return [None, v_data_01, v_data_02]
    
v_profileChar['BigData_BI'] = layoutPerProfileRole(X_professionalsF, y_lgb, [ ('BigData_Product', 'BigData_Product'),
                                                                              ('BigData_Product_Most_Used', 'BigData_Product_Most_Used'),
                                                                              ('BI_Tool', 'BI_Tool') ], p_minGroup = 30 )

### 5.5.4 **Cloud Computing Platform**

Findings:

   - **<font color = 'blue'>50%</font>** of **Data Analysts** are using a **Computing Platform**
   - **<font color = 'blue'>70%</font>** of **Data Analysts** are using a **Laptop**
   - **Data Analysts** which have a **Medium / Senior Level for Machine Learning** are more likely to use a **Computing Platform**
   - **<font color = 'blue'>72%</font>** of **Data Analysts** are **not** using a **Cloud Computing Product**
   - **<font color = 'blue'>69%</font>** of **Data Analysts** are **not** using **Data Storage Products**   
   <br>

   - **<font color = 'blue'>64%</font>** of **Data Scientists** are using a **Computing Platform**
   - **<font color = 'blue'>39%</font>** of **Data Scientists** are using **AWS**
   - **<font color = 'blue'>28%</font>** of **Data Scientists** are using **GCP**
   - **<font color = 'blue'>56%</font>** of **Data Scientists** are using a **Laptop**
   - **Data Scientists** which have a **Medium / Senior Level for Machine Learning** are more likely to use **Data Storage Products** and **Cloud Computing** 
   - **Data Scientists** which have a **Very Junior Level for Machine Learning** are less likely to use **Data Storage Products** and **Cloud Computing** (**78%** of them use a **Laptop**, **58% don't** use a **Computing Platform**, **75% don't** use **Cloud Computing Products** and **75%** don't use **Data Storage Products**)
   - **<font color = 'blue'>48%</font>** of **Data Scientists** are using **Data Storage Products**
   - **<font color = 'blue'>29%</font>** of **Data Scientists** are using **AWS - S3**
   <br>

   - **<font color = 'blue'>56%</font>** of **Engineers** are using a **Computing Platform**
   - **<font color = 'blue'>33%</font>** of **Engineers** are using **AWS**
   - **<font color = 'blue'>27%</font>** of **Engineers** are using **GCP**
   - **<font color = 'blue'>55%</font>** of **Engineers** are using a **Laptop**
   - **Engineers** which have a **Medium / Senior Level for Machine Learning** are more likely to use **Data Storage Products** and **Cloud Computing** 
   - **Engineers** which have a **Very Junior Level for Machine Learning** are less likely to use **Data Storage Products** and **Cloud Computing** (**75%** of them use a **Laptop**, **51% don't** use a **Computing Platform**, **66% don't** use **Cloud Computing Products** and **68%** don't use **Data Storage Products**)
   - **<font color = 'blue'>42%</font>** of **Engineers** are using **Data Storage Products**
   <br>
   
   - **61%** of the other roles use a **Laptop**, **55% don't** use a **Computing Platform**, **73% don't** use **Cloud Computing Products** and **71%** don't use **Data Storage Products**
   - **Medium / Senior Level for Machine Learning** profiles are more likely to use **Data Storage Products** and **Cloud Computing**

In [None]:
v_cols = [ ('Cloud_Computing_Platform',           'Cloud Computing Platform'),
           ('Cloud_Computing_Platform_Most_Used', 'Cloud Computing Platform Most Used'),
           ('Cloud_Computing_Products',           'Cloud Computing Products'),
           ('Data_Storage_Products',              'Data Storage Products') ]
v_profileChar['Cloud'] = layoutPerProfileRole(X_professionalsF, y_lgb, v_cols, p_minGroup = 25 )

### 5.5.5 **ML Managed** and **ML Auto**

Findings:

   - more than **<font color = 'blue'>75%</font>** of the roles are not using **ML Managed** and **ML Auto**, except for the **Data Scientists**
   - **Medium / Senior Level for Machine Learning** roles are more likely to use **ML Managed** and **ML Auto**
   
   - **<font color = 'blue'>35%</font>** of the **Data Scientists** are using **ML Managed**
   - **<font color = 'blue'>33%</font>** of the **Data Scientists** are using **ML Auto**

In [None]:
v_profileChar['ML_Auto'] = layoutPerProfileRole(X_professionalsF, y_lgb, [ ('ML_Managed', 'ML_Managed'),
                                                                           ('ML_Auto_01_Tools', 'ML_Auto_01_Tools'),
                                                                           ('ML_Auto_02_Tools', 'ML_Auto_02_Tools') ], p_minGroup = 30 )

### 5.5.6 **Programming Language** and **IDE**

Findings:

   - more than **<font color = 'blue'>80%</font>** of the respondents are using **<font color = 'blue'>Jupyter Environments</font>** and **<font color = 'blue'>Python</font>**
   - **<font color = 'blue'>SQL</font>** is used by **Data Scientists**, **Data Analysts** and **Engineers** 
   - **<font color = 'blue'>35%</font>** of **Data Scientists** and **Data Analysts** are using **<font color = 'blue'>R</font>**. The percentage increases to over **<font color = 'blue'>50%</font>** for **Data Scientists** and **Data Analysts** which have a **Medium / Senior Level for Machine Learning** 
   - **Data Scientists** and **Data Analysts** which have a **Medium / Senior Level for Machine Learning** are more likely to use **<font color = 'blue'>Other Programming Languages</font>**
   
   <br>
   
   - **<font color = 'blue'>67%</font>** of **Data Analysts** are using **SQL**
   <br>
   
   - **<font color = 'blue'>67%</font>** of **Data Scientists** are using **SQL**
   <br>
   
   - **<font color = 'blue'>76%</font>** of **Engineers** are using **Other Programing Languages** (C, C++, Java, JavaScript, etc)
   - **<font color = 'blue'>64%</font>** of **Engineers** are using **Visual Studio**
   - **<font color = 'blue'>38%</font>** of **Engineers** are using **PyCharm**
   - **<font color = 'blue'>53%</font>** of **Engineers** are using **Visual Studio Code**

In [None]:
v_profileChar['Dev_IDE'] = layoutPerProfileRole(X_professionalsF, y_lgb, [ ('Dev_Programming_Language', 'Programming Language'),
                                                                           ('IDE', 'IDE') ], p_minGroup = 33 )

### 5.5.7 **NLP Algorithm** and **Computer Vision**

Findings:

   - more than **<font color = 'blue'>83%</font>** of **Data Analysts** are not using **NLP Algorithm** and **Computer Vision**
   - **<font color = 'blue'>32%</font>** of **Data Analysts** are using **Basic Statistical Software**
   <br>
   
   - **<font color = 'blue'>30%</font>** of **Data Scientists** and **Engineers** are using **NLP Algorithms**
   - **<font color = 'blue'>34%</font>** of **Data Scientists** and **Engineers** are using **Computer Vision**
   - **<font color = 'blue'>45%</font>** of **Engineers** are using **Computer Vision**
   <br>
   - **Data Scientists** which have a **Very Junior Level for Machine Learning** or the **Junior Level for Machine Learning** respondents that work in **Large Companies** are less likely to use **NLP Algorithm** and **Computer Vision**. For the other profiles we see quite a similar usage for **<font color = 'blue'>Deep Learning</font>**
   <br>
   
   - **Engineers** which have a **Medium / Senior Level for Machine Learning** are more likely to use **<font color = 'blue'>Deep Learning</font>**

In [None]:
_ = layoutPerProfileRole(X_professionalsF, y_lgb, [ ('NLP_Algorithm', 'NLP_Algorithm'),
                                                ('Computer_Vision_Methods', 'Computer_Vision_Methods'),
                                                ('Most_Used_Analyse_Tool', 'Most_Used_Analyse_Tool') ], p_minGroup = 25 )

### 5.5.8 **Hosted Notebooks**
Findings:

   - repondents which are **Older than 35 years** and have **Medium / Senior Level for Machine Learning** are less likely to post on **Hosted Notebooks**

In [None]:
_ = layoutPerProfileRole(X_professionalsF, y_lgb, [ ('Hosted_Notebooks', 'Hosted_Notebooks') ], p_minGroup = 10 )

### 5.5.9 **ML Algorithm** and **ML Framework**
Findings:

   - **Data Analysts** are the the only role through the respondents that have less then **<font color = 'blue'>30%</font>** of respondents using **<font color = 'blue'>Deep Learning</font>**
   - more then **<font color = 'blue'>70%</font>** of respondents are implementing **Linear and Logistic Regression** models and **Sklearn**
   - **<font color = 'blue'>76%</font>** of **Data Scientists** are implementing **Decision Trees or Random Forest** models
   - **<font color = 'blue'>Tensorflow</font>** is the most used **ML Framework** for **Deep Learning**
   - **Medium / Senior Level for Machine Learning** are more likely to use more complex **ML Algorithm** and **ML Framework**

In [None]:
_ = layoutPerProfileRole(X_professionalsF, y_lgb, [ ('ML_Algorithm', 'ML_Algorithm'),
                                                    ('ML_Framework', 'ML_Framework') ], p_minGroup = 25 )

### 5.5.10 **Salary**
Findings:

   - regardless of the Role Title, repondents residing in **<font color = 'blue'>USA</font>** and **<font color = 'blue'>Europe</font>** have the **highest** salaries
   - salaries for respondents from **<font color = 'blue'>Africa</font>** are the **lowest**
   - regardless of the Role Title, **Medium / Senior Level for Machine Learning** have the **highest** salaries
   
   <br>
   
   - respondants having a role title **Data Analyst** and **Medium / Senior Level for Machine Learning**:
       - in **USA** will likely have a salary between **<font color = 'blue'>50k and 200k</font>**
       - in **Europe** will likely have a salary between **<font color = 'blue'>20k and 100k</font>**
       - in **India** will likely have a salary between **<font color = 'blue'>1k and 10k</font>**       
       - in **India** **<font color = 'blue'>50%</font>** of the respondents from **Profile 5**  will have a salary between **<font color = 'blue'>20k and 50k</font>**       
       - in **Asia** will likely have a salary between **<font color = 'blue'>50k and 100k</font>**
       - in **Africa** will likely have a salary between **<font color = 'blue'>5k and 20k</font>**
   
   <br>
   
   - respondants having a role title **Data Scientist** and **Medium / Senior Level for Machine Learning**:
       - in **USA** will likely have a salary between **<font color = 'blue'>100k and 200+ k</font>**
       - in **Europe** will likely have a salary between **<font color = 'blue'>50k and 200k</font>**
       - in **India** will likely have a salary between **<font color = 'blue'>20k and 100k</font>**
       - in **Asia** will likely have a salary between **<font color = 'blue'>50k and 200k</font>**
       - in **Africa** will likely have a salary between **<font color = 'blue'>50k and 100k</font>**
   
   <br>
   
   - respondants having a role title **Engineer** and **Medium / Senior Level for Machine Learning**:   
       - in **USA** will likely have a salary between **<font color = 'blue'>100k and 200+ k</font>**
       - in **Europe** will likely have a salary between **<font color = 'blue'>50k and 100k</font>**
       - in **India** will likely have a salary between **<font color = 'blue'>50k and 200k</font>**
       - in **Asia** will likely have a salary between **<font color = 'blue'>50k and 200k</font>**
       - in **Africa** will likely have a salary between **<font color = 'blue'>1k and 5k</font>**
       - in **Africa** **<font color = 'blue'>50%</font>** of the respondents having **Senior Level for Machine Learning**  will have a salary between **<font color = 'blue'>50k and 100k</font>**
       
   <br>
   
   - respondants having a role title **Other** and **Medium / Senior Level for Machine Learning**:      
      - in **USA** will likely have a salary between **<font color = 'blue'>100k and 200+ k</font>**
      - in **Europe** will likely have a salary between **<font color = 'blue'>20 and 200k</font>**
      - in **India** will likely have a salary between **<font color = 'blue'>20k and 50k</font>**
      - in **Asia** will likely have a salary between **<font color = 'blue'>20k and 200k</font>**
      - in **Africa** will likely have a salary between **<font color = 'blue'>10k and 50k</font>**
   

In [None]:
from plotly.subplots import make_subplots

v_salaries = []
for roleTitle in sorted(X_professionalsF['Role_Title__Grouped'].unique()):
    v_data = y_lgb.reset_index().merge(X_professionalsF[ X_professionalsF['Role_Title__Grouped'] == roleTitle ].reset_index(), how = 'inner', on = 'index').set_index('index')
    v_map = {key: ' - '.join(value.split('\n')[:2]) for key, value in v_profileName.items()}
    v_data['Cluster'] = v_data['Cluster'].replace(v_map)
    v_map = { '1|Europe|Others':          '2. Europe',
              '2|North America|USA':      '1. USA',
              '2|North America|Others':   '6. Others',
              '3|South America|Others':   '6. Others',
              '4|Africa|Others':          '5. Africa',
              '5|Asia|India':             '3. India',
              '5|Asia|Others':            '4. Asia (except India)',
              '6|Oceania|Others':         '6. Others',
              '9|Other|Other':            '6. Others' }
    v_data['Residence_Country__Filtered'] = v_data['Residence_Country__Filtered'].replace(v_map)
    v_data = v_data[ ~v_data['Residence_Country__Filtered'].isin(['6. Others']) ]
    v_data = v_data.groupby(['Salary__Filtered', 'Cluster', 'Residence_Country__Filtered', 'Role_Title__Grouped'])[['Filling_Duration']].count()
    v_data.columns = ['Count']
    v_data.reset_index(inplace = True)
    v_grp = ( v_data.groupby(['Cluster', 'Residence_Country__Filtered'])[['Count']].sum()
                    .rename(columns = { 'Count': 'Count_Grp' }).reset_index() )
    v_data = v_data.merge(v_grp, how = 'inner', on = ['Cluster', 'Residence_Country__Filtered'])
    v_data['Percentage'] = np.round(v_data['Count'] / v_data['Count_Grp'], 4)

    fig = make_subplots()
    for valueF in v_data['Residence_Country__Filtered'].unique():
        v_plot = v_data[ v_data['Residence_Country__Filtered'] == valueF ].copy()
        v_plot['x'] = v_plot[['Residence_Country__Filtered', 'Cluster', ]].apply(lambda x: ' - '.join(x), axis = 1)
        v_plot['y'] = v_plot['Salary__Filtered'].apply(lambda x: x.replace('|', ' - '))    
        fig.add_trace( go.Scatter( x = v_plot['x'].values, y = v_plot['y'].values, 
                                   text = v_plot['Residence_Country__Filtered'], name = valueF, mode = 'markers',
                                   marker = dict( size = v_plot["Percentage"],
                                                  sizeref = 2. * max(v_plot["Percentage"])  / (9.**2), ),
                                   hovertemplate = "<b>%{text}</b><br><br>"
                                                     + "Profile and Residence: %{x}<br>" 
                                                     + "Salary: %{y}<br>" 
                                                     + "Percentage: %{marker.size:%}" 
                                                     + "<extra></extra>", ) )
    fig.update_layout( xaxis  = dict(type = 'category',categoryorder = 'category ascending'),
                       yaxis  = dict(type = 'category',categoryorder = 'category ascending'), 
                       height = 650, width = 1200 )
    fig.update_layout( title_text = f"Role Title - {roleTitle}")
    fig.show()    
    v_salaries.append(v_data)
v_salaries = pd.concat(v_salaries)

Findings:
   - respondents having the role title **Data Analyst** will have **<font color = 'blue'>lower</font>** salaries then **Data Scientists** or **Engineers**
   
   <br>
   
   - in ***USA*** respondents having the role title **Data Analyst** and **Junior Level for Machine Learning** will have the **<font color = 'blue'>highest</font>** salaries for **Profile 5** (working for **Large Size Companies**)
   - in ***USA*** respondents having the role title **Data Scientist** and **Very Junior Level for Machine Learning** will have **<font color = 'blue'>lowest</font>** salaries when compared to all the other **Data Scientist** profiles
   - in ***USA*** respondents having the role title **Engineer** will have the **<font color = 'blue'>highest</font>** salary
   
   <br>
   
   - in ***Europe*** respondents having **Medium / Senior Level for Machine Learning** will have the **<font color = 'blue'>highest</font>** salaries
   
   <br>
   
   - in ***India*** respondents having the role title **Data Analyst** will have the **<font color = 'blue'>lowest</font>** salaries
   - in ***India*** respondents having the role title **Data Analyst** and **Junior Level for Machine Learning** will have the **<font color = 'blue'>highest</font>** salaries for **Profile 5** (working for **Large Size Companies**)
   - in ***India*** respondents having **Senior Level for Machine Learning** will have the **<font color = 'blue'>highest</font>** salaries
   
   <br>
   
   - in ***Asia*** respondents having the role title **Engineer** and **Other** and **Junior Level for Machine Learning** will have the **<font color = 'blue'>highest</font>** salaries for **Profile 5** (working for **Large Size Companies**)
   
   <br>
   
   - in ***Africa*** respondents having **Senior Level for Machine Learning** will have the **<font color = 'blue'>highest</font>** salaries

In [None]:
for residence in sorted(v_salaries['Residence_Country__Filtered'].unique()):
    v_data = v_salaries[ v_salaries['Residence_Country__Filtered'] == residence ]
    
    fig = make_subplots()
    for valueF in v_data['Role_Title__Grouped'].unique():
        v_plot = v_data[   ( v_data['Role_Title__Grouped'] == valueF )
                         & ( v_data['Percentage'] > 0.09 ) ].copy()
        v_plot['Salary__Filtered'] = v_plot['Salary__Filtered'].apply(lambda x: x.replace('|', ' - '))   
        v_plot['x'] = v_plot[['Salary__Filtered', 'Role_Title__Grouped', ]].apply(lambda x: ' - '.join(x), axis = 1) 
        fig.add_trace( go.Scatter( x = v_plot['x'].values, y = v_plot['Cluster'].values, 
                                   name = valueF, mode = 'markers',
                                   text = v_plot['Salary__Filtered'],
                                   marker = dict( size = v_plot["Percentage"],
                                                  sizeref = 2. * max(v_plot["Percentage"])  / (9.**2), ),
                                   hovertemplate = "<b>%{text}</b><br><br>"
                                                     + "Salary and Role: %{x}<br>" 
                                                     + "Profile: %{y}<br>" 
                                                     + "Percentage: %{marker.size:%}" 
                                                     + "<extra></extra>", ) )
    fig.update_layout( xaxis  = dict(type = 'category',categoryorder = 'category ascending'),
                       yaxis  = dict(type = 'category',categoryorder = 'category ascending'), 
                       height = 650, width = 1200 )
    fig.update_layout( title_text = f"Residence Country / Continent - {residence}")
    fig.show()

# 6. Conclusions

From the data science community represented in the Kaggle Survey we have choosen a subset of **Professionals**, for which we have created 9 profiles.

The main characteristic for this profiles is the **ML Level** and we have seen that most of the technological, programming, algorithm and methods are becoming more complex as the **<font color = 'blue'>Seniority Level Increases</font>** and the salary is following the same trend.

A very interesting profile is **Profile 5**, which resembles very well to **Profile 4** when looking at the main characteristics. The differentiator is the size of the company; the respondants in **Profile 5** are working for **Large Size Companies**. We can see that this profile is **less likely** to be exposed to **more complex algorithms and technologies** (expecially for Role Titles **Data Scientist** and **Engineer**), but they usually have **higher** salaries.


#### **I hope that you have enjoyed this analysis and don't hesitate to provide your feedback in the comments section !**

---------