# A. EXECUTIVE SUMMARY

### Context 

As outlined within the [Kaggle introduction](https://www.kaggle.com/c/kaggle-survey-2020/overview): "the survey was live for 3.5 weeks in October, and after cleaning the data we finished with 20,036 responses".

Membership of the community implies either having already some data science skills, or at least a willingness to observe, learn from, network with those who have.

**NOTE**: 
1. this notebook contains charts that are generated dynamically- to see all the charts (e.g. radar chart comparing countries), please open the notebook and select "run all"
2. if you download the notebook, at the end of this section, within the environment initialization, you will find also the option to use a local copy of the CSV containing the Kaggle results, and you will be able to run locally, and export the resulting document either as HTML (suggested) or PDF
3. if you select to download and run locally, within the environment initialization at the end of this section you can also set
<code>
    kagglestorysaveimages = True
</code>

By setting that value, when you "run all" the notebook will try to create a subdirectory "images" (if not present), and will save:
* all the results of "print" into a text file
* all the non-plotly charts as SVG files clearly labelled with the section they are produced from
* all the plotly charts as HTML files clearly labelled with the section they are produced from, files that can be opened in any browser and will dinamycally reproduce the chart "as is", i.e. as saved 

All these options are to enable exporting the full documentation and reproduce the results, and have been tested on a Debian 10 with Jupyter Notebook and Python 3.7.3 on 2021-01-05

### Focus adopted

My use of the [Kaggle survey results dataset](https://www.kaggle.com/c/kaggle-survey-2020/data) is <u>focused on identifying some potential trends on the bridging between data science and business, notably on the management side</u>.

To that end, selected people whose job title typically requires interacting with "business domain specialists" and decision-makers, ***Business Analyst*** and **Product/Project Manager***.

This is because, since the 1980s, in successive rounds of "new data-based decision-making" in business organizations, observed that only when awareness is widespread, and use eventually (at different degrees of expertise) is within the direct circle of decision-makers, and not just in a closed self-referential technical group, a technology is adopted.

As an example, in the 1980s Decision Support Systems were probably too technical and too complex, and embraced by managers in their 30s, at a time when more senior managers did not even use personally a PC.

In the 1990s, business intelligence tools (now widespread) lowered the "knowledge bar", by enabling also intuitive data investigation, and expanding on visualizations that were both easy to obtain, and business-relevant.

But it all starts with people, not technology.

Therefore, this report is a comparative analysis between three countries (or aggregations thereof) that:
* represent different market concepts
* have a different mix of respondents (age, education background, etc)
* have number of respondents in the same order of magnitude.

After an initial check across all the countries contained within the dataset, therefore three countries have been selected:
* Europe (as aggregation of countries geographically in Europe, from the Atlantic to the Urals, not just the European Union, as anyway not all the European Union contries are explicitly represented within the survey dataset)
* India
* United States of America (sometimes referenced within the commentary as "USA").

### Preliminary results

The analysis started by reading the Executive Summary released by Kaggle, and then reviewing other material, to confirm the focus of the report.

In early January 2021, while preparing the report, decided to replace as source of reference other documents with a new report produced in June 2020 by McKinsey on the "State of AI", and focused on the business impacts side, that confirmed some elements within the analysis.

Unfortunately, those data are not available, just the Executive Summary, and therefore I added the link under *"References"* later in this section.

As discussed within the report, this notebook is just a first phase, as more advanced analyses will be carried out in the future.

The purpose was to share a feed-back on what is inside the dataset, and suggest which other data (or integration with other data sources) might be useful, within the focus shown above.

The key section of this report, to understand what follows, is the (mainly non-technical) ***Section B. ASSUMPTIONS AND CHOICES***, where choices both about data and structure of the report are presentd.

From the analysis of the data, some <u>first results</u> (see ***Section E. DATA ANALYSIS*** for more details):
* the three countries cover 11,992 respondents out of 20,036, i.e. 59.50% of the respondents to the survey
* across the job titles of Business Analyst and Product/Project Manager, the three countries cover 898 respondents out of 1,490, i.e. 60.3% of the respondents to the survey 
* by focusing on Business Analysts and Product/Project Managers (898 out of 11,992, i.e. 7.5%), the gender gap is even greater than on the Executive Summary by Kaggle, i.e. the higher up the decision-making chain, the more "gender equality" is lacking- also in Europe, despite all the "positive bias" initiatives adopted since the beginning of the XXI century
* the audience (members of Kaggle) is obviously biased, showing at least interest into the machine learning domain, but nonetheless the differences between the three countries selected in demographics and company they work for are a confirmation of what is observed routinely also in other surveys on the general IT population
* nowhere there is a strong correlation (positive or negative) between the roles selected and the other questions, albeit, after identifying the average correlation for the three countries, there are differences that might be worth further investigation.

Areas of further investigation that, based upon the results, identified:
* explore the differences, and try to identify patterns at least for the three questions that show the higher level of difference (Q32, Q24, Q20) for the roles selected
* verify if those patterns apply also to the community of respondents at large, or are just an indication, due to the job titles selected, applying just to the three countries
* for Europe, as the subdivision in countries is present, identify if the above mentioned patterns show differences
* also, would like to see if these differences are matched by differences vs. UN SDG for both the three "macro-countries" and the individual European countries.


### Structure of the report

The report is composed by just one long notebook, and some files reproducing its contents, charts, results for comparison and reproducibility purposes.

By the end of January 2021, a PDF version will be added to the files.

The structure of this report:
* A. Executive Summary
* B. Assumptions and Choices
* C. Dimensions of Analysis (data selection)
* D. Data Preparation
* E. Data Analysis
* F. Appendix

Section B contains the focus adopted and the production process, while Section C contains the data selection criteria.

Section D prepares the data for further use, creating different aggregations.

Section E is the core of the report:
* first discussing each one of the features selected (16 columns out of 355)
* then identifying why some are not relevant to the focus of this report
* on those selected to remain, a correlation analysis
* finally, based upon the results, listing the other potential areas of future investigation that I would like to follow, e.g. checking if any pattern is also consistent with patterns within the UN SDGs, notably for individual European countries (already posted on Kaggle datasets on UN SDGs, as part of another ongoing publication project, on [my kaggle profile](https://kaggle.com/robertolofaro).

Within Section F, I enclosed my notes/quotes from the two reports listed within the references.


### References

Out of many more documents, I considered as a reference for the analysis two reports:
* [Kaggle's executive summary PDF](https://www.kaggle.com/kaggle-survey-2020), containing the above mentioned 20,036 responses (retrieved and read 2019-11-20) 
* [McKinsey's "Global Survey - The State of AI in 2020"](https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/global-survey-the-state-of-ai-in-2020), carried out between 2020-06-09 and 2020-06-19, containing 2,395 responses, pag. 13: "representing the full range of regions, industries, company sizes, functional specialties, and tenures. Of those respondents, 1,151 said their organizations had adopted AI in at least one function and were asked questions about their organizations’ initiatives." (retrieved and read 2021-01-02).

Only the data from the Kaggle survey have been used in report, albeit the McKinsey survey executive summary (13 pages) can be read online for free (and, upon registration for a free account, the PDF version can be downloaded).


**INITIAL CONFIGURATION AND LOAD INPUT FILE (KAGGLE SURVEY CSV)**


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# add code to load libraries
#import pandas as pd
#import numpy as np
import matplotlib.pyplot as plt
#%matplotlib inline
import seaborn as sns
from math import pi
import plotly.express as px
import plotly.graph_objects as go

# set to true if text and images are to exported, false otherwise
# on Kaggle: set false
kagglestorysaveimages = False

if(kagglestorysaveimages==True):
    # redirect all the "print" output to a textfile
    import sys
    import os
    from plotly.offline import iplot, init_notebook_mode
    import plotly.graph_objs as go
    import plotly.io as pio
    import psutil
    # if it does not exist, create the directory for the text and images
    if not os.path.exists("images"):
        os.mkdir("images")    
    # define the output for the "print"
    orig_stdout = sys.stdout
    f = open('images/00text.txt', 'w')
    sys.stdout = f
    
# load data
#kaggledataraw = pd.read_csv('kaggle_survey_2020_responses.csv',low_memory=False)
kaggledataraw = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv',low_memory=False)

# B. ASSUMPTIONS AND CHOICES

In the appendix, after the data analysis, I will share more analysis, but this preamble is to explain the logic in my use of the [dataset](https://www.kaggle.com/c/kaggle-survey-2020/data).

**ASSUMPTIONS**

1. AI to avoid facing yet another [**"AI Winter"**](https://en.wikipedia.org/wiki/AI_winter) has to become part of the *"new normal"* that the convergence of [**"Digital Transformation"**](https://en.wikipedia.org/wiki/Digital_transformation), accelerated by the COVID-19 pandemic side-effects, is delivering.

2. The transformation will affect both society and businesses, but I will focus on the business side.

3. What I observed since the 1980s is that technology, to become embedded in businesses, needs:

>a **mandate**, or at least (for initiatives started from the bottom of an organization, as spontaneous experiments, and not as initiatives, programmes, etc from the top) clear sponsorship from decision-makers

>adequate **resources** (budget, but also infrastructure and integration within the ordinary budget of e.g. business units or corporate ICT)

>last but not least, **talent** , i.e. an environment that attracts, retains, develops human resources, and spreads awareness across the whole organization.

***THEMES SELECTED***

After reading the Survey Executive Summary report and reviewing the data, the following selection criterias have been applied to choose which data (questions) should be used in the analysis:
* identifying people
* defining the organizations they work for
* identifying the level of adoption of data-based decision-making (not just AI- also Business Intelligence, as e.g. Microsoft PowerBI allows integration also with R and Python)

***POPULATION SELECTION***

These are the **job titles present within the dataset**:


In [None]:
print("\r\n===> SECTION B. ASSUMPTIONS AND CHOICES\r\n")

In [None]:
# full list of jobtitles
print("\r\nset(kaggledataraw['Q5'])")
print(set(kaggledataraw['Q5']))

My use of the [Kaggle survey results dataset](https://www.kaggle.com/c/kaggle-survey-2020/data) is <u>focused on identifying some potential trends on the bridging between data science and business, notably on the management side</u>.

To that end, selected people whose job title typically requires interacting with "business domain specialists" and decision-makers, ***Business Analyst*** and ***Product/Project Manager***.

This report is a comparative analysis between three countries (or aggregations thereof) that:
* represent different market concepts
* have a different mix of respondents (age, education background, etc)
* have number of respondents in the same order of magnitude.

The three countries selected:
* ***Europe*** (as aggregation of countries geographically in Europe, from the Atlantic to the Urals, not just the European Union, as anyway not all the European Union contries are explicitly represented within the survey dataset, see below in sections C-D)
* ***India***
* ***United States of America*** (sometimes referenced within the commentary as ***USA***).


**CHOICES**

***STRUCTURE***

Key points:
1. This report is built using just one long Jupyter Notebook
2. Added output exporting logic, to ease producing a report for non-Jupyter users
3. For the plotly charts, the export is in HTML format, to allow interaction from any browser
4. To enhance readability for the textual part of the export, each "print" statement is preceded by the printing of a "contextual positioning" of what is going to be printed

More notebooks have been prepared to identify the focus of the analysis, and investigate other options.

From those notebooks, only the items that could contribute to the narrative chosen for this notebook have been included.

Whenever there was a group of lines that was used more than once, tried to create functions, to enforce consistency in data presentation and visualization, and enable future expansion of the analysis while keeping a consistent format.

Except for key questions, where used different visualizations, after identifying the target discussed above, selected just two charts, using Plotly (to allow interacting with data points):
* a radar chart with multiple (Europe, India, USA) cases
* a static radar chart separating each country, to better highlight patterns relevant to each country, without any scaling issues that e.g. a single country with values out of range might force

Future releases might be restructured to add more functions and different visualizations.

***DATA SOURCES***

This first report uses exclusively the Kaggle 2020 survery dataset to identify what the data tell, and highlight further items to discuss.

***PUBLICATION RELEASE***

Along with the notebook, released also:
* a ZIP file to enable to read the results and view individually each chart (for the plotly charts, each HTML file is the chart "live", as interfaces with the Plotly server to reproduce the chart, enabling to visualize details)
* an HTML file containing the whole execution of the notebook as released, to enable reproducibility and verify data, generated by the following command:
<code>
    jupyter nbconvert --execute --to html NOTEBOOKNAME.ipynb
</code>


**PS a personal note**: as I started studying Python etc during the COVID-19 lockdown March-June 2020, I might revise this notebook in the future, as my knowledge progresses, as well, as time will allow, expand to explore other concepts.

Therefore, I apologize for un-pythonesque code and choices



# C. DIMENSIONS OF ANALYSIS - DATA SELECTION


***QUESTIONS***

Following the assumptions discussed under ***"themes selected"*** (see above in section ***B. ASSUMPTIONS AND CHOICES***), only a subset of questions has been selected

In [None]:
# TAKE 
# person: q1_age 2=sex 3=country 4=education 5=occupation
list_person = ['Q1','Q2','Q3','Q4','Q5']
# company: 
#q20=companysize 
#21=individualsresponsibledatascienceworkloads 
#employer currently incorporating ML
#24=annualcompensation
list_company = ['Q20','Q21','Q22','Q24','Q25']
# roles and influence:
#Q23
list_roles =['Q23_Part_1']
# tech: 
#q6=codewriting 
#q11=cloudcomputingused 
#q15=howlongusedML 
#q32=BI tool used
#q38=primary tool to analyze data
list_tech = ['Q6','Q11','Q15','Q32','Q38']
# data to be extracted - questions subset
kagglestory_listquestions = list_person+list_company+list_roles+list_tech

The original dataset has 20037 rows (the first one being the actual question), over 355 columns

In [None]:
print("\r\n===> SECTION C. DIMENSIONS OF ANALYSIS - DATA SELECTION\r\n")
print("\r\nkaggledataraw.shape")
print(kaggledataraw.shape)

In [None]:
# collect the questions
kaggledataraw_questionstable = kaggledataraw.iloc[0].copy()

# data without the questions
kaggledataraw_modified = kaggledataraw[1:].copy()

# data to be extracted
kagglestory = kaggledataraw_modified[kagglestory_listquestions]


The selected dataset usd to seed the analysis contains 20036 rows over 16 columns

In [None]:
# data to be used to seed the report
print("\r\nkagglestory.shape")
print(kagglestory.shape)

In [None]:
# questiontable relevant only for the selected questions
kagglestory_questionstable = kaggledataraw_questionstable[kagglestory_listquestions].copy()

# clean up the string so that it can be used as a description
kagglestory_questionstable = kagglestory_questionstable.str.replace('- Selected Choice','')

In [None]:
for i in range(16):
    print(kagglestory.columns[i]," ",kagglestory_questionstable[i],"\r\n")

***VALUES DISTRIBUTION***

This is the set of values available for each question within the selected subset

In [None]:
# gets now the value counts, printing before the question
for i in kagglestory.columns:
    print("\r\n === \n question: ",i,"\n",kagglestory_questionstable[i],"\n ===\r\n")
    print(kagglestory[i].value_counts())
    print("\r\n\r\n")


***JOB TITLES***

These are the **job titles that have been considered in this analysis**:
* 'Business Analyst'
* 'Product/Project Manager'

***GEOGRAPHIC COVERAGE***

While the initial analysis considers all the countries covered within the [Kaggle survey data](https://www.kaggle.com/c/kaggle-survey-2020/data), to compare the current status, three geographical areas with roughly the same sample size <u>across the job titles selected</u>:
* USA
* India
* Europe (from the Atlantic to the Urals, i.e. including EU, UK, former USSR European countries)



# D. DATA PREPARATION


**DATA BY COUNTRY**


In [None]:
print("\r\n===> SECTION D. DATA PREPARATION\r\n")

In [None]:
# used for all the following tables: number of respondents, number of BA and PM, percent 
referencecolumnstable = ['answers','ba_pm','percent']

# see data by country for the two job titles
kagglestorytotals = kagglestory.value_counts('Q3')
# see data by country for the two job titles
kagglestoryselection = kagglestory[(kagglestory['Q5'] == 'Business Analyst') | (kagglestory['Q5'] == 'Product/Project Manager')].value_counts('Q3')
# see distribution by country
kagglestorypercent = round((kagglestoryselection / kagglestorytotals)*100,2)
# show the three together
kagglestorysummary = pd.concat([kagglestorytotals,kagglestoryselection,kagglestorypercent],axis=1)
kagglestorysummary.columns = referencecolumnstable
print("\r\nkagglestorysummary")
print(kagglestorysummary)

The column "ba_pm" contains the number of respondents by country having as title either *Business Analyst* or *Product/Project Manager*, the column percent is

column percent = $\frac{column\ ba\_pm}{column\ answers}$

as a percentage, with two decimals precision


**CREATE SUBSET(S) FOCUSING ON EUROPE, INDIA, USA**

In [None]:
selectcountrieseurope = ['Belarus','Belgium','France','Germany','Greece',
'Ireland','Italy','Netherlands','Poland','Portugal',
'Romania','Russia','Spain','Sweden','Switzerland',
'Turkey','Ukraine','United Kingdom of Great Britain and Northern Ireland']

kagglestoryeurope = pd.DataFrame()
kagglestoryeuropdetails = pd.DataFrame()
kagglestoryusa = pd.DataFrame()
kagglestoryindia = pd.DataFrame()
kagglestoryeuropedetails = kagglestory[kagglestory['Q3'].isin(selectcountrieseurope)].copy()
kagglestoryeurope = kagglestory[kagglestory['Q3'].isin(selectcountrieseurope)].copy()
kagglestoryeurope['Q3'] = 'Europe'
kagglestoryusa = kagglestory[kagglestory['Q3']=='United States of America'].copy()
kagglestoryindia = kagglestory[kagglestory['Q3']=='India'].copy()
kagglestoryworld = pd.concat([kagglestoryeurope,kagglestoryindia,kagglestoryusa])
kagglestoryworlddetails = pd.concat([kagglestoryeuropedetails,kagglestoryindia,kagglestoryusa])

In [None]:
kagglestoryworldtotals = kagglestoryworld.value_counts('Q3')
kagglestoryworldselection = kagglestoryworld[(kagglestoryworld['Q5'] == 'Business Analyst') | (kagglestory['Q5'] == 'Product/Project Manager')].value_counts('Q3')
kagglestoryworldpercent = round((kagglestoryworldselection / kagglestoryworldtotals)*100,2)
kagglestoryworldsummary = pd.concat([kagglestoryworldtotals,kagglestoryworldselection,kagglestoryworldpercent],axis=1)
kagglestoryworldsummary.columns = referencecolumnstable
print("\r\nkagglestoryworldsummary")
print(kagglestoryworldsummary)

In [None]:
kagglestoryeuropetotals = kagglestoryeuropedetails.value_counts('Q3')
kagglestoryeuropeselection = kagglestoryeuropedetails[(kagglestoryeuropedetails['Q5'] == 'Business Analyst') | (kagglestoryeuropedetails['Q5'] == 'Product/Project Manager')].value_counts('Q3')
kagglestoryeuropepercent = round((kagglestoryeuropeselection / kagglestoryeuropetotals)*100,2)
kagglestoryeuropesummary = pd.concat([kagglestoryeuropetotals,kagglestoryeuropeselection,kagglestoryeuropepercent],axis=1)
kagglestoryeuropesummary.columns = referencecolumnstable
print("\r\nkagglestoryeuropesummary")
print(kagglestoryeuropesummary)

# E. DATA VISUALIZATION

Data visualization is a storyline within the storyline.

It is a series of steps to increase understanding and see how far I could go in my research with the available data.

Generally, E1-E2 are about overall distribution, while E3-E6 are about comparatively profiling countries.

For the former, I selected the usual (tables, histograms), for the latter a comparison tool that I used since the early 2000s in organizational change activities, the radar chart.

While many dislike it, I found it useful to compare patterns between organizations and also in purchasing and quality/audit activities.

Anyway, the structure and functions (as well as the dataframes created above with self-explanatory names and comments) should enable altering the logic, e.g. to selectively adopt different visualizations for different questions.

My target is an hypothetical request from an hypothetical business customer asking questions about presence/absence of people with those job titles, and the mix of corporate and technological environments they work in, as well as demographic differences between the three countries (Europe, India, USA).

This table contains the steps within the data analysis, the column **OK/NOK** states if the results of that step supported my original question (*identifying some potential trends on the bridging between data science and business, notably on the management side*)

| Step | Purpose | OK/NOK |
| --- | --- | --- |
| E1. Check BA/PM frequency by country | verify from percentages if could make sense to check Europe as an aggregate, vs. India and USA | OK |
| E2. Confirm comparison countries | verify if, be aggregating countries in Europe, there is a unit comparable in size to India and USA | OK |
| E3. Check decision influencers | check if the distribution of E2. matches that of influencers  | NOK |
| E4. Compare demographic distribution | check, for BAs/PMs, how Europe, India, USA compare | OK |
| E5. Compare companies characteristics | check, for BAs/PMs, how Europe, India, USA compare | OK |
| E6. Compare technical characteristics | check, for BAs/PMs, how Europe, India, USA compare | OK |
| E7. Assess results and analysis | summarize results from the commentary in E1-E6 | OK |
| E8. Further investigations planned | share ideas about other areas of analysis, or new questions | OK |


In [None]:
print("\r\n===> SECTION E. DATA VISUALIZATION\r\n")

## *E1. Check BA/PM frequency by country*

Different countries have different approaches.

Lacking some further information (e.g. since how long a data science team has been created in organizations in each country), some answer cannot be directly obtained from data.

It is anyway more interesting to consider the distribution of those titles across countries, i.e. the share of people holding that title in each country (considering the number of respondents for the country).

In [None]:
print("\r\n===> SECTION E1. Check BA/PM distribution by country\r\n")
print("\r\nkagglestorypercent.sort_values ascending=False")
print(kagglestorypercent.sort_values(ascending=False))

The percentage of Business Analysts and Product/Project Managers in each country could be an indicator of the focus and team size.

Also, could be an indicator of how much those answering are involved in bringing new products or services on the market.

A lower number of project managers might both imply smaller teams, e.g. delivering outsourced data science services on-demand, or even just experimenting.

But, again, to further explore, would need information about industries, business domains, etc.

The McKinsey report linked within the section ***A. Executive Summary*** contains those dimensions, but would be interesting to have the same questions also within the Kaggle survey.

**CLUSTERING OF COUNTRIES BY PERCENTAGE OF PM-BA ON NUMBER OF RESPONDENTS**

In [None]:
# ax = sns.histplot(kagglestorysummary,x=kagglestorysummary['percent'], stat="count", kde=True)
ax = kagglestorysummary['percent'].plot.hist(bins=7)
if(kagglestorysaveimages):
    fig = ax.get_figure()
    fig.savefig('images/E1_01_clustering.svg',format='svg')

**Results:** 

* there is a distribution with differences between countries
* multiple countries differ by level of presence of BAs and Product/Project Managers between respondents
* it could make sense to check other dimensions

**Next step:** 

* checking if the aggregations Europe, India, USA have a comparable size


## *E2. Confirm comparison countries*

As stated above, the intent is to focus the analysis on three continental-level countries, to identify is there are different patterns vs. the other questions selected for this report.

These are the three candidates:
* Europe
* India
* United States of America

The question in this step is if the resulting size is comparable.

As the survey results do not cover all the Member States of the European Union, selected instead a concept of Europe closer to "from Atlantic to Urals", and including also countries that are not part of the European Union (yet).

Which countries are considered to be Europe, for the purpose of this report, between those individually available in the dataset?

| country | answers | ba_pm | percent |
| --- | --- | --- | --- |
| Belarus	| 59	| 4	| 6.78 | 
| Belgium	| 60	| 4	| 6.67 | 
| France	| 330	| 38	| 11.52 | 
| Germany	| 404	| 33	| 8.17 | 
| Greece	| 111	| 9	| 8.11 | 
| Ireland	| 54	| 8	| 14.81 | 
| Italy	| 267	| 36	| 13.48 | 
| Netherlands	| 151	| 17	| 11.26 | 
| Poland	| 148	| 11	| 7.43 | 
| Portugal	| 122	| 13	| 10.66 | 
| Romania	| 61	| NaN	| NaN | 
| Russia	| 582	| 61	| 10.48 | 
| Spain	| 336	| 27	| 8.04 | 
| Sweden	| 78	| 9	| 11.54 | 
| Switzerland	| 68	| 4	| 5.88 | 
| Turkey	| 344	| 17	| 4.94 | 
| Ukraine	| 170	| 11	| 6.47 | 
| United Kingdom of Great Britain and Northern Ireland	| 489	| 45	| 9.2 | 

As you can see, considered geographical and business integration, as e.g. Turkey is integrated within the European supply chains, and the same applies (in other industries, e.g. energy) with Russia and Ukraine.

**Note**: as I have contacts on Kaggle in other European countries that are missing from the list (and from the data), I assume that either they are within "Other", or that have not answered to the survey.


In [None]:
print("\r\n===> SECTION E2. Confirm comparison countries\r\n")

sns.set_context("talk", font_scale=1.1)
plt.figure(figsize=(10,6))
sns.scatterplot(x=kagglestoryeuropesummary['answers'], 
                y=kagglestoryeuropesummary['ba_pm'],
                size=kagglestoryeuropesummary['percent'],            
                hue=kagglestoryeuropesummary['percent'],            
                data=kagglestoryeuropesummary)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel("number of respondents")
plt.ylabel("number of BAs or PMs")
plt.title("European Distribution")
plt.tight_layout()

if(kagglestorysaveimages):
    plt.savefig("images/E2_02_European_distribution.svg",format='svg',dpi=150)

as you can see from the bubblechart, there is not a direct correlation between the number of respondents in each European country (as per the concept of "Europe" selected above) and the percentage of Business Analysts and Product/Project Managers (represented by the size and hue of each dot).

In this report, as stated under section ***B. Assumptions and Choices***, will keep using only the dataset from Kaggle, starting with its selection contained within the dataframe **kagglestory** (created in section ***C. Dimensions of Analysis (data selection)***), and then further restricting to the two job titles selected (Business Analyst and Product/Project Manager).

And this is the summary table including Europe (both as an aggregate and as individual countries), India, and the United States of America.

In [None]:
kagglechartworldsummary = pd.concat([kagglestoryworldsummary,kagglestoryeuropesummary])
kagglechartworldsummary.columns = referencecolumnstable
print("\r\nkagglechartworldsummary")
print(kagglechartworldsummary)

**Results:** 

* within European countries, there is no direct correlation between number of respondents and presence of BAs/PMs
* aggregating Europe as defined above, it is comparable in size and percentage of BAs/PM with India and USA
* it could make sense to check other dimensions

**Next step:** 

* working on the influencers distribution

## *E3. Check decision influencers*

For the purpose of this report, a single part of Question 23, the one focused on a potential role by the respondent in influencing product and business choices, was selected.

The table under ***E2. Confirm comparison countries*** summarizes the distribution of the job titles target of this report, country by country.

There was a question within the survey (*Question 23: "Select any activities that make up an important part of your role at work"*), that was a multiple choice.

The first option, of interest here, was within the question Q23_Part_1:


In [None]:
print("\r\n===> SECTION E3. Check decision influencers\r\n")
print("\r\nQ23 Part 1: ",kagglestory_questionstable['Q23_Part_1'])


The aim was to **analyze and understand data to <u>influence product of business decisions</u>**

From the distribution above, you could assume that countries with a lower distribution of the roles of targe, business analysts and product/project managers, could have a lower number of "influencers", but this is not the case.


Reminder: this is the number of respondents:

In [None]:
print("\r\nnumber of respondents\nkagglestory Q3,Q23_Part_1 .count Q3")
print(kagglestory[['Q3','Q23_Part_1']].count()['Q3'])

This is the numbers of those that could be classified as influencers as per Question 23 Part 1:

In [None]:
print("\r\nnumber of influencers\nkagglestory Q3,Q23_Part_1 .count Q23_Part_1")
print(kagglestory[['Q3','Q23_Part_1']].count()['Q23_Part_1'])

In [None]:
# add to the table above two columns: count influencers, % influencers on total
kaggleinfluencerstotals = kagglestory[['Q3','Q23_Part_1']].groupby('Q3').count()
kaggleinfluencerstable = pd.concat([kagglestorytotals,kagglestoryselection,kagglestorypercent,kaggleinfluencerstotals],axis=1)
influencerscolumns = referencecolumnstable.copy()
influencerscolumns.append('influencers')
kaggleinfluencerstable.columns = influencerscolumns
kaggleinfluencerstable['pcnt_on_bapm'] = round((kaggleinfluencerstable['influencers'] / kaggleinfluencerstable['ba_pm'])*100,2)
kaggleinfluencerstable['pcnt_on_total'] = round((kaggleinfluencerstable['influencers'] / kaggleinfluencerstable['answers'])*100,2)
print("\r\nkaggleinfluencerstable")
print(kaggleinfluencerstable)

**Results:** 

* the purpose of using this question was to complement the job titles (Q5)
* from the results, there is no direct relationship between being a Business Analyst or Product/Project Manager and influencing product or business decision
* this is a theme that is worth investigating, but would require further data
* therefore, the **<u>question is excluded from the report</u>**, as probably its results are obfuscated by both the interpretations of the question by respondents, and other information

**Next step:** 

* working on the demographic distribution

# DIGRESSION: CHARTING FUNCTIONS


**NOTE**: 

**<u>from this point on, all the charts are based only on the target job titles, i.e. Business Analyst and Product/Project Manager</u>**
    

In [None]:
# create common variables
kaggleanalysis = kagglestoryworld[(kagglestoryworld['Q5'] == 'Business Analyst') | (kagglestoryworld['Q5'] == 'Product/Project Manager')]

kagglestory_countrylist = list(set(kaggleanalysis['Q3']))
print("\r\nkagglestory_countrylist")
print(kagglestory_countrylist)

In [None]:
def create_chart_dataframe(question_to_use):
    tempa = kaggleanalysis[['Q3',question_to_use]].value_counts()[kagglestory_countrylist[0]]
    tempb = kaggleanalysis[['Q3',question_to_use]].value_counts()[kagglestory_countrylist[1]]
    tempc = kaggleanalysis[['Q3',question_to_use]].value_counts()[kagglestory_countrylist[2]]
    tempchart = pd.concat([tempa,tempb,tempc],axis=1)
    tempchart.columns = kagglestory_countrylist
    print("\r\nQuestion:",question_to_use,"\r\n")
    print(kagglestory_questionstable[question_to_use])
    print("\r\n",tempchart,"\r\n")
    
    return tempchart


In [None]:
def plot_spider(dflayer,dfvalues, spider_descriptions, fig_title):
    print("Country: ",fig_title,'\r\n')
    df = pd.DataFrame(dict(
    r=dfvalues,
    theta=spider_descriptions))
    fig = px.line_polar(df, r='r', theta='theta', line_close=True)
    fig.update_traces(fill='toself')
    if(kagglestorysaveimages):
        filenamehtml = 'images/'+dflayer+dfvalues.index.name+fig_title+'.html'
        filenamefig = 'images/'+dflayer+dfvalues.index.name+fig_title+'.svg'
        #plt.savefig(filenamefig,format='svg')
        fig.write_html(filenamehtml)
        # write as static image
        #fig.write_image(filenamefig)
    fig.show()

In [None]:
def plot_radar(dflayer,df, categories):

    maximumvalue = max(max(df['Europe']),
                       max(df['India']),
                       max(df['United States of America']))
    
    fig = go.Figure()

    fig.add_trace(go.Scatterpolar(
      r=df['Europe'],
      theta=categories,
      fill='toself',
      name='Europe'
    ))
    fig.add_trace(go.Scatterpolar(
      r=df['India'],
      theta=categories,
      fill='toself',
      name='India'
    ))
    fig.add_trace(go.Scatterpolar(
      r=df['United States of America'],
      theta=categories,
      fill='toself',
      name='United States of America'
    ))

    fig.update_layout(
      polar=dict(
        radialaxis=dict(
          visible=True,
          range=[0, maximumvalue]
        )),
      showlegend=True
    )

    fig.show()
    if(kagglestorysaveimages):
        filenamehtml = 'images/'+dflayer+df.index.name+'aaradar'+'.html'
        filenamefig = 'images/'+dflayer+df.index.name+'aaradar'+'.svg'
        #plt.savefig(filenamefig,format='svg')
        # write as interactive plotly
        fig.write_html(filenamehtml)
        # write as static image
        #fig.write_image(filenamefig)

## ***E4. Compare demographic distribution***

In this section, just looked at how the three "countries" (Europe, India, USA) compared in terms of demographics, across five variables:

In [None]:
print("\r\n===> SECTION E4. Compare demographic distribution\r\n")

for i in range(5):
    print(kaggleanalysis.columns[i]," ",kagglestory_questionstable[i],"\r\n")

In [None]:
q1_dataframe = create_chart_dataframe('Q1')


The initial idea was to have a radar chart comparing the three countries, but just the first question showed significant differences in demographics, as visualized in the following radar chart:

In [None]:
plot_radar('E4',q1_dataframe,q1_dataframe.index.values)

Therefore. decided to create, in each question, both the aggregate radar (as significant differences are an information per se), as well as individual radar charts (to allow better understanding of each country).

The purpose: to allow, across all the 16 questions selected, to visually compare differences between the three countries.

In [None]:
for i in range(len(kagglestory_countrylist)):
    plot_spider('E4',q1_dataframe[kagglestory_countrylist[i]], q1_dataframe.index.values, kagglestory_countrylist[i])

In [None]:
print("\r\nQuestion:","Q3","\r\n")
print(kagglestory_questionstable["Q3"])
print("\r\n",kagglechartworldsummary.iloc[:3],"\r\n")

The table above contains an extract of the same data already shown in sections ***E1. Check BA/PM frequency by country*** and ***E2. Confirm comparison countries***.

Following the results of the first verification, created a new function that produces automatically all the tables, charts, etc, so that I could focus only on the commentary question-by-question, to be then integrated within the section ***E7. Assess results and analysis***

In [None]:
def questions_descriptive_radars(dflayer,question_to_use):
    question_dataframe = create_chart_dataframe(question_to_use)
    plot_radar(dflayer,question_dataframe,question_dataframe.index.values)
    for i in range(len(kagglestory_countrylist)):
        plot_spider(dflayer,question_dataframe[kagglestory_countrylist[i]], question_dataframe.index.values, kagglestory_countrylist[i])
        

In [None]:
questions_descriptive_radars('E4','Q2')

Remember that the focus of this report is just on Business Analysts and Product/Project Managers, i.e. roles that usually are associated with higher frequency of interaction with decision makers.

The [Executive Summary report published by Kaggle on the whole survey](https://www.kaggle.com/kaggle-survey-2020) about gender differences overall states (page 5): *\"Data science is still suffering from a large gender gap in the
workplace, as 82\% of users identify as men. This is only a
slight change from last year’s results, where 84\% of users
identified as males. This is the first year we’ve
differentiated between “Nonbinary” and “Prefer to
self-describe” with each answer coming in around a third
of a percent."*

As you can see from both the table above and the radar chart, also for the Kaggle community, the higher you go within organizations, the higher the gender gap.

And this difference shows little variation between the three countries.

In [None]:
questions_descriptive_radars('E4','Q4')

In terms of formal education, there is little difference between the three countries: for the job titles selected, a master's degree (and with lesser but still significant frequency, a bachelor's degree) in both India and USA are common, while a PhD has a non negligible presence in Europe.

For the question about jobtitles, Q5, having just two values, a radar chart would be useless:

In [None]:
question_dataframe = create_chart_dataframe('Q5')


As can be seen from the table above, both Europe and United States of America have slightly more Product/Project Managers within the respondents than Business Analysts, while in India there is a larger proportion of Business Analysts.

**Results:** 

* the gender gap shown within the Kaggle Executive Summary of the whole survey is even stronger when focusing on the Business Analyst and Product/Project Manager job titles
* it would be worth investigating why in India the profile seems to closer to a standard IT project, i.e. more business analysts than product/project managers, while in Europe is just the opposite: might be due to a gap in knowledge (hence, more interest in joining Kaggle) or other factors related to market structure
* it is anyway interesting to note how in Europe a PhD seems to be almost as common as a Bachelor's degree, while in all the three countries anyway it a Master's degree the most common educational level

**Next step:** 

* working on companies characteristics

## ***E5. Compare companies characteristics***



In [None]:
print("\r\n===> SECTION E5. Compare companies characteristics\r\n")
for i in range(5,10):
    print(kagglestory.columns[i]," ",kagglestory_questionstable[i],"\r\n")

In [None]:
questions_descriptive_radars('E5','Q20')

More than the table, it is the first radar chart that clearly shows the difference within the responders.

While those from USA are spread across company sizes, in Europe the focus of those on Kaggle answering is predominanty the typical small company (0-49), while India is predominantly represented (again, just for these two job titles, ***Business Analyst*** and ***Product/Project Manager***) by employees of very large companies.

In [None]:
questions_descriptive_radars('E5','Q21')

On how many individuals are directly responsible for data science workloads, both India and USA are oriented toward 20 or more, while Europe, confirming what was shown by the previous question, is geared toward smaller number- predominantly 1 or 2 people per company.

In [None]:
questions_descriptive_radars('E5','Q22')

In terms of actual use within business of Machine Learning methods, USA is more evenly spread, while in both India and Europe the focus is on sperimentation ("recently started" and "exploring").

Also, Europe confirms its lag in actual integration of machine learning in business, having almost the double of "No (we do not use ML methods)" than both India and USA.

In [None]:
questions_descriptive_radars('E5','Q24')

I left this dimension, compensation, only for future reference, but, as expected, the differences in yearly compensation would require integrating with further information, e.g. if the activities are internal or directed toward external customers (such as outsourcing, business process management, consulting, etc)

In [None]:
questions_descriptive_radars('E5','Q25')

This question actually merges multiple dimensions:
* home and work
* machine learning and cloud computing

which makes difficult to consider or identify "why", i.e. the actual allocation of costs toward business ends (e.g. technological alignment, digital transformation, migration to the cloud, new services, etc.)

**Results:** 

* the questions in this group are actually those that would benefit more of a merger of the database of this survey and the survery from McKinsey
* what the data show, is that, within the Business Analyst or Product/Project Manager responders, there is a difference between countries, both in terms of structure and, probably, aim (external, i.e. for customers, vs. internal, i.e. business development)
* blended with the answers within the previous sections, could actually influence some considerations (discussed later in this report)
* the information about expenditure on machine learning and/or cloud services as well as the one on remuneration probably would require additional questions to clarify what they really represent

**Next step:** 

* working on the technical characteristics

## ***E6. Compare technical characteristics***



In [None]:
print("\r\n===> SECTION E6. Compare technical characteristics\r\n")

for i in range(11,16):
    print(kagglestory.columns[i]," ",kagglestory_questionstable[i],"\r\n")

The questions selected for this section are mainly to cross-reference potential doubts deriving from the previous sections, e.g. on demographic, distribution of the two roles selected (***Business Analyst*** and ***Product/Project Manager***), and information about the business the respondents are working for.

In [None]:
questions_descriptive_radars('E6','Q6')

The first radard chart highlights the difference between the three countries, with India having both a younger (in terms of number of years of experience) or newcomer base.

USA has fewer respondents that never programmed, while both USA and Europe have a larger group of members with 10 or more years of experience than India, but it is on the oldest range (20+ years) that Europe and USA numbers highlight probably a different market structure.

In [None]:
questions_descriptive_radars('E6','Q11')

This question has no divergence: still to be confirmed (through other information) how much those numbers represent the business users or students.
    
Using a laptop or desktop is common also in business, from observation, for at least two reasons:
* using cloud services to develop data analysis still receives, for data analysis activities, lukewarm acceptance (if any) from the business side
* purchasing ad hoc hardware (e.g. a machine learning workstation) is still rare for "line" business uses, and more the domain of (various forms of) research 

In [None]:
questions_descriptive_radars('E6','Q15')

The main difference is shown from 5 years or more of experience in using machine learning methods, three times more common in Europe and USA than in India, albeit both Europe and India show significantly more "newcomers" than USA.

In [None]:
questions_descriptive_radars('E6','Q32')

In both USA and Europe Microsoft Power BI (thanks probably to the diffusion in corporate environments of Office 365) is more common in Europe and USA, while Tableau is more common in India (probably also due to licensing reasons).

In Europe, along with Power BI, also Qlik has a not so small presence.

The absence of Salesforce and SAP Analytics Cloud are probably due respectively to the different audience (Kaggle is anyway targeting those who are focused more on data than on business), and the relatively recent introduction of the SAP offer (even in Europe, also in SAP customer environments is not so common, as it is associated mainly to the transition to the more recent offer, S/4 Hana)

In [None]:
questions_descriptive_radars('E6','Q38')

As shown by the first radar "traditional" tools (Excel etc) are common everywhere, also considering the audience of Kaggle.

Remembering, again, that this representation focuses only on those whose job title (***Business Analyst*** or ***Product/Project Manager***) usually implies more contact with business.

Therefore, "to analyze data" usually implies "sharing analyses".

Until when also business users will be used to e.g. Jupyter Notebooks or other "live data" tools that are closer to data analysis and statistics, probably the humble spreadsheet will still be  more common, at least in Europe and India.


**Results:** 

* for the purpose of this report, three questions deliver interesting highlights: number of years of experience in programming, tools used for analysis, and tools used for business intelligence
* the key point is the difference in demographic represented by the years of experience
* as for the tools, could be interesting to see also how those tools are used on data-related activities, i.e. if Power BI is used also integrated with Python and R, i.e. integrated with the data science pipeline

**Next step:** 

* summarize results

## ***E7. Assess results and analysis***


In [None]:
print("\r\n===> SECTION E7. Assess results and analysis\r\n")

## ***E7.1. Focus on the questions to use to compare countries***

To summarize the preceding sections, this was the list of questions initially considered:

In [None]:
for i in range(16):
    print(kagglestory.columns[i]," ",kagglestory_questionstable[i],"\r\n")

In this section, the following dimensions will be used for the correlation analysis:
* Country: Europe, India, USA (question Q3)
* Job title: Business Analyst, Product/Project Manager (question Q5)

Following the results of the analysis and visualization of each individual question, the following questions will not be considered, for the reason listed:

| Question | Reason | See |
| --- | --- | --- |
| Q2 | gender gap is a constant | E4 |
| Q23_ Part_1 | "influencer" status would require further data | E3 |
| Q25 | question definition obfuscates purpose | E5 |
| Q11 | question definition obfuscates purpose | E6 |


In [None]:
kagglecheck_columns = list(kaggleanalysis.columns)
# remove Q2 gender 
kagglecheck_columns.remove('Q2')
# remove Q23_Part_1 "influencer"
kagglecheck_columns.remove('Q23_Part_1')
# remove Q25 money spent on ML and cloud computing services
kagglecheck_columns.remove('Q25')
# remove Q11 platform
kagglecheck_columns.remove('Q11')

In [None]:
# list of the columns that will be considered in this section
print("\r\nkagglecheck_columns")
print(kagglecheck_columns)

In [None]:
print("\r\nkagglestory_questionstable kagglecheck_columns")
print(kagglestory_questionstable[kagglecheck_columns])

In [None]:
kagglecheck = kaggleanalysis[kagglecheck_columns]
print("\r\nkagglecheck")
print(kagglecheck)

In [None]:
print("data used in this section:")
print("\r\nnumber of rows:",kagglecheck.shape[0])
print("\r\nnumber of columns:",kagglecheck.shape[1])

## ***E7.2. Identify correlations***

Before comparing countries, identify if the level of correlations between the variables is significantly different between the three countries

In [None]:
# identifying correlations requires first converting datatypes
print("\r\nkagglecheck.dtypes")
print(kagglecheck.dtypes)

In [None]:
# convert into categories the overall dataset
df = kagglecheck.copy()
list_of_columns = ['Q1','Q4','Q5','Q20','Q21','Q22','Q24',
                   'Q6','Q15','Q32','Q38']
# this oneliner solution from: https://stackoverflow.com/a/61761109
df[list_of_columns] = df[list_of_columns].apply(lambda col:pd.Categorical(col).codes)

# convert into categories europe
df_europe = kagglecheck[kagglecheck['Q3']=='Europe'].copy()
list_of_columns = ['Q1','Q4','Q5','Q20','Q21','Q22','Q24',
                   'Q6','Q15','Q32','Q38']
# this oneliner solution from: https://stackoverflow.com/a/61761109
df_europe[list_of_columns] = df_europe[list_of_columns].apply(lambda col:pd.Categorical(col).codes)

# convert into categories india
df_india = kagglecheck[kagglecheck['Q3']=='India'].copy()
list_of_columns = ['Q1','Q4','Q5','Q20','Q21','Q22','Q24',
                   'Q6','Q15','Q32','Q38']
# this oneliner solution from: https://stackoverflow.com/a/61761109
df_india[list_of_columns] = df_india[list_of_columns].apply(lambda col:pd.Categorical(col).codes)

# convert into categories united states of america
df_usa = kagglecheck[kagglecheck['Q3']=='United States of America'].copy()
list_of_columns = ['Q1','Q4','Q5','Q20','Q21','Q22','Q24',
                   'Q6','Q15','Q32','Q38']
# this oneliner solution from: https://stackoverflow.com/a/61761109
df_usa[list_of_columns] = df_usa[list_of_columns].apply(lambda col:pd.Categorical(col).codes)

In [None]:
# function to create heatmap of correlation between columns
def create_heatmap(dfcorr,dftitle):
    # see this page for more information on this version of the heatmap:
    # https://medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-python-834c0686b88e
    temptitle = 'Correlation Heatmap for: '+ dftitle
    plt.figure(figsize=(16, 6))
    sns.set_context("poster",font_scale=.7)
    # define the mask to set the values in the upper triangle to True
    mask = np.triu(np.ones_like(dfcorr, dtype=np.bool))
    heatmap = sns.heatmap(dfcorr, mask=mask, vmin=-1, vmax=1, annot_kws={"size":8}, annot=True)
    heatmap.set_title(temptitle, fontdict={'fontsize':18}, pad=16)
    if(kagglestorysaveimages):
        filenametosave = "images/E7_02_Correlations_" + dftitle + ".svg"
        plt.savefig(filenametosave,format='svg',dpi=150)


In [None]:
df_corr = df.corr()
create_heatmap(df_corr," Europe, India, USA")
df_europe_corr = df_europe.corr()
create_heatmap(df_europe_corr," Europe")
df_india_corr = df_india.corr()
create_heatmap(df_india_corr," India")
df_usa_corr = df_usa.corr()
create_heatmap(df_usa_corr," USA")

A simple visual inspection, i.e. on the first column, shows some differences, but nowhere there is a strong correlation (positive or negative).

The heatmap showns the level of positive (number > 0) or negative (number < 0) correlation between the questions.

While correlation is not causation, it is interesting look at the first heatmap before moving to the correlation assessment country-by-country. 

Therefore, decided to see visually (through a different use of a heatmap) if there are significant differences, before moving forward.


In [None]:
kagglecheck_matrix_df = pd.DataFrame(df_corr['Q5'])
kagglecheck_matrix_df.rename(columns={'Q5':'All'},inplace=True)
kagglecheck_matrix_df_europe = pd.DataFrame(df_europe_corr['Q5'])
kagglecheck_matrix_df_europe.rename(columns={'Q5':'Europe'},inplace=True)
kagglecheck_matrix_df_india = pd.DataFrame(df_india_corr['Q5'])
kagglecheck_matrix_df_india.rename(columns={'Q5':'India'},inplace=True)
kagglecheck_matrix_df_usa = pd.DataFrame(df_usa_corr['Q5'])
kagglecheck_matrix_df_usa.rename(columns={'Q5':'United States of America'},inplace=True)
kagglecheck_matrix = pd.DataFrame()
kagglecheck_matrix = kagglecheck_matrix_df.copy()
kagglecheck_matrix['Europe'] = kagglecheck_matrix_df_europe
kagglecheck_matrix['India'] = kagglecheck_matrix_df_india
kagglecheck_matrix['United States of America'] = kagglecheck_matrix_df_usa
# remove the row with Q5 as this is the correlation vs. Q5
kagglecheck_matrix.drop(['Q5'],axis=0, inplace=True)

In [None]:
print("\r\nkagglecheck_matrix - correlation values vs. Q5, i.e. jobtitle")
print(kagglecheck_matrix)

In [None]:
plt.figure(figsize=(16, 10))
heatmap = sns.heatmap(kagglecheck_matrix, vmin=-1, vmax=1, annot_kws={"size":8}, annot=True)
heatmap.set_title("Correlation Heatmap vs. Q5 job title, by country", fontdict={'fontsize':18}, pad=16)
# to save to external file is the flag is set at the top of the notebook
if(kagglestorysaveimages):
    filenametosave = "images/E7_02_CorrelationByCountry.svg"
    plt.savefig(filenametosave,format='svg',dpi=150)


In [None]:
def computedelta(df,dfsource,dfdelta):
    return round(((df[dfdelta] - df[dfsource])/df[dfsource])*100,0)

In [None]:
kagglecheck_delta = pd.DataFrame()
kagglecheck_delta['Europe'] = computedelta(kagglecheck_matrix,'All','Europe')
kagglecheck_delta['India'] = computedelta(kagglecheck_matrix,'All','India')
kagglecheck_delta['United States of America'] = computedelta(kagglecheck_matrix,'All','United States of America')

The following bar chart shows how much, on each question, the three countries differ, across the following questions, again for those who listed as their job title ***Business Analyst*** or ***Product/Project Manager***:

| Question | |
| --- | --- |
| Q38 | Primary tool for analysis |
| Q32 | Business intelligence tools |
| Q15 | For how many years have you used machine learning |
| Q6  | For how many years have you been writing code |
| Q24 | What is your current yearly compensation |
| Q22 | Does your current employer incorporate machine learning |
| Q21 | Approximately how many individuals are responsible |
| Q20 | What is the size of the company |
| Q4  | What is the highest level of formal education |
| Q1  | What is your age (# years)? |

In [None]:
kagglecheck_delta.plot.barh(figsize=(16,10), sort_columns=True)
if(kagglestorysaveimages):
    filenametosave = "images/E7_02_CorrelationByCountryBar.svg"
    plt.savefig(filenametosave,format='svg',dpi=150)

The bar chart shows that, also if there is no strong correlation, there are actually large differences in correlation between the other questions and job title in each one of the three countries.

The differences might be interesting to explore, at least to see if there are specific characteristics worth following or considering, e.g. for talent development and retention.

## ***E8. Further investigations planned***

Exploring the differences, and trying to identify pattern at least for the three questions that show the higher level of difference (Q32, Q24, Q20) for the job titles selected.

Then, verify if those patterns apply also to the community of respondents at large, or are just an indication, due to the roles selected, of how in each of the countries there are different approaches.

At the same time, for Europe, as the subdivision in countries is present, identify if the above mentioned patterns show differences.

Also, would like to see if these differences are matched by differences vs. UN SDG for both the three "macro-countries" and the individual European countries.


In [None]:
print("\r\n===> SECTION E8. Further investigations planned\r\n")


# F. APPENDIX


In [None]:
print("\r\n===> SECTION G. APPENDIX\r\n")

## F1. Notes from the Kaggle Executive Summary

**page 2**

Based on responses from 20,036 Kaggle members,
we’ve created this report focused on the 13% (2,675
respondents) who are currently employed as data
scientists

=> see note on page 2

**page 3**

Data science continues to have a heavy gender
imbalance, with most identifying as male

The vast majority of data scientists are under 35 years
old

Over half of data scientists have graduate degrees

More than half of data scientists have less than three
years of experience with machine learning

Scikit-learn is the most popular machine learning tool
in 2020, with over four in five data scientists using it

Tableau and PowerBI are the most popular business
intelligence tools

**page 5**

Similar to 2019 results, data scientists tend to be in their
late 20s or early 30s, with about 60% between 22 and 34.
Only one in five professional data scientists are 40 or older.
There are signs of the numbers skewing even younger, as
generation Z gets more involved. Nearly 7% of data
scientists are aged 18-21, an increase from last year’s 5%.
Though not included in this chart, responses from students
have also increased each year (26.8% in 2020, 21% in 2019,
22.9% in 2018). As these students graduate into the
workforce, we may see future surveys with even younger
data scientists.

**page 6**

Two countries have far more representation in the Kaggle
community. India makes up almost 22% of Kaggle data
scientists, while 14.5% reside in the United States. Brazil is
a distant third, at under 5%.

**page 7**

Graduate degrees continue to be the norm for data
scientists, with over 68% having obtained either a Master’s
or doctoral degree. Fewer than 5% of data scientists have
no degree beyond a high school diploma.

**page 8**

Data science and machine learning are quickly changing,
so it’s no surprise over 90% of Kaggle data scientists
maintain ongoing education. While about 30% take
traditional higher education courses, many more learn
through online materials.

Coursera, Udemy, and Kaggle Learn top the most common
mediums in our survey. Unsurprisingly, many Kaggle data
scientists chose multiple resources in the survey, with an
average of 2.8 mediums selected.

**page 9**

Most Kaggle data scientists have at least a few years of
experience under their belt. Just over 8% of data scientists
have been programming since the 20th century! That’s not
to say there aren’t newcomers, however. Over 9% have
taken up programming in the last year. Just under 2% of
data scientists claim to have never written code at all.

Compared to the global audience, United States data
scientists have significantly greater programming
experience. In the US, 37% have been programming 10 or
more years, versus 22% worldwide.

**page 11-12**

US data scientists salaries 
18.6% 100k-125k
18% 125k-150k
21.3% 150k-200k
8.9% 200k-250k
1.4% 250k-300k
3.9% 300k-500k
0.8% >500k

globally:
6.8%
4.5%
4.7%
1.6%
0.3%
0.7%
0.5%

**page 14**

median salary by contry:
USA 125k-150k
Germany 70-80k
others: lower

=> check other countries

**page 15-16-17**

on companies employee data science, data science teams, enterprise machine learning adoption: biased by the absence of turnover information to see the weight

**page 18**

interactive development environments

\>74.1% Jupyter lab
33.2% Visual Studio code
31.9% PyCharm
\>31.5% RStudio
21.8% Spyder
\>19.4% Notepad++

**page 19-20**

methods and algorithms usage

\>83.7% linear or logistic regression => i.e. the same I used in the 1980s on DSS
\>78.1% decision trees or random forests
\>61.4% gradient boosting machines (xgboost, lightgbm, etc)
\>43.2% CNN
\>31.4% Bayesian approaches
30.2% RNN
28.2 Neural networks (MLPs, etc)
14.8% Transformer networks (BERT, gpt-3, etc) 
\>7.3% GAN
\>6.5% evolutionary approaches
4.5% other
1.7% none

Python-based tools continue to dominate the machine
learning frameworks. Scikit-learn, a swiss army knife
applicable to most projects, is the top with four in five data
scientists using it. TensorFlow and Keras, notably used in
combination for deep learning, were each selected on
about half of the data scientist surveys. Gradient boosting
library xgboost is fourth, with about the same usage as
2019.

The fifth place tool, PyTorch, climbed above 30%, up from
about 26% in 2019.

The most popular of the tools added to the survey this year
is R-based Tidymodels, reaching over 7 percent.

machine learning framework usage
\>82.8% scikit-learn
\>50.5% tensorflow
\>50.5% keras
\>48.4% xgboost
\>30.9% pytorch
26.1% lightgbm
14.1% caret
13.7% catboost
10% prophet
7.5% fast.ai
7.2% tidymodels
6% h2o 3
2.1% mxnet
3.7% other
3.2% none
0.7% JAX

**page 21**

enterprise cloud computing

There are clearly three big players in cloud computing, and
it’s no surprise who: Amazon Web Services, Google Cloud
Platform, and Microsoft Azure. Notably, more data
scientists are using the cloud overall. In 2019, about 25%
had not adopted cloud computing, which decreased to 17%
in this year’s survey.

48.2% AWS
35.3% GCP
29.4% Microsoft Azure
17.1% none
5.6% IBM Cloud / Red Hat
4.1% other
3% Oracle Cloud
2.9% VMWare cloud
1.9% Salesforce cloud
1.8% SAP Cloud
0.9% Alibaba cloud
0.7% Tencent cloud

**page 23-24-25-26**

enterprise machine learning product usage

Those who use AWS, Google Cloud Platform, or Microsoft
Azure were asked about machine learning (ML) tools in
particular. Over half of these data scientists do not use ML
in the cloud.

Of those with ML usage, Amazon SageMaker was the most
popular answer, followed closely by Google Cloud AI and
ML.

55.2% no/none
16.5% amazon sagemaker
14.8% google cloud ai platform / google cloud ml engine
12.9% azure machine learning studio
8% google cloud vision ai
7.8% google cloud natural language
6.4% azure cognitive services
4.3% amazon rekognition
4.3% google cloud video ai
3.7% amazon forecast 
2.9% other

enterprise big data

Business Intelligence tools help data scientists visualize
their data, but four in 10 do not use one. The majority do
employ BI, with Tableau as the most popular tool. Microsoft
Power BI and Google Data Studio round out the top three.

data scientist usage of business intelligence tools

\>38.8% none
33.3% tableau
\27% microsoft power bi
9.1% google data studio
6.4% other
5% qlik
2.9% amazon quicksight
2.8% salesforce
2.5% looker
2.1% alteryx
2% SAP analytics cloud
1.4% tibco spotfire
1.2% sisense
0.9% einstein analytics
0.7% domo

Regarding databases, there isn't a clear favorite among
data scientists. MySQL was mentioned most often (35.6%),
followed by PostgreSQL (28.86%) SQL Server (24.93%).

database usage by data scientists

35.6% mysql
28.9% postgresql
24.9% microsoft sql server
18.7% mongodb
16.5% sqlite
15.4% none
13.5% google cloud bigquery
12.9% oracle database
9.3% amazon redshift
9.1% microsoft azure data lake storage
7.9% other
6.7% amazon athena
5.9% goole cloud sql
5.6% snowflake
5.1% amazon dynamodb
4.2% microsoft access
3.5% ibm db2
2.8% google cloud firestore

As with machine learning overall, many data scientists
(33%) do not use auto ML tools. Google Cloud AutoML saw
gains from last year’s survey, nearly 14% versus 6% in 2019.

13.9% google cloud automl
9.5% h2o driverless ai
8.4% datarobot automl
6.5% databricks automl

**page 27**

Among data scientists who use tools to manage machine
learning experiments, TensorBoard is a clear favorite (over
21%). The closest competitor is Weights & Biases, with 6%.
However, the vast majority (68%) of data scientists do not
use special tools to keep track of and manage their ML
experiments.

Usage of machine learning experimenttools

681.% no/none
21.6% tensorboard
6% weights&biases
5.4% other
3.1% trains
2.3% neptune
1% domino model monitor
0.9% polyaxon
0.8% guild.ai
0.7% comet.ml
0.6% sacred+omniboard


## F2. Notes for the McKinsey Survey

**cover**

"Since our 2019 survey, artificial intelligence has become more of a
revenue driver. Companies earning the most from AI plan to invest
even more in response to COVID-19—and perhaps widen the gap
with others."

**page 2**

"Overall, half of respondents say their organizations
have adopted AI in at least one function

AI adoption was about equal across regions last
year, this year’s respondents working for companies
with headquarters in Latin American countries and
in other developing countries are much less likely
than those elsewhere to report that their companies
have embedded AI into a process or product in at
least one function or business unit. By industry,
respondents in the high-tech and telecom sectors 2
are again the most likely to report AI adoption, with
the automotive and assembly sector falling just
behind them (down from sharing the lead last year).

The business functions in which organizations
adopt AI remain largely unchanged from the 2019
survey, with service operations, product or service
development, and marketing and sales again taking
the top spots"


**page 2**

"The use cases
that most commonly led to cost decreases are
optimization of talent management, contact-center
automation, and warehouse automation"

**page 4**

"This year we asked about adoption of deep
learning—a type of machine learning that uses
neural networks and can sometimes deliver superior
results—for the first time. Just 16 percent of
respondents say their companies have taken deep
learning beyond the piloting stage. Once again, high-
tech and telecom companies are leading the charge,
with 30 percent of respondents from those sectors
saying their companies have embedded deep-
learning capabilities."

from the comment from Micheael Chui, partner, McKinsey Global Institute, San Francisco " However, there
was a bit of a decrease in bullishness this
year, perhaps reflecting the passing of
AI’s hype phase. We do think AI is worth
the investment, but it requires effective
execution to generate significant value,
particularly at enterprise scale."

Beside the usual concepts about performance and leadership attuned (up to championing IA initiatives directly from the C-suite), as in other past technological waves I witnessed since the 1980s, it is not just a matter of budgets, or share of the ICT budget, but it is a matter of internalizing skills and mindset.

**page 5**

"High performers also
tend to have the ability to develop AI solutions
in-house—as opposed to purchasing solutions—
and they typically employ more AI-related talent,
such as data engineers, data architects, and
translators, than do their counterparts. They
also are much more likely than others to say their
companies have built a standardized end-to-
end platform for AI-related data science, data
engineering, and application development."

from commentary by Bryce Hall, associate partner, Washington DC: "Many executives now realize that
AI solutions typically need to be developed
or adapted in close collaboration with busi-
ness users to address real business needs
and enable adoption, scale, and real value
creation. As a result, we see companies
increasingly developing a bench of AI
talent and launching training programs to
raise the overall analytics acumen across
their organizations."

Other dimensions are discussed in the report, e.g. the risks associated with AI.

Which cover reputational but also business risks, such as misuse of recommendation systems in business decision-making that is base: which converges with the current drive toward ensuring transparency via explainability in terms understandable to business users

unfortunately, as noted by Roger Burkhardt, partner, new york on page 10: "Overall, however, the results are
concerning. While some risks, such as
physical safety, apply to only particular
industries, it’s difficult to understand why
universal risks aren’t recognized by a much
higher proportion of respondents. "

and adds a curious side-effect, probably linked to the paradigms used, e.g. learning from the past, in a time (the COVID-19 pandemic and its associated phases of universal business lockdown) when the drivers of past performance of a business are not relevant: "Generally, respondents from companies that have
adopted more AI capabilities are more likely to report
seeing AI models misperform amid the COVID-19
pandemic than others are. Responses indicate
that high-performing organizations, which tend to
have adopted more AI capabilities than others, are
witnessing more misperformance than companies
seeing less value from AI. These high-performing
organizations’ models were particularly vulnerable
within marketing and sales, product development,
and service operations (Exhibit 6)—the areas where
AI adoption is most commonly reported."

**data reported on page 12**

"
Respondents from AI high performers most often say their models have
misperformed within the business functions where AI is used the most.
32% Marketing and sales
21% Product and/or service development 
19% Service operations

**page 13, methodology**

"
The online survey was in the field from June 9 to
June 19, 2020, and garnered responses from 2,395
participants representing the full range of regions,
industries, company sizes, functional specialties,
and tenures. Of those respondents, 1,151 said their
organizations had adopted AI in at least one function
and were asked questions about their organizations’
AI use. To adjust for differences in response rates,
the data are weighted by the contribution of each
respondent’s nation to global GDP. McKinsey also
conducted interviews with executives between May
and August 2020 about their companies’ use of AI.
All quotations from executives were gathered during
those interviews.
"

In [None]:
if(kagglestorysaveimages):
    sys.stdout = orig_stdout
    f.close()