# Learning about machine learning: Kaggle member survey 2021 analysis


**TABLE OF CONTENTS**<br>

* [1. Introduction](#chapter_1)
    - [1.1 The possibilities of ML](#chapter_1_1)
    - [1.2 The limits of ML](#chapter_1_2)
    - [1.3 The ethical dimension of ML](#chapter_1_3)
    - [1.4 Research question](#chapter_1_4)
* [2. Data Exploration](#chapter_2)
* [3. ML and Age](#chapter_3)
* [4. ML and Gender](#chapter_4)
* [5. ML and Location](#chapter_5)
* [6. Population and ML](#chapter_6)
* [7. ML Goes Global](#chapter_7) 
* [8. Conclusion](#chapter_8)
    - [8.1 The present of ML](#chapter_8_1)
    - [8.2 The future of ML](#chapter_8_2)

In [None]:
# import module
from IPython import display

# display image  from url
display.Image(url= "https://raw.githubusercontent.com/mojopriest/mojopriest.github.io/master/pratchett.jpg", width = 600, style= margin-left:auto)

### 1. Introduction <a class="anchor" id="chapter_1"></a>

#### 1.1 The possibilities of ML <a class="anchor" id="chapter_1_1"></a>

This notebook and particularly its research question first came to being thanks to my Japanese language studies as I was watching the Japanese drama series *Japan Sinks: People of Hope* (*Nippon Chinbotsu: Kibo no Hito*) the other night (*link:* __[series Wikipedia article](https://en.wikipedia.org/wiki/Japan_Sinks:_People_of_Hope)__). The series - as its revealing title suggests - is the latest reboot of the 1973 disaster novel by **Sakyo Komatsu** with the same title (*link:* __[novel Wikipedia article](https://en.wikipedia.org/wiki/Japan_Sinks)__).

As a published novelist myself,  my original motive for learning more about data analysis was strictly personal: I was in the middle of writing a new book and needed information no one else seemed to be in possession of. Thanks to for example the well-constructed Kaggle crash courses, I now know a little something also about machine learning (ML) methods. 

However I noticed that even in the updated *Japan Sinks* drama series rendition, in the story analyzing and modeling geological data is something almost magical, something only a handful of dedicated specialists can do. Most people in the series still rely on either Japanese government or journalists and news agencies to passively receive fateful information about the future of their sinking homeland (*not a spoiler thanks to series title*).

Contrary to this, as Kaggle has effectively proven with its own existence over the past few years, ***skills concerning different aspects of data analysis can actually have an empowering effect***. Instead of relying on others for facts and hypotheses, people can now do it all themselves. When put in professional use, these methods may potentially lead to for example life-saving machine learning and artificial intelligence (AI) applications for better breast cancer screening (*link:* __[Forbes article](https://www.forbes.com/sites/jenniferhicks/2021/10/06/heres-how-artificial-intelligence-can-help-predict-breast-cancer-risk/?sh=351a9dbd4bec)__).

Thus in my mind there's no doubt that machine learning can be applied in numerous ways that significantly benefit peoples' lives. To me the value of Kaggle lies in the fact that ***increasing data analysis and machine learning knowledge also increases the ability for critical thinking i.e. recognizing misinformation, deliberately biased interpretations and falsely applied methods***. As the recent COVID-19 vaccination discourse has effectively proved to be true, even these skills may end up being of life-saving nature. 


#### 1.2 The limits of ML <a class="anchor" id="chapter_1_2"></a>

One of the most talked about recent topics concerning data science has been about setting international regulations for using AI and ML as well as safeguards on applications considered as "risky" (*link:* __[AP News article](https://apnews.com/article/artificial-intelligence-technology-business-europe-ursula-von-der-leyen-19ec99f8a970fe14a99a84d52017ec22)__). Also, there is the existing debate on the "inbound" limits of AI and computers in general. As **Robert J. Marks** argued to **Larry Linenschmidt** in his podcast episode (*link:* __[podcast article](https://mindmatters.ai/2020/08/six-limitations-of-artificial-intelligence-as-we-know-it/)__), these limitations are as following:
<br><br>
- ***Computers (and AI) are limited to algorithms.***<br>
- ***Computers are faster than before but not more intelligent.***<br>
- ***Computers only make use of data which they’ve been presented.***<br>
- ***Computers don’t experience things or make judgements.***<br> 
- ***Computers do exactly what they were programmed to do.***<br>

If one thinks about for example the case of *Japan Sinks*, more than often in the series plot AI and ML possess almost supernatural powers similar to what electricity was once considered to have, or what nuclear technology made people imagine in 1950s classic sci-fi stories (*some half of all Marvel's superheroes seem to have been created by different nuclear accidents*). 

***This mental image of AI and ML as a “go-to-solution” to every possible problem is largely fictional. Yet this line of thinking has a profound effect on what people expect AI and ML to do for them in reality.*** As another example, just think about the mental image of flying cars. Although no such inventions have so far existed in reality, many people are genuinely disappointed that this feature of flying is not included in their current commuter vehicles as a default setting.

#### 1.3 The ethical dimension of ML <a class="anchor" id="chapter_1_3"></a>

There is also the ethical aspect of applying AI and ML. Because flying cars are not available, let's imagine a logistical ML application designed for transferring groups of people from place A to place B. For example in commuting or post-pandemic travel industry, well-functioning global logistics undoubtedly result in better customer satisfaction as well as in various cost savings (fuel and energy consumption etc.).

*But what happened if the very same application would be given the task on how to efficiently transfer people to concentration camp facilities?*

Probably the application would calculate most direct routes, efficient hubs and correct connections without any problem, since the data would not include the actual experience or moral context of the task. There would be no need for the computer to defend its actions, since the computer would only do exactly what it was programmed to do. However, only in slightly different words, the very same argument were presented - not by a computer but by a human defendant - in the Nuremberg trial (*link:* __[Nuremberg trial article](https://famous-trials.com/Nuremberg)__):

**"*I was given this assignment which I could not refuse.*"** 
**(Fritz Sauckel, Chief of Slave Labor Recruitment)**

Moreover, presented only with logistical data, an AI judge application would probably have concluded that the very defendant was not guilty, since the AI app would not have found any procedural errors. Based on this, we can conclude that there can be both positive and negative aspects regarding the use of AI and ML. ***The decision of how and when ML applications are used still falls in the hands of the people***, no matter how evolved a system we are talking about. After all, ***the computer cannot refuse the task it is given***.

The appropriate role of humans in all this is not to imitate AI and ML strengths but, rather, compensate their weaknesses. In the end computing power is all about calculations, but that’s not how human brain or human life in general works. This was effectively proven for example by the protagonist **Arthur Dent** in **Douglas Adams's** novel *The Hitchhiker’s Guide To Galaxy*, when Arthur crashed the most powerful computer in the universe by giving it the menial task of making a decent cup of tea. Our thinking, laced sometimes with mere suspicions and vague speculations, does not comply with the rigid terms of machine learning, and neither should this be the case. In the end, ***it is the people who came up with machine learning, not the other way***.

**But what was the specific research question that came to my mind while watching** 
***Japan Sinks ?*** 

#### 1.4 Research question <a class="anchor" id="chapter_1_4"></a>

In last year's Kaggle Survey competition I basically analyzed where Kaggle users are *not* from, what age are they *not* etc. This year - already preoccupied with the above themes concerning machine learning - I decided to go for a slightly different approach. 

***My main point of interest in this notebook is the 15th question (Q15) in the survey ("For how many years have you used machine learning methods?")***. I was intrigued by the fact that *there were no additional questions regarding specific purposes of people using ML*. Secondly, ***the first answer option ("I do not use machine learning methods") seemed to me like a potential "digital divide", defined here as lack of adequate physical, economical, educational etc. resources to information technology.*** 

If the use of various AI and ML applications is to rapidly increase in the near future as it seems, those on the other side of that divide will inevitably "sink" or be defined as "0" for example by any job recruit ML application compared to those getting "1" in the same appraisal. Also, the programmed machine learning application would - being completely unaware of it - be guilty of applicant favoritism based on its own existence. ("*My name is ML. Do you know me or how I work? No? Ok, next applicant, please.*")

Based on all this, ***my research question is to find out if there are common factors between Kaggle members not familiar with machine learning methods***. Conversely, ***I will also search for potential similarities in Kaggle users with at least some ML methods experience*** by their own admission. To attain this, I will deliberately create the aforementioned digital divide and classify Kaggle members as "0" and "1" when it comes to ML knowledge.

That's already a full plate so let's start enjoying, or *itadakimasu*, as they say in Japan before delicious meal.
<br>
<br>
***November 5th, 2021***<br>
***Jari Peltola***

<br><br>
******

### 2. Data Exploration <a class="anchor" id="chapter_2"></a>

In [None]:
# import modules
import math
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
#%matplotlib inline

import seaborn as sns
import plotly.express as px

# enable showing all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

After importing the modules, we upload the survey data to a dataframe **df**.

In [None]:
# upload data to dataframe
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv", low_memory = False)

# show dataframe's first three rows
df.head(3)

It's always good to know the overall shape (number of rows and columns) of the dataframe right from the start.

In [None]:
# get dataframe shape
shape = df.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

To preserve the original dataframe as such, next we make a copy of it and call it rather unimaginatively **df_copy**. As we don't need the actual survey questions in our analysis, we also select the 25973 *last* rows from the survey dataset, leaving the first row including questions out. 

In [None]:
# make a copy of original dataframe
df_copy = df.copy()

# select all other rows except the first
df_copy = df_copy.tail(25973)

Our first point of interest in the dataframe is the aforementioned **Q15** column. Let's see the unique values in it. 

In [None]:
# check unique values
df_copy['Q15'].unique()

In this case we are interested only in *whether* a Kaggle member uses machine learning methods, not *how long* those methods have been in use. Therefore it is practical to create a new column with value "0" if the person does not possess ML experience and "1" if he or she is familiar with the subject matter. 

There are several ways (lambda function etc.) for doing this, but in this case I will make use of the original unique answer choices we just printed. As the string "year" is included only in answers with ML experience, those answers will be mapped as "1" based on the string. 

Furthermore, I would still like to keep the answers with null values (nan) in ML experience column because of the data included in other columns. Here I will also make the general assumption that no one answering the Kaggle survey wanted to keep their true ML experience as a secret, and map those null values as "0" as well.

***In some special cases hiding one's ML abilities may actually be necessary***. After the latest Taliban takeover, in Afghanistan hundreds of girls and women continue to learn coding online and in hidden makeshift classrooms (*link:* __[Al Jazeera article about Afghanistan](https://www.aljazeera.com/news/2021/10/29/afghanistan-girls-coding-underground-taliban-education?sf153637013=1)__). As one Afghan woman described her situation in the article: “*when online, you can be locked at home and explore the virtual world without any hesitation, without worrying about geographical boundaries. That’s the beauty of technology.*” 

Taking a look at the other columns while keeping in mind the research question, ***the member age, gender and location columns would seem to provide us most potentially interesting information***. As for other columns, for example ***social media use may not be a question of personal choice, since the access to some social media platforms is restricted in certain parts of the world*** (*link:* __[Al Jazeera article about Belarus](https://www.aljazeera.com/news/2021/10/29/belarus-classifies-social-media-channels-as-extremist?sf153665818=1)__).

Also, ***income as a universal criterion would not work since the value of money is different depending on location***. For example, the GDP per capita in Ethiopia is 2772 USD (*link:* __[Ethiopia Wikipedia article](https://en.wikipedia.org/wiki/Ethiopia)__), whereas the same figure in Norway (*link:* __[Norway Wikipedia article](https://en.wikipedia.org/wiki/Norway)__) is 64856 USD, rendering the two figures basically incomparable with each other. 

As for occupation, without any analysis a strong hypothesis can be made for example about the positive correlation of data scientists and machine learning knowledge. Thus studying this feature feels like writing a young adult novel *The Adventures of Captain Obvious*, so it's better to move on.

The different questions regarding software applications would indeed give us a more clear conception on ML knowledgeable members since the applications in question are used in machine learning. However, as our research question concerns the divide between "ML members" and "non-ML members", those survey answers are in our case less relevant. 

The survey dataset does include an abundance of different kinds of data, so for clarity's sake I will next create a new dataframe **df_ml** with **IsML** as its first column.  

In [None]:
# select rows based on existing data
is_ml = df_copy['Q15'].str.contains('year')

# map values as 0 and 1
is_ml = is_ml.map({True: 1, False: 0})

# create new dataframe with empty column
df_ml = pd.DataFrame(columns=["IsML"])

# create row values for the new column
df_ml['IsML'] = np.array(is_ml)

# change all values to integer
# replace nan values with 0
df_ml['IsML'] = df_ml['IsML'].fillna(0).astype(int)

df_ml.head()

Now we have Kaggle members divided in ones and zeroes based on their self-reported ML experience. First it would be interesting to know the percentage of Kaggle people familiar with ML methods.

In [None]:
# calculate ML user and non-user percentages
ml_perc = df_ml['IsML'].value_counts(normalize=True) * 100

ml_perc

We can see that ***over 76 percent of Kaggle members included in the survey dataset have at least some experience on ML methods***. To me this is as such an important discovery since it suggests that ***people on Kaggle either already possess ML method knowledge on arrival or start to acquire it right after that***. 

It would definitely be interesting to see if the time spent on Kaggle has something to do with ML knowledge. Unfortunately there was no data in the survey for example about the exact time people have been affiliated with Kaggle, so there is no meaningful way of learning more details on this.

Considering the nature of Kaggle as data-oriented community, the widespread ML knowledge level is probably not a big surprise though. In similar manner there are hordes of sports enthusiasts on their own fan sites, metalheads on heavy rock online forums etc. 

Regarding the research question, this fact does make the dataset inevitably biased, since there is significantly more data on ML experienced people. However, since this is the existing reality on Kaggle and the analysis is all about Kaggle members, furthermore this aspect will not be considered as a deficiency. 

<br><br>
******

### 3. ML and Age <a class="anchor" id="chapter_3"></a>

Next we take a look at the age column **Q1** in the survey.

In [None]:
# check unique values
df_copy['Q1'].unique()

In the original dataset, age groups are in format "xx-yy". This does not enable treating them as what they actually are: numerical entities.

By changing the "-" character to a common dot, age groups become decimal figures in format "xx.yy". Finally, by replacing the "+" character with double zero in the "70+" category, all groups will maintain their preferred order also after the transformation. 

For clarity's sake, we will store the freshly formatted column values in our dataframe as **Age** and change all the column values to numeric format. 

In [None]:
# replace characters
age_ml = df_copy.Q1.str.replace('-', '.')
age_ml = age_ml.str.replace('+', '.00')

# create new column
df_ml['Age'] = np.array(age_ml)

# the Age column is still in string format
# convert column datatype to float
df_ml["Age"] = df_ml.Age.astype(float)

df_ml.head()

Let find out more about the size of different age groups in visual form.

In [None]:
# set plot size etc.
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Age',
              data = df_ml,
              order = df_ml['Age'].value_counts().index)

# set plot title etc.
plot.axes.set_title("Kaggle Member Survey 2021: age groups",fontsize=20)
plot.set_xlabel("Age group",fontsize=18)
plot.set_ylabel("Number of answers",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

As the same data divided by percentage shows, some two thirds of Kaggle members who took part in the survey are of age 35 years or less.  

In [None]:
# calculate age value count percentage
ml_age = df_ml['Age'].value_counts(normalize=True) * 100

ml_age

It would be interesting to know more about machine learning knowledge within particular age group, so let's see those figures next.

In [None]:
# calculate 'ML=0' and 'ML=1' percentages within age groups
ml_age_group = df_ml.groupby('IsML')['Age'].value_counts(normalize=True) * 100

ml_age_group

It seems that only the youngest age group (18-21) of people who answered the survey has a slight increase in *lacking* ML method knowledge compared to the percentage of the age group. Among "non-ML" Kaggle members, the proportion of 18-21-olds is about 25 percent whereas their overall percentage of all Kaggle members is slightly less (about 19 percent). 

However we can generally conclude that ***on Kaggle age is not a major defining factor when it comes to ML method knowledge***. As mentioned, ***people arriving at Kaggle are most likely following their topic of interest and are therefore often "pre-equipped" with at least some ML knowledge***.

<br><br>
******

### 4. ML and Gender <a class="anchor" id="chapter_4"></a>

Let's take a closer look at ML knowledge when gender (column **Q2**) is considered. First we need the overall gender percentages of Kaggle members according to survey answers.

In [None]:
# calculate gender percentage
gender_perc = df_copy['Q2'].value_counts(normalize=True) * 100

gender_perc

Almost 80 percent of people answering the survey defined themselves as men, as we can see. Earlier we found out that over 75 percent of Kaggle members possess some level of ML method knowledge. Next we will find out the correlation between gender and ML knowledge. The "Nonbinary" and "Prefer to self-describe" choices combined make about 0.5 percent of total answers, and together with "Prefer not to say" the percentage of these three choices is about 2 percent of all answers.

Based on this, next the gender data will be remodeled in our dataframe to a column **IsMan** with binary values 0 and 1. Again we can attain this by making use of the string data in survey answers.

In [None]:
# choose rows based on existing data
is_man = df_copy['Q2'].str.contains('Man')

# map values to 0 and 1
is_man = is_man.map({True: 1, False: 0})

# create new column
df_ml['IsMan'] = np.array(is_man)

# change values to integer
df_ml['IsMan'] = df_ml['IsMan'].astype(int)

df_ml.head()

Now we can find out the user percentanges divided by ML knowledge and gender.

In [None]:
# calculate 'ML=0' and 'ML=1' percentages within gender groups
ml_gender_group = df_ml.groupby('IsML')['IsMan'].value_counts(normalize=True) * 100

ml_gender_group

As we can see, Kaggle members with ML method knowledge are divided similarly to overall gender percentage. Among those with no ML experience, the percentage of men decreased from 80 percent to 74 percent, which indicates that **among women and other "not-men" gender choices the lack of ML method knowledge is slightly more common**. However the change in percentages is not radically different.

<br><br>
******

### 5. ML and Location <a class="anchor" id="chapter_5"></a>

Let's take a look at whether the place of reside (column **Q3**) will make any difference regarding our research question.

In [None]:
# check unique locations
df_copy['Q3'].unique()

Being a Finnish person myself, I definitely felt a pinch in my conscience when first taking a look at the country data. As I missed the survey deadline myself and only countries or territories with 50 respondents received their own location tag *(see Kaggle dataset appendix "Survey Methodology")*, I unwittingly may have with my non-actions erased my own country from the unique locations list. On the other hand, analyzing self-created data could also be considered as a bias, like writing a review of your own novel, so I will soldier on with the data at hand.

Next we will check the percentages of different unique locations and print out the top ten locations based on value count.

In [None]:
# calculate location percentages
location_perc = df_copy['Q3'].value_counts(normalize=True) * 100

# print first ten list items
location_perc[:10]

As the total number of survey answers was 25973, we can quickly count that one percent of answers equals roughly to about 260 members. Excluding the "Other" location, only 23 countries in the world have a proportion bigger than one percent among Kaggle survey answers. Taking a look the two value counts below, we can see that number of answers ranges from 43 (Iraq and Ethiopia) to 7434 (India).

In [None]:
# check unique location count
location_count = df_copy['Q3'].value_counts()

# print first three list items
location_count[:3]

In [None]:
# print last three list items
location_count[-3:]

Next we will modify the original survey data before bringing it in to our own dataframe. For better compatibility as well as readability, some individual country names are replaced with new versions. Also, the "Other" and "I do not wish to disclose my location" options in the original data are both combined under "Other". 

In [None]:
# replace selected strings
df_ml['Location'] = df_copy['Q3'].replace(['United States of America'],'United States')
df_ml['Location'] = df_ml['Location'].replace(['Viet Nam'],'Vietnam')
df_ml['Location'] = df_ml['Location'].replace(['United Kingdom of Great Britain and Northern Ireland'],'United Kingdom')
df_ml['Location'] = df_ml['Location'].replace(['Iran, Islamic Republic of...'],'Iran')
df_ml['Location'] = df_ml['Location'].replace(['Republic of Korea'],'South Korea')
df_ml['Location'] = df_ml['Location'].replace(['I do not wish to disclose my location'],'Other')
df_ml['Location'] = df_ml['Location'].replace(to_replace='\(', value="", regex=True)
df_ml['Location'] = df_ml['Location'].replace(to_replace='\)', value="", regex=True)
df_ml['Location'] = df_ml['Location'].replace(['Hong Kong S.A.R.'],'Hong Kong')
df_ml.fillna('Other', inplace=True)

In [None]:
# check unique locations
df_ml['Location'].unique()

For clarity, a new dataframe **df_location_temp** with the unique locations as index is created.

In [None]:
# calculate location percentages
list_1 = df_ml['Location'].value_counts(normalize=True) * 100

# percentages to dataframe
df_location_temp = list_1.to_frame()

# reset index
df_location_temp.reset_index(inplace = True) 

# rename columns
df_location_temp.rename(columns = {'index':'Location', 'Location': 'Location_Perc'}, inplace = True) 

# round values to one decimal
df_location_temp['Location_Perc'] = df_location_temp['Location_Perc'].round(decimals=1)

df_location_temp.head()

We will need to select only users with ML experience. Again for clarity's sake we will do this in a separate dataframe **df_is_ml**.

In [None]:
# new dataframe with users having ML experience (IsML == 1)
df_is_ml = df_ml[df_ml['IsML'] == 1]

df_is_ml.head()

Next we can calculate location percentage based on users with ML experience. The result is dataframe **df_location_temp_two** including member location and ML knowledge percentage as columns.

In [None]:
# calculate location percentages
list_2 = df_is_ml['Location'].value_counts(normalize=True) * 100

# percentages to dataframe
df_location_temp_two = list_2.to_frame()

# reset index
df_location_temp_two.reset_index(inplace = True) 

# rename columns
df_location_temp_two.rename(columns = {'index':'Location', 'Location': 'Location_IsML'}, inplace = True) 

# round values to one decimal
df_location_temp_two['Location_IsML'] = df_location_temp_two['Location_IsML'].round(decimals=1)

df_location_temp_two.head()

Finally, the dataframes **df_location_temp** and **df_location_temp_two** are merged to a newly created **df_ml_location** dataframe by using the identical **Location** column included in both dataframes. The **Location_Perc** column will include the overall percentage of Kaggle members per location, whereas **Location_IsML** column will tell us the percentage of Kaggle members with ML knowledge in that same location.

In [None]:
# merge dataframes
df_ml_location = pd.merge(df_location_temp, df_location_temp_two, on=['Location'],how = 'left')

df_ml_location.head()

At least in countries with most Kaggle members the figures are pretty much identical. As we can see, for example Japan's overall share of Kaggle members is 3.5 percent, and their portion of ML knowledgeable members on Kaggle is 3.6 percent, which is well in the range of any statistical error.

Next we subtract the **Location_isML** column value from **Location_Perc** column value and store the result to a new column **ML_Surplus**. If the new column value is positive, it means that the user ML knowledge in that specific location surpasses the same location's overall relative member representation on Kaggle i.e. creating an "ML surplus".

In [None]:
# calculate new column value
df_ml_location['ML_Surplus'] = (df_ml_location['Location_IsML'] - df_ml_location['Location_Perc'])

df_ml_location.tail()

The result - here deliberately showing the tail end of our dataframe - does not really differ from previous results. The percentage of Kaggle members with ML knowledge and overall member percentage on Kaggle is very much similar regardless of the individual place of reside.

Next we will take another approach and analyze the same thing by using the value counts per individual location. In this case the results should however be taken with a pinch of digital salt, since we know beforehand that there is for example some 170 times more data on Kaggle members from India compared to their peers residing in Ethiopia.

In [None]:
# calculate value counts per country (IsML = 1)
is_ml_count = df_is_ml['Location'].value_counts()

# make new list
list_mlcount = is_ml_count.tolist()

# flatten the list
list_mlcount = np.array(list_mlcount).flatten()

# create new column
df_ml_location['IsML_Count'] = np.array(list_mlcount)

df_ml_location.head()

For clarity, we also create another column with value counts of those Kaggle members with no ML experience. Let's start by creating a similar dataframe as before, but in this time we select only those users with value "0" in the **IsML** column.

In [None]:
# new dataframe with no ML experience users (IsML == 0)
df_not_ml = df_ml[df_ml['IsML'] == 0]

# calculate value counts per country (IsML = 0)
not_ml_count = df_not_ml['Location'].value_counts()

# make new list
list_notml = not_ml_count.tolist()

# flatten the list
list_notml = np.array(list_notml).flatten()

# create new column
df_ml_location['NotML_Count'] = np.array(list_notml)

df_ml_location.head()

Finally, we add the total number of Kaggle members grouped by their place of reside. 

In [None]:
# calculate total count
df_ml_location['ML_Total_Count'] = (df_ml_location['IsML_Count'] + df_ml_location['NotML_Count'])

Since we now have the total count in our dataframe, we can calculate ML knowledgeable as well as "non-ML" member percentages per individual location.

In [None]:
# calculate percentage of ML users (IsML == 1)
df_ml_location['IsML_Perc'] = (df_ml_location['IsML_Count'] / df_ml_location['ML_Total_Count']) * 100

# round values to one decimal
df_ml_location['IsML_Perc'] = df_ml_location['IsML_Perc'].round(decimals=1)

# calculate percentage of ML non-users (IsML == 0)
# this is by no means necessary, but rather the oolumn is created for clarity
df_ml_location['NotML_Perc'] = (df_ml_location['NotML_Count'] / df_ml_location['ML_Total_Count']) * 100

# round values to one decimal
df_ml_location['NotML_Perc'] = df_ml_location['NotML_Perc'].round(decimals=1)

df_ml_location.head()

Let's make visual presentations of locations with highest percentage in **isML_Perc column**. We know that our "base values" to compare any results with are about 76.5 percent for ML knowledge users and 23.5 percent for non-ML users, as this information was attained before.

In [None]:
ml_perc.round(decimals=1)


As we are to plot graphics based on specific place of reside, it seems legitimate to drop the "Other" row from our dataset, so let's start with that before continuing. As we know, the row in question is located in row index number 2, we will use that information as our condition.

In [None]:
# drop row with now specific residence data (Other)
df_ml_location = df_ml_location.drop([2])

As we are interested in specific locations with highest ML knowledge percentage, we will sort all locations by their machine learning skills and then select the ten highest ML percentages. For plotting purposes, we will also create a separate dataframe **df_ml_location_plot_one**.

In [None]:
# sort countries by IsML_Perc column value from highest to lowest
df_ml_location = df_ml_location.sort_values(by ='IsML_Perc', ascending=False)

# create new dataframe for plotting
# select first ten rows
df_ml_location_plot_one = df_ml_location.iloc[:10,:]

df_ml_location_plot_one.head(10)

Even before creating any plots, we can see that most of the **Location_Perc** as well as **ML_Total_Count** column values are low. This means that we are talking about relatively small amount of Kaggle members. For example, in this comparison the location on top position has a total of 35 Kaggle members who answered the ML skills question in the first place. From that amount, 29 people (82.9 percent) informed having some sort of ML experience. Thus a number of Kaggle members equal to one classroom full of enthusiastic ML students basically carried the whole nation with them to top position.

However, this was again more or less the expected result, since we knew already that ML knowledge is evenly distributed among Kaggle members regardless of their place of reside. Because of this, ***a single Kaggle member with ML skills is statistically more significant in locations with less amount of total Kaggle members*** i.e. the least represented individual locations included in the survey dataset.

In the plot below, the hover-on mouse data shows values in columns **Location**, **IsML_Perc** and **IsML_Count**.

In [None]:
# plot figure
# define parameters
fig = px.bar(df_ml_location_plot_one, x='Location', y='IsML_Perc', text = 'IsML_Perc',
            hover_data= ['IsML_Count'], color= 'IsML_Perc')

# set graphics
fig.data[0].marker.line.width = 0.5
fig.data[0].marker.line.color = "black"

fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(uniformtext_mode='hide')  

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

# set annotations
annotations = []

# data source
annotations.append(dict(xref='paper', yref='paper', x=0.88, y=-0.12,
                              xanchor='center', yanchor='top',
                              text='Data: Kaggle Member Survey 2021',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

# set plot title
fig.update_layout(
    title='<b>Kaggle Member Survey 2021</b>:<br>ML method knowledge percentage per location (highest)',
                font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

# set axis titles etc.
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')

fig.update_yaxes(title_text='Percentage')
fig.update_xaxes(title_text='Location')

fig.update_layout(coloraxis_colorbar=dict(
    title="ML skill percentage"    
))

fig.update_yaxes(title_font=dict(size=14))
fig.update_xaxes(title_font=dict(size=14))

# show figure
fig.show()

The next plot will be all about locations with lowest value in **IsML_Perc** column, so we will need to select those values instead and create new dataframe **df_ml_location_plot_two**.

In [None]:
# create new dataframe for plotting
# select last ten rows
df_ml_location_plot_two = df_ml_location.iloc[-10: ,:]

df_ml_location_plot_two.head(10)

Perhaps the most notable thing here is that even the individual location with relatively lowest percentage of ML skilled members (Ukraine) has a percentage of 73.2, which is in fact significantly high. 

Let's see this data further to plot form.

In [None]:
# plot figure
# define parameters
fig = px.bar(df_ml_location_plot_two, x='Location', y='IsML_Perc', text = 'IsML_Perc',
            hover_data= ['IsML_Count'], color= 'IsML_Perc')

# set graphics
fig.data[0].marker.line.width = 0.5
fig.data[0].marker.line.color = "black"

fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(uniformtext_mode='hide')  

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

# set annotations
annotations = []

# data source
annotations.append(dict(xref='paper', yref='paper', x=0.88, y=-0.20,
                              xanchor='center', yanchor='top',
                              text='Data: Kaggle Member Survey 2021',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

# set plot title
fig.update_layout(
    title='<b>Kaggle Member Survey 2021</b>:<br>ML method knowledge percentage per location (lowest)',
                font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

# set axis titles etc.
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')

fig.update_yaxes(title_text='Percentage')
fig.update_xaxes(title_text='Location')

fig.update_layout(coloraxis_colorbar=dict(
    title="ML skill percentage"    
))

fig.update_yaxes(title_font=dict(size=14))
fig.update_xaxes(title_font=dict(size=14))

# show figure
fig.show()

Of course, everything we see above is based on Kaggle member survey data only. ***What would happen for example if we combined this data with world population data and calculated ML method familiarity based on that? Where in the world is machine learning knowledge on relatively highest level?***

<br><br>
******

### 6. Population and ML <a class="anchor" id="chapter_6"></a>

To answer this question we need world population data, so let's start with that. It is notable though that this sort of population data can never be fully accurate, since it is based on national census data. In fact, in population-rich large countries such as China or India, any census data is basically old on arrival because of the sheer scale of the task, with census data compiling lasting often for years rather than months. 

The population data we are going to use in our dataframe **df_population** is retrieved from **Our World In Data** (*link:* __[Our World In Data](https://ourworldindata.org/)__) open source project.

In [None]:
# get population dataset
# Original dataset is included in Our World In Data project
url_two = "https://covid.ourworldindata.org/data/ecdc/locations.csv"

# upload dataset as pandas dataframe
df_population = pd.read_csv(url_two)

# drop columns irrelevant to task at hand
cols = ['countriesAndTerritories', 'population_year', 'continent']
df_population = df_population.drop(cols, axis=1)

# rename columns
df_population.rename(columns = {'location':'Location', 
                                'population':'Population'}, inplace = True) 

df_population.head(10)

Right now we don't need population as numeric value, so we temporarily convert it into an object. This is by any means not necessary, but merging dataframes sometimes affects numeric decimal value formats, thus making them harder to read. When numeric values are required later, we can easily transform back again.

In [None]:
# change column datatype
df_population['Population'] = df_population['Population'].astype(str)

# show datatypes
df_population.dtypes

As a precautionary measure, it would be good to know if our new dataset includes all the 64 locations included in our unique Kaggle locations. We can do this by comparing the two **Location** columns in our two dataframes.

For this I will use the ".isin" feature, which predictably checks whether a string or value in dataframe A is included also in dataframe B. However, since we are interested in locations that are *not included* in the **Our World In Data** set, the tilde character (~) is thrown into the mix. It effectively renders the query to something like ".isnotin".

In [None]:
# compare the location data between two dataframes
df_ml_location[~df_ml_location['Location'].isin(df_population['Location'])]

Now we know that our new location column is lacking Hong Kong population data. This does not however prevent us from merging the two dataframes, but we know already that null values will be created in the Hong Kong row in the process. 

Next a new dataframe **df_ml_population** is created based on Kaggle member individual location data and our new population data. Left merge here means that all 64 locations included in our Kaggle survey data (left) are included, but *only data about those specific Kaggle locations* is retrieved from the larger population dataframe (right). 

To make sure we got it right, we also check that the merged dataframe still consists of 64 rows.

In [None]:
# merge the two dataframes
df_ml_population = pd.merge(df_ml_location, df_population, how='left')

# get dataframe shape
shape = df_ml_population.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

As null values (NaN) were about to be added, let's check if this actually happened.

In [None]:
# show rows with nan values
df_ml_population[df_ml_population.isna().any(axis=1)]

Apparently we also stepped right into political hornet's nest in our analysis. ***The population data we use does not include specific value regarding Hong Kong, since it includes only countries***. However the Kaggle survey has the location data as "place of residence", as the term goes. 

As we are talking about one row only, the Hong Kong number of residents (7500700, estimated) is next added manually to the dataframe after first retrieving the value from the Hong Kong dedicated Wikipedia page: (*link:* __[Hong Kong Wikipedia article](https://en.wikipedia.org/wiki/Hong_Kong)__) 

This is by no means ideal a method, but for our purposes this can be considered as a sufficient one-off task. In addition, we should probably also subtract the same number of people from China's population, but since it would not profoundly alter China's population and we know that census data in general is never totally accurate, we'll leave it for now.  

In [None]:
# update row value
# row is selected by its index number (10)
df_ml_population.loc[10,['Population']] = ['7500700']

A bit earlier we transformed our population values to objects (string), so now is a good time to reformat them to numeric again. Next we will convert the population values to numeric (float), and since no decimals are required, we will further convert the .float values as integers (.int). 

In [None]:
# convert values to float and after that to integer
df_ml_population['Population'] = df_ml_population['Population'].astype(float) 
df_ml_population['Population'] = df_ml_population['Population'].astype(int) 

# show datatypes
df_ml_population.dtypes

As the population data is now in numeric form, we can sort the dataframe based on population.

In [None]:
# sort by Population column value from highest to lowest
df_ml_population = df_ml_population.sort_values(by ='Population', ascending=False)

df_ml_population.head(10)

Next we calculate the per capita percentage of ML method knowledge based on population and store the values in new column **ML_Per_Population**. After that we will make visual presentations on the top and bottom values as two separate plots.

In [None]:
# calculate percentage
df_ml_population['ML_Per_Population'] = (df_ml_population['IsML_Count'] / df_ml_population['Population']) * 100

df_ml_population.head(10)

As before, for plotting we will first sort the values and then create a separate dataframe **df_ml_population_plot_one**.

In [None]:
# sort countries by ML_Per_Population
df_ml_population = df_ml_population.sort_values(by ='ML_Per_Population', ascending=False)

# create new dataframe for plotting
# select first ten rows
df_ml_population_plot_one = df_ml_population.iloc[:10,:]

In [None]:
# plot figure
# define parameters
fig = px.bar(df_ml_population_plot_one, x='Location', y='ML_Per_Population', text = 'IsML_Count',
            hover_data=['ML_Per_Population'], color= 'ML_Per_Population')

# set graphics
fig.data[0].marker.line.width = 0.5
fig.data[0].marker.line.color = "black"

fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(uniformtext_mode='hide')  

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

# set annotations
annotations = []

# data source
annotations.append(dict(xref='paper', yref='paper', x=0.88, y=-0.20,
                              xanchor='center', yanchor='top',
                              text='Data: Kaggle Member Survey 2021',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

# set plot title
fig.update_layout(
    title='<b>Kaggle Member Survey 2021</b>:<br>ML method knowledge percentage per population (highest)',
                font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

# set axis titles etc.
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')

fig.update_yaxes(title_text='Percentage')
fig.update_xaxes(title_text='Location')

fig.update_layout(coloraxis_colorbar=dict(
    title="ML per population"    
))


fig.update_yaxes(title_font=dict(size=14))
fig.update_xaxes(title_font=dict(size=14))

# show figure
fig.show()

The numbers above bars in the plot are derived from the **IsML_Count** column to describe how many individual Kaggle members are actually included in the figure. As suspected, the most populous countries/regions are not among the top selection.

In similar manner, next we plot the smallest **ML_Per_Population** values and create a separate dataframe **df_ml_population_plot_two** for this purpose.

In [None]:
# create new dataframe for plotting
# select last ten rows
df_ml_population_plot_two = df_ml_population.iloc[-10: ,:]

In [None]:
# plot figure
# define parameters
fig = px.bar(df_ml_population_plot_two, x='Location', y='ML_Per_Population', text = 'IsML_Count',
            hover_data=['ML_Per_Population'], color= 'ML_Per_Population')

# set graphics
fig.data[0].marker.line.width = 0.5
fig.data[0].marker.line.color = "black"

fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(uniformtext_mode='hide')  

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

# set annotations
annotations = []

# data source
annotations.append(dict(xref='paper', yref='paper', x=0.88, y=-0.20,
                              xanchor='center', yanchor='top',
                              text='Data: Kaggle Member Survey 2021',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

# set plot title
fig.update_layout(
    title='<b>Kaggle Member Survey 2021</b>:<br>ML method knowledge percentage per population (lowest)',
                font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

# set axis titles etc.
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')

fig.update_yaxes(title_text='Percentage')
fig.update_xaxes(title_text='Location')

fig.update_layout(coloraxis_colorbar=dict(
    title="ML per population"    
))

fig.update_yaxes(title_font=dict(size=14))
fig.update_xaxes(title_font=dict(size=14))

# show figure
fig.show()

When the whole location population is concerned, China falls second to last in relative ML knowledge. Of course the same would happen in any similar comparison because of the sheer relative size of Chinese population.

One other possibility would be to apply some sort of threshold similar to what was done in the original dataset regarding individual locations (50 survey entries were required for attaining individual location tag). Next, *only locations with higher than average number of ML method related answers to the 2021 Kaggle member survey* are selected for further analysis. After selecting the locations, their population and ML knowledge will be further made into visual presentation.

To begin with, this average number of respondents is needed, so we will print that. The column we are interested in this case is **ML_Total_Count**, since it includes the answers both from ML knowledgeable and non-ML Kaggle members. 

In [None]:
# calculate column average value
ml_count_average = df_ml_population['ML_Total_Count'].mean()

ml_count_average

Based on this, as a condition, only locations with a value 385 or higher in the **ML_Total_Count** column will be included in the new dataframe **df_ml_population_plot_three**. We will also check how many rows i.e. locations we are actually talking about.

In [None]:
# create new dataframe based on column value condition
df_ml_population_plot_three = df_ml_population.loc[df_ml_population['ML_Total_Count'] >= ml_count_average]

# get dataframe shape
shape = df_ml_population_plot_three.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
df_ml_population_plot_three.head(14)

Suddenly, we are left with only fourteen locations. Let's plot the top ten of these locations by using the **ML_Per_Population** column value as criterion, with the highest value on top of the pile. The number above each bar will be retrieved from the **IsML_Count** column.

In [None]:
# select first ten rows
df_ml_population_plot_three = df_ml_population_plot_three.iloc[:10,:]

In [None]:
# plot figure
# define parameters
fig = px.bar(df_ml_population_plot_three, x='Location', y='ML_Per_Population', text = 'IsML_Count',
            hover_data=['ML_Per_Population'], color= 'ML_Per_Population')

# set graphics
fig.data[0].marker.line.width = 0.5
fig.data[0].marker.line.color = "black"

fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(uniformtext_mode='hide')  

fig.update_layout(
    yaxis=dict(
        showline=True,
        linecolor='rgb(204, 204, 204)',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='rgb(82, 82, 82)',
        )))

# set annotations
annotations = []

# data source
annotations.append(dict(xref='paper', yref='paper', x=0.88, y=-0.20,
                              xanchor='center', yanchor='top',
                              text='Data: Kaggle Member Survey 2021',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

# set plot title
fig.update_layout(
    title='<b>Kaggle Member Survey 2021</b>:<br>ML method knowledge percentage per population (385+ respondents)',
                font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

# set axis titles etc.
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGrey')

fig.update_yaxes(title_text='Percentage')
fig.update_xaxes(title_text='Location')

fig.update_layout(coloraxis_colorbar=dict(
    title="ML per population"    
))

fig.update_yaxes(title_font=dict(size=14))
fig.update_xaxes(title_font=dict(size=14))

# show figure
fig.show()

As we can see, in the group of locations with 385 or more respondents, Spain comes on top with most ML skilled Kaggle members per population. As a general notion, we could see that ***the results we get on any queries based on ML method knowledge are directly dependent on the method used in attaining them. As noted in the beginning, the computer only does exactly as it was told to do.***

If we come back to survey data, as the final population-related visual representation in this notebook we will create a world map based on the number of ML knowledgeable Kaggle members living in different regions.

<br><br>
******

### 7. ML Goes Global <a class="anchor" id="chapter_7"></a>

To do this, specific country/region coordinates are required. The source we can get the necessary data from is... ***Kaggle***, thanks to **Paul Mooney** (*link:* __[Paul Mooney's Kaggle profile](https://www.kaggle.com/paultimothymooney)__) and his carefully constructed dataset available for everyone to make use of.


In [None]:
# get population dataset
# original dataset by Kaggle member Paul Mooney
url_three = "../input/latitude-and-longitude-for-every-country-and-state/world_country_and_usa_states_latitude_and_longitude_values.csv"

# create dataframe
df_coordinates = pd.read_csv(url_three)

# select columns
df_coordinates = df_coordinates.loc[:,['latitude', 'longitude', 'country']]

# rename columns
df_coordinates.rename(columns = {'latitude':'Lat', 'longitude': 'Long', 'country': 'Location'}, inplace = True)

df_coordinates.head()

Just like we did earlier with the **Our World In Data** locations, next we check if our existing Kaggle location data is consistent with locations included in the new coordinates dataframe.

In [None]:
# compare the location data between two dataframes
df_ml_population[~df_ml_population['Location'].isin(df_coordinates['Location'])]

It seems that there are no naming issues, so we can move on by merging the two dataframes and checking null values. As a precaution we also check the dataframe shape, which should still consist of 64 rows.

In [None]:
# merge the two dataframes
df_ml_map = pd.merge(df_ml_population, df_coordinates, how='left')

# get dataframe shape
shape = df_ml_map.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])

In [None]:
# show rows with nan values
df_ml_map[df_ml_map.isna().any(axis=1)]

No unexpected null values or other issues showed up so we're "OK to go", as **Ellie Arroway** proclaimed in the modern sci-fi classic *Contact*. As we are producing an analysis based on Kaggle member survey, next the Kaggle members with ML knowledge will be projected on a digital globe. The maximum value of the color range will be set to match the maximum value in the **IsML_Count** column (5699), which is the number of ML knowledgeable Kaggle members in India.

In [None]:
# get column max value
max_ml = df_ml_map["IsML_Count"].max()

max_ml

In [None]:
# plot figure
fig = px.choropleth(df_ml_map, locations="Location",
                    projection="natural earth", locationmode="country names", title="<b>Kaggle Member Survey 2021</b>:<br>ML knowledge per respondents", color="IsML_Count",
                    template="plotly", color_continuous_scale="peach",range_color=[0,max_ml] )

annotations = []

# source
annotations.append(dict(xref='paper', yref='paper', x=0.88, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Data: Kaggle Member Survey 2021',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_layout(coloraxis_colorbar=dict(
    title="ML skills"    
))

fig.show()

As we can see, India is indeed the global hotspot when ML knowledge on Kaggle is concerned. Also, my conscience started to hurt again because of the color of Finland.

Just for comparison, with the final globe we will use the per capita **ML_Per_Population** column value as criterion, after we first seek out the maximum value (Singapore) in the column.

In [None]:
# get column max value
max_population = df_ml_map["ML_Per_Population"].max()

max_population

As even the maximum value of this column is very small, all the values in the column will next be multiplied by ten thousand. After that, these values will create the map color definition value scale.

This is mainly to make the mouse hover-on values more readable in the map. Instead of near-zero decimals, that value will now be roughly set between zero and 22. Also, as all this will be done *while plotting our globe only*, none of our actions will have an effect on the actual dataframe row content.

The maximum value we will use in color scheme is the following:

In [None]:
# calculate value
max_map = max_population * 10000

max_map

In [None]:
# plot figure
# color value is modified before plotting
fig = px.choropleth(df_ml_map, locations="Location",
                    projection="natural earth", locationmode="country names", title="<b>Kaggle Member Survey 2021</b>:<br>ML knowledge per population", 
                    color= df_ml_map["ML_Per_Population"] * 10000, template="plotly", color_continuous_scale="peach",range_color=[0,max_map] )

annotations = []

# source
annotations.append(dict(xref='paper', yref='paper', x=0.88, y=-0.07,
                              xanchor='center', yanchor='top',
                              text='Data: Kaggle Member Survey 2021',
                              font=dict(family='arial narrow',
                                        size=8,
                                        color='rgb(96,96,96)'),
                              showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_layout(
    font=dict(family='calibri',
                                size=12,
                                color='rgb(64,64,64)'))

fig.update_layout(coloraxis_colorbar=dict(
    title="ML skills"    
))


fig.show()

The second globe presentation is a bit more evenly distributed regarding hotspots, telling us that on Kaggle ML method knowledge is indeed spread relatively globally. Most of Africa is however deprived of this, along with my home country Finland. As noted earlier, this may in part be because of the threshold of 50 respondents used in the Kaggle survey for gaining individual location tag.

<br><br>
******

### 8. Conclusion <a class="anchor" id="chapter_8"></a>

#### 8.1 The present of ML <a class="anchor" id="chapter_8_1"></a>

The question of who (which location or country) holds the "ML throne" on Kaggle can be answered in different ways, and the answer is dependent on the method used, as we saw before. Region-wise, as as working hypothesis, for example the relatively high ML knowledge in Australia compared to New Zealand or any nearby regions may be affected by the large community of Chinese etc. students living in Australia. After all, in the Kaggle survey the place of residence was under inquiry instead of nationality, so no further analysis on the subject matter was in this case possible.

It is also good to remember that in many cases the location-specific data in the Kaggle survey was based on a relatively small number of members even with the threshold of 50 individual members per individual location. Also, everyone included in the survey did not necessarily answer the question about ML knowledge.

Concluding the analysis, we can argue the following based on our data:

- ***There is no significant correlation between ML knowledge and the age, gender or place of reside of a particular Kaggle member***.
- ***A Kaggle member has a 76.5 percent average probability of being ML knowledgeable regardless of the aforementioned personal details.***

Thus on global scale - remembering our research question about "ML divide" - one could actually argue that this divide is widest when Kaggle members and "non-Kaggle people" are compared. 

#### 8.2 The future of ML <a class="anchor" id="chapter_8_2"></a>

This of course raises the following question concerning the future:

- ***If more people should become aware of Kaggle (and other similar ML online communities) and join in, would this by itself have more effect on overall population ML knowledge than the individual features included in the current member survey data? In other words, is Kaggle membership actually the very threshold we were just looking for when "ML divide" is concerned?***

A good starting point would be to further study the significance of Kaggle membership and ML knowledge more closely. For example, as suggested earlier:

- ***Future surveys could include more detailed questions for example about how long an individual member has been affiliated with Kaggle. That data could then be compared with answers concerning specific ML method knowledge and the timeline of possessing that knowledge.***

Should the analysis prove that these two features indeed correlate with each other, an argument could be made that a significant portion of Kaggle members have gained their ML skills thanks to joining the Kaggle community in the first place. Or one might ask people directly something like "*What Kaggle courses have you taken or plan to take?* ". 

Regarding future courses, a more exploratory question could inquire Kagglers for example if there are any new course topics they would be interested in. This whole wider theme of *why* people chose to join Kaggle in the first place is to me perhaps the most intriguing aspect that has not been a specific topic of inquiry in Kaggle surveys so far.

As we started this data journey from Japan, it seems that the people there didn't just actually wait for the whole place to sink. Again this was not major news since it is hard to find a nation with more creative people full of unique ideas than Japan. Recently a local start-up company developed a hoverbike which uses machine learning in stabilizing the vehicle (*link:* 
__[BBC article](https://www.bbc.com/news/technology-59065674)__). Cars may still not fly, but now there is something to fly over them. As the analyst interviewed by the BBC said regarding the project, "*things that once seemed like science fiction are becoming more tangible every year* ".



*Arigato gozaimashita* and thank you for your time.

<br><br>
***Jari Peltola***<br>
***Finland***<br>

<br><br>
******
