# How can kaggler step up to next levelðŸ“Š?
## What's the factor? What impact will the use of BI tools have?

---------------------

## <u>About this notebook</u>

In this notebook, I will analyze how kaggle users use visualization tools, and make suggestions to promote that users use more variety of methods in kaggle notebook.

In kaggle, one of the important goal is to create high performance predictive models. But one of other important goals is to create and publish notebooks that accurately summarize the results of the analysis and provide a reference for prediction and discussion. In order to analyze, visualize, and put the data into kaggle notebook, we need to be familiar with visualization tools. The data of this competition has the data which represents how kaggle users use the tools.

Using those data, I will proceed with the discussion as follows.

1. First, I will analyze these data to see what tools kaggle users use to visualization.
1. Next, I will combine the data with other data in this dataset to identify what characteristics of kaggle users who are familiar with the tools (can use a lot of tools). In particular, I focused on the impact of the availability of the BI tool for the reasons described in motivation.
1. Finally, based on the results of the analysis, I will suggest what we can do with kaggle.



## <u>Goal of this notebook</u>

* Find out how kaggle users are using the visualization tools.

* Identify the characteristics of users who are familiar with many visualization tools.

* Suggest measures that can be taken in kaggle to promote to use more variety of methods in kaggle notebook.

## <u>Motivation</u>

A lot of people publish kaggle notebooks, but I feel that there is a big difference in terms of the proficiency of the tools. For example, even though using the same seaborn, I was sometimes stunned to see that while I had only very simple line plots with it, the experts were using sunburst plots and clustermap to create elegant and accurate visualizations. Some people use Plotly to turn on/off unnecessary graphs to make their analysis more compact, and others use Geoplotlib to draw easy-to-understand choropleth on maps. How experts learn and be familiar with such graph types? How can a non-expert kaggler step up to the next level? Since I was interested in such things, I wanted to clarify one way of doing so through this analysis.

For this analysis, I were particularly interested in whether the availability of BI tools would affect the number of visualization tools available to users. If we are familiar with BI tool, we can achieve a wide variety of good looking visualizations very fast. This means that users who are familiar with BI tools know many examples of ideal visualizations, and they should be working hard to reproduce or update them in their kaggle notebooks.

Also I feel that in the actual analysis field, we are required to use BI tools, not only just like the jupyter nootbook used in kaggle. For example, when a data analysis infrastructure is built, the user interface of the infrastructure is often a BI tool such as Tableau, and the analyst is required to be proficient in using the BI tool. Now we cannot use BI tools in kaggle, but since the above reason, I believe that kaggle as a platform for data analysts cannot ignore to incorporate the BI tool. Therefore, I will pay special attention to the impact of the BI tool in our analysis.

--------------------

## Load libraries and dataset.

In [None]:
import collections
from datetime import datetime as dt
import itertools
import math
import os
import re
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
import scipy
from scipy import stats
from scipy.stats import norm
import seaborn as sns

I load kaggle_survey_2020_responses.csv. Since the first line contains the question text, I drop them and re-index the dataframe.

In [None]:
survey = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")

#Since the first line contains the question text, I drop them and re-index the dataframe.
survey = survey.iloc[1:,:].reset_index(drop=True)

# Overview of the tools and libraries that kaggle users use on a daily basis for data analysis

In Q14, each column contains the presence or absence of each tool used, so I change the column name to tool names.

In [None]:
tools_q14 = ["Matplotlib", "Seaborn", "Plotly/Plotly Express", "Ggplot/ggplot2" ,"Shiny", "D3 js", "Altair",
            "Bokeh", "Geoplotlib", "Leaflet/Folium", "None", "Other"]
cols_q14 = [col for col in survey.columns if "Q14" in col]
survey_q14 = survey[cols_q14]
survey_q14.columns = tools_q14

In [None]:
survey_q14.head()

## Which tools and libraries are being used the most?

We can find result of this question in the data of Q14, which asks the following question,

Q14
What data visualization libraries or tools do you use on a regular basis? (Select all that apply)
* Matplotlib
* Seaborn
* Plotly / Plotly Express
* Ggplot / ggplot2
* Shiny
* D3 js
* Altair
* Bokeh
* Geoplotlib
* Leaflet / Folium
* None
* Other

I think there are four main categories(Some of the tools are unfamiliar to me, so I apologize if I am wrong).

* Fundamental visualization tools

    * Matplotlib
    * Ggplot / ggplot2
    
* High level visualization tools
    * Seaborn
    * Altair
    * Bokeh

* Interactive plot

    * Plotly / Plotly Express
    * D3.js
    * Shiny
    
* Geographical plot
    * Geoplotlib
    * Leaflet / Folium

For each tool(From now on, both tools and libraries will be collectively referred to as tools), I show the number of respondents who answered that they could use them in a pie chart.

In [None]:
g_tools = pd.DataFrame(len(survey_q14) - survey_q14.isnull().sum(), columns=["tool"])\
          .plot.pie(y="tool", figsize=(15, 10), fontsize=18)
g_tools.set_title("Number of answers for each regular used tools.", fontsize=18)
g_tools.set_ylabel("") 
plt.legend(bbox_to_anchor=(1.2, 1), loc='upper left', borderaxespad=0, fontsize=18)

Matplotlib is the most common, followed by Seaborn. Plotly and ggplot are next, but they are only half as common as the top two. For interactive plot, Plotly is the most major tool. The geographical visualization tools are minor. Many users are not able to visualize geographically.

## What is the distribution of the number of tools available for a kaggle user?

How many visualization tools can be used per a kaggle user?

In [None]:
q14_not_none_cols = [col for col in survey_q14.columns if 'None' != col]

In [None]:
q14_counts = np.array([np.nan for i in survey_q14[["Matplotlib"]].index])
q14_tool_counts = (~survey_q14[q14_not_none_cols].isnull()).sum(axis=1)
q14_tool_counts = q14_tool_counts[q14_tool_counts>0]

q14_counts[q14_tool_counts[q14_tool_counts>0].index] = q14_tool_counts
q14_counts[survey_q14[survey_q14["None"]=="None"].index] = 0

q14_counts = pd.DataFrame(q14_counts, columns=["number_of_tools"])

In [None]:
q14_counts.head()

In [None]:
plt.figure(figsize=(15, 10))
g_number_of_tools = sns.distplot(q14_counts["number_of_tools"], kde=False, rug=False, bins=np.arange(0,13))
g_number_of_tools.set_title("Distribution of the number of tools available for a kaggle users", fontsize=18)

In [None]:
q14_counts["number_of_tools"].describe()

We can see that the average number of visualization tools available to kaggle users is two.

As I mentioned above, there are four major types of tools, so the number of visualization tools is two, which seems small. For example, let's say you have a user who mainly uses Python, and is familiar with matplotlib and its high-level interface, Seaborn. Then that person would not be able to do interactive plotting or geographical visualization.

## What is the most common combination of tools?

Visualize the combinations of tools that were answered that kagglers can use, in order of increasing frequency.

In [None]:
def convert_tuple_without_nan(vec):
    """delete nan from list and convert the result tuple."""
    vec = [element for element in vec if element is not np.nan]
    return tuple(vec)

#Convert the results of Q14, which are not all missing values, into a list for counting.
survey_q14_records = [convert_tuple_without_nan(val) for val in survey_q14.values]
survey_q14_records = [item for item in survey_q14_records if len(item) > 0]

In [None]:
def convert_most_common(most_common):
    """Convert result of most_common() to data structure that pd.Datarfame() can use."""
    res = []
    for i in range(len(most_common)):
        ans = tuple(item for item in most_common[i][0] if item is not np.nan)
        res.append([ans, most_common[i][1]])
    return res

#counting
c = collections.Counter(survey_q14_records)

#Convert counting results to pd.DataFrame
tool_comb_freq = convert_most_common(c.most_common())
tool_comb_freq = pd.DataFrame(tool_comb_freq)
tool_comb_freq.columns = ["Combination", "freq"]

#Number of tools that can be used simultaneously
tool_comb_freq["tools"] = tool_comb_freq["Combination"].map(lambda x: len(x))

#Add a column for cumulative sum(%).
total_freq = sum(tool_comb_freq["freq"])
tool_comb_freq["acc"] = [acc/total_freq for acc in itertools.accumulate(tool_comb_freq["freq"])]

In [None]:
plt.figure(figsize=(20, 10))
g_comb = sns.barplot(x="Combination", y="freq", data=tool_comb_freq.query("acc < 0.90"))
g_comb.set_xticklabels(g_comb.get_xticklabels(), rotation=90)

Matplotlib only or a combination of Matplotlib and Seaborn were the top two responses, followed by the none tools available. We can see Plotly in fourth combination. We can see that the number of combinations is already much lower, indicating that most users are not using the interactive plot. In terms of geographic visualization tools, Geoplotlib comes in at 19th. It turns out that most users do not use geographic visualization tools.

I guess that learning the tools needed for visualization first, such as Matplotlib and Seaborn, is a higher priority than interactive or geographical visualization tools. This is confirmed by the data. Create a data frame from which various types of data are extracted.

First, interactive visualization tools.

In [None]:
def is_combination_includes_interactive_tool(comb):
    if " Plotly / Plotly Express " in comb or " D3.js " in comb or " Shiny " in comb:
        return True
    else:
        return False
     
tool_comb_freq[tool_comb_freq["Combination"].map(is_combination_includes_interactive_tool)].head(10)

Next, geographical visualization tools.

In [None]:
def is_combination_includes_geographical_tool(comb):
    if " Geoplotlib " in comb or " Leaflet / Folium " in comb:
        return True
    else:
        return False
     
tool_comb_freq[tool_comb_freq["Combination"].map(is_combination_includes_geographical_tool)].head(10)

Let's look at the two distributions.

In [None]:
fig, axs = plt.subplots(1, 2, sharey=True, figsize=(15,5))
g_number_of_tools = sns.distplot(tool_comb_freq[tool_comb_freq["Combination"].map(is_combination_includes_interactive_tool)]["tools"],
                                 kde=False, rug=False, bins=np.arange(0,13), ax=axs[0])
axs[0].set_title("interactive plot tools", fontsize=18)
g_number_of_tools = sns.distplot(tool_comb_freq[tool_comb_freq["Combination"].map(is_combination_includes_geographical_tool)]["tools"],
                                 kde=False, rug=False, bins=np.arange(0,13), ax=axs[1])
axs[1].set_title("geographical visualization tools", fontsize=18)

Users who can use geographic visualization tools or interactive visualization tools, it seems that on average, four tools are available. Therefore, it is clear that in order to achieve visualization that cannot be expressed by fundamental visualization tools, and its high level interfaces, users need to learn additional tools.

# Explore the factors that influence the number of tools used.

Using many tools does not necessarily improve the quality of notebooks, but since many of the tools in the Q14 choices have different uses as we saw above, it can be said that the more tools kaggle users can use, the more data analysis they can express.

So what is the difference between kagglers who can use a lot of tools and who can't? I would like to explore the characteristics of these from the data taken from the survey.

## Whether or not can use BI tools

First, let's look at a group divided into those who can use BI tools and those who can't. 

Whether or not can use BI tools can know from Q31-A.

Q31-A
Which of the following business intelligence tools do you use on a regular basis? (Select all that
apply)
* Amazon QuickSight
* Microsoft Power BI
* Google Data Studio
* Looker
* Tableau
* Salesforce
* Einstein Analytics
* Qlik
* Domo
* TIBCO Spotfire
* Alteryx
* Sisense
* SAP Analytics Cloud
* None
* Other

In [None]:
#The number of tools available will be reused from the previous analysis.
survey_q14_BItools = q14_counts.copy()

#Extract the data from the answers in Q31.
#I define that the BI tool is not available means that Q31_A_Part_14 is selected.
#Available means that all Q31_A options other than Q31_A_Part_14 are not null.
df_q31a_none = survey[[ col for col in survey.columns if "Q31_A" in col]].query("Q31_A_Part_14 == 'None'")
df_q31a_can_use = survey[[ col for col in survey.columns if "Q31_A" in col]].query("Q31_A_Part_14 != 'None'").dropna(how='all')

#I let can_use_BI_tool column store whether the BI tool can be used or not.
can_use_BI_tool = np.array([np.nan for i in survey_q14_BItools.index])
can_use_BI_tool[df_q31a_can_use.index] = 1
can_use_BI_tool[df_q31a_none.index] = 0
survey_q14_BItools["can_use_BI_tool"] = can_use_BI_tool

#I drop records with null from aggregation.
survey_q14_BItools = survey_q14_BItools.dropna(how='any')

In [None]:
survey_q14_BItools.head()

It is interesting to note that the average number of tools available is 0.5 larger for the group that can use BI tools than the group that can not use BI .

In [None]:
plt.figure(figsize=(20, 10))
sns.set(font_scale=2)
sns.distplot(survey_q14_BItools.query("can_use_BI_tool == 1")["number_of_tools"],
             rug=False, kde=True,kde_kws={'bw':1},
             bins=np.arange(0,13), label='can use BI tool')
sns.distplot(survey_q14_BItools.query("can_use_BI_tool == 0")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='can not use BI tool')
plt.xlim(-1, 13)
plt.title("Distribution of the number of tools available per whether can use BI tools.")
plt.legend()

In [None]:
print(f"Number of data for kagglers who can use BI tools is:", len(survey_q14_BItools.query("can_use_BI_tool == 1")["number_of_tools"]))
print(f'Mean of number of tools for kagglers who use BI tools can use is: {np.mean(survey_q14_BItools.query("can_use_BI_tool == 1")["number_of_tools"])}')

In [None]:
print(f"Number of data for kagglers who can not use BI tools is:", len(survey_q14_BItools.query("can_use_BI_tool == 0")["number_of_tools"]))
print(f'Mean of number of tools for kagglers who can not use BI tools can use is: {np.mean(survey_q14_BItools.query("can_use_BI_tool == 0")["number_of_tools"])}')

To be sure, I will use a t-test to check if there is a significant difference in the means.

In [None]:
scipy.stats.ttest_ind(survey_q14_BItools.query("can_use_BI_tool == 1")["number_of_tools"],\
                      survey_q14_BItools.query("can_use_BI_tool == 0")["number_of_tools"],\
                      axis=0, equal_var=True, nan_policy='propagate')

Since the pvalue is much smaller than 0.05, the means were found to be significantly different.

If we are familiar with BI tool, we can achieve a wide variety of good looking visualizations very fast. This means that users who are familiar with BI tools know many examples of ideal visualizations, and they should be working hard to reproduce or update them in their kaggle notebooks. So I guess that analyst who can use BI tools tends to be interested in many kind of way to visualize. 

## Whether or not to want to become familiar with BI tools

So how does this relate to the desire to be able to use the BI tool in the future? From Q31-B, we can know this.

Q31-B
Which of the following business intelligence tools do you hope to become more familiar with in the
next 2 years? (Select all that apply)
* Microsoft Power BI
* Amazon QuickSight
* Google Data Studio
* Looker
* Tableau
* Salesforce
* Einstein Analytics
* Qlik
* Domo
* TIBCO Spotfire
* Alteryx
* Sisense
* SAP Analytics Cloud
* None
* Other

I define that kaggler "don't want to be familiar with the BI tool" means that Q31_B_Part_14 is selected.
Also I define that "want to be familiar with" means that all Q31_B options other than Q31_B_Part_14 are not null.

In [None]:
#The number of tools available will be reused from the previous analysis.
survey_q14_tools = q14_counts.copy()

#Extract the data from the answers in Q31B.
#I define that kaggler don't want to be familiar with the BI tool means that Q31_B_Part_14 is selected.
#Want to be familiar with means that all Q31_B options other than Q31_B_Part_14 are not null.
df_q31b_none = survey[[ col for col in survey.columns if "Q31_B" in col]].query("Q31_B_Part_14 == 'None'")
df_q31b_can_use = survey[[ col for col in survey.columns if "Q31_B" in col]].query("Q31_B_Part_14 != 'None'").dropna(how='all')

#I let want_to_be_familiar_BI_tool column store whether kaggler wants to be familiar with BI tools or not.
will_be_familiar_BI_tool = np.array([np.nan for i in survey_q14_tools.index])
will_be_familiar_BI_tool[df_q31b_can_use.index] = 1
will_be_familiar_BI_tool[df_q31b_none.index] = 0
survey_q14_tools["want_to_be_familiar_BI_tool"] = will_be_familiar_BI_tool

#I drop records with null from aggregation.
survey_q14_tools = survey_q14_tools.dropna(how='any')

In [None]:
survey_q14_tools.head()

In [None]:
plt.figure(figsize=(20, 10))
sns.set(font_scale=2)
sns.distplot(survey_q14_tools.query("want_to_be_familiar_BI_tool == 1")["number_of_tools"],
             rug=False, kde=True,kde_kws={'bw':1},
             bins=np.arange(0,13), label='want to be familiar with BI tool')
sns.distplot(survey_q14_tools.query("want_to_be_familiar_BI_tool == 0")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='not want to be familiar with')
plt.xlim(-1, 13)
plt.title("Distribution of the number of tools available per whether want to be familiar with BI tools.")
plt.legend()

In [None]:
print("Number of data for kagglers who want to be familiar with BI tools is:",\
      len(survey_q14_tools.query("want_to_be_familiar_BI_tool == 1")["number_of_tools"]))
print("Mean of number of tools for kagglers who want to be familiar with BI tools is:",\
      np.mean(survey_q14_tools.query("want_to_be_familiar_BI_tool == 1")["number_of_tools"]))

In [None]:
print("Number of data for kagglers who don't want to be familiar with BI tools is:",\
      len(survey_q14_tools.query("want_to_be_familiar_BI_tool == 0")["number_of_tools"]))
print("Mean of number of tools for kagglers who don't want to be familiar with BI tools is:",\
      np.mean(survey_q14_tools.query("want_to_be_familiar_BI_tool == 0")["number_of_tools"]))

In [None]:
scipy.stats.ttest_ind(survey_q14_tools.query("want_to_be_familiar_BI_tool == 1")["number_of_tools"],\
                      survey_q14_tools.query("want_to_be_familiar_BI_tool == 0")["number_of_tools"],\
                      axis=0, equal_var=True, nan_policy='propagate')

There was a difference in the average number of tools available between the group that wanted to be familiar with them and the group that did not, but the difference was small compared to the availability of BI tools.

## How long are they using machine learning?

I will also look at the impact of the number of years using machine learning. It can be inferred that the longer kagglers use machine learning, the more likely they are to be interested in various visualization tools. The number of years using machine learning is answered in Q15.

Q15
For how many years have you used machine learning methods?
* I do not use machine learning methods
* Under 1 year
* 1-2 years
* 2-3 years
* 3-4 years
* 4-5 years
* 5-10 years
* 10-20 years
* 20 or more years

For the sake of analysis, I will divide these years into three categories.

* short: 0 - 2 years. Beginner level period.
* medium: 3 - 4 years. Intermediate level period.
* long: 5 - years. Advanced level period.

It depends on the person, but I think that after two full years of experience with machine learning, people will reach intermediate level. If they have more than five full years of experience, they are probably advanced.

In [None]:
def categorize_ml_years(years):
    """
    Convert years when people use machine learning methods to three categories.
    - short: "I do not use machine learning methods", "Under 1 year", "1-2 years"
    - medium: "2-3 years", "3-4 years", "4-5 years"
    - long: "5-10 years", "10-20 years", "20 or more years"
    """
    short = ["I do not use machine learning methods", "Under 1 year", "1-2 years"]
    medium = [ "2-3 years", "3-4 years", "4-5 years"]
    
    if years is np.nan:
        return np.nan
    elif years in short:
        return "short"
    elif years in medium:
        return "medium"
    else:
        return "long"

#Copy count data for q14 and add columns for categorized ml using years.    
survey_q14_years = q14_counts.copy()
survey_q14_years["usage_period"] = survey["Q15"].map(categorize_ml_years)

In [None]:
plt.figure(figsize=(20, 10))
sns.set(font_scale=2)
sns.distplot(survey_q14_years.query("usage_period == 'short'")["number_of_tools"],
             rug=False, kde=True,kde_kws={'bw':1},
             bins=np.arange(0,13), label='short')
sns.distplot(survey_q14_years.query("usage_period == 'medium'")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='medium')
sns.distplot(survey_q14_years.query("usage_period == 'long'")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='long')
plt.xlim(-1, 13)
plt.title("Distribution of the number of tools available per years of experience in machine learning.")
plt.legend()

In [None]:
print(f"Number of data for kagglers who have short years of experience in machine learning is:",\
      len(survey_q14_years.query("usage_period == 'short'")["number_of_tools"]))
print("Mean of number of tools for kagglers who have short years of experience in machine learning is:",\
      np.mean(survey_q14_years.query("usage_period == 'short'")["number_of_tools"]))

In [None]:
print(f"Number of data for kagglers who have medium years of experience in machine learning is:",\
      len(survey_q14_years.query("usage_period == 'medium'")["number_of_tools"]))
print("Mean of number of tools for kagglers who have medium years of experience in machine learning is:",\
      np.mean(survey_q14_years.query("usage_period == 'medium'")["number_of_tools"]))

In [None]:
print(f"Number of data for kagglers who have long years of experience in machine learning is:",\
      len(survey_q14_years.query("usage_period == 'long'")["number_of_tools"]))
print("Mean of number of tools for kagglers who have long years of experience in machine learning is:",\
      np.mean(survey_q14_years.query("usage_period == 'long'")["number_of_tools"]))

In [None]:
scipy.stats.ttest_ind(survey_q14_years.query("usage_period == 'short'")["number_of_tools"],\
                      survey_q14_years.query("usage_period == 'medium'")["number_of_tools"],\
                      axis=0, equal_var=True, nan_policy='propagate')

I can find that the longer period using machine learning, the greater the number of tools available. We can also see that the average does not change much between medium and long term users.

## What tools primarily use for analysis tool.

Is there a difference in the number of tools they can use depending on the tools they use on a daily basis?. I can know this from Q38.

Q38
What is the primary tool that you use at work or school to analyze data? (Include text response)
* Basic statistical software (Microsoft Excel, Google Sheets, etc.)
* Advanced statistical software (SPSS, SAS, etc.)
* Business intelligence software (Salesforce, Tableau, Spotfire, etc.)
* Local development environments (RStudio, JupyterLab, etc.)
* Cloud-based data software & APIs (AWS, GCP, Azure, etc.)
* Other

In [None]:
#The number of tools available will be reused from the previous analysis.
survey_q14_primarily_tools1 = q14_counts.copy()
#survey_q14_tools = survey_q14_tools.reset_index(drop=True)

#Extract the data from the answers in Q38.
#I define that kaggler primalily use BI tool for analysis tool means that answer of Q38 is
#'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)'.
#Not primalily use BI tool means that the answer of Q38 is the others.
df_q38_primary_use_BI = survey[[ col for col in survey.columns if "Q38" in col]].query("Q38 == 'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)'")
                         
df_q38_not_primary_use_BI = survey[[ col for col in survey.columns if "Q38" in col]].query("Q38 != 'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)'").dropna(how='all')
                             

#I let primary_use_BI column store whether or not kaggler uses BI tools primarily for analysis tool.
primary_use_BI = np.array([np.nan for i in survey_q14_primarily_tools1.index])
primary_use_BI[df_q38_primary_use_BI.index] = 1
primary_use_BI[df_q38_not_primary_use_BI.index] = 0
survey_q14_primarily_tools1["primary_use_BI_tool"] = primary_use_BI

#I drop records with null from aggregation.
survey_q14_primarily_tools1 = survey_q14_primarily_tools1.dropna(how='any')

In [None]:
plt.figure(figsize=(20, 10))
sns.set(font_scale=2)
sns.distplot(survey_q14_primarily_tools1.query("primary_use_BI_tool == 1")["number_of_tools"],
             rug=False, kde=True,kde_kws={'bw':1},
             bins=np.arange(0,13), label='primary use BI tool')
sns.distplot(survey_q14_primarily_tools1.query("primary_use_BI_tool == 0")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='not primary use BI tool')
plt.xlim(-1, 13)
plt.title("Distribution of the number of tools available per primarily using tool.")
plt.legend()

In [None]:
print(f"Number of data for kagglers who don't primarily use BI tools for analysis is:",\
      len(survey_q14_primarily_tools1.query("primary_use_BI_tool == 0")["number_of_tools"]))
print("Mean of number of tools for kagglers who primarily don't use BI tools for analysis is:",\
      np.mean(survey_q14_primarily_tools1.query("primary_use_BI_tool == 0")["number_of_tools"]))

In [None]:
print(f"Number of data for kagglers who primarily use BI tools for analysis is:",\
      len(survey_q14_primarily_tools1.query("primary_use_BI_tool == 1")["number_of_tools"]))
print("Mean of number of tools for kagglers who primarily use BI tools for analysis is:",\
      np.mean(survey_q14_primarily_tools1.query("primary_use_BI_tool == 1")["number_of_tools"]))

In [None]:
scipy.stats.ttest_ind(survey_q14_primarily_tools1.query("primary_use_BI_tool == 0")["number_of_tools"],\
                      survey_q14_primarily_tools1.query("primary_use_BI_tool == 1")["number_of_tools"],\
                      axis=0, equal_var=True, nan_policy='propagate')

Whether users use primarily BI tools for their work or not seems to be not related to number of visualization tools.

How about if they are using local environments such as RStudio or JupyterLab?

In [None]:
#The number of tools available will be reused from the previous analysis.
survey_q14_primarily_tools2 = q14_counts.copy()
#survey_q14_tools = survey_q14_tools.reset_index(drop=True)

#Extract the data from the answers in Q38.
#I define that kaggler primalily use BI tool for analysis tool means that answer of Q38 is
#'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)'.
#Not primalily use BI tool means that the answer of Q38 is the others.
df_q38_primary_use_local = survey[[ col for col in survey.columns if "Q38" in col]].query("Q38 == 'Local development environments (RStudio, JupyterLab, etc.)'")
                         
df_q38_not_primary_use_local = survey[[ col for col in survey.columns if "Q38" in col]].query("Q38 != 'Local development environments (RStudio, JupyterLab, etc.)'").dropna(how='all')                             

#I let primary_use_BI column store whether or not kaggler uses BI tools primarily for analysis tool.
primary_use_local = np.array([np.nan for i in survey_q14_primarily_tools2.index])
primary_use_local[df_q38_primary_use_local.index] = 1
primary_use_local[df_q38_not_primary_use_local.index] = 0
survey_q14_primarily_tools2["primary_use_local"] = primary_use_local

#I drop records with null from aggregation.
survey_q14_primarily_tools2 = survey_q14_primarily_tools2.dropna(how='any')

In [None]:
plt.figure(figsize=(20, 10))
sns.set(font_scale=2)
sns.distplot(survey_q14_primarily_tools2.query("primary_use_local == 1")["number_of_tools"],
             rug=False, kde=True,kde_kws={'bw':1},
             bins=np.arange(0,13), label='primary use Local development environments')
sns.distplot(survey_q14_primarily_tools2.query("primary_use_local == 0")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='not primary use Local development environments')
plt.xlim(-1, 13)
plt.title("Distribution of the number of tools available per primarily using tool.")
plt.legend()

In [None]:
print(f"Number of data for kagglers who don't primarily use Local development environments for analysis is:",\
      len(survey_q14_primarily_tools2.query("primary_use_local == 0")["number_of_tools"]))
print("Mean of number of tools for kagglers who primarily don't use Local development environments for analysis is:",\
      np.mean(survey_q14_primarily_tools2.query("primary_use_local == 0")["number_of_tools"]))

In [None]:
print(f"Number of data for kagglers who primarily use Local development environments for analysis is:",\
      len(survey_q14_primarily_tools2.query("primary_use_local == 1")["number_of_tools"]))
print("Mean of number of tools for kagglers who primarily use Local development environments for analysis is:",\
      np.mean(survey_q14_primarily_tools2.query("primary_use_local == 1")["number_of_tools"]))

In [None]:
scipy.stats.ttest_ind(survey_q14_primarily_tools2.query("primary_use_local == 0")["number_of_tools"],\
                      survey_q14_primarily_tools2.query("primary_use_local == 1")["number_of_tools"],\
                      axis=0, equal_var=True, nan_policy='propagate')

I can see that the average number of tools users can use is 0.6 higher if they usually use the local environment on their work.

# What is the impact of BI tools among the factors that may be influencing?

From the previous analysis, The following factors were found to be related to the average number of tools available.

* Can use or have interested in BI tools.
* Have been using machine learning for more than a medium period of time.
* Usually use the local environment on their work.

## ML exprience vs whether can or not use BI tools. 

In [None]:
survey_q14_years["can_use_BI_tool"] = can_use_BI_tool
survey_q14_years = survey_q14_years.dropna(how='any')
survey_q14_years.head()

### In short period

First, let's look at the distribution of the number of available tools among the population of users with short ML experience, divided by whether they can use BI tools or not.

In [None]:
plt.figure(figsize=(20, 10))
sns.set(font_scale=2)
sns.distplot(survey_q14_years.query("usage_period == 'short' and can_use_BI_tool == 1")["number_of_tools"],
             rug=False, kde=True,kde_kws={'bw':1},
             bins=np.arange(0,13), label='can use BI tool')
sns.distplot(survey_q14_years.query("usage_period == 'short' and can_use_BI_tool == 0")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='can not use BI tool')
plt.xlim(-1, 13)
plt.title("Distribution of the number of tools available per use of BI tool \n when kaggler uses machine learning in short period.")
plt.legend()

In [None]:
print(f"Number of data for kagglers who uses machine learning in short period and can not use BI tool is:",\
      len(survey_q14_years.query("usage_period == 'short' and can_use_BI_tool == 0")["number_of_tools"]))
print("Mean of number of tools for kagglers who uses machine learning in short period and can not use BI tool is:",\
      np.mean(survey_q14_years.query("usage_period == 'short' and can_use_BI_tool == 0")["number_of_tools"]))

In [None]:
print(f"Number of data for kagglers who uses machine learning in short period and can use BI tool is:",\
      len(survey_q14_years.query("usage_period == 'short' and can_use_BI_tool == 1")["number_of_tools"]))
print("Mean of number of tools for kagglers who uses machine learning in short period and can use BI tool is:",\
      np.mean(survey_q14_years.query("usage_period == 'short' and can_use_BI_tool == 1")["number_of_tools"]))

In [None]:
scipy.stats.ttest_ind(survey_q14_years.query("usage_period == 'short' and can_use_BI_tool == 0")["number_of_tools"],\
                      survey_q14_years.query("usage_period == 'short' and can_use_BI_tool == 1")["number_of_tools"],\
                      axis=0, equal_var=True, nan_policy='propagate')

I can see that the average number of tools users can use is 0.5 higher if they can use the BI tool.

### In medium period

Next, let's look at the distribution of the number of available tools among the population of users with medium ML experience, divided by whether they can use BI tools or not.

In [None]:
plt.figure(figsize=(20, 10))
sns.set(font_scale=2)
sns.distplot(survey_q14_years.query("usage_period == 'medium' and can_use_BI_tool == 1")["number_of_tools"],
             rug=False, kde=True,kde_kws={'bw':1},
             bins=np.arange(0,13), label='can use BI tool')
sns.distplot(survey_q14_years.query("usage_period == 'medium' and can_use_BI_tool == 0")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='can not use BI tool')
plt.xlim(-1, 13)
plt.title("Distribution of the number of tools available per use of BI tool \n when kaggler uses machine learning in medium period.")
plt.legend()

In [None]:
print(f"Number of data for kagglers who uses machine learning in medium period and can not use BI tool is:",\
      len(survey_q14_years.query("usage_period == 'medium' and can_use_BI_tool == 0")["number_of_tools"]))
print("Mean of number of tools for kagglers who uses machine learning in medium period and can not use BI tool is:",\
      np.mean(survey_q14_years.query("usage_period == 'medium' and can_use_BI_tool == 0")["number_of_tools"]))

In [None]:
print(f"Number of data for kagglers who uses machine learning in medium period and can use BI tool is:",\
      len(survey_q14_years.query("usage_period == 'medium' and can_use_BI_tool == 1")["number_of_tools"]))
print("Mean of number of tools for kagglers who uses machine learning in medium period and can use BI tool is:",\
      np.mean(survey_q14_years.query("usage_period == 'medium' and can_use_BI_tool == 1")["number_of_tools"]))

In [None]:
scipy.stats.ttest_ind(survey_q14_years.query("usage_period == 'medium' and can_use_BI_tool == 0")["number_of_tools"],\
                      survey_q14_years.query("usage_period == 'medium' and can_use_BI_tool == 1")["number_of_tools"],\
                      axis=0, equal_var=True, nan_policy='propagate')

I can also see that the average number of tools users can use is 0.6 higher if they can use the BI tool.

### In long period

Last, let's look at the distribution of the number of available tools among the population of users with long ML experience, divided by whether they can use BI tools or not.

In [None]:
plt.figure(figsize=(20, 10))
sns.set(font_scale=2)
sns.distplot(survey_q14_years.query("usage_period == 'long' and can_use_BI_tool == 1")["number_of_tools"],
             rug=False, kde=True,kde_kws={'bw':1},
             bins=np.arange(0,13), label='can use BI tool')
sns.distplot(survey_q14_years.query("usage_period == 'long' and can_use_BI_tool == 0")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='can not use BI tool')
plt.xlim(-1, 13)
plt.title("Distribution of the number of tools available per use of BI tool \n when kaggler uses machine learning in long period.")
plt.legend()

In [None]:
print(f"Number of data for kagglers who uses machine learning in long period and can not use BI tool is:",\
      len(survey_q14_years.query("usage_period == 'long' and can_use_BI_tool == 0")["number_of_tools"]))
print("Mean of number of tools for kagglers who uses machine learning in long period and can not use BI tool is:",\
      np.mean(survey_q14_years.query("usage_period == 'long' and can_use_BI_tool == 0")["number_of_tools"]))

In [None]:
print(f"Number of data for kagglers who uses machine learning in long period and can use BI tool is:",\
      len(survey_q14_years.query("usage_period == 'long' and can_use_BI_tool == 1")["number_of_tools"]))
print("Mean of number of tools for kagglers who uses machine learning in long period and can use BI tool is:",\
      np.mean(survey_q14_years.query("usage_period == 'long' and can_use_BI_tool == 1")["number_of_tools"]))

In [None]:
scipy.stats.ttest_ind(survey_q14_years.query("usage_period == 'long' and can_use_BI_tool == 0")["number_of_tools"],\
                      survey_q14_years.query("usage_period == 'long' and can_use_BI_tool == 1")["number_of_tools"],\
                      axis=0, equal_var=True, nan_policy='propagate')

I can also see that the average number of tools users can use is 0.7 higher if they can use the BI tool.

## Using a local environment for work and whether can or not use BI tools. 

In the previous analysis, I found that users who use a local development environments for work have a larger number of analysis tools available to them. So, is there a difference in the number of tools available between the groups of people who can use BI tools and those who cannot, even among those who work in a local environment on a daily basis?

In [None]:
survey_q14_tools = q14_counts.copy()
#survey_q14_tools = survey_q14_tools.reset_index(drop=True)

survey_q14_tools["primary_use_BI"] = primary_use_local
survey_q14_tools["can_use_BI_tool"] = can_use_BI_tool
survey_q14_tools.dropna(how='any')

In [None]:
plt.figure(figsize=(20, 10))
sns.set(font_scale=2)
sns.distplot(survey_q14_tools.query("primary_use_BI == 1.0 and can_use_BI_tool == 1")["number_of_tools"],
             rug=False, kde=True,kde_kws={'bw':1},
             bins=np.arange(0,13), label='can use BI tool')
sns.distplot(survey_q14_tools.query("primary_use_BI == 1.0 and can_use_BI_tool == 0")["number_of_tools"],
             rug=False, kde=True, kde_kws={'bw':1},
             bins=np.arange(0,13), label='can not use BI tool')
plt.xlim(-1, 13)
plt.title("Distribution of the number of tools available per can or not use  BI tool.\n For users who usually analyze in a local environment.")
plt.legend()

In [None]:
print(f"Number of data for kagglers who uses machine learning in long period and can not use BI tool is:",\
      len(survey_q14_tools.query("primary_use_BI == 1.0 and can_use_BI_tool == 0")["number_of_tools"]))
print("Mean of number of tools for kagglers who uses machine learning in long period and can not use BI tool is:",\
      np.mean(survey_q14_tools.query("primary_use_BI == 1.0 and can_use_BI_tool == 0")["number_of_tools"]))

In [None]:
print(f"Number of data for kagglers who uses machine learning in long period and can not use BI tool is:",\
      len(survey_q14_tools.query("primary_use_BI == 1.0 and can_use_BI_tool == 1")["number_of_tools"]))
print("Mean of number of tools for kagglers who uses machine learning in long period and can not use BI tool is:",\
      np.mean(survey_q14_tools.query("primary_use_BI == 1.0 and can_use_BI_tool == 1")["number_of_tools"]))

In [None]:
scipy.stats.ttest_ind(survey_q14_tools.query("primary_use_BI == 1.0 and can_use_BI_tool == 0")["number_of_tools"],\
                      survey_q14_tools.query("primary_use_BI == 1.0 and can_use_BI_tool == 1")["number_of_tools"],\
                      axis=0, equal_var=True, nan_policy='propagate')

Even for users who usually analyze in their local environment, I found that if they can use  BI tools, the more visualization tools they could use. I can see that the average number of tools users can use is 0.5 higher if they can use the BI tool.

From the above analysis, I found that even if I categorize users, those who can use BI tools have more tools available to who can not. BI tools make it possible to do various kinds of visualization (from simple line plots and scatter plots to geographical and interactive plots) very easily. So, I think users will be interested in learning about the various visualization methods and how they can be implemented in jupyter notebook.

# What measures can be implemented?

The analysis so far has shown that the number of visualization tools that can be used increases with the experience of using BI tools. Then, what measures can be implemented in kaggle? In order to answer this question, I want to analyze Q32. In this question, the respondents answered which BI tool they usually use. From the answers to these questions, I can find Major BI tools and select good tools to recommend to the kaggler.

Q32
Which of the following business intelligence tools do you use most often?

* Amazon QuickSight
* Microsoft Power BI
* Google Data Studio
* Looker
* Tableau
* Salesforce
* Einstein Analytics
* Qlik
* Domo
* TIBCO Spotfire
* Alteryx
* Sisense
* SAP Analytics Cloud
* None
* Other


Let's count it up.

In [None]:
plt.figure(figsize=(15, 5))
sns.set(font_scale=1)
g_bi = sns.countplot(data=survey[["Q32"]].loc[1:,:], x="Q32",
                     order=survey.loc[1:,:]["Q32"].value_counts().index)
g_bi.set_xticklabels(g_bi.get_xticklabels(), rotation=90)

Tableau and Microsoft Power BI occupy the first and second place. Google Data Studio is in third place. I Think it would be good to devise a way to use Google Data Studio more often for kaggler.

* Google Data Studio is free and everyone can use.
* kaggle is supported by google.
* Although it is in third place, some kagglers have been used Google Data Studio.

It is a good idea to add content about BI tools to the Course. Currently, no such courses are available to the public.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/kaggle_Survey2020/cources.JPG" width="300">

It is also necessary to publish notebooks and discussions showing examples of use. Currently, some kagglers seem to be publishing notes related to Google Data Studio. However, there is still room to grow. If we compare the results of searching for Google Data Studio in the search box with the results of searching for pytorch, you will see that we can still make Google Data Studio more exciting in kaggle community.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/kaggle_Survey2020/Search_result_for_GoogleDataStudio.JPG" width="1000">

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/kaggle_Survey2020/Search_result_for_pytorch.JPG" width="1000">

## Future work - Check causal effect

As future work, I want to make sure that the availability of BI tools is the cause of the increase in the number of visualization tools available.

There are many different methods, but we can use [Lingam](https://github.com/cdt15/lingam) to know causal effect. For usage of Lingam to know causal effect, we can see [this reference](https://github.com/cdt15/lingam/blob/master/examples/CausalEffect.ipynb).

In [None]:
! pip install lingam
import lingam
import graphviz

In [None]:
def make_graph(adjacency_matrix, labels=None):
    idx = np.abs(adjacency_matrix) > 0.01
    dirs = np.where(idx)
    d = graphviz.Digraph(engine='dot')
    names = labels if labels else [f'x{i}' for i in range(len(adjacency_matrix))]
    for to, from_, coef in zip(dirs[0], dirs[1], adjacency_matrix[idx]):
        d.edge(names[from_], names[to], label=f'{coef:.2f}')
    return d

Unfortunately, the correlation is not high.

In [None]:
survey_q14_years[["number_of_tools", "can_use_BI_tool"]].corr()

The scatter plot looks like this.

In [None]:
sns.scatterplot(data=survey_q14_years, x="can_use_BI_tool", y="number_of_tools", hue="usage_period")

In [None]:
model = lingam.DirectLiNGAM()
model.fit(survey_q14_years[["number_of_tools", "can_use_BI_tool"]])
labels = [f'{i}. {col}' for i, col in enumerate(survey_q14_years[["number_of_tools", "can_use_BI_tool"]].columns)]
make_graph(model.adjacency_matrix_, labels)

We know that the correlation between the two variables is not high, so it is only a reference value, but it suggests that the ability to use BI tools is responsible for the number of visualization tools available.