# **Table of contents 目次**

* [Overview 全体の俯瞰](#overview)
* [Q7 Programing language プログラミング言語](#Q7)
* [Who use R? R使用者の特徴](#R)
* [IDEs](#IDEs)


In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import warnings
warnings.filterwarnings('ignore')

### **set colors 色の設定**

In [None]:
import matplotlib.cm as cm
from matplotlib.colors import rgb2hex

cmap = cm.get_cmap('coolwarm_r',12) #colormap and number
col_def =[]
for i in range(cmap.N):
    rgb = cmap(i)[:3]
    col_def.append(rgb2hex(rgb))
    print(rgb2hex(rgb))

### **data import データ取り込み**

In [None]:
df = pd.read_csv("/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv", low_memory = False)
df_Q = df[0:1]
df_A = df[1:]

In [None]:
df.head(3)

<a id='overview'></a>
# **Overview 全体の俯瞰**
The question number, the question content, the number of unique answers, and the most frequent value. First, I look at these and choose the questions that interest me.  
Age, gender, country, educational background, programming history, the language used, a platform used, etc., are generally of interest. In particular, I usually use R more than Python, but R is a minority in Kaggle, so I'm very interested in knowing how many Kaggle participants use R and Python and what kind of people they are.  
I'm also interested in the use of IDEs: I think most R users use RStudio when they use R (I do too). Actually, you can use Python in RStudio, but even I, a frequent R user, use Google colab or Jupyter notebook when I use Python. I was wondering what other people do.  

質問番号、質問内容、回答ユニーク数、最頻値。これらを眺めて興味のある質問を選ぶ。  
年齢、性別、国、教育歴、プログラミング歴、使用言語、使用プラットフォーム等はおおむね興味のある所。特に私はPythonよりもRを普段はよく使うのですが、KaggleにおいてはRは少数派。Kaggle参加者でPythonだけでなくRも使っている同士がどの程度いて、それがどんな人たちなのかはとっても興味があります。  
それ以外だと、IDE使用についても興味あり。R利用者はRを使うときはRStudio使用が多いと思います（私もそうです）。実はRStudioでもPythonを使う事ができるのですが、Rのハードユーザーである私ですらPython使うときにはGoogle colabやJupyter notebookを使ってます。他の人がどうしているのか気になりました。

In [None]:
pd.set_option('display.max_rows', 30) 
pd.set_option('display.max_columns', 10)
pd.set_option("display.max_colwidth", 200)

pd.concat([df_Q, df_A.describe(include = "all")], axis=0).T[0:20]

## **背景因子 Background**
Visualization of the data from Q1 to Q6.  
Q1～Q6までのデータの視覚化。

### **Q1 Age**

In [None]:
fig, axes = plt.subplots(1,2, figsize=(12, 6))
counts = df_A["Q1"].value_counts()
sns.barplot(x = counts.values, y=counts.index, palette=col_def,
            edgecolor = "black",ax=axes[0])
axes[0].set_title("Age", fontsize=20)
axes[1].pie(x= counts,labels = counts.index, colors=col_def,autopct='%.0f%%',
           explode=[0.03 for i in counts.index])
axes[1].add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()

### **Q2 Gender**

In [None]:
fig, axes = plt.subplots(1,2, figsize=(12, 6))
counts = df_A["Q2"].value_counts()
sns.barplot(x = counts.values, y=counts.index, palette=col_def,
            edgecolor = "black",ax=axes[0])
axes[0].set_title("Gender", fontsize=20)
axes[1].pie(x= counts,labels = counts.index, colors=col_def,autopct='%.0f%%',
           explode=[0.02, 0.05, 0.2, 0.6, 1.0])
axes[1].add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()


### **Q3 Country**  
The top six (India, USA, Japan, China, Brazil, and Russia) account for more than 50% of the total.  
India, in particular, has the largest share at 29%.   

上位6位（India, USA, Japan, China, Brazil, Russia）で全体の50％以上を占めることが分かる。  
特にIndiaは29％と最多。  

In [None]:
counts = df_A["Q3"].value_counts()
top = counts.head(12).index
df_A["Q3-2"] = ["UK" if i == 'United Kingdom of Great Britain and Northern Ireland' else
              i if i in top else
              "Other" for i in df_A["Q3"]]

In [None]:
fig, axes = plt.subplots(1,2, figsize=(12, 6))
counts = df_A["Q3-2"].value_counts()
sns.barplot(x = counts.values, y=counts.index, palette=col_def,
            edgecolor = "black", ax = axes[0])
axes[0].set_title("Country", fontsize=20)

axes[1].pie(x= counts,labels = counts.index, colors=col_def,autopct='%.0f%%', pctdistance=0.7,
           explode=[0.03 for i in counts.index])
axes[1].add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()

### **Q4 Education**

In [None]:
df_A["Q4-2"] = ["University" if i == "Some college/university study without earning a bachelor’s degree" else
               i for i in df_A["Q4"]]

In [None]:
fig, axes = plt.subplots(1,2, figsize=(15, 6))
counts = df_A["Q4-2"].value_counts()
sns.barplot(x = counts.values, y=counts.index, palette=col_def,
            edgecolor = "black",ax=axes[0])
axes[0].set_title("Education", fontsize=20)
axes[1].pie(x= counts,labels = counts.index, colors=col_def, autopct='%.0f%%',
           explode=[0.03 for i in counts.index])
axes[1].add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()

### **Q5 Title**

In [None]:
fig, axes = plt.subplots(1,2, figsize=(15, 6))
counts = df_A["Q5"].value_counts()
sns.barplot(x = counts.values, y=counts.index, palette=col_def,
            edgecolor = "black",ax=axes[0])
axes[0].set_title("Title", fontsize=20)
axes[1].pie(x= counts,labels = counts.index, colors=col_def,autopct='%.0f%%',
           explode=[0.03 for i in counts.index])
axes[1].add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()

### **Q6 Programming experience**

In [None]:
fig, axes = plt.subplots(1,2, figsize=(12, 6))
counts = df_A["Q6"].value_counts()
sns.barplot(x = counts.values, y=counts.index, palette=col_def,
            edgecolor = "black",ax=axes[0])
axes[0].set_title("Programming Experience", fontsize=20)
axes[1].pie(x= counts,labels = counts.index, colors=col_def,autopct='%.0f%%',
           explode=[0.03 for i in counts.index])
axes[1].add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()

### **Age and Education (Q1 x Q4)**


In [None]:
fig, axes = plt.subplots(2,1, figsize= (15,12))
age_order = ['18-21', '22-24', '25-29','30-34','35-39', '40-44','45-49', '50-54','55-59', '60-69', '70+']
sns.countplot(x = "Q1", hue = "Q4-2", order = age_order, ec="black", data=df_A, ax=axes[0])
axes[0].set_title("Age and Education (Q1 x Q4)", fontsize=20)

pivot = df_A.pivot_table(index="Q1", columns = "Q4-2",values="Q5" , aggfunc = "count")
sns.heatmap(pivot.T, cmap = "Blues", annot=True, linewidths=0.005, linecolor='gray',  ax=axes[1])

plt.show()

### **Age and Title (Q1 x Q5)**


In [None]:
fig, axes = plt.subplots(2,1, figsize= (15,12))
age_order = ['18-21', '22-24', '25-29','30-34','35-39', '40-44','45-49', '50-54','55-59', '60-69', '70+']
sns.countplot(x = "Q1", hue = "Q5", order = age_order, ec="black", data=df_A, ax=axes[0])
axes[0].set_title("Age and Title (Q1 x Q5)", fontsize=20)

pivot = df_A.pivot_table(index="Q1", columns = "Q5",values="Q4" , aggfunc = "count")
sns.heatmap(pivot.T, cmap = "Blues", annot=True, linewidths=0.005, linecolor='gray',  ax=axes[1])

plt.show()

### **Education and Title (Q4 x Q5)**


In [None]:
fig, axes = plt.subplots(2,1, figsize= (15,15))
plt.subplots_adjust(wspace=0.4, hspace=1)
sns.countplot(x = "Q4-2", hue = "Q5", ec="black", data=df_A, ax=axes[0])
axes[0].legend(loc = 'upper right')
axes[0].set_xticklabels(axes[0].get_xticklabels(),rotation = 90)
axes[0].set_title("Education and Title (Q4 x Q5)", fontsize=20)

pivot = df_A.pivot_table(index="Q4-2", columns = "Q5",values="Q1" , aggfunc = "count")
sns.heatmap(pivot.T, cmap = "Blues", annot=True, linewidths=0.005, linecolor='gray',  ax=axes[1])

plt.show()

<a id='Q7'></a>
# **Q7 Programing language プログラミング言語**
Over 80% of Kagglers use Python (including in combination).  

Kagglerの80％以上はPythonを使っている（併用含む）


In [None]:
Q7_cols = []
for i in df_A.columns:
    if "Q7" in i:
        Q7_cols.append(i)
        
df_Q7 = df_A[Q7_cols]
df_Q7_desc = df_Q7.describe(include="all").T.sort_values("freq", ascending=False)
df_Q7_desc["prop"] = df_Q7_desc["freq"] / df_Q7.shape[0]

In [None]:
df_Q7_desc[["top", "freq", "prop"]]

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x="top", y="prop", data= df_Q7_desc, palette = col_def, edgecolor = "black")
plt.title("Programming language (Overall)", fontsize=20)

plt.show()

### **Python users**
SQL has the highest percentage of Python users using it together, around 40%.  

Python userの併用率が最も高いのはSQLで、40％程度。

In [None]:
#Python user
df_Python = df_Q7[df_Q7["Q7_Part_1"]=="Python"]
df_Python_desc = df_Python.describe(include="all").T.sort_values("freq", ascending=False)
df_Python_desc["prop"] = df_Python_desc["freq"] / df_Python.shape[0]
df_Python_desc.drop(["Q7_Part_1"])

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x="top", y="prop", data = df_Python_desc.drop(["Q7_Part_1"]), 
            palette = col_def, edgecolor = "black")
plt.title("Programming languages used by Python users", fontsize=20)

plt.show()

### **SQL users**
89% of SQL users use Python together! In other words, most of them use SQL for preprocessing and Python for subsequent analysis. 21% of them are also R users, so it is likely that some of them use Python, R, and SQL together.  

SQL userの89％はPython併用！つまりSQLで前処理を行うが、その後の解析はPythonで行っている人が多いという事か。R userも21％程度いるので、PythonとRとSQLを併用している人もある程度いると思われる。

In [None]:
#SQL user
df_SQL = df_Q7[df_Q7["Q7_Part_3"]=="SQL"]
df_SQL_desc = df_SQL.describe(include="all").T.sort_values("freq", ascending=False)
df_SQL_desc["prop"] = df_SQL_desc["freq"] / df_SQL.shape[0]
df_SQL_desc.drop(["Q7_Part_3"])

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x="top", y="prop", data = df_SQL_desc.drop(["Q7_Part_3"]), 
            palette = col_def, edgecolor = "black")
plt.title("Programming languages used by SQL users", fontsize=20)
plt.show()

### **R users**
86% of R users use Python, 58% use SQL. As expected, many of them use Python as well.  
Like me, many of them probably use R in their daily work (statistical analysis, etc.) but use Python when they participate in Kaggle.  
Because of the nature of Kaggle, if you can't use Python, you will have a tough time, so I think there is a selection bias compared to the overall picture of people who work in data science.

R userの86％はPython併用、58％はSQL併用。やはりPython併用が多い。  
私と同様、普段の仕事（統計解析等）ではRをメインで使用しているが、Kaggle参加に当たってはPythonを使用しているという人が多いのかもしれない。Kaggleはその特性上Pythonが使えないとかなり厳しい戦いになるので、データサイエンスの仕事をしている人の全体像と比べると選択バイアスがあると思われる。

In [None]:
#R user
df_R = df_Q7[df_Q7["Q7_Part_2"]=="R"]
df_R_desc = df_R.describe(include="all").T.sort_values("freq", ascending=False)
df_R_desc["prop"] = df_R_desc["freq"] / df_R.shape[0]
df_R_desc.drop(["Q7_Part_2"])

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x="top", y="prop", data = df_R_desc.drop(["Q7_Part_2"]), 
            palette = col_def, edgecolor = "black")
plt.title("Programming languages used by R users", fontsize=20)
plt.show()

<a id='R'></a>
# **Who use R?**  

I usually use R for my work, but R is a minority in Kaggle, so I'm interested to know how many Kaggle participants use R as well as Python, and what kind of people they are.  
Aside from that, I'm also interested in the use of IDEs: I think most R users use RStudio when they use R (I do too). Actually, you can use Python in RStudio, but even I, a frequent R user, use Google colab or Jupyter notebook when I use Python. I was wondering how other people do it.

私はPythonよりもRを普段はよく使うのですが、KaggleにおいてはRは少数派。Kaggle参加者でPythonだけでなくRも使っている同士がどの程度いて、それがどんな人たちなのかはとっても興味があります。  
それ以外だと、IDE使用についても興味あり。R利用者はRを使うときはRStudio使用が多いと思います（私もそうです）。実はRStudioでもPythonを使う事ができるのですが、Rのハードユーザーである私ですらPython使うときにはGoogle colabやJupyter notebookを使ってます。他の人がどうしているのか気になりました。

## **R-only users, Python-only users, BOTH R and Python users**  
From now on, I will focus our analysis only on R and Python.  
1.  I will classify the Kagglers into those who use both R and Python, those who use only one of them, and those who use neither of them.  
2.  According to the above classification, we will look at the background factors.

ここからはRとPythonにだけ注目して解析していきます。  
1. RとPython両方使用している人、どちらか片方だけ使用している人、どちらも使っていない人に分類します。 
2. 上記の分類に従って、背景因子を見ていきます。


R使用者の86％はPython併用なので、Kaggle内におけるR使用者の大多数の特徴を調べました。
Kaggle参加者全体の特徴は今まで見てきたように、
・30歳以下の若手が多い
・学生が多い
・インドからの参加が多い
・Bachelor's degrrとMaster's degreeが多い

### **New column for the language classification 使用言語の分類**  
1. R: Who use R & don't use Python  
2. Python: Who don't use R & use Python.  
3. Both: Who use both R and Python.  
4. NoUser: Who use Neither R nor Python.  
Add a new column for the above four categories.


1. R: Rを使っている & Pythonを使っていない  
2. Python: Rを使っていない & Pythonを使っている  
3. Both: RもPythonも使っている  
4. NoUser: RもPythonも使っていない  
の4つに分類した新たな列を追加します。  

In [None]:
def RorPython(x):
    category = ""
    if (x["Q7_Part_2"] == "R")  & (x["Q7_Part_1"] =="Python"):
        category = "Both"
    elif (x["Q7_Part_2"] == "R")  & (x["Q7_Part_1"] !="Python"):
        category = "R"
    elif (x["Q7_Part_2"] != "R")  & (x["Q7_Part_1"] =="Python"):
        category = "Python"
    elif (x["Q7_Part_2"] != "R")  & (x["Q7_Part_1"] !="Python"):
        category = "NoUser"
    return category
    

In [None]:
df_A["R_or_Python"] = df_A.apply(RorPython, axis=1)

### **Over view**  
R solo users are an endangered species in Kaggle!  

R単独使用者はKaggleにおいては絶滅危惧種！

In [None]:
fig, axes = plt.subplots(1,2, figsize=(12, 6))
counts = df_A["R_or_Python"].value_counts()
sns.barplot(x = counts.values, y=counts.index, palette=col_def,
            edgecolor = "black",ax=axes[0])
axes[0].set_title("Programming langages (Python x R)", fontsize=20)
axes[1].pie(x= counts,labels = counts.index, colors=col_def,autopct='%.0f%%',
           explode=[0.03 for i in counts.index])
axes[1].add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()

### **Age x Language**  
As the age of the population increases, the percentage of people using Python alone decreases, but the number of people using Python in each age group remains high.  
Python peaks in the 18-21 age group and gradually declines, while both users peak in the 25-29 age group.  
R-only users are an endangered species to begin with, but if we look at the age groups, we can see that they also exist in the 18-21 age group, with a peak in the 30-34 age group.  
**The percentage of Kagglers using R alone increases with age!**

年齢が上がるとともに、Python単独使用者の割合は減っていくが、各年齢層で人口が多い事は変わらない。  
Pythonは18-21歳にピークがあり徐々に減っていくのに対し、Bothは25-29歳にピークがある。  
R単独使用者はそもそも絶滅危惧種だが、年齢層を見ると18-21にも存在し、そのピークは30-34歳。  
**R単独使用者の割合は年齢とともに増加していく！**


In [None]:
fig, axes = plt.subplots(2,1, figsize= (14,12))
age_order = ['18-21', '22-24', '25-29','30-34','35-39', '40-44','45-49', '50-54','55-59', '60-69', '70+']
sns.countplot(x = "Q1", hue = "R_or_Python", order = age_order, ec="black", data=df_A, ax=axes[0])
axes[0].set_title("Age and Languages", fontsize=20)

pivot = df_A.pivot_table(index="Q1", columns = "R_or_Python",values="Q4" , aggfunc = "count")
sns.heatmap(pivot.T, cmap = "Blues", annot=True, linewidths=0.005, linecolor='gray',  ax=axes[1])

# df_R = df_A[df_A["R_or_Python"] == "R"]
# sns.countplot(x = "Q1", order = age_order, ec="black", data=df_R, ax=axes[2])
# axes[2].set_title("R-only users", fontsize=20)

plt.show()

In [None]:
plt.figure(figsize= (12,6))

df_R = df_A[df_A["R_or_Python"] == "R"]
ax1 = sns.countplot(x = "Q1", order = age_order, ec="black", data=df_R)
ax1.set_title("R-only users", fontsize=20)

ax2 = ax1.twinx()
pivot = df_A.pivot_table(index= "Q1", columns="R_or_Python", values="Q4", aggfunc = "count")
pivot["R_prop"] = pivot["R"] / (pivot["R"] +  pivot["Python"] +  pivot["Both"] +  pivot["NoUser"]) 
sns.lineplot(x= "Q1", y="R_prop", data=pivot, color="red", markers= "x",  ax=ax2)
ax2.spines['right'].set_color("red")
ax2.tick_params(axis='y', colors="red")

plt.show()

### **Title x Languages**  
**Statisticians use R!**

統計家だけがR優勢であった。

In [None]:
fig, axes = plt.subplots(2,1, figsize= (18,20))

sns.countplot(y = "Q5", hue = "R_or_Python",  ec="black", data=df_A, ax=axes[0])
axes[0].set_title("Title and Languages", fontsize=20)

pivot = df_A.pivot_table(index="Q5", columns = "R_or_Python",values="Q1" , aggfunc = "count")
sns.heatmap(pivot.T, cmap = "Blues", annot=True, linewidths=0.005, linecolor='gray',  ax=axes[1])

plt.show()

In [None]:
plt.figure(figsize = (12,6))
df_Stat = df_A[df_A["Q5"] == "Statistician"]
sns.countplot(y = "Q5", hue = "R_or_Python" ,ec="black", data=df_Stat)
plt.title("Statisticians use R!", fontsize=20)
plt.show()

### **Education x Languages**  
There seems to be no relationship between the language used and Education.　　

使用言語と教育歴にはあまり関連が無いと思われる。

In [None]:
fig, axes = plt.subplots(2,1, figsize= (15,12))

sns.countplot(x = "Q4-2", hue = "R_or_Python", ec="black", data=df_A, ax=axes[0])
axes[0].set_title("Education and Languages", fontsize=20)

pivot = df_A.pivot_table(index="Q4-2", columns = "R_or_Python",values="Q5" , aggfunc = "count")
sns.heatmap(pivot.T, cmap = "Blues", annot=True , linewidths=0.005, linecolor='gray', ax=axes[1])

plt.show()

<a id='IDEs'></a>
# **IDEs**  
Since there are many Python users, there are inevitably many Kagglers using the Jupyter.  
VSCode and PyCharm are also popular.  

Python使用者が多いので必然的にJupyter系を使っている人が多い。  
VSCodeやPyCharmも人気あり。

In [None]:
Q9_cols = []
for i in df_A.columns:
    if "Q9" in i:
        Q9_cols.append(i)
        
df_Q9 = df_A[Q9_cols]
df_Q9_desc = df_Q9.describe(include="all").T.sort_values("freq", ascending=False)
df_Q9_desc["prop"] = df_Q9_desc["freq"] / df_Q9.shape[0]

In [None]:
df_Q9_desc[["top", "freq", "prop"]]

In [None]:
plt.figure(figsize=(15,6))
sns.barplot(x="prop", y="top",data= df_Q9_desc, palette = col_def, edgecolor = "black")
plt.title("IDEs (Overall)", fontsize=20)

plt.show()

### **IDEs x Languages**  
RStudio is the only IDE with a distinctly different pattern from the other IDEs, with many users using R alone or R and Python together.  
Most of the R-only users use RStudio.  

RStudioだけ明らかに他のIDEsとパターンが違い、R単独またはRとPython併用者が多い。  
逆にR単独使用者のほとんどがRStudioを使用している。

In [None]:
dict_trans = pd.DataFrame(df_Q9.describe(include="all").T["top"]).to_dict()

df_Q9L = df_A[Q9_cols + ["R_or_Python"]]
df_Q9L_new = df_Q9L.rename(columns = dict_trans["top"])
df_Q9L_new["rn"] = df_Q9L_new.index
# df_Q9L_new

In [None]:
long = pd.melt(df_Q9L_new, id_vars=['rn','R_or_Python'],var_name='IDEs', value_name='names')
long = long.dropna()

In [None]:
fig, axes = plt.subplots(2,1, figsize= (15,20))

sns.countplot(y = "IDEs", hue = "R_or_Python", ec="black", data=long, ax=axes[0])
axes[0].set_title("IDEs x Languages", fontsize=20)

pivot = long.pivot_table(index = "R_or_Python", columns= "IDEs" , values = "rn", aggfunc="count")
sns.heatmap(pivot.T, cmap = "Blues",annot=True , linewidths=0.005, linecolor='gray', ax=axes[1])

plt.show()

In [None]:
#RStudio user
fig, axes = plt.subplots(1,2, figsize=(12, 6))
df_Rs = long[long["IDEs"] == ' RStudio ']
counts = df_Rs["R_or_Python"].value_counts()
sns.barplot(x = counts.values, y=counts.index, palette=col_def,
            edgecolor = "black",ax=axes[0])
axes[0].set_title("RStudio users", fontsize=20)
axes[1].pie(x= counts,labels = counts.index, colors=col_def,autopct='%.0f%%',
           explode=[0.03 for i in counts.index])
axes[1].add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()

In [None]:
df_R = long[long["R_or_Python"] == "R"]

In [None]:
#R-only user
fig, axes = plt.subplots(1,2, figsize=(12, 6))
df_R = long[long["R_or_Python"] == "R"]
counts = df_R["IDEs"].value_counts()
sns.barplot(x = counts.values, y=counts.index, palette=col_def,
            edgecolor = "black",ax=axes[0])
axes[0].set_title("R-only users", fontsize=20)
axes[1].pie(x= counts,labels = counts.index, colors=col_def,autopct='%.0f%%',
           explode=[0.03 for i in counts.index])
axes[1].add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()

# ** Work Still in Progress**