# Search Kaggler like Me: Japanese, University Student, etc.

(I am a native Japanese, so I also write Japanese sentence besides English one in this notebook.)

(私は日本生まれなので、このノートブックでは、英語に加えて日本語の文も書きます。)

Hello, I am a  university student of Kyoto University in Japan. In this notebook, I look for friends with background like me.

こんにちは！私は日本の京都大学に所属する大学生です。このノートブックで、私と似たような経歴、背景を持つ仲間を見つけたいです。

I have been coding for some years. However, I am a newcomer to Kaggle community, and I wonder if I can compete with veteran Kagglers. 

私には数年間のコーディングの経験があります。しかし、Kaggleのコミュニティへは新入りであり、歴戦のKaggler達と戦えるのだろうかと感じています。

Anyway, let's begin the analysis!

何はともあれ、分析を始めましょう！

Request: Please advise me on comment!

お願い: コメント欄でアドバイスをください！

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import itertools

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df_res = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', low_memory=False)
df_res_header = df_res.iloc[0]
df_res = df_res.drop(df_res.index[0]).reset_index(drop=True)
df_res_jp = df_res[df_res['Q3']=='Japan']

## Find Japanese Kaggler | 日本のKagglerを見つけ出す

In which country, do Kagglers live?

Kaggler達はどの国に住んでいるのでしょうか？

In [None]:
df_ctry = df_res['Q3'].value_counts(normalize=True)
df_ctry = df_ctry.mul(100)
df_ctry = df_ctry[:12]
df_ctry = df_ctry.drop('Other').reset_index()
df_ctry = df_ctry.rename(columns={'Q3':'percent'})
df_ctry['index'] = df_ctry['index'].replace({'United States of America': 'USA', 
                                             'United Kingdom of Great Britain and Northern Ireland': 'UK'})
color = ['red' if df_ctry['index'][i]=='Japan' else 'lightgray' for i in range(len(df_ctry['index']))]

plt.figure(figsize=(12,6))
p = sns.catplot(x='index', y='percent', data=df_ctry, kind='bar', aspect=2, palette=color)

for x, (label, y) in enumerate(zip(df_ctry['percent'], df_ctry['percent'])):
    p.ax.text(x-0.2, y+0.5, str(round(label, 1))+'%')

p.ax.set_xlabel('Country')
p.ax.set_ylabel('percent(%)')
p.ax.set_title('Where do Kagglers currently reside?', fontdict={'fontsize': 20, 'fontweight': 'bold', 'fontfamily': 'serif'})

From this graph, you can see that Indian Kagglers are the most followed by ones of USA. Japan is the third, which is more than I thought.

このグラフから、インドのKagglerが一番多く、アメリカがそれに続いていることがわかります。日本は3位であり、私が予想した以上の順位です。

## Educational Background | 学歴

I am still a university student. I think that technical knowledge is so important in Kaggle that expert data analysts are the majority.

私はまだ大学生です。技術的専門的知識がKaggleにおいて非常に重要で、データ分析の専門家が大多数である、というふうに考えています。

In [None]:
df_res_edu = pd.concat([df_res_jp['Q4'].value_counts(normalize=True), df_res['Q4'].value_counts(normalize=True)], axis=1)
df_res_edu.columns = ['Japan', 'World']
df_res_edu = df_res_edu.loc[['Professional doctorate','Doctoral degree', 'Master’s degree', 'Bachelor’s degree', 'Some college/university study without earning a bachelor’s degree', 'No formal education past high school', 'I prefer not to answer']]

explode = [0.05 if idx=='Some college/university study without earning a bachelor’s degree' else 0 for idx in df_res_edu.index]
colors = ['red' if idx=='Some college/university study without earning a bachelor’s degree' else cm.Greys((i+1)/(len(df_res_edu.index)+1)) for i, idx in enumerate(df_res_edu.index)]
plt.figure(figsize=(12,6))

plt.subplot(1,2,1)
plt.pie(
    df_res_edu['Japan'],
    labels=['{:0.1f}%'.format(ratio*100)  for ratio in df_res_edu['Japan']],
    startangle=90,
    counterclock=False,
    colors=colors,
    explode=explode,
    wedgeprops={'edgecolor': 'white', 'linewidth': 2},
)
plt.title('Japan')

plt.subplot(1,2,2)
plt.pie(
    df_res_edu['World'],
    labels=['{:0.1f}%'.format(ratio*100)  for ratio in df_res_edu['World']],
    startangle=90,
    counterclock=False,
    colors=colors,
    explode=explode,
    wedgeprops={'edgecolor': 'white', 'linewidth':2},
)
plt.title('World')

plt.suptitle("The percentage of university student in Japan and World", fontsize=20, fontfamily='serif', fontweight='bold')
plt.legend(df_res_edu.index ,bbox_to_anchor=(1,0), loc='upper right')

Seeing this graph,although experts like bachelor, master, or doctor are the majority, some percent of Kagglers are university students in both Japan and world.

このグラフからわかるように、学士、修士、博士といった専門家が大多数なのだけれども、日本と世界の両方において、Kagglerのうち数パーセントは大学生です。

## Programming Experience | プログラミングの経験
I have been interested in programming since I was a high school student. I read some beginner's books about programming in my high school days. I belong to the informatics department now, so I have also listened university's lectures on programming for about two and half years.

私は高校生の時からプログラミングに興味がありました。高校時代は入門書を何冊か読みました。現在は情報学科に所属しており、約2年半の間、大学のプログラミングの講義もまた受講しています。

In [None]:
df_code_exp = pd.concat([df_res_jp['Q6'].value_counts(normalize=False), df_res['Q6'].value_counts(normalize=False)], axis=1)
df_code_exp.columns = ['Japan', 'World']
df_code_exp = df_code_exp.loc[['I have never written code', '< 1 years', '1-3 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']]
colors = ['red' if idx=='3-5 years' else cm.Greys((i+1)/(len(df_code_exp.index)+1)) for i, idx in enumerate(df_code_exp.index)]
plt.figure(figsize=(12,8))

plt.subplot(2,1,1)
sns.barplot(x=df_code_exp.index, y='Japan', data=df_code_exp, palette=colors)
for x, label in enumerate(df_code_exp['Japan']):
    y=label
    plt.text(x-0.1, y, label)

plt.subplot(2,1,2)
sns.barplot(x=df_code_exp.index, y='World', data=df_code_exp, palette=colors)
for x, label in enumerate(df_code_exp['World']):
    y=label
    plt.text(x-0.15, y, label)
    
plt.suptitle('The number of kaggler who have coded for 3-5 years.', fontsize=20, fontfamily='serif', fontweight='bold')

I have coded for about 4 years, which is not rare.

私は4年間のコーディング経験がありますが、それは珍しいことではないと分かります。

## Machine Learning | 機械学習

I had been interested in data science, but I didn't have experience of it. I started learning data science and machine learning this year. In my university, teachers tell us overview of machine learning, but don't concrete method of coding, and so I studied machine learning by Python programming on my own.

私はデータサイエンスに興味はありましたが、経験はありませんでした。今年、データサイエンスと機械学習の学習を始めました。私の大学では、先生方は機械学習の概要を教えてくれますが、コーディングの具体的な方法は教えてくれません。だから、Pythonによる機械学習を独学で学びました。

In [None]:
df_ml_exp = pd.concat([df_res_jp['Q15'].value_counts(normalize=False), df_res['Q15'].value_counts(normalize=False)], axis=1)
df_ml_exp.columns = ['Japan', 'World']
df_ml_exp = df_ml_exp.loc[['I do not use machine learning methods', 'Under 1 year', '1-2 years', '2-3 years', '3-4 years', '4-5 years', '5-10 years', '10-20 years', '20 or more years']]
# accumulate: 累積化
df_ml_exp['Japan'] = list(itertools.accumulate(df_ml_exp['Japan']))
df_ml_exp['World'] = list(itertools.accumulate(df_ml_exp['World']))
df_ml_exp.index = ['No use', 'Under 1 year', 'Under 2 years', 'Under 3 years', 'Under 4 years', 'Under 5 years', 'Under 10 years', 'Under 20 years', 'All years']

colors = ['red' if idx=='Under 1 year' else cm.Greys((i+1)/(len(df_ml_exp.index)+1)) for i, idx in enumerate(df_ml_exp.index)]
plt.figure(figsize=(12,8))

plt.subplot(2,1,1)
sns.barplot(x=df_ml_exp.index, y='Japan', data=df_ml_exp, palette=colors)
for x, label in enumerate(df_ml_exp['Japan']):
    y=label
    plt.text(x-0.1, y, label)

plt.subplot(2,1,2)
sns.barplot(x=df_ml_exp.index, y='World', data=df_ml_exp, palette=colors)
for x, label in enumerate(df_ml_exp['World']):
    y=label
    plt.text(x-0.2, y, label)
    
plt.suptitle('Cummulative sum of the experience of ML.', fontsize=20, fontfamily='serif', fontweight='bold')

I have the 1 year or less experience of machine learning. This occupies about half of Kaggler.

私は機械学習の経験が1年以下です。これはKagglerの約半分を占めます。

## Summary | 総括

Through these analyses, I noticed that there are a lot of beginners of programming or machine learning in Kaggle community. This encourges me. I'm going to make an effort to compete with the beginners like me, and someday, I'd like to challenge vetelans of Kaggle.

以上の分析を通じて、プログラミングや機械学習の初心者がKaggleのコミュニティには大勢いることに気づきました。私はこのことに勇気づけられました。努力して自分のような初心者と競争し、そしていつの日か、Kaggleの熟練者に挑戦したいです。