## Kaggle Survey?
- There is a Data Science Community called Kaggle. Every year, Kaggle conducts a survey of Kaggle users. The results of this survey are a good source of information about how the world's best Data Scientists are working and what they are studying.

- The dataset we are using this time uses the "2021 Kaggle Machine Learning Survey" dataset, which was released a few days ago, to take a look at Kagglers working in South Korea (and something others).

### <b>Points</b>
- Understanding EDA process

- Using libraries numpy, pandas, matplotlib, seaborn

- Analyze Kagglers

## Step 1. Load Datasets

### Quick intro
 
- The first row of data is what each question was about.

- If you are interested in how the data was collected and how to use it, you can download the data set and use it in the **supplementary_data** folder.

Source : https://www.kaggle.com/c/kaggle-survey-2021/data

In [None]:
import numpy as np  
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

pd.options.display.max_rows = 100

import warnings
warnings.filterwarnings('ignore')

In [None]:
# set input path and load dataset
input_path = '../input/kaggle-survey-2021/'
survey = pd.read_csv(input_path + "kaggle_survey_2021_responses.csv")
survey

In [None]:
survey.info() 

In [None]:
survey.describe()

- In order to find information about educational background, let's grab the Q1, Q2, Q3, Q4 columns.

In [None]:
selected_cols = ["Q1", "Q2", "Q3", "Q4"]
temp = survey[selected_cols]
temp

## Step 2. Pick rows to use

In [None]:
# to find my country, South Korea
temp.Q3.unique()

In [None]:
# South Korea
korean = temp.loc[temp.Q3 == "South Korea", :]
korean

- Show Age info

In [None]:
sns.countplot(data=korean, x="Q1")

- Show gender info

In [None]:
sns.countplot(data=korean, x="Q2")

- Show educational status

In [None]:
plt.figure(figsize=(8, 12))
sns.countplot(data=korean, y="Q4")

## Step3. Let's upgrade out plots!

- Sort and print the plot in age column with respect to order of age

In [None]:
X = korean.Q1.value_counts().sort_index()
sns.countplot(data=korean, x="Q1", order=X.index)

- Sort and print the age column in order of frequency

In [None]:
X = korean.Q1.value_counts().sort_values(ascending=False)
sns.countplot(data=korean, x="Q1", order=X.index)

- Change color palette to "autumn"

In [None]:
X = korean.Q1.value_counts().sort_values(ascending=False)
sns.countplot(data=korean, x="Q1", order=X.index, palette="autumn")

- Sort and print gender columns in order of frequency

In [None]:
X = korean.Q2.value_counts().sort_values()[::-1]
sns.countplot(data=korean, x="Q2", order=X.index)

- Change count into ratio

In [None]:
X = korean.Q2.value_counts(normalize=True).sort_values()[::-1]
sns.barplot(x=X.index, y=X.values, palette="Set2")

- Now, we plot education status countplot

In [None]:
plt.figure(figsize=(6, 4))
X = korean.Q4.value_counts().sort_values()[::-1]
sns.countplot(data=korean, y="Q4", order=X.index)

- Change count to ratio and set color palette as "Set2"

In [None]:
plt.figure(figsize=(6, 4))
X = korean.Q4.value_counts(normalize=True).sort_values()[::-1]
sns.barplot(x=X.values, y=X.index, palette="Set2")

### Now, we look at the educational statistics by country through the pivot table for the entire data, not for Koreans.

In [None]:
# delete first row (= delete questions)
countries = temp.iloc[1:, :]
countries

In [None]:
pt = pd.pivot_table(data=countries.loc[:, ["Q1", "Q3"]], index=["Q3"], columns="Q1", aggfunc={"Q1":"count"})
pt

- Set NaN to 0

In [None]:
pt = pd.pivot_table(data=countries.loc[:, ["Q1", "Q3"]], index=["Q3"], columns="Q1", aggfunc={"Q1":"count"}, fill_value=0)
pt

#### Let's dive into Gender

In [None]:
pt = pd.pivot_table(data=countries.loc[:, ["Q2", "Q3"]], index=["Q3"], columns="Q2", aggfunc={"Q2":"count"})
pt

- Pick USA and Canada info

In [None]:
american = pt.loc["United States of America"]
canada = pt.loc["Canada"]
display(american)
display(canada)

In [None]:
plt.figure(figsize=(6, 4))
plt.title("American")
american.Q2.plot(kind="barh")
plt.show()
plt.figure(figsize=(6, 4))
plt.title("Canada")
canada.Q2.plot(kind="barh")
plt.show()

### We just look some education info. How about others?

### Checkpoints

- This time, we select the columns related to the programming language that users frequently use.

- What columns we need?

- How can we handle starts with Q7~ columns?

In [None]:
# grab columns with Q7s
selected_cols = ["Q1", "Q2", "Q3", "Q8"]
Q7s = [col for col in survey.columns if col.startswith("Q7")]
selected_cols = selected_cols + Q7s
temp2 = survey[selected_cols]
temp2

#### Combine Q7~ columns into new "Q7" column

In [None]:
Q7_list = []
Q7_list.append(temp2[Q7s[0]].loc[0]) # insert Question
for _, row in temp2[Q7s][1:].iterrows():
    #print(row)
    temp_list = row[~row.isna()]
    #print(temp_list.values)
    Q7_list.append(temp_list.values)
print(Q7_list[:3])

In [None]:
temp2.drop(Q7s, axis=1, inplace=True)
temp2["Q7"] = Q7_list
temp2

- There is some missings. Delete them!

In [None]:
temp3 = temp2.dropna()
temp_len = temp3.Q7.apply(lambda x: len(x))
temp3["Q7_len"] = temp_len
temp3 = temp3.loc[temp3.Q7_len != 0, :]
temp3.drop("Q7_len", axis=1, inplace=True)
temp3

- Now, we can choose my friends, Korean!

In [None]:
korean = temp3.loc[temp3.Q3 == "South Korea"]
korean

#### To combine list of languages, we create another dataframe.

In [None]:
from collections import Counter
Q7_data = []
for row in korean.Q7:
    Q7_data = Q7_data + list(row)
counter = Counter(Q7_data)
df = pd.DataFrame({"Languages":counter.keys(), "Count":counter.values()}).set_index("Languages")
df.plot(kind="barh")

- Select and Sort into countplot!

In [None]:
X = df.sort_values(by="Count", ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=X.index, y=X.Count, palette="Set2")

- At this time, choose Q8 column

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(data=korean, x="Q8")

- Change count into ratio, and visualize it in ascending order.

In [None]:
X = korean.Q8.value_counts(normalize=True).sort_values()
plt.figure(figsize=(12, 6))
sns.barplot(x=X.index, y=X.values, palette="Set2")

### Now, we look at the programming language statistics by gender through the pivot table for the entire data, not for Koreans.

In [None]:
# drop the first row
genders = temp3.iloc[1:, :]
genders

In [None]:
pt = pd.pivot_table(data=genders.loc[:, ["Q2", "Q8"]], index=["Q2"], columns="Q8", aggfunc={"Q8":"count"})
pt

In [None]:
# fill NaN values to 0
pt = pd.pivot_table(data=genders.loc[:, ["Q2", "Q8"]], index=["Q2"], columns="Q8", aggfunc={"Q8":"count"}, fill_value=0)
pt

#### How about Q8?

In [None]:
pt = pd.pivot_table(data=genders.loc[:, ["Q3", "Q8"]], index=["Q3"], columns="Q8", aggfunc={"Q8":"count"}, fill_value=0)
pt

#### Let's choose some countries.. France and Germany!

In [None]:
france = pt.loc["France"]
germany = pt.loc["Germany"]
display(france)
display(germany)

In [None]:
plt.figure(figsize=(6, 4))
plt.title("France")
france.Q8.plot(kind="barh")
plt.show()
plt.figure(figsize=(6, 4))
plt.title("Germany")
germany.Q8.plot(kind="barh")
plt.show()

#### Change the two graphs above, that outputs in descending order of frequency, and change the color palette to winter.

In [None]:
plt.figure(figsize=(10, 6))
plt.title("France")
df = france.Q8.sort_values()[::-1]
sns.barplot(x=df.index, y=df.values)
plt.figure(figsize=(10, 6))
plt.title("Germany")
df2 = germany.Q8.sort_values()[::-1]
sns.barplot(x=df2.index, y=df2.values)

### Checkpoints

- This time, we select the columns related to the ML methods that users frequently use.

- What columns we need?

- How can we handle starts with Q17~ columns, Q18~ columns and Q19~ columns?

Let's grab the columns we need

In [None]:
selected_cols = ["Q1", "Q2", "Q3"]
Q17s = [col for col in survey.columns if col.startswith("Q17")]
Q18s = [col for col in survey.columns if col.startswith("Q18")]
Q19s = [col for col in survey.columns if col.startswith("Q19")]

selected_cols = selected_cols + Q17s + Q18s + Q19s
temp4 = survey[selected_cols]
temp4

#### Combine Q17~ columns into single one

In [None]:
Q17_list = []
Q17_list.append(temp4[Q17s[0]].loc[0])
for _, row in temp4[Q17s][1:].iterrows():
    #print(row)
    temp_list = row[~row.isna()]
    #print(temp_list.values)
    Q17_list.append(temp_list.values)
print(Q17_list[:3])

- Same thing in Q18s

In [None]:
Q18_list = []
Q18_list.append(temp4[Q18s[0]].loc[0])
for _, row in temp4[Q18s][1:].iterrows():
    #print(row)
    temp_list = row[~row.isna()]
    #print(temp_list.values)
    Q18_list.append(temp_list.values)
print(Q18_list[:3])

- Same in Q19s

In [None]:
Q19_list = []
Q19_list.append(temp4[Q19s[0]].loc[0])
for _, row in temp4[Q19s][1:].iterrows():
    #print(row)
    temp_list = row[~row.isna()]
    #print(temp_list.values)
    Q19_list.append(temp_list.values)
print(Q19_list[:3])

#### Replace Q17s, Q18s, Q19s that we made right before.

In [None]:
temp4.drop(Q17s, axis=1, inplace=True)
temp4["Q17"] = Q17_list
temp4

In [None]:
temp4.drop(Q18s, axis=1, inplace=True)
temp4["Q18"] = Q18_list
temp4

In [None]:
temp4.drop(Q19s, axis=1, inplace=True)
temp4["Q19"] = Q19_list
temp4

#### Drop missings

In [None]:
temp5 = temp4.dropna()
temp_len = temp4.Q17.apply(lambda x: len(x))
temp_len2 = temp4.Q18.apply(lambda x: len(x))
temp_len3 = temp4.Q19.apply(lambda x: len(x))
temp5["Q17_len"] = temp_len
temp5["Q18_len"] = temp_len2
temp5["Q19_len"] = temp_len3
temp5 = temp5.loc[temp5.Q17_len != 0, :]
temp5 = temp5.loc[temp5.Q18_len != 0, :]
temp5 = temp5.loc[temp5.Q19_len != 0, :]
temp5.drop("Q17_len", axis=1, inplace=True)
temp5.drop("Q18_len", axis=1, inplace=True)
temp5.drop("Q19_len", axis=1, inplace=True)
temp5

#### Let's pick my Koreans!

In [None]:
korean = temp5.loc[temp5.Q3 == "South Korea"]
korean

#### Plot their favorite ML methods

In [None]:
from collections import Counter
Q17_data = []
for row in korean.Q17:
    Q17_data = Q17_data + list(row)
counter = Counter(Q17_data)
df = pd.DataFrame({"Methods":counter.keys(), "Count":counter.values()}).set_index("Methods")
df.plot(kind="barh")

In [None]:
X = df.sort_values(by="Count", ascending=False)
plt.figure(figsize=(6, 4))
sns.barplot(y=X.index, x=X.Count, palette="Set2")

#### Now, in CV

In [None]:
from collections import Counter
Q18_data = []
for row in korean.Q18:
    Q18_data = Q18_data + list(row)
counter = Counter(Q18_data)
df = pd.DataFrame({"Methods":counter.keys(), "Count":counter.values()}).set_index("Methods")
df.plot(kind="barh")

In [None]:
X = df / df.Count.sum()
X = X.sort_values(by="Count", ascending=True)
plt.figure(figsize=(6, 4))
sns.barplot(y=X.index, x=X.Count, palette="Set2")

#### Now, in NLP

In [None]:
from collections import Counter
Q19_data = []
for row in korean.Q19:
    Q19_data = Q19_data + list(row)
counter = Counter(Q19_data)
df = pd.DataFrame({"Methods":counter.keys(), "Count":counter.values()}).set_index("Methods")
df.plot(kind="barh")

In [None]:
X = df.sort_values(by="Count", ascending=False)
plt.figure(figsize=(6, 4))
sns.barplot(y=X.index, x=X.Count, palette="Spectral")

#### Let's apply in entire dataset!

#### ML methods

In [None]:
from collections import Counter
Q17_data = []
for row in temp5[1:].Q17:
    Q17_data = Q17_data + list(row)
counter = Counter(Q17_data)
df = pd.DataFrame({"Methods":counter.keys(), "Count":counter.values()}).set_index("Methods")
df.plot(kind="barh")

In [None]:
X = df / df.Count.sum()
X = X.sort_values(by="Count", ascending=False)
plt.figure(figsize=(6, 4))
sns.barplot(y=X.index, x=X.Count, palette="Set3")

#### CV!!

In [None]:
from collections import Counter
Q18_data = []
for row in temp5[1:].Q18:
    Q18_data = Q18_data + list(row)
counter = Counter(Q18_data)
df = pd.DataFrame({"Methods":counter.keys(), "Count":counter.values()}).set_index("Methods")
df.plot(kind="barh")

In [None]:
X = df.sort_values(by="Count", ascending=False)
plt.figure(figsize=(6, 4))
sns.barplot(y=X.index, x=X.Count, palette="winter")

#### NLP !!

In [None]:
from collections import Counter
Q18_data = []
for row in temp5[1:].Q19:
    Q19_data = Q19_data+ list(row)
counter = Counter(Q19_data)
df = pd.DataFrame({"Methods":counter.keys(), "Count":counter.values()}).set_index("Methods")
df.plot(kind="barh")

In [None]:
X = df.sort_values(by="Count", ascending=True)
plt.figure(figsize=(6, 4))
sns.barplot(y=X.index, x=X.Count, palette="Set2")

## Summary

- We practice a lot of aspect of this wonderful dataset!


- If you feel interested points, please upvote and share your own code!


**Thanks a lot! Have a wonderful day**