# Kaggle 2021 Survey Data Visualization

<br>
<br>


## Let's find out trends in relevant ML-related tools and algorithms.
### - This can be helpful for students and beginners who want to learn practical ML.
### - We can also use data to see what "the world" is doing.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### The first thing I want to know is 
####    - Group by expertise
####       - Ex. Age, job title, ML experience, compensation.
####    - Why? Qualify each responses by their expected expertise.
   

In [None]:
%matplotlib ipympl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
from matplotlib.animation import FuncAnimation
from matplotlib.lines import Line2D
from IPython.display import HTML

# from matplotlib import animation, rc

In [None]:
survey_2021 = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', dtype='string')
survey_2020 = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', dtype='string')
survey_2019 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv', dtype='string')
survey_2018 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv', dtype='string')

list_surveys = [survey_2018, survey_2019, survey_2020, survey_2021]

In [None]:
# list in order of 2017 ~ 2021
groups = {
    'Age': ['Q2', 'Q1', 'Q1', 'Q1'],
    'Gender': ['Q1', 'Q2', 'Q2', 'Q2'],
    'Job Title': ['Q6', 'Q5', 'Q5', 'Q5'],
    'Programming Experience': [np.nan, np.nan, 'Q6', 'Q6'],
    'ML Experience': ['Q25', 'Q23', 'Q15', 'Q15'],
    'Compensation': ['Q9', 'Q10', 'Q25', 'Q25'],
    'Industry': ['Q7', np.nan, 'Q4', 'Q4'],
    
    'Programming Language': ['Q16_', 'Q18_', 'Q7_', 'Q7_'],
    'IDE': ['Q13_', 'Q16_', 'Q9_', 'Q9_'],
    'Notebook': ['Q14_', 'Q17_', 'Q10_', 'Q10_'],
    'Specialized Hardware': [np.nan, 'Q21_', 'Q12_', 'Q12_'],
    'Visualization Libraries/Tools': ['Q21_', 'Q20_', 'Q14_', 'Q14_'],
    'ML Frameworks': ['Q19_', 'Q28_', 'Q16_', 'Q16_'],
    'ML Algorithms': [np.nan, 'Q24_', 'Q17_', 'Q17_'],
    'CV Algorithms': [np.nan, 'Q26_', 'Q18_', 'Q18_'],
    'NLP Methods': [np.nan, 'Q27_', 'Q19_', 'Q19_'],
    'Important Part of Work': ['Q11_', 'Q24_', 'Q24_', 'Q24_'],
    'Computing Platform': ['Q15_', 'Q29_', 'Q26_', 'Q27_'],
    
# Cloud platforms: GCP, AWS, Microsoft Azure, IBM
#     'Daily Cloud Platforms': 'Q27_A_',
#     'Cloud Products': 'Q27_A_'
#     'Data Storage Products': 'Q30_A_',
#     'Big Data Products': 'Q32_A_',
#     'Data Analysis Tools': 'Q41'

    'Language Recommendation': ['Q18', 'Q19', 'Q8', 'Q8'],
    'Courses': ['Q36_', 'Q13_', 'Q40_', 'Q40_'],
    'Favorite Media Source': ['Q38_', 'Q12_', 'Q42_', 'Q42_'],
    'Tools to Share Projects': ['Q49_', np.nan, 'Q39_', 'Q39']
}

year_df = {
    0: '2018',
    1: '2019',
    2: '2020',
    3: '2021'
}

In [None]:
# Cleaning Functions

# Get df for using and rec without questions by given Q: 'Q6_' or 'Q7'
def get_df(df, Q):
    if pd.isna(Q):
        return None
    elif Q[-1] == '_':
        result = df.loc[1:, df.columns.str.startswith(Q)]
    else:
        result = df.loc[1:, [Q]]
    return result

# 2018, 2019
# change -1 to np.nan and others to 'Other'
def change_other(x):
    if x != -1 and x != '-1':
        return 'Other'
    else:
        return np.nan
    
# For Using df with multiple columns
def change_columns(df): 
    columns = []
    for col in df.columns:
        val = df[col].unique()
        val = [x for x in val if not pd.isna(x)][0]
        columns.append(val)
    df.columns = columns
    return df

# For Rec df with single column
def format_values(df): 
    vals = df.values.flatten().tolist()
    columns = np.unique([x for x in vals if not pd.isna(x)])
    results = pd.DataFrame(index=df.index, columns=columns)

    for col in columns:
        idx = df[df.iloc[:, 0] == col].index
        results.loc[idx, col] = col

    return results

# Groupby given group G: 'Age'
def group_by(survey, groups, i, group_name, df):
    if groups[group_name] is None:
        return None
    col = groups[group_name][i]
    group_col = survey.loc[1:, col]
    group_col.name = group_name
    grouped = (pd.concat([group_col, df], axis=1).groupby(group_name)).count()
    return grouped


def get_df_by_group(group_name, variable_name):
    # Combined responses of all surveys and groups.
    combined = []
    for i in range(len(list_surveys)):
        df = get_df(list_surveys[i], groups[variable_name][i])
        if df is None:
            combined.append(None)
        else:
            if df.shape[1] == 1:
                df = format_values(df)
            else:
                # if -1 exists in other (2018-2019), apply (change_other)
                if i in [0, 1]:
                    df.iloc[:, -1] = df.iloc[:, -1].apply(change_other)
                df = change_columns(df)
            grouped = group_by(list_surveys[i], groups, i, group_name, df)
            combined.append(grouped)
    return combined

In [None]:
temp = get_df_by_group('Compensation', 'Computing Platform')

temp[3].empty

In [None]:
def animate_X_by_Group(fig, axes, G, X):
#     plt.close('all')
    fig.suptitle(f'Trend of {X} Grouped by {G} for each Year')
    datas = get_df_by_group(G, X)
    
    
    max_values = []
    # max_value
    for df in datas:
        if not df.empty:
            max_values.append(max(df.values.flatten()) + 200)
        else:
            max_values.append(None)
    
    def animate(frame):
        for i, ax in enumerate(axes.flatten()):
            if max_values[i]:
                data = datas[i].sort_values(by=datas[i].index[0], axis=1)

                ax.clear()
                ax.set_ylim([0, max_values[i]])

                group = data.iloc[frame, :]

                rects = ax.bar(group.index, group.values, color=None)
                ax.bar_label(rects, labels=group.values)

                title = ax.set_title(f'Year {str(year_df[i])}: {G} {group.name}')
                plt.setp(ax.get_xticklabels(), rotation=-45, horizontalalignment='left')

                top_5 = group.sort_values(ascending=False).iloc[:5]
                top_5 = [i + ': ' + str(j) for i, j in zip(top_5.index, top_5.values)]
                custom_lines = [Line2D([0], [0],  lw=4),
                                Line2D([0], [0],  lw=4),
                                Line2D([0], [0],  lw=4),
                               Line2D([0], [0], lw=4),
                                Line2D([0], [0], lw=4)]
                legs = ax.legend(custom_lines, top_5, title=f'Top 5 {X}',
                               title_fontsize='large', fontsize='medium',
                               loc=2)
            else:
                pass
        return axes, rects

    ani = FuncAnimation(fig, animate, frames=len(datas), repeat=True, interval=2000, blit=False)
        
    return ani

## Let's See What We Got!

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
plt.subplots_adjust(top = 0.9, bottom=0.1, right=0.9, left=0.1, hspace=0.3, wspace=0.3)
result = animate_X_by_Group(fig, axes, 'Compensation', 'Programming Language')
plt.close()

# 1) Language Used Grouped by Compensation

### Python came in first in all groups and all years.
### SQL came in strong second.
<br>
<br>

### And other languages such as Javascript, C++, R, and Bash appeared in the top 5 languages.

#### My intuition is that Python is the primary language of data science and artificial intelligence
#### And Javascript, SQL, C++, R, and Bash are used to create and manipulate website interactions, embedded systems, and operating systems.

In [None]:
HTML(result.to_jshtml())

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
plt.subplots_adjust(top = 0.9, bottom=0.1, right=0.9, left=0.1, hspace=0.3, wspace=0.3)
result = animate_X_by_Group(fig, axes, 'ML Experience', 'Visualization Libraries/Tools')
plt.close()

# 2) Visualization Libraries/Tools grouped by ML Experience

### - Matplotlib seems to be the go-to choice for visualizations.
### - Seaborn, which is built on top of Matplotlib, came in second.
### - Ggplot2, Plotly, Shiny (for R), and Geoplotlib (geo-data) came in the top 5.


<br>
<br>


#### I used Matplotlib to create data visualizations and animations.
#### It allows flexibility and versatility, which, combined with other visualization tools, can help data scientists tremendously.

In [None]:
HTML(result.to_jshtml())

# ML Frameworks Grouped by ML Experience

## The plot reveals a popularity in
### - Scikit-Learn,
### - Then TensorFlow, Keras, and PyTorch,
### - and Xgboost and RandomForest.

<br/> 

### Similar to programming languages, machine learning frameworks and tools seem to be growing in users.
### This may be due to rapid developments and advances in machine learning communities.
### It's quiet awesome to see the numbers at the beginning of an exponential growth of the data science community!


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
plt.subplots_adjust(top = 0.9, bottom=0.1, right=0.9, left=0.1, hspace=0.3, wspace=0.3)
result = animate_X_by_Group(fig, axes, 'ML Experience', 'ML Frameworks')
plt.close()

In [None]:
HTML(result.to_jshtml())

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
plt.subplots_adjust(top = 0.9, bottom=0.1, right=0.9, left=0.1, hspace=0.3, wspace=0.3)
result = animate_X_by_Group(fig, axes, 'Compensation', 'Computing Platform')
plt.close()

# Computing Platform Grouped by Compensation for each Year

## The plot reveals a clear trend:
### - Data scientists began ramping up Cloud Services usage.
### - AWS, Google Cloud Platform, Microsoft Azure are same big names out there currently.

<br/> 

### Like other categories, we expect the usage in the top 5 platforms to increase.
### Cloud computing platforms allow data scientists to leverage the power of data centers to analyze data and train machine learning models.

In [None]:
HTML(result.to_jshtml())

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
plt.subplots_adjust(top = 0.9, bottom=0.1, right=0.9, left=0.1, hspace=0.3, wspace=0.3)
result = animate_X_by_Group(fig, axes, 'Compensation', 'Courses')
plt.close()

# Courses Completed or Started Grouped by Compensation for each Year


### - Data scientists have been using courses such as,
### - Coursera, Kaggle, Universities, Udemy, DataCamp, and edX.

<br/>
** Kaggle did not ask this question in the year 2020
<br/> 

### Most of these courses are free or sold at an affordable price, so it's a great idea to start studying them.
### My intuition is that AI education and consulting will play as a huge catalyst to the AI revolution.

In [None]:
HTML(result.to_jshtml())

# Wrapping Things Up...

<br>
<br />

### What I am excited about:
### - Expanding data science community. (Stack Overflow, Kaggle, Youtube...). I had so many people help me create these visualizations and I can't thank them enough.
### - Improving documentations (Matplotlib, Seaborn, Tensorflow, PyTorch, Keras) that accelerate data science projects.
### - Powerful technologies that are being developed by big companies that give immense computing powers to individuals like me.
### - Learning more about data visualizations and artificial intelligence. There are lots of room for improvement (speed, accuracy, publications, revisions).

<br />

### Improvements I could have made: 
### - I could have cleaned the data a lot more to allow for more robust data analysis.
### - I could have tried different kinds of plots such as scatter plot and 3d animations.
### - I could have came up with more intereseting subsets.