# Analysis QnA
The first thing to be questioned is whether the questionnaire is representative of Kaggle community. This is also our first assumption: *Assume that the questionnaire is a well represented subset of the Kaggle community*. 

(to be filled in)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import re
import pandas as pd
import seaborn as sns
from textwrap import wrap

plt.rcParams["figure.dpi"] = 100

In [None]:
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")

In [None]:
# By Andrada Olteanu
# https://www.kaggle.com/andradaolteanu/siim-covid-19-box-detect-dcm-metadata
def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(round(_x, 5), round(_y, 5), format(round(value, 5), ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs): _show_on_single_plot(ax)
    else: _show_on_single_plot(axs)

In [None]:
def our_countplot(column, df):
    ax = sns.countplot(x=df[column].to_numpy()[1:])
    plt.title("\n".join(wrap(df[column].to_numpy()[0], 60)))
    plt.grid(axis="y")
    show_values_on_bars(ax)
    _ = plt.xticks(rotation=90)
    
    
def re_filter(query, df_in):
    global df
    
    r = re.compile(query)
    newlist = list(filter(r.match, df_in.columns.to_list()))
    
    columns = [df[col].iloc[1:].unique().astype("str") for col in newlist]
    columns = [g[g != "nan"][0] for g in columns]
    
    g = df_in[newlist].count() - 1  # first row requires minus out. 
    g.index = columns
    g = g.reset_index(level=0)
    g.columns = ["index", "count"]
    return g, columns, newlist[0]


def our_plot(query, df):
    if query == "Q3": plt.figure(figsize=(12, 7))
    try: our_countplot(query, df)
    except KeyError:
        out, columns, title_col = re_filter(query, df)
        ax = sns.barplot(x="index", y="count", data=out)
        show_values_on_bars(ax)
        plt.xticks(rotation=90)
        plt.grid(axis="y")
        plt.title("\n".join(wrap(df[title_col].to_numpy()[0].split("?")[0] + "?", 60)))
        
        
def get_columns(query):
    """Given a question, e.g. 'Q7', return all column names that are 'Q7*'"""
    r = re.compile(query)
    return np.array(list(filter(r.match, df.columns.to_list())))

1) What type of people uses SQL? Are they mainly business people? How much programming experience they have? 
We are not answering what type of products they use here, we just want to know what community they represent. Given SQL importance, it is worth knowing what community aren't exposed to SQL and hence create targeted resources exposing them to SQL (and make it easy for them to take on SQL if they find it difficult somehow). 

In [None]:
get_columns("Q7")

In [None]:
# Query with retaining first row
q = np.array(df["Q7_Part_3"] == "SQL")
q[0] = True

df_sql = df[q]
df_sql.head()

In [None]:
our_plot("Q5", df_sql)

After looking through Q1 to Q6, one founds that apart from Q5, others have no useful information (whether they are SQL users or not, they have the same distributions of population). Q5 have a difference: 

Most people using SQL are found to be Student, Data Analyst, Data Scientist and Software Engineer. These aren't exceeding expectations. Example, even Data Engineer although low, aren't exceeding expectations because the total number of Data Engineers taking the query aren't many (466 of 668 uses SQL). What exceeds expectations is **Machine Learning Engineer (MLE)**.

It seems that MLE aren't very popular using SQL (491 of 1499) hence can deduce they're perhaps more targeting on the models development. 

Few months ago, Data Centric development had been introduced by Andrew Ng (https://www.youtube.com/watch?v=06-AZXmwHjo&ab_channel=DeepLearningAI) (from ones' perspective, because that's how one got introduced to data centric, but it doesn't mean he might be the first to do data centric, it might be someone else or his team or another team or whatever, one didnt research on that). Perhaps it's time to start learning SQL and be prepared. Even though we could play around based on small subset of data, when we want to apply it to real world, in a pipeline, whatever an MLE played with needs to integrate to such pipeline, without much delay. SQL provides very quick way to work through data and it's important to translate whatever query one uses into SQL. 

For example, if one plays with Pandas DataFrame and uses it's SQL-like commands, they're not the fastest way to deal with data cleaning (especially since Pandas itself runs on a single core only, while others like Spark DataFrame runs on all cores, and others like CUDF or BlazingSQL runs on GPU/CUDA for preprocessing). 

They're important. 

In [None]:
our_plot("Q7", df)

2) It seems that people uses C++ (ignoring C here) as many as R (at least based on people who took the questionnaire). But Kaggle notebooks only support Python and R. Are there C++ (notebooks/script) available to Kaggle community as well? Is there are difference in performance developing with Pytorch Python vs PyTorch C++ API? (we ignore TensorFlow here but it may well generalize to TF and other ML libraries, if they support both language). 