## Analyzing Kaggle Survey in a More Structured Way

Yao-Jen Kuo from <yaojenkuo@datainpoint.com>

In [None]:
import re
import pandas as pd
import matplotlib.pyplot as plt

## TL; DR

The Kaggle Survey 2021 data contains 369 columns with 51 questions combined with multiple choice and multiple selection questions. It is definitely painful and tedious to explore without the aids of re-usable codes like functions or classes. In this notebook, we define a class `KaggleSurvey2021` that is able to help us explore the Kaggle Survey 2021 data in a more structured way and shows how to conduct data analysis with object-oriented programming besides the traditional procedural programming approach.

## The `KaggleSurvey2021` class

We will define 3 major methods for the objects instantiated by `KaggleSurvey2021` class to analyze both survey responses and questions.

1. The `generate_question_table()` method returns a dataframe that maps question indexes to their descriptions and question types.
2. The `summarize_survey_response(question_index, order_by_value=True, show_value_counts=True)` method returns an aggregated summary of value counts for a given question index.
3. The `plot_survey_summary(question_index, horizontal=True, n=3)` method plots a horizontal(default)/vertical bar that illustrates the aggregated summary for a given question index.

In [None]:
class KaggleSurvey2021:
    def __init__(self, csv_file_path: str) -> None:
        """
        Args:
            csv_file_path (str): Specify the file path of kaggle_survey_2021_responses.csv.
        """
        self._first_two_lines = pd.read_csv(csv_file_path, nrows=1)
        temp_df = pd.read_csv(csv_file_path, skiprows=[1], low_memory=False)
        self._survey_data = temp_df.drop('Time from Start to Finish (seconds)', axis=1)
    def generate_question_table(self) -> pd.DataFrame:
        """
        Returns a DataFrame of question indexes, descriptions, and types.
        """
        questions = self._first_two_lines.iloc[0, 1:]
        question_indexes_str_split = self._first_two_lines.columns[1:].str.split("_")
        question_indexes = []
        for question_index in question_indexes_str_split:
            if len(question_index) == 1:
                question_indexes.append(question_index[0])
            elif question_index[1] in {"A", "B"}:
                question_indexes.append("{}{}".format(question_index[0], question_index[1]))
            else:
                question_indexes.append(question_index[0])
        self._question_indexes = pd.Series(question_indexes)
        unique_question_indexes = pd.Series(question_indexes).drop_duplicates().tolist()
        multiple_selection_pattern = " \(Select all that apply\).*"
        multiple_choice_pattern = " - Selected Choice.*"
        questions_substituted = list()
        for question in questions:
            question_sub_multiple_selection_pattern = re.sub(pattern=multiple_selection_pattern, repl="", string=question)
            question_sub_multiple_choice_pattern = re.sub(pattern=multiple_choice_pattern, repl="", string=question_sub_multiple_selection_pattern)
            questions_substituted.append(question_sub_multiple_choice_pattern)
        question_type_counts = dict()
        for question in questions_substituted:
            if question in question_type_counts.keys():
                question_type_counts[question] += 1
            else:
                question_type_counts[question] = 1
        question_table = pd.DataFrame()
        question_table["question_index"] = unique_question_indexes
        question_table["question_description"] = question_type_counts.keys()
        question_table["question_type"] = ["multiple choice" if v == 1 else "multiple selection" for v in question_type_counts.values()]
        return question_table
    def summarize_survey_response(self, question_index: str, order_by_value: bool=True, show_value_counts: bool=True) -> pd.Series:
        """
        Returns a Series of question summaries in value counts or percentages.
        Args:
            question_index (str): Specify the question, e.g. 'Q1' for Question 1, 'Q27A' for Question 27-A.
            order_by_value (bool): Sort by value vs. index.
            show_value_counts (bool): Show value counts vs. percentage.
        """
        columns = pd.Series(self._survey_data.columns)
        question_index_columns = columns[self._question_indexes == question_index]
        df_to_summarize = self._survey_data[question_index_columns]
        response_summary = pd.Series(df_to_summarize.values.ravel()).value_counts().sort_values()
        if not order_by_value:
            response_summary = response_summary.sort_index()
        if not show_value_counts:
            response_summary = response_summary / response_summary.sum()
        return response_summary
    def plot_survey_summary(self, question_index: str, horizontal: bool=True, n: int=3) -> plt.figure:
        """
        Plots a horizontal(default)/vertical bar for a given question index.
        Args:
            question_index (str): Specify the question, e.g. 'Q1' for Question 1, 'Q27A' for Question 27-A.
            horizontal (bool): Plot horizontal vs. vertical bar.
            n (int): Highlight top n with red.
        """
        fig = plt.figure()
        axes = plt.axes()
        if horizontal:
            survey_response_summary = self.summarize_survey_response(question_index)
            y = survey_response_summary.index
            width = survey_response_summary.values
            colors = ['c' for _ in range(y.size)]
            colors[-n:] = list('r'*n)
            axes.barh(y, width, color=colors)
            axes.spines['right'].set_visible(False)
            axes.spines['top'].set_visible(False)
            axes.tick_params(length=0)
        else:
            survey_response_summary = self.summarize_survey_response(question_index, order_by_value=False)
            x = survey_response_summary.index
            height = survey_response_summary.values
            colors = ['c' for _ in range(x.size)]
            axes.bar(x, height, color=colors)
            axes.spines['right'].set_visible(False)
            axes.spines['top'].set_visible(False)
            axes.tick_params(length=0)
        question_table = self.generate_question_table()
        nth_unique_question = question_table[question_table['question_index'] == question_index]
        question_description = nth_unique_question['question_description'].values[0]
        axes.set_title(question_description)
        plt.show()

## The Usage

We can instantiate an object of `KaggleSurvey2021` class providing a valid file path of `kaggle_survey_2021_responses.csv`.

In [None]:
csv_file_path = "../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv"
kaggle_survey = KaggleSurvey2021(csv_file_path)

Call `generate_question_table()` method for a summary of Kaggle Survey 2021 questions.

In [None]:
survey_question_table = kaggle_survey.generate_question_table()
survey_question_table.head()

In [None]:
survey_question_table.tail()

In [None]:
n_questions = survey_question_table.shape[0]
question_summary = survey_question_table['question_type'].value_counts()
n_multiple_choice = question_summary['multiple choice']
n_multiple_selection = question_summary['multiple selection']
print(f"There are {n_multiple_choice} multiple choices and {n_multiple_selection} multiple selections among {n_questions} questions.")

Call `summarize_survey_response(question_index, order_by_value=True, show_value_counts=True)` method for summary of a question. Specify `order_by_value=False` to order the summary regarding to categories. Specify `show_value_counts=False` to show value percentages given a multiple choice question.

In [None]:
kaggle_survey.summarize_survey_response("Q1", order_by_value=False)

In [None]:
kaggle_survey.summarize_survey_response("Q7")

In [None]:
kaggle_survey.summarize_survey_response("Q8", show_value_counts=False)

Call `plot_survey_summary(question_index, horizontal=True)` method to plot a horizontal bar for a question. Specify `horizontal=False` to plot a vertical bar.

In [None]:
kaggle_survey.plot_survey_summary("Q1", horizontal=False)

In [None]:
kaggle_survey.plot_survey_summary("Q7", n=4)

## The Exploration

We would like to portrait the outlook of a data scientist/analyst/engineer by exploring the following questions:

- Major responsibilty(Q24)
- Programming languages(Q7, Q8)
- Relational database management systems(Q33)
- Visualization libraries(Q14)
- Business intelligence tools(Q35)
- Machine learning(Q16, Q17)

In [None]:
kaggle_survey.plot_survey_summary("Q24")

In [None]:
kaggle_survey.plot_survey_summary("Q7", n=4)

In [None]:
kaggle_survey.plot_survey_summary("Q8")

In [None]:
kaggle_survey.plot_survey_summary("Q33")

In [None]:
kaggle_survey.plot_survey_summary("Q14", n=4)

In [None]:
kaggle_survey.plot_survey_summary("Q35", n=2)

In [None]:
kaggle_survey.plot_survey_summary("Q16", n=5)

In [None]:
kaggle_survey.plot_survey_summary("Q17", n=4)

## Conclusion

The notebook users have long been criticized for not writing re-usable code. In this notebook, we incorporate object-oriented programming. We are able to demonstrate how to conduct reproducible exploratory analysis by instantiating the `KaggleSurvey2021` class. If you also find it convenient, it is time to add some object-oriented flavor into your own notebook!

The `KaggleSurvey2021` module can be downloaded via Kaggle Datasets: <https://www.kaggle.com/yaojenkuo/ks2021py>.