The Stack Overflow Survey asked the question:

> To what extent do you agree or disagree with each of the following statements:
>
> * I feel a sense of kinship or connection to other developers
> * I think of myself as competing with my peers
> * I’m not as good at programming as most of my peers

These questions are interesting not only to Stack Overflow but also to the Kaggle community where we learn and compete with one another!

In [13]:
import numpy   as np
import pandas  as pd
import seaborn as sns

import matplotlib
import matplotlib.patches as mpatches
import matplotlib.pyplot  as plt

matplotlib.rcParams['figure.figsize'] = (20, 7)

In [14]:
survey = pd.read_csv('../input/survey_results_public.csv', dtype='category')

## The Statements: Agree or Disagree

In [16]:
statements = {
    'AgreeDisagree1': "I feel a sense of kinship or connection to other developers",
    'AgreeDisagree2': "I think of myself as competing with my peers",
    'AgreeDisagree3': "I'm not as good at programming as most of my peers"
}

general_counts = []

for i in range(3):
    column = f"AgreeDisagree{i + 1}"

    survey[column] = survey[column].cat.reorder_categories([
        'Strongly agree',
        'Agree',
        'Neither Agree nor Disagree',
        'Disagree',
        'Strongly disagree',
    ])

    df = survey[column].value_counts().sort_index().reset_index(name='count')
    df.rename(columns={ 'index': 'answer' }, inplace=True)
    df['statement'] = statements[column]

    general_counts.append(df)

ax = plt.subplot()
ax.set_title("To what extent do you agree or disagree with each of the following statements?")

sns.barplot(
    data=pd.concat(general_counts),
    x='count',
    y='answer',
    hue='statement',
    ax=ax
)

plt.savefig('statements.png')
plt.show()

We see that most developers feel a sense of kinship with one another - that is great. We also see that while competitiveness is pretty split, developers generally have confidence in their skills.

This notebook will take a close look at the competitiveness divide. It should be interesting to see how agreeing or disagreeing to seeing themselves as competitive affects how developers answer the rest of the survey.

## Some preparations

We implement a percentage stacked barplot as it should make visualization easier.

In [4]:
def percent_plot(
    df,
    bar=0.85,
    orientation='h',
    title='',
    palette=sns.color_palette("coolwarm", 5),
    figsize=None
):
    # get percentages from counts
    df = df.div(df.sum(axis=1), axis=0)

    ticks = range(len(df))
    last_values = [0 for tick in ticks]

    # if graphing horizontally, we want to order
    # from top to bottom
    if orientation == 'h':
        df = df.iloc[::-1]
        
    if figsize is not None:
        plt.figure(figsize=figsize)

    for i, col in enumerate(df.columns):
        if orientation == 'h':
            plt.xlabel("Percent")
            plt.yticks(ticks, df.index.tolist())
            plt.barh(
                ticks,
                df[col].tolist(),
                height=bar,
                left=last_values,
                color=palette[i]
            )

        else:
            plt.ylabel("Percent")
            plt.xticks(ticks, df.index.tolist())
            plt.bar(
                ticks,
                df[col].tolist(),
                width=bar,
                bottom=last_values,
                color=palette[i]
            )

        last_values = [
            last + curr
            for last, curr in zip(last_values, df[col].tolist())
        ]

    plt.legend(
        title='Competitive',
        handles=[
            mpatches.Patch(color=palette[i], label=col)
            for i, col in enumerate(df.columns)
        ]
    )

    plt.title(title)
    plt.show()

Also, we will need a function that can take a column in the survey and produce an aggregate count dataframe in the shape needed by the above graphing utility.

In [5]:
def calculate_column_counts(column):
    df = survey.groupby([column, 'AgreeDisagree2']).size().reset_index(name='Count')
    df = df.pivot(column, 'AgreeDisagree2', 'Count')
    return df

# for columns with multiple responses (so slow though)
def calculate_multivalue_column_counts(column):
    s = survey[column].astype(str).str.split(';').apply(pd.Series, 1).stack()
    s.index = s.index.droplevel(-1)
    s.name  = column

    return survey[['AgreeDisagree2']].           \
            join(s).                             \
            groupby([column, 'AgreeDisagree2']). \
            size().                              \
            reset_index(name='Count').           \
            pivot(column, 'AgreeDisagree2', 'Count')

Lastly, let's just have a function to do the previous 2 in one step.

In [6]:
def graph_column(column, calculate=calculate_column_counts, **kwargs):
    df = calculate(column)
    percent_plot(df, title=f"Competitiveness by {column}", **kwargs)

Now we can start.

## Who are they?


### Age
First, let's look at their ages.

In [7]:
survey['Age'].cat.reorder_categories([
    'Under 18 years old',
    '18 - 24 years old',
    '25 - 34 years old',
    '35 - 44 years old',
    '45 - 54 years old',
    '55 - 64 years old',
    '65 years or older'
], inplace=True)

graph_column('Age')

We see that competitiveness peaks at `18-24 years old`. After that, it starts a steady decline which is only broken after reaching retirement age.

### Country

In [8]:
graph_column('Country', figsize=(20, 40))

It's interesting to note that we have countries with only competitive developers but no country with only non-competitive developers. There are only 11 respondents from Angola and all of them say they see other developers as competition.

In [9]:
graph_column('Gender', calculate=calculate_multivalue_column_counts)

While there have been studies linke this [one](http://gap.hks.harvard.edu/do-women-shy-away-competition-do-men-compete-too-much) that conclude women shy away from competition,  we see that here that doesn't seem to be true for developers. What we see though is gender non-conformists and transgenders tend to be less competitive. I wonder why.

Let's checkout sexual orientation:

In [10]:
graph_column('SexualOrientation', calculate=calculate_multivalue_column_counts)

Heterosexuals and asexuals look to have the same level of competitiveness. Gays, lesbians, and bisexuals are less competitive than the rest.

Intersting. I wonder if this is significant enough to warrant further study.

## Work

### Number of Monitors

In [9]:
graph_column('NumberMonitors')

If the competitive ones were to have more than one monitor, they wouldn't settle for 2, 3, or even 4. No, they'd have more than that.

### Wake Time

In [10]:
survey['WakeTime'].cat.reorder_categories([
    'Before 5:00 AM',
    'Between 5:00 - 6:00 AM',
    'Between 6:01 - 7:00 AM',
    'Between 7:01 - 8:00 AM',
    'Between 8:01 - 9:00 AM',
    'Between 9:01 - 10:00 AM',
    'Between 10:01 - 11:00 AM',
    'Between 11:01 AM - 12:00 PM',
    'After 12:01 PM',
    'I work night shifts',
    'I do not have a set schedule'
], inplace=True)
graph_column('WakeTime')

In [11]:
survey['HoursComputer'].cat.reorder_categories([
    'Less than 1 hour',
    '1 - 4 hours',
    '5 - 8 hours',
    '9 - 12 hours',
    'Over 12 hours'
], inplace=True)
graph_column('HoursComputer')

I expected competitive people to spend more time with a computer but as it turns out they tend to like outside better:

In [12]:
survey['HoursOutside'].cat.reorder_categories([
    'Less than 30 minutes',
    '30 - 59 minutes',
    '1 - 2 hours',
    '3 - 4 hours',
    'Over 4 hours'
], inplace=True)
graph_column('HoursOutside')

Looks like competitive developers understand that career success is not just defined by how long you work in front of a computer but also how effectively you network outside. To illustrate this further look at their reasons for joining hackathons:

In [18]:
graph_column('HackathonReasons', calculate=calculate_multivalue_column_counts)

Sure enough they like joining hackathons to find new job opportunities and to build their professional network.

Also not surprising, they want to win prizes! I'd even guess it's not the prizes or cash awards part they're truly after..it's the winning.

## Well-being

In [13]:
graph_column('Exercise')

Nothing very noticeable.

In [14]:
survey['SkipMeals'].cat.reorder_categories([
    'Never',
    '1 - 2 times per week',
    '3 - 4 times per week',
    'Daily or almost every day',
], inplace=True)
graph_column('SkipMeals')

They need their meals if they are to compete effectively!

## Satisfaction

I'm guessing competitive developers are the type that are not satisfied where they are and that drives them to strive for more.

In [34]:
survey['JobSatisfaction'].cat.reorder_categories([
    'Extremely satisfied',
    'Moderately satisfied',
    'Slightly satisfied',
    'Neither satisfied nor dissatisfied',
    'Slightly dissatisfied',
    'Moderately dissatisfied',
    'Extremely dissatisfied',
], inplace=True)

graph_column('JobSatisfaction')

And sure enough they are less likely to report being extremely satisfied. Though, they don't seem to be report being dissatisfied more. Let's see if this hold for career satisfaction:

In [35]:
survey['CareerSatisfaction'].cat.reorder_categories([
    'Extremely satisfied',
    'Moderately satisfied',
    'Slightly satisfied',
    'Neither satisfied nor dissatisfied',
    'Slightly dissatisfied',
    'Moderately dissatisfied',
    'Extremely dissatisfied',
], inplace=True)

graph_column('CareerSatisfaction')

What do they hope to be in five years:

In [36]:
graph_column('HopeFiveYears')

Very interesting. They tend to not want to be doing the same work or even to retire. No, the're more ambitious than that. Competitive developers tend to want to move to a managerial role or even found their own company!

And to that end, they tend to be more open for new opportunities:

In [41]:
survey['JobSearchStatus'].cat.reorder_categories([
    'I am actively looking for a job',
    'I’m not actively looking, but I am open to new opportunities',
    'I am not interested in new job opportunities',
], inplace=True)

graph_column('JobSearchStatus')

In [12]:
def plot_rank(name, num):
    df = get_rank_df(name, num)
    sns.boxplot(
        data=df,
        x='Rank',
        y=name,
        hue='AgreeDisagree2'
    );
    
def get_rank_df(name, num):
    columns = [
        f"{name}{i + 1}"
        for i in range(num)
    ]

    for column in columns:
        survey[column] = survey[column].astype(float)
    
    columns.append('AgreeDisagree2')

    df = pd.melt(
        survey[columns],
        id_vars=['AgreeDisagree2'],
        value_vars=columns[:num],
        var_name=name,
        value_name='Rank'
    ).dropna()

    return df

## Ethics

#### Imagine that you were asked to write code for a purpose or product that you consider extremely unethical. Do you write the code anyway?

In [26]:
graph_column('EthicsChoice')

This is revealing. Ethics will not stand in the way of competing with others.

#### Do you report or otherwise call out the unethical code in question?

In [27]:
graph_column('EthicsReport')

This is consistent with the previous finding.

I think the reasoning here is that reporting will hurt their career.

#### Who do you believe is ultimately most responsible for code that accomplishes something unethical?

In [28]:
graph_column('EthicsResponsible')

At least they don't shy out of the responsibility.

#### Do you believe that you have an obligation to consider the ethical implications of the code that you write?

In [29]:
graph_column('EthicalImplications')

And this is how we'll end up with cylons: competitive people with little regard for ethics.

Speaking of which, who do they think should be responsible to consider the ramifications of increasingly advanced AI technology?

In [63]:
graph_column('AIResponsible')

At least it's not a resounding "Nobody!".

## Earnings

Given all we have unconvered about competitive developers, it would now be interesting to see how competitiveness affect earnings.

In [32]:
survey['ConvertedSalary'] = survey['ConvertedSalary'].astype('float')

sns.boxplot(
    data=survey[['AgreeDisagree2', 'ConvertedSalary']].dropna(),
    x='AgreeDisagree2',
    y='ConvertedSalary',
    palette=sns.color_palette("coolwarm", 5),
    showfliers=False
)

plt.title('ConvertedSalary by Competitiveness')
plt.show()

That's a bit surprising. I expected them to be earning more but that's not what we see here. If anything, they look to be earning slightly less on average.

## Taking this survey

Let's see how competitiveness affects how developers feel about the survey.

In [20]:
survey['SurveyTooLong'].cat.reorder_categories([
    'The survey was too short',
    'The survey was an appropriate length',
    'The survey was too long'
], inplace=True)

graph_column('SurveyTooLong')

I don't see that very interesting. The next one though:

In [19]:
survey['SurveyEasy'].cat.reorder_categories([
    'Very easy',
    'Somewhat easy',
    'Neither easy nor difficult',
    'Somewhat difficult',
    'Very difficult',
], inplace=True)

graph_column('SurveyEasy')

I just find this hilarious. Apparently, competitive developers find surveys difficult. I wonder if that means they tend to fill out surveys more completely:

In [19]:
null_counts = pd.DataFrame({
    'AgreeDisagree2': survey['AgreeDisagree2'],
    'NullCount':      survey.isnull().sum(axis=1)
})

sns.boxplot(
    data=null_counts,
    x='AgreeDisagree2',
    y='NullCount',
    palette=sns.color_palette("coolwarm", 5),
    showfliers=False
)

plt.title('Number of null entries by competitiveness')
plt.show()

Nope.

## Decision Tree

Let's see if we can create a simple decision tree classifier to predict the competitiveness of a developer based on selected survey questions.

### Implementation Detail

I really don't like that sklearn's tree-based classifiers can't handle categories as-is. Fortunately, the categories I'll be selecting have some inherent order in them that makes sense so an encoded column being treated as an integer by might be fine. We have to be careful though to use the categorical ordering I have defined above as using `LabelEncoder` will result in an unwanted ordering.

A problem I encounter is where to place `nan` values in the ordering. I have decided to just drop rows with nan values to avoid the problem. This reduced the dataset from 98855 to 62150 (dropping 36705) -- looks acceptable to me. Fortunately, the small subset of columns I am using don't have too much nan values. If I were to do the same to the whole dataset, I'll only be left with 6 rows!

In [134]:
survey_clf = survey[[
    'AgreeDisagree2',
    'Age',
    'SurveyEasy',
    'NumberMonitors',
    'SkipMeals',
    'HoursOutside',
    'EthicalImplications'
]].dropna()

def get_encoder(column):
    ordered_categories = survey[column].cat.categories.tolist()
    return lambda x: ordered_categories.index(x)

for column in survey_clf.select_dtypes(include=['category']):
    survey_clf[column] = survey_clf[column].apply(get_encoder(column))

survey_clf.head()

Fit the decision tree

In [130]:
from sklearn import tree
from sklearn.model_selection import train_test_split

X = survey_clf.drop('AgreeDisagree2', axis=1)
y = survey_clf['AgreeDisagree2']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.33,
    random_state=42
)

clf = tree.DecisionTreeClassifier(random_state=42)
clf = clf.fit(X_train, y_train)

Evalutate

In [131]:
import itertools
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

cnf_matrix = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix(cnf_matrix, classes=survey['AgreeDisagree2'].cat.categories.tolist(), title='Confusion matrix')

plt.show()

It's not very impressive but it's something.

Visualize the tree's first 2 levels:

In [133]:
import graphviz

dot_data = tree.export_graphviz(
    clf,
    out_file=None,
    max_depth=2,
    feature_names=X.columns.tolist(),
    class_names=survey['AgreeDisagree2'].cat.categories.tolist(),
    filled=True,
    rounded=True
)
graph = graphviz.Source(dot_data)
graph

## Conclusion

Understanding our more competitive peers is an important part of understanding our community as a whole. The findings in the Ethics section of this notebook is a little troubling and is something companies should keep in mind when forming teams. We've also uncovered their career goals and their openness for new opportunities. And curiously, we noticed that for our underrepresented LGBT peers are less likely to see other developers as competition.

But with those said, while there are those of us who see other developers as competition, that doesn't affect the sense of kinship we feel with one another. :)

In [17]:
df = survey.groupby(['AgreeDisagree1', 'AgreeDisagree2']). \
    size().                                                \
    reset_index(name="Count").                             \
    pivot("AgreeDisagree1", "AgreeDisagree2", "Count")

ax = sns.heatmap(df, annot=True, fmt="d")
ax.set(
    title="Competitiveness and Kinship",
    xlabel="I think of myself as competing with my peers",
    ylabel="I feel a sense of kinship or connection to other developers"
)
plt.show()