# Exploration and Analysis of Stack Overflow's Annual Developer Surveys from 2017 to 2019: Pre-Analysis

In this series of notebooks we will examine the tools that developers utilize, using Stack Overflow's annual developer surveys.

## Finding Common Data Attributes

Before analyzing the actual survey data, we must look for responses that can be analyzed across all surveys. We'll store all of the corresponding questions and column names in a dataframe, which we'll later use to extract any data of interest from the surveys.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load Schema data
dataset_dir = 'datasets/developer_survey'

developer_surveys = {}
for year in range(2017, 2020):
    path = f'{dataset_dir}/{year}/survey_results_schema.csv'
    developer_surveys[year] = pd.read_csv(path)

developer_surveys[2019].head(10)

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?
3,OpenSourcer,How often do you contribute to open source?
4,OpenSource,How do you feel about the quality of open sour...
5,Employment,Which of the following best describes your cur...
6,Country,In which country do you currently reside?
7,Student,"Are you currently enrolled in a formal, degree..."
8,EdLevel,Which of the following best describes the high...
9,UndergradMajor,What was your main or most important field of ...


### Cleaning Survey Dataframes

In [2]:
def show_column_names():
    print('Column names')
    for year in developer_surveys:
        column_names = developer_surveys[year].columns.to_list()
        delimiter = ', '
        print(f'Survey Schema {year}: {delimiter.join(column_names)}')

show_column_names()

new_column_names = {
    'Question': 'questionText',
    'QuestionText': 'questionText',
    'Column': 'column'
}

for year in developer_surveys:
    developer_surveys[year] = developer_surveys[year].rename(new_column_names, axis=1)

Column names
Survey Schema 2017: Column, Question
Survey Schema 2018: Column, QuestionText
Survey Schema 2019: Column, QuestionText


We'll rename the `Question` column to `questionText` for consistency across all surveys schemas and convert the other schema column names to camel case.

In [3]:
for year in developer_surveys:
    print(f'Survey Schema {year}')
    print(developer_surveys[year].isnull().sum(axis=0), '\n')

Survey Schema 2017
column          0
questionText    0
dtype: int64 

Survey Schema 2018
column          0
questionText    0
dtype: int64 

Survey Schema 2019
column          0
questionText    0
dtype: int64 



There's no null data in any of the dataframes.

### Finding Common Data

In [4]:
# Add the survey year to all dataframes and combine them into a single dataframe
for year in developer_surveys:
    developer_surveys[year]['year'] = year

combined_survey_schemas = pd.concat(
    [developer_surveys[year] for year in developer_surveys]).reset_index(drop=True)

combined_survey_schemas['column'].value_counts().sort_values(ascending=False)

Gender                    3
Respondent                3
Country                   3
Methodology               2
Dependents                2
                         ..
StackOverflowJobSearch    1
WantWorkDatabase          1
WorkChallenge             1
AssessJobProjects         1
WelcomeChange             1
Name: column, Length: 334, dtype: int64

In [5]:
column_names = ['Country', 'Gender']
survey_attributes = combined_survey_schemas[combined_survey_schemas['column'].isin(column_names)]

Every survey has the three columns above: `Respondent`, `Country`, and `Gender`--because `Respondent` is simply a random identifier, there's no need to include it in our analysis. At first glance it seems that we may have to limit our analysis to two attributes, `Country` and `Gender`. But it is more likely that several columns identify the same type of data but just have different names. So let's find some of those columns.

In [6]:
survey_attributes['category'] = survey_attributes['column'].str.lower()
survey_attributes

Unnamed: 0,column,questionText,year,category
3,Country,In which country do you currently live?,2017,country
145,Gender,Which of the following do you currently identi...,2017,gender
157,Country,In which country do you currently reside?,2018,country
274,Gender,Which of the following do you currently identi...,2018,gender
289,Country,In which country do you currently reside?,2019,country
361,Gender,Which of the following do you currently identi...,2019,gender


Adding the category column will allow us to identify common data across all surveys.

### Finding Platform Data

In [7]:
def search_df(df, cols, search_word):
    bool_results = []
    for col in cols:
        bool_results.append(df[col].str.lower().str.contains(search_word))
    final_bool_result = bool_results[0]
    if bool_results[1:]:
        for br in bool_results[1:]:
            final_bool_result = final_bool_result | br
    return df[final_bool_result]

def search_survey_schemas(search_word, category=None):
    cols = ['column', 'questionText']
    results = search_df(combined_survey_schemas, cols, search_word)
    if category:
        results['category'] = category
    return results

platform_questions = search_survey_schemas('platform', category='platform')
platform_questions

Unnamed: 0,column,questionText,year,category
16,MobileDeveloperType,For which of the following platforms do you de...,2017,platform
94,HaveWorkedPlatform,Which of the following platforms have you done...,2017,platform
95,WantWorkPlatform,Which of the following platforms have you done...,2017,platform
124,InfluenceRecruitment,How much influence do you have on purchasing d...,2017,platform
223,PlatformWorkedWith,Which of the following platforms have you done...,2018,platform
224,PlatformDesireNextYear,Which of the following platforms have you done...,2018,platform
265,HypotheticalTools3,Please rate your interest in participating in ...,2018,platform
330,PlatformWorkedWith,Which of the following platforms have you done...,2019,platform
331,PlatformDesireNextYear,Which of the following platforms have you done...,2019,platform


In [8]:
def show_survey_questions(questions_df, indexes):
    for index, row in questions_df.loc[indexes].iterrows():
        survey_year_col = f'Survey {row["year"]} (column: {row["column"]})'
        print(survey_year_col)
        print(row['questionText'], '\n')
        
platform_ixs = [94, 223, 330]
show_survey_questions(platform_questions, platform_ixs)

Survey 2017 (column: HaveWorkedPlatform)
Which of the following platforms have you done extensive development work for over the past year, and which do you want to work on over the next year? 

Survey 2018 (column: PlatformWorkedWith)
Which of the following platforms have you done extensive development work for over the past year?   (If you both developed for the platform and want to continue to do so, please check both boxes in that row.) 

Survey 2019 (column: PlatformWorkedWith)
Which of the following platforms have you done extensive development work for over the past year?   (If you both developed for the platform and want to continue to do so, please check both boxes in that row.) 



Now let's add the corresponding rows to `survey_attributes`.

In [9]:
selected_platform_questions = platform_questions.loc[platform_ixs,:]
survey_attributes = pd.concat([survey_attributes, selected_platform_questions])

### Finding Programming Language Data

In [10]:
language_questions = search_survey_schemas('language', 'language')
language_questions

Unnamed: 0,column,questionText,year,category
53,AssessJobTech,When you're assessing potential jobs to apply ...,2017,language
88,HaveWorkedLanguage,Which of the following languages have you done...,2017,language
89,WantWorkLanguage,Which of the following languages have you done...,2017,language
174,AssessJob4,Imagine that you are assessing a potential job...,2018,language
219,LanguageWorkedWith,"Which of the following programming, scripting,...",2018,language
220,LanguageDesireNextYear,"Which of the following programming, scripting,...",2018,language
246,AdsPriorities7,Please rank the following advertising qualitie...,2018,language
326,LanguageWorkedWith,"Which of the following programming, scripting,...",2019,language
327,LanguageDesireNextYear,"Which of the following programming, scripting,...",2019,language


In [11]:
language_ixs = [88, 219, 326]
show_survey_questions(language_questions, language_ixs)

Survey 2017 (column: HaveWorkedLanguage)
Which of the following languages have you done extensive development work in over the past year, and which do you want to work in over the next year? 

Survey 2018 (column: LanguageWorkedWith)
Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year?  (If you both worked with the language and want to continue to do so, please check both boxes in that row.) 

Survey 2019 (column: LanguageWorkedWith)
Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year?  (If you both worked with the language and want to continue to do so, please check both boxes in that row.) 



Let's add the matching rows to `survey_attributes`.

In [12]:
selected_language_questions = language_questions.loc[language_ixs,:]
survey_attributes = pd.concat([survey_attributes, selected_language_questions])

### Finding Database Data

In [13]:
database_questions = search_survey_schemas('database', category='database')
database_questions

Unnamed: 0,column,questionText,year,category
92,HaveWorkedDatabase,Which of the following database technologies h...,2017,database
93,WantWorkDatabase,Which of the following database technologies h...,2017,database
121,InfluenceDatabase,How much influence do you have on purchasing d...,2017,database
221,DatabaseWorkedWith,Which of the following database environments h...,2018,database
222,DatabaseDesireNextYear,Which of the following database environments h...,2018,database
324,PurchaseHow,How does your company make decisions about pur...,2019,database
328,DatabaseWorkedWith,Which of the following database environments h...,2019,database
329,DatabaseDesireNextYear,Which of the following database environments h...,2019,database


In [14]:
column_names = ['HaveWorkedDatabase', 'DatabaseWorkedWith']

selected_database_questions = database_questions[database_questions['column'].isin(column_names)]
survey_attributes = pd.concat([survey_attributes, selected_database_questions])

### Finding Employment Data

In [15]:
employment_questions = search_survey_schemas('employment', category='employment_status')
employment_questions

Unnamed: 0,column,questionText,year,category
5,EmploymentStatus,Which of the following best describes your cur...,2017,employment_status
159,Employment,Which of the following best describes your cur...,2018,employment_status
288,Employment,Which of the following best describes your cur...,2019,employment_status


In [16]:
survey_attributes = pd.concat([survey_attributes, employment_questions])

### Finding Salary Data

In [17]:
salary_questions = search_survey_schemas('salary', category='gross_salary')
salary_questions

Unnamed: 0,column,questionText,year,category
64,ImportantBenefits,"When it comes to compensation and benefits, ot...",2017,gross_salary
152,Salary,"What is your current annual base salary, befor...",2017,gross_salary
153,ExpectedSalary,You said before that you are currently learnin...,2017,gross_salary
181,AssessBenefits1,"Now, imagine you are assessing a job's benefit...",2018,gross_salary
206,Salary,What is your current gross salary (before taxe...,2018,gross_salary
207,SalaryType,"Is that salary weekly, monthly, or yearly?",2018,gross_salary
208,ConvertedSalary,Salary converted to annual USD salaries using ...,2018,gross_salary
312,CompTotal,What is your current total compensation (salar...,2019,gross_salary
314,ConvertedComp,Salary converted to annual USD salaries using ...,2019,gross_salary


In [18]:
salary_ixs = [152, 206, 312]
show_survey_questions(salary_questions, salary_ixs)

Survey 2017 (column: Salary)
What is your current annual base salary, before taxes, and excluding bonuses, grants, or other compensation? 

Survey 2018 (column: Salary)
What is your current gross salary (before taxes and deductions), in ${q://QID50/ChoiceGroup/SelectedChoicesTextEntry}? Please enter a whole number in the box below, without any punctuation. If you are paid hourly, please estimate an equivalent weekly, monthly, or yearly salary. If you prefer not to answer, please leave the box empty. 

Survey 2019 (column: CompTotal)
What is your current total compensation (salary, bonuses, and perks, before taxes and deductions), in `CurrencySymbol`? Please enter a whole number in the box below, without any punctuation. If you are paid hourly, please estimate an equivalent weekly, monthly, or yearly salary. If you prefer not to answer, please leave the box empty. 



We'll select this data because all of these questions ask for the same quantity: gross salary. Because our attention is limited to developers in the United States, we won't have to worry about converting between different currencies.

In [19]:
selected_salary_questions = salary_questions.loc[salary_ixs,:]
survey_attributes = pd.concat([survey_attributes, selected_salary_questions])

### Finding Experience Data

In [20]:
experience_questions = search_survey_schemas('years', category='experience')
experience_questions

Unnamed: 0,column,questionText,year,category
11,YearsProgram,How long has it been since you first learned h...,2017,experience
12,YearsCodedJob,For how many years have you coded as part of y...,2017,experience
13,YearsCodedJobPast,For how many years did you code as part of you...,2017,experience
23,ExCoder10Years,You said before that you used to code as part ...,2017,experience
86,CousinEducation,Let's pretend you have a distant cousin. They ...,2017,experience
164,YearsCoding,"Including any education, for how many years ha...",2018,experience
165,YearsCodingProf,For how many years have you coded professional...,2018,experience
168,HopeFiveYears,Which of the following best describes what you...,2018,experience
211,TimeFullyProductive,Suppose a new developer with four years of exp...,2018,experience
296,YearsCode,"Including any education, how many years have y...",2019,experience


In [21]:
experience_ixs = [12, 165, 298]
show_survey_questions(experience_questions, experience_ixs)

Survey 2017 (column: YearsCodedJob)
For how many years have you coded as part of your job? 

Survey 2018 (column: YearsCodingProf)
For how many years have you coded professionally (as a part of your work)? 

Survey 2019 (column: YearsCodePro)
How many years have you coded professionally (as a part of your work)? 



Although the first question differs from the last two, it essentially asks for the same information. But we'll make sure to note the difference as we analyze this data.

In [22]:
selected_experience_questions = experience_questions.loc[experience_ixs,:]
survey_attributes = pd.concat([survey_attributes, selected_experience_questions])

## Exporting Data

Now we will export the dataframe to a CSV file, which will be used to extract and analyze data in subsequent notebooks.

In [23]:
survey_attributes = survey_attributes.reset_index(drop=True)
survey_attributes.head(10)

Unnamed: 0,column,questionText,year,category
0,Country,In which country do you currently live?,2017,country
1,Gender,Which of the following do you currently identi...,2017,gender
2,Country,In which country do you currently reside?,2018,country
3,Gender,Which of the following do you currently identi...,2018,gender
4,Country,In which country do you currently reside?,2019,country
5,Gender,Which of the following do you currently identi...,2019,gender
6,HaveWorkedPlatform,Which of the following platforms have you done...,2017,platform
7,PlatformWorkedWith,Which of the following platforms have you done...,2018,platform
8,PlatformWorkedWith,Which of the following platforms have you done...,2019,platform
9,HaveWorkedLanguage,Which of the following languages have you done...,2017,language


In [24]:
survey_attributes.to_csv('survey_attributes.csv', index=False)