First read in the main survey dataset of Stack Overflow

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline

df = pd.read_csv('./survey_results_public.csv')
df.head()

Unnamed: 0,Respondent,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,...,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
0,1,Student,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school,,,,...,Strongly disagree,Male,High school,White or of European descent,Strongly disagree,Strongly agree,Disagree,Strongly agree,,
1,2,Student,"Yes, both",United Kingdom,"Yes, full-time",Employed part-time,Some college/university study without earning ...,Computer science or software engineering,"More than half, but not all, the time",20 to 99 employees,...,Strongly disagree,Male,A master's degree,White or of European descent,Somewhat agree,Somewhat agree,Disagree,Strongly agree,,37500.0
2,3,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A professional degree,White or of European descent,Somewhat agree,Agree,Disagree,Agree,113750.0,
3,4,Professional non-developer who sometimes write...,"Yes, both",United States,No,Employed full-time,Doctoral degree,A non-computer-focused engineering discipline,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A doctoral degree,White or of European descent,Agree,Agree,Somewhat agree,Strongly agree,,
4,5,Professional developer,"Yes, I program as a hobby",Switzerland,No,Employed full-time,Master's degree,Computer science or software engineering,Never,10 to 19 employees,...,,,,,,,,,,


The dataset is huge with many columns, to understand about the data in each column I know that I have to refer to the schema data

In [2]:
schema = pd.read_csv('./survey_results_schema.csv')
schema.head()

Unnamed: 0,Column,Question
0,Respondent,Respondent ID number
1,Professional,Which of the following best describes you?
2,ProgramHobby,Do you program as a hobby or contribute to ope...
3,Country,In which country do you currently live?
4,University,"Are you currently enrolled in a formal, degree..."


1st question: What programing language were most popular in this year? - I want to know what columns contain the information about language

In [3]:
print(schema[schema['Column'].str.contains("Language")])

                Column                                           Question
88  HaveWorkedLanguage  Which of the following languages have you done...
89    WantWorkLanguage  Which of the following languages have you done...


So I know that I have data about the programing language in 2 columns "HaveWorkedLanguage" and "WantWorkLanguage"

In [4]:
df.WantWorkLanguage.value_counts()

JavaScript                                              642
Python                                                  556
Java                                                    544
C#                                                      475
C#; JavaScript; SQL                                     444
                                                       ... 
Clojure; Go; JavaScript; Python; R; Ruby; TypeScript      1
Go; Python; Ruby; SQL; Swift                              1
C#; Java; JavaScript; Lua; Objective-C; SQL               1
Assembly; C#; Rust; Scala                                 1
Erlang; F#; Python; R                                     1
Name: WantWorkLanguage, Length: 11239, dtype: int64

In [None]:
val_wantlanguage = df.WantWorkLanguage.value_counts()
(val_wantlanguage/df.shape[0]).plot(kind="bar");
plt.title("What language do you want to work with?");

So the most wanted language the developers want to work with is JavaScript

In [10]:
df.HaveWorkedLanguage.value_counts()

C#; JavaScript; SQL                                      1276
JavaScript; PHP; SQL                                     1143
Java                                                      913
JavaScript                                                807
JavaScript; PHP                                           662
                                                         ... 
C; C++; JavaScript; Matlab; Python; R; TypeScript           1
C++; Go; Java; JavaScript; Perl; Python                     1
C; C++; JavaScript; Lua; Perl; PHP; Python; Ruby; SQL       1
Assembly; C; C++; C#; Matlab; VB.NET                        1
C++; Python; R; Ruby; Scala; SQL                            1
Name: HaveWorkedLanguage, Length: 8438, dtype: int64

JavaScript is also the most popular language that the developers have worked with

In [15]:
df['HaveWorkedLanguage'].str.contains("JavaScript").value_counts()/df.shape[0]

True     0.445108
False    0.267551
Name: HaveWorkedLanguage, dtype: float64

In [16]:
df['WantWorkLanguage'].str.contains("JavaScript").value_counts()/df.shape[0]

False    0.339430
True     0.317695
Name: WantWorkLanguage, dtype: float64

The above results also show that JavaScript is so popular, nearly 2/3 among the group answered about HaveWorkedLanguage have worked with JavaScript (see the ratio of True vs False) and more than 1/2 (see the ratio of True vs False) among the group anwsered about WantWorkLanguage want to work with it.

2nd question: who have the higher salary btw the developers with and without the most popular language JavaScript (have worked with JavaScript or not)?

In [18]:
df[df['HaveWorkedLanguage'].str.contains("JavaScript")==True]['Salary'].mean()

56396.92270639794

In [19]:
df[df['HaveWorkedLanguage'].str.contains("JavaScript")==False]['Salary'].mean()

58155.64831958471

In [20]:
df[df['WantWorkLanguage'].str.contains("JavaScript")==True]['Salary'].mean()

55493.912634418375

In [21]:
df[df['WantWorkLanguage'].str.contains("JavaScript")==False]['Salary'].mean()

58493.89933337937

So I can see that the developers who want to work with JavaScript or who have worked with JavaScript have lower salary than the developers don't have relationship with JavaScript. 
It's understandable, JavaScript is most popular language so it cannot help the developers get higher salary. To get higher salary the developers need to master the more difficult languages

In [22]:
df.groupby(['HaveWorkedLanguage'])['Salary'].mean()

HaveWorkedLanguage
Assembly                                                                                                                                101809.954751
Assembly; C                                                                                                                              26382.440394
Assembly; C#                                                                                                                             37276.397262
Assembly; C#; Clojure                                                                                                                             NaN
Assembly; C#; CoffeeScript; Dart; Go; Haskell; Java; JavaScript; Lua; Matlab; PHP; Python; R; Ruby; Rust; Smalltalk; SQL; TypeScript              NaN
                                                                                                                                            ...      
VB.NET; VBA; Visual Basic 6                                                      

In [23]:
df.groupby(['WantWorkLanguage'])['Salary'].mean()

WantWorkLanguage
Assembly                                                                                                                                                    56670.324150
Assembly; C                                                                                                                                                 58582.312455
Assembly; C#                                                                                                                                                55000.000000
Assembly; C#; Clojure; Dart; Elixir; Erlang; F#; Groovy; Hack; Julia; Lua; Matlab; Objective-C; Rust; Scala; Smalltalk; TypeScript; VBA; Visual Basic 6              NaN
Assembly; C#; Clojure; Dart; Groovy; Ruby; Scala                                                                                                                     NaN
                                                                                                                                          

So I can confirm my conlusion about the higher salary should come from more difficult languages. With the above sorted result of mean salary with WantWorkLanguage and HaveWorkedLanguage, the highest salary should be earned with the language like Assembly, C, C#.