## Pandas

You can find the Dataset at the following link: https://insights.stackoverflow.com/survey (2023)

Reading the Dataset using Pandas

In [1]:
import pandas as pd
df = pd.read_csv("survey_results_public.csv")
df.head(3)

Unnamed: 0,ResponseId,Q120,MainBranch,Age,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,ProfessionalTech,Industry,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I agree,None of these,18-24 years old,,,,,,,...,,,,,,,,,,
1,2,I agree,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Hobby;Contribute to open-source projects;Boots...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;Friend or fam...,Formal documentation provided by the owner of ...,...,1-2 times a week,10+ times a week,Never,15-30 minutes a day,15-30 minutes a day,DevOps function;Microservices;Automated testin...,"Information Services, IT, Software Development...",Appropriate in length,Easy,285000.0
2,3,I agree,I am a developer by profession,45-54 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Professional development or self-paced l...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Formal documentation provided by the owner of ...,...,6-10 times a week,6-10 times a week,3-5 times a week,30-60 minutes a day,30-60 minutes a day,DevOps function;Microservices;Automated testin...,"Information Services, IT, Software Development...",Appropriate in length,Easy,250000.0


In [2]:
print(df.index)

RangeIndex(start=0, stop=89184, step=1)


Finding the count of respondents

In [3]:
# 1.Count of respondents
count_respondents = df['ResponseId'].count()
print(count_respondents) 

89184


Finding the count of unique respondents

In [4]:
# The other way of solving #1
df.ResponseId.nunique()

89184

Finding the count of respondents who filled in all the columns

In [5]:
count_nan_i_c = df.isna().sum() # Return count of NaN in columns
print(count_nan_i_c)

ResponseId                 0
Q120                       0
MainBranch                 0
Age                        0
Employment              1286
                       ...  
ProfessionalTech       47401
Industry               52410
SurveyLength            2699
SurveyEase              2630
ConvertedCompYearly    41165
Length: 84, dtype: int64


In [6]:
# 2. Count of respondents who filled in all the columns
def c_r_a_q(df):
    count_respondents_a_q = count_respondents - df.isna().sum().max()
    return count_respondents_a_q
print(c_r_a_q(df))

2621


Finding the count of respondents who answered all questions

In [7]:
# The other way to solving #2 including survey_results_schema.csv
schema = pd.read_csv('survey_results_schema.csv')
questions = set(schema.qname.unique()) & set(df.columns) 
df.dropna(subset=questions).shape[0] 

2032

In [8]:
# what df schema contains.
schema.head()

Unnamed: 0,qid,qname,question,force_resp,type,selector
0,QID16,S0,"<div><span style=""font-size:19px;""><strong>Hel...",False,DB,TB
1,QID12,MetaInfo,Browser Meta Info,False,Meta,Browser
2,QID310,Q310,"<div><span style=""font-size:19px;""><strong>You...",False,DB,TB
3,QID312,Q120,,True,MC,SAVR
4,QID1,S1,"<span style=""font-size:22px; font-family: aria...",False,DB,TB


Finding central tendency measures of respondents work experience

In [9]:
# 3.WorkExp central tendency measures
work_exp = df['WorkExp']
mean, median, mode = work_exp.mean(), work_exp.median(), work_exp.mode()
print(mean, median, mode)

11.405126322311204 9.0 0    5.0
Name: WorkExp, dtype: float64


In [10]:
df.columns #the names of all columns

Index(['ResponseId', 'Q120', 'MainBranch', 'Age', 'Employment', 'RemoteWork',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       'LearnCodeCoursesCert', 'YearsCode', 'YearsCodePro', 'DevType',
       'OrgSize', 'PurchaseInfluence', 'TechList', 'BuyNewTool', 'Country',
       'Currency', 'CompTotal', 'LanguageHaveWorkedWith',
       'LanguageWantToWorkWith', 'DatabaseHaveWorkedWith',
       'DatabaseWantToWorkWith', 'PlatformHaveWorkedWith',
       'PlatformWantToWorkWith', 'WebframeHaveWorkedWith',
       'WebframeWantToWorkWith', 'MiscTechHaveWorkedWith',
       'MiscTechWantToWorkWith', 'ToolsTechHaveWorkedWith',
       'ToolsTechWantToWorkWith', 'NEWCollabToolsHaveWorkedWith',
       'NEWCollabToolsWantToWorkWith', 'OpSysPersonal use',
       'OpSysProfessional use', 'OfficeStackAsyncHaveWorkedWith',
       'OfficeStackAsyncWantToWorkWith', 'OfficeStackSyncHaveWorkedWith',
       'OfficeStackSyncWantToWorkWith', 'AISearchHaveWorkedWith',
       'AISearchWan

Respondents who work Remotely (count)

In [11]:
# Respondents who work Remotely
df[
    df['RemoteWork'] == 'Remote'
].head()

Unnamed: 0,ResponseId,Q120,MainBranch,Age,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,ProfessionalTech,Industry,SurveyLength,SurveyEase,ConvertedCompYearly
1,2,I agree,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Hobby;Contribute to open-source projects;Boots...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;Friend or fam...,Formal documentation provided by the owner of ...,...,1-2 times a week,10+ times a week,Never,15-30 minutes a day,15-30 minutes a day,DevOps function;Microservices;Automated testin...,"Information Services, IT, Software Development...",Appropriate in length,Easy,285000.0
4,5,I agree,I am a developer by profession,25-34 years old,"Employed, full-time;Independent contractor, fr...",Remote,Hobby;Contribute to open-source projects;Profe...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Online Courses or Certi...,Formal documentation provided by the owner of ...,...,1-2 times a week,1-2 times a week,3-5 times a week,60-120 minutes a day,30-60 minutes a day,Microservices;Automated testing;Observability ...,Other,Appropriate in length,Neither easy nor difficult,23456.0
5,6,I agree,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Hobby;Professional development or self-paced l...,Some college/university study without earning ...,Books / Physical media;Colleague;Online Course...,Formal documentation provided by the owner of ...,...,1-2 times a week,1-2 times a week,3-5 times a week,30-60 minutes a day,15-30 minutes a day,DevOps function;Microservices;Observability to...,Other,Appropriate in length,Neither easy nor difficult,96828.0
6,7,I agree,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Hobby;Contribute to open-source projects;Profe...,Some college/university study without earning ...,Friend or family member;Online Courses or Cert...,,...,1-2 times a week,3-5 times a week,1-2 times a week,Less than 15 minutes a day,15-30 minutes a day,Microservices;Automated testing;Continuous int...,"Information Services, IT, Software Development...",Appropriate in length,Easy,135000.0
7,8,I agree,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Online Courses or Certi...,Formal documentation provided by the owner of ...,...,,,,60-120 minutes a day,30-60 minutes a day,None of these,Financial Services,,,80000.0


In [12]:
# 4. Count of Respondents who work Remotely
count_remote = (df['RemoteWork'] == 'Remote').sum()
print(count_remote)

30566


Finding the percentage of Respondents who code on Python

In [13]:
# Count Respondents who code on Python
count_python = df['LanguageHaveWorkedWith'].str.contains('Python').sum()
print(count_python)

43158


In [14]:
# 5.Count Respondents who code on Python %
def count_python_perc(df):
    return round(count_python / count_respondents * 100, 2)
print(f"{count_python_perc(df)} %")

48.39 %


Count Respondents who learnt code online

In [15]:
# All values in LearnCodeOnline
df['LearnCodeOnline'].values

array([nan,
       'Formal documentation provided by the owner of the tech;Blogs with tips and tricks;Books;Recorded coding sessions;How-to videos;Video-based Online Courses;Written-based Online Courses;Auditory material (e.g., podcasts);Online challenges (e.g., daily or weekly coding challenges);Written Tutorials;Click to write Choice 20;Stack Overflow',
       'Formal documentation provided by the owner of the tech;Blogs with tips and tricks;How-to videos;Online challenges (e.g., daily or weekly coding challenges);Written Tutorials;Click to write Choice 20;Stack Overflow',
       ..., nan,
       'Formal documentation provided by the owner of the tech;Recorded coding sessions;How-to videos;Video-based Online Courses;Stack Overflow',
       'Formal documentation provided by the owner of the tech;Blogs with tips and tricks;How-to videos;Video-based Online Courses;Written-based Online Courses;Written Tutorials;Stack Overflow'],
      dtype=object)

In [16]:
# 6.Count Respondents who learnt code online
count_code_online = df['LearnCodeOnline'].str.contains('Video-based Online Courses') | df['LearnCodeOnline'].str.contains('Written-based Online Courses')
print(count_code_online.sum())

42195


Finding the average and the median of compensations by countries among Respondents

In [17]:
# 7. Avg and median of compensations by countries
python_dev = df['LanguageHaveWorkedWith'].str.contains('Python').fillna(False)
python_dev = df[python_dev]

if not python_dev.empty:
    result = python_dev.groupby('Country').agg(
        {
            'ConvertedCompYearly': ['mean', 'median']
        })
    print(result)
else:
    print("No Python developers found.")

                                     ConvertedCompYearly         
                                                    mean   median
Country                                                          
Afghanistan                                   665.000000     48.0
Albania                                     28008.600000  11844.0
Algeria                                      8336.333333   6586.0
Andorra                                     32127.000000  32127.0
Angola                                        662.000000    662.0
...                                                  ...      ...
Venezuela, Bolivarian Republic of...        24973.529412  12000.0
Viet Nam                                    20191.870370  13401.0
Yemen                                        8373.000000   9000.0
Zambia                                      13051.000000   9687.0
Zimbabwe                                     5600.000000   6000.0

[173 rows x 2 columns]


Top 5 respondents' compensation by education level

In [18]:
# 8. Top 5 respondents' compensation by education level
df[['ResponseId', 'EdLevel', 'ConvertedCompYearly']].sort_values(by='ConvertedCompYearly', ascending=False).head()

Unnamed: 0,ResponseId,EdLevel,ConvertedCompYearly
53268,53269,"Professional degree (JD, MD, Ph.D, Ed.D, etc.)",74351432.0
77848,77849,"Professional degree (JD, MD, Ph.D, Ed.D, etc.)",73607918.0
66223,66224,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",72714292.0
28121,28122,Primary/elementary school,57513831.0
19679,19680,"Professional degree (JD, MD, Ph.D, Ed.D, etc.)",36573181.0


Finding the percentage of Python programmers in 'age' category

In [19]:
# Count all respondents by Age
df.groupby('Age')['ResponseId'].count()

Age
18-24 years old       17931
25-34 years old       33247
35-44 years old       20532
45-54 years old        8334
55-64 years old        3392
65 years or older      1171
Prefer not to say       449
Under 18 years old     4128
Name: ResponseId, dtype: int64

In [20]:
# 9. Percent of Python programmers in 'age' category.
Python_age = df.groupby('Age').agg({'ResponseId': 'count', 'LanguageHaveWorkedWith': lambda x: x.str.contains('Python').sum(),})
#print(Python_age)
percent_python_among_age = Python_age['LanguageHaveWorkedWith'] / Python_age['ResponseId'] * 100
print(round(percent_python_among_age,1))

Age
18-24 years old       61.4
25-34 years old       47.6
35-44 years old       41.4
45-54 years old       38.5
55-64 years old       36.5
65 years or older     30.9
Prefer not to say     41.2
Under 18 years old    68.6
dtype: float64


Findinq the Top Industries among remote respondents in percentile 75 of compensation

In [21]:
# Findinq the Industry among remote respondents in percentile 75 of compensation
type_work = ['Remote']
quantile_respondents = df.loc[df['RemoteWork'].isin(type_work)].groupby('Industry')['ConvertedCompYearly'].quantile(0.75)
# print(quantile_respondents)
# 10. Finding top industries 
sorted_quantile_respondents = quantile_respondents.sort_values(ascending=False)
print(sorted_quantile_respondents)

Industry
Healthcare                                                             155000.00
Advertising Services                                                   150000.00
Financial Services                                                     150000.00
Other                                                                  148966.00
Insurance                                                              145000.00
Retail and Consumer Services                                           140000.00
Information Services, IT, Software Development, or other Technology    136552.00
Manufacturing, Transportation, or Supply Chain                         130000.00
Higher Education                                                       120000.00
Oil & Gas                                                              117841.00
Legal Services                                                         117500.00
Wholesale                                                              113645.25
Name: ConvertedComp