In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h1>Stack Overflow 2020 Developer Survey

<h2>Overview<h2>
     <h4> In this exploratory analysis, a detailed explaination of the dataset collected by the StackOverflow has been provided. The data was first cleaned and then selected according to the reqiurement of this analysis. Then the dataset was analysed according to the age, country, experience and gender. Then with the help of visualizations, answers to different questions were provided.

     


Loading The Dataset

In [None]:
#Loading The Dataset
data=pd.read_csv('../input/stack-overflow-developer-survey-2020/developer_survey_2020/survey_results_public.csv')

<h2>Data Preparation And Cleaning

# The data has been limited according to the following domains:
    
      1. The country from which responses have been filled.
      2. The education level, professional experience and age.
      3. Their knowledge of the programming languages.
       


<h3> Overview of the cleaning process.
<h4>1.  Checking the rows and columns of the dataset
<h4>2.  Checking for any Null or Missing Values
<h4>3.  Checking for any wrong datatype

In [None]:
#Going through the dataset
data

<h4>**The dataset contains about 64461 responses for about 61 questions. Some of the records contains null values and the respones have been made anonymously to protect the identity of the user.**

In [None]:
#Showing The First Five Records Of The Dataset
data.head()

In [None]:
#Checking The Columns
data.columns

<h4>The shortcodes for the questions have been used as the names of the columns. We can use the schemas file to obtain the question by using QuestionText.

In [None]:
schema = '../input/stack-overflow-developer-survey-2020/developer_survey_2020/survey_results_schema.csv'
new_schema = pd.read_csv(schema, index_col='Column').QuestionText

In [None]:
new_schema

In [None]:
new_schema['WelcomeChange']

<h3>We will create a new set of columns to limit our analysis to focus on some particular fields.

In [None]:
new_columns = [
    # Demographics
    'Country',
    'Age',
    'Gender',
    'EdLevel',
    'UndergradMajor',
    # Programming experience
    'Hobbyist',
    'Age1stCode',
    'YearsCode',
    'YearsCodePro',
    'LanguageWorkedWith',
    'LanguageDesireNextYear',
    'NEWLearn',
    'NEWStuck',
    # Employment
    'Employment',
    'DevType',
    'WorkWeekHrs',
    'JobSat',
    'JobFactors',
    'NEWOvertime',
    'NEWEdImpt'
]

In [None]:
#Selecting only selected columns from the dataset
data[new_columns]

In [None]:
#Creating new dataset which contains only selected columns according to our requirements
new_survey=data[new_columns]

In [None]:
new_survey.head()

In [None]:
#Looking For The Data And Object Types
new_survey.info()

<h4>Most of the data is in the form of the Object except for 'WorkWeekHrs' and 'Age'.
<h4>Since every column contains some null values, we will convert those NAN values to numeric values by using pandas library.

In [None]:
#Converting NAN values of 'AGE1stCode' COlumn to Numeric Value.
new_survey['Age1stCode'] = pd.to_numeric(new_survey.Age1stCode, errors='coerce')


In [None]:
#Converting NAN values of 'YearsCode' COlumn to Numeric Value.
new_survey['YearsCode'] = pd.to_numeric(new_survey.YearsCode, errors='coerce')


In [None]:
#Converting NAN values of 'YearsCodePro' COlumn to Numeric Value.
new_survey['YearsCodePro'] = pd.to_numeric(new_survey.YearsCodePro, errors='coerce')

In [None]:
new_survey.describe()

<h3> An intresting observation here is that the maximum value of age is 279 and minimum for the same is 1 year. This is not practical and it can be fixed by ignoring those records in which age is greater than 100 or less than 1.


In [None]:
#Dropping those records for which which age is greater than 100 or less than 1.

new_survey.drop(new_survey[new_survey.Age < 10].index, inplace=True)
new_survey.drop(new_survey[new_survey.Age > 100].index, inplace=True)

In [None]:
#Looking For The Null Values

new_survey.isna()

<h3> Another intresting observation is that the number of working hours in a week is greater than 475 for some records and since it is not possible, we will drop such records.

In [None]:
new_survey.drop(new_survey[new_survey.WorkWeekHrs > 140].index, inplace=True)

<h3>The Gender column appears to have many choices but to simplify our analysis, we will consider only 3 choices.

In [None]:
new_survey['Gender'].value_counts()

In [None]:
#The choices are seperated by ; so we will consider only few choices.
import numpy as np
new_survey.where(~(new_survey.Gender.str.contains(';', na=False)), np.nan, inplace=True)

In [None]:
new_survey['Gender'].value_counts()

In [None]:
new_survey.head()

In [None]:
new_schema.head()

# Exploratory Analysis and Visualization
 We will kickstart the analysis by further exploring fields like education, age, demographics, gender etc. to get much granular view of the dataset so as to not leave any demographic or community.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

# Country
We will dig deeper into the responses to see the variation and distribution by Country.

In [None]:
new_schema.Country

In [None]:
#Number of unique countries in the record
new_survey.Country.nunique()

In [None]:
top_countries = new_survey.Country.value_counts().head(15)
top_countries

It shows that USA has the highest number of responses followed by India, UK etc.
It is very important here to know that 'Stackoverflow' publishes its questions only in English so there are no non-english speaking country.

In [None]:
plt.figure(figsize=(12,6))
plt.xticks(rotation=75)
plt.title(new_schema.Country)
sns.barplot(x=top_countries.index, y=top_countries);

# Age

In [None]:
plt.figure(figsize=(12, 6))
plt.title(new_schema.Age)
plt.xlabel('Age')
plt.ylabel('Number of respondents')

plt.hist(new_survey.Age, bins=np.arange(10,80,5), color='lightblue');

It is clearly evident that most of the responses were between the age 20-40. But it is also encouraging to see that people above 40 years of age are also active programmers.

# Gender

In [None]:
gender_counts = new_survey.Gender.value_counts()
gender_counts

In [None]:
plt.figure(figsize=(18,9))
plt.title(new_schema.Gender)
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=180,colors=['#C39BD3','#A9CCE3','#76D7C4']);

<h4>91% of the programmers are Male, 8% are Female and 0.7% are neither. It might be possible that this number seems overwhelming beacuse not all programmers send the responses. But it also known that women are underrepresented in the Programming Sector. 

# Education Level

In [None]:
sns.countplot(y=new_survey.EdLevel)
plt.xticks(rotation=75);
plt.title(new_schema['EdLevel'])
plt.ylabel('Qualification')


<h4>From the graph above, it is clearly evident that most of the programmers who have submitted their responses have a 'Bachelor's Degree'.

# Employment

In [None]:
new_schema.Employment

In [None]:
(new_survey.Employment.value_counts(normalize=True, ascending=True)).plot(kind='barh', color='g')
plt.title(new_schema.Employment)
plt.xlabel('Percentage')
plt.ylabel('Profession Type')

In [None]:
#Converting X axis to %
(new_survey.Employment.value_counts(normalize=True, ascending=True)*100).plot(kind='barh', color='g')
plt.title(new_schema.Employment)
plt.xlabel('Percentage')
plt.ylabel('Profession Type')

Around 70% of the programmers are Employeed Full time while 15% are students and 10% are self-employed.

# Asking and Answering Questions
<h3>We've already gained several insights about the respondents and the programming community by exploring individual columns of the dataset. Let's ask some specific questions and try to answer them using data frame operations and visualizations.

<h3>Q: In which countries do developers work the highest number of hours per week? Consider countries with more than 250 responses only.


We will use groupby to aggregate the records for each country

In [None]:
#Grouping The Data
countries_data = new_survey.groupby('Country')[['WorkWeekHrs']].mean().sort_values('WorkWeekHrs', ascending=False)

In [None]:
countries_data

In [None]:
#Considering Countries with records greater than 250
high_response_countries_data = countries_data.loc[new_survey.Country.value_counts() > 250].head(15)


In [None]:
high_response_countries_data

Overall, the deviation between the Iran (which is the country with most working hours per week) and India (which is the country with least working hours per week) is only around 4 hours. The top 15 countries are mix from Asia, Europe, North America.

<h3>Q: How important is it to start young to build a career in programming?


We will plot a scatterplot between Age vs YearsCodePro which will show the age and their experience in the field.

In [None]:
new_schema.YearsCodePro


In [None]:
sns.scatterplot(x='Age', y='YearsCodePro', hue='Hobbyist', data=new_survey)
plt.xlabel("Age")
plt.ylabel("Years of professional coding experience");

It is clear that there is no specific age to learn programming. Infact, it can be started at any age.

<h3>Qu. How many of those who have submitted the responses have been exposed to programming atleast once in their life ?

In [None]:
plt.title(new_schema.Age1stCode)
sns.histplot(x=new_survey.Age1stCode, bins=30, kde=False);

This shows that most of the fields require programming skills since most of the people below 40 years of age have been exposed to programming at least once. This depicts the importance of learning basic programming skills.

<h3>Qu.Which role has the highest average number of hours worked per week? Which one has the lowest?

In [None]:
employment_hours = new_survey.groupby('Employment')[['WorkWeekHrs']].mean().sort_values('WorkWeekHrs', ascending=False)
employment_hours[:5]


It is obvious that 'Full-Time Employeed' have more hours to fill in than 'Self' and 'Part-Time Programmers'. Among those who work as programmers, Part-Time Employeed have lowest working hours.

<h3>Q. How many of the programmers have Computer Science as their major ?

In [None]:
new_schema.UndergradMajor

In [None]:
undergrad_pct = new_survey.UndergradMajor.value_counts() * 100 / new_survey.UndergradMajor.count()

sns.barplot(x=undergrad_pct, y=undergrad_pct.index)

plt.title(new_schema.UndergradMajor)
plt.ylabel(None);
plt.xlabel('Percentage');

It turns out that 40% of programmers holding a college degree have a field of study other than computer science - which is very encouraging. It seems to suggest that while a college education is helpful in general, you do not need to pursue a major in computer science to become a successful programmer.

<h2>Conclusion And Inferences

The following conclusions have been drawn from the analysis done above:

- Most of the responses have been taken from the English-speaking countires. This may not represent the overall programming community.

- Women, Transgender and other communities are still underrepresented in the Computer Science Field. Although their percentage is increasing over the years, more work has to be done to have equal contributions from all communities.

- Most of the working professionals have atleast a bachelor's degree and around 40% of them had a Master's Degree. So it seems that a Bachelor's degree is important for pursuing a career in Computer Science field.

- Almost 60% of the Bachelor's degree holders have their major as Computer Science. But around 15% has their major other than Computer Science so it is very encouraging. It seems that a degree in the field of Computer Science might not be neccessary if you have enough skills.

- Most of the programmers work as Full-Time employees but around 13% are self-employeed which is very encouraging for the begineers.

- Most of the programmers seems to be working around 40 hours per week.

- It is not important to start programming at young age. People above 30 years of age can start programming and make career in it.

- Most of the people have been exposed to programming at leat once in their life. This shows the importance of having at least basic programming skills regardless of any field one is in.

- Full time employeed work the most in a week but intrestingly, self-emlployeed and part-time employees also work almost same number of hours so it is highly possible that the dataset might be biased.