# Analysis of Stack Over Flow 2020 Survey Results

In this notebook, I analyzed the Stack Overflow data from 2020 and tried to cover below topics:<br>

* How is the distribution of the respondents according to countries, does the distribution represents<br>
the countries well according to their population ?<br>

* What are the most popular (currently being worked and desired to work next year) programing languages<br>
in mostly represented countries ?<br>

* How is the distribution of following features in those countries ?<br>

    * Job satisfaction
    * Primary field of study
    * Education Level
    * Employment and Job seeking status
    * Education importance
    * What are the most effected features on job satistaction ?

I created a Logistic Regression Model, in order to predict the job satisfaction status.<br>
Very dissatisfied and Slightly dissatisfied values are labelled as Dissatisfied;<br>
Neither satisfied nor dissatisfied values are deleted from data and Slightly satisfied and Very satisfied are labelled as Satisfied.<br>

The main findings also can be found at the post available [here].<br>

[here]:https://ozkanoztork.medium.com/you-are-a-satisfied-developer-arent-you-95170cc45ad4

[i. Importing Required Packages](#i)<br>
[ii. Importing Data](#ii)<br>

[Q1 : How is the distribution of the respondents according to countries, does the distribution represents the countries well according to their population ?](#Q1)<br>
* [Q1.1 : What is the percentage of Respondents according to Countries?](#Q1.1)<br>
* [Q1.2 : What is Respondent Density According to Population?](#Q1.2)<br>

[Q2. What are the most popular (currently being worked and desired to work next year) programing languages in mostly represented countries ?](#Q2)<br>
* [Q2.1: What is the most popular language among the respondents ?](#Q2.1)<br>
* [Q2.2 What are the usage percents of languages in Countries, Which language is being used most in which Country?](#Q2.2)<br>
* [Q2.3: What is the most desired language next year ?](#Q2.3)<br>
* [Q2.4 What are the desired percents of languages in Countries, Which language is desired most in which country?](#Q2.4)<br>

[Q3: How is the distribution of following features in those countries ? (Primary field of study, Education Level, Job satisfaction, Employment and Job seeking status, Education importance)](#Q3)<br>
* [Q3.1 How is the distribution of the primary fields in top countries?](#Q3.1)<br>
* [Q3.2 How is the distribution of the education level in top countries?](#Q3.2)<br>
* [Q3.3 How is the distribution of the job satisfaction in top countries?](#Q3.3)<br>

[Q4: 4.What are the most effected features on job satistaction ?](#Q4)<br>


[Conclusion](#5)<br>


<a id="i"></a>
### i. Importing Required Packages 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="ii"></a>
## ii. Importing Data

For the Population 2020 data, i used the data [here], thanks to Tanu N Prabhu.

[here]:https://www.kaggle.com/tanuprabhu/population-by-country-2020


In [None]:
results_2020 = pd.read_csv("/kaggle/input/stack-overflow-developer-survey-2020/developer_survey_2020/survey_results_public.csv")
schema_2020 = pd.read_csv("/kaggle/input/stack-overflow-developer-survey-2020/developer_survey_2020/survey_results_schema.csv")
population_2020 = pd.read_csv("/kaggle/input/population-by-country-2020/population_by_country_2020.csv")

In [None]:
len(results_2020.Country.value_counts().index)

<a id="Q1"></a>
## Q1 : How is the distribution of the respondents according to countries, does the distribution represents the countries well according to their population ?

<a id="Q1.1"></a>
### Q1.1 : What is the percentage of Respondents according to Countries?

Knowing the individuals distribution in each country is a must to gain consistent insights with further analysis.<br>
If there is not sufficient respondent for a country in survey, this means that the country is not well represented.<br>
Thus, i need to understand percentage of respondents according to countries.

I will use "Country" column in order to find the number of the respondent from each country.<br>
Afterwards i will divide the sum of respondents by total number of respondents to get the mean.<br>
First: I will check if there is any NaN row in Country column:

In [None]:
respondent_count = results_2020.shape[0]
print("Survey respondents :", respondent_count)


# How many respondents having NaN for Country?
country_nan_count = results_2020.Country.isnull().sum()
print("Survey respondents not having any country: ", country_nan_count)

In [None]:
# Check if the respondents having nan for "Country" have any nan for the rest of the features
not_na_percents = results_2020[results_2020.Country.isnull()==True].notnull().mean()
print("Not-Nan Percent > 0 :\n" ,not_na_percents[not_na_percents>0])

Only 3 features (Respondent, MainBranch and Hobbyist) have any value. Rest of the features are all empty.<br>
Thus, i will drop all rows that have missing Country and created new data frame

In [None]:
df = results_2020.dropna(subset = ["Country"], axis = 0)

## How many respondents according to countries?

I will create a function to calculate percentage of respondents according to countries:

In [None]:
def create_count_table(df, col_name):
    '''
    INPUTS:
    df - (dataframe)- The dataframe containing related column
    col_name (string)- Column name of the column of which values to be counted
    
    OUTPUT:
    new dataframe containing columns:
    - col_name:  
    - Counts : Counts of items in target column
    - Percent : Counts of item / Total count of column
    
    '''
    counts = df[col_name].value_counts().values
    items = df[col_name].value_counts().index
    percent = np.round(counts / counts.sum()*100,2)
    
    new_table = pd.DataFrame({col_name:items, "Counts":counts, "Percent":percent})
    
    return new_table
    

In [None]:
# Creating a dataframe including countries, counts and percent.
country_count = create_count_table(df, "Country")
country_count.head()

In [None]:
# Visualisation

# Creating ax object
country_count_filtered = country_count[country_count.Percent>1]
ax = country_count_filtered.plot(kind = "bar", x = "Country", y = "Counts",legend = False, figsize = (15,6), edgecolor = "black")

i=0
for patch in ax.patches:
    ax.text((patch.get_x()),
            patch.get_height()+100,
            country_count_filtered.Percent.values[i],
            fontsize=9,
            rotation=0,
            color="red")
    i = i + 1

plt.xlabel("Countries", fontsize = 12)
plt.ylabel("Respondent Count and Percent", fontsize = 12)
plt.title("Respondent Counts and Percents According to Countries (Countries having min %1 respondents are shown)", fontsize=14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.xticks(rotation = 45, fontsize = 9)
plt.savefig("Respondent Counts and Percents According to Countries Countries having min %1 respondents are shown")
plt.show()

From above graph: Most interested Country in the survey is United States with a 19.46 % .
 And it is followed by India, United Kingdom and Germany.

<a id="Q1.2"></a>
### Q1.2 : What is Respondent Density According to Population?

* We see that top 5 countries according to respondent percent are US, India, UK, Germany and Canada.<br>
* I want to understand how is the distribution according to country population.<br>
* Thus i will compare the Respondent Density of each country:

     Respondent Density = Respondent number in country / Population of Country <br>
     
• I will use Population data to get the population values corresponding Countries

In [None]:
# Creating new df with necessary columns from population_2020 dataframe

pop_df = population_2020[["Country (or dependency)","Population (2020)"]]
pop_df.rename(columns={"Country (or dependency)": "Country", "Population (2020)": "Population"}, inplace = True)

pop_df.info()

In [None]:
# Checking the Country names if there are differences between pop_df and country_count df

non_matches = sorted(set.difference(set(country_count.Country),set(pop_df.Country)))
non_matches


In [None]:
# Manuel correction of non-matches in pop_df

# correcting_list = ["country name in country_count", "country name in pop_df"]
correcting_list = [['Brunei Darussalam','Brunei'],['Cape Verde',  'Cabo Verde'],['Congo, Republic of the...',  'Congo'],
                   ['Czech Republic', 'Czech Republic (Czechia)'],['Democratic Republic of the Congo', 'DR Congo'],['Hong Kong (S.A.R.)', 'Hong Kong'],
                   ['Kosovo', "Kosovo"],["Lao People's Democratic Republic",'Laos'],['Libyan Arab Jamahiriya','Libya'],
                   ['Micronesia, Federated States of...','Micronesia'],['Nomadic',"Nomadic"],['Republic of Korea','South Korea'],['Republic of Moldova', 'Moldova'],
                   ['Russian Federation','Russia'],['Saint Vincent and the Grenadines','St. Vincent & Grenadines'],['Swaziland',"Swaziland"],
                   ['Syrian Arab Republic','Syria'],['The former Yugoslav Republic of Macedonia', "The former Yugoslav Republic of Macedonia"],['United Republic of Tanzania','Tanzania'],
                   ['Venezuela, Bolivarian Republic of...','Venezuela'],['Viet Nam','Vietnam']]

for x,y in correcting_list:
    pop_df.loc[pop_df.Country == y, "Country"] = x
    

In [None]:
# Creating new column in country count for population info


country_count["Population"]= np.NaN   # creating new column containing 0 values

for k in country_count.Country:   # look for each country name in country_count
    for g in pop_df.Country:      # look for each country name in pop_df
        if k == g:                # if country names are same
            country_count.Population[country_count.Country == k] = pop_df.Population[pop_df.Country == g].values
            # get the population value of country from pop_df and write in corresponding cell in country_count

In [None]:
# Drop nan
country_count.dropna(inplace = True)


# Change data type to int
country_count.Population = country_count.Population.astype(int)


In [None]:
country_count.info()

In [None]:

#country_count["Population"] = (country_count.Population-country_count.Population.min())/(country_count.Population.max()-country_count.Population.min())
country_count["Resp_Density*100k"] = country_count.Counts/country_count.Population*100000
#country_count["norm"] = (country_count.Pop_Density-country_count.Pop_Density.min())/(country_count.Pop_Density.max()-country_count.Pop_Density.min())

In [None]:
#country_count.iloc[:,3:] = (country_count.iloc[:,3:] - country_count.iloc[:,3:].min())/(country_count.iloc[:,3:].max()-country_count.iloc[:,3:].min())

#country_count.iloc[:,4:] = country_count.iloc[:,4:] /country_count.iloc[:,4:].mean()

In [None]:
country_count_sorted = country_count.sort_values(by=['Resp_Density*100k'], ascending=False)
country_count_sorted[country_count.Percent >1]

In [None]:
country_count_sorted.describe()

In [None]:

country_count_sorted[country_count_sorted.Percent >1 ].plot.bar("Country",["Percent", "Resp_Density*100k"],figsize=(15,6), edgecolor = "black")
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Respondent Percent and Respondent Density (Respondent Count / Country Population) of Countries", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Respondent Percent and Respondent Density of Countries")
plt.show()


* We see from above graph: Although the big portion (19.5 %)of the respondents from the US, <br>
if we compare the respondent densities, it is in the 8th order.<br>
* So, we can say that the people in the US, did not show big interest in survey according to respondent density.<br>
* Most interested top five countries are Sweden, Netherlands, Israel, Canada and UK.<br>
* Least intereste countries are Pakistan and India among the countries having percentage > 1%.<br>


<a id="Q2"></a>
## Q2. What are the most popular (currently being worked and desired to work next year) programing languages in mostly represented countries ?

* As a new starter on learning programing language, Python, i want to undestand the 
 popularity of the languages in different countries.
* I will define the countries which are mostly represented or mostly interested in Survey as the countries having  <br>
percentage > 1% (so they are the countries shown on above graphs).<br>

* For uncovering most popular languages, i will use the columns "LanguageWorkedWith" and "LanguageDesireNextYear"<br>

* I will also create a new data frame containing the columns in which i am interested for further analysis.



In [None]:
top_country_list = list(country_count[country_count.Percent >1].Country.values)
top_country_list

In [None]:
feature_list = ["Age","Age1stCode","ConvertedComp","Country","EdLevel","Employment","Hobbyist", "UndergradMajor","JobSat","JobSeek",
                "LanguageDesireNextYear", "LanguageWorkedWith", "MainBranch","NEWEdImpt","OrgSize","WorkWeekHrs","YearsCode","YearsCodePro"]

In [None]:
# Create new df for top countries
top_country_df = df[df.Country.isin(top_country_list)][feature_list]


<a id="Q2.1"></a>
### Q2.1: What is the most popular language among the respondents ?

In [None]:
# percentage of nulls of columns
top_country_df.isnull().mean()[top_country_df.isnull().mean()>0.25]

In [None]:
def get_language_percent(df, col_country, col_lang, top_country_list ):
    
        
    '''
    INPUT:
    df - Dataframe
    col_country - Column name in df (as string) where the countries are stored
    col_lang - Language column in df (as string) where the languages are stored
    top_country_list - A list containig country names for which popularity to be calculated
    
    OUTPUT:
    sorted_lang_percent_table:
    Overal Percentages of currently worked languages sorted as descending (as Series)
    country_lang_percent_df:
    Percentages of currently worked languages in Countries (as dataframe)
    lang_list:
    A list containing unique languages
    
    '''
    
    # create new df
    new_df = df[[col_country,col_lang]]
    
    # dropp na
    new_df.dropna(inplace = True)
    
    # create list for unique languages
    lang_list = []
    for each in list(new_df[col_lang].value_counts().index):
        splited = each.split(";")
        for each in splited:
            lang_list.append(each)
    lang_list = list(set(lang_list))

    # Arranging LanguageWorkedWith column by
    # seperating into several columns
    for lang in lang_list:

        new_df[lang] = new_df[col_lang].str.split(";")
        new_df[lang] = [lang in row for row in new_df[lang]]
        new_df[lang] = new_df[lang].astype(int).replace({False: 0, True: 1})
    
    # Percentages of languages 
    sorted_lang_percent_table = new_df.iloc[:,2:].mean().sort_values(ascending = False)
    
    
    # Sorting languages descending and creating new list
    sorted_lang_list = new_df.iloc[:,2:].mean().sort_values(ascending = False).index
    
    # Creating series for language statistics of each country
    # and creating df from those series
    series_list = []

    for country in top_country_list:
        country_lang_percent = new_df[new_df.Country == country].iloc[:,2:].mean().reindex(sorted_lang_list)
        series_list.append(country_lang_percent)

    country_lang_percent_df = pd.concat(series_list, axis = 1)
    country_lang_percent_df.columns = top_country_list
    country_lang_percent_df = country_lang_percent_df.transpose()
    

    return sorted_lang_percent_table, country_lang_percent_df, lang_list
    

In [None]:
# getting statistics for LanguageWorkedWith
sorted_lang_percent_table, country_lang_percent_df, lang_list = get_language_percent(top_country_df, "Country", "LanguageWorkedWith",top_country_list )

In [None]:
sorted_lang_percent_table

In [None]:
# Visualisation
sorted_lang_percent_table.plot.bar(figsize=(15,6), edgecolor = "black")
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Languages", fontsize = 12)
plt.title("Working Language Percentage of Respondents", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Working Language Percentage of Respondents")
plt.show()

From above graph : 
* Most popular language currently being worked is JavaScript with 67 % of usages among the respondents.
* Followed by HTML/CSS, SQL, Python, Java... 

In [None]:
country_lang_percent_df

In [None]:
# Visualisation
country_lang_percent_df.iloc[:,:5].plot.bar(figsize = (20,4), width = 0.8, colormap= "Set3", edgecolor = "black", alpha = 0.8)
#country_lang_percent_df.iloc[:,:5].plot(figsize = (20,4))
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Usage Percentage of Top 5 Languages in Top Countries", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Usage Percentage of Top 5 Languages in Top Countries")
plt.show()

* Overal top 5 language are JavaScript, HTML, SQL, Python and Java from previous graph.
* Still almost in all countries JavaScript at the first rank except Pakistan; in Pakistan HTML/CSS has greater usage than JavaScript.
* Altough Python is at 4th rank, it is at 2nd rank in Israel and at 4th rank some other countries.
* HTML/CSS is mostly used in Pakistan,<br>
    SQL is mostly used in Italy,<br>
    Python is mostly used in Israel and US,<br>
    Java is mostly used in Italy and Germany among the other countries.

<a id="Q2.3"></a>
### Q2.3: What is the most desired language next year ? 

In [None]:
# getting statistics for LanguageDesireNextYear
sorted_desired_lang_percent_table, country_desired_lang_percent_df, lang_list = get_language_percent(top_country_df, "Country", "LanguageDesireNextYear",top_country_list )

In [None]:
sorted_desired_lang_percent_table

In [None]:
# Visualisation
sorted_desired_lang_percent_table.plot.bar(figsize=(15,6), edgecolor = "black")
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Language", fontsize = 12)
plt.title("Desired Language Percentage of Respondents", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Desired Language Percentage of Respondents")
plt.show()

* Most desired language is Python with 49.5% desire among the respondents.
* Followed by JavaScript, HTML/CSS, SQL and TypeScript.
* So, Bash/Shell/PowerShell and Java seems to be losing popularity.
* TypeScript seems to be getting more popular.

<a id="Q2.4"></a>
### Q2.4 What are the desired percents of languages in Countries, Which language is desired most in which country?

In [None]:
country_desired_lang_percent_df

In [None]:
# Visualisation
country_desired_lang_percent_df.iloc[:,:5].plot.bar(figsize = (20,4), width = 0.8, colormap= "Set3", edgecolor = "black", alpha = 0.8)
#country_desired_lang_percent_df.iloc[:,:5].plot(figsize = (20,4))
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Desire Percentage of Top 5 Languages in Top Countries", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Desire Percentage of Top 5 Languages in Top Countries")
plt.show()

* Python is mostly desired in Pakistan, India and Israel,
* JavaScript is mostly desired Brazil and Pakistan,
* HTML/CSS is mostly desired in Brazil, 
* SQL is mostly desired in Brazil, Italy and US,
* TypeScript is mostly desired in Netherlands.

<a id="Q3"></a>
## Q3: How is the distribution of following features in those countries ? (Primary field of study, Education Level, Job satisfaction, Employment and Job seeking status, Education importance)


In [None]:
top_country_df.head()

In [None]:
# Arranging EdLevel column

top_country_df["EdLevel"] = top_country_df.EdLevel.str.split("\(", expand=True)[0]

# Arranging Employment column
top_country_df.Employment = top_country_df.Employment.str.split(", f", expand = True)[0]

# Arranging UndergradMajor column
top_country_df.UndergradMajor = top_country_df.UndergradMajor.str.split("\,|\(", expand = True)[0]

# Arranging OrgSize column
top_country_df.OrgSize = top_country_df.OrgSize.str.split("em|er", expand = True)[0]

# Dropping nulls
top_country_df.dropna(subset = ["LanguageDesireNextYear","LanguageWorkedWith"], inplace = True)


In [None]:
# Arranging Age1stCode column
# Data type to be changed from object to int
# Younger than 5 years  --- > 4
top_country_df.Age1stCode = [4 if each == "Younger than 5 years" else each for each in top_country_df.Age1stCode]

Age1stCode_sum = sum([int(each) for each in top_country_df.Age1stCode.dropna(inplace = False)])
Age1stCode_len = len([int(each) for each in top_country_df.Age1stCode.dropna(inplace = False)])
Age1stCode_mean = Age1stCode_sum / Age1stCode_len

top_country_df.Age1stCode = top_country_df.Age1stCode.fillna(Age1stCode_mean)
top_country_df.Age1stCode = top_country_df.Age1stCode.astype(int)

In [None]:
# Arranging YearsCode column
# Less than 1 year  --> 0.5
# More than 50 years --> 51

top_country_df.YearsCode = [0.5 if each == "Less than 1 year" else  51 if each == "More than 50 years" else each for each in top_country_df.YearsCode]

YearsCode_sum = sum([float(each) for each in top_country_df.YearsCode.dropna(inplace = False)])
YearsCode_len = len([float(each) for each in top_country_df.YearsCode.dropna(inplace = False)])
YearsCode_mean = YearsCode_sum / YearsCode_len

top_country_df.YearsCode = top_country_df.YearsCode.fillna(YearsCode_mean)
top_country_df.YearsCode = top_country_df.YearsCode.astype(float)

In [None]:
# Arranging YearsCodePro column
# Less than 1 year  --> 0.5
# More than 50 years --> 51
top_country_df.YearsCodePro = [0.5 if each == "Less than 1 year" else  51 if each == "More than 50 years" else each for each in top_country_df.YearsCodePro]

YearsCodePro_sum = sum([float(each) for each in top_country_df.YearsCodePro.dropna(inplace = False)])
YearsCodePro_len = len([float(each) for each in top_country_df.YearsCodePro.dropna(inplace = False)])
YearsCodePro_mean = YearsCodePro_sum / YearsCodePro_len

top_country_df.YearsCodePro = top_country_df.YearsCodePro.fillna(YearsCodePro_mean)
top_country_df.YearsCodePro = top_country_df.YearsCodePro.astype(float)

In [None]:
top_country_df.WorkWeekHrs = top_country_df.WorkWeekHrs.fillna(top_country_df.WorkWeekHrs.mean())
top_country_df.Age = top_country_df.Age.fillna(top_country_df.Age.mean())

In [None]:
for each in ["MainBranch","Employment","EdLevel","JobSeek","UndergradMajor","NEWEdImpt","JobSat","OrgSize"]:
    top_country_df[each].fillna(top_country_df[each].value_counts().index[0], inplace=True)

In [None]:
top_country_df.notnull().mean().sort_values(ascending = False)

In [None]:
# Arranging LanguageDesireNextYear

for lang in lang_list:

    top_country_df[lang] = top_country_df["LanguageDesireNextYear"].str.split(";")
    top_country_df[lang] = [lang in row for row in top_country_df[lang]]
    top_country_df[lang] = top_country_df[lang].replace({False: 0, True: 1}).astype(int)
    top_country_df.rename(columns={lang: "Desired_{}".format(lang)}, inplace = True)
top_country_df.drop(["LanguageDesireNextYear"], axis = 1, inplace = True)

In [None]:
# Arranging LanguageWorkedWith

for lang in lang_list:

    top_country_df[lang] = top_country_df["LanguageWorkedWith"].str.split(";")
    top_country_df[lang] = [lang in row for row in top_country_df[lang]]
    top_country_df[lang] = top_country_df[lang].replace({False: 0, True: 1}).astype(int)
    top_country_df.rename(columns={lang: "Worked_{}".format(lang)}, inplace = True)
top_country_df.drop(["LanguageWorkedWith"], axis = 1, inplace = True)

In [None]:
top_country_df.head()

<a id="Q3.1"></a>
### Q3.1 How is the distribution of the primary fields in top countries?

In [None]:
def get_pair_statistics(top_country_df, in_col_pair1, col_pair2, top_country_list):

        
    '''
    INPUTS:
    top_country_df - Dataframe in which the features to be investigated (as dataframe)
    in_col_pair1 - Name of the column in which data to be grouped (as string)
    col_pair2 - Name of the column which will be investigated in grouped data (as string)
    top_country_list - country list in order the reindex the data (as list)
    
    OUTPUTS:
    percents_df - Percentage table of col_pair2 in in_col_pair1 (as dataframe)
    ovrl_percents_df - Percentage table of col_pair2 in overall sum of col_pair2 for each in_col_pair1 (as dataframe)
    
    '''
    
    # Create gropued df
    grouped_df = top_country_df.groupby([in_col_pair1])[col_pair2]
    
    # Creating new df from groups in grouped df
    pair1_list = []
    series_list = []
    for a,b in grouped_df:
        pair1_list.append(a)
        series_list.append(b.value_counts())
    
    # Creating new df with series
    percents_df = pd.concat(series_list, axis = 1)

    # Updating column names
    percents_df.columns = pair1_list   
    
    # Copying new_df for relative percentage
    ovrl_percents_df = percents_df.copy()

    # Getting percentages
    for col in percents_df.columns:
        percents_df[col] = percents_df[col].map(lambda x:x/(percents_df[col].sum())*100)
    
    # Getting relative percentages
    ovrl_percents_df = ovrl_percents_df.div(ovrl_percents_df.sum(axis = 1), axis = 0)*100

    # Switching columns and rows in both dfs
    percents_df = percents_df.transpose()
    ovrl_percents_df = ovrl_percents_df.transpose()

    # Reindex both dfs according to top_country_list
    percents_df = percents_df.reindex(top_country_list)
    ovrl_percents_df = ovrl_percents_df.reindex(top_country_list)
    
    return percents_df, ovrl_percents_df


In [None]:
undergradMajor_percent_in_countries_df, undergradMajor_ovrl_percent_in_countries_df = get_pair_statistics(top_country_df, "Country", "UndergradMajor", top_country_list)

In [None]:
undergradMajor_percent_in_countries_df

In [None]:
# Visualisation
undergradMajor_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Primary Field of Study Percentage of Respondents in Top Countries (Field count in country / Respondent Sum in Country)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Primary Field of Study Percentage of Respondents in Top Countries")
plt.show()

In [None]:
undergradMajor_ovrl_percent_in_countries_df

In [None]:
# Visualisation
undergradMajor_ovrl_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Primary Field of Study Overall Percentage of Respondents in Top Countries (Field count in country / Field's Overall Count)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Primary Field of Study Overall Percentage of Respondents in Top Countries")
plt.show()

* Primary field of the respondents is Computer Science in all countries; roughly min 60% of respondent's primary field is computer science in each country. 
* From 2nd graph we can see that big portion of the computer science and information system employees, as well as fine arts and humanities dicipline, are in US.
* In india, respondents are mostly from other engineering diciplines.

<a id="Q3.2"></a>
### Q3.2 How is the distribution of the education level in top countries?

In [None]:
EdLevel_percent_in_countries_df, EdLevel_ovrl_percent_in_countries_df = get_pair_statistics(top_country_df, "Country", "EdLevel", top_country_list)
EdLevel_percent_in_countries_df

In [None]:
# Visualisation
EdLevel_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Education Level Percentage of Respondents in Top Countries (E.Level Count in country / E.Level Total in Country)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Education Level Percentage of Respondents in Top Countries")
plt.show()

In [None]:
EdLevel_ovrl_percent_in_countries_df

In [None]:
# Visualisation
EdLevel_ovrl_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Education Level Overall Percentage of Respondents in Top Countries (E.Level Count in country / E.Level Total)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Education Level Overall Percentage of Respondents in Top Countries")
plt.show()

* At least 70 % of the respondents in India and Pakistan have Bachelor's degree; which is the highest among the other countries.
* However in overall, US has the big portion of Bachelor's Degrees than India, roughly 30 percent of Bachelor's Degree is in US and 23% is in India.
* US has the big portion of respondents (50%) having master's degree,Canada at 2nd rank with 13%.
* Spain has th big portion of repondents (18%) having professioal degree, Russia at 2nd rank with 13%.

<a id="Q3.3"></a>
### Q3.3 How is the distribution of the job satisfaction in top countries?

In [None]:
JobSat_percent_in_countries_df, JobSat_ovrl_percent_in_countries_df = get_pair_statistics(top_country_df, "Country", "JobSat", top_country_list)

In [None]:
JobSat_percent_in_countries_df

In [None]:
# Visualisation
JobSat_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Job Satisfaction Percentage of Respondents in Top Countries (JobSat Count in country / JobSat Total in Country)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Job Satisfaction Percentage of Respondents in Top Countries")
plt.show()

In [None]:
JobSat_ovrl_percent_in_countries_df

In [None]:
# Visualisation
JobSat_ovrl_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Job Satisfaction Overall Percentage of Respondents in Top Countries (JobSat Count in country / JobSat Total)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Job Satisfaction Overall Percentage of Respondents in Top Countries")
plt.show()

In [None]:
Empl_percent_in_countries_df, Empl_ovrl_percent_in_countries_df = get_pair_statistics(top_country_df, "Country", "Employment", top_country_list)

In [None]:
# Visualisation
Empl_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Employment Percentage of Respondents in Top Countries (Employment Count in country / Employment Total in Country)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Employment Percentage of Respondents in Top Countries")
plt.show()

In [None]:
# Visualisation
Empl_ovrl_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Employment Overall Percentage of Respondents in Top Countries (Employment Count in country / Employment Total", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Employment Overall Percentage of Respondents in Top Countries")
plt.show()

In [None]:
JobSeek_percent_in_countries_df, JobSeek_ovrl_percent_in_countries_df = get_pair_statistics(top_country_df, "Country", "JobSeek", top_country_list)

In [None]:
# Visualisation
JobSeek_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("JobSeek Percentage of Respondents in Top Countries (JobSeek Count in country / JobSeek Total in Country)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("JobSeek Percentage of Respondents in Top Countries")
plt.show()

In [None]:
# Visualisation
JobSeek_ovrl_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("JobSeek Overall Percentage of Respondents in Top Countries (JobSeek Count in country / JobSeek Total)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("JobSeek Overall Percentage of Respondents in Top Countries")
plt.show()

In [None]:
MainBranch_percent_in_countries_df, MainBranch_ovrl_percent_in_countries_df = get_pair_statistics(top_country_df, "Country", "MainBranch", top_country_list)

In [None]:
# Visualisation
MainBranch_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("MainBranch Percentage of Respondents in Top Countries (MainBranch Count in country / MainBranch Total in Country)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("MainBranch Percentage of Respondents in Top Countries")
plt.show()

In [None]:
# Visualisation
MainBranch_ovrl_percent_in_countries_df .plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("MainBranch Overall Percentage of Respondents in Top Countries (MainBranch Count in country / MainBranch Total)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("MainBranch Overall Percentage of Respondents in Top Countries")
plt.show()

In [None]:
NEWEdImpt_percent_in_countries_df, NEWEdImpt_ovrl_percent_in_countries_df = get_pair_statistics(top_country_df, "Country", "NEWEdImpt", top_country_list)

In [None]:
# Visualisation
NEWEdImpt_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Education Importance Percentage of Respondents in Top Countries (Answer Count in country / Answers Total in Country)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Education Importance Percentage of Respondents in Top Countries")
plt.show()

In [None]:
# Visualisation
NEWEdImpt_ovrl_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Education Overall Importance Percentage of Respondents in Top Countries (Answer Count in country / Answers Total)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.savefig("Education Overall Importance Percentage of Respondents in Top Countries")
plt.show()

In [None]:
OrgSize_percent_in_countries_df, OrgSize_ovrl_percent_in_countries_df = get_pair_statistics(top_country_df, "Country", "OrgSize", top_country_list)

In [None]:
# Visualisation
OrgSize_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Organization Size Percentage of Respondents in Top Countries (OrgSize Count in country / Answers Total in Country)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.show()

In [None]:
# Visualisation
OrgSize_ovrl_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Organization Size Overall Percentage of Respondents in Top Countries (OrgSize Count in country / Answers Total)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.show()

In [None]:
Hobbyist_percent_in_countries_df, Hobbyist_ovrl_percent_in_countries_df = get_pair_statistics(top_country_df, "Country", "Hobbyist", top_country_list)

In [None]:
# Visualisation
Hobbyist_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Hobbyist Percentage of Respondents in Top Countries (Hobbyist Count in country / Answers Total in Country)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.show()

In [None]:
# Visualisation
Hobbyist_ovrl_percent_in_countries_df.plot.bar(figsize = (20,4), width = 0.8, colormap= "Paired", edgecolor = "black", alpha = 0.9)
plt.legend(bbox_to_anchor=(1,1))
plt.xticks(rotation = 45, fontsize = 9)
plt.xlabel("Countries", fontsize = 12)
plt.title("Hobbyist Overall Percentage of Respondents in Top Countries (Hobbyist Count in country / Answers Total)", fontsize = 14)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.show()

<a id="Q4"></a>
## Q4: What are the most effective features on job satisfaction?

* I will use Sklearn library to create a Logistic Regression model.
* Using model coefficients, features that has negative and positive effect on job satisfaction to be calculated.
* Before creting model, i will prepare the data.

In [None]:
#numericals = top_country_df.select_dtypes(exclude = "object")
numericals = ["Age","Age1stCode","ConvertedComp","WorkWeekHrs","YearsCode","YearsCodePro"]
categoricals = ["Country","EdLevel","Employment","Hobbyist","UndergradMajor","JobSeek","MainBranch","NEWEdImpt","OrgSize"]

In [None]:
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

In [None]:
top_country_df.JobSat.value_counts()

In [None]:
# Combining JobSat column
# Delete "Neither satisfied nor dissatisfied"
# Combine "Very satisfied" and "Slightly satisfied", label as "Satisfied" -->1
# Combine "Very dissatisfied" and "Slightly dissatisfied", label as "Dissatisfied"-->0

# Delete rows "Neither satisfied nor dissatisfied"
df = top_country_df.drop(top_country_df[top_country_df.JobSat == "Neither satisfied nor dissatisfied"].index)

df.JobSat = [1 if each == "Very satisfied" else 
             1 if each == "Slightly satisfied" else 
             0 if each == "Very dissatisfied"else 
             0 if each == "Slightly dissatisfied" else
             each for each in df.JobSat]

In [None]:
# Dropping nan in ConvertedComp
df = df.dropna()

In [None]:
# one hot encoding
df = pd.get_dummies(df,  columns = categoricals )

In [None]:
# Convert all rows except int and float to category
#for each in new_categoricals:
    #df[each] = df[each].astype("category")

In [None]:
#new_numericals = df.select_dtypes(include = ["int","float"]).columns
#new_categoricals = df.select_dtypes(include = "category").columns
#print("new_numericals:",new_numericals)
#print("new_categoricals:",new_categoricals)

In [None]:
# Normalization of numerical features
for each in numericals:
    df[each] = (df[each] - df[each].min()) / (df[each].max() - df[each].min())


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
# Split data into X and y
X = df.drop("JobSat", axis = 1)
y = df.JobSat

In [None]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=7)

In [None]:
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]

# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
results_df = pd.DataFrame()
results_df["Rates"] = importance.tolist()
results_df["Columns"] = X.columns

new_index = results_df.Rates.sort_values(ascending = False).index
sorted_results = results_df.reindex(new_index)
filtered_results = sorted_results[np.abs(sorted_results.Rates) > 0.1]

plt.figure(figsize =(10,11))
plt.barh(filtered_results.Columns, filtered_results.Rates)
plt.grid(axis="both", color="gray", linewidth=0.1)
plt.title("Negative and Positive Effected Features on Job Satistaction",fontsize = 14)
plt.savefig("Negative and Positive Effected Features on Job Satistaction")
plt.show()


* Features having effect rate > abs(0.15) are shown.
* According to graph:
* Top 3 features negatively effecting Job Satisfaction are age, actively looking for a job and age of startin to coding. So, in the elderly ages job satisfaction may decrease because of the personal expectation increases. As expected, respondents who are looking for a job would be dissatisfied. Respondents who started in early ages may be dissatisfied beacuse of they are well-qualified in their profession but the company they work may not be well-qualified comparing with their expectations. In the same way, as the professional coding years increase, satisfaction may decrease. Also increased working hours decreases satisfaction.
* Among the countries; most dissatisfied countries are Israel, France, Poland, Spain, Brazil and Italy.
* Primary-Elementary shcool graduations most dissatisfied, whereas having doctoral degrees mostly satisfied.
* Most satisfied countries Pakistan, Sweden, US, Canada and Australia.

<a id="5"></a>
## Conclusion

Having the top respondent density and greater positive effect ratio on satisfaction, Sweden seems to have the most satisfied developers.<br>

Although Pakistan has the greatest positive effect ratio on job satisfaction; because it has lowest respondent density, i can not conclude that it has the satisfied developers.<br>

We can conclude that most dissatisfied developers are from Israel, due to it’s respondent density is higher and it has top negative effect ration among the countries.<br>

As programing languages, Python and TypeScript have increasing popularity, however JavaScript is still mostly used language.