# Content

Hey, thanks for viewing my Kernel!

If you like my work, please, leave an upvote: it will be really appreciated and it will motivate me in offering more content to the Kaggle community ! :)

This is a survey dataset that is about data science and machine learning in 2019. The survey was live for three weeks in October and is finished with 19,717 responses. The objective here is to find the difference between data scientists with high and low salaries. This work has 3 parts. You can see these parts below.

**Part 1:** Data Cleaning

**Part 2:** Data Analysis

**Part 3:** Results

# Data Cleaning

The objective here is to find the different data scientists with high and low salaries. Therefore, we dropped rows that are equal to 'Student' or 'Not employed'. After that, we will look at data distribution for visual information.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import warnings


warnings.simplefilter("ignore")
sns.set(font_scale=1)

df = pd.read_csv("../input/kaggle-survey-2019/multiple_choice_responses.csv", low_memory = False)
df = df[~df['Q5'].isin(["Student", "Not employed"])]

In [None]:
df.head()

In [None]:
def plot_distirubution(df, cols, normalize=True):
    df_survey_result = df.loc[1:, :]
    labels = df.loc[0, cols].values
    
    n_cols = 2
    nrows = round(len(cols) / n_cols)
    fig, axes = plt.subplots(nrows, n_cols, figsize=(24, 12))
    plt.subplots_adjust(hspace=0.4)
    
    index = 0
    for row in range(nrows):
        for col in range(n_cols):
            df_count = df_survey_result[cols[index]].value_counts(normalize=True)
            df_count = df_count.mul(100)
            df_count = df_count.rename('percent').reset_index()
            
            df_count = df_count[df_count['percent'] > 1]

            g = sns.barplot(y='index', x='percent', data=df_count, ax=axes[row][col])
            axes[row][col].set_xlim(0,100)
            axes[row][col].set_title(labels[index])

            for p in axes[row][col].patches:
                txt = str(p.get_width().round(2)) + '%'
                #txt_x = p.get_x() 
                #txt_y = p.get_height()
                txt_x = p.get_width() 
                txt_y = p.get_y() + p.get_height() / 2
                axes[row][col].text(txt_x,txt_y,txt)
            
            index += 1

In [None]:
# Q1 age, Q2 gender, Q3 country, Q4 education, Q5 job title, Q10 income
cols = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q10']
plot_distirubution(df, cols)

# Data Analysis

The type of compensation_num is a string. Therefore, firstly we clean and convert string data to integer data. Then we divided compensation data into 3 groups and gave the 'compensation_num_group' name to this column. This column represents 3 different incomes that are low, medium, high.

After that, we calculated high and low-income differences by categorical_distribution_diff function and we plotted graphs according to categorical distribution score. The first graph plots the most important column and the last graph plots the least important column.

In [None]:
df['compensation_num'] = df['Q10'].str.replace('$', '')
df['compensation_num'] = df['compensation_num'].str.replace(',', '')
df['compensation_num'] = df['compensation_num'].str.replace('> 500000', '600000')

quenstion_dict = {}
for index, value in enumerate(df.loc[0, :]):
    quenstion_dict[df.columns[index]] = value

df['low_compensation_num'] = df.loc[1:, 'compensation_num'].str.split('-').str[0]
df['high_compensation_num'] = df.loc[1:, 'compensation_num'].str.split('-').str[1]

df['low_compensation_num'] = df['low_compensation_num'].fillna(-1)
df['high_compensation_num'] = df['high_compensation_num'].fillna(-1)

df['low_compensation_num'] = df['low_compensation_num'].astype(int)
df['high_compensation_num'] = df['high_compensation_num'].astype(int)

df['compensation_num'] = (df['low_compensation_num'] + df['high_compensation_num']) / 2 
df = df[df['compensation_num'] != -1]

df = df.drop(['low_compensation_num', 'high_compensation_num'], 1)
df['compensation_num_group'] = pd.qcut(df['compensation_num'], 3, labels=["low", "medium", "high"])

print(df['compensation_num_group'].value_counts())

In [None]:
def group_cols(df):
    cols = df.columns
    
    col_part = []
    for col in cols:
        if '_' in col:
            col_part.append(col)
    
    cols_1 = list(set(cols) - set(col_part))
    
    temp_df = pd.DataFrame(col_part)
    temp_df['question'] = temp_df[0].str.split('_').str[0]
    temp_group = temp_df.groupby('question')[0]
    
    cols_2 = []
    for name, group in temp_group:
        if len(group) > 1:
            cols_2.append(list(group.values))
    
    return list(cols_1 + cols_2)


def categorical_distribution_diff(x, y):
    x_counts = x.value_counts()
    y_counts = y.value_counts()
    
    total_diff = 0
    for index in x_counts.index:
        try:
            diff = abs(x_counts[index] - y_counts[index])
            total_diff += diff
        except:
            total_diff += x_counts[index]
    
    for index in y_counts.index:
        try:
            diff = abs(x_counts[index] - y_counts[index])
            total_diff += diff
        except:
            total_diff += y_counts[index]
    
    return total_diff / 2

In [None]:
def find_distribution_diff(df, questions, target_col):
    df_survey_result = df.loc[1:, :]
    
    low_df = df_survey_result[df_survey_result[target_col] == 'low']
    high_df = df_survey_result[df_survey_result[target_col] == 'high']
    
    score_list = []
    for col in questions:
        if type(col) is str:
            dist_diff = categorical_distribution_diff(low_df[col], high_df[col])
            score_list.append(dist_diff)
        else:
            total_diff = 0
            for q in col:
                if 'TEXT' in q:
                    continue
                dist_diff = categorical_distribution_diff(low_df[q], high_df[q])
                total_diff += dist_diff
            total_diff = total_diff
            score_list.append(dist_diff)
    
    score_np = np.array(score_list)
    sort_index = np.argsort(score_np)[::-1]
    
    score_cols = {}
    for order in sort_index:
        if type(questions[order]) is str:
            score_cols[questions[order]] = questions[order], score_np[order]
        else:
            question_list = []
            for q in questions[order]:
                if 'TEXT' in q:
                    continue
                question_list.append(q)
            if len(question_list) <= 1:
                continue
            score_cols[questions[order][0].split('_')[0]] = question_list, score_np[order]
    
    return score_cols


def long_sentences_seperate(sentence, step=10):
    try:
        splittext = sentence.split(" ")
        for x in range(step, len(splittext), step):
            splittext[x] = "\n"+splittext[x].lstrip()
        text = " ".join(splittext)
        return text
    except:
        return sentence

In [None]:
def plot_point_salary(df, target='', country='all'):
    if country != 'all':
        temp_df = df[df['Q3'] == country]
        if len(temp_df) <= 0:
            print('There is no such a country in dataset like ' + str(country) + ' !!!')
            print('Here is the countries list: ', df['Q3'].unique())
            return 0
    else:
        temp_df = df.copy()
        
    x_list = ['Q4', 'Q15', 'Q15', 'Q15', 'Q15']
    y_list = ['Q2', 'Q1', 'Q4', 'Q5', 'Q19']
    
    fig, ax = plt.subplots(len(x_list))
    plt.close(fig)
    ax_index = 0
    for x, y in zip(x_list, y_list):
        x_values = temp_df[x].unique()
        counts_list = []
        for target_value in ['low', 'medium', 'high']:
            ttemp_df = temp_df[temp_df[target] == target_value]
            for value_name_x in x_values:
                tttemp_df = ttemp_df[ttemp_df[x] == value_name_x]
                counts_y = tttemp_df[y].value_counts()
                for value_name_y in counts_y.index:
                    condition_1 = np.logical_and(temp_df[y]==value_name_y, temp_df[x]==value_name_x)
                    condition_2 = np.logical_and(condition_1, temp_df[target]==target_value)
                    temp_df.loc[condition_2, 'size'] = counts_y.loc[value_name_y]
                    counts_list.append(counts_y.loc[value_name_y])
        
        temp_df[y] = temp_df[y].apply(lambda x: long_sentences_seperate(x, step=4))
        temp_df[x] = temp_df[x].apply(lambda x: long_sentences_seperate(x, step=4))
        
        minsize = min(counts_list)
        maxsize = max(counts_list)
        
        chart = sns.relplot(data=temp_df, y=y, x=x, col=target, hue='size', hue_norm=(minsize,maxsize), 
                            size='size', size_norm=(1, 1000), sizes=(minsize,maxsize), kind='scatter', 
                            legend='brief', ax=ax[ax_index])
        ax_index += 1
        for axes in chart.axes.flat:
            _ = axes.set_xticklabels(axes.get_xticklabels(), rotation=90)
    plt.show()

In [None]:
def plot_treemap_salary(df, target='', country='all'):
    if country != 'all':
        temp_df = df[df['Q3'] == country]
        if len(temp_df) <= 0:
            print('There is no such a country in dataset like ' + str(country) + ' !!!')
            print('Here is the countries list: ', df['Q3'].unique())
            return 0
    else:
        temp_df = df.copy()
    
    temp_df['Q15'] = temp_df['Q15'].fillna('None')
    temp_df['Q11'] = temp_df['Q11'].fillna('None')
    
    x_list = ['Q2', 'Q3', 'Q5']
    y_list = ['Q4', 'Q1', 'Q15']
    z_list = ['Q5', 'Q6', 'Q11']
    title_list = ['Genders', 'Countries', 'Job Titles', 'Courses']
    for title, x, y, z in zip(title_list, x_list, y_list, z_list):
        x_values = temp_df[x].unique()
        z_values = temp_df[z].unique()
        counts_list = []
        for target_value in ['low', 'medium', 'high']:
            ttemp_df = temp_df[temp_df[target] == target_value]
            for value_name_z in z_values:
                tttemp_df = ttemp_df[ttemp_df[z] == value_name_z]
                for value_name_x in x_values:
                    ttttemp_df = tttemp_df[tttemp_df[x] == value_name_x]
                    counts_y = ttttemp_df[y].value_counts()
                    for value_name_y in counts_y.index:
                        condition_1 = np.logical_and(temp_df[y]==value_name_y, temp_df[x]==value_name_x)
                        condition_2 = np.logical_and(condition_1, temp_df[target]==target_value)
                        condition_3 = np.logical_and(condition_2, temp_df[z]==value_name_z)
                        temp_df.loc[condition_2, 'size'] = counts_y.loc[value_name_y]
                        counts_list.append(counts_y.loc[value_name_y])
        
        fig = px.treemap(temp_df, path=[px.Constant(title), x, y, z], values='size',
                      color='compensation_num', color_continuous_scale='RdBu')
        #fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
        fig.show()

In [None]:
def plot_parallel_categories_salary(df, country='all'):
    if country != 'all':
        temp_df = df[df['Q3'] == country]
        if len(temp_df) <= 0:
            print('There is no such a country in dataset like ' + str(country) + ' !!!')
            print('Here is the countries list: ', df['Q3'].unique())
            return 0
    else:
        temp_df = df.copy()
    
    temp_df['compensation_num_factorized'] = pd.qcut(temp_df['compensation_num'], 3, labels=[1, 2, 3])
    
    #################################################################
    
    fig = px.parallel_categories(temp_df, dimensions=['Q4', 'Q2', 'Q15'], 
                                 color="compensation_num_factorized", color_continuous_scale='RdBu',
                                labels={'Q4':'Educations', 'Q2':'Genders', 'Q15':'Experiences'})
    fig.show()
    
    #################################################################
    
    ttemp_df = temp_df[temp_df['Q3'].isin(['France', 'India', 'United States of America', 'Germany', 'Russia', 'Japan',
                                        'Other', 'Brazil', 'Canada', 'Spain', 'United Kingdom of Great Britain and Northern Ireland'])]
    
    fig = px.parallel_categories(ttemp_df, dimensions=['Q11', 'Q3', 'Q5'], 
                                 color="compensation_num_factorized", color_continuous_scale='RdBu',
                                labels={'Q11':'Spending Money For ML', 'Q3':'Countries', 'Q5':'Job Titles'})
    fig.show()
    
    #################################################################
    
    def data_preparation(df_indef, col_names, targets):
        df_list = []
        y_label = col_names[0].split('_')[0]
        df_cols = targets.copy()
        df_cols.append(y_label)
        for col_n in col_names:
            df_part = pd.DataFrame(columns=df_cols)
            df_part[targets] = df_indef[targets]
            df_part[y_label] = df_indef[col_n]
            df_list.append(df_part)
        df_all = pd.concat(df_list)
        return df_all
    courses = ['Q13_Part_1', 'Q13_Part_2', 'Q13_Part_3', 'Q13_Part_4', 'Q13_Part_5', 'Q13_Part_6', 'Q13_Part_7', 'Q13_Part_8', 
               'Q13_Part_9', 'Q13_Part_10', 'Q13_Part_11', 'Q13_Part_12']
    target_list = ['Q14', 'Q1', 'compensation_num_factorized']
    ttemp_df = data_preparation(temp_df, courses, target_list)
    
    ttemp_df = ttemp_df.dropna()
    fig = px.parallel_categories(ttemp_df, dimensions=['Q13', 'Q14', 'Q1'], 
                                 color="compensation_num_factorized", color_continuous_scale='RdBu',
                                labels={'Q13':'Courses', 'Q14':'ML Tools', 'Q1':'Ages'})
    fig.show()
    
    #################################################################
    
    ml_models = ['Q28_Part_1', 'Q28_Part_2', 'Q28_Part_3', 'Q28_Part_4', 'Q28_Part_5', 'Q28_Part_6', 'Q28_Part_7', 'Q28_Part_8', 
                 'Q28_Part_9', 'Q28_Part_10', 'Q28_Part_11', 'Q28_Part_12']
    target_list = ['Q14', 'Q1', 'compensation_num_factorized']
    ttemp_df = data_preparation(temp_df, ml_models, target_list)
    
    ttemp_df = ttemp_df.dropna()
    fig = px.parallel_categories(ttemp_df, dimensions=['Q28', 'Q14', 'Q1'], 
                                 color="compensation_num_factorized", color_continuous_scale='RdBu',
                                labels={'Q28':'ML Models', 'Q14':'ML Tools', 'Q1':'Ages'})
    fig.show()

In [None]:
plot_parallel_categories_salary(df, country='all')

plot_treemap_salary(df, target='compensation_num_group', country='all')

plot_point_salary(df, target='compensation_num_group', country='all')

In [None]:
def plot_salary(df, score_cols, quenstion_dict, target='', country='all'):
    n_cols = 2
    keys = list(score_cols.keys())
    keys.remove('Time from Start to Finish (seconds)')
    keys.remove('compensation')
    keys.remove('Q10')
    nrows = round(len(keys) / n_cols)
    fig, axes = plt.subplots(nrows, n_cols, figsize=(16, 288))
    plt.subplots_adjust(hspace=0.15, wspace=0.5)    
    
    index = 0
    for row in range(nrows):
        for col in range(n_cols):
            col_name, score = score_cols[keys[index]]
            
            if country != 'all':
                temp_df = df[df['Q3'] == country]
                if len(temp_df) <= 0:
                    print('There is no such a country in dataset like ' + str(country) + ' !!!')
                    print('Here is the countries list: ', df['Q3'].unique())
                    return 0
            else:
                temp_df = df.copy()
                    
            if type(col_name) is str:
                df_count = temp_df[col_name].value_counts(normalize=True)
                df_count = df_count.mul(100)
                df_count.index.rename(col_name, inplace=True)
                df_count = df_count.rename('percent').reset_index()
                if col_name != 'Q3':
                    df_count = df_count[df_count['percent'] > 1]
                else:
                    df_count = df_count[df_count['percent'] > 2]
                temp_df = temp_df[temp_df[col_name].isin(df_count[col_name])]
                
                temp_df[col_name] = temp_df[col_name].apply(lambda x: long_sentences_seperate(x, step=3))
                plot_order = temp_df[col_name].value_counts(normalize=True).index
                sns.countplot(data=temp_df, y=col_name, hue=target, order=plot_order, ax=axes[row][col])
                #title_name = title_name + ' - Score: ' + str(score)
                title_name_new = long_sentences_seperate(quenstion_dict[col_name])
                axes[row][col].set_title(title_name_new)
                
                percentages = []
                for plot_index in plot_order:
                    df_percentages = temp_df[temp_df[col_name] == plot_index][target].value_counts(normalize=True)
                    percentages.extend(df_percentages.loc[['low', 'medium', 'high']].values)
                
                percentages_y = []
                for p in axes[row][col].patches:
                    txt_y = p.get_y()
                    percentages_y.append(txt_y)
                percentages_y = np.array(percentages_y)
                percentages_order = percentages_y.argsort()
                
                for p_index in range(0, len(axes[row][col].patches)):
                    p = axes[row][col].patches[percentages_order[p_index]]
                    txt = str(round(percentages[p_index], 2)) + '%'
                    txt_x = p.get_width() 
                    txt_y = p.get_y() + p.get_height() / 2
                    axes[row][col].text(txt_x,txt_y,txt)
            else:
                df_list = []
                y_label = col_name[0].split('_')[0]
                for col_n in col_name:
                    df_part = pd.DataFrame(columns=[target, y_label])
                    df_part[target] = temp_df[target]
                    df_part[y_label] = temp_df[col_n]
                    df_list.append(df_part)
                df_all = pd.concat(df_list)
                
                df_all[y_label] = df_all[y_label].apply(lambda x: long_sentences_seperate(x, step=3))
                plot_order = df_all[y_label].value_counts(normalize=True).index
                sns.countplot(data=df_all, y=y_label, hue=target, order=plot_order, ax=axes[row][col])
                title_name = quenstion_dict[col_name[0]]
                #title_name = title_name + ' - Score: ' + str(score)
                title_name_new = long_sentences_seperate(title_name)
                axes[row][col].set_title(title_name_new)
                
                percentages = []
                for plot_index in plot_order:
                    df_percentages = df_all[df_all[y_label] == plot_index][target].value_counts(normalize=True)
                    percentages.extend(df_percentages.loc[['low', 'medium', 'high']].values)
                
                percentages_y = []
                for p in axes[row][col].patches:
                    txt_y = p.get_y()
                    percentages_y.append(txt_y)
                percentages_y = np.array(percentages_y)
                percentages_order = percentages_y.argsort()
                
                for p_index in range(0, len(axes[row][col].patches)):
                    p = axes[row][col].patches[percentages_order[p_index]]
                    txt = str(round(percentages[p_index], 2)) + '%'
                    txt_x = p.get_width() 
                    txt_y = p.get_y() + p.get_height() / 2
                    axes[row][col].text(txt_x,txt_y,txt)
            index += 1

In [None]:
questions = group_cols(df)

score_cols = find_distribution_diff(df, questions, 'compensation_num_group')

sns.set(font_scale=1.2)

plot_salary(df, score_cols, quenstion_dict, target='compensation_num_group', country='all')

# Results

According to the order of count plots, the country is the most important feature for a high salary. **USA, Germany**, and **Canada** are the most common high-paying countries for data scientists. 

The second important feature is the experience. A data scientist that been working for **5 years in Data Analysis** and for **3–4 years in Machine Learning**, is experienced. Age and experience are the same features and both of them represent the experience.

Spending money on machine learning is the third important feature. If someone spent over 1000 dollars on machine learning, it can earn more salary than other data scientists. But the most significant thing in this plot is a lot of data scientists have high salaries without spending money on machine learning. This situation shows us the importance of **free sources**.

The fourth important feature is the properties of the company that data scientists work for. If the company has a machine learning model for more than **2 years in product**, **20 people** that are responsible for data science, and more than 1**0 000 employees**, that company give a high salary for a data scientist.

Job title and education of the data scientist are other important features. The data scientist that has a **Product/Project Manager** job title and a** Doctoral degree or Master's degree** earns a high salary. These features are tied with features of experience because becoming a Product/Project Manager and having a Doctoral degree or Master's degree needs time.

Programing language and databases are also important features for data science. The most common programming language is python but **SQL** is more significant for a high salary and all databases are serious features for high compensation in general. According to this result, we can say "All good data scientists have to know databases".

There are a lot of online courses for data science. The most common online platform for data science courses is **Coursera** and **Fast.ai** has the highest rate for high salaries.

Machine learning models are used for making decisions. The most important machine learning model is Xgboost. This model is also the most common ml model in **Kaggle**.

Cloud computing and data analysis tools are also important features. **AWS** is the most important tool for data science.

                                                         ...

This analysis was done for all countries but the analysis can change according to the country. If you are interested in the relationship between high compensation and data science talent in your own country, you can copy & edit this [notebook](https://www.kaggle.com/hasanbasriakcay/difference-between-500-225-000-income-in-ds). Just you should change the country variable in the plot_salary function.

If you like my work, please, leave an upvote: it will be really appreciated and it will motivate me in offering more content to the Kaggle community ! :)