#  Stack Overflow Survey results from 2017 to 2020.

> In this project I will retrieve the last four [Stack Overflow Survey results](https://insights.stackoverflow.com/survey). 

EDA on 2011-2020 data coming soon


## Business Questions

- How had an individual's job satisfaction changed over the years?

- What are the changes in salaries and job satisfaction for data scientists?
 
- Does race and gender or education level matter more to an individual's salary?



## Data Understanding

 > To complete this project we need to collect the results of the last four years. And then follow a data wrangling prosses. 

#### Data wrangling consists of: <br>
>1- Gathering data <br>
>2- Assessing data <br>
>3- Cleaning data <br>
>4- Storing, analyzing, and visualizing wrangled data <br><br>

#### Gathering data

In [None]:
#Importing backages 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
import plotly.graph_objects as go

In [None]:
#Stack Overflow  Syrvey result from 2017 to 2020 
df_17 = pd.read_csv('../input/stackoverflow-datasets-2011-to-present/2017 Stack Overflow Survey Responses.csv')
df_18 = pd.read_csv('../input/stackoverflow-datasets-2011-to-present/2018 Stack Overflow Survey Responses.csv',low_memory=False)
df_19 = pd.read_csv('../input/stackoverflow-datasets-2011-to-present/2019 Stack Overflow Survey Responses.csv')
df_20 = pd.read_csv('../input/stackoverflow-datasets-2011-to-present/2020 Stack Overflow Survey Responses.csv')

In [None]:
#creating a columns contains the year 
#later these data sets will be combained
df_17['year'] = 2017
df_18['year'] = 2018
df_19['year'] = 2019
df_20['year'] = 2020


#### Assessing data
> Looking through each set to decide on the main features that can be used in the analysis.

In [None]:
#2017 data set 
df_17.head()

In [None]:
#2018 
df_18.head()

In [None]:
#2019 
df_19.head()

In [None]:
#2020 
df_20.head()

#### Quality Issues 
> There are some issues that needed to be fixed.


- Rename columns to compare these data easly.
- Handling missing values.
- Selecting interesting columns.


#### Selecting interesting columns
Since that the goal of this project is to compare the four years data to gain insights, only the common columns are useful.<br> 

The selected columns are :
- Programming as a Hobby
- Country
- Employment Status
- Have Worked Language
- Want Work Language
- Have Worked Database
- Want Work Database
- Have Worked Platform
- Want Work Platform
- Formal Education
- Years Coded Job
- Years Coded JobPast
- Job Satisfaction
- Developer Type


## Data Cleaning and Preparation


### ` Developer Survey Results 2017`

> Most of the columns needed to be renamed.

In [None]:
#Rename columns of 2017
df_17.rename(columns={'ProgramHobby': "Hobbyist", 'FormalEducation': 'EdLevel',
                      'HaveWorkedLanguage':'LanguageWorkedWith','YearsProgram':'YearsCode',
                      'WantWorkLanguage':'LanguageDesireNextYear','YearsCodedJob':'YearsCodePro', 
                    'HaveWorkedDatabase':'DatabaseWorkedWith','WantWorkDatabase':'DatabaseDesireNextYear',
                    'HaveWorkedPlatform':'PlatformWorkedWith','WantWorkPlatform':'PlatformDesireNextYear',
                      'JobSatisfaction':'JobSat', 'EmploymentStatus':'Employment','Salary':'ConvertedComp',
                      'DeveloperType':'DevType','Race':'Ethnicity','MajorUndergrad':'UndergradMajor'},inplace = True)

In [None]:
#selecting columns
df_17 = df_17[['Respondent', 'Hobbyist', 'Country', 'Employment', 'EdLevel',
       'YearsCode', 'YearsCodePro', 'DevType', 'JobSat', 'LanguageWorkedWith',
       'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'Gender','year','ConvertedComp','Ethnicity','UndergradMajor']]

In [None]:
#Test the change
df_17.head()

##### Jobsat

In [None]:
# all the other survey results rank job satisfaction with catagries instead of number 
#here the jobsat are numerical must be change into catagries 
#remove nan 
df_17['JobSat']=df_17['JobSat'].fillna(0)
#change the type
df_17['JobSat']= df_17['JobSat'].astype(int)
# create empty set 
def f(column):
    '''
    INPUT - 
            column_name - string - the name of the column you would like to know about
    OUTPUT - 
            list contian the desired changes
    '''
    JobSat = []
    for x in column: 
        if int(x) >= 1 and int(x) < 2:
            JobSat.append('Very dissatisfied')
        elif int(x) >= 2 and  int(x) <4:
            JobSat.append('Slightly dissatisfied')
        elif int(x) >= 4  and int(x) < 6:
            JobSat.append('Slightly satisfied')
        elif int(x) >= 6 :
            JobSat.append('Very satisfied')
        else :
            JobSat.append('Neither satisfied nor dissatisfied')
    return JobSat
    
    
df_17['JobSat']=f(df_17['JobSat']) 

In [None]:
#test
df_17['JobSat'].unique()

In [None]:
#unique values of hobbyist
df_17['Hobbyist'].unique()

> we can only use yes or no to simplify our data

In [None]:
# create an empty set
Hobbyist = []

for x in df_17['Hobbyist']: 
    if x == 'No' or x == 'Yes, both':
        Hobbyist.append('No')
    else : 
        Hobbyist.append('Yes')
# Create a column for the list
df_17['Hobbyist']= Hobbyist

In [None]:
#test 
df_17['Hobbyist'].unique()

In [None]:
# the distribution of ConvertedComp in 2017
df_17['ConvertedComp'].hist()

In [None]:
#uniques values of YearsCodePro
df_20['YearsCodePro'].unique()

In [None]:
#uniques values of YearsCode
df_17['YearsCodePro'].unique()

> we can change these values from a range to single values

In [None]:
#change these values from a range to single values
df_17['YearsCodePro']=df_17['YearsCodePro'].astype(str)
df_17['YearsCodePro'] =df_17['YearsCodePro'].str.split(' ').str[0]
df_17['YearsCodePro'].unique()

In [None]:
#change these values from a range to single values
df_17['YearsCode']=df_17['YearsCode'].astype(str)
df_17['YearsCode'] =df_17['YearsCode'].str.split(' ').str[0]
df_17['YearsCode'].unique()

In [None]:
df_17['EdLevel'].value_counts()

In [None]:
#test
df_17.info()

### ` Developer Survey Results 2018`

In [None]:
#Fixing Salary column outliers
df_18["Salary"] = pd.to_numeric(df_18["Salary"], errors='coerce')
df_18=df_18[df_18.Salary< 1.0000000000000002e+6]
df_18["Salary"].describe()

In [None]:
#Rename columns of 2018
df_18.rename(columns={'Hobby': "Hobbyist", 'FormalEducation': 'EdLevel',
                      'YearsCoding':'YearsCode','YearsCodingProf':'YearsCodePro',
                      'JobSatisfaction':'JobSat','Salary':'ConvertedComp','RaceEthnicity':'Ethnicity'},inplace = True)

In [None]:
#selecting columns
df_18 = df_18[['Respondent', 'Hobbyist', 'Country', 'Employment', 'EdLevel',
       'YearsCode', 'YearsCodePro', 'DevType', 'JobSat', 'LanguageWorkedWith',
       'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'Gender','year','ConvertedComp','Ethnicity','UndergradMajor']]

In [None]:
#testing change
df_18.head()

In [None]:
#uniques values of job sat
df_18['JobSat'].unique()

In [None]:
#Here the job satisfaction ranked diffrently thsn the other results, instead of 7 catagories we need just five.
#change the values to match the other dfs
#create empty set
JobSat = []
#loop through the column
for x in df_18['JobSat']: 
    '''
    replace extremly and moderate with very and 
    '''
    if x == 'Extremely dissatisfied' or x== 'Moderately dissatisfied' :
        JobSat.append('Very dissatisfied')
    elif x == 'Slightly dissatisfied': 
        JobSat.append('Slightly dissatisfied')
    elif x == 'Slightly satisfied' :
        JobSat.append('Slightly satisfied')
    elif x==  'Extremely satisfied' or x== 'Moderately satisfied':
        JobSat.append('Very satisfied')
    else :
        JobSat.append('Neither satisfied nor dissatisfied')
# Create a column for the list
df_18['JobSat']= JobSat

In [None]:
#test
df_18['JobSat'].unique()

In [None]:
#uniques values of Hobbyist
df_18['Hobbyist'].unique()

In [None]:
#description of ConvertedComp
df_18['ConvertedComp'].describe()

In [None]:
#hist of ConvertedComp
df_18['ConvertedComp'].hist()

In [None]:
df_18['YearsCode'].nunique()

In [None]:
df_18['YearsCode']=df_18['YearsCode'].astype(str)
df_18['YearsCode'] =df_18['YearsCode'].str.split('-').str[0]
df_18['YearsCode'].nunique()

In [None]:
df_18['YearsCodePro'].unique()

In [None]:
df_18['YearsCodePro']=df_18['YearsCodePro'].astype(str)
df_18['YearsCodePro'] =df_18['YearsCodePro'].str.split('-').str[0]
df_18['YearsCodePro'].unique()

In [None]:
df_18['EdLevel'].value_counts()

In [None]:
#simplify the edlevel
#dropping nan values since that fillna is not usuful here and there is no avaliable method to replace missing value
df_18=df_18.dropna(subset=['EdLevel'])
df_18.loc[df_18['EdLevel'].str.contains("BA"), 'EdLevel'] = "Bachelor's degree"
df_18.loc[df_18['EdLevel'].str.contains("MA"), 'EdLevel'] = "Master's degree"
df_18.loc[df_18['EdLevel'].str.contains("Ph.D"), 'EdLevel'] = "Other doctoral degree"
df_18.loc[df_18['EdLevel'].str.contains("without"), 'EdLevel'] = "Some college/university study without earning a bachelor's degree"
df_18.loc[df_18['EdLevel'].str.contains("Associate degree"), 'EdLevel'] = "Associate degree"
df_18.loc[df_18['EdLevel'].str.contains("JD"), 'EdLevel'] = "Professional degree"
df_18.loc[df_18['EdLevel'].str.contains("Secondary school"), 'EdLevel'] = "Secondary school"


df_18['EdLevel'].value_counts()

In [None]:
#testing changes
df_18.info()

### ` Developer Survey Results 2019`

In [None]:
#selecting columns
df_19 = df_19[['Respondent', 'Hobbyist', 'Country', 'Employment', 'EdLevel',
       'YearsCode', 'YearsCodePro', 'DevType', 'JobSat', 'LanguageWorkedWith',
       'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'Gender','year','ConvertedComp','Ethnicity','UndergradMajor']]

In [None]:

df_19.head()

In [None]:
df_19['JobSat'].unique()

In [None]:
df_19['Hobbyist'].unique()

In [None]:
df_19=df_19[df_19.ConvertedComp < 1.0000000000000002e+6]
df_19['ConvertedComp'].describe()

In [None]:
df_19['ConvertedComp'].hist()

In [None]:
df_19['EdLevel'].value_counts()

In [None]:
#simplify the edlevel
df_19=df_19.dropna(subset=['EdLevel'])
df_19.loc[df_19['EdLevel'].str.contains("BA"), 'EdLevel'] = "Bachelor's degree"
df_19.loc[df_19['EdLevel'].str.contains("MA"), 'EdLevel'] = "Master's degree"
df_19.loc[df_19['EdLevel'].str.contains("Ph.D"), 'EdLevel'] = "Other doctoral degree"
df_19.loc[df_19['EdLevel'].str.contains("without"), 'EdLevel'] = "Some college/university study without earning a bachelor's degree"
df_19.loc[df_19['EdLevel'].str.contains("Associate degree"), 'EdLevel'] = "Associate degree"
df_19.loc[df_19['EdLevel'].str.contains("JD"), 'EdLevel'] = "Professional degree"
df_19.loc[df_19['EdLevel'].str.contains("Secondary school"), 'EdLevel'] = "Secondary school"




In [None]:
df_19['EdLevel'].value_counts()

### ` Developer Survey Results 2020`

In [None]:
#selecting columns
df_20 = df_20[['Respondent', 'Hobbyist', 'Country', 'Employment', 'EdLevel',
       'YearsCode', 'YearsCodePro', 'DevType', 'JobSat', 'LanguageWorkedWith',
       'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'Gender','year','ConvertedComp','Ethnicity','UndergradMajor']]
#dropping missing values
df_20.head()

In [None]:
df_20['Hobbyist'].unique()

In [None]:
df_20=df_20[df_20.ConvertedComp < 1e+6]


In [None]:
df_20.ConvertedComp.hist()

In [None]:
df_20['EdLevel'].value_counts()

In [None]:
#simplify the edlevel
df_20=df_20.dropna(subset=['EdLevel'])
df_20.loc[df_20['EdLevel'].str.contains("BA"), 'EdLevel'] = "Bachelor's degree"
df_20.loc[df_20['EdLevel'].str.contains("B.A."), 'EdLevel'] = "Bachelor's degree"

df_20.loc[df_20['EdLevel'].str.contains("MA"), 'EdLevel'] = "Master's degree"
df_20.loc[df_20['EdLevel'].str.contains("Ph.D"), 'EdLevel'] = "Other doctoral degree"
df_20.loc[df_20['EdLevel'].str.contains("without"), 'EdLevel'] = "Some college/university study without earning a bachelor's degree"
df_20.loc[df_20['EdLevel'].str.contains("Associate degree"), 'EdLevel'] = "Associate degree"
df_20.loc[df_20['EdLevel'].str.contains("JD"), 'EdLevel'] = "Professional degree"
df_20.loc[df_20['EdLevel'].str.contains("Secondary school"), 'EdLevel'] = "Secondary school"


df_20['EdLevel'].value_counts()

# Prepare Data

In [None]:
#combaine the data set 
frames=[df_17,df_18,df_19,df_20]
df=pd.concat(frames,sort=False)
df.info()

In [None]:
df.shape

In [None]:
df.dtypes

### Education Level
> We can look at the changes of EdLevel through the years.



In [None]:
#value counts for the educational levels

df['EdLevel'].value_counts()

In [None]:
df.loc[df['EdLevel'].str.contains("without"), 'EdLevel'] = "Study without earning a degree"

#reduced data set 
df_ed=df[df.groupby('EdLevel')['EdLevel'].transform('size') > 2000]


In [None]:
#plot EdLevel through years
fig, ax = plt.subplots()
fig.set_size_inches(14, 8)
sns.set_palette('Accent')
sns.countplot(x="year", hue="EdLevel",data=df_ed)
plt.title('Education Level across years');

> THe majority are at Undergradute level.

### Gender


In [None]:
df['Gender'].nunique()

In [None]:
df['Gender'] =df['Gender'].str.split(';').str[0]
df['Gender'].value_counts()

In [None]:
#Similarly,many repeated phrase that can be reduced and renamed into Male, Female,Non-binary, and Transgender.
df=df.dropna(subset=['Gender'])
df.loc[df['Gender'].str.contains("Woman"), 'Gender'] = 'Female'
df.loc[df['Gender'].str.contains("Man"), 'Gender'] = 'Male'
df['Gender'].value_counts()

In [None]:
#plot

fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.set_palette('Accent')
sns.countplot(x="year", hue="Gender",data=df)
plt.title('The gender across years');

> The participents are overwhelmingly male.

#### Hobbyist

In [None]:
df['Hobbyist'].unique()

In [None]:
df['Hobbyist'].value_counts()

In [None]:
#plot
fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.set_palette('Accent')
sns.countplot(x="year", hue="Hobbyist",data=df)
plt.title('Coding as hoppy across years');

#### Employment

In [None]:
df['Employment'].unique()

In [None]:
df['Employment'].value_counts()

#### Undergrad Major

In [None]:
df=df.dropna(subset=['UndergradMajor'])

#update the values of undergrad major

df.loc[df['UndergradMajor'].str.contains("Humanities"), 'UndergradMajor'] = 'Humanities and social discipline'
df.loc[df['UndergradMajor'].str.contains("humanities"), 'UndergradMajor'] = 'Humanities and social discipline'

df.loc[df['UndergradMajor'].str.contains("business"), 'UndergradMajor'] = 'Business discipline'
df.loc[df['UndergradMajor'].str.contains("Social"), 'UndergradMajor'] = 'Humanities and social discipline'
df.loc[df['UndergradMajor'].str.contains("social"), 'UndergradMajor'] = 'Humanities and social discipline'

df.loc[df['UndergradMajor'].str.contains("health"), 'UndergradMajor'] = 'Health science'
df.loc[df['UndergradMajor'].str.contains('Mathematics' ), 'UndergradMajor'] = 'STEM'
df.loc[df['UndergradMajor'].str.contains('technology'), 'UndergradMajor'] = 'STEM'
df.loc[df['UndergradMajor'].str.contains('natural' ), 'UndergradMajor'] = 'STEM'
df.loc[df['UndergradMajor'].str.contains('engineering' ), 'UndergradMajor'] = 'STEM'
df.loc[df['UndergradMajor'].str.contains('development'), 'UndergradMajor'] = 'STEM'
df.loc[df['UndergradMajor'].str.contains('systems'), 'UndergradMajor'] = 'STEM'
df.loc[df['UndergradMajor'].str.contains('Computer'), 'UndergradMajor'] = 'STEM'
df.loc[df['UndergradMajor'].str.contains('information'), 'UndergradMajor'] = 'STEM'
df.loc[df['UndergradMajor'].str.contains('web'), 'UndergradMajor'] = 'STEM'
df.loc[df['UndergradMajor'].str.contains('arts'), 'UndergradMajor'] = 'Fine arts or performing arts'
df.loc[df['UndergradMajor'].str.contains('Psychology'), 'UndergradMajor'] = 'Health science'
df.loc[df['UndergradMajor'].str.contains('Another engineering discipline'), 'UndergradMajor'] = 'Another engineering discipline'


In [None]:
#test
UndergradMajor= df['UndergradMajor'].value_counts()
UndergradMajor

In [None]:
#plot

fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.set_palette('Accent')
sns.countplot(x="year", hue="UndergradMajor",data=df)
plt.title('The Eduction Level across years');

#### Years of Coding as Professional  and Years of Coding

In [None]:
#conv to numeric valuse
df['YearsCodePro']= pd.to_numeric(df["YearsCodePro"], errors='coerce')
df['YearsCode']= pd.to_numeric(df["YearsCode"], errors='coerce')
#plot
plt.figure(figsize = [12, 6])
plt.subplot(1, 2, 1)
df['YearsCodePro'].hist()
plt.title('Distribution of Years of coding as Professional')
plt.ylabel('Counts')
#plot
plt.subplot(1, 2, 2)
df['YearsCode'].hist()
plt.title('Distribution of Years of coding')
plt.ylabel('Counts');

###  YearsCodePro versus YearsCode over years 

In [None]:
#creating two plots to show the distribution of YearsCode vs YearsCodePro

# left plot: Years coding
plt.figure(figsize = [16, 8])
plt.subplot(1, 2, 1)
sns.boxplot(x="year", y="YearsCode", data=df,showfliers = False)
plt.xlabel(' Year ')
plt.ylabel(' Years of coding')
plt.title('The distribution of YearsCode')

# Right plot: Professional coding
plt.subplot(1, 2, 2)
sns.boxplot(x="year", y="YearsCodePro", data=df,showfliers = False)
plt.xlabel(' Year ')
plt.ylabel(' Years of professional coding')
plt.title('The distribution of YearsCodePro');


#### Salary

In [None]:
plt.figure(figsize = [10, 6])
df['ConvertedComp'].hist()
plt.ylabel('Salary')
plt.title('Distribution of salary');

In [None]:
#creating plot to show the distribution salary over years

fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.boxplot(x="year", y="ConvertedComp", data=df,showfliers = False)
plt.xlabel(' Year ')
plt.ylabel(' Salary')
plt.title('The distribution of Salary')



In [None]:
# mean ConvertedComp each year
#making copy
df2 = df.copy()
#dropping all nan
df2.dropna(how='any',inplace=True)
mean_comp = df2.groupby(['year']).mean().sort_values(['ConvertedComp'],ascending=False)
mean_comp=mean_comp.reset_index()
#plot 
plt.figure(figsize = [10, 6])
plt.bar(mean_comp['year'],mean_comp['ConvertedComp'])
plt.xticks(mean_comp['year'], ('2017', '2018', '2019', '2020'))
plt.title('The mean of salary throught four years ');

#### Country

In [None]:
#most country participated in survay 
Countries = df['Country'].value_counts().head(10)
Countries

In [None]:
#Top 10 most participated countries in survey 
#plot 
base_color = sns.color_palette()[0]
fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.barplot(Countries, Countries.index,color = base_color)
plt.title('Country')
plt.xlabel('Counts')
plt.ylabel('-');

#### Ethnicity

In [None]:
#Top 10 
Ethnic=df['Ethnicity'].value_counts().head(5)

#plot 
base_color = sns.color_palette()[0]
fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.barplot(Ethnic, Ethnic.index,color = base_color)
plt.title('Top 10 Ethnicity on Stack overflow')
plt.xlabel('Counts')
plt.ylabel('-');

# Analysing

# How individual's job satisfaction changed over the years?


#### Job Satisfaction

In [None]:
df['JobSat'].unique()

In [None]:
df['JobSat'].value_counts()

In [None]:
#plot

fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.set_palette('Accent')
sns.countplot(x="year", hue="JobSat",data=df)
plt.title('The Job Satisfaction across years');


## What are the changes of salaries and job satisfaction for data scientist?

In [None]:
#filter data science data 
df.dropna(subset=['DevType'],inplace=True)
df_datsci = df[df['DevType'].str.contains('Data scientist')]
df_datsci

#### Aveage Salary for Data Science

In [None]:
#distribution of salary 

fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.boxplot(x="year", y="ConvertedComp", data=df_datsci,showfliers = False)
plt.xlabel(' Year ')
plt.ylabel(' Salary')
plt.title('The distribution of data science developer salary');



#### Job satisfaction for data science

In [None]:
#plot

fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.set_palette('Accent')
sns.countplot(x="year", hue="JobSat",data=df_datsci)
plt.title('The Job Satisfaction across years for data science developer');

# Does race and gender or education level matter more to an individual's salary?

In [None]:
#dropping nan
df.dropna(how='any',inplace=True)
#create a column to cal the average salary based on 'EdLevel','Gender','Ethnicity','UndergradMajor'
df['Compa_verage'] = df.groupby(['EdLevel','Gender','Ethnicity','UndergradMajor'])['ConvertedComp'].transform(lambda x : x.mean())


In [None]:
#filttering data to focous on the major 5 Ethnicity in stack
in_ethnic=[ x for x in df['Ethnicity'] if x in Ethnic]
df_ethnic = df[df['Ethnicity'].isin(in_ethnic)]
df_ethnic=df_ethnic.query( 'Gender == "Male" or Gender == "Female"')



In [None]:
#plot the gender gap accros Ethnicity
plt.figure(figsize=(18,18))
sns.catplot(x="Ethnicity", y="Compa_verage", hue="Gender", data=df_ethnic,kind="bar")
plt.xticks(rotation=60)
plt.gcf().set_size_inches(12, 8)
plt.title('Gender Earning Gap by Race/Ethnicity');



In [None]:
#plot the gender gap accros Ed Level
sns.catplot(x="EdLevel", y="Compa_verage", hue="Gender", data=df_ethnic, kind="bar")
plt.xticks(rotation=60)
plt.gcf().set_size_inches(12, 8)
plt.title('Gender Earning Gap by Education Level');

## Modeling 

> We can use 'Hobbyist',"Employment_answer",'Ed_Level','YearsCode','YearsCodePro',and 'ConvertedComp' to attemp predicting the jobsatisfaction.

> First we must deal with catagories data, then we can start modeling.

In [None]:
#making copy
df1 = df.copy()
#dropping all nan 
#catagical values is hard to impute its missing values
df1.dropna(how='any',inplace=True)
#convert into dummies 
df1['Hobbyist'] =pd.get_dummies(df1['Hobbyist'])
#using LabelEncoder to encode categorical features as a one-hot numeric
label = LabelEncoder()
df1["Employment_answer"] = label.fit_transform(df1["Employment"])
df1[["Employment", "Employment_answer"]]
df1["JobSat_answer"] = label.fit_transform(df1["JobSat"])
df1[['JobSat', 'JobSat_answer']]
df1['Ed_Level'] = label.fit_transform(df1['EdLevel'])
df1[['EdLevel', 'Ed_Level']]
df1['Year'] = label.fit_transform(df1['year'])
df1[['year', 'Year']];

In [None]:
#Gaussian Naive Bayes (GaussianNB)
#define x and y 
X= df1[['Hobbyist',"Employment_answer",'Ed_Level','YearsCode','YearsCodePro','ConvertedComp']]
Y = df1['JobSat']
#split data
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.5,random_state=0)
#define model
clf= GaussianNB()
#fitting
clf.fit(X_train,y_train)
#predicting
y_pred=clf.predict(X_test)
#accuracy
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (y_test != y_pred).sum()))
"Accuracy:",metrics.accuracy_score(y_test, y_pred)

> The accuracy is low, but this model can be improved if we have more survey answer that related to job satisfaction factors. 
