    Executive Summary

    This goal of this study was to build a model that can accurately predict the pay grade for job postings online that came with no salary info. The model was built and tested on data from the job accumulator indeed.com. The model predicted vast differences in pay grades between different job titles. Most notably, anything with ‘data science’ or ‘machine learning’ was predicted to pay well, and anything with the word ‘research’ in it was predicted to pay poorly. 

    Methods

    The data used in this analysis was accumulated by scraping 100 pages of job postings on indeed.com from each of 16 major cities across the US. Those cities were New York, Chicago, San Francisco, Austin, Seattle, Los Angeles, Philadelphia, Atlanta, Dallas, Pittsburgh, Portland, Phoenix, Denver, Houston, Miami, and Washington DC. All salaries on indeed.com were listed as either yearly, monthly, or hourly. Any salary ranges were reduced to their average. Monthly salaries were multiplied by 12, and hourly salaries were multiplied by 2000 for the analysis.

    All job postings with relevant salary info were used to build a random forest model that predicts whether any given job posting will pay above or below average based on its job title and location. The model was then used to predict the pay grade in each of our target cities for a list of very common job titles that are relevant to data science.

    Results

    The final model was tested on un-learned data. The test revealed that the final model is about 80% accurate in predicting whether any specific job posting will be above or below the median salary. The baseline accuracy for this situation is 50%, so the model does give us valuable insight.
 
    Below is a list of job titles, and the number of cities (out of 16) where postings with that job title were predicted to pay above average.

    job_title
    Data Engineer                16.0
    Data Science Developer       11.0
    Data Scientist               16.0
    Data Specialist               5.0
    Database Analyst              5.0
    Machine Learning Engineer    16.0
    Quantitative Analyst         15.0
    Research Analyst              0.0
    Research Assistant            5.0
    Research Associate            1.0
    Statistical Analyst           6.0

    The next list shows the number of job titles on our list (out of 11) that are predicted to pay above average for each of our target cities.

    city
    Atlanta             6.0
    Austin              9.0
    Chicago             5.0
    Dallas              4.0
    Denver              5.0
    Houston             4.0
    Los+Angeles         9.0
    Miami               5.0
    New+York            5.0
    Philadelphia        9.0
    Phoenix             4.0
    Pittsburgh          4.0
    Portland            3.0
    San+Francisco      10.0
    Seattle             5.0
    Washington+City     9.0


    Interpretation

    Job postings with the title ‘data engineer’ and ‘machine learning engineer’ are predicted to be high-paying in all of our target cities. ‘Research analyst’ and ‘research associate’ positions pay very poorly everywhere. In Washington DC, Philadelphia, San Francisco, Los Angeles, and Austin, the model predicted that all job titles besides ‘research analyst’ and ‘research associate’ would fetch high pay. The model predicted mixed results for the other combinations of job title and location that were fed into it.

    Conclusion

    If you can (ethically) describe your job as a research position of any type, you will be able to get away with offering far less competitive pay. If that’s not applicable to your job, you can also describe your data-science jobs in different terms so that there will be less competition from other employers. For example, statistical analyst, database analyst, and data specialist positions were predicted to pay below average in most of our target cities. Also, not surprisingly, data science jobs in San Francisco, Washington DC, Los Angeles, and Austin are predicted to pay highly. If you can hire for your jobs in other cities instead, you can get away with offering lower wages.

In [1]:
# Export to csv
import pandas as pd
import re
data=pd.read_csv('jobs_data2.csv').drop('Unnamed: 0', axis=1)
df=data.drop_duplicates().copy()
df=df.reset_index(drop=True)
df.head()

Unnamed: 0,location,company,job_title,salary,city
0,"Houston, TX",MD Anderson Cancer Center,Institute Research Scientist - Translational B...,"$95,000 - $142,000 a year",Houston
1,"Houston, TX",MD Anderson Cancer Center,Computational Scientist,"$76,400 - $114,600 a year",Houston
2,"Houston, TX",MD Anderson Cancer Center,Research Scientist - Neuro-Oncology - Research,"$51,600 - $77,400 a year",Houston
3,"Houston, TX",MD Anderson Cancer Center,Research Scientist - Thoracic H&N Med Oncology...,"$51,600 - $77,400 a year",Houston
4,"Houston, TX","Earl Carl Institute for Legal & Social Policy,...",Part Time Academic Research and Writing Policy...,"$20,000 - $25,000 a year",Houston


In [2]:
df['time_frame']=['yr' if ('year' in _) \
                  else 'mo' if ('month' in _) \
                  else 'hr' if ('hour' in _) \
                  else '-'
                  for _ in df['salary']]

In [3]:
df['salary']=df['salary'].str.replace(' a year', '')
df['salary']=df['salary'].str.replace(' a month', '')
df['salary']=df['salary'].str.replace(' an hour', '')
df['salary']=df['salary'].str.replace('$', '')
df['salary']=df['salary'].str.replace('-','')
df['salary']=df['salary'].str.replace(',','')
df['salary']=df['salary'].str.strip()

In [4]:
def stringparse(test):
    if re.match(r'^\d+\ +\d+', test):
        return (float(re.findall('\d+', test)[0])+float(re.findall('\d+', test)[1]))/2
    elif re.match('\d+\.\d+\ +\d+\.+\d+', test):
        return (float(re.findall('\d+\.\d+', test)[0])+float(re.findall('\d+\.\d+', test)[1]))/2
    else:
        return (float(test))

In [5]:
df['sal_avg']=df['salary'].apply(stringparse)
df['sal_yr']=[_[1]['sal_avg'] if _[1]['time_frame']=='yr' \
              else _[1]['sal_avg']*12 if _[1]['time_frame']=='mo' \
              else _[1]['sal_avg']*2000 if _[1]['time_frame']=='hr' \
              else 0 for _ in df.iterrows()]
df.head()

Unnamed: 0,location,company,job_title,salary,city,time_frame,sal_avg,sal_yr
0,"Houston, TX",MD Anderson Cancer Center,Institute Research Scientist - Translational B...,95000 142000,Houston,yr,118500.0,118500.0
1,"Houston, TX",MD Anderson Cancer Center,Computational Scientist,76400 114600,Houston,yr,95500.0,95500.0
2,"Houston, TX",MD Anderson Cancer Center,Research Scientist - Neuro-Oncology - Research,51600 77400,Houston,yr,64500.0,64500.0
3,"Houston, TX",MD Anderson Cancer Center,Research Scientist - Thoracic H&N Med Oncology...,51600 77400,Houston,yr,64500.0,64500.0
4,"Houston, TX","Earl Carl Institute for Legal & Social Policy,...",Part Time Academic Research and Writing Policy...,20000 25000,Houston,yr,22500.0,22500.0


city_list=['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Washington+City']

## Predicting salaries using Random Forests + Another Classifier

In [6]:
df['above_med']=(df['sal_yr']>df['sal_yr'].median())

#### The baseline accuracy for this model is 50/50 since exactly half the salaries will be below the median salary and half will be above.

50/50

#### Create a Random Forest model to predict High/Low salary using Sklearn. Start by ONLY using the location as a feature. 

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [8]:
X=pd.get_dummies(df['city'])
y=df['above_med']
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=.2)
rforest=RandomForestClassifier(n_estimators=1000)
rforest.fit(X_train, y_train)
print 'RForest Acc., Location Only: '+ str(rforest.score(X_test, y_test))

RForest Acc., Location Only: 0.5625


#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' is in the title or whether 'Manager' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the job titles.
- Build a new random forest model with location and these new features included.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import GridSearchCV
titles=df['job_title']
companies=df['company']

In [10]:
cvec_titles=CountVectorizer(stop_words='english', ngram_range=(2,2))
cvec_titles.fit(titles)
cvec_titles_data=cvec_titles.transform(titles)
df_titles=pd.DataFrame(cvec_titles_data.todense(), columns=cvec_titles.get_feature_names())

In [11]:
X=pd.concat([df_titles, X], axis=1)
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=.2)

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 

In [12]:
grid_params={'n_estimators':[10,25,100,1000],
            'max_features':[15,25,50,250]}
grid=GridSearchCV(RandomForestClassifier(), param_grid=grid_params, cv=10, n_jobs=-1)
grid.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [10, 25, 100, 1000], 'max_features': [15, 25, 50, 250]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [13]:
rforest=grid.best_estimator_
rforest.fit(X_train, y_train)
print 'RForest Acc., Location+Description CVec: '+str(rforest.score(X_test, y_test))

RForest Acc., Location+Description CVec: 0.84375


In [14]:
print grid.best_params_

{'max_features': 15, 'n_estimators': 100}


#### Repeat the model-building process with a non-tree-based method.

In [15]:
from sklearn.neighbors import KNeighborsClassifier
grid_params={'p':[1,2,3],
            'weights':['uniform', 'distance'],
            'algorithm':['ball_tree', 'kd_tree', 'brute'],
            'n_neighbors':[2,5,10]}

In [16]:
grid=GridSearchCV(KNeighborsClassifier(n_jobs=-1), param_grid=grid_params)
grid.fit(X_train, y_train)
knn=grid.best_estimator_

In [17]:
print grid.best_params_
print 'knn-score'+str(knn.score(X_test, y_test))

{'n_neighbors': 2, 'weights': 'uniform', 'algorithm': 'kd_tree', 'p': 1}
knn-score0.71875


In [18]:
X_train.shape

(256, 631)

### BONUS 

#### Bonus: Use Count Vectorizer from scikit-learn to create features from the job descriptions. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [19]:
feature_importances = pd.DataFrame(rforest.feature_importances_,
                                   index = X.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
pd.DataFrame(feature_importances)

Unnamed: 0,importance
data scientist,0.089549
research analyst,0.031341
data engineer,0.021630
data analyst,0.017448
Washington+City,0.017037
senior data,0.014319
San+Francisco,0.013993
Chicago,0.011836
machine learning,0.011786
Portland,0.011388


### TESTING

In [20]:
test_titles=['Data Scientist', 'Research Analyst', 'Data Engineer', 'Research Associate',
             'Machine Learning Engineer', 'Research Assistant', 'Statistical Analyst', 
             'Quantitative Analyst','Data Science Developer', 'Database Analyst',
             'Data Specialist', 'Research Analyst', 'Predictive Analyst', 'Machine Learning Specialist',
            'Database Engineer', 'Statistical Analyst', 'Software Engineer', 'Software Developer', 'Database Developer']
             
city_list=['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
           'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
           'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Washington+City']
dftest_title_list=[]
dftest_city_list=[]
for tit in test_titles:
    for cit in city_list:
        dftest_title_list.append(tit)
        dftest_city_list.append(cit)
test_df=pd.DataFrame({'job_title':dftest_title_list, 'city':dftest_city_list})

In [21]:
test_X=pd.get_dummies(test_df['city'])
test_titlevecs=cvec_titles.transform(test_df['job_title'])
test_df_titles=pd.DataFrame(test_titlevecs.todense(), columns=cvec_titles.get_feature_names())
test_X=pd.concat([test_df_titles, test_X], axis=1)

In [22]:
y_pred=rforest.predict(test_X)

In [23]:
test_df['above_med']=y_pred

In [24]:
pd.set_option("display.max_rows",300)
test_df.sort_values('job_title')

Unnamed: 0,city,job_title,above_med
39,Atlanta,Data Engineer,True
34,San+Francisco,Data Engineer,True
35,Austin,Data Engineer,True
36,Seattle,Data Engineer,True
37,Los+Angeles,Data Engineer,True
38,Philadelphia,Data Engineer,True
40,Dallas,Data Engineer,True
41,Pittsburgh,Data Engineer,True
42,Portland,Data Engineer,True
43,Phoenix,Data Engineer,True


In [25]:
test_df.groupby('job_title')['above_med'].sum().sort_values(ascending=False)

job_title
Data Engineer                  16.0
Machine Learning Engineer      16.0
Data Scientist                 16.0
Statistical Analyst            14.0
Machine Learning Specialist    13.0
Quantitative Analyst           11.0
Data Science Developer         11.0
Predictive Analyst              7.0
Database Engineer               7.0
Database Developer              7.0
Database Analyst                7.0
Software Developer              7.0
Software Engineer               4.0
Data Specialist                 4.0
Research Assistant              1.0
Research Associate              1.0
Research Analyst                0.0
Name: above_med, dtype: float64

In [26]:
test_df.sort_values('city')

Unnamed: 0,city,job_title,above_med
151,Atlanta,Database Analyst,True
55,Atlanta,Research Associate,False
247,Atlanta,Statistical Analyst,True
263,Atlanta,Software Engineer,False
39,Atlanta,Data Engineer,True
71,Atlanta,Machine Learning Engineer,True
231,Atlanta,Database Engineer,True
87,Atlanta,Research Assistant,False
215,Atlanta,Machine Learning Specialist,True
279,Atlanta,Software Developer,True


In [27]:
test_df.groupby('city')['above_med'].sum().sort_values(ascending=False)

city
San+Francisco      17.0
Washington+City    15.0
New+York           15.0
Los+Angeles        15.0
Philadelphia       13.0
Austin             13.0
Atlanta            13.0
Seattle             6.0
Houston             6.0
Denver              6.0
Chicago             6.0
Pittsburgh          4.0
Dallas              4.0
Portland            3.0
Phoenix             3.0
Miami               3.0
Name: above_med, dtype: float64