# Project 4 

### Objectives
1. Determine the **industry factors** that are _most important_ in predicting the salary amounts for these data.
2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.

In [3]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import re
import time

import matplotlib.pyplot as plt
import seaborn as sns

In [100]:
import nltk
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score,classification_report,roc_curve, auc

In [5]:
df = pd.read_csv('./datasets/webscrape_alljobs_nosalary.csv',index_col=0)

In [6]:
df.head()

Unnamed: 0,Job_Company,Job_Desc,Job_Salary,Job_Title,Job_employment_type,Min_years,Salary_max,Salary_min,Salary_avg
0,Morgan Stanley,Roles & Responsibilities COMPANY DESCRIPTION ...,4500,Business Analyst,,,4500.0,4500.0,4500.0
1,HP Inc,Roles & Responsibilities Review and understand...,7100,Business Analyst / Business Manager,,,7100.0,7100.0,7100.0
2,Point72,Roles & ResponsibilitiesThe Analyst works toge...,7000,Business Analyst,,,7000.0,7000.0,7000.0
3,Twitter,Roles & Responsibilities\r\n\r\n\r\n\r\n\r\n\r...,4250,Business Analyst,,,4250.0,4250.0,4250.0
4,Henkel,Roles & ResponsibilitiesThe Development Techni...,5000,Technical Leader,,,5000.0,5000.0,5000.0


In [7]:
df[df.duplicated(keep=False)].sort_values('Job_Title')

Unnamed: 0,Job_Company,Job_Desc,Job_Salary,Job_Title,Job_employment_type,Min_years,Salary_max,Salary_min,Salary_avg
362,KERRY CONSULTING PTE. LTD.,Roles & Responsibilities Global specialty chem...,$5000to$7500Monthly,APAC Business Analyst Specialty chemicals,"Permanent, Full Time",,7500.0,5000.0,6250.0
43,KERRY CONSULTING PTE. LTD.,Roles & Responsibilities Global specialty chem...,$5000to$7500Monthly,APAC Business Analyst Specialty chemicals,"Permanent, Full Time",,7500.0,5000.0,6250.0
560,KERRY CONSULTING PTE. LTD.,Roles & Responsibilities Global specialty chem...,$5000to$7500Monthly,APAC Business Analyst Specialty chemicals,"Permanent, Full Time",,7500.0,5000.0,6250.0
429,KERRY CONSULTING PTE. LTD.,Roles & Responsibilities Global specialty chem...,$5000to$7500Monthly,APAC Business Analyst Specialty chemicals,"Permanent, Full Time",,7500.0,5000.0,6250.0
364,FINMECHANICS PTE. LTD.,Roles & ResponsibilitiesYou Need to Be A true...,$4500to$8000Monthly,Associate Consultant,"Permanent, Full Time",,8000.0,4500.0,6250.0
281,FINMECHANICS PTE. LTD.,Roles & ResponsibilitiesYou Need to Be A true...,$4500to$8000Monthly,Associate Consultant,"Permanent, Full Time",,8000.0,4500.0,6250.0
370,KELLY SERVICES (SINGAPORE) PTE. LTD.,Roles & ResponsibilitiesWe are seeking a highl...,$4000to$6000Monthly,BAU Analyst and Support,Contract,,6000.0,4000.0,5000.0
569,KELLY SERVICES (SINGAPORE) PTE. LTD.,Roles & ResponsibilitiesWe are seeking a highl...,$5000to$7500Monthly,BAU Analyst and Support,Contract,,7500.0,5000.0,6250.0
108,KELLY SERVICES (SINGAPORE) PTE. LTD.,Roles & ResponsibilitiesWe are seeking a highl...,$5000to$7500Monthly,BAU Analyst and Support,Contract,,7500.0,5000.0,6250.0
164,KELLY SERVICES (SINGAPORE) PTE. LTD.,Roles & ResponsibilitiesWe are seeking a highl...,$6000to$9000Monthly,BAU Analyst and Support,Contract,,9000.0,6000.0,7500.0


In [8]:
df.drop_duplicates(inplace=True)

In [9]:
df.reset_index(drop=True,inplace=True)

In [10]:
df['Job_Title_clean'] = ['Data Scientist' if 'scien' in jobtitle.lower() else 
                          'Data Analyst'    if 'data' in jobtitle.lower() and 'analy' in jobtitle.lower() else 
                          'Business Analyst'if 'business' in jobtitle.lower() else 
                          'Data Analyst'    if 'analy' in jobtitle.lower() else 
                          'Data Engineer'   if 'engineer' in jobtitle.lower() else
                          'other' for jobtitle in df['Job_Title']]

df['Job_Title_clean'].value_counts()

Business Analyst    644
Data Scientist      233
Data Engineer       226
Data Analyst        198
other               184
Name: Job_Title_clean, dtype: int64

In [11]:
#downsample Business Analyst
ba_index = df[df['Job_Title_clean'] == 'Business Analyst'].index

In [12]:
import random 
newba_index = random.sample(ba_index,  220)
print len(newba_index)

220


In [13]:
new_index = list(df[df['Job_Title_clean'] != 'Business Analyst'].index) + newba_index
len(new_index)

1061

In [14]:
df.loc[new_index].shape

(1061, 10)

In [15]:
jobs = df.loc[new_index]

In [16]:
jobs['Job_Title_clean'].value_counts()

Data Scientist      233
Data Engineer       226
Business Analyst    220
Data Analyst        198
other               184
Name: Job_Title_clean, dtype: int64

In [20]:
jobs['Job_Title_class'] = [0 if jobtitle == 'Data Scientist' else 
                           1 if jobtitle == 'Data Analyst'   else 
                           2 if jobtitle == 'Business Analyst'else 
                           3 if jobtitle == 'Data Engineer'   else
                           4 for jobtitle in jobs['Job_Title_clean']]

In [21]:
jobs['Job_Title_class'].value_counts()

0    233
3    226
2    220
1    198
4    184
Name: Job_Title_class, dtype: int64

In [22]:
jobs.shape

(1061, 11)

# Modelling

In [115]:
def obtain_scores(estimator,X_train, y_train, X_test, y_test):
    y_pred = estimator.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    ds_recall = recall_score(y_test, y_pred,average=None)[0]
    ds_precision = precision_score(y_test, y_pred,average=None)[0]
    ds_F1 = 2*(ds_precision*ds_recall)/(ds_precision+ds_recall)
    confusion = pd.DataFrame(confusion_matrix(y_test, y_pred), 
                             index=['is_{}'.format(name) for name in range(0,5)],
                             columns=['pred_{}'.format(name) for name in range(0,5)])
    
    fpr, tpr, _ = roc_curve(y_test, y_pred,pos_label=4)
    roc_auc = auc(fpr, tpr)

    print classification_report(y_test, y_pred)
    return {'confusion matrix': confusion,
            'roc': roc_auc,
            'for_compare': [accuracy,ds_recall,ds_precision,ds_F1,roc_auc]
           }

def add_model(model_name,summary,df):
    row = pd.DataFrame([model_name]+summary['for_compare'],
                       index=['model','accuracy','recall_DS','precision_DS','F1_score_DS','ROC']).T
    new_df = pd.concat([df,row])
    return new_df

In [37]:
X = jobs['Job_Desc']

    
X = X.map(lambda x: re.sub('[^ a-zA-Z]',' ',x).lower())
X = X.map(lambda x: " ".join(x.split())) 

to_replace_list = ['requirements', 'responsibilities','data']
for word in to_replace_list:
    X = X.str.replace(word, word +' ')

ps = PorterStemmer() 
# lemmatizer = WordNetLemmatizer()   
X = X.map(lambda x: ' '.join([ps.stem(word)for word in x.split() if word not in set(stopwords.words('english'))]))


X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    jobs['Job_Title_class'], 
                                                    test_size = 0.25, 
                                                    random_state=42)

In [25]:
X_train.sample(5)

836    role respons famili offic focus invest technol...
120    role respons job purpos softwar engin would pa...
956    role respons respons gather analyz document re...
736    role respons canopi http canopi cloud singapor...
193    role respons standard charter lead intern bank...
Name: Job_Desc, dtype: object

In [38]:
y_train.sample(5)

1346    3
1457    4
367     2
842     3
1227    0
Name: Job_Title_class, dtype: int64

In [27]:
cvec = CountVectorizer()
cvec.fit_transform(X_train)
print len(cvec.get_feature_names())

4317


In [116]:
#Set up empty dataframe to add results for easy comparison
summary = pd.DataFrame(columns=['model','accuracy','recall_DS','precision_DS','F1_score_DS','ROC'])

#### Using Logistic Regression

In [117]:
cvec = CountVectorizer()
logreg = LogisticRegression(penalty='l2')

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('logreg',logreg)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
summary = add_model('basic logreg penalty l2',summary_pipe,summary)
print summary_pipe['roc']
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.95      0.98      0.96        53
          1       0.95      0.89      0.92        47
          2       0.87      0.87      0.87        47
          3       0.99      0.99      0.99        69
          4       0.90      0.92      0.91        50

avg / total       0.94      0.94      0.94       266

0.9497685185185185


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,52,0,0,1,0
is_1,2,42,3,0,0
is_2,1,1,41,0,4
is_3,0,0,0,68,1
is_4,0,1,3,0,46


In [118]:
cvec = CountVectorizer()
logreg = LogisticRegression(penalty='l2')

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('logreg',logreg)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
summary = add_model('basic logreg penalty l2',summary_pipe,summary)
print summary_pipe['roc']
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.95      0.98      0.96        53
          1       0.95      0.89      0.92        47
          2       0.87      0.87      0.87        47
          3       0.99      0.99      0.99        69
          4       0.90      0.92      0.91        50

avg / total       0.94      0.94      0.94       266

0.9497685185185185


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,52,0,0,1,0
is_1,2,42,3,0,0
is_2,1,1,41,0,4
is_3,0,0,0,68,1
is_4,0,1,3,0,46


In [119]:
cvec = CountVectorizer()
tfidf = TfidfTransformer()
logreg = LogisticRegression(penalty='l1')

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('tfidf',tfidf),
                       ('logreg',logreg)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
print summary_pipe['roc']
summary = add_model('tfidf logreg penalty l1',summary_pipe,summary)
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.83      0.94      0.88        53
          1       0.91      0.62      0.73        47
          2       0.60      0.87      0.71        47
          3       0.89      0.90      0.89        69
          4       0.86      0.62      0.72        50

avg / total       0.82      0.80      0.80       266

0.8362037037037037


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,50,1,2,0,0
is_1,5,29,11,0,2
is_2,0,1,41,3,2
is_3,4,0,2,62,1
is_4,1,1,12,5,31


In [129]:
cvec = CountVectorizer()
tfidf = TfidfTransformer()
logreg = LogisticRegression(penalty='l1')

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('tfidf',tfidf),
                       ('logreg',logreg)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
print summary_pipe['roc']
summary = add_model('tfidf logreg penalty l1',summary_pipe,summary)
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.83      0.94      0.88        53
          1       0.91      0.62      0.73        47
          2       0.60      0.87      0.71        47
          3       0.89      0.90      0.89        69
          4       0.86      0.62      0.72        50

avg / total       0.82      0.80      0.80       266

0.8362037037037037


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,50,1,2,0,0
is_1,5,29,11,0,2
is_2,0,1,41,3,2
is_3,4,0,2,62,1
is_4,1,1,12,5,31


In [79]:
gs_params = {
                'penalty':['l1','l2'],
                'solver':['liblinear'],
                'C':np.logspace(-5,0,50),
                'tol':np.logspace(-5,0,10)
            }

lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params, cv=5, verbose=1)
lr_gridsearch.fit(cvec.fit_transform(X_train), y_train)

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits


[Parallel(n_jobs=1)]: Done 5000 out of 5000 | elapsed:  7.5min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([1.00000e-05, 1.26486e-05, 1.59986e-05, 2.02359e-05, 2.55955e-05,
       3.23746e-05, 4.09492e-05, 5.17947e-05, 6.55129e-05, 8.28643e-05,
       1.04811e-04, 1.32571e-04, 1.67683e-04, 2.12095e-04, 2.68270e-04,
       3.39322e-04, 4.29193e-04, 5.42868e-...,
       5.99484e-03, 2.15443e-02, 7.74264e-02, 2.78256e-01, 1.00000e+00]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [80]:
lr_gridsearch.best_estimator_

LogisticRegression(C=0.05963623316594643, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.021544346900318846, verbose=0,
          warm_start=False)

In [121]:
cvec = CountVectorizer()
logreg = LogisticRegression(C=0.05963623316594643, class_weight=None, dual=False,
                              fit_intercept=True, intercept_scaling=1, max_iter=100,
                              multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
                              solver='liblinear', tol=0.021544346900318846, verbose=0,
                              warm_start=False)

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('logreg',logreg)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
print summary_pipe['roc']
summary = add_model('gridsearch logreg penalty l2',summary_pipe,summary)
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.95      0.98      0.96        53
          1       0.95      0.89      0.92        47
          2       0.87      0.87      0.87        47
          3       0.97      0.99      0.98        69
          4       0.90      0.90      0.90        50

avg / total       0.93      0.93      0.93       266

0.9463425925925927


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,52,0,0,1,0
is_1,2,42,3,0,0
is_2,1,1,41,0,4
is_3,0,0,0,68,1
is_4,0,1,3,1,45


In [122]:
cvec = CountVectorizer()
tfidf = TfidfTransformer()
pipe = Pipeline(steps=[('CountVec', cvec),
                       ('tfidf',tfidf)])

gs_params = {
                'penalty':['l1','l2'],
                'solver':['liblinear'],
                'C':np.logspace(-5,0,50),
                'tol':np.logspace(-5,0,10)
            }

lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params, cv=5, verbose=1)
lr_gridsearch.fit(pipe.fit_transform(X_train), y_train)
lr_gridsearch.best_estimator_

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits


KeyboardInterrupt: 

In [98]:
cvec = CountVectorizer()
logreg = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                              intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
                              penalty='l2', random_state=None, solver='liblinear',
                              tol=0.021544346900318846, verbose=0, warm_start=False)

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('logreg',logreg)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
print summary_pipe['roc']
summary = add_model('gridsearch logreg penalty l2',summary_pipe,summary)
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.95      0.98      0.96        53
          1       0.95      0.89      0.92        47
          2       0.87      0.87      0.87        47
          3       0.99      0.99      0.99        69
          4       0.90      0.92      0.91        50

avg / total       0.94      0.94      0.94       266

0.9497685185185185


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,52,0,0,1,0
is_1,2,42,3,0,0
is_2,1,1,41,0,4
is_3,0,0,0,68,1
is_4,0,1,3,0,46


#### Using Naive Bayes

In [103]:
cvec = CountVectorizer()
bayes = MultinomialNB()

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('bayes',bayes)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
summary = add_model('bayes',summary_pipe,summary)
print summary_pipe['roc']
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.88      1.00      0.94        53
          1       0.90      0.77      0.83        47
          2       0.71      0.94      0.81        47
          3       0.91      0.91      0.91        69
          4       0.94      0.66      0.78        50

avg / total       0.87      0.86      0.86       266



Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,53,0,0,0,0
is_1,2,36,5,3,1
is_2,0,1,44,1,1
is_3,4,2,0,63,0
is_4,1,1,13,2,33


In [123]:
cvec = CountVectorizer()
tfidf = TfidfTransformer()
bayes = MultinomialNB()

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('tfidf',tfidf),
                       ('bayes',bayes)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
summary = add_model('tfidf bayes',summary_pipe,summary)
print summary_pipe['roc']
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.79      1.00      0.88        53
          1       1.00      0.66      0.79        47
          2       0.62      0.98      0.76        47
          3       0.94      0.87      0.90        69
          4       0.97      0.58      0.72        50

avg / total       0.87      0.82      0.82       266

0.8147222222222222


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,53,0,0,0,0
is_1,5,31,10,1,0
is_2,0,0,46,1,0
is_3,7,0,1,60,1
is_4,2,0,17,2,29


### Using Decision Trees 

In [124]:
cvec = CountVectorizer()
tree = DecisionTreeClassifier(random_state=0)

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('decision tree',tree)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
summary = add_model('decision tree',summary_pipe,summary)
print summary_pipe['roc']
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.96      0.98      0.97        53
          1       0.86      0.91      0.89        47
          2       0.80      0.83      0.81        47
          3       0.93      0.93      0.93        69
          4       0.86      0.76      0.81        50

avg / total       0.89      0.89      0.89       266

0.8964814814814814


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,52,0,0,1,0
is_1,0,43,0,2,2
is_2,0,7,39,0,1
is_3,2,0,0,64,3
is_4,0,0,10,2,38


In [125]:
cvec = CountVectorizer()
tfidf = TfidfTransformer()
tree = DecisionTreeClassifier(random_state=0)

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('tfidf',tfidf),
                       ('decision tree',tree)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
summary = add_model('tfidf decision tree',summary_pipe,summary)
print summary_pipe['roc']
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.83      1.00      0.91        53
          1       0.83      0.85      0.84        47
          2       0.95      0.79      0.86        47
          3       0.94      0.91      0.93        69
          4       0.79      0.76      0.78        50

avg / total       0.87      0.87      0.87       266

0.8398611111111112


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,53,0,0,0,0
is_1,1,40,1,1,4
is_2,1,2,37,1,6
is_3,6,0,0,63,0
is_4,3,6,1,2,38


### Using RFECV on logreg

In [126]:
cvec = CountVectorizer()
logreg = LogisticRegression(penalty='l2')
rfecv = RFECV(logreg, step=10, cv=3)

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('rfecv',rfecv)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
summary = add_model('rfecv logreg penalty l2',summary_pipe,summary)
print summary_pipe['roc']
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.95      0.98      0.96        53
          1       0.95      0.89      0.92        47
          2       0.87      0.87      0.87        47
          3       0.99      0.99      0.99        69
          4       0.90      0.92      0.91        50

avg / total       0.94      0.94      0.94       266

0.9497685185185185


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,52,0,0,1,0
is_1,2,42,3,0,0
is_2,1,1,41,0,4
is_3,0,0,0,68,1
is_4,0,1,3,0,46


In [127]:
cvec = CountVectorizer()
logreg = LogisticRegression(penalty='l1')
rfecv = RFECV(logreg, step=10, cv=3)

pipe = Pipeline(steps=[('CountVec', cvec),
                       ('rfecv',rfecv)])
pipe.fit(X_train,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
summary = add_model('rfecv logreg penalty l1',summary_pipe,summary)
print summary_pipe['roc']
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.93      0.98      0.95        53
          1       0.86      0.91      0.89        47
          2       0.91      0.85      0.88        47
          3       1.00      0.94      0.97        69
          4       0.86      0.88      0.87        50

avg / total       0.92      0.92      0.92       266

0.9221296296296296


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,52,1,0,0,0
is_1,2,43,1,0,1
is_2,1,1,40,0,5
is_3,1,2,0,65,1
is_4,0,3,3,0,44


### Analysis on 'coefficients' 

In [130]:
summary.reset_index(drop=True).sort_values('ROC',ascending = False)

Unnamed: 0,model,accuracy,recall_DS,precision_DS,F1_score_DS,ROC
0,basic logreg penalty l2,0.93609,0.981132,0.945455,0.962963,0.949769
1,basic logreg penalty l2,0.93609,0.981132,0.945455,0.962963,0.949769
8,rfecv logreg penalty l2,0.93609,0.981132,0.945455,0.962963,0.949769
4,gridsearch logreg penalty l2,0.932331,0.981132,0.945455,0.962963,0.946343
9,rfecv logreg penalty l2,0.917293,0.981132,0.928571,0.954128,0.92213
6,decision tree,0.887218,0.981132,0.962963,0.971963,0.896481
3,tfidf logreg penalty l2,0.887218,1.0,0.898305,0.946429,0.87713
7,tfidf decision tree,0.868421,1.0,0.828125,0.905983,0.839861
2,tfidf logreg penalty l1,0.800752,0.943396,0.833333,0.884956,0.836204
10,tfidf logreg penalty l1,0.800752,0.943396,0.833333,0.884956,0.836204


In [131]:
cvec = CountVectorizer()
tfidf = TfidfTransformer()
logreg = LogisticRegression(penalty='l2')

X_train_new = cvec.fit_transform(X_train)
X_train_new = tfidf.fit_transform(X_train_new)
logreg.fit(X_train_new,y_train)

pd.DataFrame([abs(logreg.coef_[0]),logreg.coef_[0]],
             columns=cvec.get_feature_names(),
             index=['abs_coefficient','coefficent']).T.\
             sort_values('abs_coefficient',ascending=False)

Unnamed: 0,abs_coefficient,coefficent
learn,2.763041,2.763041
machin,2.268495,2.268495
model,2.014537,2.014537
statist,1.955771,1.955771
manag,1.608839,-1.608839
research,1.435420,1.435420
report,1.421285,-1.421285
phd,1.382348,1.382348
data,1.330662,1.330662
scienc,1.310000,1.310000


In [None]:
pd.DataFrame(abs(logreg.coef_),columns = cvec.get_feature_names(), 
             index = logreg.classes_)\
            .T.sort_values('Data Scientist',ascending = False)

In [133]:
cvec = CountVectorizer()
logreg = LogisticRegression(penalty='l1')
rfecv = RFECV(logreg, step=10, cv=3)

X_train_new = cvec.fit_transform(X_train)
rfecv.fit(X_train_new,y_train)
summary_pipe = obtain_scores(pipe,X_train,y_train,X_test,y_test)
print summary_pipe['roc']
summary_pipe ['confusion matrix']

             precision    recall  f1-score   support

          0       0.83      0.94      0.88        53
          1       0.91      0.62      0.73        47
          2       0.60      0.87      0.71        47
          3       0.89      0.90      0.89        69
          4       0.86      0.62      0.72        50

avg / total       0.82      0.80      0.80       266

0.8362037037037037


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4
is_0,50,1,2,0,0
is_1,5,29,11,0,2
is_2,0,1,41,3,2
is_3,4,0,2,62,1
is_4,1,1,12,5,31


In [140]:
rfecv_columns = np.array(cvec.get_feature_names())[rfecv.support_]
X_train_best = X_train.map(lambda x: ' '.join([word for word in x.split() if word in rfecv_columns]))

In [142]:
cvec = CountVectorizer()
logreg = LogisticRegression(penalty='l1')

X_train_new = cvec.fit_transform(X_train_best)
logreg.fit(X_train_new,y_train)

pd.DataFrame([abs(logreg.coef_[0]),logreg.coef_[0]],
             columns=cvec.get_feature_names(),
             index=['abs_coefficient','coefficent']).T.\
             sort_values('abs_coefficient',ascending=False).head(20)

Unnamed: 0,abs_coefficient,coefficent
phd,2.597537,2.597537
hadoop,1.919791,1.919791
high,1.348101,-1.348101
analyz,1.329823,-1.329823
engin,1.293707,-1.293707
travel,1.235036,1.235036
minimum,1.162412,-1.162412
scientist,1.036516,1.036516
cognit,1.016169,1.016169
mufg,0.998309,0.998309


# Conclusion
What components of a job posting distinguish data scientists from other data jobs? 
- score of all the metrics (accuracy, precision, recall, f1 score) is very good 
- which means that the job description does reflect the job-title 
- words like phd, hadoop,analyze,engineer,travel plays a big importance in the type of job there is.