# What is the difference between a Data Scientist and a Data Analyst?

As our online and in-person data scientist programs are maturing there is interest from both the Product and Curriculum teams on having data-driven insights on the data science job market to make sure that the Flatiron school's offerings and marketing best fit the evolving data science job market our graduates are heading into. 

The first step in this exploration is to try and get more clear definitions as to the difference between a data scientist and a data analyst. It is well known in the data science community that the understanding of the field among the general market, hiring managers, and HR recruiters is quite variable and there is significant overlap between roles that have widely varying titles. To try and get some clarity we have scraped a dataset from LinkedIn covering data scientist and data analyst roles and filtered it for roles in NYC, Atlanta, and Kansas City, MO aiming to incorporate listings from a major tech hub, a developing tech hub, and a non-technical job market so as to be representative of the variety of job markets our graduates will be entering. 

In [2]:
#library imports - obtain, scrub, explore
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

OBTAIN

In [3]:
df = pd.read_csv('Data/jobs.csv')
df = pd.DataFrame(df)
df.head()

Unnamed: 0,company,description,job_cat,loc,location,position,position_low
0,Loftium,About the role\nMachine learning is core to ou...,machine learning,"Seattle, WA","Seattle, Washington, United States",Machine Learning Engineer,
1,Zume Inc.,Who We Are\n\nZume is on a quest to be the mos...,machine learning,"Seattle, WA","Seattle, WA, US",Machine Learning Engineer - Platform,
2,"TRC Staffing Services, Inc.",The goal is to lead the processes from infrast...,machine learning,"Seattle, WA","Seattle, WA, US",Machine Learning Engineer,
3,bidco,We are looking for a Machine Learning Engineer...,machine learning,"Seattle, WA","Seattle, WA, US",Machine Learning Engineer,
4,Compass /,Engineering\n\nMachine Learning Engineer\n\nSe...,machine learning,"Seattle, WA","Seattle, WA, US",Machine Learning Engineer,


In [4]:
df = df.drop(['location','position_low'], axis=1)

In [5]:
locations = ['NY', 'KCMO', 'ATL']
roles = ['data scientist', 'data analyst']
# df['loc'].unique()
df_filteredbyloc = df[df['loc'].isin(locations)]
df_filteredbyloc = df_filteredbyloc[df_filteredbyloc['job_cat'].isin(roles)].reset_index()
df_filteredbyloc.head()

Unnamed: 0,index,company,description,job_cat,loc,position
0,2078,PRI Technology,Sr. Data Scientist\n\nThe Sr. Data Scientist i...,data scientist,ATL,Sr. Data Scientist
1,2079,StevenDouglas,POSITION SUMMARY:\nThe Data Scientist provides...,data scientist,ATL,Data Scientist
2,2080,Experience LLC,When your team hits a game winner or the band ...,data scientist,ATL,Data Scientist
3,2081,Arby's,Purpose Of The Position\n\nAn Inspire data sci...,data scientist,ATL,Data Scientist
4,2082,Collabera Inc.,"Atlanta, Georgia\nSkills : python ,r ,scala ,j...",data scientist,ATL,Data Scientist


DATA CLEANING - NLP PREP

In [6]:
import nltk, re
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk import word_tokenize
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec, Doc2Vec, TfidfModel
from gensim.models.phrases import Phraser, Phrases

[nltk_data] Downloading package punkt to /Users/blewis2/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
stopword_set = stopwords.words('english')
stopword_set = stopword_set + ["need", "goal", "include", "looking", "seeking"] + list(string.punctuation)

In [8]:
stopword_set_titles = stopword_set + ['data scientist', 'data analyst']

In [9]:
def cleaner(text, stopwords=stopwords.words('english'), symbols = "?&()*%$#@.!:;^"):
    text = text.replace("\n", " ").replace("*", " ").strip().replace("-", "_")
    for symbol in symbols:
        text = text.replace(symbol, " ")
    text = text.lower()
    text = re.sub(" +", " ", text)
    new_text = []
    for word in text.split(" "):
        if word in stopwords or word == ' ' or word == '':
            continue
        else:
            new_text.append(word)
    return new_text

In [10]:
descriptions = df_filteredbyloc.description 
descriptions = [cleaner(d, stopwords=stopword_set) for d in descriptions]
df_filteredbyloc['descriptions'] = descriptions

In [11]:
descriptions[:10]

[['sr',
  'data',
  'scientist',
  'sr',
  'data',
  'scientist',
  'driven',
  'interest',
  'solving',
  'complex',
  'problems',
  'data',
  'exploration',
  'analysis',
  'highly',
  'analytically',
  'minded',
  'robust',
  'technical',
  'tool',
  'kit',
  'substantial',
  'experience',
  'analytics',
  'portfolio',
  'successfully',
  'executed',
  'models',
  'sr',
  'data',
  'scientist',
  'responsible',
  'executing',
  'steps',
  'modeling',
  'process',
  'produce',
  'high',
  'quality,',
  'reproducible',
  'results',
  'work',
  'team',
  'analyst',
  'support',
  'development',
  'processes',
  'technologies',
  'improve',
  'patient',
  'outcomes',
  'role',
  'responsible',
  'extracting,',
  'managing,',
  'analyzing',
  'complex',
  'administrative',
  'healthcare',
  'data',
  'applying',
  'client/ai',
  'concepts',
  'uncover',
  'opportunities',
  'process',
  'improvement',
  'responsibilities',
  'develop',
  'algorithms',
  'applications',
  'apply',
  'math

In [12]:
bigram = Phrases(sentences = descriptions, min_count=1, threshold=1, max_vocab_size=5000)
phrased_descriptions = [bigram[d] for d in descriptions]
df_filteredbyloc['phrased_descriptions'] = phrased_descriptions
df_filteredbyloc['cleaned_phrased_descriptions'] = [' '.join(d) for d in df_filteredbyloc.phrased_descriptions]

In [13]:
df_filteredbyloc['phrased_descriptions'].head()

0    [sr, data, scientist, sr, data, scientist, dri...
1    [position, summary, data, scientist, provides,...
2    [team, hits, game, winner, band, plays, favori...
3    [purpose, position, inspire, data, scientist, ...
4    [atlanta,, georgia, skills, python, ,r, ,scala...
Name: phrased_descriptions, dtype: object

In [14]:
df_filteredbyloc['cleaned_phrased_descriptions'][0]

"sr data scientist sr data scientist driven interest solving complex problems data exploration analysis highly analytically minded robust technical tool kit substantial experience analytics portfolio successfully executed models sr data scientist responsible executing steps modeling process produce high quality, reproducible results work team analyst support development processes technologies improve patient outcomes role responsible extracting, managing, analyzing complex administrative healthcare data applying client/ai concepts uncover opportunities process improvement responsibilities develop algorithms applications apply mathematics data, perform large_scale experimentation develop functionality work large_scale real_world health data evidence generation drive innovation via use data driven insights customers biggest problems build advocate data_driven solutions help customers improve clinical outcomes processes design build scalable production_ready analytics solutions using wide

In [15]:
y = df_filteredbyloc['job_cat']
X = df_filteredbyloc['cleaned_phrased_descriptions']

In [16]:
#vectorize data
count_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

X_t = tfidf_vectorizer.fit_transform(X)

In [17]:
X_t.shape

(1175, 15311)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=0)

In [19]:
# for non-pipeline model
X_train_np, X_test_np, y_train_np, y_test_np = train_test_split(X_t.toarray(), y, test_size=0.2,
                                                    random_state=0)

In [20]:
X_train[:3]

1092    design, develop, troubleshoot debug software p...
453     founded 2014, octane lending fintech startup v...
892     focus data science team digitizing recruiting ...
Name: cleaned_phrased_descriptions, dtype: object

SCRUB

In [15]:
df_filteredbyloc.sum().isna()

index                           False
company                         False
description                     False
job_cat                         False
loc                             False
position                        False
descriptions                    False
phrased_descriptions            False
cleaned_phrased_descriptions    False
dtype: bool

EXPLORE

In [16]:
df_filteredbyloc.groupby('loc').count()

Unnamed: 0_level_0,index,company,description,job_cat,position,descriptions,phrased_descriptions,cleaned_phrased_descriptions
loc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ATL,282,282,282,282,282,282,282,282
KCMO,96,96,96,96,96,96,96,96
NY,797,797,797,797,797,797,797,797


MODEL

In [21]:
#library imnports - model
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB

In [22]:
len(tfidf_vectorizer.get_feature_names())

15311

In [23]:
#classification model pipelines
pipelines = {
    'multinomialnb': make_pipeline(TfidfVectorizer(), MultinomialNB()),
    'logisticregression': make_pipeline(TfidfVectorizer(), LogisticRegression()),
    'randomforestclassifier': make_pipeline(TfidfVectorizer(), RandomForestClassifier()),
    'gradientboostingclassifier': make_pipeline(TfidfVectorizer(), GradientBoostingClassifier())
}
#define hyperparameters for each model 
multinomialnb_hyperparameters = {
    'multinomialnb__alpha': np.linspace(0.5, 1.5, 6),
    'multinomialnb__fit_prior': [True, False]
}
logisticregression_hyperparameters = {
    'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
randomforestclassifier_hyperparameters = {
    'randomforestclassifier__n_estimators': [100, 200], 
    'randomforestclassifier__max_features': ['auto', 'sqrt',0.33]
}
gradientboostingclassifier_hyperparameters = {
    'gradientboostingclassifier__n_estimators': [100, 200], 
    'gradientboostingclassifier__learning_rate': [0.05, 0.1, 0.2],
    'gradientboostingclassifier__max_depth': [1, 3, 5]
}
#set hyperparameter dictionary
hyperparameters = {
    'multinomialnb': multinomialnb_hyperparameters,
    'logisticregression': logisticregression_hyperparameters,
    'randomforestclassifier': randomforestclassifier_hyperparameters,
    'gradientboostingclassifier': gradientboostingclassifier_hyperparameters
}

In [56]:
fitted_models = {}

for name, pipeline in pipelines.items():
    model = GridSearchCV(pipeline, hyperparameters[name], cv=10, n_jobs=-1, scoring='roc_auc')
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    confusion_matrix(y_hat, y_test)
    fitted_models[name] = model
    print(name, 'has been fitted.')

for name, model in fitted_models.items(): 
    print(name, model.best_score_)
    y_hat = model.predict(X_test)
    confusion_matrix(y_hat, y_test)

multinomialnb has been fitted.




logisticregression has been fitted.
randomforestclassifier has been fitted.
gradientboostingclassifier has been fitted.
multinomialnb 0.9639052325801897
logisticregression 0.9830256500153026
randomforestclassifier 0.9898190450801345
gradientboostingclassifier 0.9902678678686624


In [65]:
for name, model in fitted_models.items(): 
    print(name, model.best_score_)
    y_hat = model.predict(X_test)
    print(confusion_matrix(y_hat, y_test))
    print(classification_report(y_hat, y_test))

multinomialnb 0.9639052325801897
[[ 71   2]
 [ 35 127]]
                precision    recall  f1-score   support

  data analyst       0.67      0.97      0.79        73
data scientist       0.98      0.78      0.87       162

     micro avg       0.84      0.84      0.84       235
     macro avg       0.83      0.88      0.83       235
  weighted avg       0.89      0.84      0.85       235

logisticregression 0.9830256500153026
[[101   7]
 [  5 122]]
                precision    recall  f1-score   support

  data analyst       0.95      0.94      0.94       108
data scientist       0.95      0.96      0.95       127

     micro avg       0.95      0.95      0.95       235
     macro avg       0.95      0.95      0.95       235
  weighted avg       0.95      0.95      0.95       235

randomforestclassifier 0.9898190450801345
[[ 99   1]
 [  7 128]]
                precision    recall  f1-score   support

  data analyst       0.93      0.99      0.96       100
data scientist       0.99  

In [None]:
fitted_models['randomforestclassifier']

In [None]:
fitted_models['randomforestclassifier'].best_params_

In [None]:
# Save classifier to a file
import pickle 

save_classifier = open("Tfidf_randomforestclassifier.pickle", 'wb') #wb= write in bytes. 
pickle.dump(fitted_models['randomforestclassifier'], save_classifier) #use pickle to dump the grid3 we trained, as 'Tfidf_LogR.pickle' in wb format
save_classifier.close() 

In [None]:
randomforestclassifier = fitted_models['randomforestclassifier']

In [None]:
# Retrieve the saved file and uplaod it to an object

model_randomforestclassifier = open("Tfidf_randomforestclassifier.pickle", 'rb') # rb= read in bytes
grid = pickle.load(model_randomforestclassifier)
model_randomforestclassifier.close()

In [24]:
# non-pipeline rf_classifier for convenience
clf = RandomForestClassifier(max_features=0.33, n_estimators=100, random_state=0)
clf.fit(X_train_np, y_train_np)
y_hat_np = clf.predict(X_test_np)
print(confusion_matrix(y_hat_np, y_test_np))
print(classification_report(y_hat_np, y_test_np))

[[ 99   1]
 [  7 128]]
                precision    recall  f1-score   support

  data analyst       0.93      0.99      0.96       100
data scientist       0.99      0.95      0.97       135

     micro avg       0.97      0.97      0.97       235
     macro avg       0.96      0.97      0.97       235
  weighted avg       0.97      0.97      0.97       235



### Run model with 'data scientist' and 'data analyst' added to stopwords

In [25]:
descriptions_2 = df_filteredbyloc.description 
descriptions_2 = [cleaner(d, stopwords=stopword_set_titles) for d in descriptions_2]
df_filteredbyloc['descriptions_2'] = descriptions_2

In [26]:
bigram = Phrases(sentences = descriptions_2, min_count=1, threshold=1, max_vocab_size=5000)
phrased_descriptions_2 = [bigram[d] for d in descriptions_2]
df_filteredbyloc['phrased_descriptions_2'] = phrased_descriptions_2
df_filteredbyloc['cleaned_phrased_descriptions_2'] = [' '.join(d) for d in df_filteredbyloc.phrased_descriptions_2]

In [27]:
y_2 = df_filteredbyloc['job_cat']
X_2 = df_filteredbyloc['cleaned_phrased_descriptions_2']

X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_2, test_size=0.2,
                                                    random_state=0)

In [64]:
fitted_models_2 = {}

for name, pipeline in pipelines.items():
    model = GridSearchCV(pipeline, hyperparameters[name], cv=10, n_jobs=-1, scoring='roc_auc')
    model.fit(X_train_2, y_train_2)
    y_hat_2 = model.predict(X_test_2)
    confusion_matrix(y_hat_2, y_test_2)
    fitted_models_2[name] = model
    print(name, 'has been fitted.')

for name, model in fitted_models_2.items(): 
    print(name, model.best_score_)
    y_hat = model.predict(X_test_2)
    confusion_matrix(y_hat_2, y_test_2)

multinomialnb has been fitted.




logisticregression has been fitted.
randomforestclassifier has been fitted.
gradientboostingclassifier has been fitted.
multinomialnb 0.9639052325801897
logisticregression 0.9830256500153026
randomforestclassifier 0.9899692028266481
gradientboostingclassifier 0.9900376567432334


In [66]:
for name, model in fitted_models_2.items(): 
    print(name, model.best_score_)
    y_hat_2 = model.predict(X_test_2)
    print(confusion_matrix(y_hat_2, y_test_2))
    print(classification_report(y_hat_2, y_test_2))

multinomialnb 0.9639052325801897
[[ 71   2]
 [ 35 127]]
                precision    recall  f1-score   support

  data analyst       0.67      0.97      0.79        73
data scientist       0.98      0.78      0.87       162

     micro avg       0.84      0.84      0.84       235
     macro avg       0.83      0.88      0.83       235
  weighted avg       0.89      0.84      0.85       235

logisticregression 0.9830256500153026
[[101   7]
 [  5 122]]
                precision    recall  f1-score   support

  data analyst       0.95      0.94      0.94       108
data scientist       0.95      0.96      0.95       127

     micro avg       0.95      0.95      0.95       235
     macro avg       0.95      0.95      0.95       235
  weighted avg       0.95      0.95      0.95       235

randomforestclassifier 0.9899692028266481
[[100   1]
 [  6 128]]
                precision    recall  f1-score   support

  data analyst       0.94      0.99      0.97       101
data scientist       0.99  

In [49]:
X_t_2 = tfidf_vectorizer.fit_transform(X_2)

In [50]:
# for non-pipeline model with ds/da added to stopwords
X_train_2_np, X_test_2_np, y_train_2_np, y_test_2_np = train_test_split(X_t_2.toarray(), y_2, test_size=0.2,
                                                    random_state=0)

In [51]:
# non-pipeline rf_classifier with ds/da added to stopwords for convenience
clf_2 = RandomForestClassifier(max_features=0.33, n_estimators=100, random_state=0)
clf_2.fit(X_train_2_np, y_train_2_np)
y_hat_2_np = clf.predict(X_test_2_np)
print(confusion_matrix(y_hat_2_np, y_test_2_np))
print(classification_report(y_hat_2_np, y_test_2_np))

[[ 99   1]
 [  7 128]]
                precision    recall  f1-score   support

  data analyst       0.93      0.99      0.96       100
data scientist       0.99      0.95      0.97       135

     micro avg       0.97      0.97      0.97       235
     macro avg       0.96      0.97      0.97       235
  weighted avg       0.97      0.97      0.97       235



## Extracting Feature Importances of Each Class

In [59]:
y.value_counts()

data scientist    633
data analyst      542
Name: job_cat, dtype: int64

In [28]:
top_10 = clf.feature_importances_.argsort()[-10:]

In [31]:
for index in top_10:
    print(tfidf_vectorizer.get_feature_names()[index])

reporting
models
algorithms
reports
excel
science
python
machine_learning
analyst
scientist


In [32]:
sorted(clf.feature_importances_)[-10:]

[0.0055644793599725585,
 0.00762846660364644,
 0.011749058717741132,
 0.015599509852322438,
 0.018111262549433987,
 0.03504427377414214,
 0.04892389581669526,
 0.12497422654671285,
 0.21824899401101255,
 0.27150730556752606]

In [41]:
# feature importances for entire model
first_map = dict(zip(tfidf_vectorizer.get_feature_names(), X_t.toarray()[0]))


str

In [52]:
# feature importances for model with ds/da words removed
second_map = dict(zip(tfidf_vectorizer.get_feature_names(), X_t_2.toarray()[0])) 

In [83]:
for k, v in sorted(first_map.items(), key=lambda v: v[1], reverse=False):
    if v > 0.00:
        print(k, v)

business 0.025165059476156586
years 0.025505856061991704
ability 0.02856444795043758
knowledge 0.030084451209372073
responsibilities 0.03142196705089656
role 0.03170781472731846
support 0.03232609854142378
development 0.03293143415713369
requirements 0.034314777028703336
engineering 0.035729719719219614
degree 0.03605034168305703
understanding 0.036131368448421824
help 0.036459052931611016
use 0.03783077973091068
create 0.039215708349365816
drive 0.03940841072906059
opportunities 0.03960313097140501
bachelor 0.04009904337965895
needs 0.04139996623548871
results 0.04139996623548871
well 0.04200252777652709
apply 0.04217030888372943
quality 0.042226574019634366
computer_science 0.04251048202637372
s_degree 0.042740771201465445
also 0.042798791882637646
mathematics 0.04297394809032143
communicate 0.04344926120611176
algorithms 0.045482203763054826
minimum 0.04651747738350934
professional 0.047241133412761024
mining 0.04738930641370574
languages 0.047613803465022336
master 0.04768924216170

In [34]:
df_first_map = pd.DataFrame.from_dict(first_map, orient='index')
df_first_map.columns = ['importance']
df_first_map = df_first_map[df_first_map.importance > 0.00]
df_first_map.sort_values(by='importance', ascending=False)

Unnamed: 0,importance
data,0.245055
sr,0.217317
packages_,0.124285
co_operative,0.124285
date_sources,0.124285
large_scale,0.122701
driven,0.119967
outcomes,0.117148
executed,0.116745
ai,0.116609


In [54]:
df_second_map = pd.DataFrame.from_dict(second_map, orient='index')
df_second_map.columns = ['importance']
df_second_map = df_second_map[df_second_map.importance > 0.00]
df_second_map.sort_values(by='importance', ascending=False)

Unnamed: 0,importance
data,0.245055
sr,0.217317
packages_,0.124285
co_operative,0.124285
date_sources,0.124285
large_scale,0.122701
driven,0.119967
outcomes,0.117148
executed,0.116745
ai,0.116609
