# (4) Project 4 Modelling Part 2

---
For part 2, I will create a model to predict whether a job is a data scientist or business analyst. Therefore, as with part 1, it will be another model to predict a binary target: 1 for data scientist and 0 for business analyst.

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score



%matplotlib inline
%config InlineBackend.figure_format = 'retina'

---
In part 1 I made a model to predict high vs low salaries. I was then able to use this model to predict high or low salaries for the jobs that did not list a salary. I will now read in the csv file that contains salary data on all job listings that I scraped.

---

In [2]:
df = pd.read_csv('all_salary.csv')

In [3]:
df.head()

Unnamed: 0,location,title,company,summary,description,sal_above_mean
0,sydney,data scientist python r,correlate resources,work within a team of industry leading data sc...,our client is an industry pioneering customer ...,1.0
1,sydney,data scientists x 2,alloc8 recruitment solutions pty ltd,the need for 2 data scientist is paramount and...,the company alloc8 have been fortunate enoug...,1.0
2,sydney,data analyst perm role sydney,infopeople,i am looking for a data analyst with at least ...,i am looking for a data analyst with at least ...,0.0
3,sydney,senior data scientist leadership position,morgan mckinley,the time has come to step up and be counted as...,calling all senior data scientists the ti...,1.0
4,sydney,junior data scientist,correlate resources,as a junior data scientist the responsibilitie...,our client is an industry pioneering customer ...,0.0


In [4]:
df.shape

(4372, 6)

---
I need to subset my data so that I only have jobs that are either data scientist or business analyst. First I need to rename many of the job title records so that title with 'data scientist' in the title will be renamed to just 'data scientist'. I will also do this for business analyst.

---

In [5]:
df['title'] = df['title'].map(lambda x: 'data scientist' if 'data scientist' in x else x)
# Rename job titles to data scientist if the title contains data scientist.

In [6]:
df['title'] = df['title'].map(lambda x: 'business analyst' if 'business analyst' in x else x)
# Rename job titles to business analyst if the title contains business analyst.

In [7]:
df.title.value_counts().head()

business analyst      842
data scientist        133
data analyst           77
analyst                30
commercial analyst     23
Name: title, dtype: int64

---
I now have 842 business analysts and 133 data scientists. Now I can subset the data to create a dataframe that only contains business analyst and data scientist jobs.

---

In [8]:
df2 = df[(df.title == 'data scientist') | (df.title == 'business analyst')]
# Create new dataframe with only data scientists and business analysts.

In [9]:
df2.shape

(975, 6)

In [10]:
df2['title'] = df2['title'].map(lambda x: 1 if 'data scientist' in x else 0)
# Now classify data scientist as 1 and business analyst as 0.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [11]:
df2.title.value_counts()

0    842
1    133
Name: title, dtype: int64

In [12]:
df2.head()

Unnamed: 0,location,title,company,summary,description,sal_above_mean
0,sydney,1,correlate resources,work within a team of industry leading data sc...,our client is an industry pioneering customer ...,1.0
1,sydney,1,alloc8 recruitment solutions pty ltd,the need for 2 data scientist is paramount and...,the company alloc8 have been fortunate enoug...,1.0
3,sydney,1,morgan mckinley,the time has come to step up and be counted as...,calling all senior data scientists the ti...,1.0
4,sydney,1,correlate resources,as a junior data scientist the responsibilitie...,our client is an industry pioneering customer ...,0.0
9,sydney,1,genesis it t pty ltd,our client seeks an experienced data scientist...,our client seeks an experienced data scientist...,1.0


---
I have split my data into just data scientists and business analysts, and binarised the job title. As with part 1, I will use natural language processing to create features from the words in the summary and description. Again, I will be looking at the ratio of ngrams that appear in data scientist summaries and descriptions to the ngrams that appear in business analyst summaries and descriptions.

---

## NLP

---
I will start by splitting the data into X (summary and description) and y (title).

---

In [13]:
X_title = df2[['summary', 'description']]
X_title.shape

(975, 2)

In [14]:
y_title = df2[['title']]
y_title.shape

(975, 1)

### Summary

---
Now I need to create a count vectorizer to count ngrams within the summary. As with part 1, I will be using an ngram range of 1 to 3 words. I will limit the count vectorizer to ngrams that appear in at least 2.5% of the records.

---

In [15]:
cvec = CountVectorizer(stop_words='english', ngram_range=(1, 3), min_df=0.025, analyzer='word')
cvec.fit(X_title['summary'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0.025,
        ngram_range=(1, 3), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [16]:
summary_words = pd.DataFrame(cvec.transform(X_title['summary']).todense(), columns=cvec.get_feature_names())

In [17]:
summary_words.shape

(975, 114)

In [18]:
summary_words = pd.concat([summary_words, y_title], axis=1)
summary_words.head()

Unnamed: 0,ability,agile,analyse,analysis,analyst,analyst experience,analyst join,analyst role,analysts,analytics,...,technical business analyst,technology,understanding,work,working,working business,years,years experience,years experience business,title
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [19]:
summary_count_business = summary_words[summary_words['title'] == 0].sum()
print ('Word count for 0 (business analyst):\n')
print (summary_count_business.sort_values(ascending = False))

print ('\n')

summary_count_data = summary_words[summary_words['title'] == 1].sum()
print ('Word count for 1 (data scientist):\n')
print (summary_count_data.sort_values(ascending = False))

Word count for 0 (business analyst):

business                        232.0
data                            176.0
analyst                         136.0
business analyst                124.0
experience                       69.0
analysis                         33.0
requirements                     24.0
role                             24.0
senior                           23.0
looking                          23.0
experience business              22.0
years                            22.0
experience business analyst      20.0
projects                         20.0
technical                        19.0
team                             19.0
process                          17.0
analytics                        16.0
client                           16.0
experienced                      16.0
working                          16.0
work                             16.0
systems                          14.0
leading                          14.0
skills                           14.0
senior busin

In [20]:
summary_count_compare = pd.DataFrame([summary_count_business, summary_count_data]).T
summary_count_compare["ratio"] = summary_count_compare[1]/summary_count_compare[0]
summary_count_compare.ratio.sort_values(ascending=False)

title                                inf
understanding                   4.000000
industry                        2.333333
company                         1.500000
excellent                       1.333333
proven                          1.000000
develop                         1.000000
migration                       1.000000
ability                         1.000000
technology                      0.750000
seeking                         0.714286
new                             0.666667
key                             0.666667
technical business analyst      0.600000
modelling                       0.600000
scientist                       0.583333
data scientist                  0.583333
business requirements           0.571429
financial                       0.571429
technical business              0.545455
knowledge                       0.500000
highly                          0.500000
experienced business analyst    0.500000
experienced business            0.500000
commercial      

---
As with part 1, I will only be picking specfic ngrams that have some relevance to the job. Ngrams with a high ratio are more associated with data scientist jobs whereas ngrams with a low ratio are associated with business analyst roles.

---

I will use ngrams with a ratio above 1 for data science summary (understanding, industry, excellent, proven, develop, migration, ability).
<br>Ngrams below 0.1 are business analyst summaries (management, years experience, solutions, years, services, background, functional, analyst role, business analysts, project, agile, government).
<br><br>The summary data suggests that roles asking for understanding and ability are more likely to be data scientist jobs. However, business analyst jobs seem to ask more for experience and roles in management and services.

---

### Description

In [21]:
# Now create a count vectorizer for description. Description contains a lot more words to limit vectorizer to ngrams in at least 15% of dataframe and at most 75% of dataframe.
cvec = CountVectorizer(stop_words='english', ngram_range=(1, 3), min_df=0.15, max_df=0.75, analyzer='word')
cvec.fit(X_title['description'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.75, max_features=None, min_df=0.15,
        ngram_range=(1, 3), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [22]:
description_words = pd.DataFrame(cvec.transform(X_title['description']).todense(), columns=cvec.get_feature_names())

In [23]:
description_words.shape

(975, 503)

In [24]:
description_words = pd.concat([description_words, y_title], axis=1)
description_words.head()

Unnamed: 0,10,11,12,12 hours,12 hours ago,14,14 hours,14 hours ago,15,17,...,window document location,window ia,window ia var,work,working,workshops,written,years,years experience,title
0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,...,1.0,1.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0.0,...,1.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,
3,3.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0


In [25]:
description_count_business = description_words[description_words['title'] == 0].sum()
print ('Word count for 0 (business analyst):\n')
print (description_count_business.sort_values(ascending = False))

print ('\n')

description_count_data = description_words[description_words['title'] == 1].sum()
print ('Word count for 1 (data scientist):\n')
print (description_count_data.sort_values(ascending = False))

Word count for 0 (business analyst):

nsw                        656.0
sydney                     643.0
sydney nsw                 530.0
document                   344.0
work                       336.0
hours                      328.0
melbourne                  325.0
hours ago                  323.0
requirements               319.0
vic                        299.0
team                       281.0
working                    279.0
analysis                   276.0
melbourne vic              276.0
technical                  261.0
project                    255.0
recjoblink                 255.0
var                        229.0
senior                     221.0
ia                         216.0
management                 207.0
strong                     205.0
value                      202.0
ago business               201.0
development                196.0
ago business analyst       192.0
days ago easily            187.0
process                    172.0
projects                   171.0
austr

In [26]:
#creating as dataframe, using method ".T" to transpose columns with index.
description_count_compare = pd.DataFrame([description_count_business, description_count_data]).T
#creating a "ratio" column to determine frequency of words associated with above median jobs vs. below median jobs
description_count_compare["ratio"] = description_count_compare[1]/description_count_compare[0]
description_count_compare.ratio.sort_values(ascending=False)

title                           inf
sap                        1.066667
strategic                  1.047619
click                      0.909091
lead                       0.882353
insights                   0.804348
advanced                   0.714286
government                 0.636364
days ago business          0.621622
engagement                 0.606061
need                       0.577778
highly                     0.576271
key                        0.567010
ba                         0.564103
resume                     0.560000
modelling                  0.553846
10                         0.540541
cv                         0.538462
include                    0.533333
techniques                 0.521739
developing                 0.517241
page                       0.511111
analytics                  0.509259
michael page               0.503759
initiatives                0.500000
page au                    0.500000
michael page au            0.500000
applications               0

---
The description count vectorizer produces a lot of ngrams that are irrelevant to whether the role is a data scientist or business analyst and contain many company names and locations. I will just be using words that have some relevance to the role itself. The range of the ratio is also quite low so it is possible that, when it comes to modelling, the description may not be a good feature to distinguish between data scientist jobs and business analyst jobs.

---

Ngrams above 0.55 are data scientist descriptions (sap, strategic, lead, insights, advanced, government, engagement, modelling).
<br>Ngrams below 0.1 are business analyst descriptions (executive, analysts, analyst junior).
<br><br>The description data suggests that roles requiring insights, modelling, and softwares like sap are more likely to be data scientist jobs. Business analyst descriptions ask more for analysts.

---

## Features

---
Now that I have chosen ngrams from the summary and description, I can create binary features.

---

### Summary

In [27]:
df2['data_summary'] = 0                         # Create data scientist summary column, set to 0.
df2.ix[(df2.summary.str.contains('understanding') |
           df2.summary.str.contains('industry') |
           df2.summary.str.contains('excellent') |
           df2.summary.str.contains('proven') |
           df2.summary.str.contains('develop')|
           df2.summary.str.contains('migration')|
           df2.summary.str.contains('ability')), 'data_summary'] = 1   # Set to 1 if any of these words appear in the summary.   


df2['business_summary'] = 0                         # Create analyst summary column, set to 0.
df2.ix[(df2.summary.str.contains('management') |
           df2.summary.str.contains('years experience') |
           df2.summary.str.contains('solutions') |
           df2.summary.str.contains('years') |
           df2.summary.str.contains('services') |
           df2.summary.str.contains('background') |
           df2.summary.str.contains('functional') |
           df2.summary.str.contains('analyst role') |
           df2.summary.str.contains('business analysts') |
           df2.summary.str.contains('project') |
           df2.summary.str.contains('agile') |
           df2.summary.str.contains('government')), 'business_summary'] = 1   # Set to 1 if any of these words appear in the summary.
df2.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-

Unnamed: 0,location,title,company,summary,description,sal_above_mean,data_summary,business_summary
0,sydney,1,correlate resources,work within a team of industry leading data sc...,our client is an industry pioneering customer ...,1.0,1,0
1,sydney,1,alloc8 recruitment solutions pty ltd,the need for 2 data scientist is paramount and...,the company alloc8 have been fortunate enoug...,1.0,0,0
3,sydney,1,morgan mckinley,the time has come to step up and be counted as...,calling all senior data scientists the ti...,1.0,0,0
4,sydney,1,correlate resources,as a junior data scientist the responsibilitie...,our client is an industry pioneering customer ...,0.0,0,0
9,sydney,1,genesis it t pty ltd,our client seeks an experienced data scientist...,our client seeks an experienced data scientist...,1.0,0,0
11,sydney,1,7 recruitment,big data data mining machine learning ai g...,we are currently working with an organisation ...,1.0,0,0
13,sydney,1,morgan mckinley,my client a well established australian insti...,utilise your data science skill set across a v...,1.0,0,0
14,sydney,1,nakama,data extraction old new limited and huge d...,mature data science practice rapidly expanding...,1.0,1,0
15,sydney,1,salt recruitment,data modelling client engagement statistical...,technology permanent sydney au 100000 00 ...,1.0,0,0
37,melbourne,1,rmit university,data scientist analytics insights pd new...,job no 564003 work type full time fixed t...,0.0,1,0


### Description

In [28]:
df2['data_description'] = 0
df2.ix[(df2.description.str.contains('sap') |
           df2.description.str.contains('strategic') |
           df2.description.str.contains('lead') |
           df2.description.str.contains('insights') |
           df2.description.str.contains('advanced') |
           df2.description.str.contains('government') |
           df2.description.str.contains('engagement') |
           df2.description.str.contains('modelling')), 'data_description'] = 1


df2['business_description'] = 0
df2.ix[(df2.description.str.contains('executive') |
           df2.description.str.contains('analysts') |
           df2.description.str.contains('analyst junior')), 'business_description'] = 1
df2.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


Unnamed: 0,location,title,company,summary,description,sal_above_mean,data_summary,business_summary,data_description,business_description
0,sydney,1,correlate resources,work within a team of industry leading data sc...,our client is an industry pioneering customer ...,1.0,1,0,1,0
1,sydney,1,alloc8 recruitment solutions pty ltd,the need for 2 data scientist is paramount and...,the company alloc8 have been fortunate enoug...,1.0,0,0,1,0
3,sydney,1,morgan mckinley,the time has come to step up and be counted as...,calling all senior data scientists the ti...,1.0,0,0,1,1
4,sydney,1,correlate resources,as a junior data scientist the responsibilitie...,our client is an industry pioneering customer ...,0.0,0,0,1,0
9,sydney,1,genesis it t pty ltd,our client seeks an experienced data scientist...,our client seeks an experienced data scientist...,1.0,0,0,1,0
11,sydney,1,7 recruitment,big data data mining machine learning ai g...,we are currently working with an organisation ...,1.0,0,0,1,0
13,sydney,1,morgan mckinley,my client a well established australian insti...,utilise your data science skill set across a v...,1.0,0,0,1,0
14,sydney,1,nakama,data extraction old new limited and huge d...,mature data science practice rapidly expanding...,1.0,1,0,1,0
15,sydney,1,salt recruitment,data modelling client engagement statistical...,technology permanent sydney au 100000 00 ...,1.0,0,0,1,0
37,melbourne,1,rmit university,data scientist analytics insights pd new...,job no 564003 work type full time fixed t...,0.0,1,0,1,0


### Other Columns

---
I have created the NLP binary features from the summary and description. Now I need to sort out the other columns so that I have a completely numeric and binary dataframe to feed into the model.

---

In [29]:
cities = pd.get_dummies(df2.location)       # Create dummy variables from location.
df2 = pd.concat([df2, cities], axis=1)      # Concatonate to main dataframe.

In [30]:
df2.drop(['location', 'company', 'summary', 'description'], axis=1, inplace=True)
# Drop other non-numeric columns.

In [31]:
df2.head()

Unnamed: 0,title,sal_above_mean,data_summary,business_summary,data_description,business_description,adelaide,brisbane,melbourne,perth,sydney
0,1,1.0,1,0,1,0,0,0,0,0,1
1,1,1.0,0,0,1,0,0,0,0,0,1
3,1,1.0,0,0,1,1,0,0,0,0,1
4,1,0.0,0,0,1,0,0,0,0,0,1
9,1,1.0,0,0,1,0,0,0,0,0,1


In [32]:
df2.shape

(975, 11)

## Modelling

---
Now that I have my features I can create my model to predict if a job will be a data scientist or business analyst. I will initially try a decision tree, but I may try other classifier models if it does not perform well.

---

### Decision Tree

In [33]:
# Check baseline accuracy.
baseline_acc = 1 - y_title.mean()
baseline_acc

title    0.86359
dtype: float64

---
The baseline accurracy is high, 86%. This is due to the class imbalance whereby most of the records are business analysts rather than data scientists.

---

In [34]:
# Create X variable of predictors.
X_title = df2.iloc[::, 1::]
X_title.shape

(975, 10)

In [35]:
# Create y variable of target.
y_title = df2.iloc[::, 0]
y_title.shape

(975,)

In [36]:
# Train / test split, 75:25 split, stratify on y_title to ensure roughly equal proportions of data scientists and business analysts in both training and testing data. 
X_train, X_test, y_train, y_test = train_test_split(X_title, y_title, test_size=0.25, stratify=y_title, random_state=34)

In [37]:
print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)

X_train shape: (731, 10)
y_train shape: (731,)
X_test shape: (244, 10)
y_test shape: (244,)


In [38]:
y_test.value_counts()

0    211
1     33
Name: title, dtype: int64

---
I will create multiple decision trees of different depths to see which one performs best.

---

In [39]:
# Create decision trees.
dtr1 = DecisionTreeClassifier(max_depth=1)
dtr2 = DecisionTreeClassifier(max_depth=2)
dtr3 = DecisionTreeClassifier(max_depth=3)
dtrN = DecisionTreeClassifier(max_depth=None)

In [40]:
# Fit trees to training data.
dtr1.fit(X_train, y_train)
dtr2.fit(X_train, y_train)
dtr3.fit(X_train, y_train)
dtrN.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [41]:
# Training scores.
print(dtr1.score(X_train, y_train))
print(dtr2.score(X_train, y_train))
print(dtr3.score(X_train, y_train))
print(dtrN.score(X_train, y_train))

0.8632010943912448
0.8632010943912448
0.8632010943912448
0.8741450068399452


---
All test scores are roughly the same, only just above the baseline. The tree that has no maximum depth has the best training score of 0.877.

---

In [42]:
# Check cross validation scores - 5 folds.
dtr1_scores = cross_val_score(dtr1, X_train, y_train, cv=5)
dtr2_scores = cross_val_score(dtr2, X_train, y_train, cv=5)
dtr3_scores = cross_val_score(dtr3, X_train, y_train, cv=5)
dtrN_scores = cross_val_score(dtrN, X_train, y_train, cv=5)

print((dtr1_scores, np.mean(dtr1_scores)))
print((dtr2_scores, np.mean(dtr2_scores)))
print((dtr3_scores, np.mean(dtr3_scores)))
print((dtrN_scores, np.mean(dtrN_scores)))

(array([0.86394558, 0.8630137 , 0.8630137 , 0.8630137 , 0.8630137 ]), 0.8632000745503682)
(array([0.86394558, 0.8630137 , 0.8630137 , 0.8630137 , 0.8630137 ]), 0.8632000745503682)
(array([0.86394558, 0.8630137 , 0.8630137 , 0.8630137 , 0.8630137 ]), 0.8632000745503682)
(array([0.83673469, 0.8630137 , 0.84246575, 0.86986301, 0.85616438]), 0.8536483086385239)


---
When cross-validating the training data, the model performs consistently across all folds.

---

In [43]:
# Training scores.
print(dtr1.score(X_test, y_test))
print(dtr2.score(X_test, y_test))
print(dtr3.score(X_test, y_test))
print(dtrN.score(X_test, y_test))

0.8647540983606558
0.8647540983606558
0.8647540983606558
0.8770491803278688


---
The test scores are very close to the training scores. Again, the tree without a maximum depth performs the best.

---

In [44]:
print (confusion_matrix(y_test, dtrN.predict(X_test)))    # True (-ve),  False (+ve)
                                                          # False (-ve), True (+ve)
print('\n')    
print(classification_report(y_test, dtrN.predict(X_test), target_names=['business analyst', 'data scientist'])) 

[[208   3]
 [ 27   6]]


                  precision    recall  f1-score   support

business analyst       0.89      0.99      0.93       211
  data scientist       0.67      0.18      0.29        33

     avg / total       0.86      0.88      0.85       244



---
Looking at the confusion matrix reveals that only around 20% of the data scientist jobs are being predicted correctly. In other words, there are a lot more false negatives than true positives, so the model is struggling to distinguish between business analyst and data scientist. The high accuracy score of the model is due to the large class imbalance between business analysts and data scientists.

---

I will try using logistic regression and KNN to see if they are able to perform any better.

---

### Logistic Regression

In [45]:
# Create logistic regression model and fit it to training data.
logreg = LogisticRegression()
model = logreg.fit(X_train, y_train)

In [46]:
# Training score.
model.score(X_train, y_train)

0.8673050615595075

In [47]:
# Testing score.
model.score(X_test, y_test)

0.860655737704918

---
As with the decision trees, the logistic regression is getting accuracy scores of roughly the baseline.

---

In [48]:
print (confusion_matrix(y_test, model.predict(X_test)))    # True (-ve),  False (+ve)
                                                           # False (-ve), True (+ve)
print('\n')    
print(classification_report(y_test, model.predict(X_test), target_names=['business analyst', 'data scientist'])) 

[[210   1]
 [ 33   0]]


                  precision    recall  f1-score   support

business analyst       0.86      1.00      0.93       211
  data scientist       0.00      0.00      0.00        33

     avg / total       0.75      0.86      0.80       244



---
The confusion matrix reveals that logistic regression is performing much worse than the decision tree. It is not able to correctly predict any data scientists. It has only predicted 1 data scientist and this is a false positive. Everything else has been classified as a business analyst.

---

In [49]:
model.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0])

---
As seen in the confusion matrix, the model is only predicting a single data scientist. The confusion matrix revealed this to be a false positive. I will now try KNN.

---

### KNN

In [50]:
# Create KNN model, 9-nearest neighbours.
knn = KNeighborsClassifier(n_neighbors=9, weights='uniform')

In [51]:
# Fit to training data.
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=9, p=2,
           weights='uniform')

In [52]:
# Training score.
knn.score(X_train, y_train)

0.86593707250342

In [53]:
# Testing score.
knn.score(X_test, y_test)

0.8729508196721312

---
As with the decision trees and logistic regression, the scores for KNN are roughly baseline. I will need to check the confusion matrix to get a better understanding of how the model is performing.

---

In [54]:
print (confusion_matrix(y_test, knn.predict(X_test)))      # True (-ve),  False (+ve)
                                                           # False (-ve), True (+ve)
print('\n')    
print(classification_report(y_test, knn.predict(X_test), target_names=['business analyst', 'data scientist']))

[[206   5]
 [ 26   7]]


                  precision    recall  f1-score   support

business analyst       0.89      0.98      0.93       211
  data scientist       0.58      0.21      0.31        33

     avg / total       0.85      0.87      0.85       244



---
The KNN model has correctly predicted 1 more data scientist than the decision tree. However, there are also more false positives. Therefore, I would conclude that the decision tree has performed the best on this data. Despite this, I am still not happy with the results of the model as it is still unable to distinguish between data scientists and business analysts. As all models are producing similar results, I have concluded that the data that I have scraped, in particular the summaries and descriptions, are not good predictors of job title. The features were not accurately able to distinguish between the two types of job. I believe more data would be required to produce a model that could perform better than the baseline.

---