# Job Recommendation

with dataset from Careerbuilder

## TOC
* [Import dependencies](#import)
* [Load dataset](#dataset)
* [EDA and Preprocessing](#EDA)
 - [split into training and testing dataset](#split)
 - [location](#location)
 - [preprocessing](#preprocessing)
* [Building model](#model)
* [Clean html](#html)
* [Content based filtering](#content)
 - [job description based recommender](#jobdesc)
 - [similar user based recommender](#similarusers)
* [Collaborative filtering](#collab)

## Import dependencies <a class="anchor" id="import"></a>

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

## Load dataset <a class="anchor" id="dataset"></a>

In [2]:
!ls data/*.tsv

data/apps.tsv  data/test_users.tsv    data/users.tsv
data/jobs.tsv  data/user_history.tsv  data/window_dates.tsv


- users
- jobs
- apps
- users_history
- test_users

In [3]:
users = pd.read_csv('data/users.tsv', sep='\t', encoding='utf-8')
jobs = pd.read_csv('data/jobs.tsv', sep='\t', encoding='utf-8', error_bad_lines=False)
apps = pd.read_csv('data/apps.tsv', sep='\t', encoding='utf-8')
user_history = pd.read_csv('data/user_history.tsv', sep='\t', encoding='utf-8')
test_users = pd.read_csv('data/test_users.tsv', sep='\t', encoding='utf-8')

# jobs = pd.read_csv('data/jobs.tsv', sep='\t')

b'Skipping line 122433: expected 11 fields, saw 12\n'
b'Skipping line 602576: expected 11 fields, saw 12\n'
b'Skipping line 990950: expected 11 fields, saw 12\n'
  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
users.head()

Unnamed: 0,UserID,WindowID,Split,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany
0,47,1,Train,Paramount,CA,US,90723,High School,,1999-06-01 00:00:00,3,10.0,Yes,No,0
1,72,1,Train,La Mesa,CA,US,91941,Master's,Anthropology,2011-01-01 00:00:00,10,8.0,Yes,No,0
2,80,1,Train,Williamstown,NJ,US,8094,High School,Not Applicable,1985-06-01 00:00:00,5,11.0,Yes,Yes,5
3,98,1,Train,Astoria,NY,US,11105,Master's,Journalism,2007-05-01 00:00:00,3,3.0,Yes,No,0
4,123,1,Train,Baton Rouge,LA,US,70808,Bachelor's,Agricultural Business,2011-05-01 00:00:00,1,9.0,Yes,No,0


In [5]:
users.columns

Index(['UserID', 'WindowID', 'Split', 'City', 'State', 'Country', 'ZipCode',
       'DegreeType', 'Major', 'GraduationDate', 'WorkHistoryCount',
       'TotalYearsExperience', 'CurrentlyEmployed', 'ManagedOthers',
       'ManagedHowMany'],
      dtype='object')

In [6]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 389708 entries, 0 to 389707
Data columns (total 15 columns):
UserID                  389708 non-null int64
WindowID                389708 non-null int64
Split                   389708 non-null object
City                    389708 non-null object
State                   389218 non-null object
Country                 389708 non-null object
ZipCode                 387974 non-null object
DegreeType              389708 non-null object
Major                   292468 non-null object
GraduationDate          269477 non-null object
WorkHistoryCount        389708 non-null int64
TotalYearsExperience    375528 non-null float64
CurrentlyEmployed       347632 non-null object
ManagedOthers           389708 non-null object
ManagedHowMany          389708 non-null int64
dtypes: float64(1), int64(4), object(10)
memory usage: 44.6+ MB


In [7]:
jobs.replace('NaN',np.NaN)
jobs.head()

Unnamed: 0,JobID,WindowID,Title,Description,Requirements,City,State,Country,Zip5,StartDate,EndDate
0,1,1,Security Engineer/Technical Lead,<p>Security Clearance Required:&nbsp; Top Secr...,<p>SKILL SET</p>\r<p>&nbsp;</p>\r<p>Network Se...,Washington,DC,US,20531.0,2012-03-07 13:17:01.643,2012-04-06 23:59:59
1,4,1,SAP Business Analyst / WM,<strong>NO Corp. to Corp resumes&nbsp;are bein...,<p><b>WHAT YOU NEED: </b></p>\r<p>Four year co...,Charlotte,NC,US,28217.0,2012-03-21 02:03:44.137,2012-04-20 23:59:59
2,7,1,P/T HUMAN RESOURCES ASSISTANT,<b> <b> P/T HUMAN RESOURCES ASSISTANT</b> <...,Please refer to the Job Description to view th...,Winter Park,FL,US,32792.0,2012-03-02 16:36:55.447,2012-04-01 23:59:59
3,8,1,Route Delivery Drivers,CITY BEVERAGES Come to work for the best in th...,Please refer to the Job Description to view th...,Orlando,FL,US,,2012-03-03 09:01:10.077,2012-04-02 23:59:59
4,9,1,Housekeeping,I make sure every part of their day is magica...,Please refer to the Job Description to view th...,Orlando,FL,US,,2012-03-03 09:01:11.88,2012-04-02 23:59:59


In [8]:
apps.head()

Unnamed: 0,UserID,WindowID,Split,ApplicationDate,JobID
0,47,1,Train,2012-04-04 15:56:23.537,169528
1,47,1,Train,2012-04-06 01:03:00.003,284009
2,47,1,Train,2012-04-05 02:40:27.753,2121
3,47,1,Train,2012-04-05 02:37:02.673,848187
4,47,1,Train,2012-04-05 22:44:06.653,733748


In [9]:
user_history.head()

Unnamed: 0,UserID,WindowID,Split,Sequence,JobTitle
0,47,1,Train,1,National Space Communication Programs-Special ...
1,47,1,Train,2,Detention Officer
2,47,1,Train,3,"Passenger Screener, TSA"
3,72,1,Train,1,"Lecturer, Department of Anthropology"
4,72,1,Train,2,Student Assistant


In [10]:
test_users.head()

Unnamed: 0,UserID,WindowID
0,767,1
1,769,1
2,861,1
3,1006,1
4,1192,1


## EDA and Preprocessing <a class="anchor" id="EDA"></a>

### Splitting into Training and Testing dataset <a class="anchor" id="split"></a>

with attribute split:
- users
- apps
- user_history

### users

In [11]:
users_training = users.loc[users['Split'] == 'Train']

In [12]:
users_testing = users.loc[users['Split'] == 'Test']

### apps

In [13]:
apps_training = apps.loc[apps['Split'] == 'Train']

In [14]:
apps_testing = apps.loc[apps['Split'] == 'Test']

### user_history

In [15]:
user_history_training = user_history.loc[user_history['Split'] == 'Train']

In [16]:
user_history_testing = user_history.loc[user_history['Split'] == 'Test']

#### Dataframes
- users_training
- users_testing
- apps_training
- apps_testing
- user_history_training
- user_history_testing

## Preprocessing <a class="anchor" id="preprocessing"></a>
- Considering only US
- Removing data with empty state

In [17]:
jobs_US = jobs.loc[jobs['Country'] == 'US']

In [18]:
jobs_US[['City','State','Country']]

Unnamed: 0,City,State,Country
0,Washington,DC,US
1,Charlotte,NC,US
2,Winter Park,FL,US
3,Orlando,FL,US
4,Orlando,FL,US
5,Ormond Beach,FL,US
6,Orlando,FL,US
7,Orlando,FL,US
8,Orlando,FL,US
9,Winter Park,FL,US


In [19]:
jobs_US.groupby(['City', 'State', 'Country']).size().reset_index(name = 'Locationwise').sort_values('Locationwise', ascending = False).head()

Unnamed: 0,City,State,Country,Locationwise
6601,Houston,TX,US,19306
9835,New York,NY,US,18395
2651,Chicago,IL,US,17806
3475,Dallas,TX,US,13139
610,Atlanta,GA,US,12352


In [20]:
statewise_jobs = jobs_US.groupby(['State']).size().reset_index(name = 'Statewise').sort_values('Statewise', ascending = False)

In [21]:
jobs_US.groupby(['City']).size().reset_index(name='Citywise').sort_values('Citywise', ascending=False)

Unnamed: 0,City,Citywise
4564,Houston,19323
6809,New York,18402
1782,Chicago,17806
2351,Dallas,13202
408,Atlanta,12365
7650,Phoenix,12297
1709,Charlotte,10419
2056,Columbus,9323
4684,Indianapolis,9235
5632,Los Angeles,8878


In [22]:
citywise_jobs = jobs_US.groupby(['City']).size().reset_index(name='Citywise').sort_values('Citywise', ascending=False)

In [23]:
citywise_jobs_top = citywise_jobs.loc[citywise_jobs['Citywise']>=12]

- jobs_US
- statewise_jobs
- citywise_jobs
- citywise_jobs_top

### User profile based on location

In [24]:
users_training_US = users_training.loc[users_training['Country'] == 'US']

In [25]:
users_training_statewise = users_training_US.groupby('State').size().reset_index(
    name='statewise').sort_values('statewise',ascending=False)
users_training_statewise.head()

Unnamed: 0,State,statewise
11,FL,40381
47,TX,33260
6,CA,31141
17,IL,22557
37,NY,19299


In [26]:
users_training_statewise_top = users_training_statewise.loc[users_training_statewise['statewise'] >= 12]

In [27]:
users_training_citywise = users_training_US.groupby(['City']).size().reset_index(
    name='citywise').sort_values('citywise',ascending=False)
users_training_citywise.head()

Unnamed: 0,City,citywise
1528,Chicago,6964
4066,Houston,5487
4177,Indianapolis,4450
5604,Miami,4359
6965,Philadelphia,4347


In [28]:
users_training_citywise_top = users_training_citywise.loc[users_training_citywise['citywise'] >= 12]

- users_training_US
- users_training_statewise
- users_training_citywise
- users_training_citywise_top

In [29]:
import ast 
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
# from nltk.stem.snowball import SnowballStemmer
# from nltk.stem.wordnet import WordNetLemmatizer
# from nltk.corpus import wordnet

## Building model <a class="anchor" id="model"></a>

In [30]:
jobs_US.columns

Index(['JobID', 'WindowID', 'Title', 'Description', 'Requirements', 'City',
       'State', 'Country', 'Zip5', 'StartDate', 'EndDate'],
      dtype='object')

In [31]:
jobs_US.head().transpose()

Unnamed: 0,0,1,2,3,4
JobID,1,4,7,8,9
WindowID,1,1,1,1,1
Title,Security Engineer/Technical Lead,SAP Business Analyst / WM,P/T HUMAN RESOURCES ASSISTANT,Route Delivery Drivers,Housekeeping
Description,<p>Security Clearance Required:&nbsp; Top Secr...,<strong>NO Corp. to Corp resumes&nbsp;are bein...,<b> <b> P/T HUMAN RESOURCES ASSISTANT</b> <...,CITY BEVERAGES Come to work for the best in th...,I make sure every part of their day is magica...
Requirements,<p>SKILL SET</p>\r<p>&nbsp;</p>\r<p>Network Se...,<p><b>WHAT YOU NEED: </b></p>\r<p>Four year co...,Please refer to the Job Description to view th...,Please refer to the Job Description to view th...,Please refer to the Job Description to view th...
City,Washington,Charlotte,Winter Park,Orlando,Orlando
State,DC,NC,FL,FL,FL
Country,US,US,US,US,US
Zip5,20531,28217,32792,,
StartDate,2012-03-07 13:17:01.643,2012-03-21 02:03:44.137,2012-03-02 16:36:55.447,2012-03-03 09:01:10.077,2012-03-03 09:01:11.88


In [32]:
jobs_US_base_line = jobs_US.iloc[0:10000,0:8]

In [33]:
jobs_US_base_line.head()

Unnamed: 0,JobID,WindowID,Title,Description,Requirements,City,State,Country
0,1,1,Security Engineer/Technical Lead,<p>Security Clearance Required:&nbsp; Top Secr...,<p>SKILL SET</p>\r<p>&nbsp;</p>\r<p>Network Se...,Washington,DC,US
1,4,1,SAP Business Analyst / WM,<strong>NO Corp. to Corp resumes&nbsp;are bein...,<p><b>WHAT YOU NEED: </b></p>\r<p>Four year co...,Charlotte,NC,US
2,7,1,P/T HUMAN RESOURCES ASSISTANT,<b> <b> P/T HUMAN RESOURCES ASSISTANT</b> <...,Please refer to the Job Description to view th...,Winter Park,FL,US
3,8,1,Route Delivery Drivers,CITY BEVERAGES Come to work for the best in th...,Please refer to the Job Description to view th...,Orlando,FL,US
4,9,1,Housekeeping,I make sure every part of their day is magica...,Please refer to the Job Description to view th...,Orlando,FL,US


In [34]:
jobs_US_base_line['Title'] = jobs_US_base_line['Title'].fillna('')
jobs_US_base_line['Description'] = jobs_US_base_line['Description'].fillna('')
#jobs_US_base_line['Requirements'] = jobs_US_base_line['Requirements'].fillna('')

jobs_US_base_line['Description'] = jobs_US_base_line['Title'] + jobs_US_base_line['Description']

## Clean html <a class="anchor" id="html"></a>

In [35]:
import re

def preprocessor(text):
    text = text.replace('\\r', '').replace('&nbsp', '').replace('\n', '')
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [36]:
jobs_US_base_line['Description'] = jobs_US_base_line['Description'].astype(dtype='str').apply(preprocessor)

In [37]:
jobs_US_base_line.loc[0,'Description']

'security engineer technical leadsecurity clearance required top secret job number tmr 447location of job washington dctmr inc is an equal employment opportunity companyfor more job opportunities with tmr visit our website www tmrhq comsend resumes to hr tmrhq2 com job summary leads the customer rsquo s overall cyber security strategy formalizes service offerings consisted with itil best practices and provides design and architecture support provide security design architecture support for ojp rsquo s it security division itsd leads the secops team in the day to day ojp security operations support provides direction when needed in a security incident or technical issues works in concert with network operations on design integration for best security posture supports business development functions including capture management proposal development and responses and other initiatives to include conferences trade shows webinars developing white papers and the like identifies resources and 

## Dataset

**From here onwards use `jobs_US_base_line` data frame to work on, which is selected by `jobs_US.iloc[0:10000,0:8]`.**

In [38]:
jobs_US_base_line.head()

Unnamed: 0,JobID,WindowID,Title,Description,Requirements,City,State,Country
0,1,1,Security Engineer/Technical Lead,security engineer technical leadsecurity clear...,<p>SKILL SET</p>\r<p>&nbsp;</p>\r<p>Network Se...,Washington,DC,US
1,4,1,SAP Business Analyst / WM,sap business analyst wmno corp to corp resumes...,<p><b>WHAT YOU NEED: </b></p>\r<p>Four year co...,Charlotte,NC,US
2,7,1,P/T HUMAN RESOURCES ASSISTANT,p t human resources assistant p t human resour...,Please refer to the Job Description to view th...,Winter Park,FL,US
3,8,1,Route Delivery Drivers,route delivery driverscity beverages come to w...,Please refer to the Job Description to view th...,Orlando,FL,US
4,9,1,Housekeeping,housekeepingi make sure every part of their da...,Please refer to the Job Description to view th...,Orlando,FL,US


#### Dataframes
- users_training
- users_testing
- apps_training
- apps_testing
- user_history_training
- user_history_testing

##### Location
- jobs_US
- statewise_jobs
- citywise_jobs
- citywise_jobs_top

# Content based filtering <a class="anchor" id="content"></a>

## job description based recommender <a class="anchor" id="jobdesc"></a>
using term frequency-inverse document frequency

In [39]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(jobs_US_base_line['Description'])

In [40]:
tfidf_matrix.shape

(10000, 535561)

In [41]:
print(tfidf_matrix)

  (0, 441695)	0.22505879122815065
  (0, 169589)	0.030669382747046916
  (0, 488697)	0.04752972531305034
  (0, 264028)	0.07576675940802603
  (0, 91384)	0.043127331709931944
  (0, 415767)	0.020558424763840996
  (0, 441253)	0.04981586018912077
  (0, 253884)	0.08610417293518174
  (0, 326351)	0.03628758749940522
  (0, 498708)	0.23741486018184604
  (0, 12862)	0.07913828672728201
  (0, 522879)	0.04488762967640074
  (0, 133507)	0.07913828672728201
  (0, 176599)	0.028902410192437344
  (0, 167677)	0.03274166426535237
  (0, 337345)	0.021001362288507987
  (0, 104352)	0.07000309538665801
  (0, 336738)	0.023882963162787138
  (0, 520158)	0.03215125588973581
  (0, 524056)	0.036239661087183676
  (0, 532905)	0.026723078362849224
  (0, 498712)	0.07913828672728201
  (0, 108785)	0.07913828672728201
  (0, 425609)	0.0371210673841218
  (0, 226031)	0.03469578910963611
  :	:
  (9999, 344616)	0.05017478681815903
  (9999, 467781)	0.05017478681815903
  (9999, 74829)	0.05017478681815903
  (9999, 523365)	0.0501747868

In [42]:
jobs_US_base_line.loc[0,'Description']

'security engineer technical leadsecurity clearance required top secret job number tmr 447location of job washington dctmr inc is an equal employment opportunity companyfor more job opportunities with tmr visit our website www tmrhq comsend resumes to hr tmrhq2 com job summary leads the customer rsquo s overall cyber security strategy formalizes service offerings consisted with itil best practices and provides design and architecture support provide security design architecture support for ojp rsquo s it security division itsd leads the secops team in the day to day ojp security operations support provides direction when needed in a security incident or technical issues works in concert with network operations on design integration for best security posture supports business development functions including capture management proposal development and responses and other initiatives to include conferences trade shows webinars developing white papers and the like identifies resources and 

In [43]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [44]:
cosine_sim[0]

array([1.        , 0.03241652, 0.00838853, ..., 0.01491531, 0.01491531,
       0.01491531])

In [45]:
jobs_US_base_line = jobs_US_base_line.reset_index()
titles = jobs_US_base_line['Title']
indices = pd.Series(jobs_US_base_line.index, index=jobs_US_base_line['Title'])

In [46]:
jobs_US_base_line.head()

Unnamed: 0,index,JobID,WindowID,Title,Description,Requirements,City,State,Country
0,0,1,1,Security Engineer/Technical Lead,security engineer technical leadsecurity clear...,<p>SKILL SET</p>\r<p>&nbsp;</p>\r<p>Network Se...,Washington,DC,US
1,1,4,1,SAP Business Analyst / WM,sap business analyst wmno corp to corp resumes...,<p><b>WHAT YOU NEED: </b></p>\r<p>Four year co...,Charlotte,NC,US
2,2,7,1,P/T HUMAN RESOURCES ASSISTANT,p t human resources assistant p t human resour...,Please refer to the Job Description to view th...,Winter Park,FL,US
3,3,8,1,Route Delivery Drivers,route delivery driverscity beverages come to w...,Please refer to the Job Description to view th...,Orlando,FL,US
4,4,9,1,Housekeeping,housekeepingi make sure every part of their da...,Please refer to the Job Description to view th...,Orlando,FL,US


In [47]:
print(indices)

Title
Security Engineer/Technical Lead                                 0
SAP Business Analyst / WM                                        1
P/T HUMAN RESOURCES ASSISTANT                                    2
Route Delivery Drivers                                           3
Housekeeping                                                     4
SALON/SPA COORDINATOR                                            5
SUPERINTENDENT                                                   6
ELECTRONIC PRE-PRESS PROFESSIONAL                                7
UTILITY LINE TRUCK OPERATOR/ DIGGER DERRICK                      8
CONSTRUCTION PROJECT MGR & PM TRAINEE                            9
Administrative Assistant                                        10
ACCOUNT EXECUTIVES                                              11
COMMERCIAL ESTIMATOR                                            12
Immediate Opening                                               13
TESL Adjunct                                            

In [48]:
def get_recommendations(title):
    idx = indices[title]
    #print (idx)
    sim_scores = list(enumerate(cosine_sim[idx]))
    #print (sim_scores)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    job_indices = [i[0] for i in sim_scores]
    return titles.iloc[job_indices]

In [49]:
get_recommendations('SAP Business Analyst / WM').head(10)

1                           SAP Business Analyst / WM
6051                    SAP FI/CO Business Consultant
5159                          SAP Basis Administrator
5868                       SAP FI/CO Business Analyst
5351    SAP Sales and Distribution Solution Architect
4796       Senior Specialist - SAP Configuration - SD
5117                       SAP Integration Specialist
4290           SAP FICO Functional -2years experience
4728           SAP ABAP Developer with PRA experience
5244                                 Business Analyst
Name: Title, dtype: object

In [50]:
get_recommendations('Security Engineer/Technical Lead').head(10)

0                        Security Engineer/Technical Lead
5906                             Senior Security Engineer
6380                Security Technology - SIEM Consultant
3248                Senior Lead Systems Security Engineer
1302                       Information Security Architect
5525                   Sr. Information Security Architect
6873              Integrated System Service Engineer - CA
3230                    Computer Systems Security Manager
1568                 Senior Information Security Engineer
4901    Cloud Services Security Application Administrator
Name: Title, dtype: object

## similar user based recommender <a class="anchor" id="similarusers"></a>

- degree type, majors and total years of experience
- `users_training` dataset

In [51]:
users_training.head()

Unnamed: 0,UserID,WindowID,Split,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany
0,47,1,Train,Paramount,CA,US,90723,High School,,1999-06-01 00:00:00,3,10.0,Yes,No,0
1,72,1,Train,La Mesa,CA,US,91941,Master's,Anthropology,2011-01-01 00:00:00,10,8.0,Yes,No,0
2,80,1,Train,Williamstown,NJ,US,8094,High School,Not Applicable,1985-06-01 00:00:00,5,11.0,Yes,Yes,5
3,98,1,Train,Astoria,NY,US,11105,Master's,Journalism,2007-05-01 00:00:00,3,3.0,Yes,No,0
4,123,1,Train,Baton Rouge,LA,US,70808,Bachelor's,Agricultural Business,2011-05-01 00:00:00,1,9.0,Yes,No,0


In [52]:
user_based_approach_US = users_training.loc[users_training['Country']=='US']

In [53]:
user_based_approach = user_based_approach_US.iloc[0:10000,:].copy()

In [54]:
user_based_approach.head()

Unnamed: 0,UserID,WindowID,Split,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany
0,47,1,Train,Paramount,CA,US,90723,High School,,1999-06-01 00:00:00,3,10.0,Yes,No,0
1,72,1,Train,La Mesa,CA,US,91941,Master's,Anthropology,2011-01-01 00:00:00,10,8.0,Yes,No,0
2,80,1,Train,Williamstown,NJ,US,8094,High School,Not Applicable,1985-06-01 00:00:00,5,11.0,Yes,Yes,5
3,98,1,Train,Astoria,NY,US,11105,Master's,Journalism,2007-05-01 00:00:00,3,3.0,Yes,No,0
4,123,1,Train,Baton Rouge,LA,US,70808,Bachelor's,Agricultural Business,2011-05-01 00:00:00,1,9.0,Yes,No,0


In [55]:
user_based_approach['DegreeType'] = user_based_approach['DegreeType'].fillna('')
user_based_approach['Major'] = user_based_approach['Major'].fillna('')
user_based_approach['TotalYearsExperience'] = str(user_based_approach['TotalYearsExperience'].fillna(''))

user_based_approach['DegreeType'] = user_based_approach['DegreeType'] + user_based_approach['Major'] + \
                                    user_based_approach['TotalYearsExperience']


In [56]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(user_based_approach['DegreeType'])

In [57]:
tfidf_matrix.shape

(10000, 7337)

In [58]:
cosine_sim = linear_kernel(tfidf_matrix,tfidf_matrix)

In [59]:
cosine_sim[0]

array([1.        , 0.67053882, 0.84759861, ..., 0.43990417, 0.79335895,
       0.69670809])

In [60]:
user_based_approach = user_based_approach.reset_index()
userid = user_based_approach['UserID']
indices = pd.Series(user_based_approach.index, index=user_based_approach['UserID'])
indices.head(2)

UserID
47    0
72    1
dtype: int64

In [61]:
def get_recommendations_userwise(userid):
    idx = indices[userid]
    #print (idx)
    sim_scores = list(enumerate(cosine_sim[idx]))
    #print (sim_scores)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    user_indices = [i[0] for i in sim_scores]
    #print (user_indices)
    return user_indices[0:11]

In [62]:
get_recommendations_userwise(123)

[4, 150, 1594, 5560, 2464, 2846, 7945, 8125, 1171, 11, 24]

In [63]:
def get_job_id(usrid_list):
    jobs_userwise = apps_training['UserID'].isin(usrid_list) #
    df1 = pd.DataFrame(data = apps_training[jobs_userwise], columns=['JobID'])
    joblist = df1['JobID'].tolist()
    Job_list = jobs['JobID'].isin(joblist) #[1083186, 516837, 507614, 754917, 686406, 1058896, 335132])
    df_temp = pd.DataFrame(data = jobs[Job_list], columns=['JobID','Title','Description','City','State'])
    return df_temp

In [64]:
get_job_id(get_recommendations_userwise(47))

Unnamed: 0,JobID,Title,Description,City,State
905894,428902,Aircraft Servicer,<b>Job Classification: </b> Direct Hire \r\n\r...,Memphis,TN
975525,1098447,Automotive Service Advisor,<div>\r<div>Briggs Nissan in Lawrence Kansas h...,Lawrence,KS
980507,37309,Medical Lab Technician - High Volume Lab,<span>Position Title:<span>&nbsp;&nbsp;&nbsp;&...,Fort Myers,FL
986244,83507,Nurse Tech (CNA/STNA),"<p align=""center""><b>Purpose of Your Job Posit...",Englewood,FL
987452,93883,Nurse Tech II (CNA/STNA),<B>Nurse Tech II (CNA/STNA)</B> <BR>\r<BR>\rTh...,Fort Myers,FL
1000910,228284,REGISTERED NURSE – ICU,"<p><strong><span><font face="""">Registered Nurs...",Punta Gorda,FL
1007140,284840,Certified Nursing Assistant / CNA,"<hr>\r<p style=""text-align: center""><strong>Ce...",Saint Petersburg,FL
1007141,284841,Home Health Aide / HHA,"<hr>\r<p style=""text-align: center""><strong>Ho...",Saint Petersburg,FL
1009455,312536,Secretary II,<br><br><b>Department: </b>COMM Maryland Cardi...,Baltimore,MD
1011978,341662,Medical Assistant,Certified Medical Assistant for busy Pain Clin...,Fort Myers,FL


## Trying out [implicit collaborative filtering](https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe)

In [65]:
import random
import pandas as pd
import numpy as np

import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve
from sklearn.preprocessing import MinMaxScaler

```py
#-------------------------
# LOAD AND PREP THE DATA
#-------------------------
 
raw_data = pd.read_table('data/usersha1-artmbid-artname-plays.tsv')
raw_data = raw_data.drop(raw_data.columns[1], axis=1)
raw_data.columns = ['user', 'artist', 'plays']
 
 # Drop rows with missing values
 data = raw_data.dropna()
  
 # Convert artists names into numerical IDs
 data['user_id'] = data['user'].astype("category").cat.codes
 data['artist_id'] = data['artist'].astype("category").cat.codes
 
 # Create a lookup frame so we can get the artist names back in 
 # readable form later.
 item_lookup = data[['artist_id', 'artist']].drop_duplicates()
 item_lookup['artist_id'] = item_lookup.artist_id.astype(str)
 
 data = data.drop(['user', 'artist'], axis=1)
 
 # Drop any rows that have 0 plays
 data = data.loc[data.plays != 0]
 
 # Create lists of all users, artists and plays
 users = list(np.sort(data.user_id.unique()))
 artists = list(np.sort(data.artist_id.unique()))
 plays = list(data.plays)
 
 # Get the rows and columns for our new matrix
 rows = data.user_id.astype(int)
 cols = data.artist_id.astype(int)
 
 # Contruct a sparse matrix for our users and items containing number of plays
data_sparse = sparse.csr_matrix((plays, (rows, cols)), shape=(len(users), len(artists)))
```

In [66]:
apps_training.head()

Unnamed: 0,UserID,WindowID,Split,ApplicationDate,JobID
0,47,1,Train,2012-04-04 15:56:23.537,169528
1,47,1,Train,2012-04-06 01:03:00.003,284009
2,47,1,Train,2012-04-05 02:40:27.753,2121
3,47,1,Train,2012-04-05 02:37:02.673,848187
4,47,1,Train,2012-04-05 22:44:06.653,733748


# Collaborative filtering <a class="anchor" id="collab"></a>

In [67]:
from surprise import Reader, Dataset, SVD, evaluate

In [68]:
reader = Reader()

In [69]:
apps_training.head()

Unnamed: 0,UserID,WindowID,Split,ApplicationDate,JobID
0,47,1,Train,2012-04-04 15:56:23.537,169528
1,47,1,Train,2012-04-06 01:03:00.003,284009
2,47,1,Train,2012-04-05 02:40:27.753,2121
3,47,1,Train,2012-04-05 02:37:02.673,848187
4,47,1,Train,2012-04-05 22:44:06.653,733748


In [70]:
data = Dataset.load_from_df(apps_training[['UserID', 'JobID']], reader)
data.split(n_folds=5)


ValueError: not enough values to unpack (expected 3, got 2)

In [None]:
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])

## Hybrid Recommender

In [None]:
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    #print(idx)
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)


In [2]:
!jupyter nbconvert --to script job_recommender_v2.ipynb

[NbConvertApp] Converting notebook job_recommender_v2.ipynb to script
[NbConvertApp] Writing 12522 bytes to job_recommender_v2.py
