In this notebook we will be prepping the data for a recommendation system. We will be making a content based recommendation system and for that we need to use NLP on the details section of the data. Towards the end of this notebook we also take a look at using NLP on the category section and using that to make content based recommendations.

In [1]:
# downloading resources and importing libraries
import nltk
nltk.download('punkt')
nltk.download('wordnet')

import pandas as pd
import numpy as np

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
import re

[nltk_data] Downloading package punkt to /home/ish/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ish/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
df = pd.read_csv('../data_for_notebooks/recomm_df.csv')

In [4]:
# checking the data once
df.isnull().sum()

id                0
job_title         0
company_name      0
job_loc           0
details           1
category          0
compensation      0
start             0
end             215
skills            0
href              0
dtype: int64

In [5]:
# lets check this out once 
df[df.details.isnull()]

Unnamed: 0,id,job_title,company_name,job_loc,details,category,compensation,start,end,skills,href
99,103,business development,promon,new delhi,,marketing professional,paid,2016-08-16,,business skills,http://letsintern.com/internship/Marketing-Pro...


In [6]:
# we drop the entry as it is only one entry 
df.dropna(subset = ['details'], inplace= True)

In [7]:
# checking the shape once again
df.shape

(624, 11)

In [8]:
df.head()

Unnamed: 0,id,job_title,company_name,job_loc,details,category,compensation,start,end,skills,href
0,1,hr executive - recruitment,engenia technologies,gurgaon,we are seeking a hr recruiter who will...,human resources recruiter,paid,2019-03-02,2019-08-28,hr practices,http://letsintern.com/internship/Human-Resourc...
1,2,telecalling & lead generation,abalone technologies pvt ltd,noida,selected intern's day-to-day responsib...,tele sales executive,paid,2019-02-17,2019-08-30,office administration,http://letsintern.com/internship/Tele-Sales-Ex...
2,3,digital marketing internship,brandstory digital marketing company,bangalore,are you looking for digital marketing ...,marketing professional,paid,2018-12-25,2020-04-29,digital marketing,http://letsintern.com/internship/Marketing-Pro...
3,4,recruitment of corporate bank back office post,bandhan pvt.ltd,"kathua,barasat,bardhaman,bongoan,habra",huge opportunity in corporate bank for...,accountant,paid,2019-03-12,,analytical skills,http://letsintern.com/internship/Accountant-in...
4,5,software developer,trippyigloo,bangalore,we are looking for interns who are wil...,software developer : python,paid,2019-01-30,2019-06-20,"go(golang),java,mongodb,nginx,python",http://letsintern.com/internship/Software-Deve...


We have decided to use the following features for creating filters in the app:

1. Skills/Category
2. Locations
3. Compensation

In the final app, the user will be choosing from the options provided or inputting the data themselves for these categories. Haven't decided on that part yet. Based on the profile ID they have entered, we will be using a content based recommendation system to recommend the user similar profiles. This similarity will be based on the details column.

We will be creating functions to implement the following steps in this notebook : 
1. Normalize : Remove extra signs and numbers form the sentences (everything is already lowercase)
2. Tokenize
3. Remove extra space from all tokens
4. Stem each token

In [9]:
def tokenize(sentences):
    '''
    tokenizes a bunch of sentences after normalizing them and returns stemmed tokens.
    
    INPUT:
    sentences - a paragraph that need to be tokenized
    
    OUTPUT:
    tokens - list of stemmed tokens
    
    '''
    # normalizing, tokenizing, lemmatizing 
    sentences = re.sub('\W',' ',sentences) 
    sentences = re.sub('[0-9]',' ',sentences)

    tokens = word_tokenize(sentences)
    tokens = [i.strip() for i in tokens]
    
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(i) for i in tokens]
    return tokens

In [10]:
def similarity_matrix(df):
    '''
    returns a similarity matrix, in the form of a dataframe, between different internships by using the 
    details section of df.
    
    INPUT:
    df - dataframe with 'details' as one of the columns
    
    OUTPUT:
    sim - similarity matrix(dataframe) with internship id as column and row labels 
    
    '''
    details = df['details']
    vect = CountVectorizer(tokenizer= tokenize, stop_words = 'english')
    tfidf = TfidfTransformer()
    
    mat = tfidf.fit_transform(vect.fit_transform(details).toarray()).toarray()
    sim = np.dot(mat, mat.T)
    sim = pd.DataFrame(sim, columns=df.id, index = df.id)
    return sim

In [11]:
sim = similarity_matrix(df)

In [12]:
sim.shape

(624, 624)

In [13]:
sim.to_csv('../data_for_notebooks/recommendation_matrix.csv', index = True)

# ***


After trying out the recommendations in the make_recs notebook, I have come back to this notebook to make some changes. I think that we can maybe make better recommendations and thus we will try a few things out with this data(since it isn't big at all).

The few ways that I can think of right now to change how recommendations are made:
1. Do not use Tfidf and but stem tokens
2. Lemmatize the words instead of stemming them
3. Use lemmatization and do not use tfidf

1. Do not use Tfidf

In [14]:
def similarity_matrix_wo_tfidf(df):
    '''
    returns a similarity matrix, in the form of a dataframe, between different internships by using the 
    details section
    
    INPUT:
    df - dataframe with 'details' as one of the columns
    
    OUTPUT:
    sim - similarity matrix(dataframe) with internship id as column and row labels 
    
    '''
    details = df['details']
    vect = CountVectorizer(tokenizer= tokenize, stop_words = 'english')
    
    mat = vect.fit_transform(details).toarray()
    sim = np.dot(mat, mat.T)
    sim = pd.DataFrame(sim, columns=df.id, index = df.id)
    return sim

In [15]:
sim_1 = similarity_matrix_wo_tfidf(df)
sim_1.to_csv('../data_for_notebooks/recommendation_matrix_wo_tfidf.csv', index = True)

2. Lemmatize the word

In [16]:
def tokenize_lem(sentences):
    '''
    tokenizes a bunch of sentences after normalizing it and returns lemmatized tokens.
    
    INPUT:
    sentences - a paragraph that needs to be tokenized
    
    OUTPUT:
    tokens - list of lemmatized tokens
    
    '''
    # normalizing, tokenizing, lemmatizing 
    sentences = re.sub('\W',' ',sentences) 
    sentences = re.sub('[0-9]',' ',sentences)

    tokens = word_tokenize(sentences)
    tokens = [i.strip() for i in tokens]
    
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(i) for i in tokens]
    return tokens

In [17]:
def similarity_matrix_w_lem(df):
    '''
    returns a similarity matrix, in the form of a dataframe, between different internships by using the 
    details section
    
    INPUT:
    df - dataframe with 'details' as one of the columns
    
    OUTPUT:
    sim - similarity matrix(dataframe) with internship id as column and row labels 
    
    '''
    details = df['details']
    vect = CountVectorizer(tokenizer= tokenize_lem, stop_words = 'english')
    tfidf = TfidfTransformer()
    
    mat = tfidf.fit_transform(vect.fit_transform(details).toarray()).toarray()
    sim = np.dot(mat, mat.T)
    sim = pd.DataFrame(sim, columns=df.id, index = df.id)
    return sim

In [18]:
sim_2 = similarity_matrix_w_lem(df)
sim_2.to_csv('../data_for_notebooks/recommendation_matrix_w_lem.csv', index = True)

3. Use lemmatization and do not use tfidf

In [19]:
def similarity_matrix_w_lem_wo_tfidf(df):
    '''
    returns a similarity matrix, in the form of a dataframe, between different internships by using the 
    details section
    
    INPUT:
    df - dataframe with 'details' as one of the columns
    
    OUTPUT:
    sim - similarity matrix(dataframe) with internship id as column and row labels 
    '''
    details = df['details']
    vect = CountVectorizer(tokenizer= tokenize_lem, stop_words = 'english')
    tfidf = TfidfTransformer()
    
    mat = vect.fit_transform(details).toarray()
    sim = np.dot(mat, mat.T)
    sim = pd.DataFrame(sim, columns=df.id, index = df.id)
    return sim

In [21]:
sim_3 = similarity_matrix_w_lem_wo_tfidf(df)
sim_3.to_csv('../data_for_notebooks/recommendation_matrix_w_lem_wo_tfidf.csv', index = True)

After forming these 3 with the details section, we try an alternate approach and try to use the 'category' column to form the similarity matrix. This should give us more relevant results as there are a high no. of common categories and similar internships have really similar category names. Thus, in the make_recs notebook, we will try using this too and compare it with the results of the above 3.

In [22]:
def similarity_matrix_cat(df):
    '''
    returns a similarity matrix, in the form of a dataframe, between different internships by using the 
    cat section
    
    INPUT:
    df - dataframe with 'category' as one of the columns
    
    OUTPUT:
    sim - similarity matrix(dataframe) with internship id as column and row labels 
    '''
    cats = df['category']
    vect = CountVectorizer(tokenizer= tokenize, stop_words = 'english')
    tfidf = TfidfTransformer()
    
    mat = vect.fit_transform(cats).toarray()
    sim = np.dot(mat, mat.T)
    sim = pd.DataFrame(sim, columns=df.id, index = df.id)
    return sim

*Note: The above has been formed in a similar way to sim_1, i.e, without tfidf and with stemming the tokens.This is because when we actually compare the recommendations given using different similarity matrices, we see that sim_1 gives us the best results. Thus later, what we do is that we combine both of sim_1 and sim_cat to get relevant and interesting recommendations. If sim_cat had been made using tdidf then the value of its elements would have been really small compared to sim_1's and it would have been a little difficult trying to make a similarity matrix which uses both of these to make recommendations.*

In [24]:
sim_cat = similarity_matrix_cat(df)

In [25]:
sim_cat.to_csv('../data_for_notebooks/recommendation_df_cat.csv', index = True)