### Hi everyone! It's been a while since last time I showed up on Kaggle. 

#### This time, I'm also working on the Google Job dataset. While this time, I would like to build a simple recommendation system based on the scenario of looking for a position and finding similar openings for users this time

![google](http://img.technews.tw/wp-content/uploads/2015/09/Google-logo_1.jpg)

# Outline

## Recommendation System

- [EDA](#0)   
    * I'll do simple exploratory on the data structure and values
- [Modeling](#1)
    * I'll start to test out vectorize text and find similar positions based on job description
- [Finalizing](#2)
    * Will also consider requirements in this part
    
## [Text Clustering](#Cluster)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('ggplot')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
df = pd.read_csv('../input/google-job-skills/job_skills.csv')

## [EDA](#0)

**First, I would like to know more about the data**

- Starting from the columns
- Then, the text pattern in columns
- Finally, the correlation between different positions

In [None]:
df.head(3)

In [None]:
# I modify the column name so that I can use df dot column name more easily
df = df.rename(columns={'Minimum Qualifications': 'Minimum_Qualifications', 'Preferred Qualifications': 'Preferred_Qualifications'})

In [None]:
df.Company.value_counts()

In [None]:
df.Category.value_counts()

In [None]:
df.Location.value_counts()[:10]

In [None]:
df['Country'] = df['Location'].apply(lambda x : x.split(',')[-1])

In [None]:
df.Country.value_counts()[:10]

In [None]:
pd.isnull(df).sum()

In [None]:
df = df.dropna(how='any',axis='rows')

## [Modeling](#1)

In [None]:
# Perform the necessary imports for similarity
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline


scaler = MaxAbsScaler()

model = NMF(n_components=100)

normalizer = Normalizer()

# Create a pipeline: pipeline
pipeline = make_pipeline(scaler,model,normalizer)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectors_Responsibilities = vectorizer.fit_transform(df['Responsibilities'])

In [None]:
Responsibilities = pipeline.fit_transform(vectors_Responsibilities)

In [None]:
df_Responsibilities = pd.DataFrame(Responsibilities,index=df['Title'])

In [None]:
df_Responsibilities.head(2)

In [None]:
pd.set_option('display.max_colwidth', -1)
print(df[df.Title.str.contains('Data Scientist')]['Title'])

In [None]:
Position = df_Responsibilities.loc['Customer Experience Data Scientist, Google Cloud Support']

In [None]:
similarities_1 = df_Responsibilities.dot(Position)

In [None]:
similarities_1[:3]

In [None]:
print(similarities_1.nlargest())

### Let's see if the role is similar and ideal as an alternative.

In [None]:
df[np.isin(df['Title'],similarities_1.nlargest().index.tolist())].head()

In [None]:
type(similarities_1)

### In my opinion, the role is a good alternative choice while the requirement could be a blocker. So let's also consider the part of requirements.

In [None]:
vectorizer_Requirements = TfidfVectorizer()
vectors_Requirements = vectorizer_Requirements.fit_transform(df['Minimum_Qualifications'])

In [None]:
Requirements = pipeline.fit_transform(vectors_Requirements)

In [None]:
df_Requirementss = pd.DataFrame(Requirements,index=df['Title'])

In [None]:
Position = df_Requirementss.loc['Customer Experience Data Scientist, Google Cloud Support']

In [None]:
similarities_2 = df_Responsibilities.dot(Position)

In [None]:
print(similarities_2.nlargest())

Though this looks a bit weird, let's see how we put responsibilities and requirements together first.

In [None]:
similarities_1

In [None]:
similarities_1.rename("similarity")
similarities_2.rename("similarity")

similarities_1.to_frame().join(similarities_2.to_frame(),lsuffix='1')

In [None]:
similarities_overall = (2 * similarities_1) + similarities_2

In [None]:
print(similarities_overall.nlargest())

In [None]:
df[np.isin(df['Title'],similarities_overall.nlargest(3).index.tolist())].head()

### The result is not bad! Though one of the alternative position looks more emphasize soft skills part while another is similar in terms of the hard skills part, I think they both look like a good choice as well.

## [Text Clustering](#Clustering)

#### The purpose of this part is aiming at finding the relevant words, skills, requirements across different roles using Cluster Analysis instead of Word Cloud in my previous project.

In [None]:
from scipy.cluster.vq import kmeans, vq
from numpy import random

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import string

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
string.punctuation

In [None]:
stop_words_0 = set(stopwords.words('english')) 
stop_words = ['and', 'in', 'of', 'or', 'with','to','on','a']

def remove_noise(text):
    tokens = word_tokenize(text)
    clean_tokens = []
    lemmatizer=WordNetLemmatizer()
    for token in tokens:
        token = re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+', '', token)
        token = lemmatizer.lemmatize(token.lower())
        if len(token) > 1 and token not in stop_words_0 and token not in stop_words:
            clean_tokens.append(token)
            
    return clean_tokens

In [None]:
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=100,tokenizer=remove_noise)

# Use the .fit_transform() method on the list plots
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Minimum_Qualifications'])

In [None]:
random.seed = 123

In [None]:
distortions = []
num_clusters = range(2, 25)

# Create a list of distortions from the kmeans function
for i in num_clusters:
    cluster_centers, distortion = kmeans(tfidf_matrix.todense(),i)
    distortions.append(distortion)

# Create a data frame with two lists - num_clusters, distortions
elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})

# Creat a line plot of num_clusters and distortions
sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot)
plt.xticks(num_clusters)
plt.title('Clusters and Distortions')
plt.show()

In [None]:
cluster_centers, distortion = kmeans(tfidf_matrix.todense(),13)

# Generate terms from the tfidf_vectorizer object
terms = tfidf_vectorizer.get_feature_names()

for i in range(13):
    # Sort the terms and print top 10 terms
    center_terms = dict(zip(terms, list(cluster_centers[i])))
    sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True)
    print(sorted_terms[:5])

With the groups of words, I can tell different groups are from different fields of the positions.

In [None]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

In [None]:
vocab = {v:k for k,v in tfidf_vectorizer.vocabulary_.items()}

In [None]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vectorizer.vocabulary_, tfidf_matrix, 5)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = tfidf_matrix[:, list(filtered_words)]

In [None]:
print(filtered_text)