## Loading Libraries<a id="1"></a> <br>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
init_notebook_mode(connected=True)  
plt.style.use('ggplot')
from collections import Counter
from wordcloud import WordCloud
from PIL import Image
import urllib.request
import random
from sklearn.preprocessing import StandardScaler

# Data Preprocessing <a id="2"></a> <br>

Loading the dataset and gathering a glimpse:

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("../input/new-york-city-current-job-postings/nyc-jobs.csv")

In [None]:
df.head()


In [None]:
df.info()

#### Columns Description:

- **Job ID**: The Unique Job ID for each opening
- **Posting Type**: The opening type, whether internal or external, for the job.
- **# of Positions**: The number of positions available for a certain opening
- **Business Title**: The position the candidate would hold.
- **Civil Service Title**: The Broad Title the position would be classified under
- **Title Code No**: The Code for a particular title
- **Level**: The authority the certain opening would bring with it
- **Job Category**: Broad Classification of where all the jobs would fall in
- **Full-time/Part-Time**: Time frame of a job.
- **Salary Range From**: The beginning salary cap for that particular opening
- **Salary Range To**: The highest cap for that particular job opening.
- **Salary Frequency**: The payment factor for the job, hourly or annual
- **Work Location**: The location of the workplace
- **Division/Work Unit**: Broad working units for all the jobs 
- **Job Description**: A brief idea of what the job will contain
- **Minimum Qual Requirements**: The minimum qualifications a candidate must possess for the job
- **Preferred Skills**: Optimal skills which the posting is looking for
- **Additional Information**: Any additional information provided with the job opening
- **Hours/Shift**: The timings for the job
- **Work Location 1**: Additional information for the work location
- **Recruitment Contact**: Empty field, supposed to contain numbers
- **Residency Requirement**: Whether the employee must be a resident of NYC.
- **Posting date**: When the opening was announced.
- **Post Until**: The closing date.
- **Posting Updated**: The time when the posting was updated for the opening.
- **Process Date**: When the posting process was completed

Phew! That was a lot of columns, well then, let's get to exploring them! 

# Data Preprocessing

In [None]:
def missing_values_table(df):
   
    # Total missing values
    mis_val = df.isnull().sum()
    
    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    
    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
    # Rename the columns
    mis_val_table_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    
    # Sort the table by percentage of missing descending
    # .iloc[:, 1]!= 0: filter on missing missing values not equal to zero
    mis_val_table_columns = mis_val_table_columns[
        mis_val_table_columns.iloc[:,1] != 0].sort_values(
    '% of Total Values', ascending=False).round(2)  # round(2), keep 2 digits
    
    # Print some summary information
    print("Dataset has {} columns.".format(df.shape[1]) + '\n' + 
    "There are {} columns that have missing values.".format(mis_val_table_columns.shape[0]))
    
    # Return the dataframe with missing information
    return mis_val_table_columns

In [None]:
missing_values_table(df)

In [None]:
df = df.drop(['Recruitment Contact', 'Hours/Shift', 'Post Until', 'Work Location 1'],axis=1)

As we see from the above step that Recruitment Contact, Hours/Shift, Post Until, Work Location 1has more than 50% null values, so it's abvious to drop these columns

In [None]:
df = df.drop(['Additional Information'],axis=1)

Even 'Additional Information' is not relevant to our requirement, so it has to be removed

In [None]:
missing_values_table(df)

In [None]:
for column in ['Job Category','Residency Requirement','Posting Date', 'Posting Updated','Process Date', 'To Apply']:
    df[column] = df[column].fillna(df[column].mode()[0]) 

Replacing null values of few variables which has less than 0.1% of null values with mode of respective features

# Exploratory Data Analysis<a id="3"></a> <br>


### Highest High Salary Range <a id="9"></a> <br>

In [None]:

high_sal_range = (df.groupby('Civil Service Title')['Salary Range To'].mean().nlargest(10)).reset_index()

fig = px.bar(high_sal_range, y="Civil Service Title", x="Salary Range To", orientation='h', title = "Highest High Salary Range",color=  "Salary Range To", color_continuous_scale= px.colors.qualitative.G10).update_yaxes(categoryorder="total ascending")
fig.show()


Oh. It seems that **Senior General Deputy Manager**, in general, has the highest avergae salary range, ranging upto $230,000 per year!
Now that's an impressive amount. 

Most of the openigns in the top ten highest salary seem to be from executive fields, or higher posts. These are the fields which rake in most of the money, on average, paving way for the high salaries people seem to hear about!

In [None]:
popular_categories = df['Job Category'].value_counts()[:5]
popular_categories

### Top 10 Job Openings via Category <a id="15"></a> <br>

In [None]:
job_categorydf = df['Job Category'].value_counts(sort=True, ascending=False)[:10].rename_axis('Job Category').reset_index(name='Counts')
job_categorydf = job_categorydf.sort_values('Counts')

In [None]:
trace = go.Scatter(y = job_categorydf['Job Category'],x = job_categorydf['Counts'],mode='markers',
                   marker=dict(size= job_categorydf['Counts'].values/2,
                               color = job_categorydf['Counts'].values,
                               colorscale='Viridis',
                               showscale=True,
                               colorbar = dict(title = 'Opening Counts')),
                   text = job_categorydf['Counts'].values)

data = [(trace)]

layout= go.Layout(autosize= False, width = 1000, height = 750,
                  title= 'Top 10 Job Openings Count',
                  hovermode= 'closest',
                  xaxis=dict(showgrid=False,zeroline=False,
                             showline=False),
                  yaxis=dict(title= 'Job Openings Count',ticklen= 2,
                             gridwidth= 5,showgrid=False,
                             zeroline=True,showline=False),
                  showlegend= False)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

# Feature Engineering

In [None]:
num_cols = df._get_numeric_data().columns

In [None]:
num_cols

In [None]:
cat_cols = list(set(df.columns) - set(num_cols))

In [None]:
today = pd.datetime.today()

In [None]:
redudant_cols = ['Job ID', '# Of Positions','Posting Updated','Minimum Qual Requirements','To Apply','Business Title','Level']

In [None]:
df[cat_cols]

Based on the business problem given in the problem statement, it can be said that personal information(Posting date,process date,resident details) will be of no use for our employee segregeration

In [None]:
df = df.drop(redudant_cols,axis=1)

In [None]:
df

### Data Cleaning and Transformation

In [None]:
def parse_categories(x):
    l = x.replace('&', ',').split(',')
    l = [x.strip().rstrip(',') for x in l]
    key_categories.extend(l)

In [None]:
def parse_keywords(x, l):
    x = x.lower()
    tokens = nltk.word_tokenize(x)
    stop_words = set(stopwords.words('english'))
    token_l = [w for w in tokens if not w in stop_words and w.isalpha()]
    l.extend(token_l)

In [None]:
def preferred_skills(x):
    kwl = []
    df[df['Job Category'] == x]['Preferred Skills'].dropna().apply(parse_keywords, l=kwl)
    kwl = pd.Series(kwl)
    return kwl.value_counts()[:20]

In [None]:
key_categories = []
df['Job Category'].dropna().apply(parse_categories)
key_categories = pd.Series(key_categories)
key_categories = key_categories[key_categories!='']
popular_categories = key_categories.value_counts().iloc[:25]

In [None]:
key_categories

In [None]:
df['cat'] = key_categories

In [None]:
plt.figure(figsize=(10,10))
sns.countplot(y=key_categories, order=popular_categories.index, palette='YlGn')

In [None]:

salary_table = df[['Civil Service Title', 'Salary Range From', 'Salary Range To']]


In [None]:
jobs_highest_high_range = pd.DataFrame(salary_table.groupby(['Civil Service Title'])['Salary Range To'].mean().nlargest(10)).reset_index()
plt.figure(figsize=(8,6))
sns.barplot(y='Civil Service Title', x='Salary Range To', data=jobs_highest_high_range, palette='Greys')

In [None]:
def plot_wordcloud(text):
    wordcloud = WordCloud(background_color='white',
                     width=1024, height=720).generate(text)
    plt.clf()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.show()

In [None]:
job_description_keywords = []
df['Job Description'].apply(parse_keywords, l=job_description_keywords)
plt.figure(figsize=(10, 8))
counter = Counter(job_description_keywords)
common = [x[0] for x in counter.most_common(40)]
plot_wordcloud(' '.join(common))

From the above wordcloud, it can be seen that work, city, project, water, new are most frequently used words in the Job description, whereas staff system,management, planning, design, support e.t.c are required skills which are demanded mostly by the employer

In [None]:
words = []
counts = []
for letter, count in counter.most_common(10):
    words.append(letter)
    counts.append(count)

In [None]:
import matplotlib.cm as cm
from matplotlib import rcParams
colors = cm.rainbow(np.linspace(0, 1, 10))
rcParams['figure.figsize'] = 20, 10

plt.title('Top words in the Job description vs their count')
plt.xlabel('Count')
plt.ylabel('Words')
plt.barh(words, counts, color=colors)

So, here we can remove the words which doesn't necessarily depict any information related to skills

In [None]:
df['Posting Date'] = pd.to_datetime(df['Posting Date'])

In [None]:
df['Process Date'] = pd.to_datetime(df['Process Date'])

As there is no column for years of exprience, so we can assume that process date is the date when either latest or new posting has been published by the employer 

In [None]:
df['years of exprience'] = df['Process Date'] - df['Posting Date']

In [None]:
df['years of exprience'] = df['years of exprience'].dt.days

In [None]:
df_cluster = df[['cat','Salary Range To','years of exprience']]

In [None]:
df_cluster.isna().sum()

In [None]:
df_cluster['cat'].value_counts()

In [None]:
df_cluster['cat'].fillna('Others', inplace=True)

In [None]:
df_cluster=df_cluster.replace('\*','',regex=True)

In [None]:
df_cluster

we are creating new dataframe with job category, maximum salary for the respective role and years of exprience. Reason of taking max salary instead of mean salary is to categorize those set of job which demands niche skills and higher salary

In [None]:
#Calculating the Hopkins statistic
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
#Let's check the Hopkins measure
hopkin_df = df_cluster
hopkins(hopkin_df.drop(['cat'],axis=1))

0.99 is a good Hopkins score. Hence the data is very much suitable for clustering. Preliminary check is now done.
We can do standardisation again or else we can skip this step as well.

In [None]:
df_cluster_std = df_cluster
X_C = df_cluster_std.drop(['cat'],axis=1)
df_cluster_std = StandardScaler().fit_transform(X_C)

In [None]:
df_cluster

# K-means Clustering

In [None]:
#Let's check the silhouette score first to identify the ideal number of clusters
# To perform KMeans clustering 
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
sse_ = []
for k in range(2, 10):
    kmeans = KMeans(n_clusters=k).fit(df_cluster_std)
    sse_.append([k, silhouette_score(df_cluster_std, kmeans.labels_)])

In [None]:
plt.plot(pd.DataFrame(sse_)[0], pd.DataFrame(sse_)[1]);

The sihouette score reaches a peak at around 4 clusters indicating that it might be the ideal number of clusters.

In [None]:
#The sihouette score reaches a peak at around 4 clusters indicating that it might be the ideal number of clusters.
#Let's use the elbow curve method to identify the ideal number of clusters.
ssd = []
for num_clusters in list(range(1,10)):
    model_clus = KMeans(n_clusters = num_clusters, max_iter=50)
    model_clus.fit(df_cluster_std)
    ssd.append(model_clus.inertia_)

plt.plot(ssd)

A distinct elbow is formed at around 2-5 clusters. Let's finally create the clusters and see for ourselves which ones fare better

In [None]:

#K-means with k=4 clusters
model_clus4 = KMeans(n_clusters = 4, max_iter=50)
model_clus4.fit(df_cluster_std)

In [None]:
dat4=df_cluster
dat4.index = pd.RangeIndex(len(dat4.index))
dat_km = pd.concat([dat4, pd.Series(model_clus4.labels_)], axis=1)
dat_km.columns = ['cat','salary_max','exp','ClusterID']
dat_km

In [None]:
dat_km['ClusterID'].value_counts()

In [None]:
dat_km

In [None]:
#One thing we noticed is all distinct clusters are being formed except cluster 1 with more data points
#Now let's create the cluster means wrt to the various variables mentioned in the question and plot and see how they are related
df_final=pd.merge(df,dat_km,on='cat')

In [None]:
df_final

In [None]:
df_final.info()

In [None]:
#Along Job category and years of exprience
sns.scatterplot(x='cat',y='exp',hue='ClusterID',data=df_final)

In [None]:
#Along Job category and years of exprience
sns_plot = sns.scatterplot(x='Salary Range To',y='exp',hue='cat',data=df_final)

From the above plot, it can be seen that different salary ranges based on job category(cat) and years of experience(exp).


In [None]:
fig = sns_plot.get_figure()
fig.savefig("output.png")

As Job categories are more, x-axis in the graph is not visible but we can make a clear depiction below

In [None]:
#let's take a look at those Job category clusters and try to make sense if the clustering process worked well.
df_final_on_jobcat = df_final[df_final['ClusterID']==1]

In [None]:
df_final_on_jobcat['cat'].value_counts()

# Conclusion

It can be concluded from the above analysis that:
    
Engineering :                             51425
Architecture :                           50325
Planning  :                               24625

has more number of demand as well as higher salary with respect to niche skills. Whereas for last few job category there are having very less openings coming.


Health Policy         :                       9
Planning Building Operations      :           8
Health Building Operations  :                6
Health Public Safety      :                  6
Community Programs Policy    :               6
Innovation Policy       :                    4
Human Resources Technology   :               4
Human Resources Communications    :          4
Human Resources Constituent Services :       4
Human Resources Health Public Safety  :       1

It is obvious from the clustering as well as the merged data with cluster information that cluster 1 belongs to those set which has more openings and higher demand with more salary