# Story of New York Jobs

![New York Jobs](https://kuwaitjobvacancy.com/wp-content/uploads/2017/07/New-York-JOBS.png)

1. [Introduction](#Introduction)
2. [Loading Packages and Data](#Loading)
3. [Data Structure and Content](#DSC)
4. [Data Analysis](#DataAnalysis)

<a id="Introduction"></a>
## Introduction

This data contains current job posting available on the City of New Yorkâ€™s official jobs site.

<a id="Loading"></a>
## Loading Packages and Data

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #Data Ploting
import seaborn as sns #Data Ploting
import operator
import math
from wordcloud import WordCloud, STOPWORDS
from nltk import pos_tag, sent_tokenize, word_tokenize, BigramAssocMeasures,\
    BigramCollocationFinder, TrigramAssocMeasures, TrigramCollocationFinder
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
import string
%matplotlib inline

In [None]:
df = pd.read_csv("../input/nyc-jobs.csv",index_col='Posting Date', parse_dates=['Posting Date'])

<a id="DSC"></a>
## Data Structure and Content

Let's first chech the dataset structure. No. or rows and features it contains.

In [None]:
df.shape

So, there are 3096 rows and 27 columns or featueres are there.

Let's check what are the features we got.

In [None]:
list(df.columns)

Let's check is there any null present in this dataset.

In [None]:
df.isnull().values.any()

yess...the dataset contains null. But we don't know where, let's check onece again...

In [None]:
for column in df.columns:
    if df[column].isnull().any():
       print('{0} has {1} null values'.format(column, df[column].isnull().sum()))

We can see that Hours/Shift, Work Location, and Post Until contains lot of null values. And Recruitment Contact has no value at all in the dataset. Job Category contains only 2 null values, which we will try to impute and all the other features contains relatively less null's.

Let's check the Job Category, where there are null.

In [None]:
df[df['Job Category'].isnull().values]

We can try to find the pattern using other columns like Agency, Civil Service Title along with level to find the correct value for the nulls.

In [None]:
df[(df['Business Title']=="Account Manager")]

Using Business Title to find pattern is not going to work here as there is only one row, that contains Nan as Job Category. Let's try some other columns.

In [None]:
df[(df['Civil Service Title']=="CONTRACT REVIEWER (OFFICE OF L")]

Now, we can impute our first row as "*Constituent Services & Community Programs*",as the Civil Service Title, Title no. and other columns are giving us the indication for the right value.

Now, let's look for another one...

In [None]:
df[(df['Civil Service Title']=="ADMINISTRATIVE BUSINESS PROMOT") & (df['Level'] == 'M3')]

Here we are getting three options for imputing our value as "Constituent Services & Community Programs", "Public Safety, Inspections, & Enforcement" and "Administration & Human Resources Communication" but if we can take Agency into context we can impute the first option for time being.

So, Let's impute "Constituent Services & Community Programs" in Job Category

In [None]:
df["Job Category"] = df["Job Category"].fillna("Constituent Services & Community Programs")

<a id="DataAnalysis"></a>
## Data Analysis

Now,as we have already seen the data content and structure, lets find some insights from it.

### No. of jobs produced over the years

In [None]:
df[["Job ID"]].resample('M').count().plot(figsize=(20,10), linewidth=3, fontsize=20)
plt.xlabel('Year', fontsize=20)

Hmm...The No. of jobs starts growing exponentionally after 2016. Good for New york but is the no. of opportunities also increased as the job are increasing? let's check

In [None]:
df[["# Of Positions"]].resample('M').sum().plot(figsize=(20,10), linewidth=3, fontsize=20)
plt.xlabel('Year', fontsize=20)

Yes, the Opportunities also grows with job, but there is a small glich in 2019, otherwise looking good...

Now, Let's check which category has picked up over the years

In [None]:
df['year'] = pd.DatetimeIndex(df.index).year
def parse_categories(index,x):
    l = x.replace('&', ',').split(',')
    l = [x.strip().rstrip(',') for x in l]
    for category in l:
        if index in key_categories:
            if category in key_categories[index]:
                key_categories[index][category] +=1
            else:
                key_categories[index][category] = 1
        else:
            key_categories[index] = {}

key_categories = {}
for index, rows in df.iterrows():
    if type(rows['Job Category']) is str:
        parse_categories(rows['year'],rows['Job Category'])

for index,item in key_categories.items():
    if '' in item:
        item.pop('', None)

sorted_x = {}
for index,item in key_categories.items():
    sorted_x[index] = (sorted(item.items(), key=operator.itemgetter(1)))[-5:]
    
tempList = []
for index,y in sorted_x.items():
    for i in range(len(y)):
        temp={}
        temp['year'] = index
        temp['Category'] = y[i][0]
        temp['Count'] = y[i][1]
        tempList.append(temp)

df2 = pd.DataFrame(tempList)

In [None]:
sns.set_style('darkgrid')
sns.set_context("talk")
fig, ax = plt.subplots()
fig.set_size_inches(15, 9)
ax = sns.lineplot(x="year", y="Count", hue="Category",data=df2)
sns.despine()

So, Engineering and Architecture is at top in 2019, Planning and inspection can also be a good pick. IT and Telecommunication is somwhere lost after 2015.

Nnow Let's Check the salary growth wrt top Civil Service Title

In [None]:
title = ["COMMUNITY COORDINATOR","CIVIL ENGINEER","AGENCY ATTORNEY","CITY RESEARCH SCIENTIST","ADMINISTRATIVE PROJECT MANAGER","CLERICAL ASSOCIATE"]
df4 = df[(df['Full-Time/Part-Time indicator']=='F') & df["Civil Service Title"].isin(title)][["Civil Service Title","Level","Salary Range From"]]
df4 = df4.reset_index()
fig, ax = plt.subplots(nrows=3,ncols=2,figsize=(15,15))
k=0
for i in range(3):
    for j in range(2):
        levelList = list(df4[(df4['Civil Service Title']==title[k])]["Level"].unique())
        for level in levelList:
            tempDF = df4[(df4['Civil Service Title']==title[k]) & (df4.Level == level)]
            ax[i,j].plot(tempDF['Posting Date'], tempDF["Salary Range From"])
            ax[i,j].title.set_text(title[k])
            ax[i,j].legend(levelList,loc="upper left")
        k+=1
plt.gcf().autofmt_xdate()

let's check each graph one-by-one:

**Community Cordinator** - only one level. And for most of the years it goes desent, till 2019.

**Civil Engineer** - Three levels, all have distingushiable difference between them. Also all of them get a raise in mid 18.

**Agency Atorny** - Four Levels, difference in the salaries are descent. Every level is smooth with time except level four which has some downshift in 2019.

**City Research Scientist** - Four Levels, again easily distinguishable with levels, Although 4B might be very less to be visiable on chart. Also level one jobs have occured in 2018 for the first time.

**Administrative Project Manager** - Five levels, and its all over the chart :D . Seems like M3 and M4 level opportunities have occurred rescently. And Level M3 got more salary than M4 in 19, it can be assumed then that M4 will get a great boost up.

**Clerical Associate** - Four Levels, Got descent salary wrt other profession.


### highest and lowest posting dates (Average).

In [None]:
df5 = df[["Civil Service Title","Post Until"]]
df5 = df5.dropna()
df5 = df5.reset_index()
df5['Posting Date'] = df5['Posting Date'].astype('datetime64[ns]')
df5['Post Until'] = df5['Post Until'].astype('datetime64[ns]')
df5["Posted_Days"] = (df5['Post Until'] - df5['Posting Date']).dt.days
df5 = round(df5.groupby(['Civil Service Title'], as_index=False)['Posted_Days'].mean(),2)
df5 = df5.sort_values(["Posted_Days"],ascending=False)

In [None]:
plt.figure(figsize=(10,10))
sns.barplot(y=df5["Civil Service Title"][:10],x=df5.Posted_Days[:10])

In [None]:
plt.figure(figsize=(10,10))
sns.barplot(y=df5["Civil Service Title"][-10:],x=df5.Posted_Days[-10:])

These are the average no. of days the job is posted over the portal. Looks like Correctional Standard Review has been posted for vary long time wrt others.

and Child welfare specialist has been posted for seven days average. seems they can be found easily.


Now, Lets check the jobs or category that require residency requirment

In [None]:
df.loc[df['Residency Requirement'].str.contains("not required"), 'Residency_Required'] = 'No'
df.loc[df['Residency Requirement'].str.contains("no residency requirement"), 'Residency_Required'] = 'No'
df.loc[df['Residency Requirement'].str.contains("no residency requirements"), 'Residency_Required'] = 'No'
df["Residency_Required"].fillna("Yes", inplace=True)
df7 = df[["Civil Service Title","Residency_Required"]]

In [None]:
((pd.crosstab(index=df7["Civil Service Title"], columns=df7["Residency_Required"]))).sort_values(['Yes', 'No'], ascending=[False, True])[:10].T.plot.bar(figsize=(13,10))

Categories which do not require residency

In [None]:
((pd.crosstab(index=df7["Civil Service Title"], columns=df7["Residency_Required"]))).sort_values(['Yes', 'No'], ascending=[False, True])[-10:].T.plot.bar(figsize=(13,10))

**Most preferred skills for Jobs**

In [None]:
def get_bitrigrams(full_text, threshold=30):
    if isinstance(full_text, str):
        text = full_text
    else:
        text = " ".join(full_text)
    bigram_measures = BigramAssocMeasures()
    trigram_measures = TrigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(text.split())
    finder.apply_freq_filter(3)
    bigrams = {" ".join(words): "_".join(words)
               for words in finder.above_score(bigram_measures.likelihood_ratio, threshold)}
    finder = TrigramCollocationFinder.from_words(text.split())
    finder.apply_freq_filter(3)
    trigrams = {" ".join(words): "_".join(words)
                for words in finder.above_score(trigram_measures.likelihood_ratio, threshold)}
    return bigrams, trigrams

def process_text(text, lemmatizer, translate_table, stopwords):
    processed_text = ""
    for sentence in sent_tokenize(text):
        tagged_sentence = pos_tag(word_tokenize(sentence.translate(translate_table)))
        for word, tag in tagged_sentence:
            word = word.lower()
            if word not in stopwords:
                if tag[0] != 'V':
                    processed_text += lemmatizer.lemmatize(word) + " "
    return processed_text

wordnet_lemmatizer = WordNetLemmatizer()
newStopWords = ["new","skill","york","city","new york","new york city"]
#stopwords = stopwords.extend(newStopWords)
stop = set(stopwords.words('english'))
stop1 = stop.union(newStopWords)
translate_table = dict((ord(char), " ") for char in string.punctuation)

def use_ngrams_only(texts, lemmatizer, translate_table, stopwords):
    processed_texts = []
    for index, doc in enumerate(texts):
        if type(doc) is str:
            processed_texts.append(process_text(doc, wordnet_lemmatizer, translate_table, stop))
    bigrams, trigrams = get_bitrigrams(processed_texts)
    indexed_texts = []
    for doc in processed_texts:
        current_doc = []
        for k, v in trigrams.items():
            c = doc.count(k)
            if c > 0:
                current_doc += [v] * c
                doc = doc.replace(k, v)
        for k, v in bigrams.items():
            current_doc += [v] * doc.count(" " + k + " ")
        indexed_texts.append(" ".join(current_doc))
    return " ".join(indexed_texts)

In [None]:
wordcloud = WordCloud(stopwords=stop1, background_color="white").\
    generate(use_ngrams_only(df['Preferred Skills'], wordnet_lemmatizer, translate_table, stop))
plt.figure(figsize=(8, 5))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

So, Verbal and communication skills are the most important, But its generalized with all the job category. We have to look into specific job category if we needs to find out the actual skills.


**Job Description word cloud**

In [None]:
wordcloud = WordCloud(stopwords=stop1, background_color="white").\
    generate(use_ngrams_only(df['Job Description'], wordnet_lemmatizer, translate_table, stop))
plt.figure(figsize=(8, 5))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

It seems that work for enviroment and its related area are in most of the discription./

**Minimum requirment**

In [None]:
wordcloud = WordCloud(stopwords=stop, background_color="white").\
    generate(use_ngrams_only(df['Minimum Qual Requirements'], wordnet_lemmatizer, translate_table, stop))
plt.figure(figsize=(8, 5))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

So, A full time diploma is at least required for the job to be offered.

Thanks,

Do upvote if feel helpful