# Beat the ATS - Project Description

###### According to Jobscan, 99% of Fortune 500 companies use an Applicant Tracking System (ATS) as their recruitment strategy.
###### All ATS' store candidates' information by using keywords, so it's easy to parse and filter them.
###### If applicant's resume does not meet certain criteria, it is either flagged or auto-rejected.
###### "Beat the ATS" Project aims to analyse the most popular tools and skills (keywords) for Data Analytics field and check if there is a relationship between those skills and earnings.

## Research Questions:
<b>All research questions are based on years 2020-2021 in the United States of America</b>
<br><br>
###### 1. What were the most popular technologies employers sought in Data Analytics field in the period 2020-2021? (Frequency analysis)
###### 2. What were the most popular tools employers sought? (Frequency analysis)
###### 3. What were the most popular soft skills? (Frequency analysis)
###### 4. Is there a relationship between education level and earnings? If yes, what is the relationship? (Regression analysis)
###### 5. Is there a relationship between years of experience and earnings? If yes, what is the relationship? (Regression analysis)
<br><br>
### Hypotheses:
###### 1. There are certain technologies that are more sought after than others.
###### 2. There are certain tools that are more sought after than others.
###### 3. There are certain soft skills that are sought after by employers.
###### 4. There is a correlation between education level and earnings.
###### 5. There is a correlation between experience level (measured in years) and earnings.


In [59]:
# Import Dependencies
import string
import pandas as pd
from pandas import DataFrame
import numpy as np
from scipy import stats
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
# Declare variables and import the data
job_descriptions = pd.read_csv("job_descriptions.csv")
salary_education_experience = pd.read_csv("salary_education_experience.csv")

In [61]:
# Checking if the data was loaded correctly
job_descriptions.head(3)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Company Name,Location
0,"Data Analyst, Center on Immigration and Justic...",37-66,Are you eager to roll up your sleeves and harn...,Vera Institute of Justice\n3.2,"New York, NY"
1,Quality Data Analyst,37-66,Overview\n\nProvides analytical and technical ...,Visiting Nurse Service of New York\n3.8,"New York, NY"
2,"Senior Data Analyst, Insights & Analytics Team...",37-66,We’re looking for a Senior Data Analyst who ha...,Squarespace\n3.4,"New York, NY"


In [62]:
salary_education_experience.head(3)

Unnamed: 0,Year,Company,Job Title,Annual Salary,Location,Years of Experience,Gender,Masters Degree,Bachelors Degree,Doctorate Degree,High School,Some College,Education
0,2020,PwC,Business Analyst,115000,"Los Angeles, CA",5,Female,1,0,0,0,0,Master's Degree
1,2020,Fractal Analytics,Data Scientist,85000,"Bangalore, KA, India",4,Male,0,1,0,0,0,Bachelor's Degree
2,2020,Microsoft,Data Scientist,156000,"Seattle, WA",2,Male,0,0,1,0,0,PhD


In [63]:
# Clean the data
# Drop N/A where relevant (not relevant in the context of education!!!)
# Words spelled differently, aggregate words for the same meaning
# Address encoding problems: convert all job descriptions to UTF-8 using unicode_escape, pandas ignore encoding)

In [64]:
# Searching for keywords:
# Define words matrix
# Stop words removal
# Count the frequency using CountVectorizer
# Produce counts
def words_matrix(words, vectorizer):
    matrix = vectorizer.fit_transform(words)
    return DataFrame(matrix.toarray(),
                     columns=vectorizer.get_feature_names_out()
                     )

In [65]:
# Assigning text to a variable and changing type to string, so it's possible to remove stop words
text = job_descriptions['Job Description'].to_string()

In [66]:
# Declaring and printing stop words - using print function, so they are printed as a list, not as an array
stop_words = stopwords.words('english')

In [83]:
keywords = [word for word in text.split() if word[0] == 'a' or word[0] == 'b' or word[0] == 'c' or word[0] == 'd' or word[0] == 'e' or word[0] == 'f' or word[0] == 'g' or word[0] == 'h' or word[0] == 'i' or word[0] == 'j' or word[0] == 'k' or word[0] == 'l' or word[0] == 'm' or word[0] == 'n' or word[0] == 'o' or word[0] == 'p' or word[0] == 'q' or word[0] == 'r' or word[0] == 's' or word[0] == 't' or word[0] == 'u' or word[0] == 'v' or word[0] == 'w' or word[0] == 'y' or word[0] == 'z' and len(word) >=3 and word not in stop_words]
cleaned_keywords = ' '.join(keywords)
keywords = cleaned_keywords.split()
# print(keywords)

In [84]:
vec = CountVectorizer()
frequency = words_matrix(keywords, vec)
frequency.transpose()
frequency_count = frequency.sum()
frequency_count

8d             1
abi            1
ability        2
about         10
absolutely     2
              ..
you           74
young          1
your          23
yrs            1
ze             1
Length: 1271, dtype: int64

In [86]:
# Create a visualisation for keywords

In [87]:
# Calculate Measures of Central Tendency - mean, median, mode for education
# example_array = np.array([24, 16, 12, 10, 12, 28, 38, 12, 28, 24])
# example_mode = stats.mode(example_array)
# If there are multiple modes, the stats.mode() function will always return the smallest mode in the dataset.

In [71]:
# Create a visualisation for education

In [72]:
# Calculate Measures of Central Tendency - mean, median, mode for experience

In [73]:
# Create a visualisation for experience

In [74]:
# Calculate Measures of Central Tendency - mean, median, mode for earnings

In [75]:
# Create a visualisation for earnings

In [76]:
# Perform regression analysis for education and earnings

In [77]:
# Create a visualisation for regression analysis: education vs earnings

In [78]:
# Perform regression analysis for experience and earnings

In [79]:
# Create a visualisation for regression analysis: experience vs earnings

In [80]:
# We can define the Student t-test as a method that tells us how significant the differences can be between different groups.
# A Student t-test is defined as a statistic and this is used to compare the means of two different populations.
# t-test: stats.ttest_1samp(X, mean)
# Independent t-test: stats.ttest_ind(X,Y)

In [81]:
# Create a visualisation for t-test