# Beat the ATS - Project Description

###### According to Jobscan, 99% of Fortune 500 companies use an Applicant Tracking System (ATS) as their recruitment strategy.
###### All ATS' store candidates' information by using keywords, so it's easy to parse and filter them.
###### If applicant's resume does not meet certain criteria, it is either flagged or auto-rejected.
###### "Beat the ATS" Project aims to analyse the most popular tools and skills (keywords) for Data Analytics field and check if there is a relationship between those skills and earnings.

## Research Questions:
<b>All research questions are based on years 2020-2021 in the United States of America</b>
<br><br>
###### 1. What were the most popular technologies employers sought in Data Analytics field in the period 2020-2021? (Frequency analysis)
###### 2. What were the most popular tools employers sought? (Frequency analysis)
###### 3. What were the most popular soft skills? (Frequency analysis)
###### 4. Is there a relationship between education level and earnings? If yes, what is the relationship? (Regression analysis)
###### 5. Is there a relationship between years of experience and earnings? If yes, what is the relationship? (Regression analysis)
<br><br>
### Hypotheses:
###### 1. There are certain technologies that are more sought after than others.
###### 2. There are certain tools that are more sought after than others.
###### 3. There are certain soft skills that are sought after by employers.
###### 4. There is a correlation between education level and earnings.
###### 5. There is a correlation between experience level (measured in years) and earnings.


In [188]:
#%matplotlib notebook
#%matplotlib inline

In [142]:
# Import Dependencies
import pandas as pd
import numpy as np
import scipy.stats as st
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [244]:
# Declare variables and import the data
# job_descriptions = pd.read_csv("Job descriptions.csv")
# salary_education_experience = pd.read_csv("salary vs education vs experience.csv")
job_desc_df = pd.read_csv("salary_education_experience.csv")

# Finding the relationship between years of experience and earnings

In [270]:
new_df = job_desc_df.rename(columns = {'Annual Salary':'Annual_Salary', 'Years of Experience':'Years_of_Experience'})
new_df

Unnamed: 0,Year,Compant,Job Title,Annual_Salary,Location,Years_of_Experience,Gender,Masters Degree,Bachelors Degree,Doctorate Degree,Highschool,Some College,Education
0,2020,PwC,Business Analyst,115000,"Los Angeles, CA",5,Female,1,0,0,0,0,Master's Degree
1,2020,Fractal Analytics,Data Scientist,85000,"Bangalore, KA, India",4,Male,0,1,0,0,0,Bachelor's Degree
2,2020,Microsoft,Data Scientist,156000,"Seattle, WA",2,Male,0,0,1,0,0,PhD
3,2020,PwC,Business Analyst,25000,"Moscow, MC, Russia",8,Female,1,0,0,0,0,Master's Degree
4,2020,SAP,Business Analyst,41000,"Toronto, ON, Canada",2,Male,0,1,0,0,0,Bachelor's Degree
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1799,2020,Twitter,Business Analyst,89000,"San Francisco, CA",1,Female,0,1,0,0,0,Bachelor's Degree
1800,2020,Accenture,Business Analyst,90000,"Detroit, MI",1,Male,0,1,0,0,0,Bachelor's Degree
1801,2020,Bill.com,Data Scientist,110000,"Houston, TX",0,Male,0,1,0,0,0,Bachelor's Degree
1802,2020,JP Morgan Chase,Data Scientist,132000,"New York, NY",0,Male,1,0,0,0,0,Master's Degree


In [281]:
Average_Annual_Salary = new_df["Annual_Salary"].mean()

In [282]:
Average_Annual_Salary

176344.23503325944

In [283]:
Max_Years_of_Experience = new_df["Years_of_Experience"].max()

In [284]:
Max_Years_of_Experience

45

In [288]:
new_df[new_df.Years_of_Experience == 45].Annual_Salary.value_counts()

155000    1
Name: Annual_Salary, dtype: int64

In [291]:
new_df.groupby(["Location", "Job Title"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Year,Compant,Annual_Salary,Years_of_Experience,Gender,Masters Degree,Bachelors Degree,Doctorate Degree,Highschool,Some College,Education
Location,Job Title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
"Alexandria, VA",Data Scientist,1,1,1,1,1,1,1,1,1,1,0
"Alpharetta, GA",Data Scientist,1,1,1,1,1,1,1,1,1,1,1
"Amsterdam, NH, Netherlands",Business Analyst,3,3,3,3,3,3,3,3,3,3,3
"Amsterdam, NH, Netherlands",Data Scientist,14,14,14,14,14,14,14,14,14,14,12
"Ann Arbor, MI",Data Scientist,1,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
"Wilton, CT",Business Analyst,1,1,1,1,1,1,1,1,1,1,1
"Woonsocket, RI",Data Scientist,1,1,1,1,1,1,1,1,1,1,1
"Worcester, MA",Data Scientist,1,1,1,1,1,1,1,1,1,1,1
"Zurich, ZH, Switzerland",Business Analyst,3,3,3,3,3,3,3,3,3,3,3


In [275]:
new_df[new_df.Gender == "Female"].Years_of_Experience.value_counts()

2     66
5     59
3     58
4     52
0     32
1     31
6     29
7     27
8     23
10    23
9     12
12     8
20     6
15     5
11     5
13     4
14     2
45     1
35     1
Name: Years_of_Experience, dtype: int64

In [None]:
# Clean the data
# Drop N/A where relevant (not relevant in the context of education!!!)
# Words spelled differently, aggregate words for the same meaning
# Address encoding problems: convert all job descriptions to UTF-8 using unicode_escape, pandas ignore encoding)

In [62]:
# Searching for keywords:
# Stop words removal
# Count the frequency using CountVectorizer
# Produce counts
# Define words matrix
def words_matrix(words, vectorizer):
    matrix = vectorizer.fit_transform(words)

In [70]:
text = ['Hello, my my my name is Rita and and I am a data scientist.']
text2 = ['This is a vectorizer test']

In [71]:
vec = CountVectorizer()
x = vec.fit_transform(text)
print(x)

  (0, 3)	1
  (0, 5)	3
  (0, 6)	1
  (0, 4)	1
  (0, 7)	1
  (0, 1)	2
  (0, 0)	1
  (0, 2)	1
  (0, 8)	1


In [None]:
# Create a visualisation for keywords

In [None]:
# Calculate Measures of Central Tendency - mean, median, mode for education
# example_array = np.array([24, 16, 12, 10, 12, 28, 38, 12, 28, 24])
# example_mode = stats.mode(example_array)
# If there are multiple modes, the stats.mode() function will always return the smallest mode in the dataset.

In [None]:
# Create a visualisation for education

In [None]:
# Calculate Measures of Central Tendency - mean, median, mode for experience

In [None]:
# Create a visualisation for experience

In [None]:
# Calculate Measures of Central Tendency - mean, median, mode for earnings

In [None]:
# Create a visualisation for earnings

In [None]:
# Perform regression analysis for education and earnings

In [None]:
# Create a visualisation for regression analysis: education vs earnings

In [None]:
# Perform regression analysis for experience and earnings

In [None]:
# Create a visualisation for regression analysis: experience vs earnings

In [None]:
# We can define the Student t-test as a method that tells us how significant the differences can be between different groups.
# A Student t-test is defined as a statistic and this is used to compare the means of two different populations.
# t-test: stats.ttest_1samp(X, mean)
# Independent t-test: stats.ttest_ind(X,Y)

In [None]:
# Create a visualisation for t-test