## Scraping JOB Postings from techolution

### Link: https://techolution.app.param.ai/jobs/

## <font color='blue'>Steps</font>
1. I used chromedriver to fetch the webpage.
2. Then, I used beautiful soup to parse the webpage to get the required job list.
3. Fetching each job's info, I store it in a seperate list and then add it to a dataframe.
4. Finally, I sort the dataframe based on job of posting.
5. In the end, I export the dataframe to a csv file.

### <font color='blue'> First, import the required libraries </font>

In [1]:
from selenium import webdriver
import os
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np

### <font color='blue'> We initialize empty arrays and an empty dataframe.</font>

In [2]:
job_name = []
date_posted = []
job_info = []
job_loc = []
job_exp = []
job_type= []
job_tln = pd.DataFrame()

<font color='blue'> 
Since the website is dynamic, we cannot fetch data statically. 

So, I used **chrome driver**

Make sure chrome driver is installed and is in the correct path using the below code. 

I installed chromedriver in the current working directory itself.(Shown below)
</font>

In [3]:
DRIVER_PATH = os.path.join(os.getcwd())
print(DRIVER_PATH)

/Volumes/ANAGHA/techolution_assignment


In [4]:
driver_path = ''

if os.name == 'nt':
	driver_path = os.path.join(DRIVER_PATH,'chromedriver.exe')
elif os.name == 'posix':
	driver_path = os.path.join(DRIVER_PATH,'chromedriver')
else:
	driver_path = None

<font color='blue'>
After fetching the webpage using webdrivers' get method, I used beautiful soup to parse the html.
</font>

In [5]:
driver = webdriver.Chrome(executable_path=driver_path)
url ='https://techolution.app.param.ai/jobs/'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)


<font color='blue'>
Upon inspection we see that the div with id joblist has the infomation. So,
Here, using the scraped soup object, I found the div tag with the id - joblist which has our required job information
</font>

In [6]:
data= soup.find('div', {'id': 'joblist'})

## h3 contains the job names so I extract them using beautiful soup.

<font color='blue'>
I then, append all the jobnames to a list - job_names     
</font>


In [7]:
for jn in data.find_all('h3'):
    temp = jn.get_text()
    job_name.append(temp)


In [31]:
job_name

['Big Data Intern',
 'Senior Cloud Specialist',
 'Cloud Native Developer',
 'Data Scientist Intern',
 'Embedded Engineer',
 'Networking & Security Specialist',
 'System Engineer',
 'Associate QA Engineer',
 'Solution Architect',
 'Sr. Microservices Developer',
 'SOA Consultant',
 'Android Mobile Developer',
 'Associate Cloud Engineer',
 'Sr Full Stack Developer',
 'Sr SAP PI/PO Developer',
 'Blockchain Developer',
 'Junior Cloud Native Developer',
 'Senior DevOps Engineer',
 'Lead DevOps Engineer ',
 'Site Reliability Engineer',
 'OSS DevOps Engineer',
 'Sr SDET',
 'Engineering Lead',
 'Machine Learning Engineer']

### Extracting the date of job posting. 

<font color='blue'>
After extracting, I append the dates of posting to date_posted list
</font>

In [8]:
dt = data.find_all("div", {"class": "four wide right aligned computer tablet only column"}) 

for tag in dt:   
    temp = tag.get_text()
    date_posted.append(temp)
    

In [30]:
date_posted

['6 days ago',
 '10 days ago',
 '11 days ago',
 '13 days ago',
 '14 days ago',
 '19 days ago',
 '19 days ago',
 '20 days ago',
 '20 days ago',
 'a month ago',
 'a month ago',
 'a month ago',
 'a month ago',
 'a month ago',
 'a month ago',
 'a month ago',
 '2 months ago',
 '2 months ago',
 '2 months ago',
 '2 months ago',
 '2 months ago',
 '2 months ago',
 '2 months ago',
 '2 months ago']

### Extracting Job Info which inclues 
#### job type- [internship, full time etc]
#### job location [hyderabad, newyork, newjersey etc]
#### experience required - range of values. eg: 2- 3 years

<font color='blue'>
I get a nd array containing jobtype, job location and experience required
</font>

In [9]:
info = data.find_all("div", {"class": "twelve wide computer twelve wide tablet sixteen wide mobile column"}) 
for tag in info:
    temp = re.sub(r"[\n\t\s]*", "", tag.find('p').text)
    temp1 = [x.strip() for x in temp.split('·')]
    job_info.append(temp1)
    
job_info = np.asarray(job_info)

In [11]:
job_info

array([['Internship', 'Hyderabad', '0-2Years'],
       ['Full-time', 'Singapore', '5-10Years'],
       ['Full-time', 'Hyderabad', '2-5Years'],
       ['Internship', 'Hyderabad', '0-4Years'],
       ['Full-time', 'Hyderabad', '2-4Years'],
       ['Full-time', 'Hyderabad', '2-6Years'],
       ['Internship', 'Mauritius', '0-1Years'],
       ['Full-time', 'Hyderabad', '1-3Years'],
       ['Full-time', 'Hyderabad', '9-15Years'],
       ['Full-time', 'Hyderabad', '4-9Years'],
       ['Full-time', 'Hyderabad', '0-1Years'],
       ['Full-time', 'mauritius', '3-8Years'],
       ['Full-time', 'Hyderabad', '0-3Years'],
       ['Full-time', 'Mauritius', '3-8Years'],
       ['Contract', 'NewJersey', '7-12Years'],
       ['Full-time', 'Hyderabad', '1-4Years'],
       ['Full-time', 'Delaware', '1-2Years'],
       ['Full-time', 'Hyderabad', '3-10Years'],
       ['Full-time', 'Hyderabad', '5-11Years'],
       ['Full-time', 'NewYork', '1-3Years'],
       ['Full-time', 'Hyderabad', '6-12Years'],
       [

### We add all the obtained values as columns in the dataframe 'job_tln'

<font color='blue'>
In this step, I add my obtained lists to the job_tln dataframe
</font>

In [23]:
job_tln['job_name'] = job_name
job_tln['date_posted'] = date_posted
job_tln['job_type'] = job_info[:,0]
job_tln['job_loc'] = job_info[:,1]
job_tln['exp'] = job_info[:,2]

In [24]:
#head to see some values.
job_tln.head()

Unnamed: 0,job_name,date_posted,job_type,job_loc,exp
23,Big Data Intern,6 days ago,Internship,Hyderabad,0-2Years
22,Senior Cloud Specialist,10 days ago,Full-time,Singapore,5-10Years
21,Cloud Native Developer,11 days ago,Full-time,Hyderabad,2-5Years
20,Data Scientist Intern,13 days ago,Internship,Hyderabad,0-4Years
19,Embedded Engineer,14 days ago,Full-time,Hyderabad,2-4Years


***
***
***
### Sorting by Date of Posting

<font color='blue'>
There was a lot of difference in the way date of posting was written.

For example: 'a month ago', '2 months ago', '20 days ago'. Sorting algorithm cannot comprehend 'a month ago as 1 month ago'. So, I convert all the values into number of days format.

Note: Though the question was to sort by date of posting we dont know the exact date of posting. 
For example: 'one month ago' can be approximately 35 days ago or 32 days ago. So, I sorted based on the number of days itself. 
If we know exactly when the job was posted, we can subtract from datetime.date.today(), to find how many days back exactly was the job posted and sort accordingly.

</font>

In [25]:
job_tln['date_posted'] = job_tln['date_posted'].replace('a month ago','30 days ago')
job_tln['date_posted'] = job_tln['date_posted'].replace('2 months ago','60 days ago')

### To sort by date, we split the date_posted to get only the number of days, convert to int and then sort. 

In [26]:
f = lambda x: x['date_posted'].split("days",1)[0] 
job_tln["keys"] = job_tln.apply(f, axis=1).astype(int)
job_tln.sort_values(by='keys', ascending=False, inplace = True)

In [27]:
#drop the keys after sorting the dataframe
job_tln.drop(['keys'], axis=1,inplace = True)

## Final Result Dataframe.

In [32]:
job_tln

Unnamed: 0,job_name,date_posted,job_type,job_loc,exp
0,Machine Learning Engineer,60 days ago,Full-time,Hyderabad,3-5Years
1,Engineering Lead,60 days ago,Full-time,Mauritius,7-18Years
2,Sr SDET,60 days ago,Full-time,NewYork,3-10Years
3,OSS DevOps Engineer,60 days ago,Full-time,Hyderabad,6-12Years
4,Site Reliability Engineer,60 days ago,Full-time,NewYork,1-3Years
5,Lead DevOps Engineer,60 days ago,Full-time,Hyderabad,5-11Years
6,Senior DevOps Engineer,60 days ago,Full-time,Hyderabad,3-10Years
7,Junior Cloud Native Developer,60 days ago,Full-time,Delaware,1-2Years
10,Sr Full Stack Developer,30 days ago,Full-time,Mauritius,3-8Years
8,Blockchain Developer,30 days ago,Full-time,Hyderabad,1-4Years


***
***
***
### Finally, we save the dataframe in a csv.

Note: Please change the path accordingly.

In [29]:
csv_location = '/Volumes/ANAGHA/techolution_assignment/techolution_jobs.csv' 
job_tln.to_csv(csv_location,header = True, sep = ',')

***
***


## **End Notes**

- https://techolution.app.param.ai/jobs/ is a dynamic website. The content cannot be scraped just by using beautiful soup. So we use chromedriver to fetch the html of the webpage. 
- Other drivers also can be used based on your browser.
- Information obtained from the job-cards was very messy and had lots of trailing spaces and \n. Using a ***regex*** to clean the obtained data is most efficient way. 

### All the above code can be written without using temporary arrays, it can be consolidated in a single loop. However, to explain the process clearly, I used many arrays and then filled in an empty dataframe. 
- Coming to sorting by date, the only way to sort a list such as ['a month ago','2 months ago','10 days ago','20 days ago'...etc] is to convert to ***one scale*** . So, I converted all of them to ***days scale***. Sorting can be shown in a single line, in place by taking the first split of the string. Again, just to explain clearly, I divided it into a lot of steps. 
- As said in the code as well, we cannot sort by ***exact date***. For example: 'a month ago' can be 30 days or 35 days or 40 days. If were not an approximation, datetime.date.now() can be subtracted from the date of posting to obtain an accurate sorting!
- As a future scope, ***phantom js*** can be used, to also get the url of the job posting.




***
***
#### Thanks & Regards, 
#### Anagha Karanam
#### 15th April 2019
#### email: anaghakaranam@gmail.com
#### linkedin: https://www.linkedin.com/in/anaghakaranam/