# Capstone Webscrapping using BeautifulSoup

By Made Swastika Nata Negara (batch Vulcan)

This notebook contains step by step process on the data web scrapiing and processing for the application

## 💭Background

As someone that has intererst in the field of data, I want to know how is the job market in the field. Using job platform Kalibrr, I scrapped jobs on April 11th 20233 with serch keyword "data". In total, 225 jobs from 15 web pages data are scrapped. I also did analysis and created visualisations with interactive plots.

## 📚Importing Modules and Libraries

In [1]:
import pandas as pd
import numpy as np
import json

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots

from bs4 import BeautifulSoup 
import requests

## 🔎Requesting the Data and Creating a BeautifulSoup

Let's begin with requesting the web from the site with `get` method. Find the key and put the key into the `.find()`

In [2]:
detail_urls = []

for i in range(1, 16):
    url_get = requests.get(f'https://www.kalibrr.id/id-ID/job-board/te/data/{i}')
    
    soup = BeautifulSoup(url_get.content,"html.parser")
    
    table = soup.find('div', attrs={'class':'k-bg-white k-divide-y k-divide-solid k-divide-tertiary-ghost-color'})


    list = table.find_all('a', attrs={'itemprop':'name'})
    for l in list:
        detail_urls.append(l['href'])

I collected the url of the detail page of the job. Here are a few samples

In [117]:
print(len(detail_urls))

for url in detail_urls[:5]:
    print(url)

225
/id-ID/c/pgi-data/jobs/198724/project-manager
/id-ID/c/pgi-data/jobs/208282/it-system-analyst
/id-ID/c/pt-adicipta-inovasi-teknologi/jobs/197693/data-analytics-manager
/id-ID/c/mobius-digital/jobs/219866/devops-and-data-engineer
/id-ID/c/pgi-data/jobs/208690/network-security-engineer


## 💡Finding the Right Key to Scrap the Data & Extracting the Right Information

I decided to collect a few information about the job: Id, Job Title, Company, Location, Date Posted, Position, Employment Type, Minimum Education, Category, and Industry

In [47]:
ids = []
titles = []
locations = []
companies = []
dates = []
employment_types = []
categories = []
educations= []
positions = []
industries = []

for url in detail_urls:
    url_get = requests.get('https://www.kalibrr.id'+url)
    ids.append(url.split('/')[5])
    
    soup = BeautifulSoup(url_get.content,"html.parser")
    
    title = soup.find('h1', attrs={'itemprop': 'title'})
    location = soup.find('span', attrs={'itemtype':'http://schema.org/PostalAddress'})
    company = soup.find('h2', attrs={'class':'k-inline-block'})
    date = soup.find('span', attrs={'itemprop':'datePosted'})
    employment_type = soup.find('a', attrs={'class':'k-text-grey-900'})
    position = soup.find_all('dl', attrs={'class':'k-mb-4'})[0].find('a')
    category = soup.find_all('dl', attrs={'class':'k-mb-4'})[1].find('a')
    education = soup.find_all('dl', attrs={'class':'k-mb-4'})[2].find('a')
    industry = soup.find('span', attrs={'itemprop':'industry'})
    
    titles.append(title.text if title else None)
    locations.append(location.text if location else None)
    companies.append(company.text if company else None)
    dates.append(date.text if company else None)
    employment_types.append(employment_type.text if employment_type else None)
    categories.append(category.text if category else None)
    educations.append(education.text if education else None)
    positions.append(position.text if position else None)
    industries.append(industry.text if industry else None)

In [5]:
len(titles)

225

## 🛠️Creating Data Frame & Data Wrangling

Put the array into dataframe

In [48]:
import pandas as pd

df = pd.DataFrame({'Id': ids, 'Job Title': titles, 'Company': companies, 'Location': locations, 'Date Posted': dates, 
                   'Position': position, 'Employment Type':employment_types,'Minimum Education': educations, 
                   'Category': categories, 'Industry': industries  
                  })
df.head()

Unnamed: 0,Id,Job Title,Company,Location,Date Posted,Position,Employment Type,Minimum Education,Category,Industry
0,198724,Project Manager,PGI Data,"Jakarta, Indonesia",2023-04-03T06:38:05.915181+00:00,Mid-Senior Level Manager,Penuh waktu,Lulus program Sarjana (S1),IT and Software,Information Technology / IT
1,208282,IT System Analyst,PGI Data,"Jakarta, Indonesia",2023-03-27T06:14:27.852017+00:00,Mid-Senior Level Manager,Penuh waktu,Lulus program Sarjana (S1),IT and Software,Information Technology / IT
2,197693,Data Analytics Manager,PT Adicipta Inovasi Teknologi,"Kota Jakarta Barat, Indonesia",2023-04-06T02:20:33.531699+00:00,Mid-Senior Level Manager,Penuh waktu,Lulus program Sarjana (S1),IT and Software,Information Technology / IT
3,219866,DevOps and Data Engineer,Mobius Digital,"Tangerang Selatan, Indonesia",2023-04-04T09:02:35.935010+00:00,Mid-Senior Level Manager,Penuh waktu,Lulus program Sarjana (S1),IT and Software,Information Technology / IT
4,208690,Network Security Engineer,PGI Data,"Jakarta, Indonesia",2023-04-05T07:54:25.152029+00:00,Mid-Senior Level Manager,Penuh waktu,Lulus program Sarjana (S1),IT and Software,Information Technology / IT


Somehow, position are not collected well by Beautiful Soup. So, I tried to scrap it again individually.

In [54]:
positions = []

for url in detail_urls:
    url_get = requests.get('https://www.kalibrr.id'+url)
    soup = BeautifulSoup(url_get.content,"html.parser")
    position = soup.find_all('dl', attrs={'class':'k-mb-4'})[0].find('a')

    positions.append(position.text)

In [56]:
df['Position'] = positions
df['Position'].value_counts()

Mid-Senior Level Manager        91
Lulusan Baru / Junior           65
Supervisor / Asisten Manager    60
Magang / OJT                     4
Direktur / Eksekutif             4
Mid-Senior Level / Manager       1
Name: Position, dtype: int64

I have Date Posted feature, so it's best to change the datatype into Datetime 

In [101]:
df['Date Posted'] = df['Date Posted'].apply(pd.to_datetime)

In [8]:
df.dtypes

Id                                object
Job Title                         object
Company                           object
Location                          object
Date Posted          datetime64[ns, UTC]
Positions                         object
Employment Type                   object
Minimum Education                 object
Category                          object
Industry                          object
dtype: object

Some data are still dirty, some are due to the translation. I cleaned and made them more consistent

In [67]:
df['Minimum Education'] = df['Minimum Education'].str.replace("Bachelor's degree graduate", "Lulus program Sarjana (S1)")
df['Position'] = df['Position'].str.replace("Mid-Senior Level / Manager", "Mid-Senior Level Manager")
df['Employment Type'] = df['Employment Type'].str.replace('Full time', 'Penuh waktu')

In [89]:
df['Location'] = df['Location'].replace(['Kota', 'City', ', Indonesia'], ['', '', ''])\
    .replace(['South Jakarta', 'Central Jakarta', 'North Jakarta', 'East Jakarta', 'West Jakarta', 'South Tangerang'], \
             ['Jakarta Selatan', 'Jakarta Pusat', 'Jakarta Utara', 'Jakarta Timur', 'Jakarta Barat', 'Tangerang Selatan'])\
    .str.strip()

In [90]:
df['Location'].value_counts()

Jakarta Selatan      68
Jakarta Pusat        34
Jakarta              21
Tangerang            21
Jakarta Barat        18
Jakarta Utara        11
Jakarta Timur        10
Tangerang Selatan     8
Surabaya              8
Bandung               5
Sleman                4
Banyuwangi            2
Kupang                2
Denpasar              2
Yogyakarta            1
Malang                1
Sukabumi              1
Depok                 1
Makassar              1
Palembang             1
Medan                 1
West Lombok           1
Central Lampung       1
Bekasi                1
Bogor                 1
Name: Location, dtype: int64

export data to csv file

In [125]:
df.to_csv('dataset.csv', index=False)

## 📊Analysis and Conclusion

In [12]:
# Create a custom theme and set it as default
pio.templates["custom"] = pio.templates["plotly_white"]
pio.templates["custom"].layout.margin = {'b': 25, 'l': 25, 'r': 25, 't': 50}
pio.templates["custom"].layout.width = 600
pio.templates["custom"].layout.height = 450
pio.templates["custom"].layout.autosize = False
pio.templates["custom"].layout.font.family="Arial"
pio.templates["custom"].layout.title.update({"x":0.5, "xref":"paper", "font_family":"Arial Black"})
pio.templates["custom"].layout.xaxis.update({"showline":True, "linecolor":"darkgray"})
pio.templates["custom"].layout.yaxis.update({"showline":True, "linecolor":"darkgray"})
pio.templates["custom"].layout.colorway = ['#1F77B4', '#FF7F0E', '#54A24B', '#D62728', '#C355FA',
                                           '#8C564B', '#E377C2', '#7F7F7F',"#FFE323", '#17BECF']
pio.templates.default = "custom"

### Number of Jobs based on Location

In [126]:
temp_df = df.groupby('Location').count()['Id'].sort_values(ascending=False)

fig = px.bar(temp_df[:15], title='Top 15 Location with Highest Job Postings',
            labels={'category':'Company', 'value':'Job Postings'}, orientation='h',
            color='value', color_continuous_scale='purp', text_auto=True)
fig.update_layout(coloraxis_showscale=False, xaxis_showgrid=False, width=700)
fig.update_yaxes(categoryorder='total ascending')
fig.data[0].texttemplate = '%{x:s}'
fig.data[0].textposition = 'outside'
fig.add_annotation(xref='x domain', yref="y domain", x=0.5, y=1.06, font_size=14,
                   text="<i>Jakarta and Tangerang have more opportunities than everywehere else</i>", 
                   showarrow=False)

avg = temp_df.mean()
fig.add_vline(x=avg, line_width=1.5, line_dash="dash", line_color="#ff66b3")
fig.add_annotation(x=avg, xanchor="left", y=0.5, text="Average of all companies", showarrow=False,
                   font_color="#ff66b3", textangle=90)

fig.write_html("./static/plots/location.html")
fig.show()

### Number of Jobs based on Company

In [127]:
temp_df = df.groupby('Company').count()['Id'].sort_values(ascending=False)

fig = px.bar(temp_df[:15], title='Top 15 Companies with Highest Job Postings',
            labels={'category':'Company', 'value':'Job Postings'}, orientation='h',
            color='value', color_continuous_scale='purp', text_auto=True)
fig.update_layout(coloraxis_showscale=False, xaxis_showgrid=False, width=700)
fig.update_yaxes(categoryorder='total ascending')
fig.data[0].texttemplate = '%{x:s}'
fig.data[0].textposition = 'outside'
fig.add_annotation(xref='x domain', yref="y domain", x=0.5, y=1.06, font_size=14,
                   text="<i>PT BFI Finance Indonesia Tbk is far in front</i>", 
                   showarrow=False)

avg = temp_df.mean()
fig.add_vline(x=avg, line_width=1.5, line_dash="dash", line_color="#ff66b3")
fig.add_annotation(x=avg, xanchor="left", y=0.5, text="Average of all companies", showarrow=False,
                   font_color="#ff66b3", textangle=90)

fig.write_html("./static/plots/company.html")
fig.show()

### Number of Jobs based on Date Posted

In [128]:
day_dict = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}

temp_df = df.copy()
temp_df['Day'] = temp_df['Date Posted'].dt.dayofweek
temp_df = temp_df.groupby('Day').count()['Id']
temp_df = temp_df.sort_index()
temp_df = temp_df.rename(day_dict)
colors = ['#1F77B4']*7
colors[0] = colors[1] = "#FF7F0E"

fig = px.bar(temp_df, title='Number of Job Postings by Day',
            labels={'publish_day':'Day', 'value':'Job Postings'},
            hover_data={'variable':False, 'value':':d'}, text_auto=True)
fig.update_layout(showlegend=False, yaxis_showgrid=False)
fig.update_traces(texttemplate="%{y:d}", marker_color=colors)

fig.add_annotation(xref='x domain', yref="y domain", x=0.5, y=1.06, font_size=14,
                   text="<i>Most jobs are posted on the beginning of weekday<br>No jobs are opened on sunday</i>", showarrow=False)

fig.write_html("./static/plots/day.html")
fig.show()

### Number of Jobs based on Employment Type

In [129]:
temp_df = df.copy()
temp_df = temp_df.groupby('Employment Type').count()['Id']

colors = ['#1F77B4']*2
colors[1] = "#FF7F0E"

fig = px.bar(temp_df, title='Number of Job Postings by Employment Type',
            labels={'position':'Position', 'value':'Job Postings'},
            hover_data={'variable':False, 'value':':d'}, text_auto=True)
fig.update_layout(showlegend=False, yaxis_showgrid=False)
fig.update_traces(texttemplate="%{y:d}", marker_color=colors)
fig.add_annotation(xref='x domain', yref="y domain", x=0.5, y=1.06, font_size=14,
                   text="<i>Majority of company want to hire full time employee</i>", showarrow=False)

fig.write_html("./static/plots/employment.html")
fig.show()

### Number of Jobs based on Category

In [130]:
temp_df = df.groupby('Category').count()['Id'].sort_values(ascending=False)

fig = px.bar(temp_df[:15], title='Number of Job Postings by Category',
            labels={'category':'Category', 'value':'Job Postings'}, orientation='h',
            color='value', color_continuous_scale='purp', text_auto=True)
fig.update_layout(coloraxis_showscale=False, xaxis_showgrid=False, width=700)
fig.update_yaxes(categoryorder='total ascending')

fig.data[0].texttemplate = '%{x:s}'
fig.data[0].textposition = 'outside'
fig.add_annotation(xref='x domain', yref="y domain", x=0.5, y=1.06, font_size=14,
                   text="<i>Most of the jobs are in IT and Software</i>", 
                   showarrow=False)

avg = temp_df.mean()
fig.add_vline(x=avg, line_width=1.5, line_dash="dash", line_color="#ff66b3")
fig.add_annotation(x=avg, xanchor="left", y=0.5, text="Average", showarrow=False,
                   font_color="#ff66b3", textangle=90)

fig.write_html("./static/plots/category.html")
fig.show()

### Number of Jobs based on Industry

In [131]:
temp_df = df.groupby('Industry').count()['Id'].sort_values(ascending=False)

fig = px.bar(temp_df[:15], title='Top 15 Industry',
            labels={'category':'Category', 'value':'Job Postings'}, orientation='h',
            color='value', color_continuous_scale='purp', text_auto=True)
fig.update_layout(coloraxis_showscale=False, xaxis_showgrid=False, width=700)
fig.update_yaxes(categoryorder='total ascending')

fig.data[0].texttemplate = '%{x:s}'
fig.data[0].textposition = 'outside'
fig.add_annotation(xref='x domain', yref="y domain", x=0.5, y=1.06, font_size=13,
                   text="<i>Fintech have more job openings than other specific industry in IT</i>", 
                   showarrow=False)

avg = temp_df.mean()
fig.add_vline(x=avg, line_width=1.5, line_dash="dash", line_color="#ff66b3")
fig.add_annotation(x=avg, xanchor="left", y=0.5, text="Average of all industry", showarrow=False,
                   font_color="#ff66b3", textangle=90)

fig.write_html("./static/plots/industry.html")
fig.show()

### Number of Jobs based on Education Requirements

In [132]:
temp_df = df.copy()
temp_df = temp_df.groupby('Minimum Education').count()['Id']

colors = ['#1F77B4']*3
colors[2] = "#FF7F0E"

fig = px.bar(temp_df, title='Number of Job Openings by Minimum Education',
            labels={'position':'Position', 'value':'Job Postings'},
            hover_data={'variable':False, 'value':':d'}, text_auto=True)
fig.update_layout(showlegend=False, yaxis_showgrid=False)
fig.update_traces(texttemplate="%{y:d}", marker_color=colors)
fig.add_annotation(xref='x domain', yref="y domain", x=0.5, y=1.06, font_size=14,
                   text="<i>Any degree lower than bachelor's degree is less likely to get a job</i>", showarrow=False)

fig.write_html("./static/plots/education.html")
fig.show()

### Number of Jobs based on Position

In [118]:
df['Position'].unique()

array(['Mid-Senior Level Manager', 'Supervisor / Asisten Manager',
       'Lulusan Baru / Junior', 'Magang / OJT', 'Direktur / Eksekutif'],
      dtype=object)

In [133]:
temp_df = df.copy()
temp_df = temp_df.groupby('Position').count()['Id']

colors = ['#1F77B4']*5
colors[3] = "#FF7F0E"

fig = px.bar(temp_df, title='Number of Job Openings by Position',
            labels={'position':'Position', 'value':'Job Postings'},
            hover_data={'variable':False, 'value':':d'}, text_auto=True)
fig.update_layout(showlegend=False, yaxis_showgrid=False)
fig.update_xaxes(categoryorder='array',categoryarray=['Magang / OJT','Lulusan Baru / Junior','Supervisor / Asisten Manager'\
                                                      ,'Mid-Senior Level Manager','Direktur / Eksekutif'])
fig.update_traces(texttemplate="%{y:d}", marker_color=colors)

fig.add_annotation(xref='x domain', yref="y domain", x=0.5, y=1.06, font_size=14,
                   text="<i>Companies hire very few interns</i>", showarrow=False)

fig.write_html("./static/plots/position.html")
fig.show()

### Conclusions
- Most of the job openings are in Jakarta or Tangerang
- Currently, PT BFI Finance Indonesia Tbk posts the most job
- Most jobs are posted on the beginning of weekday
- Majority of company want to hire full time employee
- Most of the jobs are in IT and Software, while Fintech becomes the specific field with most job posted
- Nowadays, Bachelor's degree is almost a must to get a job
- Company are looking for mid-senior level manager and hires very few interns