<font color='#2F4F4F'>To use this notebook on Colaboratory, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# <font color='#2F4F4F'>AfterWork Data Science: Web Scraping with Python</font>

## <font color='#2F4F4F'>Problem Statement</font>


Together with a team of startup entrepreneurs, you decide to work on an idea that could
change the way people search for jobs. You decide that job scraping could be the next
big thing as there are actively many people looking for jobs in the country, in this case,
Kenya.


The problem is that there are many job listings which can not get visits for the target job
seekers. While working in a team, your task as a data scientist for this project is to
scrape for job titles and links and then put them in a single table that can be used by
your team members to further build a job aggregator.


## <font color='#2F4F4F'>Prerequisites</font>

In [None]:
# We first import the required libraries
# ---
#
import pandas as pd             # library for data manupation
pd.set_option('display.width', 1000)
import requests                 # library for fetching a web page 
from bs4 import BeautifulSoup   # library for extrating contents from a webpage 
from google.colab import files

pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)
import warnings
warnings.filterwarnings('ignore')

## <font color='#2F4F4F'>Step 1: Obtaining our Data</font>

In [2]:
# PigiaMe: https://www.pigiame.co.ke/it-software-jobs
# ---
#
pigia_me = requests.get('https://www.pigiame.co.ke/it-software-jobs')
pigia_me

<Response [200]>

In [3]:
# MyJobMag: https://www.myjobmag.co.ke/jobs-by-field/information-technology
# ---
#

Myjob_mag=requests.get('https://www.myjobmag.co.ke/jobs-by-field/information-technology')
Myjob_mag

<Response [200]>

In [4]:
# KenyanJob: https://www.kenyajob.com/job-vacancies-search-kenya?f%5B0%5D=im_field_offre_secteur%3A133
# ---

Kenyan_job=requests.get('https://www.kenyajob.com/job-vacancies-search-kenya?f%5B0%5D=im_field_offre_secteur%3A133')
Kenyan_job

<Response [200]>

## <font color='#2F4F4F'>Step 2: Parsing</font>

In [5]:
# Parsing our document: pigia_me
# ---
# 
soup = BeautifulSoup(pigia_me.text, "html.parser")


In [6]:
# Parsing our document: my_job_mag
# ---
#  
my_job_mag_soup = BeautifulSoup(Myjob_mag.text, "html.parser")


In [7]:
# Parsing our document: kenyan_job
# ---
# 
KenyanJob_soup = BeautifulSoup(Kenyan_job.text, "html.parser")


## <font color='#2F4F4F'>Step 3: Extracting Required Elements</font>

In [11]:
# 1. Extracting job titles:pigia_me

results = soup.find(class_="search")
job_elements= results.find_all("div",class_="listings-cards__list-item")

#create empty lists that we will use to store content fetched
job_title = []
job_link = []


#loop through these tags
for job_element in job_elements:

  title_element= job_element.find("div", class_="listing-card__header__title")
  job_url = job_element.find('a')['href']

  #append result to initialized lists
  job_title.append(title_element.text.strip())
  job_link.append(job_url)

df_pigia = pd.DataFrame({"Job Title":job_title, "Job Url": job_link})
df_pigia


Unnamed: 0,Job Title,Job Url
0,Senior Software Engineer,https://www.pigiame.co.ke/listings/senior-software-engineer-3887056
1,Academic Cloud Advocate – Machine Learning and AI Developer,https://www.pigiame.co.ke/listings/academic-cloud-advocate-machine-learning-and-ai-developer-3887037
2,Academic Cloud Advocate – Developer tools and Application Architecture,https://www.pigiame.co.ke/listings/academic-cloud-advocate-developer-tools-and-application-architecture-3887022
3,Computer literate,https://www.pigiame.co.ke/listings/computer-literate-3884819
4,"Software Engineer – Karen, Kenya",https://www.pigiame.co.ke/listings/software-engineer-karen-kenya-3881204
5,Full Stack Software Developer,https://www.pigiame.co.ke/listings/full-stack-software-developer-3880995
6,Database Administrator I,https://www.pigiame.co.ke/listings/database-administrator-i-3874948
7,Mobile App Developer,https://www.pigiame.co.ke/listings/mobile-app-developer-3869264
8,"Junior Data Analyst (Agriculture) – Nairobi, Kenya",https://www.pigiame.co.ke/listings/junior-data-analyst-agriculture-nairobi-kenya-3857270
9,Software Systems Supervisor - Urgent,https://www.pigiame.co.ke/listings/software-systems-supervisor-urgent-3849477


In [12]:
# 2. Extracting job titles:my_job_mag

#create empty lists that we will use to store content fetched
title_mag = []
url_mag = []

# Getting all tags required
# ---
#
results_my_job_mag = my_job_mag_soup.find("ul", class_="job-list")
job_elements_mag = results_my_job_mag.find_all("h2")

job_elements_mag

# We the loop through these tags
for result in job_elements_mag:
   
    # Getting our text from each tag
    text = result.get_text()

    # We concatenate our domain with href link that we scrape
    # in order to form a full link
    link = 'https://www.myjobmag.co.ke'+result.find('a')['href']

    # Then appending the text to our title list
    title_mag.append(text)

    # Then appending the text to our url list
    url_mag.append(link)

df_mag = pd.DataFrame({"Job Title": title_mag, "Job Url":url_mag})
df_mag


Unnamed: 0,Job Title,Job Url
0,"BMS Operator (Facilities and Operations) at Aga Khan Hospital, Mombasa",https://www.myjobmag.co.ke/job/bms-operator-facilities-and-operations
1,Product Manager - Digital Capabilities at Safaricom Kenya,https://www.myjobmag.co.ke/job/product-manager-digital-capabilities-safaricom-kenya
2,Product Manager - Cyber Security at Safaricom Kenya,https://www.myjobmag.co.ke/job/product-manager-cyber-security-safaricom-kenya
3,Product Manager – Cloud Computing at Safaricom Kenya,https://www.myjobmag.co.ke/job/product-manager-cloud-computing-safaricom-kenya
4,Product Manager – IOT at Safaricom Kenya,https://www.myjobmag.co.ke/job/product-manager-iot-safaricom-kenya
5,Tribe Lead: M-Pesa Next Financial Services at Safaricom Kenya,https://www.myjobmag.co.ke/job/tribe-lead-m-pesa-next-financial-services-safaricom-kenya
6,Product Manager at Ilara Health,https://www.myjobmag.co.ke/job/product-manager-ilara-health-1
7,Product Manager - Timiza at Absa Bank Limited,https://www.myjobmag.co.ke/job/product-manager-timiza-absa-bank-limited-2
8,Software Engineer-MSAI at Microsoft,https://www.myjobmag.co.ke/job/software-engineer-msai-microsoft
9,Data Architect Lead at Safaricom Kenya,https://www.myjobmag.co.ke/job/data-architect-lead-safaricom-kenya-1


In [13]:
# 3. Extracting job titles: kenya_job
# ---
#create empty lists that we will use to store content fetched
title_KenyanJob=[]
url_KenyanJob=[]

results_KenyanJob = KenyanJob_soup.find("div", id="content-2")
job_KenyanJob = results_KenyanJob.find_all("h5")

# loop through these tags
for result in job_KenyanJob:
   
    # Getting text from each tag
    text = result.get_text()

    # We concatenate our domain with href link that we scrape
    # in order to form a full link
    link = 'https://www.kenyajob.com'+result.find('a')['href']

    # Then appending the text to our title list
    title_KenyanJob.append(text)

    # Then appending the text to our url list
    url_KenyanJob.append(link)

df_KenyanJob = pd.DataFrame({"Job Title": title_KenyanJob, "Job Url":url_KenyanJob})
df_KenyanJob


Unnamed: 0,Job Title,Job Url
0,Senior Software Engineer- Substrate App Platform,https://www.kenyajob.com/job-vacancies-kenya/senior-software-engineer-substrate-app-platform-95719
1,Consulting Account Manager,https://www.kenyajob.com/job-vacancies-kenya/consulting-account-manager-95720
2,Busine`ss Development Manager - International Organisations,https://www.kenyajob.com/job-vacancies-kenya/business-development-manager-international-organisations-95722
3,Company Telephone Receptionist,https://www.kenyajob.com/job-vacancies-kenya/company-telephone-receptionist-90793
4,Enterprise Architect,https://www.kenyajob.com/job-vacancies-kenya/enterprise-architect-96258
5,Mid Level Data Scientist,https://www.kenyajob.com/job-vacancies-kenya/mid-level-data-scientist-96259
6,Senior Data Engineer,https://www.kenyajob.com/job-vacancies-kenya/senior-data-engineer-96261
7,HR Business Partner (Operations),https://www.kenyajob.com/job-vacancies-kenya/hr-business-partner-operations-96263
8,HR Operations Associate,https://www.kenyajob.com/job-vacancies-kenya/hr-operations-associate-96265
9,Group Head - IT Infrastructure,https://www.kenyajob.com/job-vacancies-kenya/group-head-it-infrastructure-96266


## <font color='#2F4F4F'>Step 4: Saving our Data</font>

In [18]:
# Saving the scraped contents in a dataframe and preview our data
# ---
#
jobs_df = pd.concat([df_pigia, df_mag, df_KenyanJob],ignore_index=True)
                     


#convert dataframe into an excel file and download 
from google.colab import files
jobs_df.to_excel('jobs.xls', index=False) 
files.download('jobs.xls')

jobs_df.sample(5)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,Job Title,Job Url
6,Database Administrator I,https://www.pigiame.co.ke/listings/database-administrator-i-3874948
1,Academic Cloud Advocate – Machine Learning and AI Developer,https://www.pigiame.co.ke/listings/academic-cloud-advocate-machine-learning-and-ai-developer-3887037
25,Technical Project Manager at Akvo,https://www.myjobmag.co.ke/job/technical-project-manager-akvo-1
48,Growth Engineer - Marketing Automation,https://www.kenyajob.com/job-vacancies-kenya/growth-engineer-marketing-automation-95168
46,Embedded Linux Consulting Engineering Director,https://www.kenyajob.com/job-vacancies-kenya/embedded-linux-consulting-engineering-director-95166
