<a href="https://colab.research.google.com/github/skevin-dev/NLP-FELLOWSHIP/blob/week4/Kevin_Hackathon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## In this notebook, following tasks will be conducted: 
1. Write a program which takes array of urls for job-posting websites in Rwanda
   * https://www.jobinrwanda.com/  (Job Adverts)
   * https://www.umucyo.gov.rw  ( Tenders)

2. Web-scrap the content and put them in a pandas data-frame.

3. Use regular expression and key-words to retrieve IT/Software related job/consultancy opportunities(Tenders).

4. Use EasyNMT to display the results in 50+ languages using wrapped state of art models.

5. Hosting of our work example in colab

## Data Extraction 

In [1]:
# import Libraries 

import requests # for requesting a link
from bs4 import BeautifulSoup # to scrap information fron a web 
import pandas as pd # for data wrangling/manipulation 
import re # regular expression 


In [109]:
def DataExtraction(link):
  """ This function helps to extract data about job in rwanda given a link 
  
  Args
  ----
  link:str
      link of the page to be scrapped 

  returns
  -------
  df:dataframe
     a dataset containing all information 
  """
  content =  requests.get(link).content
  soup =  BeautifulSoup(content, "html.parser")


  # find all jobs
  list_of_jobs = soup.find_all('span',class_='field field--name-title field--type-string field--label-hidden')
  jobs_titles = [list_of_jobs[index].text for index in range(len(list_of_jobs))]

  # extract companies 
  content_ = soup.find_all('p',class_='card-text')
  companies = [content_[index].find('a')['href'].split('/')[-1] for index in range(len(content_))]

  # extract experience
  experience = [content_[index].text.split('\n')[-3].strip() for index in range(len(content_))]

  # extract published dates 
  published_dates = [content_[index].text.split('\n')[-6].strip().split(' ')[2] for index in range(len(content_))] 

  # extract deadlines 
  deadlines_list = soup.find_all('time',class_='datetime')

  deadlines = [deadlines_list[d].text for d in range(len(deadlines_list))]
  
  # job description
  descriptions = soup.find_all('div',class_='card-body p-2')
  descriptions_links = ['https://www.jobinrwanda.com'+descriptions[index].find('a')['href']for index in range(len(descriptions))]

  # type of the job
  types = [content_[index].find('span').getText() for index in range(len(content_))]
  
  # # creating empty lists 
  # location= []
  Sector = []
  position = []
  contract = []
  application_link = []
  description = []
  # iterate through list of description links 
  for link_ in descriptions_links:
    content_ =  requests.get(link_).content
    soup_ =  BeautifulSoup(content_, "html.parser")
    b= soup_.find('ul',class_="list-group list-group-flush").text
    c =  b.split('\n')
    # remove unnecessary white spaces 
    res = [ele for ele in c if ele.strip()]

    text ="\n".join([" ".join(line.split()) for line in res])

    # retrieving information
    
    # sector
    sector = re.findall('Sector:\n(.+)\n',text)
    Sector.append(" ".join(sector))

    #contract type
    contract.append("".join(re.findall('Contract\stype\n:\s+(.+)\n',text)))

    #positions
    position.append("".join(re.findall('positions:\s(\d+)',text))) 

    #application link
    try:

      f = soup_.find('li',class_='list-group-item px-0 pb-0 job-apply-btn d-grid').find('a')['href']
      application_link.append('https://www.jobinrwanda.com'+f)

    except:
      application_link.append('No Link Available')

    #description
    des = soup_.find('div',class_='clearfix text-formatted field field--name-field-job-full-description field--type-text-long field--label-hidden field__item').text
    description.append(des.split('\n'))
 
 
  # create an empty dataframe 
  df = pd.DataFrame()

  # create columns and assigning values 
  df['Job Titles'] = jobs_titles
  df['Çompanies'] = companies 
  df['Experience'] = experience
  df['Published Date']  =published_dates
  df['Deadline'] = deadlines
  # df['Location'] = location
  df['Number of Position'] = position
  df['Contract Type'] = contract
  # df['Education Level']=education
  df['Description content'] =description
  df['Sectors'] = Sector
  df['Application Link'] = application_link
  df['Type'] = types
 
  return df

In [54]:
d = DataExtraction('https://www.jobinrwanda.com/jobs/all')

In [112]:
def retrieving_link(baselink):
  """this function helps to retrieve all the links including jobs,tenders,internship and consultancy

  Args
  ----
  Baselink:str
     The baselink of the page of job in rwanda websites 

  Returns
  -------
  jobs_link: the link of all jobs
  consultancy_link : the link of all consultancies
  tender_link: the link of all tenders
  internship_link: the link of all internships
  """
  content =  requests.get(baselink).content
  soup =  BeautifulSoup(content, "html.parser")
  list_of_links = []
  for name in ['all','consultancy','tender','internships']:
    name = baselink + soup.find('a',class_='nav-link px-1 text-primary nav-link--jobs-{}'.format(name))['href']
    list_of_links.append(name)

  return list_of_links


In [133]:
baselink = 'https://www.jobinrwanda.com'
data =  []
for i in range(4):
  df = DataExtraction(retrieving_link(baselink)[i])
  data.append(df)
df = pd.concat(data,ignore_index = True)
df

Unnamed: 0,Job Titles,Çompanies,Experience,Published Date,Deadline,Number of Position,Contract Type,Description content,Sectors,Application Link,Type
0,Procurement Manager,access-finance-rwanda-afr,Senior (5+ years of experience),31-10-2022,10-11-2022,1,Full-time,[Advertisement for Recruitment of the Procurem...,"Administration, Business, Logistics, Other, Pr...",https://www.jobinrwanda.com/form/default-job-a...,Job
1,Customer Service Officer,tic-tac-toe,Entry level (1 to 3 years of experience),03-11-2022,11-11-2022,2,Full-time,"[Job Description & Responsibilities:, Assist i...",Marketing and sales,https://www.jobinrwanda.com/form/default-job-a...,Job
2,Operations Manager,gardaworld,Senior (5+ years of experience),03-11-2022,14-11-2022,1,Full-time,"[Job Description – Operations Manager, , Posit...","Administration, Business, Management, Other",https://www.jobinrwanda.com/form/default-job-a...,Job
3,Finance Specialist (FS),peace-corps-rwanda,Senior (5+ years of experience),28-10-2022,11-11-2022,1,Full-time,"[Vacancy Announcement:, Finance Specialist (FS...",Other,https://www.jobinrwanda.com/form/default-job-a...,Job
4,Account Manager- Credit,yellow,Not specified,27-10-2022,27-11-2022,1,Full-time,"[Position: Account Manager- Credit, Locations:...",Other,https://www.jobinrwanda.comhttps://www.yellow....,Job
...,...,...,...,...,...,...,...,...,...,...,...
167,Supply and Installation of Medical Equipment,rwanda-medical-supply-ltd,Not specified,19-10-2022,17-11-2022,1,Tender,"[Invitation for Bids, TITLE: SUPPLY AND INSTA...",Other,No Link Available,Tender
168,Supply and Delivery of Laboratory Commodities ...,rwanda-medical-supply-ltd,Not specified,19-10-2022,22-11-2022,1,Tender,"[Invitation for Bids, TITLE: SUPPLY AND DELIV...",Other,No Link Available,Tender
169,"Supply of Laboratory Reagents, Equipments and ...",rwanda-medical-supply-ltd,Not specified,19-10-2022,15-11-2022,1,Tender,"[Invitation for Bids, TITLE: SUPPLY OF LABORA...",Other,No Link Available,Tender
170,Labor Mobility and Human Development Intern,international-organization-migration-iom-0,Entry level (1 to 3 years of experience),09-11-2022,22-11-2022,1,Internship,"[CALL FOR APPLICATIONS FOR INTERNSHIP, , Posit...","Administration, Other, Project management, Soc...",https://www.jobinrwanda.com/form/default-job-a...,Internship
