<a href="https://colab.research.google.com/github/skevin-dev/NLP-FELLOWSHIP/blob/week4/Shyaka_Kevin_Hackathon_File.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## In this notebook, following tasks will be conducted: 
1. Write a program which takes array of urls for job-posting websites in Rwanda
   * https://www.jobinrwanda.com/  (Job Adverts)
   * https://www.umucyo.gov.rw  ( Tenders)

2. Web-scrap the content and put them in a pandas data-frame.

3. Use regular expression and key-words to retrieve IT/Software related job/consultancy opportunities(Tenders).

4. Use EasyNMT to display the results in 50+ languages using wrapped state of art models.

5. Hosting of our work example in colab

In [11]:
!pip install -U easynmt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
!pip install fastapi pyngrok uvicorn nest-asyncio


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [13]:
!ngrok authtoken 2HJPtUHZax33oDpfnK1fKxiDPEx_3rHCdzEEgw64LphudF92g


Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [14]:
# import Libraries 

import requests # for requesting a link
from bs4 import BeautifulSoup # to scrap information fron a web 
import pandas as pd # for data wrangling/manipulation 
import re # regular expression 
from easynmt import EasyNMT #for translation 
import json # to read json file
model  = EasyNMT('opus-mt') # for translation
from fastapi import FastAPI, Request #api 
from fastapi.responses import HTMLResponse

## Data Extraction 

## Job in Rwanda 

In [15]:
def DataExtraction(link):
  """ This function helps to extract data about job in rwanda given a link 
  
  Args
  ----
  link:str
      link of the page to be scrapped 

  returns
  -------
  df:dataframe
     a dataset containing all information 
  """
  content =  requests.get(link).content
  soup =  BeautifulSoup(content, "html.parser")


  # find all jobs
  list_of_jobs = soup.find_all('span',class_='field field--name-title field--type-string field--label-hidden')
  jobs_titles = [list_of_jobs[index].text for index in range(len(list_of_jobs))]

  # extract companies 
  content_ = soup.find_all('p',class_='card-text')
  companies = [content_[index].find('a')['href'].split('/')[-1] for index in range(len(content_))]

  # extract experience
  experience = [content_[index].text.split('\n')[-3].strip() for index in range(len(content_))]

  # extract published dates 
  published_dates = [content_[index].text.split('\n')[-6].strip().split(' ')[2] for index in range(len(content_))] 

  # extract deadlines 
  deadlines_list = soup.find_all('time',class_='datetime')

  deadlines = [deadlines_list[d].text for d in range(len(deadlines_list))]
  
  # job description
  descriptions = soup.find_all('div',class_='card-body p-2')
  descriptions_links = ['https://www.jobinrwanda.com'+descriptions[index].find('a')['href']for index in range(len(descriptions))]

  # type of the job
  types = [content_[index].find('span').getText() for index in range(len(content_))]
  
  # # creating empty lists 
  # location= []
  Sector = []
  position = []
  contract = []
  application_link = []
  description = []
  # iterate through list of description links 
  for link_ in descriptions_links:
    content_ =  requests.get(link_).content
    soup_ =  BeautifulSoup(content_, "html.parser")
    b= soup_.find('ul',class_="list-group list-group-flush").text
    c =  b.split('\n')
    # remove unnecessary white spaces 
    res = [ele for ele in c if ele.strip()]

    text ="\n".join([" ".join(line.split()) for line in res])

    # retrieving information
    
    # sector
    sector = re.findall('Sector:\n(.+)\n',text)
    Sector.append(" ".join(sector))

    #contract type
    contract.append("".join(re.findall('Contract\stype\n:\s+(.+)\n',text)))

    #positions
    position.append("".join(re.findall('positions:\s(\d+)',text))) 

    #application link
    try:

      f = soup_.find('li',class_='list-group-item px-0 pb-0 job-apply-btn d-grid').find('a')['href']
      application_link.append('https://www.jobinrwanda.com'+f)

    except:
      application_link.append('No Link Available')

    #description
    des = soup_.find('div',class_='clearfix text-formatted field field--name-field-job-full-description field--type-text-long field--label-hidden field__item').text
    description.append(" ".join(des.split('\n')))
 
 
  # create an empty dataframe 
  df = pd.DataFrame()

  # create columns and assigning values 
  df['Job Titles'] = jobs_titles
  df['Çompanies'] = companies 
  df['Experience'] = experience
  df['Published Date']  =published_dates
  df['Deadline'] = deadlines
  # df['Location'] = location
  df['Number of Position'] = position
  df['Contract Type'] = contract
  # df['Education Level']=education
  df['Description content'] =description
  df['Sectors'] = Sector
  df['Application Link'] = application_link
  df['Type'] = types
 
  return df

In [16]:
def retrieving_link(baselink):
  """this function helps to retrieve all the links including jobs,tenders,internship and consultancy

  Args
  ----
  Baselink:str
     The baselink of the page of job in rwanda websites 

  Returns
  -------
  jobs_link: the link of all jobs
  consultancy_link : the link of all consultancies
  tender_link: the link of all tenders
  internship_link: the link of all internships
  """
  content =  requests.get(baselink).content
  soup =  BeautifulSoup(content, "html.parser")
  list_of_links = []
  for name in ['all','consultancy','tender','internships']:
    name = baselink + soup.find('a',class_='nav-link px-1 text-primary nav-link--jobs-{}'.format(name))['href']
    list_of_links.append(name)

  return list_of_links

### Loop through different links and extract data 

In [17]:
baselink = 'https://www.jobinrwanda.com'
data =  []
for i in range(4):
  df = DataExtraction(retrieving_link(baselink)[i])
  data.append(df)
df = pd.concat(data,ignore_index = True)
df

Unnamed: 0,Job Titles,Çompanies,Experience,Published Date,Deadline,Number of Position,Contract Type,Description content,Sectors,Application Link,Type
0,Operations Manager,gardaworld,Senior (5+ years of experience),03-11-2022,14-11-2022,1,Full-time,Job Description – Operations Manager Position...,"Administration, Business, Management, Other",https://www.jobinrwanda.com/form/default-job-a...,Job
1,Account Manager- Credit,yellow,Not specified,27-10-2022,27-11-2022,1,Full-time,Position: Account Manager- Credit Locations: M...,Other,https://www.jobinrwanda.comhttps://www.yellow....,Job
2,Legal Manager,copedu-plc,Mid career (3 to 5 years of experience),12-11-2022,18-11-2022,1,Full-time,"NOTICE OF RECRUITMENT COPEDU PLC, is a trading...","Finance and investment, Law, Other",https://www.jobinrwanda.com/form/default-job-a...,Job
3,District Coordinator/Junior District Manager,earthenable-rwanda,Not specified,11-11-2022,30-11-2022,1,Full-time,JOB DESCRIPTION: DISTRICT COORDINATOR/JUNIOR D...,"Management, Other",https://www.jobinrwanda.comhttps://docs.google...,Job
4,Radio Sales Executives,royal-radio-ltd,Not specified,10-11-2022,15-11-2022,3,Full-time,JOB OPPORTUNITY 94.3 ROYAL FM is opening up a ...,"Marketing and sales, Other",https://www.jobinrwanda.com/form/default-job-a...,Job
...,...,...,...,...,...,...,...,...,...,...,...
162,Supply and Installation of Medical Equipment,rwanda-medical-supply-ltd,Not specified,19-10-2022,17-11-2022,1,Tender,Invitation for Bids TITLE: SUPPLY AND INSTALL...,Other,No Link Available,Tender
163,Supply and Delivery of Laboratory Commodities ...,rwanda-medical-supply-ltd,Not specified,19-10-2022,22-11-2022,1,Tender,Invitation for Bids TITLE: SUPPLY AND DELIVER...,Other,No Link Available,Tender
164,"Supply of Laboratory Reagents, Equipments and ...",rwanda-medical-supply-ltd,Not specified,19-10-2022,15-11-2022,1,Tender,Invitation for Bids TITLE: SUPPLY OF LABORATO...,Other,No Link Available,Tender
165,Labor Mobility and Human Development Intern,international-organization-migration-iom-0,Entry level (1 to 3 years of experience),09-11-2022,22-11-2022,1,Internship,CALL FOR APPLICATIONS FOR INTERNSHIP Position...,"Administration, Other, Project management, Soc...",https://www.jobinrwanda.com/form/default-job-a...,Internship


### Extract all data from features 

In [18]:
baselink = 'https://www.jobinrwanda.com'

content =  requests.get(baselink).content
soup =  BeautifulSoup(content, "html.parser")
features_link = baselink + soup.find('a',class_='nav-link px-1 active bg-primary border-primary text-white nav-link--jobs-featured')['href']

In [19]:
df1_all_features = DataExtraction(features_link)

In [20]:
df1_all_features 


Unnamed: 0,Job Titles,Çompanies,Experience,Published Date,Deadline,Number of Position,Contract Type,Description content,Sectors,Application Link,Type
0,Hiring a Firm for BRD Maintenance Assessment,development-bank-rwanda-brd,Not specified,10-11-2022,25-11-2022,1,Contract,RE-ADVERTISEMENT OF TENDER : Nº: 022/10/2022/B...,Other,No Link Available,Consultancy
1,Construction du Guichet de Byumba,reseau-interdiocesain-de-microfinance-rim-ltd,Not specified,10-11-2022,21-11-2022,1,Tender,AVIS D’APPEL D’OFFRE REFERENCE : No 06/RIM LTD...,Other,No Link Available,Tender
2,Consultant- Monitoring & Evaluation,development-bank-rwanda-brd,Not specified,09-11-2022,08-12-2022,1,Contract,RE-ADVERTISEMENT OF TENDER : RFP No. 017/08/20...,Other,No Link Available,Consultancy
3,Operations Manager,gardaworld,Senior (5+ years of experience),03-11-2022,14-11-2022,1,Full-time,Job Description – Operations Manager Position...,"Administration, Business, Management, Other",https://www.jobinrwanda.com/form/default-job-a...,Job
4,Terms of Reference for Production Of the Secon...,un-women-rwanda,Senior (5+ years of experience),02-11-2022,15-11-2022,1,Contract,TERMS OF REFERENCE FOR PRODUCTION OF THE SECON...,"Economics, Other, Social sciences",No Link Available,Consultancy
...,...,...,...,...,...,...,...,...,...,...,...
165,Rwanda Tree Lead,one-acre-fund,,15-09-2022,20-11-2022,1,Full-time,"About One Acre Fund Founded in 2006, One Acre ...","Agriculture, Agronomy, Business, Management",https://www.jobinrwanda.comhttps://grnh.se/184...,Job
166,Rwanda Potato Seed Venture ﻿﻿Lead,one-acre-fund,Senior (5+ years of experience),14-09-2022,28-11-2022,1,Full-time,"About One Acre Fund Founded in 2006, One Acre ...","Agriculture, Agronomy, Business, Environmental...",https://www.jobinrwanda.comhttps://grnh.se/a09...,Job
167,IT Operations Senior Manager,one-acre-fund,Not specified,07-09-2022,06-12-2022,1,Full-time,"About One Acre Fund Founded in 2006, One Acre ...","Agriculture, Business, Computer and IT, Manage...",https://www.jobinrwanda.comhttps://grnh.se/304...,Job
168,Rwanda Seed Innovation Centre Lead,one-acre-fund,,01-09-2022,29-11-2022,1,Full-time,"About One Acre Fund Founded in 2006, One Acre ...","Agriculture, Agronomy, Business, Environmental...",https://www.jobinrwanda.comhttps://grnh.se/ea3...,Job


# Data Cleaning / filtering 

In [21]:
def is_it(df:pd.DataFrame,col1:str,col2:str):
  """this function to retrieve IT jobs from all data 

  Args
  ----
  df: dataframe
    dataset contains all data

  col1: str
    the first column to look keywords into 

  col2: str 
      the second column 

  Returns
  -------
  df_it: dataframe
     dataset of IT jobs only
  """

  #all IT keywords
  IT_keywords = ['information technology','technology','IT','cyber security','tech','computer science','programming','business','coding','innovation',
               'software','python','information','computer','information security','technology news','java','networking','hacking','programmer','linux',
               'technology rocks','coder','technology these days','cloud computing','education','engineering','it services','new technology','data analysis','data science','AI','machine learning']

  # creating a regex pattern for keyowrds
  keyword_pattern = re.compile( "|".join(IT_keywords))

  numbers = []

  numbers_sector = []

  # loop through columns to see if there are keywords 
  for index in range(len(df)):
    # if a list because, description can be a text also 
    if type(df[col1][0]) is list:
      description = " ".join(df[col1][index])
      numbers.append(len(set(re.findall(keyword_pattern,description))))
    else: 
      numbers.append(len(set(re.findall(keyword_pattern,df[col1][index]))))

    numbers_sector.append(len(set(re.findall(keyword_pattern,df[col2][index]))))

  df['Number of IT keywords appeared in {}'.format(col1)] = numbers 
  df['number of keywords in this columns {}'.format(col2)] = numbers_sector

  # filtering IT jobs ( jobs that have at least 5 keywords in entire description and at least 1 in sector)
  df_it = df.loc[(df['Number of IT keywords appeared in {}'.format(col1)] >= 5) & (df['number of keywords in this columns {}'.format(col2)] >=1)]


  # reset index since we did filtering 
  df_it.reset_index(inplace=True)
  
  # delete the existing index 
  del df_it['index']

  return df_it

In [22]:
df_it = is_it(df1_all_features,"Description content","Sectors")

In [23]:
df_it

Unnamed: 0,Job Titles,Çompanies,Experience,Published Date,Deadline,Number of Position,Contract Type,Description content,Sectors,Application Link,Type,Number of IT keywords appeared in Description content,number of keywords in this columns Sectors
0,Product Manager,ampersand-rwanda-ltd,Senior (5+ years of experience),10-11-2022,10-01-2023,1,Full-time,Do you want to do work that matters? Do you wa...,"Computer and IT, Management, Project management",https://www.jobinrwanda.comhttps://ampersandel...,Job,5,1
1,Global Client Data Analytics Consultant,one-acre-fund,Senior (5+ years of experience),08-11-2022,20-12-2022,1,Contract,"About One Acre Fund Founded in 2006, One Acre ...","Agriculture, Business, Demography and data ana...",https://www.jobinrwanda.comhttps://grnh.se/ae4...,Consultancy,6,1
2,IT and MIS Director,chancen-international,Senior (5+ years of experience),31-10-2022,18-11-2022,1,Full-time,CHANCEN International is a non-profit organiza...,"Computer and IT, Other, Project management",https://www.jobinrwanda.com/form/default-job-a...,Job,10,1
3,IT Operations Senior Manager,one-acre-fund,Not specified,07-09-2022,06-12-2022,1,Full-time,"About One Acre Fund Founded in 2006, One Acre ...","Agriculture, Business, Computer and IT, Manage...",https://www.jobinrwanda.comhttps://grnh.se/304...,Job,6,1
4,Senior Business Analyst,one-acre-fund,,22-08-2022,17-11-2022,1,Full-time,"ABOUT ONE ACRE FUND Founded in 2006, One Acre ...","Agriculture, Business, Demography and data ana...",https://www.jobinrwanda.comhttps://grnh.se/c02...,Job,5,1


## Umucyo 

In [24]:
def DataExtractionUmucyo(filepath: str):
  """this function helps to extract data from umucyo

  Args
  ----
  filepath:str
     filepath of txt file containing html file of all page after navigating view frame source 
       
  returns
  -------
  df: dataframe
      dataset of the data 
  """
  # given the path to text file of format of the page, I extract 50 records 
  page_content = ""
  with open(filepath,'r') as file:
    for line in file.readlines():
      page_content +=line
  soup =  BeautifulSoup(page_content, "html.parser")

  table = soup.find('table',class_='article_table mb10')

  # extract headers'name of the table 
  headers = [re.sub('[^A-Za-z0–9]',"",title.text) for title in table.find_all('th')[1:]]

  # extract data 
  data = []
  for row in table.find_all('tr')[1:]:
    d =  row.find_all('td')[1:]
    row_data = [ele.text.strip() for ele in d]
    data.append(row_data)

  # create a dataframe 
  df = pd.DataFrame(data,columns =headers)

  return df 

In [25]:
df_umucyo = DataExtractionUmucyo('/content/umucyo_frame_work.txt')

In [26]:
df_umucyo

Unnamed: 0,TenderName,TenderNo,Status,AdvertisingDate,DeadlineofSubmitting,PlanedOpenDate,StageType
0,Hiring consultant Urban Mobility (Public trans...,000008/C/ICB/2022/2023/CoK,Published,10/11/2022,27/12/2022 10:00,27/12/2022 10:30,one stage
1,Hiring consultant Transport economist to suppo...,000009/C/ICB/2022/2023/CoK,Published,10/11/2022,27/12/2022 10:00,27/12/2022 10:30,one stage
2,science kits and laboratory equipment,000001/G/ICB/2022/2023/REB,Published,03/11/2022,20/12/2022 10:00,20/12/2022 10:30,one stage
3,Hiring an international Expert in Data Process...,000008/C/ICB/2022/2023/NISR,Published,02/11/2022,19/12/2022 10:00,19/12/2022 10:15,one stage
4,TENDER FOR OFFICE SUPPLIES AND STATIONARY,000001/G/NCB/2022/2023/Kiziguro,Published,04/11/2022,16/12/2022 10:00,16/12/2022 10:30,one stage
5,supply of labo kits reagents and consumables,000034/G/NCB/2022/2023/RAB,Published,10/11/2022,15/12/2022 11:00,15/12/2022 11:30,one stage
6,Provision of services of raising awareness on ...,000010/NC/NCB/2022/2023/RSB,Published,11/11/2022,15/12/2022 10:00,15/12/2022 10:10,one stage
7,"Rehabilitation of administration bloc, constru...",000001/W/NCB/2022/2023/NIDA,Published,09/11/2022,15/12/2022 10:00,15/12/2022 10:30,one stage
8,Hiring a consultant firm for Study review and ...,000014/C/NCB/2022/2023/CoK,Published,11/11/2022,15/12/2022 09:00,15/12/2022 09:10,one stage
9,Supply of calibration certified reference mate...,000009/G/NCB/2022/2023/RSB,Published,11/11/2022,15/12/2022 09:00,15/12/2022 09:30,one stage


## Data Translation and API 

In [27]:
# load json file containing languages and their codes 

with open('/content/langs.json','r+') as file:
  content =  file.read()
languages_dict = json.loads(content)

In [28]:
app = FastAPI(title='MY FASTAPI') #Starting the FastAPI instance


@app.get('/')#ROUTE
def index():
    return "This is hackathon, to extract data in english add /data/jobinrwanda to the link in your browser. For the language of your add to the link ?language= add your language"


@app.get('/data/jobinrwanda')
def get_data(language:str =None):
  """This function help to get data on the api by entering the language of your choice 

  Args
  ----
  language:str
       the language to be translated into
  Returns
  -------
  json file of the data 
  """
  # copy the dataframe for easy translation(so that translation have to be always from english)
  df_copy = df_it.copy()
  
  # use try and except so that if the user enter the wrong language, the program will not crush 
  try:
    if language:
      # translate the job title, we use .title() to allow any case the use can type the language 

      #Translate job titles 
      df_copy['Job Titles'] = model.translate(df_copy['Job Titles'],target_lang =languages_dict[language.title()])
 
      # translate description. take few first characters to easy translation 
      df_copy['Description content'] = model.translate(df_copy['Description content'][0:500],target_lang =languages_dict[language.title()])

      return df_copy.to_dict('r')
    
    # if the user doesn't include any language it will print the data in english 
    else:
      df_copy['Job Titles'] = model.translate(df_copy['Job Titles'],target_lang ="en")
      return df_copy.to_dict('r')
  
  except:
    message = "The entered language is not available, please use one of this languages: {}".format([keys for keys,values in languages_dict.items()])
    return message 

@app.get('/data/umucyo')
def get_data(language:str =None):
  """This function help to get data on the api by entering the language of your choice 

  Args
  ----
  language:str
       the language to be translated into
  Returns
  -------
  json file of the data 
  """
  # copy the dataframe for easy translation(so that translation have to be always from english)
  df_copy_umucyo = df_umucyo.copy()
  
  # use try and except so that if the user enter the wrong language, the program will not crush 
  try:
    if language:
      # translate the job title, we use .title() to allow any case the use can type the language 

      #Translate tender name 
      df_copy_umucyo['TenderName'] = model.translate(df_copy_umucyo['TenderName'],target_lang =languages_dict[language.title()])
 


      return df_copy_umucyo.to_dict('r')
    
    # if the user doesn't include any language it will print the data in english 
    else:
      df_copy_umucyo['TenderName'] = model.translate(df_copy_umucyo['TenderName'],target_lang ="en")
      return df_copy_umucyo.to_dict('r')
  
  except:
    message = "The entered language is not available, please use one of this languages: {}".format([keys for keys,values in languages_dict.items()])
    return message 

In [29]:
import nest_asyncio
from pyngrok import ngrok
import uvicorn

ngrok_tunnel = ngrok.connect(8000)
print("REST API started")
print("Your public API URL:", ngrok_tunnel.public_url)
print("You can for example open the following URL in your browser: {}?target_lang=en&text=Hallo%20Welt".format(ngrok_tunnel.public_url))

nest_asyncio.apply()
uvicorn.run(app, port=8000)

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [76]


## Datasets

### all CSV files: https://drive.google.com/drive/folders/1N853LwSHtm6TaPzBm9JQgyNqKX7ZBcdY?usp=sharing

## Report 

###report link: https://drive.google.com/drive/folders/1gPMPGH1VaubvP2lAs2ettC2dQfC0eInk?usp=sharing