### Employment Web Page Scraping and Comparison for IT Job Postings by Raffi Sahakyan

__Task:__ Challenge the claim that _Staff.am_ currently has significantly higher percentage of IT job announcements than _Careercenter.am_ with any method or approach you find suitable. Your submission must include your code and report providing the description of methodology and analysis of results. You are required to complete the task in Python.

Task Implementation Steps:
    1) Scrape both wep pages with BeautifulSoup for job postings
    2) Count the number of occurences of IT job announcements
    3) Hypothesis Testing

In [11]:
#Basic Necessary Libraries
import numpy as np
import pandas as pd
from itertools import chain
import time
from collections import Counter

#Text Modules Applied
from textblob import TextBlob, Word

#Scraping Modules
import urllib.request
import requests
from urllib.request import urlopen
import bs4
from bs4 import BeautifulSoup

#### 1.1 Scraping careercenter.am with BeutifulSoup 

In [12]:
url_cc = "http://www.careercenter.am/ccidxann.php"
html_cc = urlopen(url_cc)
soup_cc = BeautifulSoup(html_cc, 'lxml')
bs4.BeautifulSoup
title_cc = soup_cc.title
print(title_cc)

<title>Career Center - Announcements Index</title>


In [13]:
all_tables_cc = soup_cc.find_all('table')
print(soup_cc.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
 <head>
  <script type="text/javascript">
   <!--
    if (parent.location.href == self.location.href){
      window.location.href = '/';
    } //-->
  </script>
  <meta content="Custom PHP script by http://www.bartellonline.com/" name="generator"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title>
   Career Center - Announcements Index
  </title>
  <base target="main"/>
  <style type="text/css">
   body { background: FFFAA9; margin-left: 3px; font-family: Verdana; font-size: 13px; }
    table { font-family: Verdana; font-size: 13px; }
    p { margin-bottom: 5px; margin-top: 5px; }
  </style>
 </head>
 <body bgcolor="#FFFAA9">
  <p align="center">
   <i>
    <a name="TOP">
     - Last updated:
    </a>
    13 Dec 2018 10:07:52 -
   </i>
  </p>
  <p>
   <strong>
    AVAILABLE TODAY:
   </strong>
  </p>
  <p>
   &amp;nbsp
   <img alt="bullet" border="0" height="6" src="images/bullet_sl.jp

In [14]:
job_cc=[]
all_links_cc=soup_cc.find_all('a')
for link in all_links_cc:
    job_cc.append(link.get_text())

In [15]:
list_to_remove=['- Last updated: ',
 'JOB OPPORTUNITIES',
 'INTERNSHIPS',
 'TRAININGS',
 'NEWS',
 'JOB OPPORTUNITIES','INTERNSHIPS','TRAININGS','Go To Top',
 'NEWS',
 'English Language Courses for Schoolchildren / Career Center',
 'Go To Top']
job_cc = list(set(job_cc).difference(set(list_to_remove)))

In [16]:
company_cc = []
employment_cc = []

for i in job_cc:
    employment_cc.append(i.split(" / ")[0])
    company_cc.append(i.split(' / ')[1])
employment_cc = [i.lower() for i in employment_cc]

In [17]:
#Creating the Dataframe for Careercenter.am
cc_df = pd.DataFrame({"Employment_CC":employment_cc, "Company_CC":company_cc})
cc_df.head(2)

Unnamed: 0,Employment_CC,Company_CC
0,ui/ ux designer,Menu Group
1,"receptionist, abovyan branch",Ameriabank


#### 1.2 Scraping Staff.am with BeutifulSoup 

In [38]:
def custom_scraper(url):
    response = requests.get(url)
    page = response.content
    page = BeautifulSoup(page,"html.parser")
    
    titles = page.find_all("div", class_="job-inner job-item-title")
    job = [i.find("p").get_text() for i in titles]
    return job

In [39]:
end_page_staff = 15
url_base_staff = "https://staff.am/en/jobs?page="
job_staff = []

In [40]:
#To iterate over the search pages in the webpage
for i in range(1,end_page_staff):
    url = url_base_staff + str(i)
    job = custom_scraper(url)
    job_staff.append(job)

In [41]:
employment_staff = list(chain.from_iterable(job_staff))
employment_staff = [i.lower() for i in employment_staff]

#### 2 Count the IT Job Announcements

In [42]:
#IT tags retrieved from web search of IT Professions
IT_tags = ['ios developer','node.js developer', 'systems control senior specialist', 'manual qa engineer', 'c++ developer', 'scrum master', 'qa engineer', 'sql developer','python developer','c++ developer','git developer', 'angular developer', 'web qa (developer)', 'react native developer', '.net developer (full stack)', 'python backend developer - remote','react developer', 'software developer', 'R developer','android developer', 'blockchain developer','c# developer','c# software developer','java developer','manual qa engineer', 'qa specialist', 'qa automation engineer', 'automation qa engineer','manual qa engineer','qa engineer','qa intern','quality control manager', 'senior qa engineer','middle qa engineer','ui/ux designer', 'product manager','senior product manager', 'project manager','product owner','scrum master', 'technical project manager','web developer','php developer','js developer','mid/senior mvc developer','php/js developer','backend developer','angular developer'] 

#Creating a Counter object to count over the IT tags
word_counts_staff = Counter(employment_staff)
total_IT_staff = sum(word_counts_staff.get(w,0) for w in IT_tags)
prop_IT_staff = total_IT_staff/len(employment_staff)
print("There are total of "+str(total_IT_staff)+" IT job postings in Staff.am, "+ str(round(prop_IT_staff*100,2))+'% of Total.')

word_counts_cc = Counter(employment_cc)
total_IT_cc = sum(word_counts_cc.get(w,0) for w in IT_tags)
prop_IT_cc = total_IT_cc/len(employment_cc)
print("-"*100)
print("There are total of "+str(total_IT_cc)+" IT job postings in Careercenter.am, "+ str(round(prop_IT_cc*100,2))+'% of Total.')


There are total of 80 IT job postings in Staff.am, 15.56% of Total.
----------------------------------------------------------------------------------------------------
There are total of 20 IT job postings in Careercenter.am, 13.16% of Total.


#### 3 Hypothesis Testing

Hypothesis Formulation

H0: IT Job Postings Proportion by Staff.am > IT Job Postings Proportion by Careercenter.am


H1: IT Job Postings Proportion by Staff.am <= IT Job Postings Proportion by Careercenter.am
    

Significance Level Formulation

Degree of significance, $\alpha$ % = 5%

Level of significance, $\alpha$ = 0.05 

In [43]:
listofvalues_cc = [0]*(len(employment_cc)-total_IT_cc)+[1]*total_IT_cc
listofvalues_staff = [0]*(len(employment_staff)-total_IT_staff)+[1]*total_IT_staff

In [44]:
hypothesis_df = pd.DataFrame({"Sample Size":[len(employment_cc),len(employment_staff)],
                              "Average Number of IT Job Postings": [np.mean(listofvalues_cc),np.mean(listofvalues_staff)],
                              'Sample Standard Deviation':[np.std(listofvalues_cc),np.std(listofvalues_staff)]},
                             index=['CC.am','Staff.am'])
hypothesis_df

Unnamed: 0,Sample Size,Average Number of IT Job Postings,Sample Standard Deviation
CC.am,152,0.131579,0.338032
Staff.am,514,0.155642,0.362516


In [54]:
t_score = (np.mean(listofvalues_cc)-np.mean(listofvalues_staff))/(np.sqrt(np.var(listofvalues_cc)/len(employment_cc)-np.var(listofvalues_staff)/len(employment_staff)))
t_score = abs(t_score)
print("t score with unknown population variance: " + str(round(t_score,2)))

t score with unknown population variance: 1.08


In [53]:
degrees_of_freedom = min([hypothesis_df['Sample Size'][0]-1,hypothesis_df['Sample Size'][1]-1])
print("Degrees of freedom for t-statistic: " + str(degrees_of_freedom))
t_stat = 1.66 #t statistic with 100 degrees of freedom for a=0.05 one tail test
print("Comparison of t score with t stat: " + str(t_score>t_stat))

Degrees of freedom for t-statistic: 151
Comparison of t score with t stat: False


This test has not provided statistically significant evidence that proportion of IT Job Postings in Staff.am is more than that in Careercenter.am 