# (1) Project 4 - Web-Scraping

-----
Project 4 requires scraping job data from a a job listings website. I will be scraping the data from au.indeed.com. Once I have scraped and cleaned the data, I will then use the data to try and answer to questions. Firstly, I will make a model to predict the salary of a job. For the second part, using the same scraped data, I will attempt to make a model to predict different job titles to see what features can distinguish between a data science job and a business analyst job.

---

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

----
I am going to scrape my data from au.indeed.com. I will be scraping data on several features including Job Title, Company, Location, Salary, Job Summary, and Job Description.

I will be scraping data from the 5 main cities in Australia: Sydney, Melbourne, Brisbane, Perth, and Adelaide. I will also be scraping 3 types of job: data scientist, data analyst, and business analyst.

---

## Scraping

In [2]:
indeed_cities = ['Sydney', 'Melbourne', 'Brisbane', 'Perth', 'Adelaide']# Create a list of cities and jobs to scrape.

In [3]:
indeed_jobs = ['data+scientist', 'data+analyst', 'business+analyst']

----
I have created a loop using requests and BeautifulSoup to scrape the data for the different jobs and locations.

---

In [4]:
results = []
desc = []

for job in indeed_jobs:      # Iterate through the jobs list.
    for city in indeed_cities:     # Iterate through the cities list.
        for page in range(0, 1000, 10):   # Iterate through 100 indeed.com pages (page numbers are in increments of 10).
            url = 'https://au.indeed.com/jobs?q=' + job + '&l=' + city + '&start=' + str(page)  # Allows multiple jobs, cities, and pages in the url.
            html = requests.get(url)
            soup = BeautifulSoup(html.text, 'lxml')
            for result in soup.find_all('div', {'class':' row result'}):     # Get data from each job listing.
                results.append(result)    # append results to results list.
            for i in soup.find_all('div', {'class':' row result'}):       # Create another loop to go through the url for each job and scrape the description.
                a = i.find('a')                                                    
                ab = 'https://au.indeed.com/' + a.attrs['href']
                html2 = requests.get(ab)
                soup2 = BeautifulSoup(html2.text, 'lxml')
                desc.append(soup2)      # append results to desc list.

## Functions

---
Now I need to create functions that will allow me to extract the specific data for each feature, e.g. Job Title, Salary, etc.

---

In [5]:
# Job Title.
def get_job(result):   
    try:
        return result.find('a', {'data-tn-element':'jobTitle'}).text    # XPath of job title.
    except:
        return 'NA'      # Return NA if no job title found.

In [6]:
# Company
def get_comp(result):
    try:
        return result.find('span', {'class':'company'}).text
    except:
        return 'NA'

In [7]:
# Location.
def get_loc(result):
    try:
        return result.find('span', {'class':'location'}).text
    except:
        return 'NA'

In [8]:
# Salary.
def get_sal(result):
    try:
        return result.find('span', {'class':'no-wrap'}).text
    except:
        return 'NA'

In [9]:
# Summary.
def get_summ(result):
    try:
        return result.find('span', {'class':'summary'}).text
    except:
        return 'NA'

In [10]:
# Description.
def get_desc(desc):
    try:
        return desc.find('td', {'class' : "snip"}).text
    except:
        return 'NA'

## Dataframe

----
Now I will create two empty dataframes. One for the main data from the job listings (in the results list). The other will have the job description data that was obtained by going through each job listing's url and then scraping (in the desc list). I will then concatonate them.

---

In [11]:
jobs0 = pd.DataFrame(columns=['location', 'title', 'company', 'salary', 'summary'])

In [12]:
jobs1 = pd.DataFrame(columns=['description'])

----
Now I will create 2 for loops to iterate through the scraped data in the results and desc lists and use the functions to add the data to the dataframes.

---

In [13]:
for entry in results:
    location = get_loc(entry)   # use functions to get location data.
    title = get_job(entry)
    company = get_comp(entry)
    salary = get_sal(entry)
    summary = get_summ(entry)
    jobs0.loc[len(jobs0)] = [location, title, company, salary, summary]     # add data to dataframe.

for entry in desc:
    description = get_desc(entry)
    jobs1.loc[len(jobs1)] = [description]    

In [14]:
jobs0 = jobs0.replace(r'\n',' ', regex=True)     # Can use regular expression to remove new lines (\n) from text in the dataframe.

In [15]:
jobs1 = jobs1.replace(r'\n',' ', regex=True)

In [16]:
jobs0.head(15)

Unnamed: 0,location,title,company,salary,summary
0,Sydney NSW,Junior Data Analyst/Scientist,International Institute of Data & Analytics,,In data science and big data anal...
1,Sydney NSW,Senior Data Scientist,Amazon.com,,A Senior Data Scientist will:. Ou...
2,Sydney NSW,Research Scientist,Amazon.com,,Data drives the development of new process; D...
3,Macquarie University NSW,Data Science Research Engineer,Macquarie University,"$100,706 - $112,058 a year",We are seeking a Data Science Res...
4,Sydney NSW,Data Scientist (Sydney),C3 IoT,,"In this capacity, you will partic..."
5,Sydney NSW,Data Scientist,BuildingIQ,,We are looking for a Data Scienti...
6,Sydney NSW,Data Scientist,Fortune Select,,Experience using and maintaining ...
7,Alexandria NSW,Data Scientist,Black Swallow Boutique,,We are looking for an experienced...
8,Sydney NSW,Data & Analytics,KPMG,,Using Data Science techniques and...
9,Sydney NSW,Data Scientist,SAI Global,,Proficiency with Data Visualisati...


In [17]:
jobs0.shape

(17710, 5)

In [18]:
print(jobs0.duplicated().sum())

13305


In [19]:
print(jobs1.duplicated().sum())

7609


---
Using df.shape shows that there are 17,710 job records in the dataframe. However, when I check the number of duplicates, there are more than 13,305 duplicates for the first dataframe and 7,609 for the description dataframe. I will need to drop these duplicates.

----

In [20]:
jobs1.head(50)

Unnamed: 0,description
0,The International Institute of Data & Analyti...
1,Excited by using massive amounts of data to d...
2,Business today operates at the pinnacle of th...
3,
4,"In this capacity, you will participate in the..."
5,"Job Description | Location – Sydney, Australi..."
6,Location: Sydney Job Type: Permanent Skills...
7,We are looking for an experienced Data Scient...
8,Be a key part of our growth & innovation stra...
9,"At SAI Global, we make Intelligent Risk possi..."


In [21]:
jobs1.shape

(17710, 1)

---
I will now concatonate the two dataframes into one dataframe.

---

In [22]:
indeed_jobs = pd.concat([jobs0, jobs1], axis=1)

In [23]:
indeed_jobs.head(15)

Unnamed: 0,location,title,company,salary,summary,description
0,Sydney NSW,Junior Data Analyst/Scientist,International Institute of Data & Analytics,,In data science and big data anal...,The International Institute of Data & Analyti...
1,Sydney NSW,Senior Data Scientist,Amazon.com,,A Senior Data Scientist will:. Ou...,Excited by using massive amounts of data to d...
2,Sydney NSW,Research Scientist,Amazon.com,,Data drives the development of new process; D...,Business today operates at the pinnacle of th...
3,Macquarie University NSW,Data Science Research Engineer,Macquarie University,"$100,706 - $112,058 a year",We are seeking a Data Science Res...,
4,Sydney NSW,Data Scientist (Sydney),C3 IoT,,"In this capacity, you will partic...","In this capacity, you will participate in the..."
5,Sydney NSW,Data Scientist,BuildingIQ,,We are looking for a Data Scienti...,"Job Description | Location – Sydney, Australi..."
6,Sydney NSW,Data Scientist,Fortune Select,,Experience using and maintaining ...,Location: Sydney Job Type: Permanent Skills...
7,Alexandria NSW,Data Scientist,Black Swallow Boutique,,We are looking for an experienced...,We are looking for an experienced Data Scient...
8,Sydney NSW,Data & Analytics,KPMG,,Using Data Science techniques and...,Be a key part of our growth & innovation stra...
9,Sydney NSW,Data Scientist,SAI Global,,Proficiency with Data Visualisati...,"At SAI Global, we make Intelligent Risk possi..."


In [24]:
indeed_jobs.tail(15)

Unnamed: 0,location,title,company,salary,summary,description
17695,Adelaide SA,Customer Service Specialist,Talent Options,,Accurate data entry skills. This ...,Customer Services Specialist Our client is s...
17696,Adelaide SA,Sales Agronomist (Regional),Farmers Edge,,From seed selection to yield data...,Farmers Edge is a global leader in decision a...
17697,Adelaide SA,Animation Supervisor,Rising Sun Pictures (RSP),,Please review our Privacy Policy ...,"Reporting directly to CG and VFX Supervisors,..."
17698,Adelaide SA,VFX Supervisor,Rising Sun Pictures (RSP),,Please review our Privacy Policy ...,"The VFX Supervisor is required to supervise, ..."
17699,Adelaide SA,CG Supervisor,Rising Sun Pictures (RSP),,Please review our Privacy Policy ...,"The CG Supervisor will supervise, monitor, di..."
17700,Adelaide SA,Management Accountant,EGM Partners,,Work collaboratively with the Fin...,We are resruiting a Senior/Management Account...
17701,Adelaide SA,Data Analyst,Piper Talent Pty Ltd,,Microsoft Data Insights developme...,About my client A big company with a small fe...
17702,Adelaide City Centre SA,Data Scientist / Data Miner,Peoplebank,,Experience in working in Big Data...,Peoplebank is a preferred supplier to number ...
17703,Adelaide SA,CG Supervisor,Rising Sun Pictures (RSP),,Please review our Privacy Policy ...,"The CG Supervisor will supervise, monitor, di..."
17704,Adelaide SA,Senior Test Analyst,TicToc Home,,You'll have a passion for breakin...,about TicToc The World's First Instant Home L...


In [25]:
indeed_jobs.shape

(17710, 6)

---
I will now drop the duplicates. As the jobs0 dataframe has a lot more duplicates than jobs1, I will drop the duplicates that are found in the jobs0 dataframe to make sure all duplicates are removed.

---

In [26]:
indeed_jobs.drop_duplicates(['location', 'title', 'company', 'salary', 'summary'], inplace=True)

In [27]:
indeed_jobs.shape

(4405, 6)

---
Having removed the duplicates, there are 4,405 unique records remaining.

---

I have now finished the scrape and dropped the duplicates. I can save the data in csv format for future EDA.

---

In [29]:
indeed_jobs.to_csv('indeed_jobs.csv', index=False)

---
[End of Scraping (1). See file(2) for EDA]