# Your Turn: Build A Pipeline

- Combine Your Knowledge of the Website, `requests` and `bs4`
- Automate Your Scraping Process Across Multiple Pages
- Generalize Your Code For Varying Searches
- Target & Save Specific Information You Want

## Your Tasks:

- Scrape the first 100 available search results
- Generalize your code to allow searching for different locations/jobs
- Pick out information about the URL, job title, and job location
- Save the results to a file

In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Part 1: Inspect

- How do the URLs change when you navigate to the next results page?
- How do the URLs change when you use a different location and/or job title search?
- Which HTML elements contain the link, title, and location of each job?

 * Using thhe same Indeed website, we would complete the task. When URLs change there is an additional query parameter `&start=10` This doubles with every page.

This denotes the first page: `https://www.indeed.com/jobs?q=python&l=new+york&start=0`
This follows as the second page: `https://www.indeed.com/jobs?q=python&l=new+york&start=10`

 * URLs change when a different location or key word is changed by modifying the query parameter information in the URL
 


`https://www.indeed.com/jobs?q=python&l=new+york`

- **Base URL**
    - `https://www.indeed.com/jobs`
- **Query Parameters**
    - Start & Separators: `?`, `&`
    - Information: `q=python`, `l=new+york`

### Part 2: Scrape

- Build the code to fetch the first 100 search results. This means you will need to automatically navigate to multiple results pages
- Write functions that allow you to specify the job title, location, and amount of results as arguments

In [221]:
base_url = "https://countrycode.org/"

In [26]:
#I created this function to help get the content from a website. If you are going to reuse a block of code maybe you should make it a function 

url = "https://countrycode.org/"

def get_url(url):
    response = requests.get(url).content
    return response 

site_url = get_url(url)
site_url

b'<!DOCTYPE html>\n<!--[if IE 8]>         <html class="ie8" lang="en"> <![endif]-->\n<!--[if IE 9]>         <html class="ie9" lang="en"> <![endif]-->\n<!--[if gt IE 9]><!--> <html lang="en">         <!--<![endif]-->\n    <head>\n        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n        <title>Country Codes, Phone Codes, Dialing Codes, Telephone Codes, ISO Country Codes</title>\n        \n        <meta name="viewport" content="width=device-width, initial-scale=1">\n        <link rel="canonical" href="https://countrycode.org/"/>\n        \n\n        <!-- Favicon -->\n        <link rel="shortcut icon" type="image/x-icon" rel="stylesheet" href="/static/images/favicon.ico" />\n\n        \n        <link href=\'//fonts.googleapis.com/css?family=Montserrat:400,700\' rel=\'stylesheet\' type=\'text/css\'>\n        <link  rel="stylesheet" href="/global-shared/static/global-icons/font-awesome/css/fon

In [28]:
table = pd.read_html(site_url)
table

[                            COUNTRY COUNTRY CODE ISO CODES  POPULATION  \
 0                       Afghanistan           93  AF / AFG    29121286   
 1                           Albania          355  AL / ALB     2986952   
 2                           Algeria          213  DZ / DZA    34586184   
 3                    American Samoa        1-684  AS / ASM       57881   
 4                           Andorra          376  AD / AND       84000   
 5                            Angola          244  AO / AGO    13068161   
 6                          Anguilla        1-264  AI / AIA       13254   
 7                        Antarctica          672  AQ / ATA           0   
 8               Antigua and Barbuda        1-268  AG / ATG       86754   
 9                         Argentina           54  AR / ARG    41343201   
 10                          Armenia          374  AM / ARM     2968000   
 11                            Aruba          297  AW / ABW       71566   
 12                      

In [43]:
table_df = pd.DataFrame(table[-2])

table_df.to_csv("country_code.csv",index=False) #columns=["COUNTRY","COUNTRY CODE ","ISO CODES","POPULATION","AREA KM2","GDP $USD"]
#table_df.head(3)

In [6]:
soup = BeautifulSoup(site_content)
soup



<!DOCTYPE html>
<!--[if IE 8]>         <html class="ie8" lang="en"> <![endif]--><!--[if IE 9]>         <html class="ie9" lang="en"> <![endif]--><!--[if gt IE 9]><!--><html lang="en"> <!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<title>Country Codes, Phone Codes, Dialing Codes, Telephone Codes, ISO Country Codes</title>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="https://countrycode.org/" rel="canonical"/>
<!-- Favicon -->
<link href="/static/images/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="//fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css"/>
<link href="/global-shared/static/global-icons/font-awesome/css/font-awesome.min_e9898fae19.css" rel="stylesheet"/>
<link href="/global-shared/static/global-icons/entypo/css/entypo.min_e9898fae19.css" rel="stylesheet"/>
<!-- Bootstrap CSS --

In [8]:
results = soup.find("table")# This looks for the tag "table" which should contain the table 
results

<table border="0" class="table table-hover table-striped main-table" data-sort-name="countrycode" data-sort-order="desc" data-toggle="table">
<thead>
<tr>
<th data-field="name" data-sortable="true">
                            COUNTRY
                        </th>
<th data-field="code" data-formatter="country_code" data-sortable="true" data-sorter="code_sort">
                            COUNTRY CODE
                        </th>
<th data-field="iso" data-sortable="true">
                            ISO CODES
                        </th>
<th class="population-col" data-field="population" data-sortable="true" data-sorter="population_sort">
                            POPULATION
                        </th>
<th data-field="area" data-sortable="true" data-sorter="population_sort">
                            AREA KM<span class="superscript">2</span>
</th>
<th data-field="gdp" data-sortable="true" data-sorter="gdp_sort_named">
                            GDP $USD
                        

In [11]:
#Table headers have the tag <th>..</th>,table rows <tr>..</tr> and finally the content 
header_rows = results.find("tr")
header_rows 

<tr>
<th data-field="name" data-sortable="true">
                            COUNTRY
                        </th>
<th data-field="code" data-formatter="country_code" data-sortable="true" data-sorter="code_sort">
                            COUNTRY CODE
                        </th>
<th data-field="iso" data-sortable="true">
                            ISO CODES
                        </th>
<th class="population-col" data-field="population" data-sortable="true" data-sorter="population_sort">
                            POPULATION
                        </th>
<th data-field="area" data-sortable="true" data-sorter="population_sort">
                            AREA KM<span class="superscript">2</span>
</th>
<th data-field="gdp" data-sortable="true" data-sorter="gdp_sort_named">
                            GDP $USD
                        </th>
</tr>

In [20]:
table_data = results.find("tbody").find_all("tr")
table_data


[<tr>
 <td><a href="/afghanistan">Afghanistan</a></td>
 <td>93</td>
 <td>AF / AFG</td>
 <td>29,121,286</td>
 <td>647,500</td>
 <td>20.65 Billion</td>
 </tr>, <tr>
 <td><a href="/albania">Albania</a></td>
 <td>355</td>
 <td>AL / ALB</td>
 <td>2,986,952</td>
 <td>28,748</td>
 <td>12.8 Billion</td>
 </tr>, <tr>
 <td><a href="/algeria">Algeria</a></td>
 <td>213</td>
 <td>DZ / DZA</td>
 <td>34,586,184</td>
 <td>2,381,740</td>
 <td>215.7 Billion</td>
 </tr>, <tr>
 <td><a href="/americansamoa">American Samoa</a></td>
 <td>1-684</td>
 <td>AS / ASM</td>
 <td>57,881</td>
 <td>199</td>
 <td>462.2 Million</td>
 </tr>, <tr>
 <td><a href="/andorra">Andorra</a></td>
 <td>376</td>
 <td>AD / AND</td>
 <td>84,000</td>
 <td>468</td>
 <td>4.8 Billion</td>
 </tr>, <tr>
 <td><a href="/angola">Angola</a></td>
 <td>244</td>
 <td>AO / AGO</td>
 <td>13,068,161</td>
 <td>1,246,700</td>
 <td>124 Billion</td>
 </tr>, <tr>
 <td><a href="/anguilla">Anguilla</a></td>
 <td>1-264</td>
 <td>AI / AIA</td>
 <td>13,254</td

In [24]:
country = table_data[0]
country

<tr>
<td><a href="/afghanistan">Afghanistan</a></td>
<td>93</td>
<td>AF / AFG</td>
<td>29,121,286</td>
<td>647,500</td>
<td>20.65 Billion</td>
</tr>

### Part 3: Parse

- Sieve through your HTML soup to pick out only the job title, link, and location
- Format the results in a readable format (e.g. JSON)
- Save the results to a file

In [225]:

def parse_site(soup):
    results = soup.find("table")
    jobs = results.find_all('div', class_='row')
    base_url = "https://www.indeed.com"
    results = []
    
    for job in jobs:
        job_titles = job.find('h2').find('a').text.strip() #Loops through jobs and looks for the a tag in the h2 tag and returns the text within the bytes
        job_link = job.find('h2').find('a')["href"] #Loops through jobs and looks for the a tag in the h2 tag and returns the href
        job_url = base_url + job_link
        job_loc = job.find(class_="location").text #Loops through jobs and looks for the class: location and returns the text within it
        results.append({"Job Title": job_titles, "Job Link": job_url, "Job Location":job_loc})
        
    return results
parse_site(soup)[10]


{'Job Title': 'Software Engineer',
 'Job Link': 'https://www.indeed.com/rc/clk?jk=a195d7dabc2ff8e3&fccid=25b5166547bbf543&vjs=3',
 'Job Location': 'New York, NY'}

In [226]:
# Declare necessary variables
qty_page = 15
num_jobs = 100

# To determine the number of pages to open 
page_num = num_jobs//qty_page

def tot_page(title, loc, page_num):
    jobs_tot = []
    for i in range(page_num):
        site_url = get_url(title, loc, page_num)
        soup = BeautifulSoup(site_url.content)
        jobs_tot += (parse_site(soup))
    return jobs_tot
        
Task = tot_page("python","new york",page_num)
Task[75]

{'Job Title': 'Data Technician (Full- or Part-Time)',
 'Job Link': 'https://www.indeed.com/rc/clk?jk=da727c0cddda240e&fccid=56a26d4c816e53d1&vjs=3',
 'Job Location': 'New York, NY'}

In [227]:
#import necessary modules
import pandas as pd


job_search = pd.DataFrame(Task, columns=["Job Title","Job Link","Job Location"])

job_search.head()

Unnamed: 0,Job Title,Job Link,Job Location
0,Data Technician (Full- or Part-Time),https://www.indeed.com/rc/clk?jk=da727c0cddda2...,"New York, NY"
1,Python / Java Software Engineer,https://www.indeed.com/rc/clk?jk=6b49e869fa744...,"New York, NY"
2,Penetration Testing Trainee (Remote USA),https://www.indeed.com/rc/clk?jk=487b30db63184...,"Florida, NY"
3,Healthcare Data Scientist,https://www.indeed.com/rc/clk?jk=6720cf1c03a1c...,"New York, NY"
4,Digital Archives Assistant,https://www.indeed.com/rc/clk?jk=7c541433bfbcb...,"Long Island City, NY"


In [228]:
job_search.to_csv("job_search.csv")