# Web Scraping Job Vacancies

## Introduction

In this project, we'll build a web scraper to extract job listings from a popular job search platform. We'll extract job titles, companies, locations, job descriptions, and other relevant information.

Here are the main steps we'll follow in this project:

1. Setup our development environment
2. Understand the basics of web scraping
3. Analyze the website structure of our job search platform
4. Write the Python code to extract job data from our job search platform
5. Save the data to a CSV file
6. Test our web scraper and refine our code as needed

## Prerequisites

Before starting this project, you should have some basic knowledge of Python programming and HTML structure. In addition, you may want to use the following packages in your Python environment:

- requests
- BeautifulSoup
- csv
- datetime

These packages should already be installed in Coursera's Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using `!pip install packagename` within a notebook cell such as:

- `!pip install requests`
- `!pip install BeautifulSoup`

## Step 1: Importing Required Libraries

In [1]:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

## Step 2: Setting up the Browser 

The code sets up a Chrome browser instance using Selenium: 

In [2]:
driver = webdriver.Chrome()
driver.get("https://www.naukri.com/")

## Step 3: Locating Search Input, Sending Search Query and Clicking Button

The script locates the search input field and button on the webpage using XPath. The script sends a search query to the input field and clicks the button: 

In [3]:
input_search = driver.find_element(By.XPATH, "/html/body/div[1]/div[7]/div/div/div[1]/div/div/div[1]/div[1]/div/input")
input_search.send_keys("web scraping")

button = driver.find_element(By.XPATH, "/html/body/div[1]/div[7]/div/div/div[6]").click()

## Step 4: Initializing DataFrame 

The script initializes an empty pandas DataFrame with column names for storing scraped data: 

In [4]:
data = pd.DataFrame(columns=['Title', 'Comp Name', 'Exprience', 'Salary', 'Location', 'Link'])

## Main Scraping Loop
The script enters a while loop that will continue until stopped. Inside the loop:
1. BeautifulSoup : The BeautifulSoup library is used to parse the HTML of the current webpage.
2. Finding Job Postings : The script finds all job postings on the page using their class name cust-job-tuple.
3. Looping through Postings : For each posting, it extracts various details:
    - Link : The link to the full job posting
    - Title : The title of the job
    - Company Name : The name of the company offering the job
    - Experience : The required experience for the job
    - Salary : The salary offered for the job
    - Location : The location where the job is based
4. Creating a New DataFrame Row : A new row is created with these details and appended to the main DataFrame.
5. Finding Next Button : The script finds the "Next" button on the page, which will take it to the next set of job postings.

The loop continues until there are no more job postings to scrape. 

In [5]:
while True:
    soup = BeautifulSoup(driver.page_source, 'html')
    posting = soup.find_all('div', class_='cust-job-tuple')
    for post in posting:
        try:
            link = post.find('a', class_='title').get('href')
            title = post.find('a', class_='title').text
            comp_name = post.find('a', class_='comp-name').text
            expwdth = post.find('span', class_='expwdth').text
            salary = post.find('span', class_='').text
            locWdth = post.find('span', class_='locWdth').text
            data_dict = {
                'Title': title,
                'Comp Name': comp_name,
                'Exprience': expwdth,
                'Salary': salary,
                'Location': locWdth,
                'Link': link}
            # Append each row to the DataFrame
            new_row = pd.DataFrame(data_dict, index=[0])
            data = pd.concat([data, new_row], ignore_index=True)

            # Find next button and click it
            nex_button = driver.find_element(By.CSS_SELECTOR, '#lastCompMark > a:nth-child(4)').click()
            time.sleep(2)
    
        except:
            pass

MaxRetryError: HTTPConnectionPool(host='localhost', port=55735): Max retries exceeded with url: /session/60b422b841dfa2664e9da308c6cca486/source (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f441871d190>: Failed to establish a new connection: [Errno 111] Connection refused'))

## Preview extracted data

In [7]:
data

Unnamed: 0,Title,Comp Name,Exprience,Salary,Location,Link
0,"Manager, Engineering (Web Scrape)",MX Technologies,5-7 Yrs,Not disclosed,"Kolkata, Mumbai, New Delhi, Hyderabad, Pune, C...",https://www.naukri.com/job-listings-manager-en...
1,Web Scraping Engineer,Ojcommerce,2-3 Yrs,Not disclosed,Chennai,https://www.naukri.com/job-listings-web-scrapi...
2,Node.js Engineer (Web Automation and Web Scrap...,Dataweave Software,2-4 Yrs,Not disclosed,Bengaluru,https://www.naukri.com/job-listings-node-js-en...
3,Senior Web Scraping Specialist,Aimleap,2-6 Yrs,Not disclosed,Remote,https://www.naukri.com/job-listings-senior-web...
4,Python Developer (Web Scraping),MEG World It Services,3-7 Yrs,Not disclosed,Chennai,https://www.naukri.com/job-listings-python-dev...
...,...,...,...,...,...,...
231,Engineering Manager - Mobile Crawling,CommerceIQ,2-5 Yrs,Not disclosed,Bengaluru,https://www.naukri.com/job-listings-engineerin...
232,Python Developer,Neemtree Internet,2-4 Yrs,Not disclosed,Pune,https://www.naukri.com/job-listings-python-dev...
233,Python Scrapy Developer ( Immediate),Embarckle Llp,4-6 Yrs,Not disclosed,Coimbatore,https://www.naukri.com/job-listings-python-scr...
234,Senior Python Developer,Embarckle Llp,4-6 Yrs,Not disclosed,Coimbatore,https://www.naukri.com/job-listings-senior-pyt...


## Saving Scraped Data 

After all job postings have been scraped, the script saves the data to a CSV file named `scraped_data.csv`: 

In [8]:
data.to_csv('scraped_data.csv', index=False, header=True, sep=';', na_rep='N/A')

In [9]:
driver.quit()