<a href="https://colab.research.google.com/github/sarveshahuja1992/Webscraping-Indeed.com/blob/master/Scraping_Data_Science_Jobs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project is to automate the job search process, Saving time by downloading relevant jobposts from Indeed.com in an excel file.
**The first part of the code:**

Uses **BeautifulSoup** to get the information from the search results webpage using elements on it, I extract the following data points: Position/Job title, Company Name, Location, Link to the jobpost. 

I have used **Pandas** to store this data in a dataframe and export to an excel sheet later.

The second part of the code opens every job post link and extracts Company Rating, Job posting date and Job Description from each page. 

After extracting all this data, the **dataframe** is exported into an **excel sheet** and gets uploaded to a folder on **google drive**.

In [None]:
# Importing Libraries
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import re
#Creating a python list to store all the data points before it is converted to a dataframe. 
results = []
with requests.Session() as s:
    for i in range(0,500):
      #(10*i) is the page number to be looked for and the keywords can be specified after q= and l= in the link below
        url = 'https://www.indeed.com/jobs?q=data+science&l=United+States&explvl=entry_level&sort=date&start='+str(10*i)
        #Extracting the content from webpage
        res = s.get(url.format(url))
        soup = bs(res.content, 'lxml')
        #Creating lists for required data points
        titles = [item.text.strip() for item in soup.select('[data-tn-element=jobTitle]')]
        companies = [item.text.strip() for item in soup.select('.company')]
        spans = soup.find_all('span',{'class':'location accessible-contrast-color-location'})
        lines = [span.get_text() for span in spans]
        links = []
        all_a = soup.select('[data-tn-element=jobTitle]')
        for a in all_a:
            links.append(a['href'])
        # Zipping all iterables in the lists
        data = list(zip(titles, companies,links,lines))
        results.append(data)
newList = [item for sublist in results for item in sublist]
#Creating DataFrame from the list
df = pd.DataFrame(newList)
#Specifying Column Names
df.columns = ['Position', 'Company', 'HREF', 'Location' ]
#Creating HREF Formula for Excel
df['Link'] = 'https://www.indeed.com' + df['HREF'].astype(str)
df['Links'] = '=HYPERLINK("' + df['Link'] + '")'
del df['HREF']
df

Unnamed: 0,Position,Company,Location,Link,Links
0,Data Science Intern,Stitch Fix,"San Francisco, CA",https://www.indeed.com/rc/clk?jk=200aa5d3ed576...,"=HYPERLINK(""https://www.indeed.com/rc/clk?jk=2..."
1,Associate Data Scientist,The Walt Disney Studios,"Glendale, CA",https://www.indeed.com/rc/clk?jk=1d33823a58a27...,"=HYPERLINK(""https://www.indeed.com/rc/clk?jk=1..."
2,Python for Data Science-Lead,Wipro LTD,"Bothell, WA",https://www.indeed.com/rc/clk?jk=560aa33e18458...,"=HYPERLINK(""https://www.indeed.com/rc/clk?jk=5..."
3,2020 Summer Internship- Data Science,Duke Energy,"Charlotte, NC 28202 (Downtown Charlotte area)",https://www.indeed.com/rc/clk?jk=bbfe97193334d...,"=HYPERLINK(""https://www.indeed.com/rc/clk?jk=b..."
4,Jr Data Engineer,Chubb,United States,https://www.indeed.com/rc/clk?jk=64163637daae3...,"=HYPERLINK(""https://www.indeed.com/rc/clk?jk=6..."
...,...,...,...,...,...
4991,Data Scientist - Commercial - PhD,Eli Lilly,"Indianapolis, IN 46278",https://www.indeed.com/rc/clk?jk=831b2847c8329...,"=HYPERLINK(""https://www.indeed.com/rc/clk?jk=8..."
4992,Machine Learning Developer - Entry level,"Modern Technology Solutions, Inc.","Huntsville, AL 35806",https://www.indeed.com/rc/clk?jk=3e37dc3bff8c4...,"=HYPERLINK(""https://www.indeed.com/rc/clk?jk=3..."
4993,"Expert Software Engineer, Data",Walmart eCommerce,"Sunnyvale, CA 94087",https://www.indeed.com/rc/clk?jk=7e5a32f227673...,"=HYPERLINK(""https://www.indeed.com/rc/clk?jk=7..."
4994,Software Engineer in Big Data,Comcast,"New York, NY",https://www.indeed.com/rc/clk?jk=834d0f182f084...,"=HYPERLINK(""https://www.indeed.com/rc/clk?jk=8..."


In [None]:
#From previous list extracting all the links for different job positions to open and extract more information
my_list = df["Link"].tolist()
results1 = []
jobtext = []
ratinglist = []
#For each link, Extracting Job Description and storing into a list:
for link in my_list:
  r = requests.get(link)
  r.encoding = 'utf-8'
  html_content = r.text
  soup1 = bs(html_content, 'lxml')
  try:
    for hit in soup1.find_all("div", {"class": "jobsearch-jobDescriptionText"}):
      text1 = hit.text
    if text1:
      jobtext.append(text1)
    else: 
      jobtext.append('')
  except AttributeError:
        jobtext.append('')
        pass
# Extracting Date for the job posting
  try:
    t = [item.text.strip() for item in soup1.find_all("div", {"class": "jobsearch-JobMetadataFooter"})]
    str1 = ''.join(t)
    date = re.search('-(.*)-', str1)
    if date:
      dates = date.group()
      results1.append(dates)
    else: 
      results1.append('')
  except AttributeError:
        results1.append('')
        pass 
  # Extracting Company Rating
  try:
    rating = soup1.find(itemprop="ratingValue").get("content")
    if rating:
      ratinglist.append(rating)
    else:
      ratinglist.append('')
  except AttributeError:
     ratinglist.append('')
     pass 
#Adding columns to original DataFrame
df['Job Description'] = jobtext
df['Posted'] = results1
df['Company Rating'] = ratinglist
h = df.Posted.apply(lambda x: pd.Series(str(x).split("-"))) 
df['Posted'] = h[1]
del df['Link']
#Adding Timestamp and exporting to excel
name = timestamp + 'BusinessIntelligenceJobs.csv'
export_csv = df.to_csv(name, index = None, header=True)
#Uploading the excel file to Google Cloud Platform
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
# copy it there
!cp $name "/content/drive/My Drive/Indeed Job lists"