# Web scrapping for Data Scientist job in CO(Total points 9)

In this exercise we'll do web scrapping for **Data Scientist job in CO**


Here is the link to the search query

https://www.indeed.com/jobs?q=data+scientist&l=CO

As you can see at the bottom of the page there are link to series of pages related to this search.
If you click on second page, search url changes to

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10

If you click on 3rd then url changes to

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20

Hence to go to more pages we can format search string(**change start=??** part) for **requests.get in a loop**


# Q1(5 =  4(non indicator columns) + 1(indicator columns) points) Please complete the following task

- Scrape 10 pages (**last page(10 th) url will be https://www.indeed.com/jobs?q=data+scientist&l=CO&start=90**)and build a pandas DataFrame containing following information
    + **job title, name of the company, location, summary of job description**
    + **Indicator columns(with value True/False) about keywords Python, SQL, AWS, RESTFUL, Machine learning, Deep Learning, Text Mining, NLP, SAS, Tableau, Sagemaker, TensorFlow, Spark**

Note:
- Make sure that you do a case insensitive search for keywords when filing(Tue/False) in the indicator columns
- You need to go to the webpage of detail job posting for keywords search. Main job posting only contains summary of the job description.  Build detail job posting webpage url  from web scrapping main search results.

In [2]:
import requests
import bs4
import lxml
import time
import pandas as pd
import re


In [3]:
def extract(page, header):
    """This function scrapes indeed.com to find data scientist jobs in colorado.  The program takes in the number
    of pages to scrape and returns soup html code. """
    
    
             
    url=f'https://www.indeed.com/jobs?q=data%20scientist&l=CO&start={page}'
    
    req = requests.get(url, headers=header)
    print("status of request", req.status_code, "for page", page)
    soup=bs4.BeautifulSoup(req.text,"html.parser")
    time.sleep(5)
    #print(soup.prettify()) #unhash to see soup code
    
    return soup
      




In [4]:
def url_clean_up(soup):  
    """This function takes in html text cleaned in beautifulsoup and returns a tuple that contains two lists.  The 
    first list is URLs for the full job posting and the second is a list of the raw href associated with the full job
    postings.  
    
    Note, there are 15 job postings on each main page.  This function does not include the ad job postings since they
    may repeat themselves on each indexed page.  Therefore the function does not always return 15 job posting."""
    
    href_list = []
    new_url_list =[]
    base_url = "https://www.indeed.com"

    for job in soup.find_all('a'): #look for all "a" with "href" attributes
        if job.has_attr('href'):
            ref = job['href']
            new_url = base_url + ref
            #print(new_url) # unhash to see all the urls that were created
            
            
            #search for '/rc/clk' and '/company' since they are the in the urls for the full job postings.
            # .find() returns -1 if something is not found.  Use double negative to find needed values.
            if new_url.find('/rc/clk') != -1: 
                href_list.append(ref)
                new_url_list.append(new_url)
            if new_url.find('/company/') != -1:
                href_list.append(ref)
                new_url_list.append(new_url)    
        

    return new_url_list, href_list 


In [5]:
def location_extractor(href, soup):
    """The function takes in the href list and the full soup from the origional search page.  It uses the href 
    to look for the location.  The location is then cleaned up to only show the city of the job posting.  The function
    returns a list of the cities for the job description"""
    
    city = []
    for i in range(len(href)):
        
        loc = soup.find('a', {'href':href[i]}).find('div', {'class':'companyLocation'}).text.lower().strip()
        
        # convert location to remote if remote is in the location text.  
        if 'remote' in loc:
            loc = "remote"
        
        #remove everything in the parentheses that contain specific inner city location
        loc = re.sub("[\(\[].*?[\)\]]", "", loc).strip()
        
        #remove the state
        loc = loc.partition(',')[0]
        
        #append to list and return list
        city.append(loc)
        
    return city


In [6]:
def job_df_builder(url, city, header):
    
    #create the dataframe
    job_df = pd.DataFrame(columns=["Title", "Company", "Location", "Summary", "Python", "SQL", "AWS", "RESTFUL", 
                                   "Machine_Learning", "Deep_Learning", "Text_Mining", "NLP", "SAS", 
                                   "Tableau", "Sagemaker", "TensorFlow", "Spark"])
    
    #obtain the html from each full job posting
    for i in range(len(url)):
        req = requests.get(url[i], headers=header)
        soup = bs4.BeautifulSoup(req.text, "html.parser")
        time.sleep(5)
        
        #parse the soup for the needed data
        try:
            title = soup.find('h1', {'class':'icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title'}).text
        except:
            continue
        
        try:
            company = soup.find('div', {'class':'icl-u-lg-mr--sm icl-u-xs-mr--xs'}).text
        except:
            continue

        try:
            location = city[i] 
        except:
            location = "nada"
            print(url, "NO location")
            
        try:
            summary = soup.find('div', {'class':'jobsearch-jobDescriptionText'}).text.strip()   
        except:
            continue  
            
        #print(url[i]) #show weblinks to full job descriptions
        
        #boolean portion of the dataframe
        python = "python" in summary.lower()   
        sql = 'sql' in summary.lower()
        aws = 'aws' in summary.lower()
        restful = 'restful' in summary.lower()
        machine = ('machine learning' or 'ml') in summary.lower()
        deep = "deep learning" in summary.lower()
        mining = "text mining" in summary.lower()
        nlp = ('nlp' or 'natural language processing') in summary.lower()
        sas = "SAS" in summary
        tableau = "tableau" in summary.lower()
        sage = "sagemaker" in summary.lower()
        tensor = ("tensorflow" or "tensor flow") in summary.lower()
        spark = "spark" in summary.lower()
            
        #appending data to dataframe   
        job_df = job_df.append({"Title": title, 
                                "Company":company, 
                                "Location":location, 
                                "Summary":summary,
                                "Python":python, 
                                "SQL":sql, "AWS":aws, 
                                "RESTFUL":restful, 
                                "Machine_Learning":machine, 
                                "Deep_Learning":deep, 
                                "Text_Mining":mining, 
                                "NLP":nlp, 
                                "SAS":sas, 
                                "Tableau":tableau, 
                                "Sagemaker":sage, 
                                "TensorFlow":tensor, 
                                "Spark":spark}, ignore_index = True)
        
    print(job_df)
    
    return job_df

In [36]:
def question_3a_3b(input_df):
    """Code for questions 3a and 3b.  results are in the main() function output."""
    # Quesiton 3a code
    q3a = input_df['Location'].max(axis=0)
    
    # Question 3b code
    q3b = (input_df[["Python", "SQL", "AWS", "RESTFUL", 
                     "Machine_Learning", "Deep_Learning", 
                     "Text_Mining", "NLP", "SAS", "Tableau", 
                     "Sagemaker", "TensorFlow", "Spark"]]==True).sum().sort_values(ascending=False)
    
    return q3a, q3b
    
    
#question_3a_3b(final_df) #run block in its own after main() has been run once.

('remote',
 Python              126
 Machine_Learning     84
 SQL                  82
 Spark                41
 AWS                  28
 Tableau              25
 NLP                  16
 Deep_Learning        14
 Text_Mining          10
 Sagemaker             9
 TensorFlow            9
 SAS                   2
 RESTFUL               0
 dtype: int64)

In [38]:
def question_3c(input_df):
    """Code for questions 3c.  results are in the main() function output.""" 
    
    filtered_df = input_df.filter(items=['Title', 'Company','Location','Python', 'SQL'])
    #print(filtered_df)
    
    q3c=filtered_df.query('Python == True & SQL == True')
    
    return q3c
    
    
#question_3c(final_df)  #run block in its own after main program has been run once.

Unnamed: 0,Title,Company,Location,Python,SQL
0,Data Scientist,Xcel Energy,denver,True,True
1,Data Scientist,Ent Credit Union,colorado springs,True,True
5,"Associate Decision Scientist, Data & Media",Ibotta,remote,True,True
8,Data Scientist - 100% Remote,Frontdoor,remote,True,True
9,Senior Data Scientist,Panasonic Corporation of North America,denver,True,True
...,...,...,...,...,...
136,"Data Scientist II, Client Analytics",Teladoc Health,remote,True,True
137,Data Scientist - Denver,Dataiku,denver,True,True
138,Senior Data Scientist,Oracle,broomfield,True,True
143,"Data Scientist, Senior",FlightSafety International,denver,True,True


In [40]:
def main():
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}

    full_url_list = []
    full_city_list = []
    page_count = 0

    print("Please be patient, this program takes a long time to compile!")

    while page_count < 10: 
        full_soup = extract(page_count, headers) #initial extraction
        #creates a list of urls to active job pages and a list of hrefs
        url_list, href_list = url_clean_up(full_soup)  
        city_list = location_extractor(href_list, full_soup) #uses hrefs to create a list of cities (cleaned)

        #full list created by while-loop iterations 
        full_url_list = full_url_list + url_list 
        full_city_list = full_city_list + city_list

        page_count += 1

    print("URL count", len(full_url_list), "city count", len(full_city_list)) # status update of while loop

    final_df = job_df_builder(full_url_list, full_city_list, headers) # creates dataframe



    final_df.to_pickle("indeed_job_co_tw.pkl")  # this is for question 2
    q3a, q3b = question_3a_3b(final_df) #this is for question 3a and 3b
    q3c = question_3c(final_df) # this is for question 3c
    print("\n\nQuestion 3a: most commmon location:  ", q3a)
    print("\n\nQuestion 3b: output for order of most demandind skill\n", q3b)
    print("\n\nQuestion 3c: output for Title, Company, and Location for jobs that require Python and SQL.\n", q3c)

if __name__ == "__main__":
    main()

Please be patient, this program takes a long time to compile!
status of request 200 for page 0
status of request 200 for page 1
status of request 200 for page 2
status of request 200 for page 3
status of request 200 for page 4
status of request 200 for page 5
status of request 200 for page 6
status of request 200 for page 7
status of request 200 for page 8
status of request 200 for page 9
URL count 147 city count 147
                                                 Title  \
0                                       Data Scientist   
1                                       Data Scientist   
2                                       Data Scientist   
3           Associate Decision Scientist, Data & Media   
4                                       Data Scientist   
..                                                 ...   
142                            Data Scientist - Denver   
143  Data Scientist / Artificial Intelligence Resea...   
144                                     Data Scientist   

# Q2(1 point) Save you DataFrame to pickle file name *indeed_job_co.pkl*. 
   Load this pkl file in dataFrame and use this dataFrame for answering following questions.

   <font color='red'>upload the pickle file(indeed_job_co.pkl) along with solution notebook to the canvas</font>

The code for the pickle file is in the main function.  The code that I used is:

final_df.to_pickle("indeed_job_co_tw.pkl")

<font size = "6" color='red'> Use pandas functionality to answer question 2</font>
# Q 3 a(1 point) Which city has maximum job posting.



See the function, question_3a_3b, above.  The most common location was remote

# Q 3 b(1.5 point) - Top 3 most demanding skills(like Python, AWS, SQL ...)



See the function, question_3a_3b, above. Python, Machine Learning, SQL

# Q3 c(.5 point) What other questions you would like to ask  based on indeed data?

This is free response questions.

Question is what jobs (Title, Company, Location) require both Python SQL? 

see function question_3c for code and the answser is in the main() function output.