# Example: LinkedIn (Selenium)

- The selenium package is used to automate web browser interaction from Python.

For most web scraping projects (without API), the logic behind the scrapers is very similar. Think about how to implement each step before you develop the scrapers.  

In [2]:
# Libraries
import time
import pandas as pd   
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.common.exceptions import NoSuchElementException

In [3]:
#Get Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install())

In [4]:
# driver wait 10 seconds until the page loaded
driver.implicitly_wait(10)

# Enter to the site
driver.get('https://www.linkedin.com/login');
time.sleep(5)

# Accept cookies
#driver.find_element_by_xpath("accept cookies xpath").click()

In [5]:
# User Credentials
# Reading txt file where we have our user credentials
with open('Linkedin_credentials.txt', 'r',encoding="utf-8") as file:
    user_credentials = file.readlines()
    user_credentials = [line.rstrip() for line in user_credentials]

user_name = user_credentials[0] # First line
password = user_credentials[1] # Second line
#log in with your credentials
driver.find_element("xpath", '//*[@id="username"]').send_keys(user_name) 
driver.find_element("xpath", '//*[@id="password"]').send_keys(password)
time.sleep(3)

# Login button
driver.find_element("xpath", '//*[@id="organic-div"]/form/div[3]/button').click()
driver.implicitly_wait(30) # wait until the login button displays

In [6]:
links = [] #create an empty list for collecting all the project urls

#start with pagination
# for 4 pages:
#from the begining number: 0 to the end number: 25, add 25 each time
for i in range(0,100,25):
    urls = "https://www.linkedin.com/jobs/search/?currentJobId=3279897362&geoId=103644278&keywords=data%20scientist&location=United%20States&refresh=true&start="+str(i)
    page = driver.get(urls)
    driver.implicitly_wait(10) #driver sleep for 10 seconds

    for i in driver.find_elements("xpath",'//a[@class="disabled ember-view job-card-container__link job-card-list__title"]'): #find each job's url
        link = i.get_attribute('href')
        links += [link] # add the urls (links) one by one to the links list
print (len(links))

13


## Fill in the xpath for each item then complete the scraper

In [8]:
#initialize dictionary with lists
job_info = {'Job_Title':[],'jobinfo':[],'companyName':[],'companysize':[],'alumni':[],'active':[],'description':[]}

# open project one by one (open each link)
for i in links:
    driver.get(i)
    time.sleep(10)
    
    try:
        Jobtitle = driver.find_element("xpath",'//h1[@class="t-24 t-bold jobs-unified-top-card__job-title"]').get_attribute('textContent').strip()
    except NoSuchElementException:
        Jobtitle = 0
    
    try:
        companyinfo = driver.find_element("xpath",'//span[@class="jobs-unified-top-card__subtitle-primary-grouping t-black"]').text.strip()
    except NoSuchElementException:
        companyinfo = 0
    
    try:
        companyName = driver.find_element("xpath",'fill in the xpath').text.strip()
    except NoSuchElementException:
        companyName = 0     
    
    try:
        fulltime = driver.find_element("xpath",'fill in the xpath').get_attribute('textContent').strip()# Same comment as above 
    except NoSuchElementException:
        fulltime = 0
    
    try:
        companysize = driver.find_element("xpath",'fill in the xpath').get_attribute('textContent').strip()
    except NoSuchElementException:
        companysize = 0   
    
    try:
        alumni = driver.find_element("xpath",'fill in the xpath').get_attribute('textContent').strip()
    except NoSuchElementException:
        alumni = 0
    print(alumni)
    
    try:
        active = driver.find_element("xpath",'fill in the xpath').get_attribute('textContent').strip()
    except NoSuchElementException:
        active = 0
    
    try:
        description = driver.find_element("xpath",'fill in the xpath').get_attribute('textContent').strip()
    except NoSuchElementException:
        description = 0
    
    job_info['Job_Title'].append(Jobtitle) 
    job_info['jobinfo'].append(companyinfo)
    job_info['companyName'].append(companyName)
    job_info['companysize'].append(companysize)
    job_info['alumni'].append(alumni)
    job_info['active'].append(active)
    job_info['description'].append(description)
    
    print(Jobtitle, "Done!")
          
driver.close()



7 company alumni · 111 school alumni
Sr Data Analyst Done!
17 school alumni
Marketing Analytics Consultant Done!
See recent hiring trends for Publicis Worldwide. Reactivate Premium
Product Data Strategist (Hybrid) Done!
1 connection · 6 company alumni · 2,588 school alumni
Data Scientist Lead Done!
6 company alumni · 4 school alumni
Data Scientist Done!
2 company alumni · 12 school alumni
Senior Data Analyst Done!
2 company alumni · 12 school alumni
Senior Data Analyst Done!
1 company alumni · 5 school alumni
Sr. Analyst, Targeting & Data Done!
7 school alumni
PL Analytics: Rotation Program Analyst (COLLEGE HIRE) Done!
1 connection · 13 company alumni · 267 school alumni
Business Intelligence Analyst - SDS Done!
1 company alumni · 8 school alumni
Remote Data Scientist Done!
1 company alumni · 8 school alumni
Remote Data Scientist Done!
1 connection · 1 company alumni
Data Analytics Consultant, CoE | Telecom, Media & Technology Done!


In [9]:
#store the data to a pandas DataFrame then show the first five rows
jobs = pd.DataFrame(job_info)
jobs.head()

Unnamed: 0,Job_Title,jobinfo,companyName,companysize,alumni,active,description
0,Sr Data Analyst,"Bank of America Jacksonville, FL",Bank of America,"10,001+ employees · Banking",7 company alumni · 111 school alumni,Actively recruiting,About the job\n \n\n\n \n Job D...
1,Marketing Analytics Consultant,"Blue Cross and Blue Shield of Illinois, Montan...","Blue Cross and Blue Shield of Illinois, Montan...","10,001+ employees · Hospitals and Health Care",17 school alumni,Actively recruiting,About the job\n \n\n\n \n At HC...
2,Product Data Strategist (Hybrid),"Publicis Worldwide Los Angeles, CA",Publicis Worldwide,"10,001+ employees · Advertising Services",See recent hiring trends for Publicis Worldwid...,0,About the job\n \n\n\n \n Marce...
3,Data Scientist Lead,"USAA San Antonio, TX",USAA,"10,001+ employees · Financial Services","1 connection · 6 company alumni · 2,588 school...",Actively recruiting,About the job\n \n\n\n \n Purpo...
4,Data Scientist,"Stanford University Stanford, CA On-site",Stanford University,"10,001+ employees · Higher Education",6 company alumni · 4 school alumni,Your profile matches this job,About the job\n \n\n\n \n The S...


In [10]:
#store the links (jobs urls)
linksdf = pd.DataFrame(links, columns = ["Links"])
linksdf.head()

Unnamed: 0,Links
0,https://www.linkedin.com/jobs/view/3284377727/...
1,https://www.linkedin.com/jobs/view/3283815192/...
2,https://www.linkedin.com/jobs/view/3279104186/...
3,https://www.linkedin.com/jobs/view/3294995392/...
4,https://www.linkedin.com/jobs/view/3284297022/...


In [11]:
# join (merge) two dataframes
jobs = jobs.join(linksdf)
jobs.head()

Unnamed: 0,Job_Title,jobinfo,companyName,companysize,alumni,active,description,Links
0,Sr Data Analyst,"Bank of America Jacksonville, FL",Bank of America,"10,001+ employees · Banking",7 company alumni · 111 school alumni,Actively recruiting,About the job\n \n\n\n \n Job D...,https://www.linkedin.com/jobs/view/3284377727/...
1,Marketing Analytics Consultant,"Blue Cross and Blue Shield of Illinois, Montan...","Blue Cross and Blue Shield of Illinois, Montan...","10,001+ employees · Hospitals and Health Care",17 school alumni,Actively recruiting,About the job\n \n\n\n \n At HC...,https://www.linkedin.com/jobs/view/3283815192/...
2,Product Data Strategist (Hybrid),"Publicis Worldwide Los Angeles, CA",Publicis Worldwide,"10,001+ employees · Advertising Services",See recent hiring trends for Publicis Worldwid...,0,About the job\n \n\n\n \n Marce...,https://www.linkedin.com/jobs/view/3279104186/...
3,Data Scientist Lead,"USAA San Antonio, TX",USAA,"10,001+ employees · Financial Services","1 connection · 6 company alumni · 2,588 school...",Actively recruiting,About the job\n \n\n\n \n Purpo...,https://www.linkedin.com/jobs/view/3294995392/...
4,Data Scientist,"Stanford University Stanford, CA On-site",Stanford University,"10,001+ employees · Higher Education",6 company alumni · 4 school alumni,Your profile matches this job,About the job\n \n\n\n \n The S...,https://www.linkedin.com/jobs/view/3284297022/...


<dt><strong>regex</strong> bool or same types as <cite>to_replace</cite>, default False</dt>

<dd><p>Whether to interpret <cite>to_replace</cite> and/or <cite>value</cite> as regular
expressions. If this is <code class="docutils literal notranslate"><span class="pre">True</span></code> then <cite>to_replace</cite> <em>must</em> be a
string. Alternatively, this could be a regular expression or a
list, dict, or array of regular expressions in which case
<cite>to_replace</cite> must be <code class="docutils literal notranslate"><span class="pre">None</span></code>.</p>
</dd>

.replace() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

In [13]:
# remove useless space and phrase
jobs["description"] = jobs["description"].replace('\n','', regex=True).str.strip()
jobs["description"] = jobs["description"].replace('About the job','', regex=True)
jobs.head()

Unnamed: 0,Job_Title,jobinfo,companyName,companysize,alumni,active,description,Links
0,Sr Data Analyst,"Bank of America Jacksonville, FL",Bank of America,"10,001+ employees · Banking",7 company alumni · 111 school alumni,Actively recruiting,Job Description:This Senior ...,https://www.linkedin.com/jobs/view/3284377727/...
1,Marketing Analytics Consultant,"Blue Cross and Blue Shield of Illinois, Montan...","Blue Cross and Blue Shield of Illinois, Montan...","10,001+ employees · Hospitals and Health Care",17 school alumni,Actively recruiting,"At HCSC, we consider our emp...",https://www.linkedin.com/jobs/view/3283815192/...
2,Product Data Strategist (Hybrid),"Publicis Worldwide Los Angeles, CA",Publicis Worldwide,"10,001+ employees · Advertising Services",See recent hiring trends for Publicis Worldwid...,0,Marcel is seeking a Director...,https://www.linkedin.com/jobs/view/3279104186/...
3,Data Scientist Lead,"USAA San Antonio, TX",USAA,"10,001+ employees · Financial Services","1 connection · 6 company alumni · 2,588 school...",Actively recruiting,Purpose of JobAbout USAAUSAA...,https://www.linkedin.com/jobs/view/3294995392/...
4,Data Scientist,"Stanford University Stanford, CA On-site",Stanford University,"10,001+ employees · Higher Education",6 company alumni · 4 school alumni,Your profile matches this job,The School of Humanities and...,https://www.linkedin.com/jobs/view/3284297022/...


In [14]:
jobs.to_csv("linkedin.csv")

## Let's summarized the general steps for web scraping without API

(Examples we did in class)
1. Rotten Tomatoes (Challenge Lab 6)
2. Amazon reviews (Challenge Lab 7)
3. LinkedIn

Step 1: 

Step 2:

Step 3:

## Action: develop a scraper 

1. for Kickstarter projects (https://www.kickstarter.com/)
    - hint step 1:find a category (for example arts)
    - hint step 2:explore a sub-category (arts → Explore Dance: https://www.kickstarter.com/discover/categories/dance)
    - hint step 3:develop a scraper: start with pagination → collect the page URLs → create a list to collect all the project URLs → loop click in each project to get all the details of a project.
2. you are encouraged to use resources from GitHub/Stackoverflow/any websites
3. store at least 100 projects (= 100 rows) into a pandas dataframe (print your dataframe.info())
4. print the first five rows of the results (.head())