# Selenium Experiment Assignment

## Before you start: Selenium Download
For the next exercises you will have to use Selenium. 

You can read more about the webdriver here (https://chromedriver.chromium.org), but if you want to go straight to the download, go to https://chromedriver.storage.googleapis.com/index.html?path=89.0.4389.23/ and download your version. 

The steps to get Selenium to work are:
1. download webdriver
2. extract
3. Add to Path
4. install selenium from terminal (e.g. `pip install selenium`)

Once this is done, you should be able to run:
- `from selenium import webdriver`
- `browser = webdriver.Chrome([the path where you put the googlechromedriver])`

In case of any issues, the https://chromedriver.chromium.org website has some straightforward info on common bugs. 

## Selenium Sessions

Go to a website of your choice where you have an account. It can for example be the New York Times APi website where you created a login last time but also tutti.ch, comparis, whatever simple website you often use.

Using Selenium create a session where you 
1. go to the main website 
2. log in 
3. click on an element of your choice 
4. scroll to the bottom of the page
5. then save the page. 

When logging in, you will have to find the name of the login form and submit your credentials to it and then click the login button. Here you find an example for a login using selenium but in case you decide to use this help, Facebook should not be your chosen website. https://crossbrowsertesting.com/blog/test-automation/automate-login-with-selenium/
 
Tip: Website uses captcha? You can put your script to sleep for some number of seconds by using time.sleep() function and enter captcha manually.

In [33]:
# passwords removed for security
EMAIL = ""
PASS = ""
INDEED_PASS = ""

In [34]:
import time

from selenium import webdriver
# This avoids having to set the path manually
# need to 'pip install pip install webdriver-manager'
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())

# go to main website
driver.get("https://www.hackerrank.com/auth/login")

# login with credentials
driver.find_element_by_id("input-1").send_keys(EMAIL)
driver.find_element_by_id("input-2").send_keys(PASS)
driver.find_element_by_xpath("//span[contains(text(), 'Log In')]").click()

# wait for page to load
time.sleep(3)

# click on element of choice
driver.find_element_by_link_text("Bookmarked Challenges").click()

# scroll to bottom
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")

# save the page
with open('selenium_page.html', 'w') as f:
    f.write(driver.page_source)



Current google-chrome version is 100.0.4896
Get LATEST chromedriver version for 100.0.4896 google-chrome
Driver [C:\Users\sausbot\.wdm\drivers\chromedriver\win32\100.0.4896.60\chromedriver.exe] found in cache


## Measuring personalization

In this exercise you will have to imitate the study described in class on a website of your interest. You will have to measure differences in the content that you receive back from the website under varying treatments. 

You will have to choose a website and a treatment. Use selenium for this exercise as well. 
- As for websites, you can pick an online store, or traveling site, some news site, Google News.. basically try to pick something that you suspect gives different results for different searchers. 
- Examples for treatments would be location, being logged in with an account, history with the website, being on a phone vs a desktop, etc. 
- You can try to pick multiple searches to make sure you are measuring real phenomenon, not only noise
- You can include a control treatment in case you suspect there's A/B testing or noise in how the pages look
- Finally you have to pick a measure for the differences on the page. In case you receive items on a page, for example URLs or products, you can define an overlap metric. In case the page is more unstructured, come up with an explanation for how you define differences.

As your answer, explain which of the above you chose, how you implemented the experiment, and what difference you found in the pages you collected. 

You can find more info on how to run multiple browsers at the same time here: https://crossbrowsertesting.com/blog/selenium/run-test-multiple-browsers-parallel-selenium/

In [38]:
'''
The aim of the experiment is to see the difference in the salary of job search results when logged into 
an indeed account that contains my name and resume data compared to searching without being logged in.

I will search for software engineering jobs. My metric will be the salary price displayed
for me versus an anonymous user, in the first 15 results.
'''

import time

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

def indeed_login(driver):
    # go to main website
    driver.get("https://secure.indeed.com/auth")

    # enter account email
    driver.find_element_by_id("ifl-InputFormField-3").send_keys(EMAIL)
    driver.find_elements_by_xpath("//span[contains(text(), 'Continue')]")[-1].click()

    # wait for page to load
    time.sleep(3)

    # authenticate with password
    driver.find_element_by_id("ifl-InputFormField-121").send_keys(INDEED_PASS)
    driver.find_element_by_xpath("//span[contains(text(), 'Sign in')]").click()

def run_experiment(login: bool, driver):
    # login if that option is enabled
    if login:
        indeed_login(driver)
        file_name = "login_job_search"
    else:
        file_name = "no_login_job_search"
    
    # navigate to job search
    driver.get("https://ca.indeed.com")

    # search for Software Engineering jobs in Toronto
    what = driver.find_element_by_id("text-input-what")
    what.send_keys("software engineer")
    what.submit()

    # find elements by ID
    job_results = []
    job_cards = driver.find_elements_by_class_name("jobCard_mainContent")
    for job in job_cards:
        text = job.text
        text = text.split('\n')
        job_results.append(text)
    print(job_results)
    return job_results

# Note: we could run this in parallel but I start getting captcha verification
# if there is too much suspicious actiivty so we create a new webdriver after 
# the first one is done
driver_2 = webdriver.Chrome(ChromeDriverManager().install())
login_results = run_experiment(login=True, driver=driver_2)
driver_2.close()
time.sleep(10)
driver = webdriver.Chrome(ChromeDriverManager().install())
no_login_results = run_experiment(login=False, driver=driver)
driver.close()



Current google-chrome version is 100.0.4896
Get LATEST chromedriver version for 100.0.4896 google-chrome
Driver [C:\Users\sausbot\.wdm\drivers\chromedriver\win32\100.0.4896.60\chromedriver.exe] found in cache


[['.Net Developer', 'CompTrak', 'Temporarily Remote in Aurora, ON', '$60,000 - $85,000 a year'], ['Computer Programmer', 'Highlight Motor Freight Inc.', 'Vaughan, ON', '$39 an hour'], ['new', 'Software Developer', 'Milano Software2.3', 'Temporarily Remote in Richmond Hill, ON', '$60,000 - $80,000 a year'], ['ROBOTICS SOFTWARE ENGINEER', 'Maple Advanced Robotics Inc', 'Greater Toronto Area, ON', '$80,000 a year'], ['new', 'C# Software Developer', 'Molarray Research Inc.', 'Richmond Hill, ON', '$60,000 - $90,000 a year'], ['Software Engineer', 'Clutch Canada', 'Toronto, ON', '$90,000 - $155,000 a year'], ['new', 'Software Engineer', 'Akamori Designs', 'Remote in Toronto, ON', '$20 - $32 an hour'], ['Founding Engineer / Software Developer', 'Ratio', 'Toronto, ON', '$100,000 - $200,000 a year'], ['Software Engineer', 'iPartner Staffing4.5', 'Toronto, ON', '$60 - $70 an hour'], ['new', 'Application Developer', 'Intellifi Corporation (formerly Delta 360 Inc.)', 'Temporarily Remote in Markham



Current google-chrome version is 100.0.4896
Get LATEST chromedriver version for 100.0.4896 google-chrome
Driver [C:\Users\sausbot\.wdm\drivers\chromedriver\win32\100.0.4896.60\chromedriver.exe] found in cache


[['new', 'C# Software Developer', 'Molarray Research Inc.', 'Richmond Hill, ON', '$60,000 - $90,000 a year'], ['new', 'Software Developer', 'Milano Software2.3', 'Temporarily Remote in Richmond Hill, ON', '$60,000 - $80,000 a year'], ['new', 'Software Engineer', 'Akamori Designs', 'Remote in Toronto, ON', '$20 - $32 an hour'], ['new', 'Software Developer', 'Interactive Sports Technologies Inc.', 'Concord, ON', '$20 - $29 an hour'], ['new', 'Software Engineer Intern', 'spaice', 'Markham, ON', '$20 an hour'], ['Full Stack Developer', 'iPartner Staffing4.5', 'Toronto, ON', '$55 - $65 an hour'], ['Computer Programmer', 'Highlight Motor Freight Inc.', 'Vaughan, ON', '$39 an hour'], ['Software Engineer | Front-End (Flutter)', 'CMiC3.2', 'Remote in Toronto, ON'], ['Software Engineer', 'Clutch Canada', 'Toronto, ON', '$90,000 - $155,000 a year'], ['Software Engineer', 'HSBC4.0', 'Toronto, ON'], ['Software Developer', 'Kelly Engineering', 'Greater Toronto Area, ON', '$80,000 - $120,000 a year']

In [74]:
'''
Here we will parse the results of the two results we saved from the experiment
'''
import re

def parse_value(salary, type):
    '''
    A regex to parse the string with the salary ranges
    '''
    find = re.findall(r'\d+',salary)
    if len(find) == 4 and type == "YEAR":
        value = (int(find[0]) + int(find[2]))/2
    elif len(find) == 2 and type == "HOUR":
        value = (int(find[0]) + int(find[1]))/2
    else:
        value = find[0]
    return value

def calculate_avg_salary(results):
    '''
    Calcuate the avg yearly salary, monthly salary and postings with no salaries 
    '''
    avg_salary = 0
    avg_hour_salary = 0
    no_of_hourly_samples = 0
    no_of_yearly_samples = 0
    no_of_postings = 0

    for result in results:
        salary = result[-1]
        if '$' in salary and 'a year' in salary:
            value = parse_value(salary, "YEAR")
            no_of_yearly_samples += 1
            avg_salary += float(value)
        elif '$' in salary and 'an hour':
            value = parse_value(salary, "HOUR")
            no_of_hourly_samples += 1
            avg_hour_salary += float(value)

        no_of_postings += 1
    
    avg_yearly = avg_salary/no_of_yearly_samples
    avg_hourly = avg_hour_salary/no_of_hourly_samples
    
    print(f"The average salary is: ${avg_yearly:.2f}K with {no_of_yearly_samples} samples")
    print(f"The average hourly rate is: ${avg_hourly:.2f} with {no_of_hourly_samples} samples")
    print(f"No salary postings: {no_of_postings-no_of_hourly_samples-no_of_yearly_samples}")

print("For the login experiement:")
calculate_avg_salary(login_results)
print("\n")
print("For the no-login experiement:")
calculate_avg_salary(no_login_results)

For the login experiement:
The average salary is: $97.00K with 10 samples
The average hourly rate is: $38.62 with 4 samples
No salary postings: 1


For the no-login experiement:
The average salary is: $91.88K with 4 samples
The average hourly rate is: $32.42 with 6 samples
No salary postings: 5


The experiment:
I implemented an experiment to search a job board website called Indeed and treat it with (A) being logged in using my resume data compared to (B) an anonymous user with no data. My resume includes 2 internships in various engineering roles and full time work for 1.5 years doing a software related job, totalling to 3 years of experience. I parsed the results and compared the data, looking at the average yearly salary range, hourly salary range and postings with no salary listed. The search critera was for "Software Engineer" jobs and the location was set to my browser location of Richmond Hill, ON (my hometown).

The results:
When logged in, the average salary I recieved was higher by around $6,000 and the hourly salary range was higher by around $6. There was only one posting shown in the first 15 results that had no salary data, compared to 5 postings for an anonymous user.

The not logged in results contain postings for jr and intern software engineer, which could be because my hometown is an area with a lot of schools, and near some universities, so the average user is probably searching for internships. Whereas the postings I recieved did not include these results. The results I received also had more applied software positions relating to my past positions such as "Robotics" whereas without personalization this didn't show up. There are some results which are the same in both cases such as "Interactive Sports Technologies Inc". This is interesting because there is nothing on my resume that indicates I am interested in sports. Perhaps this shows how sponsored content might come into play with job listings.

Further Ideas:
A wider scale experiment with checking more postings would provide better insight into the data as opposed to looking at 15 postings. 

It would also be interesting to see if I changed details in my resume such as my name (to a male name) or age (to be older), to see if that made a difference in the search results.