# Career Scraper Lab

### Introduction

In this lesson, we'll work with data collected from a scraper that pulls data from Indeed.com.  The goal of this scraper was to find more information about data engineering positions, and tech positions in general.  

We can use the data to determine what skills are needed by data engineers, the kinds of companies hiring data engineers, and where they are being hired. 

### Connecting to our database

We can begin by using the psycopg2 library.  This library is already installed on a google colab (so no need to install it).  If we would like to install it on our own computer, we can do so with the following:

In [3]:
!pip install psycopg2-binary



Then we should be able to import the library.  We can then use this library to connect to a postgres database -- one that exists on Jigsaw's amazon account.

In [4]:
import psycopg2

ModuleNotFoundError: No module named 'psycopg2'

In [None]:
DB_NAME="careers"
DB_HOST="career-scraper.crd5vw1vref2.us-east-1.rds.amazonaws.com"
DB_USER="student"
DB_PASSWORD="jigsaw_student"

: 

In [None]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)

: 

In [43]:
db_url

'postgresql://student:jigsaw_student@career-scraper.crd5vw1vref2.us-east-1.rds.amazonaws.com/careers'

In [44]:
cursor = conn.cursor()

And from here, we can see all of the tables listed.

In [45]:
cursor.execute("""SELECT table_name FROM information_schema.tables
       WHERE table_schema = 'public'""")
tables = []
for table in cursor.fetchall():
    tables.append(table[0])
tables[1:]

['states',
 'cities',
 'scrapings',
 'scraped_pages',
 'cards',
 'companies',
 'position_locations',
 'position_skills',
 'skills',
 'positions',
 'job_titles']

So there are a number of tables listed, but we can ignore the `scrapings`, `scraped_pages`, and `cards` tables.  This leaves us with the following relevant tables.

In [46]:
relevant_tables = ['states', 'cities', 'companies',
 'position_locations', 'position_skills', 'skills', 'positions',
 'job_titles']

And we can see the columns of each of these tables with the following:

In [47]:
for relevant_table in relevant_tables:
    cursor.execute(f"Select * FROM {relevant_table} LIMIT 0")
    print(relevant_table, [desc[0] for desc in cursor.description])

states ['id', 'name', 'timestamp']
cities ['id', 'name', 'state_id', 'timestamp']
companies ['id', 'name', 'timestamp']
position_locations ['id', 'position_id', 'city_id', 'state_id', 'is_remote', 'timestamp']
position_skills ['id', 'position_id', 'skill_id']
skills ['id', 'name', 'timestamp']
positions ['id', 'source_id', 'card_id', 'title', 'description', 'minimum_salary', 'maximum_salary', 'minimum_experience', 'maximum_experience', 'company_id', 'timestamp', 'date_posted', 'query_string', 'job_title_id']
job_titles ['id', 'name', 'timestamp']


Begin by looking at the columns in the tables.  At this point, it's worth diagramming the structure of these tables.  Look at the foreign keys to determine how these tables relate to one another -- specify the relations.

* Which tables do you think we will be relying on the most?

### Answering Questions

> Hint: Before diving into the questions below, it may be useful to explore the data a little, and ask some questions of the data.

#### Assessing the data

Now this scraper does not have data on all of the jobs in the US, so it's good to start by getting a sense of the data.  What type of data does it hae most?  

* For example, what `job_titles` appear most frequently in the database?

Then ask similar questions to see what type of data has been collected.

> **Hint**: Think about dimensions in the data (who, what, where, when)?

### Diving into data engineers

* What are top skills required of data engineers?

* What is the average salary of a data engineer?

* How does the average salary of data engineers change based on the experience requirements (write a query that calculates the average salary for each year of minimum_experience)?

* What is the average salary for a data engineering position in New York City?

* Find the top five cities that have the highest average salaries for entry level data engineers.

### Choosing a Career

* Now perhaps we want to use this database to help us choose a specific profession.  What are some questions we can ask?

We can ask questions about: 

* Salary based on job title
* Geographic location of jobs
* Salary based on geographic location
* Can you determine which positions are more entry level?  Is there a career path from one profession to another?

### Choosing a skillset

* Of course, we also may want to determine what skills are most valuable to learn. 
* We could ask questions about:
    * What are the top skills requested?
    * Which skills are most associated with entry level jobs?
    * Which skills are most associated with a higher salary?
    * Does demand for skillset differ based on region?  Or based on position?