# Gentle introductin to web scraping with Pandas

- DAPL
- Dartmouth 
- Spring 2022
- Author: [Spencer Bertsch](https://github.com/spencerbertsch1)

In this notebook we will be looking at [this wikipedia article.](https://en.wikipedia.org/wiki/1999%E2%80%932000_FA_Premier_League), and the [Indeed job listings](https://www.indeed.com/jobs?q=data%20analyst&l&vjk=3187467df45bfdba) for data analytics positions. 

Let's first look at information about the FA Premier League in 1999-2000 and answer a few quations about the data displayed on the web. 

In [12]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

Import the data as a list of pandas dataframes and find out how many tables were read into memory

Use the following URL: https://en.wikipedia.org/wiki/1999%E2%80%932000_FA_Premier_League

In [13]:
URL = "https://en.wikipedia.org/wiki/1999%E2%80%932000_FA_Premier_League"
page = requests.get(URL)
html_text = page.text

print(type(html_text))

<class 'str'>


Let's look at what the html page looks like in raw text format (not so great)

vvv remember to re-comment 

In [14]:
# html_text  # <-- uncomment and run, then re-comment and run

^^^ remember to re-comment

In [15]:
# use pandas to read the url
dfs: list = pd.read_html(URL)

# print the number of tables read into memory
print(f'Number of tables read: {len(dfs)}')

Number of tables read: 21


## Time to answer some questions! 

1. What team has the stadium that can hold the most fans? 

In [1]:
# find the dataframe in the list that holds the relevant information. 

# -- you write this part -- 

In [2]:
# order by Capacity and grab the information from the first record. 

# -- you write this part -- 

In [3]:
# print the solution


2. How many teams are listed in the table we just looked at? 

In [4]:
# -- you write this part -- 

In [5]:
# print the solution


## Indeed Webpage Scraping

Now we will use pandas to scrape a different website! 

In [11]:
URL = "https://www.indeed.com/jobs?q=data%20analyst&l&vjk=cbcdc3d043ce15f7"

In [12]:
# use pandas to read the url
dfs: list = pd.read_html(URL)

# print the number of tables read into memory
print(f'Number of tables read: {len(dfs)}')

Number of tables read: 33


In [13]:
len(dfs)

33

Let's inspect the dfs to see which ones actually hold the job descriptions

In [14]:
# display the 5th dataframe in the list
dfs[5]

Unnamed: 0,0
0,newHealthcare / Healthtech Business Data Analy...


In [15]:
# display the pandas series representing the 'column 0' 
dfs[5][0]

0    newHealthcare / Healthtech Business Data Analy...
Name: 0, dtype: object

In [16]:
dfs[5][0][0]

'newHealthcare / Healthtech Business Data AnalystConfidentialRemote$120,000 - $160,000 a yearFull-time'

Seems like every *other* dataframe holds a job description, and each description is stored as a string inside a dataframe. We can index the string out of each dataframe by first separating the column as a pandas series, then by grabbing the 0'th element from that series. 

Let's iterate through every other element in the list of dataframes, extracting the strings representing each job description and see what they look like: 

In [17]:
for i in range(3, 20, 2):
    print(dfs[i][0][0])

newData Analyst (Remote)CAIA Association5.0Remote in Amherst, MAEstimated $79.2K – $100K a yearFull-time
newHealthcare / Healthtech Business Data AnalystConfidentialRemote$120,000 - $160,000 a yearFull-time
newVirtual Data AnalystSITE Improvement AssociationRemote$21.50 - $24.50 an hourFull-time
newProgram Analyst - Data AnalyticsUS Veterans Health Administration3.9Richmond, VA$82,396 - $107,119 a yearFull-time
Data AnalystAppcast Inc.2.8Lebanon, NH 03766+1 locationEstimated $59.9K – $75.9K a year
Public Health Data AnalystLantana Consulting Group3.9East Thetford, VT 05043Estimated $46.7K – $59.1K a yearFull-time
newAnalyst, Data and AnalysisDigitas3.8Boston, MA+3 locationsEstimated $52.2K – $66K a yearInternship
Data Feed AnalystPCRRemote$75,000 - $110,000 a yearFull-time
Business Analyst1StarrRemote$72,017 - $97,717 a yearFull-time


## Open questions: 

What could we do with this data? What insights could we make using these job listings?

- Talk to your neighbors and come up with two or three use cases