# Counties in Maryland

From Wikipedia, use web scraping to gather the information from the table of the list of counties in Maryland https://en.wikipedia.org/wiki/List_of_counties_in_Maryland#List_of_counties

The information to include in your final dataframe is:

County Name<br>
FIPS Code<br>
County Seat<br>
Established (year)<br>
Origin<br>
Etymology<br>
Population<br>
Area<br>

Upload your completed Jupyter notebook to Github and submit the URL for this assignment.

In [26]:
import requests
URL = "https://en.wikipedia.org/wiki/List_of_counties_in_Maryland"
page = requests.get(URL)
type(page)
page.status_code

200

In [27]:
from bs4 import BeautifulSoup

HTMLstr = page.text

soup = BeautifulSoup(HTMLstr, "html.parser")

In [28]:
right_table=soup.find('table', class_='wikitable sortable')

So, all we want is...<br>
1. County Name<br>
2. FIPS Code<br>
3. County Seat<br>
4. Established (year)<br>
5. Origin<br>
6. Etymology<br>
7. Population<br>
8. Area<br>

And the table has data for...<br>
1. County<br>
2. FIPS Code<br>
3. County Seat<br>
4. Est (year)<br>
5. Origin<br>
6. Etymology<br>
7. Flag<br>
8. Seal<br>
9. Population<br>
10. Area<br>
11. Map<br>

So we're going to need to skip the columns for a few items as well as stop early. We only want columns 1-6 & 9-10 in the final dataframe.

In [29]:
#set empty lists to hold data of each column
A=[] #th Name
B=[] #tr[0] FIPS
C=[] #tr[1] Seat
D=[] #tr[2] Est
E=[] #tr[3] Origin
F=[] #tr[4] Etymology
G=[] #tr[7] Pop
H=[] #tr[8] Area

#find all <tr> tags in the table and go through each one (row)
# tr table row tag
for row in right_table.findAll("tr"):
    
    #get at <th> tags, as the names are stored in these before the data, which is in <td> tags
    heads = row.findAll('th')
    A.append(heads[0].find(text=True)) #gets into in County Name column and adds to list A
    
    #get all the <td> tags for each <tr> tag
    cells = row.findAll('td')
    
    #we should end up with 10 td tags, though we will not be using all of them
    if len(cells)==10: 
        B.append(cells[0].find(text=True)) # gets info from FIPS column and adds it to list B
        C.append(cells[1].find(text=True)) # gets info from Seat column; add it to list C
        D.append(cells[2].find(text=True)) # gets info from Est column and adds it to list D
        E.append(cells[3].find(text=True)) # gets info from Origin column and adds it to list E
        F.append(cells[4].find(text=True)) # gets info from Etymology column and adds it to list F
        G.append(cells[7].find(text=True)) # gets info from Population column and adds it to list G
        H.append(cells[8].find(text=True)) # gets info from Area column and adds it to list H

At this point, I realized that this also pulled in the column header for the county name. The easiest way to simply assign the dataframe column for this to values 1:end of the list rather than trying to have the data initially scrap without this data. It seems that the page itself does separate the body of the table from the header using thead and tbody, but when I look at the table pulled in through right_table, it shows the entire table within a tbody tag, making an attempt to fix the initial scrape difficult.

In [31]:
A

['County',
 'Allegany County',
 'Anne Arundel County',
 'Baltimore County',
 'Baltimore City',
 'Calvert County',
 'Caroline County',
 'Carroll County',
 'Cecil County',
 'Charles County',
 'Dorchester County',
 'Frederick County',
 'Garrett County',
 'Harford County',
 'Howard County',
 'Kent County',
 'Montgomery County',
 "Prince George's County",
 "Queen Anne's County",
 "Saint Mary's County",
 'Somerset County',
 'Talbot County',
 'Washington County',
 'Wicomico County',
 'Worcester County']

In [33]:
#import pandas to convert list to data frame
import pandas as pd

df=pd.DataFrame(A[1:], columns=['County']) #turn list A into dataframe first

#add other lists as new columns in my new dataframe
df['FIPS Code'] = B
df['County Seat'] = C
df['Established Year'] = D
df['Origin'] = E
df['Etymology'] = F
df['Population'] = G
df['Area'] = H

#show first 5 rows of created dataframe
df.head()

Unnamed: 0,County,FIPS Code,County Seat,Established Year,Origin,Etymology,Population,Area
0,Allegany County,1,Cumberland,1789,Formed from part of Washington County.,From the Lenape Indian word,74012,430
1,Anne Arundel County,3,Annapolis,1650,Formed from part of St. Mary's County.,Anne Arundell,550488,588
2,Baltimore County,5,Towson,1659,Formed from unorganized territory,"Cecil Calvert, 2nd Baron Baltimore",817455,682
3,Baltimore City,510,Baltimore City,1851,Founded in 1729. Detached in 1851 from Baltimo...,"Cecil Calvert, 2nd Baron Baltimore",621342,92
4,Calvert County,9,Prince Frederick,1654,Formed as Patuxent County from unorganized ter...,The,89628,345
