# Test 1
## Task 1
Integrate information from the “job_url_data” folder into one dictionary. The keys of the dictionary are individual urls scraped from the website, and the values are the earliest date that the corresponding urls were scraped. Save the results into a json file. How many unique job urls have been collected between May 17, 2022 and May 23, 2022? 

In [161]:
import pandas
import os
os.chdir(r'C:/Users/SLdra/Documents/Me/BC/Advance/Data/jupyter/Test 1/job_url_data')

In [162]:
os.listdir() #need to reverse it

['.DS_Store',
 'job_urls_for_parsehub_5172022_v1.csv',
 'job_urls_for_parsehub_5172022_v2.csv',
 'job_urls_for_parsehub_5172022_v3.csv',
 'job_urls_for_parsehub_5172022_v4.csv',
 'job_urls_for_parsehub_5172022_v5.csv',
 'job_urls_for_parsehub_5182022_v1.csv',
 'job_urls_for_parsehub_5192022_v1.csv',
 'job_urls_for_parsehub_5192022_v2.csv',
 'job_urls_for_parsehub_5202022_v1.csv',
 'job_urls_for_parsehub_5202022_v2.csv',
 'job_urls_for_parsehub_5212022_v1.csv',
 'job_urls_for_parsehub_5212022_v2.csv',
 'job_urls_for_parsehub_5222022_v1.csv',
 'job_urls_for_parsehub_5222022_v2.csv',
 'job_urls_for_parsehub_5232022_v1.csv',
 'job_urls_for_parsehub_5232022_v2.csv']

In [163]:
from datetime import datetime
jobdict = {}
for filename in reversed(os.listdir()):
    if filename[-3:] == "csv":
        filedate = filename.split("_")[-2]
        #who named these files without month/day padding?
        filedate = datetime.strptime("0"+filedate, '%m%d%Y').strftime("%Y-%m-%d") #only works for months<10 and days>9, bc bad naming convention
        data = pandas.read_csv(filename)
        for line in data['job_url']:
            jobdict[line] = filedate

In [164]:
import json
os.chdir(r'C:/Users/SLdra/Documents/Me/BC/Advance/Data/jupyter/Test 1')
with open("unique_urls.json", "w") as url_file:
    json.dump(jobdict, url_file)

In [165]:
print("There are", len(jobdict), "unique job urls between 2022-05-17 and 2022-05-23")

There are 21260 unique job urls between 2022-05-17 and 2022-05-23


## Task 2
Clean and integrate information from the “job_info_data” folder into one data frame. Files from this subfolder might have two different formats. Some of them are csv files, while others are json files. The columns might also be named differently. Find ways to read each of the files into pandas, drop records with missing job titles and/or missing job descriptions, and combine them into one dataframe. Lastly, drop records with duplicate job urls, and then save them into a separate csv file. How many unique jobs are there in the cleaned dataframe? 

In [166]:
os.chdir(r'C:/Users/SLdra/Documents/Me/BC/Advance/Data/jupyter/Test 1/job_info_data')
job_info = pandas.DataFrame(columns=['Company', 'Job Title', 'Location', 'Description', 'Company Link', 'Job Link'])
jsonMap = {'link':'Job Link', 'job_title':'Job Title', 'company':'Company', 'company_url':'Company Link',
                      'company_location':'Location', 'job_description':'Description'}
csvMap = {}
missJobs = job_info.copy()
for key in jsonMap:
    csvMap["lnks_"+key] = jsonMap[key]
for filename in os.listdir():
    if filename[-3:] == "csv":
        data = pandas.read_csv(filename)
        if(len(data.dtypes) != len(csvMap)):
            print("Unusual data, investigate")
            break
        data.rename(columns=csvMap, inplace=True)
        data = data.query('`Job Title`.notna() & `Description`.notna()')
        job_info = pandas.concat([job_info, data])
    elif filename[-4:] == "json":
        with open(filename, encoding='UTF8') as jsonfile:
            otherdata = json.load(jsonfile)
            data = pandas.DataFrame.from_dict(otherdata['lnks'])
            if(len(data.dtypes) != len(jsonMap)):
                print("Unusual data, investigate")
                break
            data.rename(columns=jsonMap, inplace=True)
            data = data.query('`Job Title`.notna() & `Description`.notna()')
            job_info = pandas.concat([job_info, data])

In [167]:
job_info.head()

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link
0,USAA,Senior Audit Manager- Compliance (Remote),"Happy Valley Ranch, AZ",Purpose of Job We are seeking a talented Senio...,https://www.indeed.com/cmp/Usaa?campaignid=mob...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
1,Pikes Peak Community College,Access Specialist,"Colorado Springs, CO 80906",Tracking Code: 40069\nWork Type: Full-time\nCa...,https://www.indeed.com/cmp/Pikes-Peak-Communit...,https://www.indeed.com/rc/clk?jk=bf3aa4d4d8608...
2,Memorial Regional Health,Clinic Registered Nurse - Family Practice - FT,"Craig, CO 81625",Position Purpose: The Registered Nurse is resp...,https://www.indeed.com/cmp/Memorial-Regional-H...,https://www.indeed.com/rc/clk?jk=ceee36181304b...
3,Jacobs,Sr Wetland Scientist/Permitting Specialist,"Wethersfield, CT 06109",Our People & Places Solutions business – reinf...,https://www.indeed.com/cmp/Jacobs?campaignid=m...,https://www.indeed.com/rc/clk?jk=e9735aea38be9...
4,Swiss American CDMO,Quality Assurance Engineer,"Carrollton, TX 75006",QUALITY ASSURANCE ENGINEER II\nPosition Summar...,https://www.indeed.com/cmp/Swiss-American-Cdmo...,https://www.indeed.com/company/Swiss-American-...


In [168]:
print("Before job link filter", len(job_info))
job_info.drop_duplicates('Job Link', inplace=True)
print("After job link filter", len(job_info))
job_info.drop_duplicates(['Company', 'Job Title', 'Location', 'Description'], inplace=True) #about 2000 of these

Before job link filter 21510
After job link filter 16303


In [169]:
print("With all filters there are", len(job_info), "unique jobs")

With all filters there are 14387 unique jobs


## Task 3
Merge between “job_url_data” and “job_info_data”. What is the percentage of jobs that can be matched between these two data sources? How are the missing data (unmatched job urls) distributed by date? What about matched job urls? How many complete job listings were we able to collect each day? How would you interpret this result with respect to data quality? Does it mean that our data collection strategy is flawed and thus introduces non-random sampling biases?

In [170]:
os.chdir(r'C:/Users/SLdra/Documents/Me/BC/Advance/Data/jupyter/Test 1')
with open("unique_urls.json") as url_file:
    otherdata = json.load(url_file)

In [171]:
url_data = pandas.DataFrame(otherdata.items(), columns=['Job Link', 'Date'])

In [172]:
url_data.head()

Unnamed: 0,Job Link,Date
0,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-21
1,https://www.indeed.com/rc/clk?jk=0ae589447d347...,2022-05-22
2,https://www.indeed.com/rc/clk?jk=77efd97a2f20a...,2022-05-22
3,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-23
4,https://www.indeed.com/company/Golding-Farms-F...,2022-05-23


In [173]:
print(len(url_data))
print(len(job_info))

21260
14387


In [174]:
job_with_date = pandas.merge(job_info, url_data, how='inner', on='Job Link')

In [175]:
job_with_date.head()

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link,Date
0,USAA,Senior Audit Manager- Compliance (Remote),"Happy Valley Ranch, AZ",Purpose of Job We are seeking a talented Senio...,https://www.indeed.com/cmp/Usaa?campaignid=mob...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-17
1,Pikes Peak Community College,Access Specialist,"Colorado Springs, CO 80906",Tracking Code: 40069\nWork Type: Full-time\nCa...,https://www.indeed.com/cmp/Pikes-Peak-Communit...,https://www.indeed.com/rc/clk?jk=bf3aa4d4d8608...,2022-05-17
2,Memorial Regional Health,Clinic Registered Nurse - Family Practice - FT,"Craig, CO 81625",Position Purpose: The Registered Nurse is resp...,https://www.indeed.com/cmp/Memorial-Regional-H...,https://www.indeed.com/rc/clk?jk=ceee36181304b...,2022-05-17
3,Jacobs,Sr Wetland Scientist/Permitting Specialist,"Wethersfield, CT 06109",Our People & Places Solutions business – reinf...,https://www.indeed.com/cmp/Jacobs?campaignid=m...,https://www.indeed.com/rc/clk?jk=e9735aea38be9...,2022-05-17
4,Swiss American CDMO,Quality Assurance Engineer,"Carrollton, TX 75006",QUALITY ASSURANCE ENGINEER II\nPosition Summar...,https://www.indeed.com/cmp/Swiss-American-Cdmo...,https://www.indeed.com/company/Swiss-American-...,2022-05-17


In [176]:
print(round(100*len(job_with_date)/len(url_data), 3), "% of urls were matched with jobs")

67.672 % of urls were matched with jobs


In [177]:
link_missing_info = pandas.concat([job_with_date[['Job Link', 'Date']], url_data[['Job Link', 'Date']]]).drop_duplicates('Job Link', keep=False)

In [178]:
link_missing_info.head()

Unnamed: 0,Job Link,Date
0,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-21
3,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-23
4,https://www.indeed.com/company/Golding-Farms-F...,2022-05-23
5,https://www.indeed.com/rc/clk?jk=300796a784429...,2022-05-23
6,https://www.indeed.com/rc/clk?jk=0b291e71806f9...,2022-05-23


In [179]:
link_missing_info['Date'].value_counts()

2022-05-23    3423
2022-05-17    1728
2022-05-21     681
2022-05-20     412
2022-05-19     326
2022-05-22     217
2022-05-18      86
Name: Date, dtype: int64

In [180]:
link_missing_info['Date'].value_counts().sort_index() #I like this better

2022-05-17    1728
2022-05-18      86
2022-05-19     326
2022-05-20     412
2022-05-21     681
2022-05-22     217
2022-05-23    3423
Name: Date, dtype: int64

In [181]:
url_data['Date'].value_counts().sort_index()

2022-05-17    5257
2022-05-18    1325
2022-05-19    3798
2022-05-20    4523
2022-05-21    2154
2022-05-22     780
2022-05-23    3423
Name: Date, dtype: int64

In [182]:
#want to compare it to the frequency of those dates overall
link_missing_info['Date'].value_counts().sort_index().divide(url_data['Date'].value_counts().sort_index())

2022-05-17    0.328705
2022-05-18    0.064906
2022-05-19    0.085835
2022-05-20    0.091090
2022-05-21    0.316156
2022-05-22    0.278205
2022-05-23    1.000000
Name: Date, dtype: float64

In [183]:
#everything is one of those categories
accepted = ['pagead', 'rc', 'company']
frames = [link_missing_info, job_with_date]
for frame in frames:
    print(frame['Job Link'].str.split('/').str[3].value_counts())

pagead     3285
rc         3094
company     494
Name: Job Link, dtype: int64
rc         10502
pagead      2308
company     1577
Name: Job Link, dtype: int64


In [184]:
#honestly not really a trend

The missing data is not randomly distributed between the days. For whatever reason there were more missing data on the first day of data collection, there is some chance this is due to some property of those jobs that were there since the first day of data collection (perhaps older ones or some error in the scraper, hard to tell from the data so thats just speculation). Also on the last day (2022-05-23) all of the data is missing, which likely indicates some sort of scraper error.

Overall I would trust this data for broad conclusions about compliance jobs, but wouldn't trust it for specific numerical claims (or would put large confidence intervals on those claims). I would also revise the initial claim that data was collected from the 17th to the 23rd, making the interval the 17th to the 22nd instead, as no real data was collected on the 23rd.

## Bonus Task 1
Using the merged dataset from Task 3, extract state information for each individual job based on the “company_location” column. Aggregate them by state, and create a state-level choropleth map to visualize the spatial distribution of compliance jobs. The map should be colored based on the total number of compliance jobs in each state. The boundaries of US states in geojson can be found here. Interpret the results.

In [185]:
import geopandas as gpd
os.chdir(r'C:/Users/SLdra/Documents/Me/BC/Advance/Data/jupyter/Test 1/states')
state_boundaries = gpd.read_file('states.geojson')

In [186]:
job_with_date.head()

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link,Date
0,USAA,Senior Audit Manager- Compliance (Remote),"Happy Valley Ranch, AZ",Purpose of Job We are seeking a talented Senio...,https://www.indeed.com/cmp/Usaa?campaignid=mob...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-17
1,Pikes Peak Community College,Access Specialist,"Colorado Springs, CO 80906",Tracking Code: 40069\nWork Type: Full-time\nCa...,https://www.indeed.com/cmp/Pikes-Peak-Communit...,https://www.indeed.com/rc/clk?jk=bf3aa4d4d8608...,2022-05-17
2,Memorial Regional Health,Clinic Registered Nurse - Family Practice - FT,"Craig, CO 81625",Position Purpose: The Registered Nurse is resp...,https://www.indeed.com/cmp/Memorial-Regional-H...,https://www.indeed.com/rc/clk?jk=ceee36181304b...,2022-05-17
3,Jacobs,Sr Wetland Scientist/Permitting Specialist,"Wethersfield, CT 06109",Our People & Places Solutions business – reinf...,https://www.indeed.com/cmp/Jacobs?campaignid=m...,https://www.indeed.com/rc/clk?jk=e9735aea38be9...,2022-05-17
4,Swiss American CDMO,Quality Assurance Engineer,"Carrollton, TX 75006",QUALITY ASSURANCE ENGINEER II\nPosition Summar...,https://www.indeed.com/cmp/Swiss-American-Cdmo...,https://www.indeed.com/company/Swiss-American-...,2022-05-17


In [187]:
job_with_date.loc[job_with_date['Location'].str.split(' ').str[-2].isin(state_boundaries['STUSPS']), 'STUSPS'] = job_with_date['Location'].str.split(' ').str[-2]

In [188]:
job_with_date.loc[job_with_date['Location'].str.split(' ').str[-1].isin(state_boundaries['STUSPS']), 'STUSPS'] = job_with_date['Location'].str.split(' ').str[-1]

In [189]:
fullstate = job_with_date.loc[job_with_date['Location'].isin(state_boundaries['NAME']), 'Location']

In [190]:
fullstate = fullstate.to_frame().rename({'Location':'NAME'}, axis=1)

In [191]:
fullstate = fullstate.merge(state_boundaries[['NAME', 'STUSPS']], on='NAME')

In [192]:
fullstate.head()

Unnamed: 0,NAME,STUSPS
0,New Jersey,NJ
1,New Jersey,NJ
2,New Jersey,NJ
3,New Jersey,NJ
4,New Jersey,NJ


In [193]:
#list because otherwise indices mess it up
job_with_date.loc[job_with_date['Location'].isin(state_boundaries['NAME']), 'STUSPS'] = list(fullstate['STUSPS'])

In [194]:
job_with_date.head()

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link,Date,STUSPS
0,USAA,Senior Audit Manager- Compliance (Remote),"Happy Valley Ranch, AZ",Purpose of Job We are seeking a talented Senio...,https://www.indeed.com/cmp/Usaa?campaignid=mob...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-17,AZ
1,Pikes Peak Community College,Access Specialist,"Colorado Springs, CO 80906",Tracking Code: 40069\nWork Type: Full-time\nCa...,https://www.indeed.com/cmp/Pikes-Peak-Communit...,https://www.indeed.com/rc/clk?jk=bf3aa4d4d8608...,2022-05-17,CO
2,Memorial Regional Health,Clinic Registered Nurse - Family Practice - FT,"Craig, CO 81625",Position Purpose: The Registered Nurse is resp...,https://www.indeed.com/cmp/Memorial-Regional-H...,https://www.indeed.com/rc/clk?jk=ceee36181304b...,2022-05-17,CO
3,Jacobs,Sr Wetland Scientist/Permitting Specialist,"Wethersfield, CT 06109",Our People & Places Solutions business – reinf...,https://www.indeed.com/cmp/Jacobs?campaignid=m...,https://www.indeed.com/rc/clk?jk=e9735aea38be9...,2022-05-17,CT
4,Swiss American CDMO,Quality Assurance Engineer,"Carrollton, TX 75006",QUALITY ASSURANCE ENGINEER II\nPosition Summar...,https://www.indeed.com/cmp/Swiss-American-Cdmo...,https://www.indeed.com/company/Swiss-American-...,2022-05-17,TX


In [195]:
job_with_date.query('STUSPS.isna()')

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link,Date,STUSPS
7,Pirouette Medical,Vice President Quality Assurance and Regulator...,United States,"Vice President, Quality Assurance (QA) and Reg...",https://www.indeed.com/cmp/Pirouette-Medical?c...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-17,
14,"Allay Therapeutics, Inc.",Director of Clinical Operations,United States,Allay Therapeutics (www.allaytx.com) is pionee...,"https://www.indeed.com/cmp/Allay-Therapeutics,...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-17,
23,Total Wrecking,Safety Director,United States,Total Wrecking is seeking an experienced Safet...,https://www.indeed.com/cmp/Asbestos-Demolition...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-17,
24,,Principal Medical Writer,United States,Description\nPrincipal Medical Writer\nCome di...,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-17,
42,,Clinical Trial Management (CTMs/COM),United States,Description\nJOB SUMMARY\nThe Clinical Trial M...,,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-17,
...,...,...,...,...,...,...,...,...
13432,PharmAllies,Validation Senior Manager - (Bio/Pharma),"Barceloneta, PR","RELOCATION ASSISTANCE TO MADISON, WISCONSIN ME...",https://www.indeed.com/cmp/Pharmallies?campaig...,https://www.indeed.com/company/PharmAllies/job...,2022-05-20,
13599,Alaskan Dream Cruises,Captain 100 Ton,United States,COME EXPLORE WITH THE LOCALS! GREAT JOBS AVAIL...,https://www.indeed.com/cmp/Alaskan-Dream-Cruis...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,2022-05-22,
13786,Amgen,Specialist Quality Assurance – AML 14 (Quality...,"Municipio de Juncos, PR",HOW MIGHT YOU DEFY IMAGINATION?\nYou’ve earned...,https://www.indeed.com/cmp/Amgen?campaignid=mo...,https://www.indeed.com/rc/clk?jk=2d9d672fda30d...,2022-05-21,
14064,Fresenius Medical Care,Clinical Manager,"San Juan, PR 00926","PURPOSE AND SCOPE:\nSupports FMCNA’s mission, ...",https://www.indeed.com/cmp/Fresenius-Medical-C...,https://www.indeed.com/rc/clk?jk=97e20bf5cda26...,2022-05-22,


In [196]:
#just will drop the rest
job_with_date = job_with_date.query('STUSPS.notna()')

In [197]:
geojob = job_with_date.groupby('STUSPS').agg("Job Link").count().reset_index()

In [198]:
geojob.rename({'Job Link':'Job Count'}, axis=1, inplace=True)

In [199]:
geojob = state_boundaries[['STUSPS','geometry']].merge(geojob, on='STUSPS', how='inner')

In [200]:
geojob.head()

Unnamed: 0,STUSPS,geometry,Job Count
0,NE,"MULTIPOLYGON (((-104.05303 43.00059, -103.6183...",114
1,WA,"MULTIPOLYGON (((-122.52603 47.35891, -122.5139...",297
2,NM,"MULTIPOLYGON (((-109.04522 36.99908, -108.6460...",130
3,SD,"MULTIPOLYGON (((-104.05770 44.99743, -104.0397...",57
4,KY,"MULTIPOLYGON (((-89.13268 36.98220, -89.16645 ...",174


In [201]:
import folium
clat = 40.0
clon = -95.0
m = folium.Map(location=(clat, clon), zoom_start=4, width=800, height=400, tiles="Cartodb Positron")

folium.Choropleth(
    geo_data=geojob,
    name="choropleth",
    data=geojob,
    columns= ["STUSPS", "Job Count"],
    key_on="feature.properties.STUSPS",
    fill_color="YlOrRd",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Compliance job count by zip code, May 2022").add_to(m)

folium.LayerControl().add_to(m)
display(m)

There are a high amount of compliance jobs in Texas and California, which is expected of those largest states.

In [202]:
#decided to normalize it by population
#data taken from the 2020 census
os.chdir(r'C:/Users/SLdra/Documents/Me/BC/Advance/Data/jupyter/Test 1')
pop = pandas.read_excel("NST-EST2021-POP.xlsx", header=None, names=["NAME", "Pop", "Ignore1", "Ignore2"],
                        skiprows=9)

In [203]:
pop.head()

Unnamed: 0,NAME,Pop,Ignore1,Ignore2
0,.Alabama,5024279.0,5024803.0,5039877.0
1,.Alaska,733391.0,732441.0,732673.0
2,.Arizona,7151502.0,7177986.0,7276316.0
3,.Arkansas,3011524.0,3012232.0,3025891.0
4,.California,39538223.0,39499738.0,39237836.0


In [204]:
pop['NAME'] = pop['NAME'].str.slice(1)

In [205]:
pop.head()

Unnamed: 0,NAME,Pop,Ignore1,Ignore2
0,Alabama,5024279.0,5024803.0,5039877.0
1,Alaska,733391.0,732441.0,732673.0
2,Arizona,7151502.0,7177986.0,7276316.0
3,Arkansas,3011524.0,3012232.0,3025891.0
4,California,39538223.0,39499738.0,39237836.0


In [206]:
pop = pop.merge(state_boundaries[['NAME', 'STUSPS']], on='NAME')

In [207]:
pop.head()

Unnamed: 0,NAME,Pop,Ignore1,Ignore2,STUSPS
0,Alabama,5024279.0,5024803.0,5039877.0,AL
1,Alaska,733391.0,732441.0,732673.0,AK
2,Arizona,7151502.0,7177986.0,7276316.0,AZ
3,Arkansas,3011524.0,3012232.0,3025891.0,AR
4,California,39538223.0,39499738.0,39237836.0,CA


In [208]:
geojob = geojob.merge(pop[['STUSPS', 'Pop']], on='STUSPS')

In [209]:
geojob.head()

Unnamed: 0,STUSPS,geometry,Job Count,Pop
0,NE,"MULTIPOLYGON (((-104.05303 43.00059, -103.6183...",114,1961504.0
1,WA,"MULTIPOLYGON (((-122.52603 47.35891, -122.5139...",297,7705281.0
2,NM,"MULTIPOLYGON (((-109.04522 36.99908, -108.6460...",130,2117522.0
3,SD,"MULTIPOLYGON (((-104.05770 44.99743, -104.0397...",57,886667.0
4,KY,"MULTIPOLYGON (((-89.13268 36.98220, -89.16645 ...",174,4505836.0


In [210]:
geojob['Jobs per 1000'] = 1000*geojob['Job Count'].divide(geojob['Pop'])

In [211]:
geojob.head()

Unnamed: 0,STUSPS,geometry,Job Count,Pop,Jobs per 1000
0,NE,"MULTIPOLYGON (((-104.05303 43.00059, -103.6183...",114,1961504.0,0.058119
1,WA,"MULTIPOLYGON (((-122.52603 47.35891, -122.5139...",297,7705281.0,0.038545
2,NM,"MULTIPOLYGON (((-109.04522 36.99908, -108.6460...",130,2117522.0,0.061393
3,SD,"MULTIPOLYGON (((-104.05770 44.99743, -104.0397...",57,886667.0,0.064286
4,KY,"MULTIPOLYGON (((-89.13268 36.98220, -89.16645 ...",174,4505836.0,0.038617


In [212]:
import folium
clat = 40.0
clon = -95.0
m = folium.Map(location=(clat, clon), zoom_start=4, width=800, height=400, tiles="Cartodb Positron")

folium.Choropleth(
    geo_data=geojob,
    name="choropleth",
    data=geojob,
    columns= ["STUSPS", "Jobs per 1000"],
    key_on="feature.properties.STUSPS",
    fill_color="YlOrRd",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Compliance job count by zip code, May 2022").add_to(m)

folium.LayerControl().add_to(m)
display(m)

And now that its normalized, we can see that the central states have more compliance jobs per capita than most states.