# Continuing our orignial efforts... and starting something new
Since we ran into some issues with the selenium web scraping (AKA they figured out that I was using a bot), lets just use this dataset (https://www.kaggle.com/andrewmvd/data-scientist-jobs) to continue our analysis. It may seem unfortunate that our web scraping efforts have come to a halt, although I am partially satisfied knowing that I now have a working understanding of selenium and will still be able to conclude my data science job description analysis.

In [154]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [155]:
# desired columns for the time being
usecols = [
    'Job Title',
    'Company Name',
    'Location',
    'Salary Estimate',
    'Job Description']

rename = [
    'title',
    'company',
    'location',
    'salary_estimate',
    'description']

In [156]:
# lets grab this csv from my data folder
path = "C://Users//voyno//Desktop\\indeed-jobs\\data\\data_scientist_jobs.csv"
df = pd.read_csv(path, usecols=usecols)[usecols]

# show current columns
print("Remaining Columns:")
for i, col in enumerate(df.columns):
    print(f'{i:4}. {col}')

Remaining Columns:
   0. Job Title
   1. Company Name
   2. Location
   3. Salary Estimate
   4. Job Description


In [157]:
# update column names
df.columns = rename

print("New Column names:")
for i, col in enumerate(df.columns):
    print(f'{i:4}. {col}')

New Column names:
   0. title
   1. company
   2. location
   3. salary_estimate
   4. description


In [158]:
df.head()

Unnamed: 0,title,company,location,salary_estimate,description
0,Senior Data Scientist,Hopper\n3.5,"New York, NY",$111K-$181K (Glassdoor est.),"ABOUT HOPPER\n\nAt Hopper, we’re on a mission ..."
1,"Data Scientist, Product Analytics",Noom US\n4.5,"New York, NY",$111K-$181K (Glassdoor est.),"At Noom, we use scientifically proven methods ..."
2,Data Science Manager,Decode_M,"New York, NY",$111K-$181K (Glassdoor est.),Decode_M\n\nhttps://www.decode-m.com/\n\nData ...
3,Data Analyst,Sapphire Digital\n3.4,"Lyndhurst, NJ",$111K-$181K (Glassdoor est.),Sapphire Digital seeks a dynamic and driven mi...
4,"Director, Data Science",United Entertainment Group\n3.4,"New York, NY",$111K-$181K (Glassdoor est.),"Director, Data Science - (200537)\nDescription..."


In [159]:
# first we will fix company field

df["company"] = df["company"].apply(lambda x: x.split("\n")[0])

In [160]:
# then we will fix salary estimate values
# TODO create logic for: Per Hour(Glassdoor est.), (Employer est.)

# for now ill just remove "Per Hour" cols
df["hourly_salary"] = df["salary_estimate"].apply(lambda x: 1 if "Per" in x else 0)
df = df.groupby("hourly_salary").get_group(0)[rename]

remove_strings = ["K", "$", " (Glassdoor est.)", "(Employer est.)"]
for string in remove_strings:
    df["salary_estimate"] = df["salary_estimate"].apply(lambda x: x.replace(string, ""))
df["salary_estimate"] = df["salary_estimate"].apply(lambda x: np.mean(list(map(int, x.split("-")))))

In [197]:
# Q: Where are the jobs? and what do they pay?

groupby_location = [[loc, len(group), group["salary_estimate"].mean()] for loc, group in df.groupby("location")]
groupby_location_cols = ["location", "num_entries", "salary_mean"]
location_df = pd.DataFrame(groupby_location, columns=groupby_location_cols)

location_df.sort_values(by="num_entries", ascending=False).head(25)

Unnamed: 0,location,num_entries,salary_mean
6,"Austin, TX",345,105.233333
33,"Chicago, IL",330,85.559091
146,"San Diego, CA",304,106.230263
113,"New York, NY",303,135.226073
74,"Houston, TX",219,103.230594
128,"Philadelphia, PA",210,90.928571
90,"Los Angeles, CA",203,127.933498
42,"Dallas, TX",180,82.794444
144,"San Antonio, TX",170,93.714706
129,"Phoenix, AZ",165,97.718182


In [248]:
# Q: What are the most frequent words in these job descriptions? (well do this quickly for now, with more detail later)

raw_data = ""
for item in df['description'].values:
    raw_data += item
    
raw_data = raw_data.lower().split(" ")

word_dict = {}
for word in raw_data:
    if word in word_dict.keys():
        word_dict[word] += 1
    else:
        word_dict[word] = 1
    
word_freq = [[key, word_dict[key]] for key in word_dict]

pd.DataFrame(word_freq).sort_values(by=1, ascending=False).iloc[:50]

Unnamed: 0,0,1
13,and,109066
7,to,58020
19,the,45809
27,of,45397
47,in,34708
28,data,31454
5,a,29854
43,with,28489
101,for,20794
224,or,18103
