**Project 1**

Research Questions (semi-tentative; to be expanded through data exploration):
1. Descriptive - What is the average minimum required experience to be eligible for a data science job?
2. Descriptive - Do in-person jobs pay worse than remote jobs?
3. Inference - How well can average salary be predicted for different jobs based on a variety of features?

In [None]:
#This file is a python notebook for all data linked to this project.

#Original Research Question: What is the overall, lowest required experience to become eligible for a data science job on average? Combination of years of experience and level of expertise.

In [1]:
#mount the google drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#load the CSV from the google drive

import pandas as pd

#load the CSV into dataframe
df = pd.read_csv("/content/drive/MyDrive/MATH575Project1/ai_jobs.csv")
df.head()

Unnamed: 0,job_id,job_title,company_type,industry,country,city,remote_type,experience_level,min_experience_years,salary_min_usd,salary_max_usd,employment_type,posted_year,company_size
0,0IFD0TVBDIVU,MLOps Engineer,Research Lab,Education,Australia,Remote,Remote,Entry,0,56873,72223,Full-time,2023,Large
1,ZMF8MDD4V30T,Data Analyst,Startup,Education,Germany,Remote,Remote,Entry,0,54803,85599,Full-time,2024,Medium
2,CX1945NQ4FMY,MLOps Engineer,Research Lab,Tech,Canada,Remote,Remote,Senior,5,149980,175806,Full-time,2021,Large
3,QJ7YHL1C32OC,Applied Scientist,Research Lab,Healthcare,Australia,Remote,Remote,Entry,0,53483,86477,Full-time,2023,Medium
4,F0T0PVN9ER14,Machine Learning Engineer,Research Lab,Finance,Australia,Sydney,Hybrid,Mid,2,102977,127298,Full-time,2023,Large


In [None]:
#old github only instructions

# #loading data from kaggle
# import os
# from zipfile import ZipFile

# #make sure the data folder exists
# os.makedirs("data", exist_ok=True)

# #download the data if not already downloaded
# if not os.path.exists("data/global-ai-and-data-science-job-market-20202026.zip"):
#     !kaggle datasets download -d mann14/global-ai-and-data-science-job-market-20202026 -p data

# #unzip the data if not already unzipped
# zip_path = "data/global-ai-and-data-science-job-market-20202026.zip"
# if os.path.exists(zip_path):
#     with ZipFile(zip_path, "r") as zip_ref:
#         zip_ref.extractall("data")


In [None]:
# #github only instructions

# #putting locally stored data into df
# import pandas as pd

# #load the CSV into dataframe
# df = pd.read_csv("data/ai_jobs.csv")
# df.head()


Unnamed: 0,job_id,job_title,company_type,industry,country,city,remote_type,experience_level,min_experience_years,salary_min_usd,salary_max_usd,employment_type,posted_year,company_size
0,0IFD0TVBDIVU,MLOps Engineer,Research Lab,Education,Australia,Remote,Remote,Entry,0,56873,72223,Full-time,2023,Large
1,ZMF8MDD4V30T,Data Analyst,Startup,Education,Germany,Remote,Remote,Entry,0,54803,85599,Full-time,2024,Medium
2,CX1945NQ4FMY,MLOps Engineer,Research Lab,Tech,Canada,Remote,Remote,Senior,5,149980,175806,Full-time,2021,Large
3,QJ7YHL1C32OC,Applied Scientist,Research Lab,Healthcare,Australia,Remote,Remote,Entry,0,53483,86477,Full-time,2023,Medium
4,F0T0PVN9ER14,Machine Learning Engineer,Research Lab,Finance,Australia,Sydney,Hybrid,Mid,2,102977,127298,Full-time,2023,Large


In [None]:
#check for missing data

print(df.isna().sum())
#no missing data!

print(df.duplicated().sum())
#no duplicated rows!

job_id                  0
job_title               0
company_type            0
industry                0
country                 0
city                    0
remote_type             0
experience_level        0
min_experience_years    0
salary_min_usd          0
salary_max_usd          0
employment_type         0
posted_year             0
company_size            0
salary_range            0
salary_average          0
salary_mid              0
dtype: int64
0


In [None]:
#check for outliers, impossible values, or improperly encoded values
for col in df.columns:
    print(f"\nValue counts for column: {col}")
    print(df[col].value_counts())

#job_title data is for six different jobs, with oddly similar counts in each
#company_type has three levels, also oddly similar counts in each
#industry has five levels, also oddly similar counts
#country has six levels, limited ability to generalize beyond these countries
#city data has 25 levels (24 cities + remote) and is all from large cities; limited ability to apply findings to Boise data science jobs
#remote_type has three levels and matches remote values from the city column
#experience_level has three levels
#employment_type shows that all jobs are full time
#posted_year has data from 7 years (2020-2026)
#company_size has three levels

#oddly similar classes across all categorical variables show some cherry picking of data which is not representative of the actual distribution of jobs in the job market.


Value counts for column: job_id
job_id
0IFD0TVBDIVU    1
ZMF8MDD4V30T    1
CX1945NQ4FMY    1
QJ7YHL1C32OC    1
F0T0PVN9ER14    1
               ..
UN0T2IZO2KCL    1
CETZGCR42LC8    1
EQ9PTEJEFKUI    1
92UNH47RTAGS    1
QIN2HZHOH6O6    1
Name: count, Length: 50000, dtype: int64

Value counts for column: job_title
job_title
MLOps Engineer               8439
AI Researcher                8415
Data Scientist               8410
Applied Scientist            8298
Data Analyst                 8260
Machine Learning Engineer    8178
Name: count, dtype: int64

Value counts for column: company_type
company_type
Research Lab    16910
MNC             16576
Startup         16514
Name: count, dtype: int64

Value counts for column: industry
industry
Tech          10083
Healthcare    10029
Retail         9979
Finance        9970
Education      9939
Name: count, dtype: int64

Value counts for column: country
country
UK           8452
India        8350
Germany      8345
Canada       8303
Australia    8276

In [None]:
#check for outliers or impossible values on salary columns
print(df[['salary_min_usd', 'salary_max_usd']].describe())

#no unreasonable min or max values
#median < mean shows a slight right skew which is common for salary data
#maxes at 180K is weird, there are definitely data scientists making 300K+; seems like a deliberate data collection choice

(df['salary_max_usd'] - df['salary_min_usd'] < 0).any() #all maxes are larger than mins

       salary_min_usd  salary_max_usd
count    50000.000000    50000.000000
mean    100871.434320   120858.350740
std      37043.446641    37531.386484
min      50000.000000    65000.000000
25%      61287.000000    83739.000000
50%      97505.000000   117604.000000
75%     143730.500000   161348.750000
max     154999.000000   180000.000000


np.False_

In [None]:
#new columns which might help later descriptive and inferential stats
df['salary_range'] = df['salary_max_usd'] - df['salary_min_usd']
df['salary_mid'] = (df['salary_max_usd'] + df['salary_min_usd'])/2

#I'll leave one-hot encoding for later steps as if I encode all categorical columns it will add a ton of columns to the dataset and make it hard to manage in the meantime