# Automation_of_occupations_consequences_for_the_USA

Nobody has a crystal ball that can tell the future, but some people don’t need an ancient relic to foresee what’s going to haven, because they are currently building the future in which we will all live.

It’s true that AI and Automation will wreak havoc among the workforce rending a large part of the population useless and without economic value.

Not only they will take your jobs, but they will make the rich even richer.

If you are looking for job opportunities which are less likely to be affected by AI or automation, well you’re in the right place.

That said, it might be wise to consider some of the fields which will see an uptick in productivity in the following years.

Questions to analyse:

1. Which occupatios are the most sensitive and the most robust to automatisation (computerisation)?
2. See, how looks data distribution
3. What is the jobs loss in the US, if automatisation take out occupations with automatisation probability equal to 0.7 or higher?
4. Which US states are the most sensitive and the most robust to automatisation?
5. Compare most common occupations or automatisation

# I. Data import and functions

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os
from textwrap import wrap
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sns.set(font_scale=3)

In [None]:
def plot_heatmap(data_in, title_in, number):
    '''
    Inputs:
        data_in: Data Frame of objects and floats;
        title_in: string, chart title;
        number: boolean, for showing or not showing number values in heatmap.
    Output: 
        heatmap chart.
    '''
    plt.figure(figsize=(30,20))
    sns.heatmap(data=data_in, annot=number) 
    plt.title('\n'.join(wrap(title_in)), fontsize=40, fontweight="bold", pad=40)
    plt.ylabel("")

In [None]:
def barplot(data_in, x_data, y_data, title_in, hue_in, line):
   '''
    Inputs:
        data_in: Data Frame of objects and floats;
        x_data: float, x axis, number of jobs positions
        y_data: object, states names
        title_in: string, chart title (string);
        hue_in: float, used values threshhold or None
        line: float, mean value line for bars.
    Output: 
        bar chart.
    '''
    plt.figure(figsize=(15,14))
    graph = sns.barplot(x=x_data, y=y_data, hue=hue_in, data=data_in, palette="twilight")
    plt.title('\n'.join(wrap(title_in)), fontsize=20, fontweight="bold", pad=20)
    plt.xlabel("Number of jobs positions")
    graph.axvline(line)

In [None]:
# Create a list of colors (from iWantHue)
colors = ["#E13F29", "#D69A80", "#D63B59", "#AE5552", "#CB5C3B", "#EB8076", "#96624E"]

def plot_pie(data_in, title_in, labels_in):
    '''
    Inputs:
        data_in: Data Frame of objects and floats;
        title_in: string, chart title;
        labels_in: object, occupation name.
    Output: 
        pie chart.
    '''
    plt.pie(
        # using data total arrests
        data_in,
        # with the labels being officer names
        labels=labels_in,
        # with no shadows
        shadow=True,
        # with colors
        colors=colors,
        # with one slide exploded out
        explode=(0.15, 0, 0, 0, 0, 0, 0, 0, 0, 0),
        # with the start angle at 90%
        startangle=90,
        # with the percent listed as a fraction
        autopct='%1.1f%%',
        )
    plt.title(label='\n'.join(wrap(title_in, 40)), fontsize=20, fontweight="bold", loc="center", pad=20)
#     sns.set(font_scale=1.3)
    # View the plot drop above
    plt.axis('equal')

    # View the plot
    plt.tight_layout()
#     plt.savefig('Dakota pie.svg', bbox_inches="tight")
#     plt.show()

Import "Automation data by state"

In [None]:
Automation_data = "../input/occupation-salary-and-likelihood-of-automation/automation_data_by_state.csv"
A_data = pd.read_csv(Automation_data, encoding = "ISO-8859-1")
state_names = A_data.columns[3:]

Import "Occupation salary"

In [None]:
salary_data = "../input/occupation-salary-and-likelihood-of-automation/occupation_salary.xlsx"
S_data = pd.read_excel(salary_data, index_col="OCC_CODE")
# S_data.shape
# S_data.isnull().sum() #only in ANNUAL, HOURLY columns a lot of null values.

To compare occupation numbers in percentage, I need USA population data.
I couldn't read html in the Kaggle platorm, so I made csv data file in pycharm. The code here:

In [None]:
# link = 'https://www.infoplease.com/us/states/state-population-by-rank'
# w = pd.read_html(link, header=0)
# w[0].columns
# df_1 = w[0]
# condition = df_1['State'].isin(state_names)
# population_0 = df_1[condition]

Then I opened the US_population csv here:

In [None]:
link = '../input/uspopulation/US_population_2.csv'
population = pd.read_csv(link, header=0)
population.head() #- Census population - valstijos visos pupuliacijos skaičiai.
# null verčių stulpeliuose nėra

# II. Data cleaning and preparation for visualisations

## ????Salary data (S_data)

Drop emty columns: ANNUAL, HOURLY

In [None]:
S_data_clean = S_data.drop(['ANNUAL', 'HOURLY'], axis=1)

## Population data

Sort data by states and take only state and population columns

In [None]:
population_sort = population.sort_values(by=['State'])
states_pop = population_sort.loc[:, ['State', 'July 2019 Estimate']].reset_index()

## Automations data (A_data) preparation

Sum up all occupation workers for the US. After that I drop occupation lines, where are zero workers in all states. Then I sort data by US workers' numbers and by probability. Take 5 occupations with highest and 5 occupations with lowest automation probabilities.

In [None]:
us_sum = A_data[state_names].sum(axis=1)
us_sum_DF = pd.DataFrame({'US_workers':us_sum.values})
Occupation_proba = A_data[['Occupation', 'Probability']]
US_O_proba = Occupation_proba.join(us_sum_DF)
index_names = US_O_proba[ US_O_proba['US_workers'] == 0 ].index
US_worker = US_O_proba.drop(index_names).reset_index()
common_US_work = US_worker.drop(['index'], axis=1)

US_work = common_US_work.sort_values(by=['US_workers'], ascending=False).reset_index(drop=True)
US_sort_probability = common_US_work.sort_values(by=['Probability'], ascending=False).reset_index(drop=True)

A_head = US_sort_probability.head()
A_tail = US_sort_probability.tail()
highest_lowest_prob = pd.concat([A_head, A_tail])

Drop zero lines from (A_data) and set column "SOC" as index.

In [None]:
A_data_clean = A_data.drop(index_names).reset_index()
A_data_SOC = A_data_clean.set_index('SOC')

5 occupations with highest probability for automatisation, and 5 with lowest probability for automatisation

In [None]:
A_data_prob_sort = A_data_clean.sort_values(by=['Probability'], ascending=False).reset_index()
A_head_clean = A_data_prob_sort.head()
A_tail_clean = A_data_prob_sort.tail()

A_data_highest_lowest_prob = pd.concat([A_head_clean, A_tail_clean])
A_data_highest_lowest_prob_present = A_data_highest_lowest_prob.drop(['level_0', 'index'], axis=1).set_index('Occupation')

Transform occupation workers numbers per state to occupation workers population ratio. 

In [None]:
# Transform A_data
col_names = A_data_clean.columns
A_trans = pd.DataFrame(A_data_clean.values.T, columns=A_data_clean['SOC'], index=col_names)
A_tr = A_trans.iloc[4:]

# Number of occupations jobs in states divided by population of state. Then the data frame transformed back
reliative_popul = A_tr.div(states_pop['July 2019 Estimate'].values,axis=0)
reliative_popul = reliative_popul.fillna(0)
A_double_T = reliative_popul.T

# Join occupation and probability coulumns back to transformed data frame.
A_data_double_T = A_data_SOC[['Occupation', 'Probability']].join(A_double_T)

Group occupation to categories by mean probability and occupations groups ratio to population per state.

In [None]:
# Split SOC column to 'Occupation_group_no' and 'Occupation_no'
SOC_column = A_data_double_T.reset_index()
SOC_column['SOC'] = SOC_column.SOC.astype(str)
SOC_column[['Occupation_group_no','Occupation_no']] = SOC_column.SOC.str.split("-",expand=True,)
Group_data = SOC_column.copy()

# Group occupation categories by mean probability and occupations groups ratio to population per state
Group = Group_data.groupby(['Occupation_group_no']).Probability.mean()
Group_states = Group_data.groupby(['Occupation_group_no'])[state_names].sum()

# Titles of occupation groups
s = pd.Series(['Management', 'Business Operations', 'Computer and Mathematical', 'Architecture and Engineering', 'Life, Physical, and Social Science', 'Community and Social Service', 'Legal', 'Education, Training, and Library', 'Design, Entertainment and Sports', 'Healthcare Practitioners', 'Healthcare Support', 'Protective Service', 'Food Serving Related', 'Cleaning and Maintenance', 'Personal Care and Service', 'Sales and Related', 'Administrative Support', 'Farming, Fishing, and Forestry', 'Construction and Extraction', 'Installation and Repair', 'Production', 'Transportation'], index=['11', '13', '15', '17', '19', '21', '23', '25', '27', '29', '31', '33', '35', '37', '39', '41', '43', '45', '47', '49', '51', '53'])

Occupations_groups_pd = pd.DataFrame({'index':s.index, 'Occupations groups':s.values}) # Make data frame from series
Occupations_groups = Occupations_groups_pd.set_index('index')

# Join data to one table and sort by probability
Occupations_groups_join1 = Occupations_groups.join(Group)
Occupations_groups_join2 = Occupations_groups_join1.join(Group_states)
Occupations_groups_prob_sort1 = Occupations_groups_join2.sort_values(by=['Probability'], ascending=False).reset_index()
Occupations_groups_prob_sort = Occupations_groups_prob_sort1.fillna(0)
Occupations_groups_prob_sort.head()

# Merge Occupations groups and probability columns to one
Occupations_groups_prob_round = Occupations_groups_prob_sort['Probability'].round(2)
Occupations_groups_prob_round_df = pd.DataFrame({'Probability':Occupations_groups_prob_round.values})
Occupations_groups_prob_sort["Occupations groups and Probability"] = Occupations_groups_prob_sort["Occupations groups"] + " " + Occupations_groups_prob_round_df["Probability"].astype(str)
Occupations_groups_plot = Occupations_groups_prob_sort.copy().set_index('Occupations groups and Probability')

# III. Data analysis topic
# IV. Data analysis topic2

# Heat map plot function
# Barplot function
# plt.pie function

Plot heat map

In [None]:
plot_heatmap(Occupations_groups_plot[state_names], 'Occupation categories and the probability of automation by states', False)

In [None]:
plot_heatmap(Occupations_groups_plot[['South Dakota', 'Nevada', 'District of Columbia', 'Massachusetts']],'Occupation categories and probability of automatisation in S. Dakota, Nevada, DC and Massachusetts', True)

# patrumpint ilgiausiu spec pav. su kodu
surast reikiamas specialybes ir jas replace.

In [None]:
plot_heatmap(A_data_highest_lowest_prob_present[state_names], 'Workers numbers in occupations with 5 highest and 5 lowest probabilities for automation', False)

# **Ar sita distribution grafika palikt**
Po to pasinaudot atrnkant didelės automatizavimo specialybes ir turincias dideli skaiciu darbuotoju
US occupations number ant accupations probabilities distribution

In [None]:
# plt.title("US workers numbers and Occupations automatisation probabilities distribution")
sns.set_style("dark")
plt.figure(figsize=(20,20))
fig = sns.jointplot(x=US_work['Probability'], y=US_work['US_workers'], kind="kde")
# plt.ylabel("Occupations numbers")
sns.set(font_scale=1.2)
# ax.set_ylim(1,31)
# ax.set_yticks(range(1,32))
# g.despine(bottom=True, left=True)
plt.gcf().set_size_inches(12, 8)
plt.show() 

Probability higher than 0.7  representing a "high risk category, meaning that associated occupations are potentially automatable over some unspecified number of years, perhaps a decade or two" according to the original research paper. We can look to this probability as to a time frame, where higher propabilty occupations are likely to be automated sooner.

The lost work positions and state population ratio. I take lost work position, when work automation probability ir equal to 0.7 or higher (threshhold >= 0.7).

In [None]:
threshhold = 0.7 # accupations automation probability threshhold

## Total number of jobs positions per state

I estimated, that these total job numbers are about 10%, due to jabs position not included. I got data with all jobs where data was not available or there were less than 10 employees were marked as zero.

In [None]:
p = A_data_clean.sort_values(by=['Probability'], ascending=False)
sum_work_per_state = p[state_names].sum()
States_sum_DF = pd.DataFrame({'States':sum_work_per_state.index, 'sum_work_positions':sum_work_per_state.values})
States_sum_sort =  States_sum_DF.sort_values(by=['sum_work_positions'], ascending=False)

In [None]:
barplot(States_sum_sort, 'sum_work_positions', 'States', "Total number of jobs positions per state", None, 0)

# Plot second bar to every state - left absolute occupation positions when threshhold < 0.7.

In [None]:
p_index = p.reset_index().drop(['level_0', 'index', 'SOC'], axis=1)
p05 = p_index.loc[(p_index.Probability < threshhold)]

In [None]:
sum_work_per_state05 = p05[state_names].sum()
States_sum_DF05 = pd.DataFrame({'States':sum_work_per_state05.index, 'sum_work_positions':sum_work_per_state05.values})
States_sum_sort05 =  States_sum_DF05.sort_values(by=['sum_work_positions'], ascending=False)
States_sum_sort05.head()

In [None]:
barplot(States_sum_sort05, 'sum_work_positions', 'States', "Total number of jobs positions per state", None, 0)

# How many jobs positions would be lost in States, if we lost accupations which have automation probability equal to 0.7 or higher.

In [None]:
States_sum_DF['Threshhold'] = 1.0
States_sum_DF05['Threshhold'] = threshhold
States_sum_DF05.head()

In [None]:
Compare_sums = pd.concat([States_sum_DF, States_sum_DF05])
Compare_sums_sort = Compare_sums.sort_values(by=['sum_work_positions'], ascending=False)
Compare_sums_sort

In [None]:
barplot(Compare_sums_sort, 'sum_work_positions', 'States', "Total number of jobs positions per state now (Threshhold=1.0) and when higher automation probability accupations lost (Threshhold=threshhold)", 'Threshhold', 0)

Lets look, what are reliative loss numbers

In [None]:
States_sum_join = States_sum_DF.join(States_sum_DF05, lsuffix='', rsuffix='0.5')
States_sum_drop = States_sum_join.drop(['Threshhold', 'States0.5', 'Threshhold0.5'], axis=1)
Relative_jobs_drop = ((States_sum_drop['sum_work_positions']-States_sum_drop['sum_work_positions0.5'])/States_sum_drop['sum_work_positions'])*100
States_sum_drop.head()

In [None]:
Relative_jobs_drop_DF = pd.DataFrame({'Lost jobs ratio':Relative_jobs_drop.values})
Relative_jobs_drop_States = States_sum_drop.join(Relative_jobs_drop_DF)
Relative_jobs_drop_States_sort = Relative_jobs_drop_States.sort_values(by=['sum_work_positions'], ascending=False)
Relative_jobs_drop_mean = Relative_jobs_drop_States_sort['Lost jobs ratio'].mean()
Relative_jobs_drop_mean

In [None]:
barplot(Relative_jobs_drop_States_sort, 'Lost jobs ratio', 'States', "Lost jobs ratio per state when we lost jobs with automatisation probability equal to 0.7 (Threshhold) or higher", None, Relative_jobs_drop_mean)

In [None]:
Relative_jobs_drop_States_highest = Relative_jobs_drop_States.sort_values(by=['Lost jobs ratio'], ascending=False)

In [None]:
barplot(Relative_jobs_drop_States_highest, 'Lost jobs ratio', 'States', "Lost jobs ratio per state when we lost jobs with automatisation probability equal to 0.7 (Threshhold) or higher", None, Relative_jobs_drop_mean)

# South Dakota and Nevada have most jobs losses. Lets look, what are biggest occupations they lost

Let's start with Nevada data

In [None]:
nevada = A_data_clean[['Occupation', 'Probability', 'South Dakota', 'Nevada']].sort_values(by=['Probability'], ascending=False)
nevada07_full = nevada.loc[(nevada.Probability >= threshhold)].sort_values(by=['Nevada'], ascending=False).reset_index()
nevada07 = nevada07_full.head(9)
nevada07_tail = nevada07_full.tail(308).Nevada.sum()
df2 = {'Occupation': 'Other', 'Probability': 0, 'South Dakota': 0, 'Nevada': nevada07_tail} 
nevada07 = nevada07.append(df2, ignore_index = True) 
nevada07

Plot pie chart of Nevada data

In [None]:
plot_pie(nevada07['Nevada'], "The largest most likely automatable occupations in Nevada", nevada07['Occupation'])

Now let's look to South Dakota

In [None]:
dakota = A_data_clean[['Occupation', 'Probability', 'South Dakota']].sort_values(by=['Probability'], ascending=False)
dakota07_full = dakota.loc[(dakota.Probability >= threshhold)].sort_values(by=['South Dakota'], ascending=False).reset_index()
dakota07 = dakota07_full.head(9)
dakota07_tail = dakota07_full.tail(308)['South Dakota'].sum()
df2 = {'Occupation': 'Other', 'Probability': 0, 'South Dakota': dakota07_tail} 
dakota07 = dakota07.append(df2, ignore_index = True) 
dakota07

Plot pie chart of South Dakota data

In [None]:
plot_pie(dakota07['South Dakota'], "The largest most likely automatable occupations in South Dakota", dakota07['Occupation'])

District of Columbia has lowest sensitivity for automation. Let's check the data

In [None]:
DC = A_data_clean[['Occupation', 'Probability', 'District of Columbia']].sort_values(by=['Probability'], ascending=False)
DC07_full = DC.loc[(DC.Probability >= threshhold)].sort_values(by=['District of Columbia'], ascending=False).reset_index()
DC07 = DC07_full.head(9)
DC07_tail = DC07_full.tail(308)['District of Columbia'].sum()
df2 = {'Occupation': 'Other', 'Probability': 0, 'District of Columbia': DC07_tail} 
DC07 = DC07.append(df2, ignore_index = True) 
DC07

District of Columbia pie chart

In [None]:
plot_pie(DC07['District of Columbia'], "The largest most likely automatable occupations in District of Columbia", DC07['Occupation'])

Massachusetts data

In [None]:
massachusetts = A_data_clean[['Occupation', 'Probability', 'Massachusetts']].sort_values(by=['Probability'], ascending=False)
massachusetts07_full = massachusetts.loc[(nevada.Probability >= threshhold)].sort_values(by=['Massachusetts'], ascending=False).reset_index()
massachusetts07 = massachusetts07_full.head(9)
massachusetts07_tail = massachusetts07_full.tail(308).Massachusetts.sum()
df2 = {'Occupation': 'Other', 'Probability': 0, 'Massachusetts': massachusetts07_tail} 
massachusetts07 = massachusetts07.append(df2, ignore_index = True) 
massachusetts07

Massachusetts pie chart**

In [None]:
plot_pie(massachusetts07['Massachusetts'], "The largest most likely automatable occupations in Massachusetts", massachusetts07['Occupation'])

# Conclusions

1. The most robust occupations for automatisations: Saugiausios specialybės nuo automatizacijos: social service, management, computer and mathematical and medicine. The most sensitive: administrative support, production, farming, fishing, forestry, food serving related.
2. A bit more occupations have higher probability for automatisation
3. US would loss around half of all jobs, if automatisation take out occupations with automatisation probability equal to 0.7 or higher?
4. South Dakota and Nevada are the most sensitive and District of Columbia and Massachusetts are the most robust to automatisation.
5. Nevada and South Dakota have most occupations with high probability for automation. District of Columbia also has many occupations with low probability for automatisation like management, arts and protective service.