# Descriptive statistics:

- Calculate the mean, median, and mode of the salary for each title.
- Calculate the standard deviation and interquartile range (IQR) of the salary for each title.
- Create histograms and boxplots to visualize the distribution of the salary for each title.
- Summarize the findings of your descriptive analysis in a report or presentation.

# Data exploration:

- Create scatter plots to visualize the relationship between salary and regular y total earning.
- Calculate correlation coefficients to quantify the strength and direction of the relationship between salary and regular y total earning.
- Identify any outliers or patterns in the data.

# Predictive modeling:

- Use the data to train a machine learning model to predict salary based on title and regular y total earning.
- Evaluate the performance of the model using metrics such as mean absolute error (MAE) and root mean squared error (RMSE).
- Use the model to make predictions for new data points.

# Hypothesis testing:

- Formulate a hypothesis about the relationship between salary and regular y total earning.
- Conduct a statistical test to determine whether the hypothesis is supported by the data.
- Draw conclusions about the relationship between salary and regular y total earning.

# Machine learning:

- Use the data to train a machine learning model to classify job titles based on salary and regular y total earning.
- Evaluate the performance of the model using metrics such as accuracy and precision.
- Use the model to classify new job titles.

In [51]:
%pip install pandas
%pip install matplotlib
%pip install numpy

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [52]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint

In [53]:
# load data
df_main = pd.read_csv('./data/employee-earnings-report-2011.csv')



In [54]:
# delete columns zip code and name
df_main = df_main.drop(columns=['Zip Code', 'Name'])

# List comlumns should be cleaned
list_columns = ['Total Earnings', 'Regular', 'Retro', 'Other', 'Overtime', 'Injured', 'Detail', 'Quinn']

# clean columns
for column in list_columns:
    df_main[column].fillna("0.0", inplace=True)
    df_main[column] = df_main[column].str.replace('$', '')
    df_main[column] = df_main[column].str.replace(',', '')
    df_main[column] = df_main[column].str.replace(')','') if df_main[column].str.contains(')', regex=False).any() else df_main[column] 
    df_main[column] = df_main[column].str.replace('(','') if df_main[column].str.contains('(', regex=False).any() else df_main[column] 
    df_main[column] = df_main[column].str.split('.', expand=True)[0]
    df_main[column] = df_main[column].astype(int)

pprint(df_main)


                    Department Name                      Title  Regular  \
0              Assessing Department     Property Officer (Asn)    33065   
1      ASD Office Of Labor Relation      Asst Corp Counsel III    76051   
2      Transportation-Parking Clerk  Chief Claims Investigator    56430   
3             Boston Public Library        Spec Library Asst I    35058   
4                    Law Department                 Prin Clerk    41588   
...                             ...                        ...      ...   
20504         Boston Public Schools                    Teacher    87696   
20505         Boston Public Schools         Substitute Teacher    26366   
20506         Boston Public Schools         Substitute Teacher     1942   
20507         Boston Public Schools                    Teacher    64210   
20508         Boston Public Schools                  Developer    94466   

       Retro  Other  Overtime  Injured  Detail  Quinn  Total Earnings  
0          0      0       3

In [55]:
# method to calculate the mode
def mode(data):
    return data.mode().iloc[0]


# Calculate the mean, median, and mode of the salary for each title.
df_result = df_main.groupby('Title').agg({'Regular': [np.mean, np.median, mode]}).reset_index()

# show dataframe
df_result

  df_result = df_main.groupby('Title').agg({'Regular': [np.mean, np.median, mode]}).reset_index()
  df_result = df_main.groupby('Title').agg({'Regular': [np.mean, np.median, mode]}).reset_index()


Unnamed: 0_level_0,Title,Regular,Regular,Regular
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,mode
0,ABA Specialist,29314.400000,38100.0,5880
1,ACC - Attorney,54273.857143,58480.0,1461
2,ACC - Management,91143.500000,93569.5,75634
3,ACC - Sr Attorney,75000.000000,75000.0,75000
4,Academic Superintendent,86364.285714,129565.0,0
...,...,...,...,...
1385,Young Adults Librarian I,50410.000000,47629.0,46650
1386,Young Adults Librarian II,6763.000000,6763.0,6763
1387,Youth Advocate,38604.500000,40543.5,31465
1388,Youth Worker,33231.722222,38032.5,42873


In [56]:
# Calculate the standard deviation and interquartile range (IQR) of the salary for each title.
df_result = df_main.groupby('Title').agg({'Regular':np.std, 'Regular': lambda x: x.quantile(.75) - x.quantile(.25)}).reset_index()

# show dataframe
df_result

Unnamed: 0,Title,Regular
0,ABA Specialist,27805.50
1,ACC - Attorney,16364.75
2,ACC - Management,7973.50
3,ACC - Sr Attorney,0.00
4,Academic Superintendent,114884.50
...,...,...
1385,Young Adults Librarian I,5150.50
1386,Young Adults Librarian II,0.00
1387,Youth Advocate,7344.75
1388,Youth Worker,7493.50


In [None]:
# Create histograms and boxplots to visualize the distribution of the salary for each title.
df_result = df_main.groupby('title')