# Lab Five: Wide and Deep Networks

***Md Mahfuzur Rahman, Will Schneider, Nik Zelenikovski***



## 1. Preparation

In [8]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns',None)

df = pd.read_csv('salaries.csv') # read in the csv file


print(df.info())
print('===========')
# note that the describe function defaults to using only some variables
print(df.describe())
print('===========')

print(df.select_dtypes(include=['object']).nunique().sum(),"unique class variables")
print('===========')

df = df.drop_duplicates()
df = df.dropna().reset_index(drop=True)
print(df.shape)

# create a data description table
data_des = pd.DataFrame()
#code adapted from sample Lab 1 Submission
data_des['Features'] = df.columns
data_des['Description'] = ['Work Year', 'Position Experience Level',
                          'Employment Type', 'Job Title',
                          'Position Salary', 'Currency of Salary',
                          'Position Salary (USD)', 'Employee Residence',
                          'Percentage of Job Responsibilities Completed Remotely', 'Company Location', 'Company Size']
data_des['Scales'] = ['interval'] + ['ordinal'] + ['nominal']*2 + ['ratio'] + ['nominal']+ ['ratio'] + ['nominal'] + ['ratio'] + ['nominal'] + ['ordinal']
data_des['Discrete\Continuous'] = ['continuous'] + ['discrete']*3 + ['continuous'] + \
                                  ['discrete'] + ['continuous'] + ['discrete'] + ['continuous'] + ['discrete']*2
data_des['Unique Values'] = df[df.columns].nunique().values
data_des

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23710 entries, 0 to 23709
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           23710 non-null  int64 
 1   experience_level    23710 non-null  object
 2   employment_type     23710 non-null  object
 3   job_title           23710 non-null  object
 4   salary              23710 non-null  int64 
 5   salary_currency     23710 non-null  object
 6   salary_in_usd       23710 non-null  int64 
 7   employee_residence  23710 non-null  object
 8   remote_ratio        23710 non-null  int64 
 9   company_location    23710 non-null  object
 10  company_size        23710 non-null  object
dtypes: int64(4), object(7)
memory usage: 2.0+ MB
None
          work_year        salary  salary_in_usd  remote_ratio
count  23710.000000  2.371000e+04   23710.000000  23710.000000
mean    2023.460565  1.618500e+05  151918.919823     27.954450
std        0.693803  2.

Unnamed: 0,Features,Description,Scales,Discrete\Continuous,Unique Values
0,work_year,Work Year,interval,continuous,5
1,experience_level,Position Experience Level,ordinal,discrete,4
2,employment_type,Employment Type,nominal,discrete,4
3,job_title,Job Title,nominal,discrete,169
4,salary,Position Salary,ratio,continuous,3400
5,salary_currency,Currency of Salary,nominal,discrete,24
6,salary_in_usd,Position Salary (USD),ratio,continuous,3804
7,employee_residence,Employee Residence,nominal,discrete,89
8,remote_ratio,Percentage of Job Responsibilities Completed R...,ratio,continuous,3
9,company_location,Company Location,nominal,discrete,78


### 1.1 Class Variable Definition and Dataset Preparation for Classification/Regression

Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). You have the option of using tf.dataset for processing, but it is not required. 

#### Class Variables
This dataset has 8 features with class variables. Three of the features have n > 20 unique classes, which will be reduced through embedding later in this notebook. 'remote_ratio' is a numeric feature with three values in the dataset indicating the percentage of remote work. The available values are 0, 50, 100. The data source considers these values labeled (0: Non-Remote, Hybrid, & Fully Remote. These will be converted to categorical features in the code below. A  description of all the datatypes is in the code output below. In total, the 371 unique classes exist within categorical features. These will need to be processed for dimensionality reduction and embedding into the model. Additionally, two of the categorical features resemble ordinality, even expressed in the name of the features 'company_**size**' & 'experience_**level**'. They also have low cardinality. These will be converted to labeled variables and they will be crossed with other categorical columns later in this notebook.

#### Classification splitting

In [9]:
#==================================================================
from sklearn import __version__ as sklearn_version
if sklearn_version < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split
from copy import deepcopy
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline



In [10]:
# Encode ordinal features
exp_mapping = {'EN': 0, 'MI': 1, 'SE': 2, 'EX': 3}
size_mapping = {'S': 0, 'M': 1, 'L': 2}

# Apply the mappings to the entire dataset
df['experience_level'] = df['experience_level'].map(exp_mapping)
df['company_size'] = df['company_size'].map(size_mapping)

# Ensure the numeric headers are float
# define variables that should be scaled or made discrete
numeric_headers = ['work_year', 'remote_ratio','salary', 'salary_in_usd']
df[numeric_headers] = df[numeric_headers].to_numpy().astype(float)


categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_cols.remove('salary')
numerical_cols.remove('salary_in_usd')
print(numerical_cols)
# Preprocessing for numerical data
numerical_transformer = StandardScaler()

# Preprocessing for categorical data
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

label_transformer = LabelEncoder()
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])


#FILTER LESS FREQUENT JOBS (FOR TARGET VARIABLE ENGINEERING)
# Count occurrences of each job title (this helps with straified splitting)
job_title_counts = df['job_title'].value_counts()
# Filter out job titles with less than 4 occurrences (Removes 19 jobs, but will calculate a median actually to predict on)
# And it prevents freebies to the model! because if it's unseen, as far as the model is concerned, that value is the median.
valid_job_titles = job_title_counts[job_title_counts >= 2].index
df = df[df['job_title'].isin(valid_job_titles)].copy()


# Initial split to separate a test set
X = df.drop(columns=['salary', 'salary_in_usd'])
y = df['salary_in_usd']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Combine the full training data for further splits
training = deepcopy(pd.concat([X_train, y_train], axis=1))
testing = deepcopy(pd.concat([X_test, y_test], axis=1))

# Drop any missing values
training.dropna(inplace=True)
training.reset_index(inplace=True, drop=True)
testing.dropna(inplace=True)
testing.reset_index(inplace=True, drop=True)

# Step 2: Calculate the median salary by company size in the training set
median_salary_by_job_title = training.groupby('job_title')['salary_in_usd'].median()

# Step 3: Create binary target variable based on whether salary is above or below the median for the company size
training['salary_binary'] = training.apply(
    lambda row: 1 if row['salary_in_usd'] > median_salary_by_job_title[row['job_title']] 
    else 0, axis=1
)

# Create binary target variable for the test set
testing['salary_binary'] = testing.apply(
    lambda row: 1 if row['job_title'] in median_salary_by_job_title
    and row['salary_in_usd'] > median_salary_by_job_title[row['job_title']]
    else 0, axis=1
)


# Separate features and target for further splitting
X_train = training.drop(columns=[ 'salary_in_usd', 'salary_binary'])
y_train = training['salary_binary']
X_test = testing.drop(columns=[ 'salary_in_usd', 'salary_binary'])
y_test= testing['salary_binary']

#Scale testing data
# Preprocessing pipeline
preprocessing_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit and transform the data
X_train = preprocessing_pipeline.fit_transform(X_train)
X_test = preprocessing_pipeline.transform(X_test)


['work_year', 'experience_level', 'remote_ratio', 'company_size']


#### Pre-Processing with TensorFlow

### 1.2 Crossed Features
#### 1 Employee Residence x Remote Ratio
- How is the salary in USD influenced by the employee's type of salary paid? This is probably a strong indicator of pay based on area. If the pay is not in USD, it might be more of a comtract role
- Total = 9 X 4 = 36
#### 2 Company Location x Company Size
- Companies in the US will probably be larger as it is the leading big tech country.
- Total = 9 * 3 = 27
#### 3 Job Title x Experience Level
- Specifics about each position (such as "Data Science Lead") suggest the experience/authority level
- Total = 13 * 4 = 52

In [11]:
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

# Identify groups of features for cross-product
cross_product_features = [('job_title', 'location')]
print(f"Selected cross-product features: {cross_product_features}")
# Identify groups of features for cross-product
cross_product_features = ['job_title', 'location']
print(f"Selected cross-product features: {cross_product_features}")

# Metrics for evaluation
precision_scorer = make_scorer(precision_score)
recall_scorer = make_scorer(recall_score)
f1_scorer = make_scorer(f1_score)

# Choose the cross-validation method
cv = StratifiedKFold(n_splits=10)

# Create and evaluate the model using cross-validation
model = LogisticRegression()
precision_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=precision_scorer)
recall_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=recall_scorer)
f1_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=f1_scorer)

print(f"Mean Absolute Error: {precision_scores.mean()}")
print(f"Root Mean Squared Error: {recall_scores.mean()}")
print(f"Root Mean Squared Error: {f1_scores.mean()}")
# Use StratifiedShuffleSplit to ensure balanced class distribution
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

Selected cross-product features: [('job_title', 'location')]
Selected cross-product features: ['job_title', 'location']
Mean Absolute Error: 0.6099600889264719
Root Mean Squared Error: 0.7282836328108407
Root Mean Squared Error: 0.663717982438963


In [12]:
from tensorflow.keras.utils import FeatureSpace

# Example One: Just lump everything together, and concatenate
feature_space = FeatureSpace(
    features={
        # Categorical feature encoded as string
        # "experience_level": FeatureSpace.string_categorical(num_oov_indices=0),
        "employment_type": FeatureSpace.string_categorical(num_oov_indices=0,),
        "job_title": FeatureSpace.string_categorical(num_oov_indices=1),
        "salary_currency": FeatureSpace.string_categorical(num_oov_indices=1),
        "employee_residence": FeatureSpace.string_categorical(num_oov_indices=1),
        "company_location": FeatureSpace.string_categorical(num_oov_indices=1),

        # Categorical feature encoded as integers
        "remote_ratio": FeatureSpace.integer_categorical(num_oov_indices=0),
        "company_size": FeatureSpace.integer_categorical(num_oov_indices=0),
        "experience_level": FeatureSpace.integer_categorical(num_oov_indices=0),

        # Numerical features to normalize (normalization will be learned)
        # learns the mean, variance, and if to invert (3 parameters)
        # "salary_in_usd": FeatureSpace.float_normalized(),

        "work_year": FeatureSpace.float_normalized()
            },
    output_mode="concat", # can also be a dict, processed internally
)

# now that we have specified the preprocessing, let's run it on the data
ds_train = create_dataset_from_dataframe(training)
ds_test = create_dataset_from_dataframe(testing)
# # create a version of the dataset that can be iterated without labels
# train_ds_with_no_labels = ds_train.map(lambda x, _: x)
# feature_space.adapt(train_ds_with_no_labels) # inititalize the feature map to this data

# # the adapt function allows the model to learn one-hot encoding sizes
# # now define a preprocessing operation that returns the processed features
# preproc_ds_train = ds_train.map(lambda x, y: (feature_space(x), y),
#                                      num_parallel_calls=tf.data.AUTOTUNE)
# # run it so that we can use the pre-processed data
# preproc_ds_train = preproc_ds_train.prefetch(tf.data.AUTOTUNE)

# # do the same for the test set
# preproc_ds_test = ds_test.map(lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE)
# preproc_ds_test = preproc_ds_test.prefetch(tf.data.AUTOTUNE)

In [17]:
feature_space.adapt(X_train)

ValueError: `adapt()` can only be called on a tf.data.Dataset. Received instead:   (0, 0)	-0.5716510631127374
  (0, 1)	-2.1068229452963636
  (0, 2)	1.470223778295884
  (0, 3)	3.8278945817592067
  (0, 6)	1.0
  (0, 62)	1.0
  (0, 163)	1.0
  (0, 204)	1.0
  (0, 285)	1.0
  (1, 0)	-0.5716510631127374
  (1, 1)	-0.7332419507135669
  (1, 2)	1.470223778295884
  (1, 3)	-0.14087007009081992
  (1, 6)	1.0
  (1, 125)	1.0
  (1, 164)	1.0
  (1, 207)	1.0
  (1, 288)	1.0
  (2, 0)	0.7573726560667198
  (2, 1)	0.64033904386923
  (2, 2)	-0.6948718502166717
  (2, 3)	-0.14087007009081992
  (2, 6)	1.0
  (2, 62)	1.0
  (2, 175)	1.0
  :	:
  (10731, 2)	-0.6948718502166717
  (10731, 3)	-0.14087007009081992
  (10731, 6)	1.0
  (10731, 125)	1.0
  (10731, 175)	1.0
  (10731, 258)	1.0
  (10731, 335)	1.0
  (10732, 0)	0.7573726560667198
  (10732, 1)	0.64033904386923
  (10732, 2)	-0.6948718502166717
  (10732, 3)	-0.14087007009081992
  (10732, 6)	1.0
  (10732, 154)	1.0
  (10732, 175)	1.0
  (10732, 258)	1.0
  (10732, 335)	1.0
  (10733, 0)	0.7573726560667198
  (10733, 1)	0.64033904386923
  (10733, 2)	-0.6948718502166717
  (10733, 3)	-0.14087007009081992
  (10733, 6)	1.0
  (10733, 125)	1.0
  (10733, 175)	1.0
  (10733, 199)	1.0
  (10733, 280)	1.0 (of type <class 'scipy.sparse._csr.csr_matrix'>)

## 2. Modeling: Define the Wide and Deep Network Models

We will use Keras to develop three alternative broad and deep networks to categorize wage data. Each network will have a unique design to investigate how different combinations of broad and deep components impact the model's performance.

### Model 1: Basic Wide and Deep Network

This model has a wide component (a basic linear layer) and a deep component (three hidden layers).

In [18]:
#!pip install pydot
#!pip install graphviz
feature_space.adapt(X_train)
# these are the placeholder inputs in the computation graph BEFORE
# applying and transformations
dict_inputs = feature_space.get_inputs()  #getting inputs is WAY easier now

# these are the encoded features after they have been processed
# We can use these as additional inpits into the computation graph
encoded_features = feature_space.get_encoded_features() # these features have been encoded

    
# using feature space above, this will result in 131 concatenated features
# this is calucalted based on the one-hot encodings for each category
# now lets create some layers with Keras
x = keras.layers.Dense(128, activation="relu")(encoded_features)
x = keras.layers.Dense(32, activation="relu")(x)
predictions = keras.layers.Dense(1, activation="sigmoid")(x)

# we can now create two input/outputs to the computation graph

# this expects features already transformed
training_model = keras.Model(inputs=encoded_features,
                             outputs=predictions)
training_model.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
)

# this expects features that are not transformed
inference_model = keras.Model(inputs=dict_inputs,
                              outputs=predictions)
inference_model.compile(loss="binary_crossentropy", metrics=["accuracy"])

inference_model.summary()

# plot_model(
#     training_model, to_file='model.png', show_shapes=True, show_layer_names=True,
#     rankdir='LR', expand_nested=False, dpi=96
# )

ValueError: `adapt()` can only be called on a tf.data.Dataset. Received instead:   (0, 0)	-0.5716510631127374
  (0, 1)	-2.1068229452963636
  (0, 2)	1.470223778295884
  (0, 3)	3.8278945817592067
  (0, 6)	1.0
  (0, 62)	1.0
  (0, 163)	1.0
  (0, 204)	1.0
  (0, 285)	1.0
  (1, 0)	-0.5716510631127374
  (1, 1)	-0.7332419507135669
  (1, 2)	1.470223778295884
  (1, 3)	-0.14087007009081992
  (1, 6)	1.0
  (1, 125)	1.0
  (1, 164)	1.0
  (1, 207)	1.0
  (1, 288)	1.0
  (2, 0)	0.7573726560667198
  (2, 1)	0.64033904386923
  (2, 2)	-0.6948718502166717
  (2, 3)	-0.14087007009081992
  (2, 6)	1.0
  (2, 62)	1.0
  (2, 175)	1.0
  :	:
  (10731, 2)	-0.6948718502166717
  (10731, 3)	-0.14087007009081992
  (10731, 6)	1.0
  (10731, 125)	1.0
  (10731, 175)	1.0
  (10731, 258)	1.0
  (10731, 335)	1.0
  (10732, 0)	0.7573726560667198
  (10732, 1)	0.64033904386923
  (10732, 2)	-0.6948718502166717
  (10732, 3)	-0.14087007009081992
  (10732, 6)	1.0
  (10732, 154)	1.0
  (10732, 175)	1.0
  (10732, 258)	1.0
  (10732, 335)	1.0
  (10733, 0)	0.7573726560667198
  (10733, 1)	0.64033904386923
  (10733, 2)	-0.6948718502166717
  (10733, 3)	-0.14087007009081992
  (10733, 6)	1.0
  (10733, 125)	1.0
  (10733, 175)	1.0
  (10733, 199)	1.0
  (10733, 280)	1.0 (of type <class 'scipy.sparse._csr.csr_matrix'>)

In [None]:
from tensorflow.keras.utils import FeatureSpace

# Crossing columns together 
feature_space1 = FeatureSpace(
    features={
        # Categorical feature encoded as string
        # "experience_level": FeatureSpace.string_categorical(num_oov_indices=0),
        "employment_type": FeatureSpace.string_categorical(num_oov_indices=0,),
        "job_title": FeatureSpace.string_categorical(num_oov_indices=0),
        "salary_currency": FeatureSpace.string_categorical(num_oov_indices=0),
        "employee_residence": FeatureSpace.string_categorical(num_oov_indices=0),
        "company_location": FeatureSpace.string_categorical(num_oov_indices=0),

        # Categorical feature encoded as integers
        "remote_ratio": FeatureSpace.integer_categorical(num_oov_indices=0),
        "company_size": FeatureSpace.integer_categorical(num_oov_indices=0),
        "experience_level": FeatureSpace.integer_categorical(num_oov_indices=0),
        
        
        # Numerical features to normalize (normalization will be learned)
        # learns the mean, variance, and if to invert (3 parameters)
        # "salary_in_usd": FeatureSpace.float_normalized(),
        "work_year": FeatureSpace.float_normalized(),
        
            },
    # Specify feature cross with a custom crossing dim
    crosses=[
        FeatureSpace.cross(
            feature_names=('employee_residence','employment_type'), # dims: 9 x 3 x 4 = 108 
            crossing_dim=9*4),
        FeatureSpace.cross(
            feature_names=('company_location','company_size'), # 8 x 3 = 24
            crossing_dim=8*3),
        FeatureSpace.cross(
            feature_names=('job_title','experience_level'), # 12 x 4 = 48
            crossing_dim=12*4),
    ],
    output_mode="concat",
)

#### Dimensionality Reduction (Square Root Heuristic)

In TensorFlow, embeddings are used to convert categorical data into continuous vector spaces, where similar categories are mapped closer together. The embedding layer in TensorFlow is a trainable layer that learns a fixed-size continuous vector representation for each category. This is especially useful for high cardinality categorical features. By choosing to embed certain classes with high cardinality, we reduced the dimensionality of the dataset. This was particularly useful in Keras, as the only real requirement is to encode the data one-hot. 

In [7]:
from tensorflow.keras.layers import Embedding, Flatten

def setup_embedding_from_categorical(feature_space, col_name):
    # what the maximum integer value for this variable?
    # which is the same as the number of categories
    N = len(feature_space.preprocessors[col_name].get_vocabulary())
    
    # get the output from the feature space, which is input to embedding
    x = feature_space.preprocessors[col_name].output
    
    # now use an embedding to deal with integers from feature space
    x = Embedding(input_dim=N, 
                  output_dim=int(np.sqrt(N)), 
                  input_length=1, name=col_name+'_embed')(x)
    
    x = Flatten()(x) # get rid of that pesky extra dimension (for time of embedding)
    
    return x # return the tensor here 

# # add explanation of this pre-processing here
# train_ds_with_no_labels = ds_train.map(lambda x, _: x)
# feature_space.adapt(train_ds_with_no_labels)

def setup_embedding_from_crossing(feature_space, col_name):
    # what the maximum integer value for this variable?
    
    # get the size of the feature
    N = feature_space.crossers[col_name].num_bins
    x = feature_space.crossers[col_name].output
    
    
    # now use an embedding to deal with integers as if they were one hot encoded
    x = Embedding(input_dim=N, 
                  output_dim=int(np.sqrt(N)), 
                  input_length=1, name=col_name+'_embed')(x)
    
    x = Flatten()(x) # get rid of that pesky extra dimension (for time of embedding)
    
    return x

From the raw data, the model has almost 25,000 parameters! That is a lot of parameters and could lead to longer computation times. To reduce dimensionality, we then passed into the TensorFlow model, and the gradient of the distances/connections of these feature-specific classes is solved, returning an output of a specified reduced size ($\sqrt{n}$) that still maintains the majority of the information from the original data. These approximations are the output of the feature space in the below code.

##### *'salary_currenecy'*
- - **Number of dimensions after reduction = $\sqrt{24} \approx 4$**
##### *'company_location'*
- - **Number of dimensions after reduction = $\sqrt{78} \approx 8$**
##### *'employee_residence'*
- **Number of dimensions after reduction = $\sqrt{89} \approx 9$**
##### *'job_title'*
- **Number of dimensions after reduction = $\sqrt{169} \approx 12$**

### 2.1 Running the Data

In [573]:
# import the loss function we plan to use
from tensorflow.keras.losses import binary_crossentropy
# import a built in optimizer
# I am using legacy for an M1 chipset speed
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Input, Dense, Concatenate


def create_model1(feature_space):
    
    dict_inputs = feature_space.get_inputs() # need to use unprocessed features here, to gain access to each output
    encoded_features = feature_space.get_encoded_features()
    print("dict_inputs:",dict_inputs, "\n", "encoded_features:",encoded_features, "\n")
    # we need to create separate lists for each branch
    crossed_outputs = []
    
    # for each crossed variable, make an embedding
    for col in feature_space.crossers.keys():
        x = setup_embedding_from_crossing(feature_space, col)
    
        # save these outputs in list to concatenate later
        crossed_outputs.append(x)
        
    print("xed outputs",crossed_outputs)
    # now concatenate the outputs and add a fully connected layer
    wide_branch = Concatenate(name='wide_concat')(crossed_outputs)
    
    # reset this input branch
    all_deep_branch_outputs = []
    
    # for each numeric variable, just add it in after embedding
    for idx,col in enumerate(numeric_headers):
        x = feature_space.preprocessors[col].output
        # x = tf.cast(132,float) # cast an integer as a float here
        all_deep_branch_outputs.append(x)
    
    # for each categorical variable
    for col in categorical_headers:
    
        # get the output tensor from ebedding layer
        x = setup_embedding_from_categorical(feature_space, col)
        # save these outputs in list to concatenate later
        all_deep_branch_outputs.append(x)
    
    print(len(all_deep_branch_outputs))
    # Define deep branch with numeric and categorical features
    print(len(dict_inputs.keys()))
    print(encoded_features)
    

    deep_branch = Concatenate(name='embed_concat')(all_deep_branch_outputs)
    deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
    deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
    deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
    
    # merge the deep and wide branch
    final_branch = Concatenate(name='concat_deep_wide')([deep_branch, wide_branch])
    final_branch = Dense(units=1,activation='sigmoid',
                         name='combined')(final_branch)
    print(final_branch)
    training_model = keras.Model(inputs=dict_inputs, outputs=final_branch)
    training_model.compile(
        optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    print("error")
    model = tf.keras.Model(inputs=dict_inputs, outputs=final_branch)
    print("error")
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    print("error")
    return model

In [574]:
# from tensorflow.keras.layers import Input
# from tensorflow.keras.models import Model

# def create_model1(feature_space):
#     dict_inputs = feature_space.get_inputs()  # Get the inputs correctly
#     encoded_features = feature_space.get_encoded_features()
    
#     # Define the model architecture
#     x = layers.Dense(128, activation='relu')(encoded_features)
#     x =layers.Dense(128, activation='relu')(x)
#     output = layers.Dense(1, activation='sigmoid')(x)
    
#     model = Model(inputs=dict_inputs, outputs=output)
#     model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    
#     return model

In [19]:
[] = results
for train_index, val_index in sss.split(X_train, y_train):
    X_train_split, X_val_split = X_train[train_index], X_train[val_index]
    y_train_split, y_val_split = y_train[train_index], y_train[val_index]

    # Encode and scale the splits
    X_train_split, X_val_split = encode_and_scale(X_train_split, X_val_split)

    # Combine features and target for processing
    train_data = pd.concat([X_train_split, y_train_split], axis=1)
    val_data = pd.concat([X_val_split, y_val_split], axis=1)
    
    # Create TensorFlow datasets
    ds_train = create_dataset_from_dataframe(train_data,'salary_binary')
    ds_val = create_dataset_from_dataframe(val_data,'salary_binary')
    
    # Adapt the feature space to the training data
    feature_space1.adapt(ds_train.map(lambda x, _: x))
    
    # Define preprocessing operation
    preproc_ds_train = ds_train.map(lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
    preproc_ds_val = ds_val.map(lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
  
    # Train the model with the preprocessed datasets
    # model1 = create_model1(feature_space1)
    model1 = create_model1(feature_space1)
    history = model1.fit(preproc_ds_train, validation_data=preproc_ds_val, epochs=10)
    results.append(history)

# Evaluate the model on the test set (only once, not in the loop)
test_data = pd.concat([X_test, y_test], axis=1)
ds_test = create_dataset_from_dataframe(test_data)
preproc_ds_test = ds_test.map(lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
test_loss, test_accuracy = create_model.evaluate(preproc_ds_test)
print(f"Test Accuracy: {test_accuracy:.4f}")

NameError: name 'results' is not defined

In [447]:
def create_model1(feature_space):
    dict_inputs = feature_space.get_inputs()
    encoded_features = feature_space.get_encoded_features()

    crossed_outputs = []
    for col in feature_space.crossers.keys():
        x = setup_embedding_from_crossing(feature_space, col)
        crossed_outputs.append(x)

    wide_branch = Concatenate(name='wide_concat')(crossed_outputs)

    all_deep_branch_outputs = []
    numeric_headers = ['salary_in_usd', 'work_year']  # Example numeric headers, adjust as necessary
    categorical_headers = ['employment_type', 'job_title', 'salary_currency', 'employee_residence', 'company_location']  # Example categorical headers, adjust as necessary

    for idx, col in enumerate(numeric_headers):
        x = feature_space.preprocessors[col].output
        all_deep_branch_outputs.append(x)

    for col in categorical_headers:
        x = setup_embedding_from_categorical(feature_space, col)
        all_deep_branch_outputs.append(x)

    deep_branch = Concatenate(name='embed_concat')(all_deep_branch_outputs)
    deep_branch = Dense(units=50, activation='relu', name='deep1')(deep_branch)
    deep_branch = Dense(units=25, activation='relu', name='deep2')(deep_branch)
    deep_branch = Dense(units=10, activation='relu', name='deep3')(deep_branch)

    final_branch = Concatenate(name='concat_deep_wide')([deep_branch, wide_branch])
    final_branch = Dense(units=1, activation='sigmoid', name='combined')(final_branch)

    model = Model(inputs=dict_inputs, outputs=final_branch)
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    return model

In [448]:
results = []
for train_index, val_index in sss.split(X_train, y_train):
    X_train_split, X_val_split = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_split, y_val_split = y_train.iloc[train_index], y_val_split.iloc[val_index]

    # Encode and scale the splits
    X_train_split, X_val_split = encode_and_scale(X_train_split, X_val_split)

    # Combine features and target for processing
    train_data = pd.concat([X_train_split, y_train_split], axis=1)
    val_data = pd.concat([X_val_split, y_val_split], axis=1)
    
    # Create TensorFlow datasets
    ds_train = create_dataset_from_dataframe(train_data, 'salary_binary')
    ds_val = create_dataset_from_dataframe(val_data, 'salary_binary')
    
    # Adapt the feature space to the training data
    feature_space1.adapt(ds_train.map(lambda x, _: x))
    
    # Define preprocessing operation
    preproc_ds_train = ds_train.map(lambda x, y: (feature_space1(x), y), num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
    preproc_ds_val = ds_val.map(lambda x, y: (feature_space1(x), y), num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
  
    # Train the model with the preprocessed datasets
    model1 = create_model1(feature_space1)
    history = model1.fit(preproc_ds_train, validation_data=preproc_ds_val, epochs=10)
    results.append(history)

# Evaluate the model on the test set (only once, not in the loop)
test_data = pd.concat([X_test, y_test], axis=1)
ds_test = create_dataset_from_dataframe(test_data)
preproc_ds_test = ds_test.map(lambda x, y: (feature_space1(x), y), num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
test_loss, test_accuracy = model1.evaluate(preproc_ds_test)
print(f"Test Accuracy: {test_accuracy:.4f}")


IndexError: positional indexers are out-of-bounds

In [None]:
#!pip install pydot
#!pip install graphviz
# these are the placeholder inputs in the computation graph BEFORE 
# applying and transformations
dict_inputs = feature_space.get_inputs()  #getting inputs is WAY easier now

# these are the encoded features after they have been processed
# We can use these as additional inpits into the computation graph
encoded_features = feature_space.get_encoded_features() # these features have been encoded

# using feature space above, this will result in 131 concatenated features
# this is calucalted based on the one-hot encodings for each category

# now lets create some layers with Keras
x = keras.layers.Dense(64, activation="relu")(encoded_features)
x = keras.layers.Dense(32, activation="relu")(x)
predictions = keras.layers.Dense(1, activation="sigmoid")(x)

# we can now create two input/outputs to the computation graph

# this expects features already transformed
training_model = keras.Model(inputs=encoded_features, 
                             outputs=predictions)
training_model.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
)

# this expects features that are not transformed 
inference_model = keras.Model(inputs=dict_inputs, 
                              outputs=predictions)
inference_model.compile(loss="binary_crossentropy", metrics=["accuracy"])

inference_model.summary()

# plot_model(
#     training_model, to_file='model.png', show_shapes=True, show_layer_names=True,
#     rankdir='LR', expand_nested=False, dpi=96
# )

### 1.2 Crossed Features
#### 1 Employee Residence x Remote Ratio
- How is the salary in USD influenced by the employee's type of salary paid? This is probably a strong indicator of pay based on area. If the pay is not in USD, it might be more of a comtract role
- Total = 9 X 4 = 36
#### 2 Company Location x Company Size
- Companies in the US will probably be larger as it is the leading big tech country.
- Total = 9 * 3 = 27
#### 3 Job Title x Experience Level
- Specifics about each position (such as "Data Science Lead") suggest the experience/authority level
- Total = 13 * 4 = 52

In [None]:
from tensorflow.keras.utils import FeatureSpace

# Crossing columns together 
feature_space = FeatureSpace(
    features={
        # Categorical feature encoded as string
        # "experience_level": FeatureSpace.string_categorical(num_oov_indices=0),
        "employment_type": FeatureSpace.string_categorical(num_oov_indices=0,),
        "job_title": FeatureSpace.string_categorical(num_oov_indices=0),
        "salary_currency": FeatureSpace.string_categorical(num_oov_indices=0),
        "employee_residence": FeatureSpace.string_categorical(num_oov_indices=0),
        "company_location": FeatureSpace.string_categorical(num_oov_indices=0),

        # Categorical feature encoded as integers
        "remote_ratio": FeatureSpace.integer_categorical(num_oov_indices=0),
        "company_size": FeatureSpace.integer_categorical(num_oov_indices=0),
        "experience_level": FeatureSpace.integer_categorical(num_oov_indices=0),
        
        
        # Numerical features to normalize (normalization will be learned)
        # learns the mean, variance, and if to invert (3 parameters)
        # "salary_in_usd": FeatureSpace.float_normalized(),
        
        "work_year": FeatureSpace.float_normalized(),
            },
    # Specify feature cross with a custom crossing dim
    crosses=[
        FeatureSpace.cross(
            feature_names=('employee_residence','employment_type'), # dims: 9 x 3 x 4 = 108 
            crossing_dim=9*3*4),
        FeatureSpace.cross(
            feature_names=('company_location','company_size'), # 8 x 3 = 24
            crossing_dim=8*3),
        FeatureSpace.cross(
            feature_names=('job_title','experience_level'), # 12 x 4 = 48
            crossing_dim=12*4),
    ],
    output_mode="concat",
)
# workclass has 7 unique values.
# education has 16 unique values.
# marital_status has 7 unique values.
# occupation has 14 unique values.
# relationship has 6 unique values.
# race has 5 unique values.
# sex has 2 unique values.
# country has 41 unique values.

# add explanation of this pre-processing here
train_ds_with_no_labels = ds_train.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)

def setup_embedding_from_crossing(feature_space, col_name):
    # what the maximum integer value for this variable?
    
    # get the size of the feature
    N = feature_space.crossers[col_name].num_bins
    x = feature_space.crossers[col_name].output
    
    
    # now use an embedding to deal with integers as if they were one hot encoded
    x = Embedding(input_dim=N, 
                  output_dim=int(np.sqrt(N)), 
                  input_length=1, name=col_name+'_embed')(x)
    
    x = Flatten()(x) # get rid of that pesky extra dimension (for time of embedding)
    
    return x

Thanks to dimensionality reduction, we reduced the number of parameters by about 7-fold. This reduction leaves us room to cross some of these features to build stronger connections in the model.

In [None]:
dict_inputs = feature_space.get_inputs() # need to use unprocessed features here, to gain access to each output

# we need to create separate lists for each branch
crossed_outputs = []

# for each crossed variable, make an embedding
for col in feature_space.crossers.keys():
    
    x = setup_embedding_from_crossing(feature_space, col)
    
    # save these outputs in list to concatenate later
    crossed_outputs.append(x)
    

# now concatenate the outputs and add a fully connected layer
wide_branch = Concatenate(name='wide_concat')(crossed_outputs)

# reset this input branch
all_deep_branch_outputs = []

# for each numeric variable, just add it in after embedding
for idx,col in enumerate(numeric_headers):
    x = feature_space.preprocessors[col].output
    # x = tf.cast(132,float) # cast an integer as a float here
    all_deep_branch_outputs.append(x)
    
# for each categorical variable
for col in categorical_headers:
    
    # get the output tensor from ebedding layer
    x = setup_embedding_from_categorical(feature_space, col)
    
    # save these outputs in list to concatenate later
    all_deep_branch_outputs.append(x)


# merge the deep branches together
deep_branch = Concatenate(name='embed_concat')(all_deep_branch_outputs)
deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
    
# merge the deep and wide branch
final_branch = Concatenate(name='concat_deep_wide')([deep_branch, wide_branch])
final_branch = Dense(units=1,activation='sigmoid',
                     name='combined')(final_branch)

training_model = keras.Model(inputs=dict_inputs, outputs=final_branch)
training_model.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
)

training_model.summary()

plot_model(
    training_model, to_file='model.png', show_shapes=True, show_layer_names=True,
    rankdir='LR', expand_nested=False, dpi=96
)

### 1.3 Evaluation Metrics

##### Classification
What are the implications of misclassifications?

##### Regression

Stratified Shuffle Split:

Use when you have an imbalanced dataset and want to combine the benefits of random shuffling with maintaining class distribution.
Suitable for ensuring that each split is representative and for avoiding overfitting.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Initialize StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

# Perform the split
for train_index, test_index in sss.split(X, y):
    train_data = data.iloc[train_index]
    test_data = data.iloc[test_index]