# Assignment 2A: Binary Classification with Logisitic Regression

### Task/Problem Statement:
The goal of this part of the assignment is to predict whether a person earns over 50K annually based
on the UCI Adult Income dataset as described below. The predict shall be achieved via Logistic
Regression implemented via an LNN model using TensorFlow Keras API.

### Dataset: UCI Adult Income ("Census Income") Dataset
The UCI Adult Income Dataset (also known as the “Census Income” dataset) adult.csv comprises 14
attributes including categorical and numerical features. The target “income” class is a binary
variable (<=50K, >50K). The prediction task is to determine whether a person makes over 50K a year.

Data provided in adult.xlsx comprises 14 attributes including categorical and numerical features.
The target “income” class is a binary variable (<=50K, >50K).

##### Source: https://archive.ics.uci.edu/dataset/2/adult

##### Input variables:
- age
- workclass
- fnlwgt
- education
- education-num
- marital-status
- occupation
- relatioship
- race
- sex
- capital-gain
- capital-loss
- hours-per-week
- native-country

##### Output variable: 
- Income (<=50K, >50K)

In [1]:
from adult_preprocessing import AdultPreprocessing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import tensorflow as tf

## Load Data & Display Dataset Information

In [2]:
df_raw = None

# the following try-except block tries to handle alternate file types for the data
try:
    df_raw = pd.read_csv("adult.csv")
except FileNotFoundError:
    try:
        df_raw = pd.read_excel("adult.xlsx", sheet_name="in")
    except FileNotFoundError:
        print("adult.xlsx or adult.csv not found")
        exit(1)

print(df_raw.head(5))

    age          workclass  fnlwgt   education  education-num  \
0   NaN          State-gov   77516   Bachelors             13   
1  50.0   Self-emp-not-inc   83311   Bachelors             13   
2  38.0            Private  215646     HS-grad              9   
3  53.0            Private  234721        11th              7   
4  28.0            Private  338409   Bachelors             13   

        marital-status          occupation     relatioship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
0          2174             0              40   United-States 

In [3]:
# Display a summary of the dataset information.
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32560 non-null  float64
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   32561 non-null  int64  
 5   marital-status  32561 non-null  object 
 6   occupation      32561 non-null  object 
 7   relatioship     32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32561 non-null  int64  
 12  hours-per-week  32561 non-null  int64  
 13  native-country  32561 non-null  object 
 14  income          32561 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.7+ MB


## Data Pre-processing


Check for missing values in the dataset (display using the print method) and handle them using
appropriate techniques. Finally, display whether missing values exist.

In [4]:
# Assuming that we may use the adult income dataset again, I just put the general preprocessing code
# from assignment 1 into a class in a separate file to keep this notebook clean.
ap = AdultPreprocessing(df_raw)
ap.fix_question_marks()
ap.impute_missing_values(verbose=True)

Column `age` has 1 missing values imputed with median
Column `workclass` has 1836 missing values imputed with mode
Column `occupation` has 1843 missing values imputed with mode
Column `native-country` has 583 missing values imputed with mode


Encode categorical variables into numerical format.

In [5]:
ap.one_hot_encode()

Create a new DataFrame “df” that includes both numeric and encoded categorical columns without
redundancy (handled by AdultPreprocessing class).

Create a deep copy of this DataFrame “df_copy” for use in Experiment 2.

In [12]:
df = ap.get_df()
df_copy = df.copy()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 98 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   age                                        32561 non-null  float64
 1   fnlwgt                                     32561 non-null  int64  
 2   education-num                              32561 non-null  int64  
 3   capital-gain                               32561 non-null  int64  
 4   capital-loss                               32561 non-null  int64  
 5   hours-per-week                             32561 non-null  int64  
 6   workclass_Local-gov                        32561 non-null  int32  
 7   workclass_Never-worked                     32561 non-null  int32  
 8   workclass_Private                          32561 non-null  int32  
 9   workclass_Self-emp-inc                     32561 non-null  int32  
 10  workclass_Self-emp-not

## Data Analysis

Extensive data analysis is skipped because the same data was already explored in assignment 1.
Please see https://github.com/zmswanson/ecen878_knn_classifier for details and data analysis from
assignment 1.

## Create Data Matrix X and Target y

Create a “target” DataFrame containing the target variable and a “features” DataFrame containing all
feature columns.

In [13]:
target = df.pop('income_>50K')
features = df

print(f"target shape: {target.shape} and features shape: {features.shape}")

target shape: (32561,) and features shape: (32561, 97)


From the “features” and “target” DataFrame objects, create a NumPy ndarray for the feature matrix X
and a 1D array for the target y.

In [17]:
X = features.to_numpy()
y = target.to_numpy()

print(f"X shape: {X.shape} and y shape: {y.shape}")

X shape: (32561, 97) and y shape: (32561,)


## Partition the Dataset into Train & Test Subsets

Split the dataset into training and test subsets (20% for testing).

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Then, further split the training set into training and validation subsets (20% for validation).

Display the shape of each subset for both the feature matrix and target array.

In [19]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape} and y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape} and y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape} and y_test shape: {y_test.shape}")


X_train shape: (20838, 97) and y_train shape: (20838,)
X_val shape: (5210, 97) and y_val shape: (5210,)
X_test shape: (6513, 97) and y_test shape: (6513,)


## Standardize the Data

Standardize the three data subsets. Ensure that there is no data leakage. This is achieved by
fitting the standardizing model on the training set and then applying the same model to transform
the validation and test sets. This ensures that we aren't inadvertently "peaking" at the validation
or test sets.

In [20]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

## Model Construction

Create an LNN model for binary classification. Initially, the Dense layer should have the 
“kernel_regularizer” set to None. Later you will change this value as instructed below.

Display the model summary.

In [26]:
%%time
# The following model creation is borrowed and modified from 
# https://github.com/rhasanbd/Linear-Neural-Networks/blob/main/Linear%20Neural%20Network-1-Binary%20Classification-Linearly%20Separable.ipynb
tf.keras.backend.clear_session()

# Reseed the random number generator to get consistent results
np.random.seed(42)
tf.random.set_seed(42)

model = tf.keras.models.Sequential(name="ZMS_LNN_Binary_Classifier")
model.add(tf.keras.layers.Input(shape=X_train.shape[1], name="Input_Layer"))
model.add(
    tf.keras.layers.Dense(units=1, kernel_initializer="zeros", activation="sigmoid",
        kernel_regularizer=None, name="Output_Layer", use_bias=True)
)

model.summary()

Model: "ZMS_LNN_Binary_Classifier"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Output_Layer (Dense)        (None, 1)                 98        
                                                                 
Total params: 98
Trainable params: 98
Non-trainable params: 0
_________________________________________________________________
CPU times: total: 0 ns
Wall time: 26.5 ms


## Experiments
Conduct the following experiments. For each experiment:
- Display learning curves (accuracy vs. epochs, and loss vs. epochs). 
- Clearly annotate your code.
- Report performance: Train and test accuracy, test confusion matrix, and test classification report

For the following two experiments, you will perform hyperparameter tuning to enhance the performance
of your LNN models. This tuning should be conducted using the Keras Tuner library, where you can
choose either the RandomSearch or Hyperband algorithms to efficiently explore the hyperparameter 
space.

Specifically, you will tune the learning rate, number of epochs, and mini-batch size. Additionally,
you must develop your own heuristic to define the lower and upper ranges for these hyperparameters
and briefly state your heuristic in no more than a few lines.

## Experiment 1

Tune hyperparameters, including learning rate, number of epochs, and mini-batch size. You may also
apply regularization (both weight-based and early stopping) as needed to optimize performance.

In [None]:
%%time

## Experiment 2

Using the deep copy DataFrame “df_copy”, repeat the steps to create target and feature DataFrames.

Split the dataset into training and test subsets (20% for testing) and then into training and
validation subsets (20% for validation) without standardizing the dataset.

Create an optimal logistic regression LNN model. Select optimal hyperparameters and regularizers to
ensure that the test performance of this experiment is comparable to that of Experiment 1. You must
optimize the model’s performance to align it closely with Experiment 1.

In [None]:
%%time

