<a href="https://colab.research.google.com/github/thor4/neuralnets/blob/master/projects/state_farm/state_farm_task-step_3-generate_predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# State Farm Pre-employment Assessment
### *Model-based supervised learning binary classification task*
Your work will be evaluated in the following areas:
- The appropriateness of the steps you took
- The complexity of your models
- The performance of each model on the test set (using AUC)
- The organization and readability of your code
- The write-up comparing the models
---

## Step 3 - Generate predictions
Create predictions on the data in test.csv using each of your trained models. The predictions should be the class probabilities for belonging to the positive class (labeled '1').  
 
Be sure to output a prediction for each of the rows in the test dataset (10K rows). Save the results of each of your models in a separate CSV file.  Title the two files 'glmresults.csv' and 'nonglmresults.csv'. Each file should have a single column representing the predicted probabilities for its respective model. Please do not include a header label or index column.

We will begin by importing relevant libraries.

In [1]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, FunctionTransformer, QuantileTransformer, LabelEncoder
import joblib
from google.colab import files

Download the saved models from the GitHub repository.

In [6]:
!wget https://raw.githubusercontent.com/thor4/neuralnets/master/projects/state_farm/models/logistic_regression_model.pkl
!wget https://raw.githubusercontent.com/thor4/neuralnets/master/projects/state_farm/models/gb_model.pkl

--2023-03-21 02:04:29--  https://raw.githubusercontent.com/thor4/neuralnets/master/projects/state_farm/models/logistic_regression_model.pkl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2191 (2.1K) [application/octet-stream]
Saving to: ‘logistic_regression_model.pkl’


2023-03-21 02:04:29 (31.3 MB/s) - ‘logistic_regression_model.pkl’ saved [2191/2191]

--2023-03-21 02:04:29--  https://raw.githubusercontent.com/thor4/neuralnets/master/projects/state_farm/models/gb_model.pkl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 176990 (

Next, we use the pandas library to load our CSV data and joblib to load our models.

In [7]:
test_data = pd.read_csv("exercise_40_test.csv")
logistic_regression = joblib.load('logistic_regression_model.pkl')
gb = joblib.load('gb_model.pkl')

In [8]:
test_data.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100
0,4.747627,20.509439,Wednesday,2.299105,-1.815777,-0.752166,0.0098%,-3.240309,0.587948,-0.260721,...,,12.542333,no,3.107683,0.533904,12.438759,7.298306,0,,93.56712
1,1.148654,19.301465,Fri,1.8622,-0.773707,-1.461276,0.0076%,0.443209,0.522113,-1.090886,...,-0.848567,7.213829,yes,4.276078,,10.386987,12.527094,1,yes,98.607486
2,4.98686,18.769675,Saturday,1.040845,-1.54869,2.632948,-5e-04%,-1.167885,5.739275,0.222975,...,1.143388,10.483928,no,2.090868,-1.780474,11.328177,11.628247,0,yes,94.578246
3,3.709183,18.374375,Tuesday,-0.169882,-2.396549,-0.784673,-0.016%,-2.662226,1.54805,0.210141,...,0.693646,3.862867,no,2.643847,1.66224,10.064961,10.550014,1,,100.346261
4,3.801616,20.205541,Monday,2.092652,-0.732784,-0.703101,0.0186%,0.056422,2.878167,-0.457618,...,-0.834763,3.632039,yes,4.074434,,9.255766,12.716137,1,yes,102.578918


### Prepare the data

Copy the test dataset for pre-processing.

In [9]:
test = test_data.copy()

Drop unnecessary columns

In [10]:
test = test.drop(columns=["x39", "x99", "x79", "x28"])

Convert object columns to binary int64


In [11]:
binary_features = ["x24", "x31", "x93"]
for col in binary_features:
    le = LabelEncoder()
    test[col] = le.fit_transform(test[col].astype(str))

Replace null values in `x33` and `x77` with most likely value based on probabilities

In [12]:
for col in ["x33", "x77"]:
    probs = test[col].value_counts(normalize=True)
    missing = test[col].isna()
    test.loc[missing, col] = np.random.choice(probs.index, size=len(test[missing]), p=probs.values)

Combine duplicate days in `x3`

In [13]:
day_mapping = {
    "Mon": "Monday",
    "Tue": "Tuesday",
    "Wed": "Wednesday",
    "Thur": "Thursday",
    "Fri": "Friday",
    "Sat": "Saturday",
    "Sun": "Sunday"
}
test["x3"] = test["x3"].replace(day_mapping)

Convert the `x7` column to a float by removing the % sign and dividing by 100. Convert the `x19` column to a float by removing the $ sign.

In [14]:
test['x7'] = test['x7'].str.strip('%').astype(float) / 100
test['x19'] = test['x19'].str.strip('$').astype(float)

### Run preprocessing pipeline

In [16]:
# Define transformers for each group of columns
one_hot_features = ["x33", "x77", "x3", "x60", "x65"]
range_based_features = ["x58", "x67", "x71", "x84"]
quantile_transform_features = ["x12", "x18", "x61", "x92", "x40", "x57"]
log_transform_features = ["x14", "x16", "x21", "x42", "x45", "x55", "x70", "x73", "x75", "x82", "x89", "x96"]

one_hot_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("one_hot", OneHotEncoder())
])

def custom_discretizer(X, low_quantile=0.25, high_quantile=0.75):
    low_bound = np.quantile(X, low_quantile, axis=0)
    high_bound = np.quantile(X, high_quantile, axis=0)
    return np.where(X < low_bound, 0, np.where(X <= high_bound, 1, 2))

range_based_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("custom_discretizer", FunctionTransformer(custom_discretizer, validate=True)),
    ("one_hot", OneHotEncoder())
])

quantile_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("quantile_transform", QuantileTransformer(output_distribution="normal"))
])

def log1p_with_positive_shift(X):
    positive_shift = np.abs(np.min(X, axis=0)) + 1e-6
    return np.log1p(X + positive_shift)

log_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("log_transform", FunctionTransformer(log1p_with_positive_shift, validate=True)),
    ("standard_scaler", StandardScaler())
])

remaining_float_features = list(set(test.select_dtypes(include=["float64"]).columns) - set(range_based_features) - set(quantile_transform_features) - set(log_transform_features))

float_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("standard_scaler", StandardScaler())
])

# Create ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ("one_hot", one_hot_transformer, one_hot_features),
    ("range_based", range_based_transformer, range_based_features),
    ("quantile_transform", quantile_transformer, quantile_transform_features),
    ("log_transform", log_transformer, log_transform_features),
    ("float_transform", float_transformer, remaining_float_features)
])

# Apply the preprocessing pipeline
X_transformed = preprocessor.fit_transform(test)

# Get the column names from the transformers
one_hot_cols = preprocessor.named_transformers_["one_hot"].named_steps["one_hot"].get_feature_names_out(one_hot_features)
range_based_categories = ["low", "middle", "high"]
range_based_cols = [f"{col}_{cat}" for col in range_based_features for cat in range_based_categories]
quantile_transform_cols = [f"quantile_{col}" for col in quantile_transform_features]
log_transform_cols = [f"log_{col}" for col in log_transform_features]
float_transform_cols = [f"float_{col}" for col in remaining_float_features]

print("One-hot cols:", len(one_hot_cols))
print("Range-based cols:", len(range_based_cols))
print("Quantile transform cols:", len(quantile_transform_cols))
print("Log transform cols:", len(log_transform_cols))
print("Float transform cols:", len(float_transform_cols))

# Combine column names
columns = (list(one_hot_cols)
           + list(range_based_cols)
           + quantile_transform_cols
           + log_transform_cols
           + float_transform_cols)

print("Total expected columns:", len(columns))
print("Actual columns in transformed dataset:", X_transformed.shape[1])

test_transformed = pd.DataFrame(X_transformed, columns=columns)

One-hot cols: 82
Range-based cols: 12
Quantile transform cols: 6
Log transform cols: 12
Float transform cols: 64
Total expected columns: 176
Actual columns in transformed dataset: 176


In [17]:
test_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 176 entries, x33_Alabama to float_x53
dtypes: float64(176)
memory usage: 13.4 MB


In [18]:
test_transformed.columns[test_transformed.isnull().sum() != 0]

Index([], dtype='object')

### Generate and save predictions

Generate predictions (class probabilities) for both models.

In [19]:
logistic_regression_prob = logistic_regression.predict_proba(X_transformed)[:, 1]
gb_prob = gb.predict_proba(X_transformed)[:, 1]

Save the predictions in two separate CSV files.

In [20]:
# Save logistic_regression predictions to glmresults.csv
pd.DataFrame(logistic_regression_prob).to_csv('glmresults.csv', header=False, index=False)

# Save grid_search_gb predictions to nonglmresults.csv
pd.DataFrame(gb_prob).to_csv('nonglmresults.csv', header=False, index=False)

Save the files for submission.

In [21]:
files.download('glmresults.csv')
files.download('nonglmresults.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>