## Load data & organize for training/inference
We are loading in the outputs from the `basic_feature_engineering.ipynb` notebook

In [44]:
import pandas as pd

#load data
train = pd.read_pickle('final_train.pkl')
test = pd.read_pickle('final_test.pkl')

# Separate features, targets
x_train = train.drop(columns=['target_scope_1', 'target_scope_2', 'entity_id'])
y_scope1 = train['target_scope_1']
y_scope2 = train['target_scope_2']

#this is the data we will pass to our final models for inference to create our predictions
x_test = test.drop(columns=['entity_id'])
#you will need to provide entity_id in your submission file along with your predictions
entity_id_test = test.entity_id

## Train and validate model
We are using a linear regression model for simplicity.

In [45]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


def validate_model(x, y):

    # Initialize model
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', LinearRegression())
        ])
    # Set up K-Fold cross validation
    kfold = KFold(n_splits=5, shuffle=True, random_state=42)

    # Perform cross-validation with RMSE
    # Note: cross_val_score uses negative MSE, so we convert to RMSE
    cv_scores = cross_val_score(
        pipeline, 
        x, 
        y, 
        cv=kfold, 
        scoring='neg_mean_squared_error'
    )

    # Convert to RMSE (positive values)
    rmse_scores = np.sqrt(-cv_scores)

    # Print cross-validation results
    print(f"Cross-Validation RMSE Scores: {rmse_scores}")
    print(f"Mean RMSE: {rmse_scores.mean():.4f}")
    print(f"Standard Deviation RMSE: {rmse_scores.std():.4f}")




There is a lot of room for improvement in the model results below. The poor performance we are seeing likely has to do with how skewed the data is between smaller companies and much larger companies. There are additional data scaling techniques (e.g., taking the log of certain columns to address their heavy skew) that we can investigate to manage for this but we will not explore this further in this notebook.

In [46]:
s1_model = validate_model(x_train,y_scope1)

Cross-Validation RMSE Scores: [141019.25233045 100115.67828489 109294.43531758  92291.39711949
  91616.90292995]
Mean RMSE: 106867.5332
Standard Deviation RMSE: 18236.5720


In [47]:
s2_model = validate_model(x_train,y_scope2)

Cross-Validation RMSE Scores: [167800.84949382 181157.27160287  88319.3774602   81402.86321996
 275091.28553474]
Mean RMSE: 158754.3295
Standard Deviation RMSE: 70798.8001


## Final model inference

In [48]:
# Train final models on all training data
s1_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
    ])
s1_pipeline.fit(x_train,y_scope1)

s2_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
    ])
s2_pipeline.fit(x_train,y_scope2)

# Predict
s1_predictions = s1_pipeline.predict(x_test)
s2_predictions = s2_pipeline.predict(x_test)

## Create your submission file

In [53]:
submission = pd.DataFrame({
    'entity_id': entity_id_test,
    's1_predictions': s1_predictions,
    's2_predictions': s2_predictions
})

submission.to_csv('submission.csv')

submission

Unnamed: 0,entity_id,s1_predictions,s2_predictions
0,1076,28520.862664,35241.35964
1,2067,75712.18601,45558.347168
2,910,97384.30773,118591.522996
3,4082,93630.636476,90888.114501
4,4102,48521.360346,25294.518406
5,1535,49629.638088,43017.437985
6,4213,71270.516129,31431.926798
7,107,79034.984495,72371.762514
8,2301,71280.255588,77712.739351
9,1463,15574.124049,17424.197015
