# Predicted Component Scores from Composite

We're only given composite scores along with component ranks of each school, let's try to approximate the component scores given the composite formula:
```
composite = (0.5 * equity) + (0.25 * excellence) + (0.25 * efficiency)
```

## Linear Model
If we assume a linear relationship between ranks (0 is the worst) and component scores we have the additional relationships:
```
rank_normalized =  rank / total_schools

equity = equity_scaling_factor * equity_rank_normalized
excellence = excellence_scaling_factor * excellence_rank_normalized
efficiency = efficiency_scaling_factor * efficiency_rank_normalized
```

Combinging these we arrive at the equation we're trying to solve for:
```
composite = (0.5 * equity_scaling_factor * equity_rank_normalized) 
    + (0.25 * excellence_scaling_factor * excellence_rank_normalized) 
    + (0.25 * efficiency_scaling_factor * efficiency_rank_normalized)

composite = beta_0 
    + (beta_1 * equity_rank_normalized) 
    + (beta_2 * excellence_rank_normalized) 
    + (beta_3 * efficiency_rank_normalized)
```

Solving for the `betas` allows us to then approcimate the individual component scores.

## Power Model
Although we don't know the shape of the distribution of each component score, we do we see that the composite score distribution is only mostly linear. The top and bottom end have increased slopes which is better fitted with a power model, if we assume the composite score distrubution is also representative of the component score distributions we can instead fit a power model solving for `p`s:
```
composite = (0.5 * equity_rank_normalized ** p_equity) 
    + (0.25 * excellence_rank_normalized ** p_excellence) 
    + (0.25 * efficiency_rank_normalized ** p_efficiency)
```

## Results
Errors were slightly lower with the power model.

Linear Model Errors:
- mae: 5.17455588230908
- mse: 54.18475698406821
- rmse: 7.3610296143996194

Power Model Errors
- mae: 4.710755158241806
- mse: 45.38759555751562
- rmse: 6.737031657749251

## Prep the Data

In [1]:
%run notebooks/Setup.ipynb

import polars
import numpy
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.base import BaseEstimator, RegressorMixin
from scipy.optimize import minimize

# Approximate Component Scores from Composite

We're only given composite scores along with component ranks of each school, let's try to approximate the component scores given the composite formula:
```
composite = (0.5 * equity) + (0.25 * excellence) + (0.25 * efficiency)
```

## Linear Model
If we assume a linear relationship between ranks (0 is the worst) and component scores we have the additional relationships:
```
rank_normalized =  rank / total_schools

equity = equity_scaling_factor * equity_rank_normalized
excellence = excellence_scaling_factor * excellence_rank_normalized
efficiency = efficiency_scaling_factor * efficiency_rank_normalized
```

Combinging these we arrive at the equation we're trying to solve for:
```
composite = (0.5 * equity_scaling_factor * equity_rank_normalized) 
    + (0.25 * excellence_scaling_factor * excellence_rank_normalized) 
    + (0.25 * efficiency_scaling_factor * efficiency_rank_normalized)

composite = beta_0 
    + (beta_1 * equity_rank_normalized) 
    + (beta_2 * excellence_rank_normalized) 
    + (beta_3 * efficiency_rank_normalized)
```

Solving for the `betas` allows us to then approcimate the individual component scores.

## Power Model
Although we don't know the shape of the distribution of each component score, we do we see that the composite score distribution is only mostly linear. The top and bottom end have increased slopes which is better fitted with a power model, if we assume the composite score distrubution is also representative of the component score distributions we can instead fit a power model solving for `p`s:
```
composite = (0.5 * equity_rank_normalized ** p_equity) 
    + (0.25 * excellence_rank_normalized ** p_excellence) 
    + (0.25 * efficiency_rank_normalized ** p_efficiency)
```

## Results
Errors were slightly lower with the power model.

In [18]:
# source data we're going to be working with
composite_df = polars.read_csv(workspace_path.joinpath('data/processed/composite_scores_raw.csv'))
composite_df

# normalize the ranks, the max is the same across components
max_rank = composite_df['equity_rank'].max()

normalized_df = composite_df\
    .select(['school_name', 'composite_score', 'equity_rank', 'excellence_rank', 'efficiency_rank'])\
    .with_columns([
        (composite_df['equity_rank'] / max_rank).alias('equity_rank_normalized'),
        (composite_df['excellence_rank'] / max_rank).alias('excellence_rank_normalized'),
        (composite_df['efficiency_rank'] / max_rank).alias('efficiency_rank_normalized'),
        (composite_df['composite_score'] / 100).alias('composite_normalized'),
    ])

normalized_df

school_name,composite_score,equity_rank,excellence_rank,efficiency_rank,equity_rank_normalized,excellence_rank_normalized,efficiency_rank_normalized,composite_normalized
str,f64,i64,i64,i64,f64,f64,f64,f64
"""Washington (George) High""",72.91,81,52,98,0.81,0.52,0.98,0.7291
"""Presidio Middle""",51.16,52,44,94,0.52,0.44,0.94,0.5116
"""Lafayette Elementary""",23.61,7,96,48,0.07,0.96,0.48,0.2361
"""Alamo Elementary""",14.13,6,74,35,0.06,0.74,0.35,0.1413
"""Argonne Elementary""",11.46,4,75,40,0.04,0.75,0.4,0.1146
…,…,…,…,…,…,…,…,…
"""Milk (Harvey) Civil Rights Ele…",21.5,9,73,71,0.09,0.73,0.71,0.215
"""Everett Middle""",15.2,79,1,4,0.79,0.01,0.04,0.152
"""Lilienthal (Claire) Elementary""",40.88,31,87,61,0.31,0.87,0.61,0.4088
"""Marina Middle""",35.98,65,17,49,0.65,0.17,0.49,0.3598


## Model Eval

In [19]:
# retrun the errors of the predictions to evaluate the model
def errors(predicted_df):
    mae = mean_absolute_error(predicted_df['composite_score'], predicted_df['composite_score_predicted'])
    mse = mean_squared_error(predicted_df['composite_score'], predicted_df['composite_score_predicted'])
    
    return {
        'mae': mae,
        'mse': mse,
        'rmse': numpy.sqrt(mse)
    }


## Linear Model

In [20]:
# setup the regressions
x = normalized_df.select(['equity_rank_normalized', 'excellence_rank_normalized', 'efficiency_rank_normalized'])\
    .to_numpy()
y = normalized_df['composite_normalized'].to_numpy()

# fit
model = LinearRegression(fit_intercept=False)
model.fit(x, y)
beta_0 = model.intercept_
beta_1, beta_2, beta_3 = model.coef_

beta_0, beta_1, beta_2, beta_3

(0.0,
 np.float64(0.3964423881149744),
 np.float64(0.15692051444902033),
 np.float64(0.2131077408232291))

In [21]:
# predict and eval
equity_scaling_factor = beta_1 / 0.5
excellence_scaling_factor = beta_2 / 0.25
efficiency_scaling_factor = beta_3 / 0.25

linear_predictions = normalized_df.with_columns([
    (equity_scaling_factor * normalized_df['equity_rank_normalized'] * 100).alias('equity_score_predicted'),
    (excellence_scaling_factor * normalized_df['excellence_rank_normalized'] * 100).alias('excellence_score_predicted'),
    (efficiency_scaling_factor * normalized_df['efficiency_rank_normalized'] * 100).alias('efficiency_score_predicted'),
])

linear_predictions = linear_predictions.with_columns(
    (
        0.5 * linear_predictions['equity_score_predicted'] +
        0.25 * linear_predictions['excellence_score_predicted'] +
        0.25 * linear_predictions['efficiency_score_predicted'] -
        beta_0
    ).alias('composite_score_predicted')
)

errors(linear_predictions)

{'mae': np.float64(5.17455588230908),
 'mse': np.float64(54.18475698406821),
 'rmse': np.float64(7.3610296143996194)}

## Power Model

In [22]:
def fit(params, data):
    p_equity, p_excellence, p_efficiency = params
    
    # estimate the parts and composite
    data['s_equity'] = data['equity_rank_normalized'] ** p_equity
    data['s_excellence'] = data['excellence_rank_normalized'] ** p_excellence
    data['s_efficiency'] = data['efficiency_rank_normalized'] ** p_efficiency
    
    data['composite_estimated'] = (
        0.5 * data['s_equity'] +
        0.25 * data['s_excellence'] +
        0.25 * data['s_efficiency']
    )
    
    # return sum squared of errors to minimize
    sse = numpy.sum((data['composite_normalized'] - data['composite_estimated']) ** 2)
    
    return sse

In [23]:
initial_params = [1.0, 1.0, 1.0]
bounds = [(0.1, 10), (0.1, 10), (0.1, 10)]

result = minimize(
    fit,
    initial_params,
    args=(normalized_df.to_pandas(),),
    bounds=bounds,
    method='L-BFGS-B'
)

# extract the powers
p_equity, p_excellence, p_efficiency = result.x
result

  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 0.45841471513090776
        x: [ 1.212e+00  1.489e+00  3.536e+00]
      nit: 15
      jac: [-8.882e-08  3.830e-07  5.551e-09]
     nfev: 72
     njev: 18
 hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>

In [24]:
# predict and eval
power_predictions = normalized_df.with_columns([
    (normalized_df['equity_rank_normalized'] ** p_equity * 100).alias('equity_score_predicted'),
    (normalized_df['excellence_rank_normalized'] ** p_excellence * 100).alias('excellence_score_predicted'),
    (normalized_df['efficiency_rank_normalized'] ** p_efficiency * 100).alias('efficiency_score_predicted'),
])

power_predictions = power_predictions.with_columns(
    (
        0.5 * power_predictions['equity_score_predicted'] +
        0.25 * power_predictions['excellence_score_predicted'] +
        0.25 * power_predictions['efficiency_score_predicted']
    ).alias('composite_score_predicted')
)

errors(power_predictions)

{'mae': np.float64(4.710755158241806),
 'mse': np.float64(45.38759555751562),
 'rmse': np.float64(6.737031657749251)}

## Combined Approximates

In [25]:
linear_predictions_renamed = linear_predictions\
    .select(['school_name', 'equity_score_predicted', 'excellence_score_predicted', 'efficiency_score_predicted', 'composite_score_predicted'])\
    .rename({
        'equity_score_predicted': 'equity_score_lmodel',
        'excellence_score_predicted': 'excellence_score_lmodel',
        'efficiency_score_predicted': 'efficiency_score_lmodel',
        'composite_score_predicted': 'composite_score_lmodel'
    })

power_predictions_renamed = power_predictions\
    .select(['school_name', 'equity_score_predicted', 'excellence_score_predicted', 'efficiency_score_predicted', 'composite_score_predicted'])\
    .rename({
        'equity_score_predicted': 'equity_score_pmodel',
        'excellence_score_predicted': 'excellence_score_pmodel',
        'efficiency_score_predicted': 'efficiency_score_pmodel',
        'composite_score_predicted': 'composite_score_pmodel'
    })

# merge with original
combined_predictions = composite_df\
    .select(['school_name', 'composite_score', 'equity_rank', 'excellence_rank', 'efficiency_rank'])\
    .join(
        linear_predictions_renamed, 
        on='school_name', 
        how='left'
    )\
    .join(
        power_predictions_renamed, 
        on='school_name', 
        how='left'
    )

combined_predictions.write_csv(workspace_path.joinpath('data/processed/component_scores.csv'))
