#**Predicting Happiness**

In this problem set you will work with a data set from the [World Happiness Website](https://worldhappiness.report/ed/2018/).

You will use the data in the file,  WHR2018Chapter2OnlineData.xls.

Our goal will be to develop a model for happiness.


#[DSLC Stage 1]: Domain Problem and Data Collection

Read the description of the data at the World Happiness Website.

**TODO: Add a text cell to answer the following questions:**

1. From your new domain knowledge, what variable will you use as a response or dependent variable for your model of happiness?
2. From your new domain knowledge, what variables will you consider as potential predictor (or independent) variables?
3. From your new domain knowledge, can you identify any variables that are cofounders?
4. Please share one question that you still have about the data collection process.


1. The dependent variable will be the 'Life Ladder' column for my model of happiness.
2. Some potential predictor variables would be: Log GDP per Capita, Freedom to make life choices, Healthy Life Expectancy at Birth, and Social Support.
3. I would say that GDP per Capita is a cofounding variable, as it effects both the happiness of people but also the corruption, as countries with higher GDP per Capitas often have lower levels of corruption. Additionally, the GDP per capita of a nation also has an impact on the life expectancy along with general happiness.
4. One question I have is who made the conclusions on a countries 'Democratic Quality'. As the level of democracy in a country could be biased depending on who you are asking. For example, asking a representative of a government versus citizens of a state would likely produce different results. Also if it's just one country collecting the data it could be more biased than a NGO or alliance group like the UN or EU.

# [DSLC stage 2]: Data cleaning, pre-processing, and exploratory data analysis


In this section you will load and clean the data. Please run the code provide and complete modifications as specified.


In [5]:
# Installing scikit-lego package
%pip install scikit-lego



In [6]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklego.linear_model import LADRegression



pd.set_option('display.max_columns', None)
pd.options.display.max_colwidth = 500
pd.options.display.max_rows = 100

In [7]:
# load the happiness data in file WHR2018Chapter2OnlineData.xls
# Upload this file using the folder to left
happiness_orig = pd.read_excel("WHR2018Chapter2OnlineData.xls", sheet_name=0)
happiness_orig

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.723590,7.168690,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.929690,-1.655084,1.774662,0.476600,,,
1,Afghanistan,2009,4.401778,7.333790,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.137630,0.706766,0.618265,0.275324,0.299357,-1.991810,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.785360,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.775620,0.710385,0.267919,0.435440,-1.842996,-1.404078,1.798283,0.475367,,,0.344540
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1557,Zimbabwe,2013,4.690188,7.565154,0.799274,48.949745,0.575884,-0.076716,0.830937,0.711885,0.182288,0.527755,-1.026085,-1.526321,1.964805,0.418918,,0.432,0.555439
1558,Zimbabwe,2014,4.184451,7.562753,0.765839,50.051235,0.642034,-0.045885,0.820217,0.725214,0.239111,0.566209,-0.985267,-1.484067,2.079248,0.496899,,0.432,0.601080
1559,Zimbabwe,2015,3.703191,7.556052,0.735800,50.925652,0.667193,-0.094585,0.810457,0.715079,0.178861,0.590012,-0.893078,-1.357514,2.198865,0.593776,,0.432,0.655137
1560,Zimbabwe,2016,3.735400,7.538829,0.768425,51.800068,0.732971,-0.065283,0.723612,0.737636,0.208555,0.699344,-0.863044,-1.371214,2.776363,0.743257,,0.432,0.596690


In [8]:
# Examine the first 10 rows
happiness_orig.head(10)

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454
5,Afghanistan,2013,3.5721,7.503376,0.483552,51.04298,0.577955,0.074735,0.823204,0.620585,0.273328,0.482847,-1.879709,-1.403036,1.22369,0.342569,,,0.304368
6,Afghanistan,2014,3.130896,7.484583,0.525568,51.370525,0.508514,0.118579,0.871242,0.531691,0.374861,0.409048,-1.773257,-1.312503,1.395396,0.445686,,,0.413974
7,Afghanistan,2015,3.982855,7.466215,0.528597,51.693527,0.388928,0.094686,0.880638,0.553553,0.339276,0.260557,-1.844364,-1.291594,2.160618,0.54248,,,0.596918
8,Afghanistan,2016,4.220169,7.461401,0.559072,52.016529,0.522566,0.057072,0.793246,0.564953,0.348332,0.32499,-1.917693,-1.432548,1.796219,0.425627,,,0.418629
9,Afghanistan,2017,2.661718,7.460144,0.49088,52.339527,0.427011,-0.10634,0.954393,0.496349,0.371326,0.261179,,,1.454051,0.546283,,,0.286599


In [9]:
#TODO Add code to examine a random sample of 10 rows
happiness_orig.sample(10)

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
241,Canada,2010,7.650346,10.613968,0.953765,71.295418,0.933949,0.216607,0.41266,0.878868,0.233113,0.551076,1.144457,1.838644,1.749785,0.22872,0.336,0.3368,0.710133
1137,Qatar,2010,6.849653,11.737195,,66.829788,,0.06017,,,,,0.070845,0.92553,2.023532,0.295421,,,0.454522
1397,Tunisia,2010,5.130521,9.253052,0.863188,64.967682,0.623593,-0.150295,0.732379,0.724581,0.248913,0.903175,-0.740071,-0.002242,1.513249,0.29495,0.358,0.381,0.291472
1495,Uzbekistan,2006,5.232322,8.0874,0.903067,60.67469,0.784301,-0.113762,0.608808,0.727946,0.195058,0.891238,-1.928188,-1.303556,1.917568,0.366485,,0.348,
854,Malaysia,2008,5.806782,9.951739,0.802811,63.934757,0.779566,0.024165,0.883766,0.815414,0.185745,0.663472,-0.227434,0.430628,1.632484,0.281134,,0.461,
794,Liberia,2014,4.571419,6.690494,0.708302,51.581688,0.590451,0.02286,0.868966,0.542849,0.44286,0.348814,-0.445642,-0.945885,2.945781,0.644391,0.332,0.3485,
780,Lebanon,2009,5.205999,9.667031,0.736412,67.514397,0.664734,0.066966,0.937025,0.527855,0.401289,0.371837,-0.958466,-0.508679,2.366042,0.454484,,0.318,0.394788
734,Kosovo,2014,5.000375,9.07617,0.705632,61.744324,0.441391,0.004026,0.775201,0.636128,0.20595,0.34445,,,1.77152,0.354277,,,0.359232
486,Germany,2005,6.61955,10.537519,0.96349,69.18705,0.846624,,0.781007,0.775692,0.197262,0.321759,1.188014,1.642694,1.798156,0.271643,,0.311571,
459,France,2010,6.797901,10.515214,0.942955,71.70446,0.849702,-0.114192,0.622954,0.789724,0.260568,0.401466,0.941972,1.432198,1.757904,0.258595,0.337,0.3203,0.362923


Edit this cell to answer the following question:
5. Please share one question that you still have about the data itself
6. What is the observational unit in the data?

In [10]:
#TODO: Examine missingness
# 1. Calculate the percent of missingness of each variable in the data
# 2. Visualize the percent of missingness of each variable in a heatmap
missingness = []
for col in happiness_orig.columns:
  missingness.append(happiness_orig[col].isna().sum()/happiness_orig.shape[0])

missingness_reshape = np.array(missingness).reshape(1, -1)

fig = px.imshow(missingness_reshape,
                x=happiness_orig.columns,
                y=['Percent of Missingness'],
                title='Percentage of Missing Values per Variable')
fig.show()

In [11]:
# This is a data cleaning function that is provided for you.
# Please feel free to modify this based on decisions you make
# during the pre-processing step. Document any changes you make and why.
def clean_happiness(happiness_orig, predictor_variable = None):
  # rename column names
  happiness_clean = happiness_orig.rename(columns={
    "Life Ladder": "happiness",
    "Log GDP per capita": "log_gdp_per_capita",
    "Social support": "social_support",
    "Healthy life expectancy at birth": "life_expectancy",
    "Freedom to make life choices": "freedom_choices",
    "Generosity": "generosity",
    "Perceptions of corruption": "corruption",
    "Positive affect": "positive_affect",
    "Negative affect": "negative_affect",
    "Confidence in national government": "government_confidence",
    "gini of household income reported in Gallup, by wp5-year": "gini_index"})
  # filter to relevant columns
  happiness_clean = happiness_clean[["country", "year", "happiness", "log_gdp_per_capita",
                                     "social_support", "life_expectancy",
                                     "freedom_choices", "generosity",
                                     "corruption", "positive_affect",
                                     "negative_affect", "government_confidence",
                                     "gini_index"]]

  if (predictor_variable is not None):
    happiness_clean = happiness_clean[["country", "year", "happiness", predictor_variable]]

  return(happiness_clean)


In [12]:
# Cleaning the data
happiness_clean = clean_happiness(happiness_orig)
happiness_clean

Unnamed: 0,country,year,happiness,log_gdp_per_capita,social_support,life_expectancy,freedom_choices,generosity,corruption,positive_affect,negative_affect,government_confidence,gini_index
0,Afghanistan,2008,3.723590,7.168690,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,
1,Afghanistan,2009,4.401778,7.333790,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.137630,0.706766,0.618265,0.275324,0.299357,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.775620,0.710385,0.267919,0.435440,0.344540
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1557,Zimbabwe,2013,4.690188,7.565154,0.799274,48.949745,0.575884,-0.076716,0.830937,0.711885,0.182288,0.527755,0.555439
1558,Zimbabwe,2014,4.184451,7.562753,0.765839,50.051235,0.642034,-0.045885,0.820217,0.725214,0.239111,0.566209,0.601080
1559,Zimbabwe,2015,3.703191,7.556052,0.735800,50.925652,0.667193,-0.094585,0.810457,0.715079,0.178861,0.590012,0.655137
1560,Zimbabwe,2016,3.735400,7.538829,0.768425,51.800068,0.732971,-0.065283,0.723612,0.737636,0.208555,0.699344,0.596690


Edit this cell to answer the following question:
7. What variables were dropped from the original data set?  Would you drop any additional variables from this data set and why?
8. How would you impute the gini_index variable? Explain why. (You do not have to write code to do this unless you need to do so for your model. In this case, include an imputation function and call it from the data cleaning function)

7. The variables that were dropped from the original data set are: Democratic Quality, Delivery Quality, SD of Ladder by country-year, SD/Mean of ladder by country-year, GINI Index, GINI of Household Income. I would consider dropping either corruption or gini index, as the two of them are often interlinked. Government corruption in many countries often leads to high levels of income inequality, so there could be some overlap between the two variables.

8. I would impute the gini_index variable by replacing any instances where there is no gini_index value with the average between the average 2000-2015 gini index and the gini index of household income, assuming that both exist. If both don't exist replace the missing value with whatever of the two values is still present in the data.

Now we will visualize the relationships between variables.

# Plot Guidelines
For all plots and visualizations for this assignment please include


*   Captions: Descriptive captions summarizing the plot's insights.
*   Legends: Clear legends identifying each element in the plot.
*   Axis Labels: Informative labels for both the x and y axes, including units if applicable.
* Style: appropriate colors, font sizes, and plot layouts for better readability and presentation.

*It is important that your visualizations are easy-to-understand plots.*

In [26]:
# Since we are predicting happiness, we need to figure out what variable to use
# as a predictor.
# TODO: Find and justify choice of predictor variable
# 1. Calculate the correlation between the happiness variable
#   and your set of remaining potential predictor variables
# 2. Visualize the correlations between the dependent variable

numeric_df = happiness_clean.select_dtypes(include="number")
corr_matrix = numeric_df.corr()
corr_life = corr_matrix['happiness'].sort_values(ascending=False)
corr_df = corr_life.reset_index()
corr_df.columns = ['Variable', 'Correlation']

fig = px.bar(
    corr_df,
    x='Correlation',
    y='Variable',
    orientation='h',
    color='Correlation',
    title='Correlation of Dependent Variables with Happiness',
)

fig.show()

Edit this cell to answer the following question:
9. From this investigation, what variable do you choose as your predictor and why?

I would choose log_gdp_per_capita as it has the highest correlation value (besides happiness itself) with our happiness variable.

**Separate data into training and validation sets**

During this stage it is important that we choose data sets for training predictive models and validating predictive models.

In [14]:
# TODO: Write code here to separate your data into a training and validation set
# (You do not need to worry about a test set right now)
# Explain your decision to separate the data this way.
# You will reuse these data subsets in the following DSLC stage.
from sklearn.model_selection import train_test_split

X = happiness_clean[['log_gdp_per_capita']]
y = happiness_clean['happiness']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

#I decided to separate the data into a 80% training set and a 20% validation set because it will give the model enough data to train on while also leaving a sizeable amount to test the model afterwards.

# [DSLC stage 4]: Predictive analysis

In this section we will examine the relationship between happiness as the response variable and your predictor variable. First we will visualize the relationship between happiness and the predictor variable.

In [27]:
#TODO: For your training and validation data sets
# 1. Create data frames that only contain columns for
# country, year, happiness, and your predictor variable
# 2. Create a scatterplot of happiness vs your predictor variable

train_df = happiness_clean.loc[X_train.index, ['country', 'year', 'happiness', 'log_gdp_per_capita']]
val_df   = happiness_clean.loc[X_val.index, ['country', 'year', 'happiness', 'log_gdp_per_capita']]

fig_train = px.scatter(
    train_df,
    x='log_gdp_per_capita',
    y='happiness',
    color='country',
    hover_name='country',
    title='Training Data: Happiness vs Log GDP per capita'
)

fig_train.update_layout(
    xaxis_title='Log GDP per capita',
    yaxis_title='Happiness',
    width=800,
    height=500
)

fig_val = px.scatter(
    val_df,
    x='log_gdp_per_capita',
    y='happiness',
    color='country',
    hover_name='country',
    title='Validation Data: Happiness vs Log GDP per capita'
)

fig_val.update_layout(
    xaxis_title='Log GDP per capita',
    yaxis_title='Happiness',
    width=800,
    height=500
)

fig_train.show()
fig_val.show()

# Modeling the relationship
**Using your training data set**

Train the LAD (L1 loss) and LS (L2 loss) linear fits for predicting happiness  based on your chosen predictor variable of your choosing.

*Once you have completed this, edit this cell here to report the formulas for your fitted models.*

LAD model formula:
happiness = -1.716 + 0.774 * log_gdp_per_capita

LS model formula:
happiness = -1.643 + 0.770 * log_gdp_per_capita

For the LAD model you will use  LADRegression from sklego.linear_model. Examples are available in the L04 notebook and [API documentation](https://koaning.github.io/scikit-lego/user-guide/linear-models/#least-absolute-deviation-regression)

In [35]:
# TODO: Add code here to
# 1. Train LAD model on your training Set
# 2. Get the parameters of your model to write formula
# 3. Train LS model on your training Set
# 4. Get the parameters of your model to write formula
# 5. Create a scatterplot of happiness vs your predictor variable
#    with a line for ea
happiness_clean = happiness_clean.dropna()

X = happiness_clean[['log_gdp_per_capita']]
y = happiness_clean['happiness']

lad_model = LADRegression().fit(X_train, y_train)
lad_intercept = lad_model.intercept_
lad_coef = lad_model.coef_[0]
print(lad_intercept, lad_coef)

-1.716117783403469 0.7739592046721133



Please consider using scikit-learn version of quantile regression.

Hint: `from sklearn.linear_model import QuantileRegressor`
Docs: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.QuantileRegressor.html



For the LS model you will use  LinearRegression from sklearn.linear_model. Examples are available in the L04 notebook and [API documentation](https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.LinearRegression.html)

> Add blockquote



In [36]:
# TODO: Add code here to
# 3. Train LS model on your training Set
# 4. Get the parameters of your model to write formula
# 5. Create a scatterplot of happiness vs your predictor variable
#    with a line for each model

ls_model = LinearRegression().fit(X_train, y_train)
ls_intercept = ls_model.intercept_
ls_coef = ls_model.coef_[0]
print(ls_intercept, ls_coef)

-1.6427999973326566 0.7698163166686945


In [34]:
# TODO: Add code here to
# 5. Create a scatterplot of happiness vs your predictor variable
#    with a line for each model
plotting_df = happiness_clean[['log_gdp_per_capita', 'happiness']].copy()

lad_line = lad_intercept + lad_coef * plotting_df['log_gdp_per_capita']
ls_line = ls_intercept + ls_coef * plotting_df['log_gdp_per_capita']

fig = px.scatter(
    plotting_df,
    x='log_gdp_per_capita',
    y='happiness',
    title='Happiness vs Log GDP per Capita with LAD and LS Fits'
)

fig.add_scatter(
    x=plotting_df['log_gdp_per_capita'],
    y=lad_line,
    mode='lines',
    name='LAD',
    line=dict(dash='dot')
)

fig.add_scatter(
    x=plotting_df['log_gdp_per_capita'],
    y=ls_line,
    mode='lines',
    name='LS',
    line=dict(color='red')
)

fig.update_layout(
    xaxis_title='Log GDP per Capita',
    yaxis_title='Happiness',
    width=800,
    height=500
)
fig.show()

Now we'd like to evaluate how each model has done.

**Using your validation data set**

Compute the rMSE, MAE, MAD, correlation and  $R^2$  evaluations for each algorithm.

In [40]:
#TODO: Write code here to
# Create a  3 column dataframe that for each point in your validation set
# contains the actual observed happiness score, the happiness score predicted
# from LAD, and the happiness score predicted from LS

lad_pred = lad_model.predict(X_val)
ls_pred = ls_model.predict(X_val)

results = pd.DataFrame({
    'observed_happiness': y_val.values,
    'LAD_pred': lad_pred,
    'LS_pred': ls_pred
})

print(results.head())

   observed_happiness  LAD_pred   LS_pred
0            4.966812  5.939953  5.972288
1            5.372040  6.156032  6.187211
2            4.462399  5.731475  5.764927
3            4.453083  5.236806  5.272906
4            6.824173  5.815329  5.848332


In [47]:
# TODO: Write code here to
# Create a scatterplot of the observed happiness score vs
# the happiness score predicted from LAD
# What would a perfect prediction look like?

fig = px.scatter(
    results,
    x='observed_happiness',
    y='LAD_pred',
    title='Observed vs Predicted Happiness (LAD)'
)

fig.update_layout(
    xaxis_title='Observed Happiness',
    yaxis_title='Predicted Happiness (LAD)',
    width=700,
    height=500
)

fig.show()

# a perfect prediction would involve every predicted happiness point matching exactly with every observed happiness point. Graphically that would have all data points be on a line with slope 1 and y-intercept 0.

In [46]:
# TODO: Write code here to
# Create a scatterplot of the observed happiness score vs
# the happiness score predicted from LS
# What would a perfect prediction look like?

fig = px.scatter(
    results,
    x='observed_happiness',
    y='LS_pred',
    title='Observed vs Predicted Happiness (LS)'
)

fig.update_layout(
    xaxis_title='Observed Happiness',
    yaxis_title='Predicted Happiness (LS)',
    width=700,
    height=500
)

fig.show()

# a perfect prediction would involve every predicted happiness point matching exactly with every observed happiness point. Graphically that would have all data points be on a line with slope 1 and y-intercept 0.

In [50]:
#TODO: Using the dataframe that you created
# Write code in this cell to calculate and print
# the rMSE, MAE, MAD, correlation, and R2 of
# the true price with the LS and LAD predictions

def get_metrics(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    mad = np.median(np.abs(y_true - y_pred))
    corr = np.corrcoef(y_true, y_pred)[0, 1]
    r2 = r2_score(y_true, y_pred)
    return rmse, mae, mad, corr, r2

lad_rmse, lad_mae, lad_mad, lad_corr, lad_r2 = get_metrics(results['observed_happiness'], results['LAD_pred'])

ls_rmse, ls_mae, ls_mad, ls_corr, ls_r2 = get_metrics(results['observed_happiness'], results['LS_pred'])

print(f"LAD Metrics: RMSE: {lad_rmse:.4f} MAE: {lad_mae:.4f} MAD: {lad_mad:.4f} Correlation: {lad_corr:.4f} R^2: {lad_r2:.4f}")
print(f"LS Metrics: RMSE: {ls_rmse:.4f} MAE: {ls_mae:.4f} MAD: {ls_mad:.4f} Correlation: {ls_corr:.4f} R^2: {ls_r2:.4f}")


LAD Metrics: RMSE: 0.7370 MAE: 0.6104 MAD: 0.5502 Correlation: 0.7591 R^2: 0.5745
LS Metrics: RMSE: 0.7372 MAE: 0.6105 MAD: 0.5517 Correlation: 0.7591 R^2: 0.5743


**Evaluating the models**

Based on the scatterplots and evaluation metrics that you have calculated, what model is better for the relationship between happiness and your predictor variable? Please explain why with supporting evidence from your plots and calculations.

I would say that the LAD Regression model is better for predicting the relationship between happiness and log_gdp_per_capita, as the RMSE, MAE, MAD are slightly lower than their respective values for the LS Regression. Additionally, the LAD Regression has a slightly higher R^2 value compared to LS Regression. However, the graphs look nearly identical to each other, and the closeness of our performance metrics also serves as evidence for how close the accuracy and performance of these two models was.

**Citation:**

This problem set is adapted from Ch. 9 exercise 22 from the following upcoming book:

Yu, B., & Barter, R. L. (2024). Veridical data science: The practice of responsible data analysis and decision making. The MIT Press.