## 1.
1. Simple vs. Multiple Linear Regression

	•	Simple Linear Regression: Models the outcome ￼ using a single predictor ￼:
￼
Useful for analyzing one predictor’s effect on ￼.
	•	Multiple Linear Regression: Uses several predictors to explain ￼:
￼
Advantage: It allows you to consider multiple factors, giving a more comprehensive model.

2. Continuous vs. Indicator Variables in Simple Linear Regression

	•	Continuous Variable: A predictor that can take on any numeric value. With a continuous ￼:
￼
￼ changes continuously with ￼.
	•	Indicator Variable: A binary predictor (0 or 1) representing group membership:
￼
This creates a categorical shift in ￼ based on group (rather than a continuous trend).

3. Adding an Indicator Variable in Multiple Linear Regression

Combining a continuous and indicator variable in the same model:
￼
produces parallel lines for each group defined by the indicator. The slope for ￼ is the same, but each group has its own intercept.

4. Interaction Between Continuous and Indicator Variables

Adding an interaction term changes the relationship between the continuous variable and ￼ based on the group:
￼
Now, each group defined by ￼ has its own slope and intercept, allowing the lines to diverge based on group.

5. Multiple Linear Regression with Indicator Variables for Categorical Variables

For a categorical variable with ￼ categories, we use ￼ indicators. For example:
￼
	•	The intercept ￼ is the baseline (e.g., first category).
	•	Each indicator shifts ￼ relative to this baseline.

This approach helps compare each category against a reference, giving each category a unique intercept while avoiding redundancy (only using ￼ indicators).

## 2.

Outcome and Predictor Variables

	•	Outcome Variable: Sales of sports equipment (￼).
	•	Predictor Variables:
	•	TV Ad Spend (￼): Represents the amount spent on TV advertising.
	•	Online Ad Spend (￼): Represents the amount spent on online advertising.

Interaction Consideration

	•	Potential Interaction: The effectiveness of TV ad spend on sales may depend on the level of online ad spend, and vice versa. This interaction suggests that if both ad spends are high, the combined effect on sales could be different (e.g., synergistic) than if each ad spend were considered in isolation.

Linear Forms with and without Interaction

	1.	Without Interaction (Additive Model):
￼
Here, the effects of TV and online ad spending are independent of each other.
	2.	With Interaction (Interactive Model):
￼
The interaction term ￼ allows the effect of each type of ad spend to vary based on the level of the other. This could mean, for example, that high spending on both platforms has an amplified effect on sales.

Using These Models for Predictions

	•	Without Interaction: Predict sales by adding the separate contributions of TV and online ad spends. Each has a fixed effect on ￼, regardless of the level of the other.
	•	With Interaction: Predict sales by accounting for both individual and combined effects. Here, if both ad spends are high, the interaction could amplify the impact on ￼, giving different sales outcomes than the additive model.

## 3.

Interaction Model

	•	Linear Form:  \text{logit}(P) = \beta_0 + \beta_1 \text{age} + \beta_2 \text{employment\_status} + \beta_3 (\text{age} \times \text{employment\_status}) 
	•	Interpretation:
	•	 \beta_3 : Captures how the effect of age on log odds depends on employment status.
	•	If significant, the slopes of the lines (age effect) differ for employed and unemployed groups.

Visualization

	•	Additive Model Plot: Predicts parallel relationships between age and probability of high connection for employed and unemployed groups.
	•	Interaction Model Plot: Allows the relationship between age and high connection probability to vary by employment status.

Statistical Evidence

	•	Use the .summary() output to evaluate p-values for each coefficient. If  \beta_3  (interaction term) is significant, the interaction model better captures the relationship between predictors and the outcome.

In [1]:
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
import plotly.graph_objects as go

# Load the dataset (replace the path with the actual file location or URL if available)
data = pd.read_csv("CSCS_data_anon.csv")

# Create a binary outcome variable
# Assuming 'social_connection_level' is a categorical column with values like 'high', 'low'
data['high_connection'] = (data['social_connection_level'] == 'high').astype(int)

# Ensure predictors are prepared (e.g., binary or continuous)
# Example: 'employment_status' (binary) and 'age' (continuous)
data = data.dropna(subset=['age', 'employment_status'])  # Drop rows with missing values
data['employment_status'] = (data['employment_status'] == 'employed').astype(int)
# Additive logistic regression model
additive_formula = 'high_connection ~ age + employment_status'
additive_model = smf.logit(additive_formula, data=data).fit()
print(additive_model.summary())
# Interaction logistic regression model
interaction_formula = 'high_connection ~ age * employment_status'
interaction_model = smf.logit(interaction_formula, data=data).fit()
print(interaction_model.summary())
# Simulate continuous predictor (age) range for predictions
ages = np.linspace(data['age'].min(), data['age'].max(), 100)

# Predict probabilities for the additive model
employed_additive_probs = additive_model.predict(pd.DataFrame({'age': ages, 'employment_status': 1}))
unemployed_additive_probs = additive_model.predict(pd.DataFrame({'age': ages, 'employment_status': 0}))

# Predict probabilities for the interaction model
employed_interaction_probs = interaction_model.predict(pd.DataFrame({'age': ages, 'employment_status': 1}))
unemployed_interaction_probs = interaction_model.predict(pd.DataFrame({'age': ages, 'employment_status': 0}))
fig_interaction = go.Figure()

fig_interaction.add_trace(go.Scatter(
    x=ages, y=employed_interaction_probs, mode='lines', name='Employed (Interaction)'
))
fig_interaction.add_trace(go.Scatter(
    x=ages, y=unemployed_interaction_probs, mode='lines', name='Unemployed (Interaction)', line=dict(dash='dash')
))

fig_interaction.update_layout(
    title="Interaction Model: Predicted Probabilities",
    xaxis_title="Age",
    yaxis_title="Probability of High Connection",
    legend_title="Employment Status"
)
fig_interaction.show()

  data = pd.read_csv("CSCS_data_anon.csv")


KeyError: 'social_connection_level'

## 4. 

	1.	Low ￼: The model explains only 17.6% of the variation in the outcome, suggesting limited overall explanatory power.
	2.	Significant Coefficients: Large coefficients with small p-values indicate strong evidence that specific predictors are associated with the outcome.

	•	￼ reflects the overall model’s ability to explain variability in ￼, which can be low if important predictors are missing or the outcome is highly random.
	•	P-values test the statistical significance of individual predictors, showing whether a predictor has a measurable effect on ￼, even if the overall model is weak.
    
    These metrics address different questions. Low ￼ means the model is incomplete, while significant p-values show that the included predictors are relevant. Both can coexist.

## 5.

The code demonstrates how to evaluate a model’s generalizability by comparing in-sample and out-of-sample ￼, highlighting the trade-off between simplicity and complexity, and the risks of overfitting.

Cell-by-Cell Explanation

	1.	Data Splitting:
	•	Splits the dataset into 50% training and 50% testing sets to assess the model’s performance on unseen data.
	•	Ensures reproducibility using a random seed.
	2.	Simple Model Fit:
	•	Fits a model predicting ￼ using ￼ and ￼ as predictors.
	•	￼ from this model reflects how well it explains variability in the training set.
	3.	Simple Model Evaluation:
	•	Compares the model’s performance in the training data (in-sample ￼) with the test data (out-of-sample ￼).
	•	A large gap between these metrics suggests overfitting.
	4.	Complex Model Fit:
	•	Adds many predictors and interaction terms, creating a more flexible but also more complex model.
	•	Likely achieves a higher in-sample ￼ due to capturing more patterns, including noise.
	5.	Complex Model Evaluation:
	•	Assesses generalizability by comparing in-sample ￼ to out-of-sample ￼.
	•	If out-of-sample ￼ drops significantly, the model is overfit and fails to generalize.

Key Insights

	•	In-sample ￼ shows how well the model fits the training data.
	•	Out-of-sample ￼ measures how well the model predicts new, unseen data.
	•	Simple models tend to generalize better but might miss important patterns.
	•	Complex models fit the training data well but risk overfitting and performing poorly on test data.

This code highlights why evaluating both in-sample and out-of-sample performance is crucial for building models that generalize effectively.

## 6. 
The model4_linear_form generates many new predictors (interaction terms) in the design matrix, creating a highly complex representation of the data. This complexity introduces multicollinearity—strong correlations among predictors—which destabilizes the model coefficients. As a result, the model fits the training data too closely (overfitting), leading to poor generalization to unseen data, evidenced by a significant drop in out-of-sample  R^2 .

## 7.
	•	Model 5 balances simplicity (Model 3) with complexity (Model 4).
	•	Model 6 expands Model 5 by testing additional terms based on data-driven or theoretical insights.
	•	Model 7 streamlines Model 6 by focusing on the most significant terms, aiming for a generalizable and interpretable model.

## 8. 
This approach highlights the variability in model performance due to randomness in training/test splits. By comparing in-sample and out-of-sample R^2 across many iterations, we can better understand the model’s reliability and generalization, avoiding conclusions based on a single train-test split.

In [2]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Placeholder for the dataset (replace 'pokeaman' with your actual dataset variable)
# pokeaman.fillna('None', inplace=True)

# Initialize lists to store in-sample and out-of-sample R^2 values
in_sample_r2 = []
out_of_sample_r2 = []

# Loop to repeat model fitting and evaluation multiple times
for i in range(100):  # 100 iterations
    # Randomly split data into training and testing sets (no fixed seed here)
    train, test = train_test_split(pokeaman, train_size=0.5)
    
    # Fit a simple model (adjust formula to your context)
    model = smf.ols(formula='HP ~ Attack + Defense', data=train).fit()
    
    # Calculate in-sample R^2
    in_sample_r2.append(model.rsquared)
    
    # Calculate out-of-sample R^2
    y_test = test['HP']
    yhat_test = model.predict(test)
    out_of_sample_r2.append(np.corrcoef(y_test, yhat_test)[0, 1]**2)

# Plot in-sample vs. out-of-sample R^2
plt.figure(figsize=(10, 6))
plt.scatter(in_sample_r2, out_of_sample_r2, alpha=0.7)
plt.axline((0, 0), (1, 1), linestyle='--', color='gray', label='Ideal Generalization')
plt.xlabel("In-Sample R^2")
plt.ylabel("Out-of-Sample R^2")
plt.title("In-Sample vs. Out-of-Sample R^2 Across Iterations")
plt.legend()
plt.show()

NameError: name 'pokeaman' is not defined

## 9.
This code tests how well models trained on Pokémon data from specific generations can predict stats for other generations. It compares the model’s performance in three scenarios:
	1.	Original Model: The model is trained and tested on the standard train-test split (all generations mixed). This gives a baseline for in-sample (training data) and out-of-sample (test data) performance.
	2.	Model Trained Only on Generation 1: The model is trained on just Generation 1 Pokémon and then used to predict stats for Pokémon from other generations. It fits the Generation 1 data well (high in-sample R^2) but likely struggles to predict other generations accurately (low out-of-sample R^2) because their stats may differ significantly.
	3.	Model Trained on Generations 1-5: The model is trained on Pokémon from Generations 1-5 and used to predict stats for Generation 6. This setup tests whether training on a broader set of data helps the model generalize better to unseen data (Generation 6).