<a href="https://colab.research.google.com/github/gitmystuff/DTSC4050/blob/main/Week_11-Regression_II/Week_11_Assignment_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 11 Assignment Example

Your Name


## Instructions

This assignment is a continuation of the Week 11 discussion on data preparation and ways to use Python to automate redundant tasks.

* Edit your name
* Replace the Story with your story
* Replace the variable names to fit the story
* Review the notebook
* Try to get everything to run
* Troubleshoot if needed
* Submit the shared link

# Part 1 - The Story and the Data

## Create the Data

In [None]:
# library to create fake data
!pip install Faker -q

In [None]:
# create fake demographics
import numpy as np
import pandas as pd
import random
from sklearn.datasets import make_regression
from faker import Faker

fake = Faker()

output = []
for x in range(100):
    sex = np.random.choice(['egg', 'seed'], p=[0.5, 0.5])
    output.append({
        'categorical_1': sex,
        'categorical_2': np.random.choice(['A', 'B', 'C']),
        'name_1': fake.first_name_female() if sex == 'egg' else fake.first_name_male(),
        'name_2': fake.last_name(),
        'zip_code': fake.zipcode(),
        'date': fake.date_of_birth(),
        'location': fake.state_abbr()
    })

demographics = pd.DataFrame(output)

def make_null(r, w):
    if random.randint(0, 99) < w:
        return np.nan
    else:
        return r

# Generating features for linear regression with generic names
features, target = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

generic_cols = ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5']
random.shuffle(generic_cols)
df = pd.DataFrame(data=features, columns=generic_cols)
df['target_variable'] = target

# Introduce non-linearities and interactions
df['feature_1_squared'] = df['feature_1']**2
df['interaction_1_2'] = df['feature_1'] * df['feature_2']

# Apply transformations and add noise
df['target_variable'] = df['target_variable'] + np.random.normal(0, 5, 100)
df['feature_4'] = df['feature_4'].apply(lambda x: abs(x) if x < 0 else x)

# Add missing values
for col in generic_cols:
    df[col] = df[col].apply(make_null, args=(2,))

df = pd.concat([df, demographics], axis=1)

print(df.shape)
print(df.info())
df.head()

## Story

**The Great Martian Population Mystery**

In the year 2342, humanity has established thriving colonies across Mars. However, the Martian Central Command has noticed a perplexing trend: the population growth rate of these colonies varies wildly. They have collected data on several factors, including:

* **asteroid_impact:** The frequency and severity of nearby asteroid impacts.
* **solar_flare_intensity:** The intensity of solar flares affecting the colony.
* **alien_signal_strength:** The strength of mysterious alien signals detected.
* **temporal_anomaly_index:** A measure of temporal anomalies observed in the region.
* **cybernetic_enhancement_level:** The average level of cybernetic enhancements among the colonists.
* **colony_population_growth:** The observed population growth rate.

Additionally, they have demographic data on the colonists, including sex, brain wave patterns, names, zipcodes, birthdates, and state of origin (from Earth).

Martian Central Command needs your help to build a linear regression model that can predict the colony population growth rate based on these factors. They suspect that some of the factors may have non-linear relationships or interact with each other. They also know that some data is missing.

**Your Task:**

1.  Analyze the provided `data_science_fiction.csv` dataset.
2.  Clean and preprocess the data, handling missing values and potential outliers.
3.  Build a linear regression model to predict `colony_population_growth`.
4.  Evaluate the model's performance and interpret the coefficients.
5.  Write a report explaining your findings and providing insights into the factors that influence Martian colony population growth.
6. Create visualizations that help to explain your findings.

## Get Creative

My story talks about colonies so the following code snippet creates some colony names. This is unique to my story. Do you need to create unique names for something? Ask your assistant for help. You have an example below.

In [None]:
def fake_colony_name(): # only relevant to example story
    """Generates a fake colony name with a sci-fi feel."""
    name_formats = [
        "Colony " + fake.city_suffix() + " " + fake.word().capitalize(),
        fake.word().capitalize() + " " + fake.word().capitalize() + " Outpost",
        "Sector " + str(fake.random_int(min=1, max=100)) + " " + fake.word().capitalize(), # Convert the integer to a string using str()
        fake.word().capitalize() + " " + "Station " + str(fake.random_int(min=1, max=50)), # Convert the integer to a string using str()
        "Terra " + fake.word().capitalize(),
        fake.word().capitalize() + "-" + fake.word().capitalize() + " Base",
        fake.word().capitalize() + " " + "Settlement"
    ]
    return random.choice(name_formats)

def add_colony_names_to_dataframe(df, num_colonies): # only relevant to example story
    """Adds a 'colony_name' column to a DataFrame."""
    colony_names = [fake_colony_name() for _ in range(num_colonies)]
    df['colony_name'] = colony_names
    return df

# Add colony names to the DataFrame
df = add_colony_names_to_dataframe(df, len(df)) # only relevant to example story

## Rename Columns to Fit Story

In [None]:
def rename_columns(df, name_mapping):
    """
    Renames columns in a DataFrame based on a provided mapping.

    Args:
        df (pd.DataFrame): The DataFrame to rename columns in.
        name_mapping (dict): A dictionary where keys are generic column names
                              and values are the desired new column names.

    Returns:
        pd.DataFrame: The DataFrame with renamed columns.
    """
    return df.rename(columns=name_mapping)

# Example Usage (Students would create their own name_mapping)
example_name_mapping = {
    'feature_1': 'asteroid_impact',
    'feature_2': 'solar_flare_intensity',
    'feature_3': 'alien_signal_strength',
    'feature_4': 'temporal_anomaly_index',
    'feature_5': 'cybernetic_enhancement_level',
    'target_variable': 'colony_population_growth',
    'feature_1_squared' : 'asteroid_impact_squared',
    'interaction_1_2' : 'solar_flare_interaction',
    'categorical_1' : 'sex',
    'categorical_2' : 'brain_wave',
    'name_1': 'given_name',
    'name_2': 'surname',
    'zip_code': 'zipcode',
    'date': 'date_of_birth',
    'location': 'state_of_origin'
}

# Add missing values to demographic data
df['categorical_1'] = df['categorical_1'].apply(make_null, args=(5,))

df = df.sample(frac=1).reset_index(drop=True)

df = rename_columns(df, example_name_mapping)

df.to_csv('data_science_fiction.csv')

print(df.shape)
print(df.info())
df.head()

# Part 2 - The Analysis

## Some Prep Ideas

* Missing Values
* Categorical Encodeing
* Duplicates
* Scaling

## Explanations for Cleaning Techniques

* Check out https://github.com/gitmystuff/DSChunks/blob/main/PrePy.ipynb for explanations

In [None]:
# separate X from y
# REPLACE YOUR TARGET VARIABLE
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

X = df.drop(['colony_population_growth'], axis=1)
y = df['colony_population_growth']

X.info()


In [None]:
# missing values using bulky code example
df_categorical_features = df.select_dtypes(include=['category', 'object']).columns
dfx = X.copy()
for feat in df.columns[df.isnull().sum() > 1]:
  if feat in df_categorical_features:
    dfx[feat] = df[feat].fillna(df[feat].mode()[0])
  else:
    if abs(df[feat].skew()) < .8:
      dfx[feat] = df[feat].fillna(round(df[feat].mean(), 2))
    else:
      dfx[feat] = df[feat].fillna(df[feat].median())

X = dfx.copy()
# precaution for glitches in code example
for col in X.select_dtypes(include=np.number).columns:
    X[col] = X[col].fillna(X[col].mean())

X.info()

## PrePy

PrePy is an example of storing reusable chunks of code rather than copying and pasting chunks of bulky code. Check out the README to see some of the functions used for this example

https://github.com/gitmystuff/preppy/tree/main

In [None]:
! git clone https://github.com/gitmystuff/preppy.git

Check out your session storage folder to see that preppy has been cloned and is now usable.

In [None]:
from preppy.version import __version__
print(__version__)

In [None]:
# categorical encoding with less bulky code
import preppy.utils as preppy

X = preppy.functions.do_OHE(X)

# adding duplicated for example
dupes = X.loc[:7]
X = pd.concat([X, dupes], axis=0)

X.info()

## Pipelines

Scikit-learn's pipeline is designed to streamline and automate machine learning workflows. It expects each step in the pipeline to adhere to a specific interface, ensuring consistent behavior and compatibility. Here's why custom classes are often needed:

**1. Standardization:**

* Scikit-learn operates on the principle of standardized transformers and estimators. This means that all components in a pipeline should have `fit()` and `transform()` methods (for transformers) or a `fit()` and `predict()` method (for estimators).
* These methods provide a predictable way to interact with each component, regardless of its underlying functionality.
* When you create a custom class that inherits from `BaseEstimator` and `TransformerMixin` (or `ClassifierMixin` or `RegressorMixin` for estimators), you are telling scikit-learn that your class follows this standard.

**2. Seamless Integration:**

* The pipeline's strength lies in its ability to chain together multiple steps, like data preprocessing, feature selection, and model training.
* For this chaining to work smoothly, each step must provide a consistent input and output format.
* Custom classes, when properly implemented, ensure that the data passed between pipeline steps is in the expected format (e.g., NumPy arrays or Pandas DataFrames).

**3. Preventing Errors:**

* Scikit-learn's pipeline handles details like cross-validation, grid search, and model persistence.
* If you were to use functions or arbitrary code snippets directly within a pipeline, scikit-learn wouldn't be able to properly manage these operations.
* Custom classes, with their defined methods, allow scikit-learn to correctly handle data transformations and model fitting across different pipeline stages.

**4. Maintaining State:**

* Some transformations or model fitting procedures require storing information learned from the training data (e.g., mean and standard deviation for scaling, learned model coefficients).
* Classes can maintain this "state" as attributes (e.g., `self.mean_`, `self.coef_`).
* This is important, so the transform method can use the data learned in the fit method. Functions, by default, do not retain state between calls.

**In essence:**

* Scikit-learn's pipeline relies on objects that have consistent methods and data formats.
* Custom classes are used to wrap your custom operations, so they can interact with the pipeline in a way that scikit-learn understands. This makes the pipeline more robust, and less prone to errors.


In [None]:
# example of custom class to remove duplicate rows and columns
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class DuplicateRemover(BaseEstimator, TransformerMixin):
    """
    A simple transformer that removes duplicate rows and columns from a Pandas DataFrame.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """Fits the transformer (no fitting needed in this case)."""
        return self

    def transform(self, X):
        """Transforms the input data by removing duplicate rows and columns."""
        X = self.check_row_duplicates(X)
        X = self.check_col_duplicates(X)
        return X

    def check_row_duplicates(self, X):
        """Removes duplicate rows."""
        return X.drop_duplicates()

    def check_col_duplicates(self, X):
        """Removes duplicate columns."""
        duplicate_features = []
        for i in range(0, len(X.columns)):
            orig = X.columns[i]
            for dupe in X.columns[i + 1:]:
                if X[orig].equals(X[dupe]):
                    duplicate_features.append(dupe)
        if duplicate_features:
            X = X.drop(duplicate_features, axis=1)
        return X

**1. `BaseEstimator`**

* **Purpose:**
    * `BaseEstimator` is a base class in scikit-learn that provides a standard interface for all estimators (including transformers and models).
    * It primarily ensures that your custom classes adhere to scikit-learn's conventions and can be used seamlessly within its ecosystem.
* **Key Features:**
    * **`get_params()`:** This method allows scikit-learn to retrieve the parameters of your estimator. It's crucial for functions like `GridSearchCV` and `Pipeline`, which need to inspect and manipulate the parameters.
    * **`set_params()`:** This method allows scikit-learn to set the parameters of your estimator. It's used by `GridSearchCV` and `Pipeline` to configure the estimator during the search or pipeline execution.
    * By inheriting from `BaseEstimator`, you gain these core functionalities without having to implement them yourself.
* **Why It's Important:**
    * Consistency: It enforces a consistent API, making your custom estimators compatible with scikit-learn's tools.
    * Integration: It enables your estimators to work smoothly with pipelines, cross-validation, and hyperparameter tuning.

**2. `TransformerMixin`**

* **Purpose:**
    * `TransformerMixin` is a mixin class specifically designed for transformers.
    * It provides a default implementation of the `fit_transform()` method.
* **Key Features:**
    * **`fit_transform(X, y=None, **fit_params)`:** This method combines the `fit()` and `transform()` steps into a single call.
        * The default implementation simply calls `self.fit(X, y, **fit_params)` followed by `self.transform(X)`.
        * You can override `fit_transform()` if you need a more efficient or customized implementation.
* **Why It's Important:**
    * Convenience: It saves you from having to write the `fit_transform()` method explicitly in every transformer.
    * Efficiency: While the default implementation works, you can optimize it for your specific transformer if needed.
    * By inheriting from TransformerMixin, you are stating that your class will be able to transform data.

**In summary:**

* When you create a custom estimator (like a transformer or a model), you should inherit from `BaseEstimator` to ensure it integrates well with scikit-learn.
* If your estimator is a transformer (i.e., it transforms data), you should also inherit from `TransformerMixin` to get the default `fit_transform()` implementation.

By using these mixin classes, you make your custom code more robust, maintainable, and compatible with the scikit-learn ecosystem.


In [None]:
# using the class without including it in a pipeline
duplicate_remover = DuplicateRemover()
X = duplicate_remover.fit_transform(X)

X.info()

In [None]:
# This is an example of how we could list several functions into a pipeline. Do not uncomment.
pipeline = Pipeline([
    # ('missingValues', preppy.classes.MissingValueImputer()),
    # ('categoricalEncoding', preppy.functions.do_OHE()),
    # ('duplicates', DuplicateRemover()),
    ('constants', preppy.classes.ConstantAndSemiConstantRemover()),
])

X = pipeline.fit_transform(X)
X.info()

## Data Model Using Lasso

Lasso (Least Absolute Shrinkage and Selection Operator) is a linear regression technique that performs both variable selection and regularization. Here's a breakdown of what it does and why it's useful:

**1. Linear Regression Foundation:**

* Like standard linear regression, Lasso aims to find a linear relationship between a dependent variable (the target) and one or more independent variables (features).
* It does this by estimating coefficients for each feature, which represent the strength and direction of the relationship with the target.

**2. Regularization (Shrinkage):**

* Lasso adds a penalty term to the standard linear regression cost function. This penalty is based on the *absolute values* of the coefficients.
* The effect of this penalty is to "shrink" the coefficients towards zero.
* The strength of this shrinkage is controlled by a parameter called "alpha" (or lambda). A higher alpha value leads to more shrinkage.

**3. Variable Selection:**

* A key feature of Lasso is that it can force some coefficients to become *exactly zero*.
* When a coefficient is zero, it effectively removes the corresponding feature from the model.
* This makes Lasso a powerful tool for feature selection, as it can automatically identify and discard irrelevant or redundant features.

**4. How it Works (Simplified):**

* Lasso minimizes a modified cost function:

    ```
    Cost = (Sum of Squared Errors) + alpha * (Sum of Absolute Values of Coefficients)
    ```

* The first part (Sum of Squared Errors) is the standard linear regression cost.
* The second part is the L1 regularization term, which is the sum of the absolute values of the coefficients, multiplied by the alpha parameter.
* Because of the nature of the absolute value, and the minimization process, the lasso can force some of the coefficient values to be exactly 0.

**5. Why Lasso is Useful:**

* **Feature Selection:** It automatically identifies and removes irrelevant features, leading to simpler and more interpretable models.
* **Reduces Overfitting:** By shrinking coefficients, Lasso can prevent the model from overfitting the training data, especially when dealing with high-dimensional datasets (many features).
* **Improved Prediction Accuracy:** In some cases, removing irrelevant features can improve the model's prediction accuracy on unseen data.
* **Sparse Models:** Lasso produces "sparse" models, meaning that they have few non-zero coefficients. This can be beneficial for efficiency and interpretability.

**6. When to Use Lasso:**

* When you suspect that many of your features are irrelevant.
* When you want to build a model with fewer features for better interpretability.
* When you want to prevent overfitting in high-dimensional datasets.

**In summary:** Lasso is a linear regression technique that adds a penalty to the coefficients, shrinking them towards zero and potentially setting some to exactly zero. This performs feature selection and regularization, leading to simpler, more interpretable, and potentially more accurate models.


In [None]:
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=42)


model = Lasso(alpha=1, fit_intercept=True)
model.fit(X_train, y_train)

d = {'Feature': X_train.columns, 'Coef': model.coef_}
pipe_df = pd.DataFrame(d)
print(pipe_df)

In [None]:
# Remove features with 0 coefficients for the heatmap
mask = pd.Series(model.coef_ != 0, index=X_train.columns)

# Apply the mask
X_train_lassoed = X_train.loc[:, mask]
X_test_lassoed = X_test.loc[:, mask]

print("Original feature count:", X_train.shape[1])
print("Selected feature count:", X_train_lassoed.shape[1])
print("Selected features:", X_train_lassoed.columns.tolist())

In [None]:
# correlation heat map
import numpy as np
import seaborn as sns
from scipy import stats

# correlation matrix
sns.set(style="white")

# compute the correlation matrix
corr = X_train_lassoed.corr()

# generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# set up the matplotlib figure
# f, ax = plt.subplots()
f = plt.figure(figsize=(16, 8))

# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5},
            annot=True, annot_kws={"size": 10});

plt.tight_layout()

## Metrics

In [None]:
predictions = model.predict(X_test)

print(f'Model Training Score (R^2): {model.score(X_train, y_train)}')
print(f'Model Test Score (R^2): {model.score(X_test, y_test)}')
print(f'Model Predictions Score: {r2_score(y_test, predictions)}')

mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
print()
print(f'MAE: {mae}')
print(f'MSE: {mse}')
print(f'RMSE: {rmse}')

# Prediction DataFrame
prediction_df = pd.DataFrame({'Actual': y_test, 'Predicted': predictions, 'Residuals': y_test - predictions})
print('\nPrediction DataFrame:')
print(prediction_df.head())

# Visualizations
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.scatterplot(x='Actual', y='Predicted', data=prediction_df)
plt.title('Actual vs. Predicted')

plt.subplot(1, 2, 2)
sns.scatterplot(x='Predicted', y='Residuals', data=prediction_df)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residual Plot')

plt.show()

**R-squared (Coefficient of Determination)**

R-squared is a statistical measure that represents the proportion of the variance in the dependent variable (the variable you're trying to predict) that is predictable from the independent variables (the variables you're using to make the prediction). In simpler terms, it tells you how well your regression model "fits" the observed data.

**Key Points:**

* **Interpretation:**
    * R-squared values range from 0 to 1.
    * An R-squared of 1 indicates that the model perfectly predicts the dependent variable. All the variation in the dependent variable is explained by the independent variables.
    * An R-squared of 0 indicates that the model does not explain any of the variation in the dependent variable. It's as bad as just using the mean of the dependent variable to make predictions.
    * A value between 0 and 1 represents the proportion of the variance in the dependent variable that is explained by the model. For example, an R-squared of 0.75 means that 75% of the variance in the dependent variable is explained by the independent variables.
* **Range:**
    * The range of R-squared is generally **0 to 1**.
    * In some rare cases where a model is very poorly fitted, it is possible to get a negative R-squared. This occurs when the model fits the data worse than a horizontal line.
* **Limitations:**
    * R-squared does not tell you whether the coefficients are statistically significant. It only measures how well the model fits the data.
    * R-squared can increase simply by adding more independent variables to the model, even if those variables are not actually related to the dependent variable. This is why adjusted R-squared is often used.
    * R-squared does not indicate if a model is biased.


* **`predictions`**:
    * These are the values generated by the trained `model` when it's applied to the `X_test` dataset. They represent the model's estimations of the target variable for the test data.
* **`model.score(X_train, y_train)`**:
    * This calculates the coefficient of determination (R-squared, or R²) score for the model's performance on the training data. R-squared indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.
* **`model.score(X_test, y_test)`**:
    * This calculates the R-squared score for the model's performance on the testing data, providing an evaluation of how well the model generalizes to unseen data.
* **`r2_score(y_test, predictions)`**:
    * This function explicitly calculates the R-squared score by comparing the actual target values in `y_test` with the model's `predictions`.
* **`mae` (mean_absolute_error)**:
    * This is the average of the absolute differences between the predicted values and the actual values. It measures the average magnitude of the errors in the predictions, without considering their direction.
* **`mse` (mean_squared_error)**:
    * This is the average of the squared differences between the predicted values and the actual values. It gives more weight to larger errors due to the squaring.
* **`rmse` (root mean squared error)**:
    * This is the square root of the MSE. It provides an error metric that is in the same units as the target variable, making it easier to interpret.


The plots are designed to assess the performance of a regression model. Let's break down each subplot:

**Left Subplot: Actual vs. Predicted**

* **X-axis: Actual:** This axis represents the actual (or true) values of the target variable from your dataset.
* **Y-axis: Predicted:** This axis represents the predicted values of the target variable as generated by your regression model.
* **Scatter Plot:** Each point in the plot represents a data point from your dataset. The x-coordinate of the point is the actual value, and the y-coordinate is the predicted value.
* **Interpretation:**
    * **Ideal Scenario:** If the model were perfect, all points would fall exactly on a straight diagonal line (y = x), meaning the predicted values perfectly match the actual values.
    * **Deviations from the Diagonal:** The further the points are from the diagonal line, the larger the prediction errors.
    * **Patterns:** Look for any patterns in the scatter plot. For instance:
        * **Curvature:** If the points form a curve instead of a straight line, it suggests that a linear model might not be appropriate.
        * **Funnel Shape:** If the spread of the points increases or decreases as you move along the x-axis, it indicates heteroscedasticity (non-constant variance of errors).
        * **Outliers:** Points that are far away from the general trend might be outliers, which could significantly affect the model's performance.

**Right Subplot: Residual Plot**

* **X-axis: Predicted:** This axis represents the predicted values of the target variable.
* **Y-axis: Residuals:** This axis represents the residuals, which are the differences between the actual values and the predicted values (Residual = Actual - Predicted).
* **Horizontal Line at Y=0:** This line represents zero residuals.
* **Interpretation:**
    * **Ideal Scenario:** In a good model, the residuals should be randomly scattered around zero, with no discernible pattern.
    * **Patterns:** Look for any patterns in the residual plot. For instance:
        * **Curvature:** If the residuals form a curve, it suggests that a linear model might not be appropriate.
        * **Funnel Shape:** If the spread of the residuals increases or decreases as you move along the x-axis, it indicates heteroscedasticity.
        * **Outliers:** Points that are far away from the zero line might be outliers.
        * **Non-Random Patterns:** Any systematic patterns, such as a U-shape or a V-shape, indicate that the model is not capturing some underlying structure in the data.

**Overall Interpretation of the Given Plot:**

Based on the provided plot, here's a general assessment:

* **Left Subplot (Actual vs. Predicted):** The points are somewhat scattered around the diagonal line, indicating that the model's predictions are not perfect. There are some noticeable deviations, suggesting that the model has room for improvement.
* **Right Subplot (Residual Plot):** The residuals appear to be somewhat randomly scattered around zero, but there might be a slight tendency for the residuals to be more spread out at lower predicted values. This could indicate mild heteroscedasticity. There are also a couple of points that are relatively far from the zero line, which might be outliers.

**Conclusion:**

The plots suggest that the regression model is capturing some of the underlying relationships in the data, but it's not a perfect fit. There are some deviations in the predicted values and potential issues with heteroscedasticity. Further analysis and potential model refinement might be necessary to improve the model's performance.

**Recommendations:**

* **Check for Heteroscedasticity:** Use statistical tests (e.g., Breusch-Pagan test) to confirm the presence of heteroscedasticity. If confirmed, consider using transformations (e.g., logarithmic) or weighted least squares regression.
* **Address Outliers:** Investigate the potential outliers to determine if they are genuine data points or errors. If they are errors, consider removing them. If they are genuine, consider using robust regression techniques.
* **Consider Non-Linear Models:** If there are patterns in the residuals or the actual vs. predicted plot suggests non-linearity, explore non-linear regression models or feature engineering techniques.
* **Evaluate Model Performance:** Use other evaluation metrics (e.g., R-squared, RMSE) to quantify the model's performance and compare it to other potential models.


## OLS Model

Statsmodels OLS (Ordinary Least Squares) is a statistical method used to estimate the unknown parameters in a linear regression model. It's a fundamental tool in econometrics and statistics, implemented in the Python statsmodels library.

Here's a breakdown of what OLS is and how it works:

**1. Linear Regression Model:**

* OLS is designed for linear regression, which assumes a linear relationship between a dependent variable (the variable you're trying to predict) and one or more independent variables (the variables used to make the prediction).
* The general form of a linear regression model is:

    ```
    y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε
    ```

    * `y`: The dependent variable.
    * `x₁, x₂, ..., xₚ`: The independent variables.
    * `β₀, β₁, β₂, ..., βₚ`: The regression coefficients (parameters) that represent the relationship between the independent variables and the dependent variable.
    * `ε`: The error term (or residual), which represents the difference between the actual value of `y` and the value predicted by the model.

**2. Ordinary Least Squares (OLS) Estimation:**

* OLS aims to find the values of the regression coefficients (βs) that minimize the sum of the squared residuals.
* In other words, it finds the line (or hyperplane in multiple regression) that best fits the data points by minimizing the overall difference between the observed values and the predicted values.
* "Least squares" refers to minimizing the sum of the squared errors:

    ```
    Σ(yᵢ - ŷᵢ)²
    ```

    * `yᵢ`: The actual value of the dependent variable for the i-th observation.
    * `ŷᵢ`: The predicted value of the dependent variable for the i-th observation.

**3. Assumptions of OLS:**

* For OLS to provide unbiased and efficient estimates, several assumptions should be met:
    * **Linearity:** The relationship between the dependent and independent variables is linear.
    * **Independence:** The error terms are independent of each other.
    * **Homoscedasticity:** The error terms have constant variance.
    * **Normality:** The error terms are normally distributed.
    * **No Multicollinearity:** The independent variables are not perfectly correlated with each other.

**4. Statsmodels Implementation:**

* The statsmodels library in Python provides a convenient way to perform OLS regression.
* Key steps:
    * Import the necessary libraries (`statsmodels.api` and `pandas`).
    * Load your data into a Pandas DataFrame.
    * Define the dependent and independent variables.
    * Add a constant term to the independent variables using `statsmodels.api.add_constant()`.
    * Create an OLS model using `statsmodels.api.OLS(y, X)`.
    * Fit the model using `model.fit()`.
    * Obtain the regression results using `results.summary()`.
* The results summary provides valuable information, including:
    * Coefficient estimates (βs).
    * Standard errors of the coefficients.
    * t-statistics and p-values for the coefficients.
    * R-squared and adjusted R-squared.
    * F-statistic and p-value for the overall model.
    * Diagnostic information about the residuals.

**5. Interpretation of Results:**

* The coefficient estimates indicate the change in the dependent variable for a one-unit change in the corresponding independent variable, holding other variables constant.
* The p-values help determine the statistical significance of the coefficients.
* The R-squared value indicates the proportion of the variance in the dependent variable that is explained by the model.
* The residual analysis is used to verify that the assumptions of OLS are met.

In summary, Statsmodels OLS is a powerful tool for building and analyzing linear regression models in Python. It provides comprehensive statistical output and diagnostic information to help you understand the relationships between variables and assess the quality of your model.


In [None]:
import statsmodels.api as sm

X_train = X_train.copy()
X_train.insert(0, 'const', 1)
X_with_const = X_train

X_with_const = X_with_const.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

ols_model = sm.OLS(y_train, X_with_const).fit()
print(ols_model.summary())

In [None]:
# compare the p-values of the OLS summary with what Lasso selected
X_train_lassoed.columns.tolist()

## R-Squared and Adj R-Squared

Understanding the Relationship of R-Squared and Adjusted R-Square

**R-Squared (Coefficient of Determination):**

* **What it is:** R-squared measures the proportion of the variance in the dependent variable (the target) that is predictable from the independent variables (the features).
* **How it's calculated:** It's essentially 1 minus the ratio of the residual sum of squares (the unexplained variance) to the total sum of squares (the total variance).
* **What it tells you:** How well the model fits the data. A higher R-squared means the model explains a larger portion of the variance.
* **Problem:** R-squared *always* increases (or at least stays the same) when you add more features to your model, even if those features don't actually improve the model's predictive power. This can lead to overfitting.

**Adjusted R-Squared:**

* **What it is:** Adjusted R-squared is a modified version of R-squared that penalizes you for adding unnecessary features to your model.
* **How it's calculated:** It takes into account the number of features in the model and the number of data points. It essentially adjusts the R-squared value downward if you add features that don't significantly improve the model.
* **What it tells you:** How well the model fits the data *while accounting for the complexity of the model*. It helps you determine if adding more features is actually improving the model or just overfitting.
* **Why it's important:** It provides a more realistic measure of the model's performance, especially when you're dealing with models that have many features.

**How They Are Used in Feature Selection:**

1.  **R-squared as a Starting Point:**
    * R-squared can give you a general idea of how well your model is performing. If it's very low, it might indicate that your features are not very predictive.

2.  **Adjusted R-squared for Comparison:**
    * When you're trying to select the best set of features, you should primarily rely on adjusted R-squared.
    * As you add or remove features, compare the adjusted R-squared values of the resulting models.
    * The model with the highest adjusted R-squared is generally considered the best, as it indicates the best balance between model fit and model complexity.

3.  **Avoiding Overfitting:**
    * Adjusted R-squared helps you avoid overfitting by penalizing the inclusion of irrelevant features.
    * If you see a large increase in R-squared but only a small increase (or even a decrease) in adjusted R-squared when you add a feature, it's a sign that the feature is not adding much predictive power and might be leading to overfitting.

4.  **Model Simplicity:**
    * Adjusted R squared therefore encourages simpler models. If two models have similar R squared values, the model with the smaller number of features will generally have a higher adjusted R squared value.

**In Summary:**

* Use R-squared for a basic assessment of model fit.
* Use adjusted R-squared to compare models with different numbers of features and to prevent overfitting during feature selection. Adjusted R-squared is the more valuable metric when you are trying to select the most relevant features.


## OLS Regression Summary Explanation

* Endog(enous): Similar to the dependent variable
* Exog(enous): Similar to the independent variable
* https://www.statisticshowto.com/endogenous-variable/
* https://medium.com/swlh/interpreting-linear-regression-through-statsmodels-summary-4796d359035a

### Model Info
* Dep. Varialble: the response variable, dependent, outcome, etc.
* Model: what model are we using (ordinary least squares) for the training
* Method: how the parameters (coefficients) were calculated
* No. Observations: the number of observations, rows... (n)
* DF Residuals: degrees of freedom of the residuals
* DF Model: number of parameters in the model excluding the constant if present
* Covariance Type: deals with violations of assumptions

### Goodness of Fit
* R-Squared: coefficient of determination, how well the regression fits the data
* Adj R-Squared: R-squared adjustment based on number of parameters and df residuals
* F statistic: a measure of how significant the fit is
* Prop F statistic: the probability that you would get the F stat given the null hypothesis
* Log-Liklihood: can be used to compare the fit of different coefficients, the higher valur is better
* AIC: Akaike Information Criterion is used to compare models, a lower score is better (doesn't address features, just the overall model)
* BIC: Bayesian Information Criterion is similar to AIC but uses a higher penalty

### Coefficients
* coef: the estimated value of the coefficient
* std error: the basic standard error of the estimate of the coefficient
* t: the t-statistic value, how significant the coefficient is
* P>|t|: the p-value, indicates a statistically significant relationship to the dependent variable if less than the confidence level, usually 0.05
* 95% confidence interval: the lower and upper values

### Statistical Tests
* Skewness: A measure of the symmetry of the data about the mean
* Kurtosis: A measure of the shape of the data
* Omnibus: D'Angostino's test provides a combined test for the presence of skewness and kurtosis
* Prob(Omnibus): probability of Omnibus
* Jarque-Bera: Another test for skewness and kurtosis
* Prob(Jarque-Bera): probability of Jarque-Bera
* Durbin-Watson: A test for the presence of autocorrelation, if the errors aren't independent
* Cond No: A test for multicollinearity