## **1. Import Data**

File “StrainTemperature.csv” contains a dataset, with 7 columns and 337 rows.  
- Col. 1 lists the strain measured in a structural member, in micro-strains, με (i.e., in parts per million).  
- Col. 2 to 7 list the temperature measured by 6 thermometers in different locations, in degree Celsius.  
Each row refers to a specific time when all measures (of strain and temperature) are collected.
Measures are collected every 30 minutes for one week (hence the rows are 337 = 7 × 24 × 2 + 1).

In [41]:
import pandas as pd

# Load data with the correct header
data = pd.read_csv("./data/StrainTemperature.csv", header=None)

# Rename columns 
data.columns = ['Strain', 'Temp1', 'Temp2', 'Temp3', 'Temp4', 'Temp5', 'Temp6']

# Check the structure of the corrected data
print(data.head())

    Strain   Temp1   Temp2   Temp3   Temp4   Temp5   Temp6
0   69.754  22.681  23.836  24.512  25.141  23.650  24.048
1   98.703  23.317  24.357  25.073  25.689  24.205  24.622
2  104.404  23.945  24.926  25.564  26.028  24.741  25.121
3  101.514  24.226  25.503  25.994  26.164  25.146  25.481
4   99.808  24.432  26.114  26.177  26.272  25.443  25.765


## **2. Data Proprecessing**

Check for missing values and remove them if present.

In [42]:
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

# Remove missing values if any are present
if missing_values.sum() > 0:
    data = data.dropna()  # Remove rows with missing values
    print("Missing values have been handled.")
else:
    print("No missing values found.")

Strain    0
Temp1     0
Temp2     0
Temp3     0
Temp4     0
Temp5     0
Temp6     0
dtype: int64
No missing values found.


The result of the missing value check shows no missing values. Therefore, the data remains unchanged.

## **3. Calibrate a linear regression model**

Using all six temperatures and a “constant feature,” infer the strain as a function of the temperatures.

In [43]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Separate input (X) and output (y)
X = data.iloc[:, 1:]  # Temperature data (columns 2 to 7)
y = data.iloc[:, 0]   # Strain data (column 1)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = model.predict(X_test)
print("R^2 Score:", r2_score(y_test, y_pred))

R^2 Score: 0.9221524043718972


1) **Coefficient of Determination (𝑅²) for this Fitted Model**  
The coefficient of determination (𝑅²) for this fitted model is **0.9222**.  
This indicates that approximately **92.22% of the variability** in the test data is explained by this linear regression model.

2) **How 𝑅² Relates to the Uncertainty in Inferring the Strain**  
The 𝑅² value represents how well the model explains the relationship between the input data (temperatures) and the output data (strain).  

    - A **high 𝑅² value** (close to 1) suggests that the model effectively explains the variability in the data, resulting in **lower uncertainty** in predictions.  
    - A **low 𝑅² value** indicates that the model does not sufficiently explain the data, leading to **higher uncertainty** in predictions.

3) **Conclusion**  
With an 𝑅² value of **0.9222**, this model effectively infers strain as a function of temperatures, showing relatively **low uncertainty**.  
However, it is important to note that 𝑅² alone does not fully evaluate model quality. Additional analyses, such as residual analysis and checks for multicollinearity, should also be considered.

## **4. Build a 95% confidence interval**

Build a 95% confidence interval for each of the parameters.

In [44]:
import numpy as np
import statsmodels.api as sm
import pandas as pd

# Prepare data (set independent and dependent variables)
X = data.iloc[:, 1:]  # Temperature data (columns 2 to 7)
y = data.iloc[:, 0]   # Strain data (column 1)

# Add a constant term (Statsmodels requires manual addition of the constant term)
X = sm.add_constant(X)

# Fit the linear regression model
model = sm.OLS(y, X).fit()

# Display the results
print(model.summary())

# Extract 95% confidence intervals
confidence_intervals = model.conf_int(alpha=0.05)  # 95% confidence intervals
confidence_intervals.columns = ['Lower Bound', 'Upper Bound']
confidence_intervals.index = ['Constant'] + list(data.columns[1:])  # Add variable names

print("\n95% Confidence Intervals:")
print(confidence_intervals)

                            OLS Regression Results                            
Dep. Variable:                 Strain   R-squared:                       0.932
Model:                            OLS   Adj. R-squared:                  0.931
Method:                 Least Squares   F-statistic:                     754.7
Date:                Tue, 07 Jan 2025   Prob (F-statistic):          2.31e-189
Time:                        12:25:34   Log-Likelihood:                -1484.2
No. Observations:                 337   AIC:                             2982.
Df Residuals:                     330   BIC:                             3009.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -319.8254     14.588    -21.923      0.0

1) **95% Confidence Intervals**  
The 95% confidence intervals for each parameter representing the relationship between temperature and strain are shown in the last two columns [0.025, 0.975] of the results.  

    **Examples:**   
    - Constant (const) confidence interval: **[-348.523, -291.127]**  
    - First variable (Temp1): **[-13.881, 30.264]**  
    - Last variable (Temp6): **[37.726, 184.569]** 
     
    The meaning of these intervals is that the true coefficient is expected to lie within these ranges with 95% confidence.


2) **Significance of Coefficients**  
    The `P>|t|` values indicate whether each variable’s coefficient is statistically significant.  

    **Typically:**  
    - A **p-value < 0.05** implies the coefficient is statistically significant.

    **Analysis of Results:**  
    - Constant (const): **p=0.000**, significant.  
    - Temp5: **p=0.000**, significant.  
    - Temp6: **p=0.003**, significant.  
    - Other variables (Temp1, Temp2, Temp3, Temp4): **p > 0.05**, not significant.


3) **Size of Confidence Intervals**  
Large confidence intervals suggest greater uncertainty in the estimation of the coefficients.

    **Examples:**  
    - Temp1 (first variable): **[-13.881, 30.264]** has a wide interval, indicating that the model struggles to accurately estimate its effect.  
    - Temp2: **[-38.402, 7.197]**, Temp3: **[-29.957, 15.906]**, and Temp4: **[-21.802, 15.807]** also exhibit wide intervals.  
    - Even for Temp6, which is significant, the interval **[37.726, 184.569]** is relatively large, implying some degree of uncertainty.

    The large intervals for all variables (including significant ones) suggest considerable uncertainty in the coefficient estimates.


4) **Reasons for Large Confidence Intervals**  
    - **Multicollinearity:** High correlation among independent variables makes it challenging to estimate individual coefficients reliably.  
    - **Data Quality:** Small sample sizes or noisy data increase the uncertainty of estimates.  
    - **Low Predictive Power:** Variables with limited influence on the dependent variable (strain) result in wider intervals.


## **5. Mmulticollinearity. Compute the Variance Inflation Factor (VIF)**

Compute the Variance Inflation Factor (VIF) for each of the temperatures

In [45]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

# Prepare data (select temperature variables only)
X = data.iloc[:, 1:]  # Temperature data (columns 2 to 7)
X_columns = X.columns

# Calculate VIF
vif_values = []

for i in range(X.shape[1]):
    # Set the current variable (i) as the dependent variable and the rest as independent variables
    y = X.iloc[:, i]
    X_temp = X.drop(X.columns[i], axis=1)
    
    # Train the linear regression model
    model = LinearRegression()
    model.fit(X_temp, y)
    
    # Compute the coefficient of determination (R^2)
    r_squared = model.score(X_temp, y)
    
    # Calculate VIF
    vif = 1 / (1 - r_squared)
    vif_values.append(vif)

# Organize VIF values into a DataFrame
vif_data = pd.DataFrame({
    "Feature": X_columns,
    "VIF": vif_values
})

# Print the results
print("Variance Inflation Factor (VIF):")
print(vif_data)


Variance Inflation Factor (VIF):
  Feature           VIF
0   Temp1   1680.118023
1   Temp2   2419.379388
2   Temp3   3147.839906
3   Temp4   1836.505518
4   Temp5   4979.115281
5   Temp6  25248.587883


VIF (Variance Inflation Factor) measures how much a variable is correlated with other independent variables. A high VIF indicates that the variable is not independent but redundant, as it shares significant information with others. Typically, VIF > 10 suggests a multicollinearity issue.  

In this case, all temperature variables have extremely high VIFs, indicating that they are highly correlated and act as **redundant** features rather than independent ones. This redundancy makes it difficult to isolate the individual impact of each variable in the model, highlighting the need to address multicollinearity through techniques like feature removal, dimensionality reduction (e.g., PCA), or using regularization methods.

In [48]:
# Use library for verification
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

# Prepare data (select temperature variables only)
X = data.iloc[:, 1:]  # Temperature data (columns 2 to 7)

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns  # Variable names
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Print results
print("Variance Inflation Factor (VIF):")
print(vif_data)

Variance Inflation Factor (VIF):
  Variable            VIF
0    Temp1   43276.786079
1    Temp2   48045.142288
2    Temp3   45966.427308
3    Temp4   29682.584833
4    Temp5   86096.084991
5    Temp6  483266.304104


**Why the Results Differ When Using the Library**

The difference in VIF results arises because the `variance_inflation_factor` function from `statsmodels` uses matrix operations for calculations, providing more stable and consistent values. In contrast, manual methods rely on regression models, which can be sensitive to scaling or numerical precision issues.

Despite the difference in exact VIF values, both approaches clearly indicate a **multicollinearity issue**, as all VIF values are significantly above the threshold (VIF > 10). This confirms that the temperature variables are highly correlated and redundant.

## **6. Solve the Issue of Multicollinearity**

### 6.1 Reasons to Address Multicollinearity

Addressing multicollinearity is essential for several reasons, which are all relevant to the current analysis:

1) **Improving the inference of strain**  
   Multicollinearity prevents the model from accurately estimating the individual effects of independent variables due to high correlations among them. Resolving it improves the accuracy of strain inference and enhances predictive performance.

2) **Obtaining a simpler model**  
   Multicollinearity leads to redundant variables in the model, increasing complexity unnecessarily. Removing such variables or applying dimensionality reduction techniques simplifies the model while maintaining or even improving its performance.

3) **Better understanding of the relationship between strain and specific temperatures**  
   High multicollinearity obscures the individual impact of specific temperature variables on strain. Reducing multicollinearity enables clearer interpretation of how each temperature affects strain.

4) **Reducing uncertainty in model parameters**  
   Multicollinearity destabilizes parameter estimates, widening confidence intervals and lowering the reliability of the model. Addressing it reduces uncertainty in parameter estimation and improves model stability.

In conclusion, resolving multicollinearity is crucial for enhancing model performance, improving interpretability, and ensuring the stability and reliability of parameter estimates.

### 6.2 Techniques for Resolving Multicollinearity 

#### 1) Subset of thermometers

In [50]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Function to calculate VIF
def calculate_vif(X):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif_data

# Initial VIF calculation
print("VIF Before Subset Selection:")
vif_before = calculate_vif(X)
print(vif_before)

# Remove the variable with the highest VIF (e.g., "Temp6")
X_subset = X.drop("Temp6", axis=1)

# New VIF calculation after subset selection
print("\nVIF After Subset Selection:")
vif_after = calculate_vif(X_subset)
print(vif_after)

# Train-test split with the reduced subset
X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.2, random_state=42)

# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model with the reduced subset
y_pred = model.predict(X_test)
print("\nR^2 Score After Subset Selection:", r2_score(y_test, y_pred))


VIF Before Subset Selection:
  Feature            VIF
0   Temp1   43276.786079
1   Temp2   48045.142288
2   Temp3   45966.427308
3   Temp4   29682.584833
4   Temp5   86096.084991
5   Temp6  483266.304104

VIF After Subset Selection:
  Feature           VIF
0   Temp1   8392.004819
1   Temp2  13700.298040
2   Temp3   9104.187107
3   Temp4    346.453943
4   Temp5  84527.631223

R^2 Score After Subset Selection: 0.9999542061569101


왜 변수 제거를 하기로 했는지  
어떻게 했는지  
결과과  
한계: 변수 제거는 근본적인 해결이 아닐 수 있음. 왜냐하면..  

#### 2) Principal Component Analysis (PCA)  


PCA를 사용해 차원을 축소하고, 다중공선성을 제거한 새로운 변수로 모델을 학습

In [51]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardizing the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (Principal Component Analysis) with 2 principal components
# PCA is used to reduce dimensionality by transforming correlated variables into independent components.
# Here, we use 2 principal components to retain the most significant variance while simplifying the dataset.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Train-test split with the transformed PCA data
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Train the regression model with PCA-transformed data
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = model.predict(X_test)
print("R^2 Score After PCA:", r2_score(y_test, y_pred))

# Output the explained variance ratio of the principal components
# This shows how much variance is retained by each principal component.
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

R^2 Score After PCA: 0.999950731484512
Explained Variance Ratio: [0.98406031 0.0133858 ]


왜 PCA를 하기로 했는지  
어떻게 했는지 (왜 주성분이 2개인지)  
결과과
한계: 근본적인 해결이 아닐 수 있음. 왜냐하면..

#### 3) Ridge Regression

In [53]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

# Ridge regression model
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Predictions
y_pred_ridge = ridge.predict(X_test)

# R^2 score
print("R^2 Score with Ridge Regression:", r2_score(y_test, y_pred_ridge))

# Coefficients
print("Ridge Coefficients:", ridge.coef_)

# Mean squared error (MSE) comparison
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
print("Mean Squared Error with Ridge Regression:", mse_ridge)


R^2 Score with Ridge Regression: 0.9999515809826575
Ridge Coefficients: [1.90755538 0.27665518]
Mean Squared Error with Ridge Regression: 0.0010595567328021806


왜 릿지회귀를 를 하기로 했는지  
어떻게 했는지   
결과  
한계: 근본적인 해결이 아닐 수 있음. 왜냐하면..

#### 4) Lasso Regression

In [20]:
from sklearn.linear_model import Lasso

# 라쏘 회귀 모델
lasso = Lasso(alpha=0.1)  # 알파 값은 정규화 강도를 조절
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)

# R^2 결과 출력
print("R^2 Score with Lasso Regression:", r2_score(y_test, y_pred_lasso))


R^2 Score with Lasso Regression: 0.9996497233047609


왜 라쏘쏘회귀를 를 하기로 했는지  
어떻게 했는지   
결과  
한계: 근본적인 해결이 아닐 수 있음. 왜냐하면..


**전체 결론**
1. **Subset Selection**:
   - 변수 제거로 다중공선성을 완화할 수 있으나, 가장 간단한 접근법.
   - 변수 제거는 데이터 해석에 영향을 줄 수 있음.

2. **PCA**:
   - 다중공선성을 완전히 제거하며, 모델의 설명력을 유지.
   - 차원을 축소하므로 모델 해석력이 떨어질 수 있음.

3. **Ridge Regression**:
   - 다중공선성을 해결하면서 모든 변수를 유지.
   - 해석 가능성이 높음.

4. **Lasso Regression**:
   - 다중공선성 해결과 변수 선택을 동시에 수행.
   - 가장 해석 가능하고 간단한 모델을 생성할 수 있음.


## **7. Predict Future**

 선형 회귀를 사용하여, 현재 (또는 과거)의 온도와 변형률을 함수로 하여 미래의 변형률을 예측하는 모델을 개발하십시오. (미래 시점에서 수집된 데이터를 사용할 수 없습니다.)

이전 질문에서 제안한 모델과 관련하여, 예측의 정확도를 정량화하십시오.
미래 변형률에 대해 95% 신뢰구간을 어떻게 정의할 수 있습니까?
이를 답하기 위해, 연속된 시점에서 변형률에 영향을 미치는 노이즈가 상관되어 있을 수 있음을 고려하십시오.

노이즈 상관성 문제를 분석하고 이것이 초래하는 결과를 설명하며, 미래 변형률에 대한 신뢰구간을 정의할 때 이 현상을 어떻게 반영할 수 있을지 논의하십시오.

다중공선성이 마지막 두 질문에서 정의된 미래 변형률 예측과 어떻게 관련되어 있는지 논의