In [1]:
import pandas as pd

# Load the data
df = pd.read_csv("w11/PracticalLab/data/Forbes2000.csv", encoding='ISO-8859-1', index_col=0)

# Check the structure
print(df.columns)
df.head()


Index(['rank', 'name', 'country', 'category', 'sales', 'profits', 'assets',
       'marketvalue'],
      dtype='object')


Unnamed: 0,rank,name,country,category,sales,profits,assets,marketvalue
1,1,Citigroup,United States,Banking,94.71,17.85,1264.03,255.3
2,2,General Electric,United States,Conglomerates,134.19,15.59,626.93,328.54
3,3,American Intl Group,United States,Insurance,76.66,6.46,647.66,194.87
4,4,ExxonMobil,United States,Oil & gas operations,222.88,20.96,166.99,277.02
5,5,BP,United Kingdom,Oil & gas operations,232.57,10.27,177.57,173.54


In [2]:
df_clean = df[['sales', 'profits', 'assets', 'marketvalue']].dropna()


In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define features and target
X = df_clean[['sales', 'profits', 'assets']]
y = df_clean['marketvalue']

# Train the model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Output results
print("=== Scikit-Learn Multiple Linear Regression ===")
print(f"Intercept: {model.intercept_:.4f}")
print(f"Coefficients: {dict(zip(X.columns, model.coef_))}")
print(f"R²: {r2_score(y, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y, y_pred)):.4f}")


=== Scikit-Learn Multiple Linear Regression ===
Intercept: 2.9004
Coefficients: {'sales': 0.5747316233709924, 'profits': 4.597532983583894, 'assets': 0.04891510829753898}
R²: 0.5435
RMSE: 16.5406


In [4]:
import statsmodels.api as sm

# Add intercept to X
X_sm = sm.add_constant(X)

# Fit the model
model_sm = sm.OLS(y, X_sm).fit()

# Output summary
print("=== StatsModels Multiple Linear Regression ===")
print(model_sm.summary())


=== StatsModels Multiple Linear Regression ===
                            OLS Regression Results                            
Dep. Variable:            marketvalue   R-squared:                       0.543
Model:                            OLS   Adj. R-squared:                  0.543
Method:                 Least Squares   F-statistic:                     790.1
Date:                Mon, 07 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:04:04   Log-Likelihood:                -8428.4
No. Observations:                1995   AIC:                         1.686e+04
Df Residuals:                    1991   BIC:                         1.689e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const

### 🧾 PE07-1-2 — Multiple Linear Regression (Scikit-Learn & StatsModels)

We built a **multiple linear regression model** using all three independent variables — `sales`, `profits`, and `assets` — to predict `marketvalue`.

---

#### 📘 Scikit-Learn Results

| Metric             | Value                         |
|--------------------|-------------------------------|
| **Intercept**       | 2.9004                        |
| **Coefficients**    | `sales`: 0.5747<br>`profits`: 4.5975<br>`assets`: 0.0489 |
| **R-squared (R²)**  | **0.5435**                    |
| **RMSE**            | **16.54**                     |

---

#### 📘 StatsModels Summary (Key Highlights)

| Metric             | Value           | Interpretation |
|--------------------|-----------------|----------------|
| **R-squared**      | **0.543**       | Explains 54.3% of variance in `marketvalue` |
| **p-values**       | All **< 0.0001** ✅ | All predictors are statistically significant |
| **Durbin-Watson**  | 1.886           | No strong autocorrelation |
| **Coefficient Significance** | All t-stats > 11 | Each feature contributes meaningfully |

---

### 🔍 Comparison with PE07-1-1 (Simple Regression)

| Model Type                      | R²      | RMSE     | Best Predictor? |
|---------------------------------|---------|----------|-----------------|
| `marketvalue ~ sales`           | 0.412   | 18.77    | ✅ Good baseline |
| `marketvalue ~ profits`         | 0.299   | 20.49    | ❌               |
| `marketvalue ~ assets`          | 0.206   | 21.81    | ❌               |
| **Multiple Regression (all 3)** | **0.543** | **16.54** | ✅ **Best overall** |

---

### ✅ Final Conclusion

The **multiple linear regression model outperforms all individual simple models**:

- **R² increased from 0.412 → 0.543**, meaning the model explains **13.1% more variance** in `marketvalue` than `sales` alone.
- **RMSE dropped from 18.77 → 16.54**, indicating more accurate predictions.
- All three predictors (`sales`, `profits`, and `assets`) are **statistically significant** with meaningful positive coefficients.

> **Combining all three predictors gives the most accurate and reliable prediction of company market value.**

