In [1]:
import pandas as pd

# Load the data
df = pd.read_csv("w11/PracticalLab/data/Forbes2000.csv", encoding='ISO-8859-1', index_col=0)

# Check the structure
print(df.columns)
df.head()


Index(['rank', 'name', 'country', 'category', 'sales', 'profits', 'assets',
       'marketvalue'],
      dtype='object')


Unnamed: 0,rank,name,country,category,sales,profits,assets,marketvalue
1,1,Citigroup,United States,Banking,94.71,17.85,1264.03,255.3
2,2,General Electric,United States,Conglomerates,134.19,15.59,626.93,328.54
3,3,American Intl Group,United States,Insurance,76.66,6.46,647.66,194.87
4,4,ExxonMobil,United States,Oil & gas operations,222.88,20.96,166.99,277.02
5,5,BP,United Kingdom,Oil & gas operations,232.57,10.27,177.57,173.54


In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Drop rows with NaNs in key columns
df_clean = df[['sales', 'profits', 'assets', 'marketvalue']].dropna()


def run_slr_sklearn(X_col):
    X = df_clean[[X_col]]
    y = df_clean['marketvalue']
    model = LinearRegression().fit(X, y)
    y_pred = model.predict(X)

    print(f"Predictor: {X_col}")
    print(f"  Intercept: {model.intercept_:.4f}")
    print(f"  Coefficient: {model.coef_[0]:.4f}")
    print(f"  R²: {r2_score(y, y_pred):.4f}")
    print(f"  RMSE: {np.sqrt(mean_squared_error(y, y_pred)):.4f}")
    print()


for col in ['sales', 'profits', 'assets']:
    run_slr_sklearn(col)


Predictor: sales
  Intercept: 3.4309
  Coefficient: 0.8722
  R²: 0.4122
  RMSE: 18.7696

Predictor: profits
  Intercept: 9.0064
  Coefficient: 7.5899
  R²: 0.2994
  RMSE: 20.4900

Predictor: assets
  Intercept: 8.1042
  Coefficient: 0.1114
  R²: 0.2061
  RMSE: 21.8123



In [4]:
import statsmodels.api as sm

def run_slr_statsmodels(X_col):
    temp_df = df[[X_col, 'marketvalue']].dropna()
    X = sm.add_constant(temp_df[X_col])  # adds intercept term
    y = temp_df['marketvalue']
    model = sm.OLS(y, X).fit()
    print(model.summary())


for col in ['sales', 'profits', 'assets']:
    print(f"=== StatsModels: marketvalue ~ {col} ===")
    run_slr_statsmodels(col)
    print()


=== StatsModels: marketvalue ~ sales ===
                            OLS Regression Results                            
Dep. Variable:            marketvalue   R-squared:                       0.412
Model:                            OLS   Adj. R-squared:                  0.412
Method:                 Least Squares   F-statistic:                     1401.
Date:                Mon, 07 Apr 2025   Prob (F-statistic):          7.59e-233
Time:                        17:00:55   Log-Likelihood:                -8700.0
No. Observations:                2000   AIC:                         1.740e+04
Df Residuals:                    1998   BIC:                         1.742e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      


### 🧾 PE07-1-1 — Simple Linear Regression (Sales, Profits, Assets)

We built **three simple linear regression models** to predict `marketvalue` using each of the following independent variables individually:
- `sales`
- `profits`
- `assets`

We used **both Scikit-Learn** and **StatsModels** to analyze and evaluate the models.

---

#### 📊 Comparison Table (Summary of Results)

| Predictor | R² (Sklearn) | R² (StatsModels) | Coefficient (β) | RMSE (Sklearn) | p-value (StatsModels) |
|-----------|--------------|------------------|------------------|----------------|------------------------|
| **sales**   | **0.4122**     | **0.412**           | 0.8722 / 0.8724     | **18.77**       | **< 0.0001** ✅         |
| profits   | 0.2994       | 0.299            | 7.5899             | 20.49          | **< 0.0001** ✅         |
| assets    | 0.2061       | 0.206            | 0.1114             | 21.81          | **< 0.0001** ✅         |

---

#### 🧠 Analysis & Interpretation

- **Sales** is the strongest individual predictor for `marketvalue`. It yields:
  - The **highest R² (~41.2%)** → explains the most variance in market value
  - The **lowest RMSE (~18.77)** → has the most accurate predictions
  - A statistically significant and interpretable coefficient (β ≈ 0.87)

- **Profits** has a larger coefficient (~7.59), but its R² is lower (~30%), and it results in higher prediction error (RMSE ~20.49).

- **Assets** is the weakest among the three, with the lowest R² (~21%) and highest RMSE (~21.81), though still statistically significant.

---

### ✅ Conclusion

> **Among the three independent variables (`sales`, `profits`, `assets`), `sales` is the best predictor of `marketvalue`.**  
> It provides the most explanatory power and accuracy across both Scikit-Learn and StatsModels implementations.

---
