:# DS 6021: Introduction to Predictive Modeling Class Activity 2

Shiraz Robinson II

Faizan Khan

Thomas Blalock

 ## Part 1: Transforming Predictors

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
df = pd.read_csv("PlanetsData.csv")
df.head(3)

Unnamed: 0,planet,distance,diameter,revolution,position
0,Mercury,36,3030,88,1
1,Venus,67,7520,225,2
2,Earth,93,7926,365,3


In [None]:
# Part 1. 1 Visualize the distribution of the predictor variable distance

px.scatter(df, x="distance", y="revolution", trendline="ols")

**Description:**

- The distribution is right-skewed (positively skewed)

- Most planets are clustered at lower distances (inner planets), while a few outer planets (Uranus, Neptune, Pluto) have extremely large distances

- This skewness can distort linear regression assumptions

In [None]:
# Part 1. 2. a. Visualize the distribution of log distance

df["log_distance"] = np.log(df["distance"])
px.box(df, x="log_distance")

In [None]:
# Part 1. 2. b. Create a scatterplot of log distance vs. revolution.

px.scatter(df, x="log_distance", y="revolution", trendline="ols")

In [None]:
df["log_revolution"] = np.log(df["revolution"])
px.scatter(df, x="log_distance", y="log_revolution", trendline="ols")

In [None]:
df['sqrt_distance'] = np.sqrt(df['distance'])
df['sqrt_revolution'] = np.sqrt(df['revolution'])
px.scatter(df, x="sqrt_distance", y="sqrt_revolution", trendline="ols")

In [None]:
df['inverse_distance'] = 1 / df['distance']
df['inverse_revolution'] = 1 / df['revolution']
px.scatter(df, x="inverse_distance", y="inverse_revolution", trendline="ols")

**Explanation:**

Based on the scatterplot, the linear regression model does not seem appropraite for predicting revolution from log_distance. For the log_distance vs revolution scatterplot, the plot shows a quadratic curve and the line of best fit does not hug the data points. The square root and inverse distance plot do not seem appropriate either.

In [None]:
# Part 1. 3. Fit a linear model

import statsmodels.api as sm
x = sm.add_constant(df["log_distance"])
y = df["revolution"]
model = sm.OLS(y, x).fit()
model.summary()

# Part 1. 3. a. Report the estimated regression equation  :

# revolution = -7.171e+04 + 1.565e+04 * log_distance

0,1,2,3
Dep. Variable:,revolution,R-squared:,0.681
Model:,OLS,Adj. R-squared:,0.636
Method:,Least Squares,F-statistic:,14.96
Date:,"Fri, 12 Sep 2025",Prob (F-statistic):,0.00615
Time:,15:29:56,Log-Likelihood:,-100.64
No. Observations:,9,AIC:,205.3
Df Residuals:,7,BIC:,205.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.171e+04,2.51e+04,-2.857,0.024,-1.31e+05,-1.24e+04
log_distance,1.565e+04,4046.842,3.868,0.006,6085.024,2.52e+04

0,1,2,3
Omnibus:,0.41,Durbin-Watson:,0.614
Prob(Omnibus):,0.815,Jarque-Bera (JB):,0.37
Skew:,0.366,Prob(JB):,0.831
Kurtosis:,2.329,Cond. No.,24.3


In [None]:
# # Part 1. 3. b. Compute the residual associated with each prediction of revolution for the planets

df['pred_revolution'] = -7.171e+04 + 1.565e+04 * df['log_distance']
df['residuals'] = df['revolution'] - df['pred_revolution']
df[["distance", "revolution", "pred_revolution", 'residuals']]

Unnamed: 0,distance,revolution,pred_revolution,residuals
0,36,88,-15627.928613,15715.928613
1,67,225,-5906.560507,6131.560507
2,93,365,-774.817932,1139.817932
3,142,687,5848.693451,-5161.693451
4,484,4332,25039.62879,-20707.62879
5,887,10760,34519.773973,-23759.773973
6,1765,30684,45287.928421,-14603.928421
7,2791,60188,52459.529405,7728.470595
8,3654,90467,56675.991583,33791.008417


**Comment on the fit of the model:**

- The model captures general trend but struggles with variance and curvature—suggesting the need to transform the target variable too.

- Linear regression does not perform well in this scenario due to nonlinearity and unstable variance. To improve model fit, we should apply transformations to the variables

 ## Part 2: Transforming the Target Variable

In [None]:
# Part 2. 1. Visualize the distribution of the target variable revolution

px.box(df, x="revolution")

**Explanation:**

- The target variable is heavily left-skewed and has an extremely long tail.

In [None]:
# Part 2. 2. a. Visualize the distribution of log revolution

df["log_revolution"] = np.log(df["revolution"])
px.scatter(df, x="distance", y="log_revolution", trendline="ols")

In [None]:
# Part 2. 2. b. Create a scatterplot of log distance vs. log revolution

df["log_revolution"] = np.log(df["revolution"])
px.scatter(df, x="log_distance", y="log_revolution", trendline="ols")

**Explanation:**

- The distribution of log_revolution is much more symmetric.

- Skewness is significantly reduced.

- Variance appears more stable across the range.

In [None]:
# Part 2. 3. Fit a linear regression model using OLS with log revolution as the response and log distance as the predictor

import statsmodels.api as sm
x = sm.add_constant(df["log_distance"])
y = df["log_revolution"]
model2 = sm.OLS(y, x).fit()
model2.summary()

# Part 2. 3. a. Report the estimated regression equation

# log_revolution = -0.9031 + 1.5013 * log_distance

0,1,2,3
Dep. Variable:,log_revolution,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1585000.0
Date:,"Fri, 12 Sep 2025",Prob (F-statistic):,5.26e-20
Time:,15:39:21,Log-Likelihood:,34.696
No. Observations:,9,AIC:,-65.39
Df Residuals:,7,BIC:,-65.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.9031,0.007,-122.147,0.000,-0.921,-0.886
log_distance,1.5013,0.001,1259.092,0.000,1.498,1.504

0,1,2,3
Omnibus:,4.152,Durbin-Watson:,2.345
Prob(Omnibus):,0.125,Jarque-Bera (JB):,1.862
Skew:,1.112,Prob(JB):,0.394
Kurtosis:,2.882,Cond. No.,24.3


In [None]:
# Part 2. 3. b. Compute the residual associated with each prediction of revolution for the planets

df['pred_log_revolution'] = -0.9031 + 1.5013 * df['log_distance']
df['residuals_of_log_model'] = df['log_revolution'] - df['pred_log_revolution']
df[["distance", "log_revolution", "pred_log_revolution", 'residuals_of_log_model']]

Unnamed: 0,distance,log_revolution,pred_log_revolution,residuals_of_log_model
0,36,4.477337,4.476837,0.0005
1,67,5.4161,5.409405,0.006695
2,93,5.899897,5.901692,-0.001794
3,142,6.532334,6.537083,-0.004749
4,484,8.373785,8.378064,-0.004279
5,887,9.283591,9.287492,-0.003901
6,1765,10.331497,10.320478,0.011019
7,2791,11.005228,11.008447,-0.003219
8,3654,11.41274,11.412931,-0.000191


In [None]:
# Part 2. 3. b.

df['pred_revolution_from_log_model'] = np.exp(df['pred_log_revolution'])
df[["distance", "revolution", "pred_revolution_from_log_model"]]

Unnamed: 0,distance,revolution,pred_revolution_from_log_model
0,36,88,87.956026
1,67,225,223.498573
2,93,365,365.655495
3,142,687,690.270232
4,484,4332,4350.578355
5,887,10760,10802.054991
6,1765,30684,30347.749186
7,2791,60188,60382.055904
8,3654,90467,90484.265841


**Explanation:**

- Excellent Fit. The log-log transformation linearizes the relationship and stabilizes variance.


**Improvement Over Part 1:**

- Part 1 had large residuals and curvature.

- This model has tight residuals and a strong linear fit.

- R² is significantly higher, indicating better explanatory power.

## Part 2.4 Using algebraic manipulation, show that squared revolution is directly proportional to cubic distance




$ln(revolution) = -0.9031 + 1.5013 * ln(distance)$<br>
$2 ln(revolution) = -2 + 3 ln(distance)$<br>
$2 ln(revolution) \propto 3 ln(distance)$<br>
$e^{2 ln(revolution)} \propto e^{3 ln(distance)}$<br>
$revolution^2 \propto distance^3$