## Mod 6 Lecture 4 Code-Along:  Simple Linear Regression

### Purpose

**Purpose**: Use both the statsmodels and sklearn libraries to fit a simple linear regression model ; fitted on full data (FOR NOW) just to learn mechanics. Validation comes in future lectures 

### Data
We are looking at ONE month's worth of data (December 2023 to be exact).  Remember the data dictionary is found [HERE](https://data.cityofnewyork.us/Transportation/2023-Yellow-Taxi-Trip-Data/4b4i-vvec/about_data)

### Step 0:  Read in Data & EXPLORE 

**Remember**:  EDA is MOST of your time so you should spend time doing EDA before ANY modeling 

In [1]:
# import packages
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

In [2]:
# Read in data and look at it!
path = '/Users/Marcy_Student/Desktop/Marcy-Modules/marcy-git/DA2025_Lectures/Mod6/data/2023_Yellow_Taxi_Trip_Data_20251015.csv'

df = pd.read_csv(path, low_memory=False)
display(df.head())
display(df.info())
display(df.describe())

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,12/01/2023 04:11:39 PM,12/01/2023 04:19:13 PM,2.0,0.69,1.0,N,141,140,1,7.9,2.5,0.5,3,0.0,1.0,17.4,2.5,0.0
1,1,12/01/2023 04:11:39 PM,12/01/2023 04:20:41 PM,3.0,1.1,1.0,N,236,263,2,10.0,5.0,0.5,0,0.0,1.0,16.5,2.5,0.0
2,2,12/01/2023 04:11:39 PM,12/01/2023 04:20:38 PM,1.0,1.57,1.0,N,48,239,4,-10.7,-2.5,-0.5,0,0.0,-1.0,-17.2,-2.5,0.0
3,2,12/01/2023 04:11:39 PM,12/01/2023 04:20:38 PM,1.0,1.57,1.0,N,48,239,4,10.7,2.5,0.5,0,0.0,1.0,17.2,2.5,0.0
4,1,12/01/2023 04:11:39 PM,12/01/2023 04:34:39 PM,2.0,3.0,1.0,N,164,211,1,21.9,5.0,0.5,3,0.0,1.0,31.4,2.5,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3310907 entries, 0 to 3310906
Data columns (total 19 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   VendorID               int64  
 1   tpep_pickup_datetime   object 
 2   tpep_dropoff_datetime  object 
 3   passenger_count        float64
 4   trip_distance          object 
 5   RatecodeID             float64
 6   store_and_fwd_flag     object 
 7   PULocationID           int64  
 8   DOLocationID           int64  
 9   payment_type           int64  
 10  fare_amount            object 
 11  extra                  float64
 12  mta_tax                float64
 13  tip_amount             object 
 14  tolls_amount           float64
 15  improvement_surcharge  float64
 16  total_amount           object 
 17  congestion_surcharge   float64
 18  airport_fee            float64
dtypes: float64(8), int64(4), object(7)
memory usage: 479.9+ MB


None

Unnamed: 0,VendorID,passenger_count,RatecodeID,PULocationID,DOLocationID,payment_type,extra,mta_tax,tolls_amount,improvement_surcharge,congestion_surcharge,airport_fee
count,3310907.0,3133527.0,3133527.0,3310907.0,3310907.0,3310907.0,3310907.0,3310907.0,3310907.0,3310907.0,3133527.0,3133527.0
mean,1.750368,1.40956,1.784972,165.0685,163.9182,1.168633,1.48524,0.4828656,0.5734011,0.9758456,2.270032,0.1367974
std,0.4356449,0.9117169,8.283274,64.2833,69.68454,0.5959677,1.814139,0.1206643,2.228458,0.2170912,0.8078364,0.4796868
min,1.0,0.0,1.0,1.0,1.0,0.0,-39.17,-0.5,-70.0,-1.0,-2.5,-1.75
25%,1.0,1.0,1.0,132.0,113.0,1.0,0.0,0.5,0.0,1.0,2.5,0.0
50%,2.0,1.0,1.0,162.0,162.0,1.0,1.0,0.5,0.0,1.0,2.5,0.0
75%,2.0,2.0,1.0,234.0,234.0,1.0,2.5,0.5,0.0,1.0,2.5,0.0
max,6.0,9.0,99.0,265.0,265.0,4.0,51.68,42.17,161.38,1.0,2.5,1.75


### Step 1:  Model `trip_distance` to make inferences about `tip_pct` using **Statsmodels**

**Note**:  Look up statsmodels documentation BEFORE modeling so you see all the methods and attributes available to you! 

In [3]:
# Coerce fare, tip, distance to numeric safely
num_cols = ['fare_amount', 'tip_amount', 'trip_distance']
for c in num_cols:
    df[c] = pd.to_numeric(
        df[c].astype(str).str.strip().str.replace(r'[^0-9.+\-eE]', '', regex=True),
        errors='coerce'
)

In [5]:
# Clean data some and add a new column called 'tip_pct' as the target 
df = df[(df['fare_amount'] > 0) & (df['tip_amount'] >= 0) & (df['trip_distance'] > 0)]
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=num_cols)
df['tip_pct'] = (df['tip_amount']/df['fare_amount']).clip(0,1)

In [6]:
X = sm.add_constant(df[['trip_distance']]) #add intercept yourself for statsmodels
y = df['tip_pct']
model = sm.OLS(y, X).fit()
model.summary()

0,1,2,3
Dep. Variable:,tip_pct,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,73.62
Date:,"Mon, 27 Oct 2025",Prob (F-statistic):,9.46e-18
Time:,10:53:02,Log-Likelihood:,1746800.0
No. Observations:,3200255,AIC:,-3494000.0
Df Residuals:,3200253,BIC:,-3494000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1999,7.84e-05,2549.643,0.000,0.200,0.200
trip_distance,-4.257e-06,4.96e-07,-8.580,0.000,-5.23e-06,-3.28e-06

0,1,2,3
Omnibus:,67805.313,Durbin-Watson:,1.968
Prob(Omnibus):,0.0,Jarque-Bera (JB):,120891.093
Skew:,0.172,Prob(JB):,0.0
Kurtosis:,3.888,Cond. No.,158.0


#### Look at the coefficient and p-value...does it make sense?  How would you interpret the coeff and pvalue?

Answer:  None

### Step 2:  Model `trip_distance` to make inferences about `tip_pct` using **sklearn**

**Note**:  Look up sklearn documentation BEFORE modeling so you see all the methods and attributes available to you! 

In [8]:
X = df[['trip_distance']]
y = df['tip_pct']
lin = LinearRegression().fit(X, y) #default fits intercept 
print('Intercept (β₀):', lin.intercept_)
print('Slope (β₁):', lin.coef_[0])

Intercept (β₀): 0.19985730341232713
Slope (β₁): -4.2574593050381635e-06


#### Did sklearn give you the same results as statsmodels?  What are the code differences? 

### Step 3:  Make an inference to the business (NYC Transportation Commission)


Answer: None