# Stepwise Regression

In a stepwise regression, variables are added and removed from the model based on significance. You can have a forward selection stepwise which adds variables if they are statistically significant until all the variables outside the model are not significant, a backwards elimination stepwise regression which puts in all the variables and then removes those that are not statistically significant until only statistically significant ones remain, and a bidirectional elimination which both adds and removes until all the variables inside are significant AND all those outside are not significant.

In [183]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
df.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,VEHICLE TYPE CODE 5,lat,long,hour,date,BROOK,BRONX,MANHAT,STATEN,QUEENS
0,01/01/2022,7:05,,,,,,EAST 128 STREET,3 AVENUE BRIDGE,,...,,0,0,7,1,0,0,0,0,0
1,01/01/2022,14:43,,,40.769993,-73.915825,"(40.769993, -73.915825)",GRAND CENTRAL PKWY,,,...,,1,1,14,1,0,0,0,0,0
2,01/01/2022,21:20,QUEENS,11414.0,40.65723,-73.84138,"(40.65723, -73.84138)",91 STREET,160 AVENUE,,...,,1,1,21,1,0,0,0,0,1
3,01/01/2022,4:30,,,,,,Southern parkway,Jfk expressway,,...,,0,0,4,1,0,0,0,0,0
4,01/01/2022,7:57,,,,,,WESTCHESTER AVENUE,SHERIDAN EXPRESSWAY,,...,,0,0,7,1,0,0,0,0,0


### Data scrubbing

In [184]:
df = pd.read_csv("../data/nyc_mv_collisions_202201.csv")

zeros = []
for i in range(len(df)):
    zeros.append(0)
    
df["lat"] = zeros
for i in range(len(df)):
    if df["LATITUDE"][i] > 0 or df["LATITUDE"][i] < 0:
        df["lat"][i] = 1
    else:
        df["lat"][i] = 0

df["long"] = zeros
for i in range(len(df)):
    if df["LATITUDE"][i] > 0 or df["LONGITUDE"][i] < 0:
        df["long"][i] = 1
    else:
        df["long"][i] = 0
    
df["hour"] = zeros
for i in range(len(df)):
    if (df["CRASH TIME"][i])[1:2] == ':':
        df["hour"][i] = str(df["CRASH TIME"][i])[0:1]
    else:
        df["hour"][i] = str(df["CRASH TIME"][i])[0:2]
    cap = int(df["hour"][i])
    df["hour"][i] = cap

df["date"] = zeros
for i in range(len(df)):
    df["date"][i] = str(df["CRASH DATE"][i])[3:5]
    cap = int(df["date"][i])
    df["date"][i] = cap
    
df["BROOK"] = zeros
df["BRONX"] = zeros
df["MANHAT"] = zeros
df["STATEN"] = zeros
df["QUEENS"] = zeros

for i in range(len(df)):
    if df["BOROUGH"][i] == "NaN":
        cat = 1
    elif df["BOROUGH"][i] == 'BROOKLYN':
        df["BROOK"][i] = 1
    elif df["BOROUGH"][i] == 'BRONX':
        df["BRONX"][i] = 1
    elif df["BOROUGH"][i] == 'MANHATTAN':
        df["MANHAT"][i] = 1
    elif df["BOROUGH"][i] == 'STATEN ISLAND':
        df["STATEN"][i] = 1
    elif df["BOROUGH"][i] == 'QUEENS':
        df["QUEENS"][i] = 1

In [491]:
x_columns = ["date", "hour", "BROOK", "BRONX", "STATEN", "QUEENS", "MANHAT", "NUMBER OF PERSONS KILLED", "lat"]
y = df["NUMBER OF PERSONS INJURED"]

In [219]:
def get_stats():
    x = df[x_columns]
    results = sm.OLS(y, x).fit()
    print(results.summary())

def rem_high():
    x = df[x_columns]
    results = sm.OLS(y, x).fit()
    spot = 0
    for i in range(len(results.pvalues)):
        if results.pvalues[spot] < results.pvalues[i]:
            spot = i
    if results.pvalues[spot] > .05:
        x_columns.pop(spot)
    print(results.summary())

In [220]:
get_stats()
rem_high()

      date  hour  BROOK  BRONX  STATEN  QUEENS  MANHAT  \
0        1     7      0      0       0       0       0   
1        1    14      0      0       0       0       0   
2        1    21      0      0       0       1       0   
3        1     4      0      0       0       0       0   
4        1     7      0      0       0       0       0   
...    ...   ...    ...    ...     ...     ...     ...   
7654    31    12      0      0       0       1       0   
7655    31    21      1      0       0       0       0   
7656    31     9      1      0       0       0       0   
7657    31     6      0      1       0       0       0   
7658    31    14      0      0       0       1       0   

      NUMBER OF PERSONS KILLED  lat  
0                            0    0  
1                            0    1  
2                            0    1  
3                            0    0  
4                            0    0  
...                        ...  ...  
7654                         0    1  

In [178]:
rem_high()

                                    OLS Regression Results                                    
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.226
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              279.0
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:10   Log-Likelihood:                         -8476.1
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7651   BIC:                                  1.702e+04
Df Model:                                  8                                                  
Covariance Type:                   nonrobust                                                  
                               coef    std err    

### The Bronx is removed

In [179]:
rem_high()

                                    OLS Regression Results                                    
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.226
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              318.8
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:12   Log-Likelihood:                         -8476.3
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7652   BIC:                                  1.702e+04
Df Model:                                  7                                                  
Covariance Type:                   nonrobust                                                  
                               coef    std err    

### Staten island is removed

In [180]:
rem_high()

                                    OLS Regression Results                                    
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.226
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              371.7
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:15   Log-Likelihood:                         -8477.0
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7653   BIC:                                  1.701e+04
Df Model:                                  6                                                  
Covariance Type:                   nonrobust                                                  
                               coef    std err    

### Brooklyn is removed

In [181]:
rem_high()

                                    OLS Regression Results                                    
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.225
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              445.4
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:18   Log-Likelihood:                         -8478.3
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7654   BIC:                                  1.700e+04
Df Model:                                  5                                                  
Covariance Type:                   nonrobust                                                  
                 coef    std err          t      P

### Persons Killed is removed

In [182]:
rem_high()

                                    OLS Regression Results                                    
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.225
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              555.8
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:21   Log-Likelihood:                         -8479.9
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7655   BIC:                                  1.700e+04
Df Model:                                  4                                                  
Covariance Type:                   nonrobust                                                  
                 coef    std err          t      P

### Manhattan is removed

From here we can see all variables are under the .05 threshold established previously. Our formula would come out to be <br>
y = date * .0017 + hour * .0137 - QUEENS * .0515 + lat * .2052

In [492]:
x_col = []
used = 0
def crunch_num():
    p_val = 1
    location = 0
    for i in range(len(x_columns)):
        if used == 0:
            x = df[x_columns[i]]
        results = sm.OLS(y, x).fit()
        print(p_val)
        print(results.pvalues[len(x_col)])
        if (p_val > results.pvalues[len(x_col)]):
            p_val = results.pvalues[len(x_col)]
            location = i
        print(results.summary())
    if (p_val <.05):
        print(location)
        x_col.append(x_columns[location])
        x_columns.remove(x_columns[location])

In [493]:
crunch_num()

1
1.4323814844141276e-306
                                    OLS Regression Results                                    
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.167
Model:                                   OLS   Adj. R-squared (uncentered):              0.167
Method:                        Least Squares   F-statistic:                              1537.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                   1.43e-306
Time:                               05:18:18   Log-Likelihood:                         -8755.8
No. Observations:                       7659   AIC:                                  1.751e+04
Df Residuals:                           7658   BIC:                                  1.752e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
                 coef   

### Start by adding Hour

In [494]:
def crunch_nums():
    p_val = 1
    location = 0
    for i in range(len(x_columns)-len(x_col)):
        x_col.append(x_columns[i])
        x = df[x_col]
        results = sm.OLS(y, x).fit()
        print(results.summary())
        x_col.pop(len(x_col)-1)
        print(p_val)
        print(results.pvalues[len(x_col)])
        if (p_val > results.pvalues[len(x_col)]):
            p_val = results.pvalues[len(x_col)]
            location = i
    if (p_val <.05):
        print(location)
        x_col.append(x_columns[location])
        x_columns.remove(x_columns[int(location)])

In [495]:
crunch_nums()

                                    OLS Regression Results                                    
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.215
Model:                                   OLS   Adj. R-squared (uncentered):              0.215
Method:                        Least Squares   F-statistic:                              1049.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:31   Log-Likelihood:                         -8529.0
No. Observations:                       7659   AIC:                                  1.706e+04
Df Residuals:                           7657   BIC:                                  1.708e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
                 coef    std err          t      P

### Add Date

In [460]:
crunch_nums()

                                    OLS Regression Results                                    
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.215
Model:                                   OLS   Adj. R-squared (uncentered):              0.215
Method:                        Least Squares   F-statistic:                              1049.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:12:14   Log-Likelihood:                         -8529.0
No. Observations:                       7659   AIC:                                  1.706e+04
Df Residuals:                           7657   BIC:                                  1.708e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
                 coef    std err          t      P

### Add Bronx

In [502]:
crunch_nums()

                                    OLS Regression Results                                    
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.216
Model:                                   OLS   Adj. R-squared (uncentered):              0.216
Method:                        Least Squares   F-statistic:                              421.9
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:19:25   Log-Likelihood:                         -8524.1
No. Observations:                       7659   AIC:                                  1.706e+04
Df Residuals:                           7654   BIC:                                  1.709e+04
Df Model:                                  5                                                  
Covariance Type:                   nonrobust                                                  
                 coef    std err          t      P

### Add Brooklyn

In the new model we have Hour, Date, Bronx, and Brooklyn. A very different outcome than we saw with the other style of selection