# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [1]:
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

* price = a + b1 \* mileage + b2 \* cylinder + b3 * doors

In [2]:
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()  # scaler = MinMaxScaler(feature_range=(0, 1))

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

X = scaler.fit_transform(X)

print (X)

est = sm.OLS(y, X).fit()

est.summary()

[[ 0.15871591  0.5         1.        ]
 [ 0.17695178  0.5         1.        ]
 [ 0.2579757   0.5         1.        ]
 ..., 
 [ 0.40338381  0.5         1.        ]
 [ 0.5130185   0.5         1.        ]
 [ 0.70621097  0.5         1.        ]]


  from pandas.core import datetools


0,1,2,3
Dep. Variable:,Price,R-squared:,0.809
Model:,OLS,Adj. R-squared:,0.809
Method:,Least Squares,F-statistic:,1134.0
Date:,"Mon, 05 Feb 2018",Prob (F-statistic):,1.0599999999999999e-287
Time:,20:03:34,Log-Likelihood:,-8567.2
No. Observations:,804,AIC:,17140.0
Df Residuals:,801,BIC:,17150.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,2.177e+04,1503.306,14.479,0.000,1.88e+04,2.47e+04
x2,2.171e+04,1000.261,21.704,0.000,1.97e+04,2.37e+04
x3,5222.2379,711.954,7.335,0.000,3824.723,6619.753

0,1,2,3
Omnibus:,130.158,Durbin-Watson:,0.342
Prob(Omnibus):,0.0,Jarque-Bera (JB):,217.94
Skew:,1.019,Prob(JB):,4.7299999999999996e-48
Kurtosis:,4.535,Cond. No.,4.42


In [3]:
y.groupby(df.Doors).mean()

Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.