# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [9]:
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')


In [10]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

In [11]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder']]
y = df['Price']

X[['Mileage', 'Cylinder']] = scale.fit_transform(X[['Mileage', 'Cylinder']].as_matrix()) # normalize data from 0 - 1

print (X)

est = sm.OLS(y, X).fit() # ordinary Least Square

est.summary()

      Mileage  Cylinder
0   -1.417485  0.527410
1   -1.305902  0.527410
2   -0.810128  0.527410
3   -0.426058  0.527410
4    0.000008  0.527410
5    0.293493  0.527410
6    0.335001  0.527410
7    0.382369  0.527410
8    0.511409  0.527410
9    0.914768  0.527410
10  -1.171368  0.527410
11  -0.581834  0.527410
12  -0.390532  0.527410
13  -0.003899  0.527410
14   0.430591  0.527410
15   0.480156  0.527410
16   0.509822  0.527410
17   0.757160  0.527410
18   1.594886  0.527410
19   1.810849  0.527410
20  -1.326046  0.527410
21  -1.129860  0.527410
22  -0.667658  0.527410
23  -0.405792  0.527410
24  -0.112796  0.527410
25  -0.044552  0.527410
26   0.190700  0.527410
27   0.337442  0.527410
28   0.566102  0.527410
29   0.660837  0.527410
..        ...       ...
774 -0.161262 -0.914896
775 -0.089234 -0.914896
776 -0.040523 -0.914896
777  0.002572 -0.914896
778  0.236603 -0.914896
779  0.249666 -0.914896
780  0.357220 -0.914896
781  0.365521 -0.914896
782  0.434131 -0.914896
783  0.517269 -0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


0,1,2,3
Dep. Variable:,Price,R-squared:,0.06
Model:,OLS,Adj. R-squared:,0.058
Method:,Least Squares,F-statistic:,25.58
Date:,"Sat, 28 Apr 2018",Prob (F-statistic):,1.71e-11
Time:,17:06:55,Log-Likelihood:,-9208.7
No. Observations:,804,AIC:,18420.0
Df Residuals:,802,BIC:,18430.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-1248.6125,805.535,-1.550,0.122,-2829.819,332.594
Cylinder,5585.0480,805.535,6.933,0.000,4003.841,7166.255

0,1,2,3
Omnibus:,198.944,Durbin-Watson:,0.01
Prob(Omnibus):,0.0,Jarque-Bera (JB):,385.493
Skew:,1.439,Prob(JB):,1.96e-84
Kurtosis:,4.797,Cond. No.,1.03


In [12]:
y.groupby(df.Doors).mean()

Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?