<a href="https://colab.research.google.com/github/skredenmathias/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/assignment_regression_classification_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
import numpy as np

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

# Removing outliers
df = df[(df['SALE_PRICE'] >= np.percentile(df['SALE_PRICE'], 0.5)) & 
        (df['SALE_PRICE'] <= np.percentile(df['SALE_PRICE'], 99.5))]

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [15]:
top10 # What should I do with this?

Index(['FLUSHING-NORTH', 'UPPER EAST SIDE (59-79)', 'UPPER EAST SIDE (79-96)',
       'BEDFORD STUYVESANT', 'BOROUGH PARK', 'UPPER WEST SIDE (59-79)',
       'GRAMERCY', 'ASTORIA', 'FOREST HILLS', 'EAST NEW YORK'],
      dtype='object')

### Use a subset of the data where BUILDING_CLASS_CATEGORY == '01 ONE FAMILY DWELLINGS' and the sale price was more than 100 thousand and less than 2 million.

In [16]:
df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

In [17]:
df.shape

(22924, 21)

In [0]:
df2 = df[(df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS') & (df['SALE_PRICE']) > 100000 & (df['SALE_PRICE'] < 2000000 )]

In [0]:
# df2 = df.replace(np.inf, np.nan)

In [20]:
df2.isna().sum()

BOROUGH                             0
NEIGHBORHOOD                        0
BUILDING_CLASS_CATEGORY             0
TAX_CLASS_AT_PRESENT                0
BLOCK                               0
LOT                                 0
EASE-MENT                         118
BUILDING_CLASS_AT_PRESENT           0
ADDRESS                             0
APARTMENT_NUMBER                  118
ZIP_CODE                            0
RESIDENTIAL_UNITS                   0
COMMERCIAL_UNITS                    0
TOTAL_UNITS                         0
LAND_SQUARE_FEET                    0
GROSS_SQUARE_FEET                   0
YEAR_BUILT                          0
TAX_CLASS_AT_TIME_OF_SALE           0
BUILDING_CLASS_AT_TIME_OF_SALE      0
SALE_PRICE                          0
SALE_DATE                           0
dtype: int64

In [0]:
df2 = df2.dropna(axis=1)

In [22]:
df2.shape

(118, 19)

In [23]:
df2.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
424,4,OTHER,01 ONE FAMILY DWELLINGS,1,12984,102,A2,179-23 134TH AVENUE,11434.0,1.0,0.0,1.0,2200,880.0,1940.0,1,A2,717359,01/03/2019
437,5,OTHER,01 ONE FAMILY DWELLINGS,1,3736,26,A5,492 BEDFORD AVENUE,10306.0,1.0,0.0,1.0,2325,1888.0,2017.0,1,A5,646589,01/03/2019
679,4,OTHER,01 ONE FAMILY DWELLINGS,1,13463,40,A2,229-02 146TH AVENUE,11413.0,1.0,0.0,1.0,3742,1258.0,1920.0,1,A2,512275,01/04/2019
707,5,OTHER,01 ONE FAMILY DWELLINGS,1,4131,1,A5,124 PELICAN CIRCLE,10306.0,1.0,0.0,1.0,2654,1455.0,1997.0,1,A5,452465,01/04/2019
720,5,OTHER,01 ONE FAMILY DWELLINGS,1,784,20,A2,187 MARTIN AVENUE,10314.0,1.0,0.0,1.0,6000,2160.0,1970.0,1,A2,588005,01/04/2019


## Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.

In [24]:
df2['SALE_DATE'].head()

424    01/03/2019
437    01/03/2019
679    01/04/2019
707    01/04/2019
720    01/04/2019
Name: SALE_DATE, dtype: object

In [0]:
df2['SALE_DATE'] = pd.to_datetime(df2['SALE_DATE'])

In [26]:
df2['SALE_DATE'].head()

424   2019-01-03
437   2019-01-03
679   2019-01-04
707   2019-01-04
720   2019-01-04
Name: SALE_DATE, dtype: datetime64[ns]

In [0]:
df2['SALE_DATE'] = df2['SALE_DATE'].dt.month

In [28]:
df2['SALE_DATE'].head()

424    1
437    1
679    1
707    1
720    1
Name: SALE_DATE, dtype: int64

In [0]:
train = df2[(df2['SALE_DATE'] >= 1) & (df2['SALE_DATE'] <= 3)]

In [30]:
train.head(), train.shape

(    BOROUGH NEIGHBORHOOD  ... SALE_PRICE SALE_DATE
 424       4        OTHER  ...     717359         1
 437       5        OTHER  ...     646589         1
 679       4        OTHER  ...     512275         1
 707       5        OTHER  ...     452465         1
 720       5        OTHER  ...     588005         1
 
 [5 rows x 19 columns], (102, 19))

In [0]:
test = df2[(df2['SALE_DATE'] == 4)]

In [32]:
test.head(), test.shape

(      BOROUGH NEIGHBORHOOD  ... SALE_PRICE SALE_DATE
 18182       1        OTHER  ...          1         4
 18426       5        OTHER  ...    1120075         4
 18632       4        OTHER  ...     506113         4
 18644       4        OTHER  ...     707533         4
 18667       5        OTHER  ...     941881         4
 
 [5 rows x 19 columns], (16, 19))

# Do one-hot encoding of categorical features.

In [33]:
train.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
BUILDING_CLASS_CATEGORY,102,1,01 ONE FAMILY DWELLINGS,102
TAX_CLASS_AT_PRESENT,102,2,1,101
NEIGHBORHOOD,102,3,OTHER,99
BOROUGH,102,4,4,43
BUILDING_CLASS_AT_TIME_OF_SALE,102,7,A1,36
BUILDING_CLASS_AT_PRESENT,102,8,A1,36
LAND_SQUARE_FEET,102,69,4000,9
ADDRESS,102,102,1907 PITMAN AVENUE,1


In [0]:
target = 'SALE_PRICE'
# high_cardinality =  ['BUILDING_CLASS_AT_TIME_OF_SALE',
#                      'BUILDING_CLASS_AT_PRESENT',
#                      'LAND_SQUARE_FEET',
#                      'ADDRESS']

high_cardinality =  ['BUILDING_CLASS_AT_TIME_OF_SALE',
                     'ADDRESS']


features = train.columns.drop([target] + high_cardinality)

In [35]:
features

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'BUILDING_CLASS_AT_PRESENT',
       'ZIP_CODE', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'SALE_DATE'],
      dtype='object')

In [0]:
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

In [37]:
X_train.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_DATE
424,4,OTHER,01 ONE FAMILY DWELLINGS,1,12984,102,A2,11434.0,1.0,0.0,1.0,2200,880.0,1940.0,1,1
437,5,OTHER,01 ONE FAMILY DWELLINGS,1,3736,26,A5,10306.0,1.0,0.0,1.0,2325,1888.0,2017.0,1,1
679,4,OTHER,01 ONE FAMILY DWELLINGS,1,13463,40,A2,11413.0,1.0,0.0,1.0,3742,1258.0,1920.0,1,1
707,5,OTHER,01 ONE FAMILY DWELLINGS,1,4131,1,A5,10306.0,1.0,0.0,1.0,2654,1455.0,1997.0,1,1
720,5,OTHER,01 ONE FAMILY DWELLINGS,1,784,20,A2,10314.0,1.0,0.0,1.0,6000,2160.0,1970.0,1,1


In [38]:
X_train.shape

(102, 16)

In [0]:
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)

In [0]:
X_test = encoder.transform(X_test)

In [41]:
X_test.head()

Unnamed: 0,BOROUGH_4,BOROUGH_5,BOROUGH_3,BOROUGH_2,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1B,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT_A2,BUILDING_CLASS_AT_PRESENT_A5,BUILDING_CLASS_AT_PRESENT_A1,BUILDING_CLASS_AT_PRESENT_A3,BUILDING_CLASS_AT_PRESENT_A9,BUILDING_CLASS_AT_PRESENT_A0,BUILDING_CLASS_AT_PRESENT_A4,BUILDING_CLASS_AT_PRESENT_V0,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,"LAND_SQUARE_FEET_2,200","LAND_SQUARE_FEET_2,325","LAND_SQUARE_FEET_3,742","LAND_SQUARE_FEET_2,654","LAND_SQUARE_FEET_6,000","LAND_SQUARE_FEET_2,979","LAND_SQUARE_FEET_2,467","LAND_SQUARE_FEET_4,000","LAND_SQUARE_FEET_1,849","LAND_SQUARE_FEET_3,000","LAND_SQUARE_FEET_2,500","LAND_SQUARE_FEET_4,752","LAND_SQUARE_FEET_1,874","LAND_SQUARE_FEET_3,234","LAND_SQUARE_FEET_3,696","LAND_SQUARE_FEET_1,800",...,"LAND_SQUARE_FEET_1,500","LAND_SQUARE_FEET_4,050","LAND_SQUARE_FEET_2,252","LAND_SQUARE_FEET_3,800","LAND_SQUARE_FEET_3,811","LAND_SQUARE_FEET_3,456","LAND_SQUARE_FEET_2,959","LAND_SQUARE_FEET_1,600","LAND_SQUARE_FEET_8,000","LAND_SQUARE_FEET_2,460","LAND_SQUARE_FEET_4,100","LAND_SQUARE_FEET_1,680","LAND_SQUARE_FEET_1,700","LAND_SQUARE_FEET_8,455","LAND_SQUARE_FEET_8,399","LAND_SQUARE_FEET_4,427","LAND_SQUARE_FEET_2,555","LAND_SQUARE_FEET_4,292","LAND_SQUARE_FEET_2,375","LAND_SQUARE_FEET_1,602","LAND_SQUARE_FEET_2,997","LAND_SQUARE_FEET_2,400","LAND_SQUARE_FEET_3,500","LAND_SQUARE_FEET_1,185","LAND_SQUARE_FEET_2,700","LAND_SQUARE_FEET_6,695","LAND_SQUARE_FEET_4,400","LAND_SQUARE_FEET_1,785","LAND_SQUARE_FEET_6,309",LAND_SQUARE_FEET_780,"LAND_SQUARE_FEET_2,785","LAND_SQUARE_FEET_3,012","LAND_SQUARE_FEET_2,725","LAND_SQUARE_FEET_1,930","LAND_SQUARE_FEET_4,300","LAND_SQUARE_FEET_4,361",GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_DATE
18182,0,0,0,0,1,0,0,1,1,0,1708,24,0,0,0,0,1,0,0,0,10029.0,1.0,0.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2501.0,1890.0,1,4
18426,0,1,0,0,1,0,0,1,1,0,4309,65,0,0,0,0,0,0,0,0,10306.0,2.0,0.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3274.0,2018.0,1,4
18632,1,0,0,0,1,0,0,1,1,0,11169,21,0,0,1,0,0,0,0,0,11429.0,1.0,0.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1200.0,1935.0,1,4
18644,1,0,0,0,1,0,0,1,1,0,12981,39,1,0,0,0,0,0,0,0,11422.0,1.0,0.0,1.0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1128.0,1950.0,1,4
18667,0,1,0,0,1,0,0,1,1,0,3397,14,0,0,1,0,0,0,0,0,10305.0,1.0,0.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2194.0,2018.0,1,4


In [42]:
X_test.shape

(16, 97)

 # Do feature selection with SelectKBest.

In [43]:
from sklearn.feature_selection import SelectKBest, f_regression

selector = SelectKBest(score_func=f_regression, k=11)

X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform = selector.transform(X_test)

  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [44]:
X_train_selected.shape, X_test_selected.shape

((102, 11), (16, 11))

In [45]:
# Which features were selected? 
all_names = X_train.columns
selected_mask = selector.get_support()

selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]
print('Features selected:')
for name in selected_names:
  print(name)

Features selected:
BOROUGH_2
BLOCK
BUILDING_CLASS_AT_PRESENT_A3
LAND_SQUARE_FEET_2,467
LAND_SQUARE_FEET_1,800
LAND_SQUARE_FEET_4,500
LAND_SQUARE_FEET_4,001
LAND_SQUARE_FEET_1,402
LAND_SQUARE_FEET_4,292
GROSS_SQUARE_FEET
YEAR_BUILT


#### Fit a ridge regression model with multiple features. Use the normalize=True parameter (or do feature scaling beforehand — use the scaler's fit_transform method with the train set, and the scaler's transform method with the test set)

In [46]:
# Note, this is linear regression AFAIK
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for k in range(1, len(X_train.columns)+1):
  print(f'{k} features')

  selector = SelectKBest(score_func=f_regression, k=k)
  X_train_selected = selector.fit_transform(X_train, y_train)
  X_test_selected = selector.transform(X_test)

  model = LinearRegression(normalize=True)
  model.fit(X_train_selected, y_train)
  y_pred = model.predict(X_test_selected)

  mae = mean_absolute_error(y_test, y_pred)
  print(f'Test MAE: ${mae:,.0f} \n')

  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) 

1 features
Test MAE: $302,479 

2 features
Test MAE: $292,393 

3 features
Test MAE: $266,567 

4 features
Test MAE: $272,979 

5 features
Test MAE: $273,939 

6 features
Test MAE: $268,920 

7 features
Test MAE: $273,814 

8 features
Test MAE: $275,635 

9 features
Test MAE: $281,042 

10 features
Test MAE: $282,357 

11 features
Test MAE: $280,136 

12 features
Test MAE: $276,342 

13 features
Test MAE: $272,940 

14 features
Test MAE: $272,940 

15 features
Test MAE: $272,940 

16 features
Test MAE: $272,940 

17 features
Test MAE: $275,149 

18 features
Test MAE: $274,582 

19 features
Test MAE: $274,485 

20 features
Test MAE: $270,667 

21 features


  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) 

Test MAE: $271,719 

22 features
Test MAE: $270,345 

23 features
Test MAE: $266,030 

24 features
Test MAE: $285,453 

25 features
Test MAE: $294,240 

26 features
Test MAE: $291,667 

27 features
Test MAE: $293,362 

28 features
Test MAE: $289,922 

29 features
Test MAE: $294,104 

30 features
Test MAE: $300,314 

31 features
Test MAE: $293,346 

32 features
Test MAE: $292,978 

33 features
Test MAE: $306,656 

34 features
Test MAE: $301,440 

35 features
Test MAE: $293,398 

36 features
Test MAE: $295,104 

37 features
Test MAE: $299,392 

38 features
Test MAE: $302,040 

39 features
Test MAE: $17,146,828,508,544 

40 features
Test MAE: $20,700,194,235,416 

41 features
Test MAE: $51,601,939,885,568 

42 features
Test MAE: $116,213,137,463,624 

43 features
Test MAE: $26,688,971,708,128 

44 features


  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) 

Test MAE: $11,818,058,806,332 

45 features
Test MAE: $33,433,912,853,223 

46 features
Test MAE: $33,844,657,751,847 

47 features
Test MAE: $51,872,090,037,159 

48 features
Test MAE: $63,416,172,355,687 

49 features
Test MAE: $35,240,686,229,543 

50 features
Test MAE: $1,190,904,697,834 

51 features
Test MAE: $2,131,456,858,212 

52 features
Test MAE: $111,556,531,984,711 

53 features
Test MAE: $1,750,614,350,221 

54 features
Test MAE: $1,169,112,693,649 

55 features
Test MAE: $192,630,442,428 

56 features
Test MAE: $94,684,411,902,908 

57 features
Test MAE: $33,772,117,362,588 

58 features
Test MAE: $7,673,257,555,700 

59 features
Test MAE: $233,945,067,567,292 

60 features
Test MAE: $57,619,760,980,796 

61 features
Test MAE: $94,149,125,691,324 

62 features
Test MAE: $43,065,459,809,596 

63 features
Test MAE: $180,279,609,558,460 

64 features
Test MAE: $84,513,612,491,068 

65 features
Test MAE: $2,262,055,359,451 

66 features


  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) 

Test MAE: $19,521,665,208,651 

67 features
Test MAE: $42,729,280,582,603 

68 features
Test MAE: $23,551,987,606,827 

69 features
Test MAE: $77,650,719,829,067 

70 features
Test MAE: $26,796,976,499,147 

71 features
Test MAE: $14,603,384,806,486 

72 features
Test MAE: $43,365,351,509,750 

73 features
Test MAE: $25,362,971,460,245 

74 features
Test MAE: $45,733,699,806,179 

75 features
Test MAE: $28,458,021,880,035 

76 features
Test MAE: $89,988,382,084,451 

77 features
Test MAE: $23,591,732,175,939 

78 features
Test MAE: $1,437,932,936,348 

79 features
Test MAE: $15,738,776,299,279,892 

80 features
Test MAE: $11,022,830,223,971,472 

81 features


  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms


Test MAE: $21,395,540,092,319,720 

82 features
Test MAE: $19,781,152,154,692,056 

83 features
Test MAE: $201,088,077,350,942,464 

84 features
Test MAE: $345,704,593,510,537,472 

85 features
Test MAE: $138,790,774,409,544,512 

86 features
Test MAE: $438,623,518,005,228,288 

87 features
Test MAE: $988,639,598,762,780,672 

88 features
Test MAE: $386,762,737,643,832,000 

89 features
Test MAE: $143,301,032,793,783,728 

90 features
Test MAE: $189,000,724,100,344,768 

91 features
Test MAE: $942,006,738,275,369,984 

92 features
Test MAE: $2,188,437,479,841,814,016 

93 features
Test MAE: $849,310,917,684,722,176 

94 features
Test MAE: $3,408,197,783,084,802,048 

95 features
Test MAE: $2,927,326,025,771,099,136 

96 features
Test MAE: $24,504,407,134,976,901,120 

97 features
Test MAE: $3,532,851,210,396,498,944 



  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a

In [47]:
# Ridge Regression
import matplotlib.pyplot as plt
from IPython.display import display, HTML
from sklearn.linear_model import Ridge

for alpha in [0.001, 0.01, 0.1, 1.0, 1, 10, 100]:
  # Fit Ridge
  display(HTML(f'Ridge Regression, with alpha={alpha}'))
  model = Ridge(alpha=alpha, normalize=True)
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)

  # Get test MAE
  mae = mean_absolute_error(y_test, y_pred)
  display(HTML(f'Test Mean Absolute Error: ${mae:,.0f}'))

  # Plot coefficients
  coefficients = pd.Series(model.coef_, X_train.columns)
  plt.figure(figsize=(10,6))
  coefficients.sort_values().plot.barh(color='grey')
  plt.xlim(-400,700)
  plt.show()

# RidgeCV

In [0]:
alphas = [0.01, 0.1, 1.0, 10.0, 100.0]

In [49]:
# from sklearn.linear_model import RidgeCV
# ridge = RidgeCV(alphas=alphas, normalize=True)
# ridge.fit(X_train, target)
# ridge.alpha_

TypeError: ignored

# Permutation Importance

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

data = df2
y = data['SALE_PRICE']
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_x, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(n_estimators=100,
                                  random_state=0).fit(train_X, train_y)

In [51]:
!pip install eli5
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(my_model, random_state = 1).fit(val_x, val_y)
eli5.show_weights(perm, feature_names = val_x.columns.tolist())

Collecting eli5
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |███                             | 10kB 19.3MB/s eta 0:00:01[K     |██████▏                         | 20kB 5.9MB/s eta 0:00:01[K     |█████████▎                      | 30kB 8.1MB/s eta 0:00:01[K     |████████████▍                   | 40kB 5.7MB/s eta 0:00:01[K     |███████████████▌                | 51kB 6.9MB/s eta 0:00:01[K     |██████████████████▋             | 61kB 8.0MB/s eta 0:00:01[K     |█████████████████████▊          | 71kB 9.2MB/s eta 0:00:01[K     |████████████████████████▊       | 81kB 10.2MB/s eta 0:00:01[K     |███████████████████████████▉    | 92kB 8.3MB/s eta 0:00:01[K     |███████████████████████████████ | 102kB 9.1MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 9.1MB/s 
Installing collected packages: eli5
Successfully installed eli5-0.10.1


Using TensorFlow backend.


Weight,Feature
0.0333  ± 0.0422,SALE_PRICE
0.0333  ± 0.0000,LOT
0.0267  ± 0.0267,BLOCK
0  ± 0.0000,TAX_CLASS_AT_TIME_OF_SALE
-0.0200  ± 0.0327,SALE_DATE
