# GEOG 5160 6160 Lab 07

## Data processing

Let's start by by importing the modules we'll need for the class:

In [43]:
import pandas as pd
import numpy as np
import sklearn

As before, we will start by loading and cleaning the dataset for us. There are several steps we need to take here:

- Remove observations with missing values
- Create variables containing the average number of bedrooms and rooms per district
- Create a Boolean (0/1) variable indicating whether a district is high value or not. We'll define this as being when the median house value for that district is over $250K

Now load the data and use the `describe()` method to remind us of the available variables/features

In [44]:
housing = pd.read_csv("../datafiles/housing.csv")
print(housing.shape)

(20640, 10)


In [45]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


First use the `dropna()` method to remove missing values:

In [46]:
housing = housing.dropna()
housing.shape

(20433, 10)

Next, we'll create the features with the average number of rooms and bedroom ratio.

In [47]:
housing['avg_rooms'] = housing.total_rooms / housing.households
housing['bedroom_ratio'] = housing.total_bedrooms / housing.total_rooms

Now, we'll create two categorical features for use in the model, both binary. For the first of these, we'll convert the `ocean_proximity` feature into a binary value. This requires a few steps: first we convert this to two groups by with a conditional statement (INLAND vs all other locations); then we convert this to a categorical Series and extract the numerical codes (0/1) using `.cat.codes`. 

In [50]:
ocean_cats = housing.ocean_proximity != "INLAND" ## Conditional to make two groups inland vs all others
ocean_cats = ocean_cats.astype('category') ## Convert to categorical
ocean_cats = ocean_cats.cat.codes ## Extract the code numerical labels (0/1)
housing['ocean_new'] = ocean_cats ## Replace original ocean_proximity

Next we convert the `median_house_value` to a binary outcome of low vs. high house values. Here we use Pandas `cut()` function. For $k$ groups, this requires a vector of cuts of length $k+1$, and optionally a vector of labels for the new groups of length $k$. 

In [51]:
bins = [0, 2.5e5, np.inf]
labels = ['low', 'high']
housing['mhv_new'] = pd.cut(housing.median_house_value, bins, labels = labels)

Now let's look at the new data

In [52]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,avg_rooms,bedroom_ratio,ocean_new,mhv_new
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,1,6.984127,0.146591,0,high
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,1,6.238137,0.155797,0,high
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,1,8.288136,0.129516,0,high
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,1,5.817352,0.184458,0,high
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,1,6.281853,0.172096,0,high


## Features and labels

Let's make new DataFrames: one with a subset of variables or features for building our initial model, and one with the outcome or labels

In [53]:
X = housing[['avg_rooms', 'bedroom_ratio', 'housing_median_age', 'median_income', 
             'population', 'ocean_new']]
y = housing['mhv_new']