In [1]:
%run 1.fetch_data_from_github.ipynb
%run 6.stratified_sampling.ipynb
%run 9.transform_data.ipynb

first let’s revert to a clean training set (by copying strat_train_set once again),
and let’s separate the predictors and the labels since we don’t necessarily want to apply
the same transformations to the predictors and the target values

In [2]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

####  Data Cleaning - Fill Na with median

the total_bedrooms attribute has some missing values, so let’s fix this. We have 3 options
- Get rid of the corresponding districts.
- Get rid of the whole attribute.
- Set the values to some value (zero, the mean, the median, etc.).

Scikit-Learn provides a handy class to take care of missing values: SimpleImputer. Here is how to use it. First, you need to create a SimpleImputer instance, specifying that you want to replace each attribute’s missing values with the median of that attribute:

In [3]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

Since the median can only be computed on numerical attributes, we need to create a
copy of the data without the text attribute ocean_proximity:

In [4]:
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
# imputer.statistics_

In [5]:
# the result of imputer is Numpy array
X = imputer.transform(housing_num)

In [6]:
# change the data back to Pandas dataframe
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

#### Handling text and categorical values

Most Machine Learning algorithms prefer to work with numbers anyway, so let’s convert
these categories from text to numbers. For this, we can use Scikit-Learn’s Ordina
lEncoder class19:

In [7]:
housing_cat = housing[["ocean_proximity"]]

In [8]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

In [9]:
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

You can get the list of categories using the categories_ instance variable. It is a list
containing a 1D array of categories for each categorical attribute

In [13]:
ordinal_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

Dummy Variables (One-hot encoding)

In [15]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

In [17]:
housing_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

get the list of categories using the encoder’s categories_
instance variable

In [18]:
cat_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]