# INFO T780: Applied Machine Learning

## Week 1: Introduction to ML
### Prof. Y. An, PhD
### College of Computing and Informatics, Drexel University

For a Machine Learning Project, we usually need to go through these main steps:

- Understanding the Objective
- Data Acquisition
- Data Pre-processing
- Gain insights from Exploratory Data Analysis
- Data preparation for Machine Learning
- Select proper model(s)
    * supervised or unsupervised?
    * regression or classification?
    * if with label, univariate regression/classification or multivariate regression/classification?
    * what performance measure(s) to use?
- Train model with hyperparameter tuning
- Prediction & performance evaluation
- Present Solution
- Lauch, Monitor and maintain the system

# Setup

In [None]:
# Common imports
import numpy as np
import pandas as pd

# To plot pretty figures
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# to make this notebook's output identical at every run
np.random.seed(42)

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

# Get the data

We will use California Housing data as example. It contains data drawn from the 1990 U.S. Census: related literature: Pace, R. Kelley, and Ronald Barry, "Sparse Spatial Autoregressions," Statistics and Probability Letters, Volume 33, Number 3, May 5 1997, p. 291-297.*
>We collected information on the variables using all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area. Naturally, the geographical area included varies inversely with the population density. We computed distances among the centroids of each block group as measured in latitude and longitude. We excluded all the block groups reporting zero entries for the independent and dependent variables. The final data contained 20,640 observations on 9 characteristics.

In [None]:
housing = pd.read_csv('../input/hands-on-machine-learning-housing-dataset/housing.csv')
housing.head()


In [None]:
housing.info()

Column `total_bedrooms` seem to have about 200 missing values; `ocean_proximity` is not numerical data.

In [None]:
# take a look how many districts belong to each category
housing["ocean_proximity"].value_counts()

In [None]:
housing.describe()

In [None]:
# plot a histogram for each numerical attribute to get a feel of data
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

Observation:

- These attributes have very different scales.
- The `housing_median_age` and the `median_house_value` were capped. The `median_house_value` may be a serious problem since it is the label to predict. The Machine Learning algorithms may learn that prices never go beyond that limit. You need to check to see if this is a problem or not. If precise predictions even beyond 500,000 is needed, then you have two options:
    * Option 1: Collect proper labels for the districts whose labels were capped.
    * Option 2: Remove those districts from the dataset.
    
- Many attributes are right skewed. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will try transforming these attributes to have more bell-shaped distributions.

### Split the data
Scikit-Learn provides a few functions to split datasets into multiple subsets in various ways. The simplest function is `train_test_split()`, which provides a couple of additional features. 
- First, there is a random_state parameter that allows you to set the random generator seed. 
- Second, you can pass it multiple datasets with an identical number of rows, and it will split them on the same indices (this is very useful, for example, if you have a separate DataFrame for labels.

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
test_set.head()

In [None]:
train_set.head()

So far we have considered purely random sampling methods. This is generally fine if the dataset is large enough (especially relative to the number of attributes), but if it is not, will face the risk of introducing a significant sampling bias. 

When a survey company decides to call 1,000 people to ask them a few questions, they don’t just pick 1,000 people randomly in a phone book. They try to ensure that these 1,000 people are representative of the whole population. 

### **For example, the US population is 51.3% females and 48.7% males, so a well-conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male. This is called **stratified sampling**: the population is divided into homogeneous subgroups called **strata**, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population.**

Suppose `median_income` is a very important attribute to predict median housing prices. We want to ensure that the test set is representative of the various categories of incomes in the whole dataset. 

Since the `median_income` is a continuous numerical attribute, we first need to create an income category attribute. It is important to have a sufficient number of instances in each stratum, or else the estimate of a stratum’s importance may be biased. This means that we should not have too many strata, and each stratum should be large enough.

In [None]:
housing["median_income"].hist()

In [None]:
pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf])

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

In [None]:
housing["income_cat"].value_counts().sort_index()

In [None]:
housing["income_cat"].hist()

The `stratify` within `train_test_split` offers an option for stratified sampling.

In [None]:
strat_train_set, strat_test_set = train_test_split(housing, test_size=0.2, random_state=42, 
                                         stratify = housing["income_cat"])

We can also use Scikit-Learn’s `StratifiedShuffleSplit` to realize stratified sampling.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set1 = housing.loc[train_index]
    strat_test_set1 = housing.loc[test_index]

we can take a look at the comparison of stratified sampling and random sampling.

In [None]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

In [None]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set), # train_test_split
    "Stratified1": income_cat_proportions(strat_test_set1), #StratifiedShuffleSplit
    "Random": income_cat_proportions(test_set),
}).sort_index()

In [None]:
compare_props.head()

In [None]:
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

compare_props

In [None]:
housing["income_cat"].value_counts() / len(housing)

In [None]:
# remove the income_cat attribute so the data is back to its original
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# Gain insights

In [None]:
train = strat_train_set.copy()

In [None]:
train.head()

In [None]:
ax = train.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=train['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )

In [None]:
california_img = mpimg.imread("../input/california-housing-feature-engineering/california.png")

In [None]:
ax = train.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=train['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = train["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
plt.show()

### Looking for correlations

Since the dataset is not too large, we can easily compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the `corr()` method.

In [None]:
corr_matrix = train.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

The correlation coefficient ranges from –1 to 1. 
- When it is close to 1, it means that there is a strong positive correlation; for example, the median house value tends to go up when the median income goes up. 
- When the coefficient is close to –1, it means that there is a strong negative correlation; we can see a small negative correlation between the latitude and the median house value (i.e., prices have a slight tendency to go down when head to north of California). 
- Finally, coefficients close to 0 mean that there is no linear correlation.

<img src="https://i.imgur.com/8McsYNO.png" width="600">

Another way to check for correlation between attributes is to use the pandas `scatter_matrix()` function, which plots every numerical attribute against every other numerical attribute. 

We will just focus on a few promising attributes that seem most correlated with the `median_housing_value`.

The most promising attribute to predict the `median_house` value is the median income.

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(train[attributes], figsize=(12, 12))

In [None]:
train.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])

This plot reveals a few things. 
- First, the correlation is indeed very strong: upward trend can be clearly seen, and the points are not too dispersed. 
- Second, the price cap that we noticed earlier is clearly visible as a horizontal line at 500,000. But this plot reveals other less obvious straight lines: a horizontal line around 450,000, another around 350,000, perhaps one around 280,000, and a few more below that. 

We may want to try removing the corresponding districts to prevent algorithms from learning to reproduce these data quirks.

In [None]:
predictors = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income', 'ocean_proximity']

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
for i in range(0, 3):
    for j in range(0, 3):
        axes[i, j].scatter(train[predictors[i*3+j]], train['median_house_value'])
        axes[i, j].set_xlabel(predictors[i*3+j])
        axes[i, j].set_ylabel('median_house_value')

### Attribute Combinations

One more thing you to do before preparing the data for Machine Learning algorithms is to try out various attribute combinations. 

- For example, the total number of rooms in a district is not very useful if we don’t know how many households there are. What we really want is the number of rooms per household. 
- Similarly, the total number of bedrooms by itself is not very useful: we probably want to compare it to the number of rooms. 
- The population per household also seems like an interesting attribute combination to look at. 

In [None]:
train["rooms_per_household"] = train["total_rooms"]/train["households"]
train["bedrooms_per_room"] = train["total_bedrooms"]/train["total_rooms"]
train["population_per_household"]=train["population"]/train["households"]

# let’s look at the correlation matrix again
corr_matrix = train.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

The new `bedrooms_per_room` attribute is much more correlated with the median house value than the `total_rooms` or `total_bedrooms`. Apparently houses with a lower bedroom/room ratio tend to be more expensive. The `rooms_per_household` is also more informative than `total_rooms` in a district—obviously the larger the houses, the more expensive they are.

In [None]:
train.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()

In [None]:
train.describe()

# Data preparation for Machine Learning algorithms

It’s time to prepare the data for Machine Learning algorithms. Instead of doing this manually, we should write functions for this purpose, for several good reasons:

- This will allow reproduce these transformations easily on any dataset (e.g., the next time get a fresh dataset).
- We can gradually build a library of transformation functions that you can reuse in future projects.
- We can use these functions in your live system to transform the new data before feeding it to ML algorithms.

In [None]:
# revert to a clean training set 
# separate the predictors and the labels
train = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
train_labels = strat_train_set["median_house_value"].copy()

### Data Cleaning

Most Machine Learning algorithms cannot work with missing features, so let’s create a few functions to take care of them. We saw earlier that the `total_bedrooms` attribute has some missing values, so let’s fix this with three options:
1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.).

We can accomplish these easily using DataFrame’s `dropna()`, `drop()`, and `fillna()`.

In [None]:
def option_for_NA(df, col_name = "total_bedrooms", option=3):
    if option == 1:
        return df.dropna(subset=[col_name])
    elif option == 2:
        return df.drop(col_name, axis=1)
    elif option == 3:
        median = df[col_name].median()
        df[col_name].fillna(median, inplace=True) 
        return df

If choose option 3, DO NOT forget to save the median value computed. We will need it later to replace missing values in the test set when evaluate the system, and also once the system goes live to replace missing values in new data.

Scikit-Learn provides a handy class to take care of missing values: `SimpleImputer`. 

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

# Remove the text attribute because median can only be calculated on numerical attributes
train_num = train.drop("ocean_proximity", axis=1)

# fit the imputer instance to the training data
imputer.fit(train_num)

The imputer has simply computed the median of each attribute and stored the result in its `statistics_` instance variable. It is usually safer to apply the imputer to all the numerical attributes.

In [None]:
# Check this is the same as manually computing the median of each attribute
imputer.statistics_ == train_num.median().values

Transform the training set with imputer.

In [None]:
imputer.strategy

In [None]:
X = imputer.transform(train_num)

train_tr = pd.DataFrame(X, columns=train_num.columns,
                          index=train_num.index)
train_tr.info()

### Categorical attributes

So far we have only dealt with numerical attributes. Now let's preprocess the categorical input feature, `ocean_proximity`.

Most Machine Learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers. For this, we can use Scikit-Learn’s `OrdinalEncoder` class.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
train_cat = train[["ocean_proximity"]]
train_cat_encoded = ordinal_encoder.fit_transform(train_cat)
train_cat_encoded[:5]

We can get the list of categories using the `categories_` instance variable. It is a list containing a 1D array of categories for each categorical attribute.

In [None]:
ordinal_encoder.categories_

What is the problem here??

One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases (e.g., for ordered categories such as “bad,” “average,” “good,” and “excellent”), but it is obviously not the case for the `ocean_proximity` column (for example, categories 0 and 4 are clearly more similar than categories 0 and 1). To fix this issue, a common solution is to create one binary attribute per category. The new attributes are sometimes called *dummy attributes*. Scikit-Learn provides a `OneHotEncoder` class to convert categorical values into one-hot vectors.

By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()` method.

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
train_cat_1hot = cat_encoder.fit_transform(train_cat)
train_cat_1hot.toarray()

# alternatively, set sparse=False
# cat_encoder = OneHotEncoder(sparse=False)

In [None]:
# get the list of categories
cat_encoder.categories_

### Custom Transformer

Although Scikit-Learn provides many useful transformers, we will need to write our own for tasks such as custom cleanup operations or combining specific attributes. We will want our transformer to work seamlessly with Scikit-Learn functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not inheritance), all we need to do is create a class and implement three methods: `fit()` (returning self), `transform()`, and `fit_transform()`.

We can get the last one for free by simply adding `TransformerMixin` as a base class. If add `BaseEstimator` as a base class (and avoid `*args` and `**kargs` in the constructor), we will also get two extra methods (`get_params()` and `set_params()`) that will be useful for automatic hyperparameter tuning.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

        

In [None]:
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
train_extra_attribs = attr_adder.transform(train.values)
train_extra_attribs = pd.DataFrame(
    train_extra_attribs,
    columns=list(train.columns)+["rooms_per_household", "population_per_household"],
    index=train.index)
train_extra_attribs.head()

In above example the transformer has one hyperparameter, `add_bedrooms_per_room`, set to `True` by default. This hyperparameter will allow us to easily find out whether adding this attribute helps the Machine Learning algorithms or not. More generally, we can add a hyperparameter to gate any data preparation step that you are not 100% sure about. 

### Feature Scaling

One of the most important transformations you need to apply to your data is **feature scaling**. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. *Note that scaling the target values is generally not required*.

There are two common ways to get all attributes to have the same scale: **min-max scaling** and **standardization**.
- Min-max scaling (many people call this normalization) is the simplest: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the minimum value and dividing by (maximum - minimum). Scikit-Learn provides a transformer called `MinMaxScaler` for this. It has a `feature_range` hyperparameter that allow to  change the range.
- Standardization is different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1). However, standardization is much less affected by outliers. For example, suppose a district had a median income equal to 100 (by mistake). Min-max scaling would then crush all the other values from 0–15 down to 0–0.15, whereas standardization would not be much affected. Scikit-Learn provides a transformer called `StandardScaler` for standardization.
- other scalling: 
    * `MaxAbsScaler`: differs from the previous scaler such that the absolute values are mapped in the range [0, 1]. On positive only data, this scaler behaves similarly to MinMaxScaler and therefore also suffers from the presence of large outliers.
    * `RobustScaler`: uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rathar than the min-max, so that it is robust to outliers. After Robust scaling, the distributions are brought into the same scale and overlap, but the outliers remain outside of bulk of the new distributions.
    * `Normalizer`: points are all brought within a sphere that is at most 1 away from the origin at any point. Also, the axes that were previously different scales are now all one scale.

### Transformation Pipelines

As you can see, there are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides the `Pipeline` class to help with such sequences of transformations.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

train_num_tr = num_pipeline.fit_transform(train_num)

So far, we have handled the categorical columns and the numerical columns separately. It would be more convenient to have a single transformer able to handle all columns, applying the appropriate transformations to each column. 

In version 0.20, Scikit-Learn introduced the `ColumnTransformer` for this purpose.

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(train_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

train_prepared = full_pipeline.fit_transform(train)

In [None]:
train_prepared.shape

In [None]:
full_pipeline.get_feature_names

Note that the `OneHotEncoder` returns a sparse matrix, while the `num_pipeline` returns a dense matrix. When there is such a mix of sparse and dense matrices, the `ColumnTransformer` estimates the density of the final matrix (i.e., the ratio of nonzero cells), and it returns a sparse matrix if the density is lower than a given threshold (by default, sparse_threshold=0.3). 

# Select and train a model 

We are now ready to select and train a Machine Learning model!

### Linear Regression

Let's first try on Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(train_prepared, train_labels)

Compare against the actual values:

A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors.

$$RMSE = \sqrt{\frac{1}{m}\sum_{i = 1}^m(h(x^{(i)})-y^{(i)})^2}$$

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outlier districts. In that case, you may consider using the mean absolute error (MAE).

$$MAE(X,h)=\frac{1}{m}\sum_{i=1}^m|h(x^{(i)})-y^{(i)}|$$

Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. RMSE corresponds to the Euclidean distance, MAE corresponds to the Manhattan distance.

In [None]:
from sklearn.metrics import mean_squared_error

train_predictions = lin_reg.predict(train_prepared)
lin_mse = mean_squared_error(train_labels, train_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

In [None]:
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(train_labels, train_predictions)
lin_mae

### Decision Tree
A prediction error of 68,628 of `median_housing_values` which range between 120,000 and 265,000 is not very satisfying. Let's try a more powerful model.

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(train_prepared, train_labels)

In [None]:
train_predictions = tree_reg.predict(train_prepared)
tree_mse = mean_squared_error(train_labels, train_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

No error at all? Could this model really be absolutely perfect? 