In [1]:
# List your imports here
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from scipy.stats import linregress
import numpy as np
%matplotlib inline

- High concentrations of certain bacteria beaches lagoons such as in Wakiki or Ko'olina, can be a serious health concern for beachgoers and tourists frequenting those beach lagoons.  Monitoring and predicting high-level of bacteria is essential to improving assuring their safety and mitigating the hazards that cause them. Current methods for assessing the levels of bacteria are time-consuming and require growing the bacteria for 24 hours before testing the samples.

Here, you will build a model which uses chemical monitoring, an inexpensive and easily automated approach to
predict the concentrations of 4 different types of bacteria (B1, B2, B3, and B4) in beach lagoons. Specifically, your model will provide a faster way to assess the levels of the four harmful bacteria based on chemical and physical measurements that are fast (run in less than 2 hours) and easy to carry out. 

Your data consists of several lagoon water samples collected in one year. These measurements include chemical and physical measurements, as well as levels of the four bacteria we are interested in predicting in out-of-sample data. The data is stored in the file `data/bacteria_final.csv`


The dataset contains the following data:  
- Season of the year when the water samples were collected (summer, spring, autumn, winter)
- Wave action outside of the lagoon (small, medium or large)
- Water current condition (low, medium or high)
- Maximum pH value
- Water oxygen level
- Chloride Concentration
- Nitrates Concentration
- CO2 Concentration
- Abundances of Bacteria B1-B4 (four last columns). Those are the target values you will be modeling.



- How many entries does that table contain?
- How many features does this dataset have?


In [3]:
# Write your code here
bacteria = pd.read_csv('data/bacteria_final.csv')
bacteria.shape

(340, 12)

- The main goal of our this study is to predict the frequencies of the four bacteria for out-of-sample data. Which type of learning task is this?
  1. Supervised or unsupervised?
  2. Regression or classification?


# Write your answer here
Supervised, regression

### Fixing Errors in the Data

Inspect your data to make sure it's dot not contain any inconsistencies. 

Hint: Often, categorical attributes are entered manually and are, therefore, subject to human error or inconsistencies. Use your judgment to fix inconsistencies, if any.
 


In [195]:
# Write your answer here
bacteria.replace(regex='_', value='', inplace=True)

Write a Python expression to show that your resulting data frame does not contain the inconsistencies identified in the original data


In [196]:
# Write your answer here
not bacteria['wave'].str.contains('_').any() and not bacteria['current'].str.contains('_').any()

True

### Encoding the categorical fearures 

Encode the categorical variables using a proper encoding. You can do this using python code using `sklern`. 




In [197]:
# Write you code here
label_encoder = LabelEncoder()
bacteria['season'] = label_encoder.fit_transform(bacteria['season'])
bacteria['wave'] = label_encoder.fit_transform(bacteria['wave'])
bacteria['current'] = label_encoder.fit_transform(bacteria['current'])

#### Handling missing features

-  Many machine learning algorithms cannot handle missing features. While it's acceptable to discard a few instances in missing values when you workign with large data sets, it's preferable to infer (impute) missing values when working relatively small datasets. Here you will use regression to impute missing values using the approach described below. Given the following dataset  with a missing value in the column `C`:

<img src="images/ex_missing_values.png" alt="drawing" style="width:400px;"/>


Computing the correlation between columns C and A shows that both columns are very highly correlated ($r \approx 0.99$). We could use that information to predict the missing values in C by predicting them using a linear regression between both columns. For instance, according to the graph below, we see that when A is 1.9, C is ~ 3.7. 

<img src="images/imputation.png" alt="drawing" style="width:800px;"/>


You will use an approach to that used above to impute missing values. For each missing values, identify the feature that is most correlated with the one containing the missing value. To avoid any spurious imputations, we will only use this method for pairs of features that have a correlation larger than 0.25. For example, if table able above did not contain `A`, then we would not be able to carry out the current methods given that the correlation between B and C  is less than 0.25.

Use linear regression to impute all the missing values in features and the feature with the higher correlation with the condition that   $r>0.25$.

In [198]:
# Write your code here
# bacteria.corr()
# pH, oxygen, chloride, nitrates, CO2
# pH: B1
# oxygen: chloride
# chloride: B1
# nitrates: CO2
# bacteria[bacteria.isna().any(1)]
linreg = linregress(bacteria['B1'].iloc[bacteria['pH'].dropna().index], bacteria['pH'].dropna())
bacteria['pH'] = bacteria['pH'].fillna(linreg.intercept + linreg.slope * bacteria['B1'])
linreg = linregress(bacteria['B1'].iloc[bacteria['chloride'].dropna().index], bacteria['chloride'].dropna())
bacteria['chloride'] = bacteria['chloride'].fillna(linreg.intercept + linreg.slope * bacteria['B1'])
linreg = linregress(bacteria['chloride'].iloc[bacteria['oxygen'].dropna().index], bacteria['oxygen'].dropna())
bacteria['oxygen'] = bacteria['oxygen'].fillna(linreg.intercept + linreg.slope * bacteria['chloride'])

unfortunately, some features many not be correlated sufficiently with other features to use the imputation method above. We will use another approach, which leverages the nearest 4 nearest neighbors to fill in the missing values. Take for instance the following graph, with represents an instance in red (`e`) with a missing value and its three nearest neighbors (those are the blue vertices (`a`, `b` and `c`) that have edge with the `e` ). 

We can impute the missing valaue in `e` by taking the average of its three nearest neighbors (`a`, `b` and `b`).


<img src="images/graph_imputation.png" alt="drawing" style="width:300px;"/>



- Use the Euclidean distance to compute the distnace between two instances.

$$
    d(s_1, s_2) = \sqrt{\sum_{i=1}^{p}(s_{1,i} - s_{2,i})^2}
$$



In [199]:
# Write your code here
for na_index, na_row in bacteria[bacteria.isna().any(1)].iterrows():
    distances = {}
    for index, row in bacteria[bacteria.notna().all(1)].iterrows():
        distance = ((pd.concat([na_row['season':'chloride'], na_row['B1':'B4']]) - pd.concat([row['season':'chloride'], row['B1':'B4']])) ** 2).sum() ** 0.5
        if len(distances) < 4 or distance < min(distances.values()):
            if len(distances) == 4:
                distances.pop(max(distances, key=distances.get))
            distances[index] = distance
    bacteria.loc[na_index, 'nitrates'] = bacteria.loc[list(distances.keys()), 'nitrates'].mean()
    bacteria.loc[na_index, 'CO2'] = bacteria.loc[list(distances.keys()), 'CO2'].mean()

Write a Python expression to show that your resulting data frame does not contain any missing values


In [200]:
# Write the expression here
bacteria[bacteria.isna().any(1)].shape[0] == 0

True

### Building the Model: Train-Test Datasets

* Split the original dataset into training testing sets.
  * You can either do it manually or by using an appropriate `sklearn` library

* Fit your data using the most adequate linear model; i.e., which degree polynomial provides the best results?
  * Again, using your best judgment and your understanding of how the error is evaluated to asses the error obtained in each model.



In [241]:
# Write your code here
X_train, X_test, y_train, y_test = train_test_split(bacteria.loc[:, 'season':'CO2'], bacteria.loc[:, 'B1':'B4'])
best_poly = 1
min_error = float('inf')

for i in range(0, 5):
    poly = PolynomialFeatures(i)
    train_transformed = poly.fit_transform(X_train)
    test_transformed = poly.transform(X_test)
    
    lin = LinearRegression()
    lin.fit(train_transformed, y_train)
    y = lin.predict(test_transformed)
    
    error = np.sqrt(mean_squared_error(y_test, y))
    if error < min_error:
        best_poly = i
        min_error = error
        
best_poly, min_error

(1, 11.785269073141588)

In 2-3 sentences, how do you model performs in generalizing to new instances?


# Write your interpretation here
It is not very precise. The result varies from degree 0 to degree 2, or from a constant function to a quadratic function, which is quite extreme. The error also varies quite a bit and is quite high, so it does not perform very well in generalizing to new instances.

### Building the Model: k-fold Cross-Validation

* Repeat the above using k-fold Cross-Validation.
  * I.e., find the best model and interpret your results



In [242]:
# Write your code here
X_train, X_test, y_train, y_test = train_test_split(bacteria.loc[:, 'season':'CO2'], bacteria.loc[:, 'B1':'B4'])
best_poly = 1
min_error = float('inf')

for i in range(0, 5):
    poly = PolynomialFeatures(i)
    train_transformed = poly.fit_transform(X_train)
    test_transformed = poly.transform(X_test)
    
    lin = LinearRegression()
    
    error = np.sqrt(np.mean(-cross_val_score(lin, train_transformed, y_train, scoring='neg_mean_squared_error', cv=10)))
    if error < min_error:
        best_poly = i
        min_error = error
        
best_poly, min_error

(1, 12.520672197828542)

# Write your interpretation here
This method is much more precise. The result is consistently degree 1 and the error does not vary by much. However the error is still quite high, so it is still quite inaccurate.

### Comparing the Models

* Why do the models differ in their accuracy?


# Write your answer here
The methods vary in accuracy because first method is very much dependent on the split between the training and testing datasets. Once that is done, the model is trained only once on the training dataset before it is used to predict the testing dataset. However, the second method repeats itself for k iterations. During each iteration, the model is trained on the training dataset before being used to predict the validation dataset. The overall error is the average error across all iterations, so any bias resulting from a bad split is canceled out.