# ___

# [ Machine Learning in Geosciences ]

**Department of Applied Geoinformatics and Carthography, Charles University** 

*Lukas Brodsky lukas.brodsky@natur.cuni.cz*

    
___


# Machine Learning Project!

Step 4 

Goal: This notebook demonstrates the **data preparation** steps for machine learning algorithms.  

Content: **Prepare the data**

    4.1/ Data Cleaning    
    4.2/ Transformation Pipelines (feature scaling, add new feature and impute) 
___    

## Setup environment

In [None]:
# Common imports
import numpy as np
import os

# add more based on the topic of the lab

# to make this notebook's output stable across runs
np.random.seed(42)

# plotting 
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# path to the current lab directory - set individually!!!
# TODO HERE! 
PROJECT_DIR = "./"
if os.path.isdir(PROJECT_DIR): 
    print('Ok continue.')
else: 
    print('Nok, set correct path to your project directory!')


In [None]:
import pandas as pd

# check the data set dir 
forest_path = os.path.join(PROJECT_DIR, "forest_fires")
# print(os.listdir(forest_path))

# function to read the csv file 
def load_local_data(data_path, csv_file):
    csv_path = os.path.join(data_path, csv_file)
    return pd.read_csv(csv_path)

# load data 
fires = load_local_data(forest_path, "forestfires.csv")

# check header and some values 
fires.head()

In [None]:
# drop labels for training set
fires_features = fires.drop("area", axis=1) 
fires_area = fires["area"].copy()
fires.head()

In [None]:
fires_features.drop("month", inplace=True, axis=1)
fires_features.drop("day", inplace=True, axis=1)

In [None]:
fires_features.head(10)

In [None]:
# Here I simulate missing values 
# df.replace('?',np.NaN,inplace=True)
fires_features['temp'][4] = np.NaN
fires_features['temp'][104] = np.NaN
fires_features['temp'][240] = np.NaN

In [None]:
fires_features[fires_features.isnull().any(axis=1)].head()

In [None]:
# Which records have None / null?

sample_incomplete_rows = fires_features[fires_features.isnull().any(axis=1)].head()
sample_incomplete_rows

### Rejecting records with missing value

In [None]:
# sample_incomplete_rows.dropna(subset=["temp"])    # option 1
# sample_incomplete_rows.drop("temp", axis=1)       # option 2

### Fill-in meadian value 

In [None]:
median = fires_features["temp"].median()
sample_incomplete_rows["temp"].fillna(median, inplace=True) # option 3
sample_incomplete_rows

In [None]:
# check the original value 
print(fires['temp'][4]) 
print(fires['temp'][104]) 
print(fires['temp'][240]) 

**Scikit-Learn Impoter**

Since Scikit-Learn 0.20, the `sklearn.preprocessing.Imputer` class was replaced by the `sklearn.impute.SimpleImputer` class.

In [None]:
try:
    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
    from sklearn.preprocessing import Imputer as SimpleImputer

imputer = SimpleImputer(strategy="median", missing_values=np.NaN)

In [None]:
fires_features.head()

In [None]:
# Now you can use this “trained” imputer to transform the training set 
# by replacing missing values by the learned medians
fires_features_imputed = pd.DataFrame(imputer.fit_transform(fires_features))

In [None]:
fires_features_imputed.head()

### Transformation Pipelines

One of the most important transformations you need to apply to your data is feature scaling. Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. 

**Feature Scaling**: 

    * min-max scaling (normalization), values are shifted and rescaled so that they end up ranging from 0 to 1. 
    * standardization: subtracts the mean value, and then it divides by the variance so that the resulting distribution has zero mean and unit variance.

Scikit-learn provides `MinMaxScaler` and `StandardScaler` for standardization. 

The MinMax scaler transformation is given by: 

    X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
    X_scaled = X_std * (max - min) + min

In [None]:
fires.head()

In [None]:
fires_sel = fires[['X', 'Y', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain']]

In [None]:
X = np.array(fires_sel)

In [None]:
X.shape

In [None]:
# define function for adding extra features 

# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
RH_ix, wind_ix, = [
    list(fires_sel.columns).index(col)
    for col in ("RH", "wind")]

print(RH_ix, wind_ix)

In [None]:
# defined function for transforamation 

def add_extra_features(X):
    RH_per_wind = X[:, RH_ix] / X[:, wind_ix]
    return np.c_[X, RH_per_wind]

We can build a pipeline for preprocessing the numerical attributes (use `CombinedAttributesAdder()` or `FunctionTransformer(...)` as preferred):

In [None]:
fires_num = fires_features

In [None]:
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median", missing_values=np.NaN)),
        ('attribs_adder', FunctionTransformer(add_extra_features)),
        ('std_scaler', StandardScaler()),
    ])

# fires_features
fires_num_tr = num_pipeline.fit_transform(fires_num)

In [None]:
fires_num.head()

In [None]:
print(fires_num.shape)

In [None]:
print(fires_num_tr.shape)

In [4]:
fires_num.columns

NameError: name 'fires_num' is not defined

In [None]:
pd.DataFrame(fires_num_tr, columns=['X', 'Y', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain',
        'RHw'], index=fires_num.index).head()

You can test also other transformations. 

In [None]:
from sklearn.compose import ColumnTransformer  
from sklearn.preprocessing import OrdinalEncoder

In [None]:
# full_pipeline = ColumnTransformer([
#         ("num", num_pipeline, num_attribs),
#         ("cat", OneHotEncoder(), cat_attribs),
#     ])

# df_prepared = full_pipeline.fit_transform(housing)

### Splitting data to train and test

In [3]:
# random split 
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(fires_num
                                       , test_size=0.3, random_state=42)

NameError: name 'fires_num' is not defined

In [2]:
# convert the selected attributes to Numpy ndarray 
X_train = np.array(train_set[['X', 'Y', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain']], 
                   dtype=np.float64)
y_train = np.array(train_set[['area']].values.ravel(), dtype=np.float64) 

# .values will give the values in a numpy array (shape: (n,1))
# .ravel will convert that array shape to (n, ) (i.e. flatten it)
#.values.ravel()

NameError: name 'np' is not defined