This lecture is about data quality, so we do not want to focus too much on the task of prediction, but more on the data themselves. For this reason, the prediction functions have been simplified as much as possible, and we will use some simple algorithms from some Python libraries.

To understand how a simple task of supervised learning is led, refer to the introductory part (link) __TODO: link to the introduction about supervised learning__.

This section is not mandatory to understand the rest of the course, but if you are curious about the functions used for prediction, we will now go through them.

There are 2 kinds of supervised prediction tasks: classification and regression. Here, they will be tackled the same way, only the algorithm used and the way of measuring the performance change. We use a simple K-Nearest-Neighbors algorithm for both types of prediction tasks.

## The K-Nearest-Neighbors algorithm

__TODO__

## Regression function

__TODO__

In [None]:
# KNN regression
def knn_regression(df, x,  y):
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsRegressor

    x_train, x_test, y_train, y_test = train_test_split(df[x], df[y], test_size = 0.2, random_state = 0)
    
    x_train = x_train.fillna(value = 0)
    x_test = x_test.fillna(value = 0)
    y_train = y_train.fillna(value = 0)
    y_test = y_test.fillna(value = 0)
    
    knn = KNeighborsRegressor(n_neighbors = 25)
    knn.fit(x_train, y_train)
    predictions = knn.predict(x_test)
    
    return predictions, y_test

## Classification function

__TODO__

In [None]:
# KNN classification
def knn_classification(df, x,  y):
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    
    x_train, x_test, y_train, y_test = train_test_split(df[x], df[y], test_size = 0.2, random_state = 0)
    
    #x_train = x_train.fillna(value = 0)
    #x_test = x_test.fillna(value = 0)
    #y_train = y_train.fillna(value = 0)
    #y_test = y_test.fillna(value = 0)
    
    knn = KNeighborsClassifier(n_neighbors = 25)
    knn.fit(x_train, y_train)
    predictions = knn.predict(x_test)
    
    return predictions, y_test

## Trajectory prediction with Random forest

__TODO: remove if not used (very probable)__

In [None]:
def rf_trajectory(df, split, clas):
    '''
    Do time series prediction on df
    in:
    df: pandas DataFrame
    split: integer, between 0 and 1, percentage of train/test split
    clas: string, the name of the attribute to predict (do it separatly for latitude and longitude)
    '''
    import numpy as np
    from sklearn.metrics import mean_absolute_error
    from sklearn.ensemble import RandomForestRegressor
    
    # First, we create a new dataset with 2 additional columns for prediction:
    # - previous_clas: contains the value of the class attribute of the previous row (example: previous latitude value)
    # - diff_clas: the difference between the previous value and the current value for the class attribute
    
    df2 = pd.DataFrame(columns = ['TripID', 'BaseDateTime', clas, 'previous_clas', 'diff_clas'])
    
    df2['TripID'] = df['TripID']
    df2['BaseDateTime'] = df['BaseDateTime']
    df2[clas] = df[clas]
    
    previous_tripid = 'first'
    
    for index, row in df2.sort_values(['TripID', 'BaseDateTime']).iterrows():
        if previous_tripid != row['TripID']: # new trip
            previous_tripid = row['TripID']
            previous_clas = row[clas]
        
        df2.loc[index, 'previous_clas'] = previous_clas
        previous_clas = row[clas]
        
    df2['diff_clas'] = df2[clas] - df2['previous_clas']
    
    df2['previous_clas'] = pd.to_numeric(df2['previous_clas'])
    df2['diff_clas'] = pd.to_numeric(df2['diff_clas'])
    
    # Split training and testing sets: the training set is the first part of the data in the chronological order
    nb_train = int(split * len(df2))
    nb_test = int(len(df2) - nb_train)
    train = df2.head(nb_train)
    test = df2.tail(nb_test)
    
    # Baseline prediction: always predicts the last position
    baseline = mean_absolute_error(test[clas], test['previous_clas'])
    #print('Baseline: mean absolute error = %.5f' % baseline)
    
    # Random Forest prediction
    # Drop the timestamp columns because they cannot be processed by the random forest algorithm
    x_train, x_test = train.drop(columns = [clas, 'BaseDateTime'], axis = 1), test.drop(columns = [clas, 'BaseDateTime'], axis = 1)
    y_train, y_test = train[clas].values, test[clas].values

    rf = RandomForestRegressor(n_estimators = 1000, n_jobs = -1, random_state = 0)
    rf.fit(x_train, y_train)
    predictions = rf.predict(x_test)
    error = mean_absolute_error(predictions, y_test)
    
    #print('Random Forest: mean absolute error = %.5f' % error)
    
    return train, test, predictions, baseline, error