# Example answer

The code below is an example answer which shows that you can streamline some of your processing using ColumnTransform and change the order of some steps in the workflow.

In [None]:
# Import libraries
import numpy as np
import pandas as pd

%matplotlib inline

In [None]:
# Set file path
student_performance_filepath = '../data/student_performance.csv'

# Import data into a data frame.
raw_data = pd.read_csv(filepath_or_buffer=student_performance_filepath, delimiter=",")

column_names = list(raw_data.columns)
print(column_names)
raw_data.sample(n=10)

## Missing values

Before we fill in missing values in our data we should understand the distributions of the attributes with missing values.


In [None]:
raw_data.info()

In [None]:
raw_data.isna().sum()

In [None]:
pd.value_counts(raw_data['lunch']).plot.bar()

# converting to 0 and 1 False and True respectively 
lunch_map = {
    False: 0,
    True: 1
}

raw_data["lunch"] = raw_data["lunch"].replace(lunch_map)

The lunch data is majority class "False", we can either try to impute based on the other feature values, or by replacing with the mode.

In [None]:
pd.value_counts(raw_data['preparation_course']).plot.bar();

# converting to 0 and 1 none and completed respectively 
preparation_map = {
    "none": 0,
    "completed": 1
}

raw_data["preparation_course"] = raw_data["preparation_course"].replace(preparation_map)


The lunch data is majority class "none", we can either try to impute based on the other feature values, or by replacing with the mode.

In [None]:
raw_data["math_score"].plot.density();

We can see that the "math_score" attribute is quite symetrically distributed.

### Encoding Data

Some of our data is not yet in the appropriate format to be passed to a model, we will need to encode it. The "gender" feature is binary so can either be one hot encoded, label encoded or mapped to produce the same effect.



In [None]:
gender_map = {
    "female": 0,
    "male": 1
}

raw_data["gender"] = raw_data["gender"].replace(gender_map)

The "parental_education" feature is ordinal as each level of education follows on from the next. We therefore have the decision to make over whether to one hot encode it (assume the features are independent) or to set the values to integers (maintain the order, but artificially create a distance between values). 

We are going to map the values of each category to an order so we don't lose structural information.

In [None]:
print(raw_data["parent_education"].unique())

parent_education_map = {
    "some high school": 0,
    "high school": 1,
    "some college": 2,
    "associate's degree": 3,
    "bachelor's degree": 4,
    "master's degree": 5,
}

raw_data["parent_education"] = raw_data["parent_education"].replace(parent_education_map)

# Making a copy for use later as dataframe.
encoded_data_frame = raw_data.copy()

In [None]:
encoded_data_frame.head()

### Scaling

We will now scale the data in appropriate manners. The binary data doesn't need to be scaled, and we have not actually one hot encoded any of the data. We are going to MinMaxScale the ordinal data and as the "math_score" data is quite normally distributed we will use the StandardScaler on it. 

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import KNNImputer

In [None]:
clean_scale_data_transform = ColumnTransformer(
    [
    ('KNN impute lunch', KNNImputer(missing_values=np.nan, n_neighbors=1), ["lunch"]),
    ('KNN impute preparation_course', KNNImputer(missing_values=np.nan, n_neighbors=1), ["preparation_course"]),
    ('MinMax scale', MinMaxScaler(), ["parent_education"]),
    ('StandardScaler', StandardScaler(), ["math_score"])
    ], remainder="passthrough",
)


# This creates a separated array of the features cleaned and scaled.
clean_data = clean_scale_data_transform.fit_transform(raw_data)

Let's look at how our features are correlated with the target to determine if we should remove any. We are going to use the original pre-transformed data for ease of visualisation, why could this give a misleading result?

In [None]:

# Generate the correlation matrix with pandas.
correlation_matrix = encoded_data_frame.corr()

# Show the matrix as a heatmap, matplotlib can also be used instead.
correlation_matrix.style.background_gradient(cmap='coolwarm').set_precision(2)

We can see that all the features are quite significantly correlated with the "writing_score" attribute, and therefore have some information to pass to our model.

We now need to split our X and y data in order to produce a training test split.

y = clean_data[:,-1]
X = np.delete(clean_data, -1, axis=1)

print("Shape of target: ", y.shape)
print("Shape of features: ", X.shape)

In [None]:
from sklearn.model_selection import train_test_split
# Split the data set into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

I have decided to choose the Lasso regression model here, as it performs generally as well as linear regression but may be able to out perform it when we tune for alpha.

In [None]:
from sklearn.linear_model import Lasso 
from sklearn.metrics import mean_squared_error

In [None]:
# Create model object
lasso_model = Lasso()

# Train model
lasso_model.fit(X_train, y_train)

# Predict values
y_pred = lasso_model.predict(X_test)

# Get the MAE score from the test and prediction data.
MSE_value = mean_squared_error(y_test, y_pred)

print(MSE_value)

We get a value for the MSE, but how do we know that it is good? We can try a range of other hyperparameters and check what values they produce.

If we look at the documentation for the Lasso model [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) we can see that along with alpha, there are other hyperparameters which may impact the performance of our model.

We are going to search through the parameters:
* alpha
* max_iter (the maximum numbers of iterations the algorithm goes through)
* tol (the tolerance of the stopping condition)

I am going to use the randomised search method.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.fixes import loguniform

In [None]:
# The range of our potential parameter values is large, therefore we sample from a 
# log uniform distribution.
param_dist = {'alpha': loguniform(0.1, 1e4),
              'max_iter': loguniform(100, 1e8),
              'tol': loguniform(1e-7, 1e-1)}

# Create a new model
lasso_model_with_search = Lasso()

# Create a searcher, we can define how many iterations of searching we
# want with n_iter
random_searcher = RandomizedSearchCV(lasso_model_with_search, 
                             param_dist, 
                             random_state=123, 
                             n_iter=2000,
                             scoring='neg_mean_squared_error')
search = random_searcher.fit(X, y)
print("The best parameters are:\n", search.best_params_)

In [None]:
best_lasso = search.best_estimator_

# Train model
best_lasso.fit(X_train, y_train)

# Predict values
y_pred = best_lasso.predict(X_test)

# Get the MAE score from the test and prediction data.
MSE_value = mean_squared_error(y_test, y_pred)

print(MSE_value)

We have significantly improved our MSE value, making our model better. We were able to use high numbers of "n_iter" for the searching of optimal parameters as our model is quite computationally cheap, should we have more expensive ones we may need to search more carefully with a smaller range.