# What is logistic regression?

Instead of predicting a numeric variable like a linear regression, logistic regression predicts the probability (between 0 and 1) of an event.

To use this as a classification algorithm, all you have to do is assign a cutoff point to these probabilities. If the predicted probability is above the cutoff point, you classify that observation as a 'yes' (in this case, the flight being late), if it's below, you classify it as a 'no'!

You'll tune this model by testing different values for several hyperparameters. A hyperparameter is just a value in the model that's not estimated from the data, but rather is supplied by the user to maximize performance. For this course it's not necessary to understand the mathematics behind all of these values - what's important is that you'll try out a few different choices and pick the best one.

Why do you supply hyperparameters?
- They improve model performance.

- Import the LogisticRegression class from pyspark.ml.classification.
- Create a LogisticRegression called lr by calling LogisticRegression() with no arguments.

In [1]:
# # Import LogisticRegression
# from pyspark.ml.classification import LogisticRegression

# # Create a LogisticRegression Estimator
# lr = LogisticRegression()

# Cross validation

This is a method of estimating the model's performance on unseen data (like your test DataFrame).

It works by splitting the training data into a few different partitions. The exact number is up to you. Once the data is split up, one of the partitions is set aside, and the model is fit to the others. Then the error is measured against the held out partition. This is repeated for each of the partitions, so that every block of data is held out and used as a test set exactly once. Then the error on each of the partitions is averaged. This is called the cross validation error of the model, and is a good estimate of the actual error on the held out data.


What does cross validation allow you to estimate?
- The model's error on held out data (test set).

- Import the submodule pyspark.ml.evaluation as evals.
- Create evaluator by calling evals.BinaryClassificationEvaluator() with the argument metricName="areaUnderROC".

In [1]:
# # Import the evaluation submodule
# import pyspark.ml.evaluation as evals

# # Create a BinaryClassificationEvaluator
# evaluator = evals.BinaryClassificationEvaluator(metricName="areaUnderROC")

- Import the submodule pyspark.ml.tuning under the alias tune.
- Call the class constructor ParamGridBuilder() with no arguments. Save this as grid.
- Call the .addGrid() method on grid with lr.regParam as the first argument and np.arange(0, .1, .01) as the second argument. This second call is a function from the numpy module (imported as np) that creates a list of numbers from 0 to .1, incrementing by .01. Overwrite grid with the result.
- Update grid again by calling the .addGrid() method a second time create a grid for lr.elasticNetParam that includes only the values [0, 1].
- Call the .build() method on grid and overwrite it with the output.

In [2]:
# # Import the tuning submodule
# import pyspark.ml.tuning as tune

# # Create the parameter grid
# grid = tune.ParamGridBuilder()

# # Add the hyperparameter
# grid = grid.addGrid(lr.regParam, np.arange(0, .1, .01))
# grid = grid.addGrid(lr.elasticNetParam, [0, 1])

# # Build the grid
# grid = grid.build()

- Create a CrossValidator by calling tune.CrossValidator() with the arguments:
- estimator=lr
- estimatorParamMaps=grid
- evaluator=evaluator
- Name this object cv.

In [3]:
# # Create the CrossValidator
# cv = tune.CrossValidator(estimator=lr,
#                estimatorParamMaps=grid,
#                evaluator=evaluator
#                )

- Create best_lr by calling lr.fit() on the training data.
- Print best_lr to verify that it's an object of the LogisticRegressionModel class.

In [4]:
# # Call lr.fit()
# best_lr = lr.fit(training)

# # Print best_lr
# print(best_lr)

# Evaluating binary classifiers

The closer the AUC is to one (1), the better the model is!

If you've created a perfect binary classification model, what would the AUC be?
- 1

- Use your model to generate predictions by applying best_lr.transform() to the test data. Save this as test_results.
- Call evaluator.evaluate() on test_results to compute the AUC. Print the output.

In [5]:
# # Use the model to predict the test set
# test_results = best_lr.transform(test)

# # Evaluate the predictions
# print(evaluator.evaluate(test_results))