Exercise 5 - Logistic Regression
=====
Logistic regression predicts binary (yes/no) events. For example, we may want to predict if someone will arrive at work on time, or if a person shopping will buy a product. 

This exercise will demonstrate simple logistic regression: predicting an outcome from only one feature.

Step 1
-----

We want to place a bet on the outcome of the next football (soccer) match. It is the final of a competition, so there will not be a draw. We have historical data about our favourite team playing in matches such as this. Complete the exercise below to see this data.

In [None]:
# Sets up the graphing configuration
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as graph
%matplotlib inline
graph.rcParams['figure.figsize'] = (15,5)
graph.rcParams["font.family"] = 'DejaVu Sans'
graph.rcParams["font.size"] = '12'
graph.rcParams['image.cmap'] = 'rainbow'

In [None]:
import pandas as pd

###--- REPLACE ??? BELOW WITH 'Data/football data.txt' (INCLUDING THE QUOTES) TO LOAD THE DATA FROM THAT FILE ---###
data = pd.read_csv(???, index_col = False, sep = '\t', header = 0)
###

###--- WRITE print(data.head()) BELOW TO PREVIEW THE DATA ---###

###

This data shows the average goals per match of our team for that season in the left column. In the right column it lists a 1 if our team won the competition or a 0 if they did not.

Step 2
----
Let's graph the data so we have a better idea of what's going on here. Complete the exercise below to make an x-y scatter plot.

In [None]:
###--- REPLACE ??? BELOW WITH 'won_competition' (INCLUDING THE QUOTES) ---###
y = data[???]
###

###--- REPLACE ??? BELOW WITH 'average_goals_per_match' (INCLUDING THE QUOTES) ---###
x = data[???]
###

# The 'won_competition' will be displayed on the vertical axis (y axis)
# The 'average_goals_per_match' will be displayed on the horizontal axis (x axis)

graph.scatter(x, y, c = y, marker = 'D')

graph.yticks([0, 1], ['No', 'Yes'])
graph.ylabel("Competition Win")
graph.ylim([-0.5, 1.5])
graph.xlabel("Average number of goals scored per match")

graph.show()

We can see from this graph that generally, when our team has a good score average, they tend to win the competition.

Step 3
----
So, let's apply AI to this problem. We'll make a logisitic regression model using this data and then graph it. This will tell us whether we will likely win this season.

Complete the exercise below to make the logistic regression model

In [None]:
import numpy as np
from sklearn import linear_model

# Here we build a logistic regression model

###--- REPLACE ??? BELOW WITH linear_model.LogisticRegression() TO BUILD A LOGISTIC REGRESSION MODEL ---###
clf = ???
###

# This step fits (calculates) the model
# We are using our feature (x - number of goals scored) and our outcome/label (y - won/lost)
clf.fit(x[:, np.newaxis], y)

# This works out the loss
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
X_test = np.linspace(0, 3, 300)
loss = sigmoid(X_test * clf.coef_ + clf.intercept_).ravel()

Alright, that's the model done. Now run the code below to graph it

In [None]:
# This makes the graph
# The data points
graph.scatter(x, y, c = y, marker = 'D')
# The curve
graph.plot(X_test, loss, color = 'gold', linewidth = 3)
# Define the y-axis
graph.yticks([0, 1], ['No = 0.0', 'Yes = 1.0'])
graph.ylabel("Competition Win Likelihood")
graph.xlabel("Average number of goals per match")
graph.show()

We now have a line fit to our data. This yellow line is our logistic regression model.

Step 4
------

We can read the model above like so:
* Take the average number of goals per match for the current year. Let's say it is 2.5.
* Find 2.5 on the x-axis. 
* What value (on the y axis) does the line have at x=2.5?
* If this value is above 0.5, then the model thinks our team will win this year. If it is less than 0.5, it thinks our team will lose.

Because this line is just a mathematical function (equation) we don't have to do this visually.

In the exercise below, __choose the number of goals you want to evaluate__.

The code will calculate the probability that our team will win with your chosen number of goals in the match.

In [None]:
###--- REPLACE ??? BELOW WITH THE NUMBER OF GOALS IN A MATCH THIS YEAR. USE ANY NUMBER BETWEEN 0 AND 3 ---###
p = ???
###

# Next we're going to use our model again - clf is the name of our model.
# We'll use a method to predict the probability of a positive result
# Use the variable p which we just made in this method.

###--- REPLACE ??? BELOW WITH p TO PREDICT USING THIS VALUE ---###
probOfWinning = clf.predict_proba(???)[0][1]
###

# This prints out the result
print("Probability of winning this year")
print(str(probOfWinning * 100) + "%")

# This plots the result
graph.scatter(x, y, c = y, marker = 'D')
graph.yticks([0, probOfWinning, 1], ['No = 0.0', round(probOfWinning,3), 'Yes = 1.0'])
graph.plot(X_test, loss, color = 'gold', linewidth = 3)

graph.plot(p, probOfWinning, 'ko') # result point
graph.plot(np.linspace(0, p, 2), np.full([2],probOfWinning), dashes = [6, 3], color = 'black') # dashed lines (to y-axis)
graph.plot(np.full([2],p), np.linspace(0, probOfWinning, 2), dashes = [6, 3], color = 'black') # dashed lines (to x-axis)

graph.ylabel("Competition Win Likelihood")
graph.xlabel("Average number of goals per match")
graph.show()

Well done! We have calculated the likelihood that our team will win this year's competition.

Optional: Step 5
-----
Of course, these predictions are only one model.

Let's return to what we did in step 3, but we'll replace `linear_model.LogisticRegression()` with `linear_model.LogisticRegression(C=200)`. This will tell the model to make a steeper decision boundary. Then repeat Step 4 with this boundary. Did your results change?

There are methods we can use to choose sensible parameters for many models. This is currently outside the scope of this course, but it is important to remember that a model is only as good as the data we give it, the parameters we choose, and the assumptions we make.

In [None]:
# Repeating step 3 using linear_model.LogisticRegression(C=200)'!

import numpy as np
from sklearn import linear_model

###--- REPLACE THE ??? WITH THE NUMBER OF GOALS YOU WANT TO EVALUATE ---###
p = ???
###

# Here we build the new logistic regression model.

###--- REPLACE ??? BELOW WITH linear_model.LogisticRegression(C=200) TO BUILD A LOGISTIC REGRESSION MODEL ---###
clf = ???
###

# This step fits (calculates) the model
# We are using our feature (x - number of goals scored) and our outcome/label (y - won/lost)
clf.fit(x[:, np.newaxis], y)

# This works out the loss
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
X_test = np.linspace(0, 3, 300)
loss = sigmoid(X_test * clf.coef_ + clf.intercept_).ravel()

# This makes the prediction for your chosen number of goals.
probOfWinning = clf.predict_proba(p)[0][1]

# This prints out the result.
print("Probability of winning this year")
print(str(probOfWinning * 100) + "%")

# This plots the result.
graph.scatter(x, y, c = y, marker = 'D')
graph.yticks([0, probOfWinning, 1], ['No = 0.0', round(probOfWinning,3), 'Yes = 1.0'])
graph.plot(X_test, loss, color = 'gold', linewidth = 3)

graph.plot(p, probOfWinning, 'ko') # result point
graph.plot(np.linspace(0, p, 2), np.full([2],probOfWinning), dashes = [6, 3], color = 'black') # dashed lines (to y-axis)
graph.plot(np.full([2],p), np.linspace(0, probOfWinning, 2), dashes = [6, 3], color = 'black') # dashed lines (to x-axis)

graph.ylabel("Competition Win Likelihood")
graph.xlabel("Average number of goals per match")
graph.show()