#### Introduction and Base Table Structure

predictive analytics is the process that aims to predict an event using historical data 
that data is gathered in the analytical basetable 
it's usually stored in a pandas dataframe
the 3 important concepts in the analytical basetable are: 
- the population, the group of people or objects you want to make a prediction for (the donors that are in scope of recieving a letter), the basetable has one row for each object in the population
- the candidate predictors, describe the objects in the population, info that can be used to predict  the event (variables like age, gender, previous gift could be used to predict whether someone will donate for a future project)
- and the target, information about the event to predict, it's 1 if the event occurs and 0 otherwise

predictive analytics can do things like help you to determine the donors that are most likely to donate (instead of sending letters to all donors, you could target those one's specifically)

the basetable will be historic
you can look at a similar event, like a different fundraising campaign in the past, then construct the basetable from the data available at that time, the target will be whether the donor donated for the historical campaign, the candidate predictors are derived from the information that was available at that time
a predictive model will then be constructed that will link the candidate predictors with the target in the basetable
this predictive model can then be used to predict the current event, the candidate predictors are available and are used as the input for the model 

In [None]:
# the analytical basetable
import pandas as pd

basetable = pd.DataFrame("import_basetable.csv")

# check the size of the population, hov many rows?
population_size = len(basetable)

# count the number of targets
targets = sum(basetable["Targets"])

# target incidence, the ratio of targets and population
print(targets / population_size)

# Count and print the number of females.
print(sum(basetable["gender"] == "F"))

#### Logistic Regression

logistic regression is a widely used modeling technique

here we'll learn how logistic regression predicts the target from candidate predictors and how to use logistic regression

for a linear regression the output could be any number but logistic regression is used because you want a probability as the output, a value from 0-1 (will donate or won't donate) that expresses how likely it is that someone will donate 

logistic regression takes a regression formula as input and calculates a probability from it, this is a mathematical trick that let's you use linear regression for binary classification probelms

logistic regression, the logit function:
- output of a*age+b is a real number
- we want to predict a 0 or 1
- the logit() function transforms a*age+b to a probability 


In [None]:
# build a logistic regression model
from sklearn import linear_model 

# create a logistic regression model object 
logreg = linear_model.LogisticRegression()

# feed data to the logistic regression model so it can be fit
X = basetable[["age"]] # predictor
y = basetable[["target"]] # target
logreg.fit(X, y)

# the model is fit so now observe the coefficient that correstponds with the predictor age by checking the coif value of the fitted model
print(logreg.coef_)

# to derive the entire formula from the fitted model, retrieve the intercept value

# so far we've assumed that there was only one predictor but there are many candidate predictors available in the basetable 
# extending univariate logistic regression to multivariate logistic regression is pretty easy, instead of ax+b you can add
# multiple predictors in the formula, for python nothing will change except you'll select multiple variables in the X object

# multivariate
X = basetable[["age", "max_gift", "income_low"]] # predictors
y = basetable[["target"]] # target
logreg.fit(X, y)
# outputting the coefficients will show that for each predictor used, a coefficient has been calculated
print(logreg.coef_)
# positive coefficients are positively correlated with the target and negative coefficients are negatively correlated with it
# Assign the intercept to the variable intercept
intercept = logreg.intercept_
print(intercept)

#### Using the Logistic Regression Model

the above is how to build a logistic regression model and here we'll learn how to make predictions with it 

a logistic regression model is a linear regression formula wrapped in a logit function, so all you need to do is replace the predictors with the values you want (female, age 72, 120 days since last donation) and then put the result in the logit function, but luckily we don't have to calculate all this by hand :) 
if you collect this data in a list (the sex, age, days), making sure they're in the same order as in the logistic regression model, you can calculate the prediction by feeding the list as a parameter to the predict_proba mothed on the logreg object

In [None]:
# making predictions
logreg.predict_proba([1, 72, 120])
# the output will be an array with 2 numbers: the first one is the probability that the donor will not donate, target 0
# the second one is the probability that the donor will donate, target 1
# you could then compare this to the probability that someone in the overall population would donate (about 5% so old lady is a high result)

In [None]:
# making predictions for a large group of people
# to decide which donors to send a letter to you could make predictions for the entire population and then send letters
# to the donors with the highest probabilities only
new_data = current_data[["gender_F", "age", "time_since_last_gift"]]
predictions = logreg.predict_proba(new_data)

In [None]:
# example exercise
# Fit a logistic regression model
from sklearn import linear_model
X = basetable[["age","gender_F","time_since_last_gift"]]
y = basetable[["target"]]
logreg = linear_model.LogisticRegression()
logreg.fit(X, y)

# Create a dataframe new_data from current_data that has only the relevant predictors 
new_data = current_data[["age", "gender_F", "time_since_last_gift"]]

# Make a prediction for each observation in new_data and assign it to predictions
predictions = logreg.predict_proba(new_data)
print(predictions[0:5])

# Sort the predictions by the calculated probability column
predictions_sorted = predictions.sort(["probability"])

# Print the row of predictions_sorted that has the donor that is most likely to donate
print(predictions_sorted.tail(1))
# head() would give the donor that is least likely to donate 