# Learning and fitting
Data Mining for Business Analytics

Robert Moakler, Spring 2016
***

We start off as we usually do. Let's import some things that will be useful.

In [None]:
# Import pandas to read in data
import pandas as pd

# Import matplotlib for plotting
import matplotlib.pylab as plt
%matplotlib inline

# Import decision trees and logistic regression
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Import train, test, and evaluation functions
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split

### Data
We're going to use a mail response data set from a real direct marketing campaign located in `data/mailing.csv`. Each record represents an individual who was targeted with a direct marketing offer.  The offer was a solicitation to make a charitable donation. 

The columns (features) are:

```
Col.  Name      Description
----- --------- ----------------------------------------------------------------
1     income    household income
2     Firstdate data assoc. with the first gift by this individual
3     Lastdate  data associated with the most recent gift 
4     Amount    average amount by this individual over all periods (incl. zeros)
5     rfaf2     frequency code
6     rfaa2     donation amount code
7     pepstrfl  flag indicating a star donator
8     glast     amount of last gift
9     gavr      amount of average gift
10    class     one if they gave in this campaign and zero otherwise.
```

Our goal is to build a model to predict if people will give during the current campaign (this is the attribute called `"class"`).

Let's read our data in and put the target variable in `Y` and all the other features in `X`.

In [None]:
# Read data using pandas
data = pd.read_csv("data/mailing.csv")

# Split into X and Y
X = data.drop(['class'], 1)
Y = data['class']

### Learning curve

Let's create a decision tree (using entropy) and fit it and use it on all of our data.

In [None]:
# Create an empty, unlearned tree
tree = DecisionTreeClassifier(criterion="entropy")

# Fit/train the tree
tree.fit(X, Y)

# Get a prediction
Y_predicted = tree.predict(X)

# Get the accuracy of this prediction
accuracy = accuracy_score(Y_predicted, Y)

# Print the accuracy
print "The accuracy is " + str(accuracy)

That's a pretty high accuracy. We might be overfitting our data. The model might have "memorized" where all the points are. This does not lead to models that will generalize well.

We can create training and testing sets very easily. Here we will create train and test sets of `X` and `Y` where we assign 70% of our data to training.

In [None]:
# Split X and Y into training and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.70)

Now, let's look at the same decision tree but fit it with our training data and test it on our testing data.

In [None]:
# Create an empty, unlearned tree
tree = DecisionTreeClassifier(criterion="entropy")

# Fit/train the tree on the training data
tree.fit(X_train, Y_train)

# Get a prediction from the tree on the test data
Y_test_predicted = tree.predict(X_test)

# Get the accuracy of this prediction
accuracy = accuracy_score(Y_test_predicted, Y_test)

# Print the accuracy
print "The accuracy is " + str(accuracy)

That's a pretty big difference! Which accuracy do you "trust" more? Why?

### Team work
I would like you to try a few different training data set sizes and a logistic regression model. Try assigning a few different percentages and check what the accuracy is. To show these results, generate a plot that has percentages on the x-axis and accuracies on the y-axis. Here is some code to get you started.

Please work with your teams or neighbors to finish this!

In [None]:
# Here are some percentages to get you started. Feel free to try more!
training_percentages = [0.10, 0.20, 0.60, 0.80]
accuracies = []

for training_percentage in training_percentages:
    # Here I am training on 70%. What should I change this to so that I can try many percentages?
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.70)

    # This will create an empty logistic regression
    logistic = LogisticRegression()
    
    # This will fit/train your logistic regression
    logistic.fit(...)
    
    # This will get predictions
    Y_test_predicted = logistic.predict(...)
    
    # With these predictions we can get an accuracy. Where should we store this accuracy?
    accuracy_score(..., ...)

# We want to plot our results. What list should we use for the x-axis? What about the y-axis?
plt.plot(what-are-your-xs?, what-are-your-ys?)
plt.show()

### Fitting curve

In [None]:
# Let's fit our training data size to 80%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80)

# Let's try different max depths for a decision tree
max_depths = range(1, 10)
accuracies = []

for max_depth in max_depths:
    # This will create an empty decision tree at a specified max depth
    tree = DecisionTreeClassifier(max_depth=max_depth)
    
    # This will fit/train your tree
    tree.fit(X_train, Y_train)
    
    # This will get accuracy and keep track of it
    Y_test_predicted = tree.predict(X_test)
    accuracies.append(accuracy_score(Y_test_predicted, Y_test))

# We want to plot our results
plt.plot(max_depths, accuracies)
plt.ylabel("Accuracy")
plt.xlabel("Max depth (model complexity)")
plt.show()