# Are you mad enough to sell more clothes?

<img src="http://www.tabletmag.com/wp-content/files_mf/menken620.jpg"/>

In [5]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd

import seaborn.apionly as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [22]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
def cv_optimize(clf, parameters, X, y, n_jobs=1, n_folds=5, score_func=None):
    if score_func:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, n_jobs=n_jobs, cv=n_folds)
    gs.fit(X, y)
    print("BEST", gs.best_params_, gs.best_score_)
    best = gs.best_estimator_
    return best
def do_classify(clf, parameters, indf, featurenames, targetname, target1val,mode="mask", reuse_split=None, score_func=None, n_folds=5, n_jobs=1):
    """
    Classification made simple (or is it more complex?)
    THIS WORKS FOR 2 Class Classification problems only
    parameters: parameter grid in the sklearn style
    indf: dataframe you feed in
    featurenames: list of columnames corresponding to features you want in your model
    targetname: the column you want to use as target
    target1val: the value of the "targetname" column
    mode: mask or split. mask a boolean mask to choose train/test or
        split a dictionary with keys Xtrain/Xtest/ytrain/ytest and values existing
        training and test sets in the canonical form
    reuse_split: the actual mask above or the actuall ditionary, depending upon which
        modu you chose
    score_func: this is from GridSearchCV
    n_folds: cross val folds
    n_jobs: mumber of processes to use in cross-validation
    
    We return classifier, and the train and test sets. We print accuracies
    and the confusion matrix
    """
    subdf=indf[featurenames]
    X=subdf.values
    y=(indf[targetname].values==target1val)*1
    if mode=="mask":
        print("using mask")
        mask=reuse_split
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    else:
        print("using reuse split")
        Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print("############# based on standard predict ################")
    print("Accuracy on training data: %0.2f" % (training_accuracy))
    print("Accuracy on test data:     %0.2f" % (test_accuracy))
    print(confusion_matrix(ytest, clf.predict(Xtest)))
    print("########################################################")
    return clf, Xtrain, ytrain, Xtest, ytest

The data set is from a fairly high end clothing chain store in the North East.

You are a data analyst for this store. Your job is to write a report to the pointy-haired boss in which you show how you can increase the store's profit by being targeted about whom to send a catalog in the mail. Yes, you are in direct marketing. You are a quant amongst the "mad men". 

You need to explore and layout in simple terms, what the business needs to spend to increase its profit. In other words, you need a budget, and its your job to figure out how much as well.

We'll guide you through the process. There is much more you can explore, of-course, but this homework will walk you through an entire real world classification and analysis process with a finite amount of work and computer runtime.

You will

1. clean the data, and create some features, learning about how to standardize the data
2. write a classifier on this data, including cross validation, and learn to write a function to encapsulate this process
3. learn how to compare this classifier to baseline classifiers that you better beat using a profit metric rather than an accuracy metric
4. understand and use prediction thresholds
5. understand the use a ROC curve, especially in the situation where probability thresholds are not possible
6. learn to use a profit curve to pick a model, thus directly reflecting the metric of importance
7. learn a bit about feature selection, and why we need to pipeline feature selection and classification together, in an attempt to improve a classifier
8. balance a data set to improve the performance of a SVM classifier
9. implement a kernelized SVM to improve performance further.
10. make a final comparison of classifiers and make a presentation for your boss.

This homework is long because we are walking you through the entire process, start early! There are some parts of it that can be solved by understanding, copying, and slightly modifying code from the lab.  Feel free to do that. Pay attention to any difference in signature in the lab functions to the ones here.

The idea for this homework, and the attendant data set is taken from the book "Data Mining Methods and Models" by [Larose](http://www.dataminingconsultant.com/DMMM.htm). Henceforth we refer to this book as DMMM. There is an analysis of the data set there as well (ch7, the book is available online through our library), which you might be interested in. It is far more detailed than this homework, talking about log-normal data transformations, amongst other things.

(Image credit: www.tabletmag.com)

(This documentation of the fields is taken verbatim from DMMM).

The clothing-store data set contains information about 28,799 customers in the following 51 fields:

- Customer ID: unique, encrypted customer identification `HHKEY`
- Zip code `ZIP_CODE`
- Number of purchase visits `FRE`
- Total net sales `MON`
- Average amount spent per visit `AVRG`
- Amount spent at each of four different franchises (four variables) `AMSPEND`, `PSSPEND`, `CCSPEND`, `AXSPEND`
- Amount spent in the past month, the past three months, and the past six months `OMONSPEND`, `TMONSPEND`, `SMONSPEND`
- Amount spent the same period last year `PREVPD`
- Gross margin percentage `GMP`
- Number of marketing promotions on file `PROMOS`
- Number of days the customer has been on file `DAYS`
- Number of days between purchases `FREDAYS`
- Markdown percentage on customer purchases `MARKDOWN`
- Number of different product classes purchased `CLASSES`
- Number of coupons used by the customer `COUPONS`
- Total number of individual items purchased by the customer `STYLES`
- Number of stores the customer shopped at `STORES`
- Number of promotions mailed in the past year `MAILED`
- Number of promotions responded to in the past year `RESPONDED`
- Promotion response rate for the past year `RESPONSERATE`
- Product uniformity (low score = diverse spending patterns) `HI`
- Lifetime average time between visits `LTFREDAYS`
- Microvision lifestyle cluster type `CLUSTYPE`
- Percent of returns `PERCRET`
- Flag: credit card user `CC_CARD`
- Flag: valid phone number on file `VALPHON`
- Flag: Web shopper `WEB`
- 15 variables providing the percentages spent by the customer on specific classes of clothing, including sweaters, knit tops, knit dresses, blouses, jackets, career pants, casual pants, shirts, dresses, suits, outerwear, jewelry, fashion, legwear, and the collectibles line; (`P*`, `PJACKETS` for example) also a variable showing the brand of choice (encrypted)
- **Target variable**: response to promotion `RESP`...this is our **response** or **y**.


These data are based on a direct mail marketing campaign conducted last year. We want to use this information to develop classification models for this year’s marketing campaign.

## Features and a simple classifier

### Get, check, clean,  the data

In [7]:
df=pd.read_csv("./data/Clothing_Store.csv")
df.head()

In [8]:
df.shape

We'll delete some columns we dont intend to use, and which I couldnt quite figure out what they were from the original data set and documentation.

In [9]:
del df['CLUSTYPE']
del df['HHKEY'], df['ZIP_CODE'], df['REC'], df['PC_CALC20'] 
del df['STORELOY']

In [10]:
df.columns

Make a copy of the dataframe to make transformations to.

In [11]:
dftouse=df.copy()

### Feature Engineering

Feature Engineering is one of the most important "human inputs" that go into machine learning. Machines can run algorithms, but if you feed in garbage, you will get out garbage. The features that are important, or the feature combinations that might be useful in a problem, are inputs that humans can use to help the machine along. Domain knowledge is particularly useful. 

We first list the columns that are percentages:

In [12]:
PERCENT_VARS=[ u'PSWEATERS', u'PKNIT_TOPS', u'PKNIT_DRES', u'PBLOUSES', u'PJACKETS', u'PCAR_PNTS', u'PCAS_PNTS', u'PSHIRTS', 
              u'PDRESSES', u'PSUITS', u'POUTERWEAR', u'PJEWELRY', u'PFASHION', u'PLEGWEAR', u'PCOLLSPND']
len(PERCENT_VARS)

Next, we look for columns where the existence or lack thereof of a zero may be important in a classifier. We used our intuition to make these choices, believing that there is additional information encoded in say, `PERCRET`: if you never returned anything you might not be a budget shopper and thus someone who might have the money to shop quite a bit...

In [13]:
ZERO_IMPORTANT_VARS = [u'PREVPD', u'AMSPEND', u'PSSPEND', u'CCSPEND', u'AXSPEND', u'RESPONDED', u'PERCRET']

We also list the columns with floating-point or integer variables that are amenable to standardization

In [14]:
STANDARDIZABLE = PERCENT_VARS + ZERO_IMPORTANT_VARS + [u'FRE', u'MON',  u'AVRG', u'GMP', u'PROMOS', u'DAYS', u'FREDAYS', u'MARKDOWN', u'CLASSES', u'COUPONS', u'STYLES',  u'MAILED',  u'RESPONSERATE', u'HI', u'LTFREDAY']

Now, an **indicator variable** is one which takes a few, usually 2 values (1/0, True/False) to code the existence or lack thereof of a property or feature. We look for existing indicators:

In [15]:
for v in df.columns:
    l=df[v].unique()
    if len(l) <= 10:
        print(v, l)

#### Zero important indicators

We encode VALPHON, PERCENT_VARS, and ZERO_IMPORTANT_VARS as indicators. By doing this we are saying: the fact that these features are non-zero carries additional importance as compared to their values.

We maintain a global list INDICATORS in which the names of these columns are stored, prepending an `i_` to each of these variables to denote that they are indicators.

Note that all changes are now being made to the `dftouse` dataframe.

In [16]:
# rename some indicators and make them all 1-0
dftouse['i_VALPHON']=(df.VALPHON=='Y')*1
del dftouse['VALPHON']
dftouse.rename(columns={'WEB':'i_WEB', 'CC_CARD':'i_CC_CARD'}, inplace=True)
INDICATORS=['i_VALPHON','i_WEB','i_CC_CARD']

We then take the `ZERO_IMPORTANT_VARS`, the ones we thought where presence or absence was important, and create indicators from them

In [17]:
for p in ZERO_IMPORTANT_VARS:
    dftouse['i_'+p]=(df[p] > 0.0)*1
    INDICATORS.append('i_'+p)

And then we create indicators for each of the percent variables `PERCENT_VARS` (following Larose's ch7), in the hope that the presence or absence of buying a particular clothing style such as blouses makes a difference...

In [18]:
for p in PERCENT_VARS:
    dftouse['i_'+p]=(df[p] > 0.0)*1
    INDICATORS.append('i_'+p)

#### Combine some features 

We do this to communicate clearly information about recentness and savings

We add two more indicators corresponding to recent spending, and recent use of a savings mechanism.

In [19]:
#create recent usage (1 month and 3 month),  sale-shopper (markdown+coupon)
dftouse["i_RECENT"]=1*((df.OMONSPEND > 0) | (df.TMONSPEND > 0))
dftouse["i_SAVER"]=1*((df.MARKDOWN > 0) | (df.COUPONS > 0))
INDICATORS.append("i_RECENT")
INDICATORS.append("i_SAVER")

Lets see what we now have...

In [20]:
dftouse.head()

In [21]:
dftouse.shape

Clearly we've currently expanded the number of features we have in an attempt to pit in information in the form of indicators which communicate additional distinguishing (in our opinion).

### Test and Training Sets, and Standardization

We standardize test and training sets separately. Specifically, we wish to standardize the non-indicator columns on both the test and training sets, by subtracting out the mean of the training set from the value, and dividing by the standard deviation of the training set. This helps us put all the continuous variables on the same scale.

(There is another reason this might be useful. One optimization which we dont do in this homework but which is useful is to take the log of all positive continuous variables. This makes data look more "normal" which can be useful in some algorithms, and then such standardization can basically be thought of in units of standard deviations of the normal distribution)

#### Why do we do this standardization on the two sets separately?

Hint: what happens to the purity of the training data if we standardize using the entire dataset?

*your answer here*


We'll split the dataset and create a training and test mask.

In [24]:
from sklearn.model_selection import train_test_split
itrain, itest = train_test_split(range(dftouse.shape[0]), train_size=0.7)

In [25]:
mask=np.ones(dftouse.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)

In [26]:
mask

In [27]:
mask.shape, mask.sum()

In [28]:
dftrain = df[mask]
dftest = df[~mask]

#### Standardize the data

Use `StandardScaler` from `sklearn.preprocessing` to "fit" the columns in `STANDRARDIZABLE` on the training set. Then use the resultant estimator to transform both the training and the test parts of each of the columns in the dataframe, replacing the old unstandardized values in the `STANDARDIZABLE` columns of `dftouse` by the new standardized ones.

In [29]:
from sklearn.preprocessing import StandardScaler


In [31]:
#your code here


In [32]:
dftouse.loc[mask, STANDARDIZABLE] = outtrain
dftouse.loc[~mask, STANDARDIZABLE] = outtest

We create a list `lcols` of the columns we will use in our classifier. This list should not contain the response `RESP`. How many features do we have?

In [33]:
lcols=list(dftouse.columns)
lcols.remove(u'RESP')
len(lcols)

### Writing code for a classifier

We will now take this data and write a classifier to predict the response, which is in the `RESP` column of `dftouse`. This response corresponds to asking the question: will a user targeted with our advertisement respond or not?

#### 1.4 Train a Logistic Regression on this data.

In [34]:
from sklearn.linear_model import LogisticRegression

In [39]:
params = {"C": [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}
clf, Xtrain, ytrain, Xtest, ytest = do_classify(LogisticRegression(solver="liblinear"), params, dftouse,lcols, u'RESP',1,  mode="mask", reuse_split=mask)

In [40]:
reuse_split=dict(Xtrain=Xtrain, Xtest=Xtest, ytrain=ytrain, ytest=ytest)

In [43]:
import pickle
with open('reuse_split.pkl', 'wb') as handle:
    pickle.dump(reuse_split, handle)

## Estimate costs and benefits from assumptions and data

### Our data is highly asymmetric

First notice that our data set is very highly asymmetric, with positive `RESP`onses only making up 16-17% of the samples.

In [63]:
print("whole data set", dftouse['RESP'].mean())#Highly asymmetric
print("training set", dftouse['RESP'][mask].mean(), "test set", dftouse['RESP'][~mask].mean())

This means that a classifier which predicts that EVERY customer is a negative has an accuracy rate of 83-84%. By this we mean that **a classifier that predicts that no customer will respond to our mailing** has an accuracy of 83-84%!

#### Compare the accuracy of the Logisic Regression to the no-customer-responds baseline

Based on your comparison, and using accuracy as a metric, does the classifier seem worthwhile pursuing?

*your answer here*


But we havent asked the most important question. Is accuracy really the relevant metric?

### Costs and the Confusion Matrix

Our classifier above had, as one of its printed outputs, a confusion matrix. It looked like this:

In [64]:
ypred=clf.predict(Xtest)
confusion_matrix(ytest, ypred)

The matrix above is of this form:

![hwimages](./images/confusionmatrix.png)


**Important note**: In sklearn, to obtain the confusion matrix in the form above, always have the observed `y` first, i.e.: use as `confusion_matrix(y_true, y_pred)`

In our example, +ives (those with a 1 `RESP`onse) are people who respond to the mailing by going into the store and buying goods. These are also called observed positives (OP). And -ives (those with a 0 `RESP`onse) are those who do not respond to the mailing. These are also called observed Negatives. On our test set, we can print the observed positives and observed negatives respectively:

In [65]:
print("OP=", ytest.sum(), ", ON=",ytest.shape[0] - ytest.sum())

We can make a similar calculation on the predictions of our LR classifier made on the test set. This gives us the predicted negatives (PN): those customers who we predict will not respond to our mailing; and the predicted positives (PP), the customers who we predict will respond to our mailing by coming into the store to buy stuff.

In [66]:
print("PP=", ypred.sum(), ", PN=",ytest.shape[0] - ypred.sum())

In addition to these four quantities, the confusion matrix gives us more details on proper classifications and mis-classifications from our classifier:

- the samples that are +ive and the classifier predicts as +ive are called True Positives (TP). These are folks we correctly identified as responders,and thus sending them a mailing would result in a sale for us. True Positives are great. We do incur the cost of mailing them, but we like to because they will come into the store to buy.
- the samples that are -ive and the classifier predicts (wrongly) as +ive are called False Positives (FP). False Positives incur us the cost of mailing them as well, but are not very costly. These are people who wouldnt have responded, but we sent them a mailing because our classifier mispredicted them as buyers. Thus, for them, we only incur the cost of preparing the mailing and mailing it to them.
- the samples that are -ive and the classifier predicts as -ive are called True Negatives (TN). These are folks we correctly identified as not-responding, and thus we dont waste any money on sending them a mailing. This is a great classification for us.
- the samples that are +ive and the classifier predicts as -ive are called False Negatives (FN). False negatives are VERY costly: these are folks who would have responded to us had we mailed them, but we didnt target them, leading to huge lost sales per person. Notice that our SVM classifier has tons of False Negatives

It is not enough to simply identify these categories from the confusion matrix. Rather, we want to sit down with our business team and identify the costs associated with each of the 4 classification situations above. Keep in mind that these costs might even change from year to year or even more suddenly: this is why it is important to have marketing and sales people on your data science teams. (See Patil, D. J. Building data science teams. " O'Reilly Media, Inc.", 2011.
 for more details).
 
 Fortunately you have talked to your domain experts and done just that!

#### Costs for True Positives, False Negatives, False Positives, and True Negatives

Lets categorize the costs for each one of these alternatives.

Lets assume the amortized cost of preparing a mailing and mailing it is \$3. Lets assume additionally that the profit margin on a sale is 30% (we are a high end clothing chain).

True Negatives cost us nothing but gain us nothing either.

In [67]:
tnc=0.0 #tnr stands for "true negative cost"

From the average cost of a sale, and the 30% profit assumption, we calculate `tpc`, the cost of a true positive. Note: `tpc` must be negative, since we are talking about costs.

The `tpc` takes into account the cost of mailing to the respondent, and since our mailing works, we subtract out the profit. We use the average of the `AVRG` column, which is the average money spent by a customer on each visit.

In [68]:
prep_and_mail=3
profit_margin=0.3
tpc=prep_and_mail - df.AVRG.mean()*profit_margin 
tpc

The false negative is a lost sale for us! We didnt mail them, and they didnt spend the money. They would have if we mailed them. So we lost a certain profit per such false negative! Thus the false-negative cost, given by `fnc`, is:

In [69]:
fnc = df.AVRG.mean()*profit_margin
fnc

This leaves us with False positives. This is a person who would not have responded but you wasted $3 on. So the false positive cost, (`fpc`) is:

In [70]:
fpc=prep_and_mail
fpc

#### Cost  and Utility Matrix

We then use these costs to write a **risk or cost matrix** in the same form as the confusion matrix above. 

![cost matrix](images/costmatrix.png)

In [71]:
risk_matrix=np.array([[tnc, fpc],[fnc, tpc]])
risk_matrix

Notice that the cost of a false positive is 11 times less than the cost of a false negative. As is often the case in situations in which one class dominates the other, the costs of one kind of misclassification: false negatives are differently expensive than false positives. We saw above that FN are more costly in our case than FP. Similar situations arise in cancer prediction, for example, where a FP only means that you diagnosed a healthy person with cancer, but a FN means that you misdiagnosed a cancer patient as healthy: possibly killing them in the process!

The negative of the cost matrix is called the **utility matrix or profit matrix** `u`. Here we calculate this utility matrix, which we shall use in the next part of the homework.

In [72]:
u = - risk_matrix
u

Ok! Now we can use this profit matrix to calculate the profit that the SVM classifier can land us. 

#### Average Profit Per Person

We can compute the average profit per person using the following formula, which calculates the "expected value" of the per-customer profit (the $P$ below stands for "predicted" and $O$ for observed):



\begin{eqnarray}
Profit &=& u(+P,+O) \times p(+P,+O) \\
       &+& u(+P,-O) \times p(+P,-O) \\
       &+& u(-P,+O) \times p(-P,+O) \\
       &+& u(-P,-O) \times p(-P,-O) 
\end{eqnarray}


which gives


$$ Profit =  \frac{( TP \times -TPC )+ ( FP \times -FPC ) + ( FN \times -FNC ) + ( TN \times -TNC )}{N}$$

where N is the total size of the test set, +P means predicted positive, -O is observed negative, and so on and so forth. The formula above just weighs the profit of a combination of observed and predicted with the out-of-sample probability of the combination occurring. The probabilities are "estimated" by the corresponding confusion matrix on the **test set**, which leads to the second formula. $-TPC$ is just the 'true positive' utility (similar for the others...).

The profit can thus be found by multiplying the utility matrix by the confusion matrix elementwise, and dividing by the sum of the elements in the confusion matrix, or the test set size.

We implement this process of finding the average profit per person in the `average_profit_pp` function below:

In [74]:
def average_profit_pp(y, ypred, u):
    c=confusion_matrix(y,ypred)
    score=np.sum(c*u)/np.sum(c)
    return score

But before we make this calculation for our logistic classifier, we need to first check what profit or cost our baseline classifier which assumes that no customer will respond, incurs.

### Establishing Baseline Classifiers via profit

The simplest classifiers you can think of are the "send to everyone" and "dont send to everyone" classifiers. We explain these below. If we are going to write any more complex classifiers we should at-least outperform these.

#### Dont Send to Anyone Baseline Classifier 

This is the "majority" classifier we talked about earlier. We dont send mailings to anyone because we believe that **no-one will respond**. Thus this classifier predicts everyone to be a 0 or -ive, a non-respondent. Remember, this classifier has a 83-84% accuracy.

We write a confusion matrix `dste` for the "dont send to everyone" model (not the best acronym, I know!), and calculate the average profit per person as `dsteval`. 

In [75]:
testsize = dftouse[~mask].shape[0]
ypred_dste = np.zeros(testsize, dtype="int")
print(confusion_matrix(ytest, ypred_dste))
dsteval=average_profit_pp(ytest, ypred_dste, u)
dsteval

#### For this baseline classifier question, explain the value we got.

*your answer here*


#### Send to Everyone Baseline Classifier

This is the other extreme. In this case we **predict everyone as responders** and send the mailing to everyone. In other words, we predict everyone on the test set to be a 1. Print out both the confusion matrix and `steval`, the average profit per person, for this case. Based on this result, which one of these two classifiers is the one to beat? Why?

In [76]:
#your code here


*your answer here*


#### Compare the Logistic Regression classifier with these baselines

Using the  classifier we calculated, `clflog`, calculate the profit we can make

In [77]:
#your code here


If you did this correctly, at this point you might be a bit dejected....

### Implement logistic regression with Lasso based feature selection

We run another classifier, a logistic regression with L1 regularization, using the `do_classify` function we defined above. L1 or Lasso regularization automatically does feature selection for us!

Return as the estimator `clflog`, and training and test sets `Xtrain`, `ytrain`, `Xtest`, and `ytest`. Let the regularization hyperparameter `C` range in powers of 10 from 0.001 to 100. Use the `reuse_split` dictionary we calculated earlier. Remember that we want to use "L1" or Lasso regularization: you can do this by passing `penalty="l1"` to the Logistic Regression: `LogisticRegression(penalty="l1")`.

In [78]:
#your code here


Calculate the profit that this classifier gives us:

In [79]:
#your code here


### Prediction and Re-prediction

In the case of such asymmetric costs, the `sklearn` API function `predict` is useless, as it assumes a threshold probability of having a +ive sample to be 0.5; that is, if a sample has a greater than 0.5 chance of being a 1, assume it is so. Clearly, when FN are more expensive than FP, you want to lower this threshold: you are ok with falsely classifying -ive examples as +ive. See Lab 5 for how this can be done.

You can think about this very starkly from the perspective of the cancer doctor. Do you really want to be setting a threshold of 0.5 probability to predict if a patient has cancer or not? The false negative problem: ie the chance you predict someone dosent have cancer who has cancer is much higher for such a threshold. You could kill someone by telling them not to get a biopsy. Why not play it safe and assume a much lower threshold: for eg, if the probability of 1(cancer) is greater than 0.05, we'll call it a 1.

Let us do this for our logistic regression example

#### Start with an arbitrary threshold t, and see how we fare at different thresholds for logistic regression

In [80]:
# your code here


We see average profits for multiple thresholds for the logistic regression classifier `clflog`. 

**(a) Average profit per person for t=0.5 (the usual case)**

In [81]:
average_profit_pp(ytest,clflog.predict(Xtest), u)

**(b) Confusion Matrix and average profit per person for t=0.05**

In [82]:
confusion_matrix(ytest,t_repredict(clflog, 0.05, Xtest))

In [83]:
average_profit_pp(ytest, t_repredict(clflog, 0.05, Xtest), u)

**(c) average profit per person for t=0.95**

In [84]:
average_profit_pp(ytest, t_repredict(clflog, 0.95, Xtest), u)

Voila, at a 0.05 threshold we have a nice positive profit! (if you did this right...)

We see that in this situation, where we have asymmetric costs (1:15), we do need to change the threshold at which we make our positive and negative predictions. We need to change the threshold so that we much dislike false nefatives (same in the cancer case). Thus we must accept many more false positives by setting such a low threshold.

For otherwise, we let too many people slip through our hands who would have otherwise shopped at our store. Once we change the threshold, we can make a profit. And indeed, at $t=0.05$, our profit is higher than in the "Send to Everyone" case, which makes doing the classifier worth it! But how do we pick this threshold?

### Feature Selection

The Lasso, for example, implements internally, a form of feature selection by setting many coefficients to zero. Let us find coefficients that are non-zero.

#### Non zero lasso features

We write a function `nonzero_lasso` which takes the fit classifier `clfloglasso` as an argument, and spits out a dataframe of coefficients, sorted by the absolute magnitude of the coefficients. This way we can see which features dominated the logistic regression.

In [85]:
def nonzero_lasso(clf):
    featuremask=(clf.coef_ !=0.0)[0]
    return pd.DataFrame(dict(feature=lcols, coef=clf.coef_[0], abscoef=np.abs(clf.coef_[0])))[featuremask].sort_values('abscoef', ascending=False)

In [86]:
lasso_importances=nonzero_lasso(clflog)
lasso_importances.set_index("feature", inplace=True)
lasso_importances

#### Feature importance using correlations

We can also get a notion of which features are important in the classification process by seeing how they correlate with the response. Implement some code to obtain the Pearson correlation coefficient between each of our features and the response. Do this on the training set only! Create a dataframe indexed by the features, which has columns `abscorr` the absolute value of the correlation and `corr` the value of the correlation. Sort the dataframe by `abscorr`, highest first, and show the top 25 features with the highest absolute correlation. Is there much overlap with the feature selection performed by the LASSO?

In [88]:
from scipy.stats.stats import pearsonr
correlations=[]
dftousetrain=dftouse[mask]
for col in lcols:
    r=pearsonr(dftousetrain[col], dftousetrain['RESP'])[0]
    correlations.append(dict(feature=col,corr=r, abscorr=np.abs(r)))

bpdf=pd.DataFrame(correlations).sort_values('abscorr', ascending=False)
bpdf.set_index(['feature'], inplace=True)
bpdf.head(25)

*your answer here*


#### Why Feature Select?

One of the reasons feature selection is done, automatically or otherwise, is that there might be strong correlations between features. Also recall polynomial regression: a large number of features can lead to overfitting. Feature selection helps curb the problem of the curse of dimensionality, where centrality measures often used in statistics go wonky at higher dimensions. Between feature-engineering which we did some of, earlier, and feature selection, is where a lot of smarts and domain knowledge comes in. You will gain this with experience.

## Exercises

1. Redo the Logistic regression using `f1-score` as the scoring metric rather than accuracy. Call the model "logistic-f1".
2. Make ROC curves for both models.