# Journal club: student alcohol consumption

"Journal clubs" are a common type of meeting in both academic settings and, for data scientists at least, professional settings as well. Journal clubs are useful for:

- Finding new, useful statistical methods to tackle relevant problems
- Reviewing the work of peers and competitors who publish their work
- Discussing what is good and bad about statistical and experimental methodologies
- Building a collaborative environment with your fellow data scientists

---

In this notebook you will be reviewing a paper on **credit card defaults**.

The paper itself is located:

[Credit card paper on website](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)

[Credit card paper local file](./DefaultCreditCardClients_yeh_2009.pdf)
    
The dataset associated with the paper:

[Dataset: credit card defaults](./default_of_credit_card_clients.xls)

**[HINT]** You can use pandas' `pd.read_excel()` to load excel files.

---

This lesson/lab is broken up into 4 sections:

**Section 1**: Read the paper!

**Section 2**: Discuss and respond to the questions about the paper.

**Section 3**: [BONUS] Load the data and perform your own logistic regression OR kNN with cross-validation.

**Section 4**: Briefly present to the class your review and opinions of the paper.

___

## 1. Read the paper, focusing on the goals, methodology, and conclusions.

___

## 2. Discuss and respond to the following questions.

### 2.1.  What is the goal of the study? (hint: this is usually described in the abstract)

A: Determine which model is the best at determining probability of defaulting on a credit card payment

### 2.2 What is the target variable? (hint: look at the website and dataset)

A: Whether or not the individual defaulted.

### 2.3. What models do they compare? (hint: although you have not yet seen all of them, try to grasp the differences)

A: 
KNN
Logisitic Regression
Discriminant analysis
Naïve Bayesian 
Neural networks
Classification trees

### 2.4. How do they judge the "goodness" of a model? Do they use accuracy? if not, what do they use?

A: They use the ratio of areas under the "best" model to the baseline, and from the tested model to the base line

$$\frac{A(Model)}{A(Total)}$$

### 2.5. What validation method do they use? Simple train/test split? Cross Validation?

A: Simple train and test split

### 2.6. Which model performs best?

A: Artificial Neural Networks

---

## 3. [BONUS] Load the data and perform your own classification using logistic regression or kNN.

Validate with cross-validation and a scoring metric of your choice.

    pd.read_excel('./default_of_credit_card_clients.xls', header=1)

In [3]:
import pandas as pd
import numpy as np

In [4]:
default = pd.read_excel('../credit-cards/default_of_credit_card_clients.xls')

In [5]:
default.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [6]:
from sklearn.neighbors import KNeighborsClassifier

In [7]:
model = KNeighborsClassifier(n_neighbors = 3)

In [8]:
import patsy

In [22]:
headers = default.ix['ID',:]

In [24]:
default.columns=headers
default.head()

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [47]:
data = default.ix[1:,:]
data.head()
data.shape

(30000, 24)

In [48]:
data['default'] = data['default payment next month']
del data['default payment next month']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [49]:
formula = 'default ~ C(SEX) + AGE + C(MARRIAGE) - 1'

In [50]:
X,Y = patsy.dmatrices(formula, data=data)

In [40]:
Y = np.ravel(Y)

In [51]:
Y.shape

(30000, 60)

In [52]:
X.shape

(30000, 2)

In [53]:
model.fit(X,Y)
model.score(X,Y)

0.012633333333333333

In [58]:
from sklearn.cross_validation import train_test_split

X_train, x_test, Y_train, y_test = train_test_split(X,Y, test_size=0.33)

In [59]:
model.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [60]:
model.score(x_test, y_test)

0.0046464646464646469

---

## 4. Discuss and prepare to briefly present your review to the class.