# Classification with Scikit Learn Long

Classification is a supervised learning technique useful when we want to predict discrete variables (like binary outcomes or belonging to a specific class).

- Pandas Documentation: http://pandas.pydata.org/
- Scikit Learn Documentation: http://scikit-learn.org/stable/documentation.html
- Seaborn Documentation: http://seaborn.pydata.org/

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Classification with 1 feature and a binary target

In [None]:
df = pd.read_csv('../data/user_visit_duration.csv')

In [None]:
df.head()

In [None]:
df.plot(kind='scatter', x='Time (min)', y='Buy',
        title='Purchase VS time spent on page');

### Features

Let's ignore train/test split for now, we have few data

In [None]:
X = df[['Time (min)']].values
y = df['Buy'].values

### Linear Regression fail

Let's try to fit this with a linear regression first, it won't give correct results

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()
model.fit(X, y)

#### Visual comparison of predictions

In [None]:
y_pred = model.predict(X)

In [None]:
df.plot(kind='scatter', x='Time (min)', y='Buy',
        title='Linear Regression Fail')
plt.plot(X, y_pred, '.r');

### Exercise: Logistic Regression

1. Replace the above model with a `LogisticRegression` and repeat the process. What results do you get?
2. Use the method `model.predict_proba` to also predict the probability of a predicted class

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression()
model.fit(X, y)

In [None]:
y_pred = model.predict(X)

In [None]:
df.plot(kind='scatter', x='Time (min)', y='Buy',
        title='Logistic Regression Success')
plt.plot(X, y_pred, '.r');

In [None]:
y_pred_prob = model.predict_proba(X)

In [None]:
df.plot(kind='scatter', x='Time (min)', y='Buy',
        title='Logistic Regression Success')
plt.plot(X, y_pred_prob[:, 1], '.r');

## Classification with 2 features and a binary target

In [None]:
df = pd.read_csv('../data/isp_data.csv')

In [None]:
df.head()

In [None]:
df.label.unique()

In [None]:
import seaborn as sns

In [None]:
grid = sns.pairplot(df, hue='label', vars=['download', 'upload'])
grid.fig.suptitle('Internet Service Providers');

In [None]:
X = df[['download', 'upload']].values
y = df['label'].values

### Logistic Regression

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.33, random_state=0)

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)

### Performance evaluation

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
y_pred = model.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred))

## Classification with more features and more targets

In [None]:
df = pd.read_csv('../data/car.csv', dtype='category')

In [None]:
df.head()

In [None]:
df.info()

### 1-hot encoding of features

In [None]:
features = df.drop('class', axis=1)

In [None]:
X = pd.get_dummies(features)

In [None]:
X.head()

### Label encoding

In [None]:
df['class'].value_counts()

In [None]:
from sklearn.preprocessing import LabelEncoder 

In [None]:
le = LabelEncoder()
y = le.fit_transform(df['class'])
le.classes_

### Train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

### Fit a Decision Tree model

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model = DecisionTreeClassifier()

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

# Exercises

### Exercise 1: 

- Benchmark your prediction. as you may have noticed, the labels are highly imbalanced, with most of the data falling in the `{"unacc": 2}` category. What score would you have gotten if you had predicted 2 for all of your test data? 
- Print a confusion matrix of the test predictions. Which classes get confused?
- Repeat the classification with the Logistic Regression model, does it improve the accuracy?

In [None]:
df['class'].value_counts()/len(df['class'])

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

### Exercise 2

- load the churn dataset `../data/churn.csv`
- assign the `Churn` column to a variable called `y`
- assign the other columns to a variable called `features`
- separate numerical columns with `features.select_dtypes`
- split data into train/test with test_size=0.3 and random_state=42
- classify the resulting data using Decision Tree classifier
- try to improve the score changing any of the default initialization parameters of the classifier:
    - max_depth
    - min_samples_split
    - min_samples_leaf
    - max_features
- try to improve the score using a `LogisticRegression`
- try to improve the score using any other of the classifiers used [here](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)


In [None]:
df = pd.read_csv('../data/churn.csv', )

In [None]:
df.head()

In [None]:
df.info()

In [None]:
y = df['Churn'] == 'Yes'
features = df.drop('Churn', axis=1)

In [None]:
X = features.select_dtypes(include=['number'])

In [None]:
X.info()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=5)
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
depths = range(1, 20)
scores = []
for d in depths:
    model = DecisionTreeClassifier(max_depth=d)
    model.fit(X_train, y_train)
    s = model.score(X_test, y_test)
    scores.append(s)

pd.Series(scores, index=depths).plot()
plt.ylabel('Test Score')
plt.xlabel('Max Depth')

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)