# Scikit-Learn Classification

- Pandas Documentation: http://pandas.pydata.org/
- Scikit Learn Documentation: http://scikit-learn.org/stable/documentation.html
- Seaborn Documentation: http://seaborn.pydata.org/


In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## 1. Read data from Files

In [None]:
df = pd.read_csv('../data/geoloc_elev.csv')

## 2. Quick Look at the data

In [None]:
type(df)

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.describe()

In [None]:
df['source'].value_counts()

In [None]:
df['target'].value_counts()

## 3. Visual exploration

In [None]:
import seaborn as sns

In [None]:
sns.pairplot(df, hue='target')

## 4. Define target

In [None]:
y = df['target']
y.head()

## 5. Feature engineering

In [None]:
raw_features = df.drop('target', axis='columns')
raw_features.head()

### 1-hot encoding

In [None]:
X = pd.get_dummies(raw_features)
X.head()

## 6. Train/Test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size = 0.3, random_state=0)

## 7. Fit a Decision Tree model

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=3, random_state=0)
model.fit(X_train, y_train)

## 8. Accuracy score on benchmark, train and test sets

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

y_pred = model.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)

pd.DataFrame(cm,
             index=["Miss", "Hit"],
             columns=['pred_Miss', 'pred_Hit'])

In [None]:
print(classification_report(y_test, y_pred))

## 10. Feature Importances

In [None]:
importances = pd.Series(model.feature_importances_, index=X.columns)
importances.plot(kind='barh')

## 11. Display the decision boundary

In [None]:
hticks = np.linspace(-2, 2, 101)
vticks = np.linspace(-2, 2, 101)
aa, bb = np.meshgrid(hticks, vticks)
not_important = np.zeros((len(aa.ravel()), 4))
ab = np.c_[aa.ravel(), bb.ravel(), not_important]

c = model.predict(ab)
cc = c.reshape(aa.shape)

ax = df.plot(kind='scatter', c='target', x='lat', y='lon', cmap='bwr')
ax.contourf(aa, bb, cc, cmap='bwr', alpha=0.2)

## Exercise 


Iterate and improve on the decision tree model. Now you have a basic pipeline example. How can you improve the score? Try some of the following:

1. change some of the initialization parameters of the decision tree re run the code.
    - Does the score change?
    - Does the decision boundary change?
2. try some other model like Logistic Regression, Random Forest, SVM, Naive Bayes or any other model you like from [here](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)
3. what's the highest score you can get?