```
Created: 2019-09-22
Author: Roy Wilds

Updates
2019-10-02: Added RF classifier
2019-11-17: Cleaned up for push to github
```

# About this notebook
This notebook captures the typical steps involved in building a classifier using pandas and sklearn.

It includes some data manipulation to create the classes to be used (the chosen dataset didn't have explicit labels).

# Data Loading
This uses the amazon fire CSV file from Kaggle: https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil

It's a nice dataset that has timestamps, categorical, and numerical features. Not overly complicated, but a nice starting point.


In [None]:
import pandas as pd

In [None]:
csvfile = '~/data/amazon.csv'
df = pd.read_csv(csvfile, quotechar='"', encoding = "ISO-8859-1") #, parse_dates=[4]
df.count()

**Note** the presence of the correct encoding argument. Initial attempt to load the data file failed with a Unicode error (the data is from Brazil).

Running a file command points us to the correct encoding:
```
$ file data/amazon.csv 
data/amazon.csv: ISO-8859 text, with CRLF line terminators
```

In [None]:
df.sample(5)

# Data Manipulation

In [None]:
df.dtypes

Let's properly dtype the various columns, and going to provide the option to translate the "month" to English.
Note we could have handled the date column during the `read_csv()` step by adding the `parse_dates` arg.

In [None]:
df['date'] = pd.to_datetime(df['date'])
df['state'] = df['state'].astype('category')
df['month'] = df['month'].astype('category')
df.dtypes

In [None]:
portugese_months = list(df['month'].cat.categories)
portugese_months.sort()
# We sort so that the explicit ordering of english months here is correct!
english_months = ['April','August','December','February','January','July','June','May','March','November','October','September']

In [None]:
translate_months = dict(zip(portugese_months,english_months))
translate_months

In [None]:
df['month'].replace(translate_months, inplace=True)
df['month'] = df['month'].astype('category')
df.sample(5)

## Create Classes
You may have noticed that we don't actually have any obvious labels! 

We could try predicting some of the categorical variables... For example, maybe you can predict the month based on the other columns (ignoring the `date` feature obviously).

But, here I'm going to be simple with a 2-class problem: "Lots of Fires" (`high`) vs "Fewer Fires" (`low`). This will be determine by whether or not the number is more than 1 standard-deviation1 away from the mean for the particular `state, month` combination in the data.

In [None]:
# There's probably a pandas way to do this cleverly using groupby and agg() 
# but I can't figure out all the reshaping required.
states = list(df['state'].cat.categories)
months = list(df['month'].cat.categories)
import numpy as np
df['class'] = 'low' # Start with everything 'low'
nstd = 1 # Number of standard deviations to be considered 'high'
for s in states:
    for m in months:
        mu = df[(df['state'] == s ) & (df['month'] == m)]['number'].mean()
        sigma = df[(df['state'] == s ) & (df['month'] == m)]['number'].std()
        # Wasn't able to get this working using pandas/groupby/etc. ops. Had to resort to a loop.
        # At least it's linear in the dataframe size.
        for index, row in df[(df['state'] == s ) & (df['month'] == m)].iterrows():
            if row['number'] > mu+nstd*sigma:
                df.iloc[index,5] = 'high' # THIS IS BRITTLE. Relies on specific shape for "df".
                
        # #Failed attempts to do this more pythonically
        # print( (df['state'] == s ) & (df['month'] == m) & ( df['number'] > 0).describe() )
        # df['class'] = np.where((df['state'] == s ) & (df['month'] == m) & ( (abs(df['number']-mu)/sigma)>0.01),'high','low')
        # df[(df['state'] == s ) & (df['month'] == m) & (abs(df['number']-mu)/sigma>0)]['class'] = 'high'

# Data Exploration
Always good to understand the raw data before jumping into modeling.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df.describe(include = 'all')

In [None]:
#Plot the number of entries per state
df.groupby(['state'])['number'].agg('count').plot(kind='bar')

In [None]:
#Plot the total number of fires per state
df.groupby(['state'])['number'].agg('sum').plot(kind='bar')

In [None]:
#Plot the total number of fires per state, colouring the numbers that were determined to be class="high"
#Not a terribly informational plot, but useful plotting technique in general.
df.groupby(['state','class'])['number'].agg('sum').unstack().plot(kind='bar')

# Modeling
Going to build a model to predict the `class` from the `state` and `number` features.

Need to convert the `state` categorical feature into features that can be consumed by LR or RF.
An ordinal encoding doesn't make sense (there's no simple ordering of the states... maybe by latitude since that could be a sensible ordering for climate/weather, but skipping that for now).

Will use one-hot encoding.

In [None]:
# Simplest to make a copy and then deal with the one-hot encoding for the 'state' categorical columns.
lrdf = df.copy()
lrdf = pd.concat([df,pd.get_dummies(df['state'], prefix='state')],axis=1)

In [None]:
# Drop the columns we don't need.
lrdf.drop(['year'], axis=1, inplace=True)
lrdf.drop(['state'], axis=1, inplace=True)
lrdf.drop(['date'], axis=1, inplace=True)
lrdf.drop(['month'], axis=1, inplace=True)
lrdf.sample(5)

In [None]:
data = lrdf
labels = lrdf['class']
data.drop(['class'], axis=1, inplace=True)

In [None]:
from sklearn.model_selection import train_test_split 

# Make train/test sets with a 30% test size.
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.3)

In [None]:
labels_test.describe()

## Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression()

In [None]:
logreg.fit(data_train, labels_train)

In [None]:
pred_test = logreg.predict(data_test)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
m = confusion_matrix(labels_test, pred_test)
# Assume that the class=='high' is the Positive Case (i.e. what we care about classifying)
tp, fn, fp, tn = m.ravel()
print(m)
#print('tn = {}, fp = {}, fn = {}, tp = {}'.format(tn,fp,fn,tp))
print('Using class="high" as the positive prediction (i.e. a true prediction).')
precision = tp/(tp+fp+0.)
recall = tp/(tp+fn+0.)
print('Precision = {:.2f} and Recall = {:.2f}'.format(precision,recall))

### Varying Threshold
Rather than using the default 0.5 threshold for determining if a prediction is `high` or not, we can vary a threshold from 0 to 1 to control the precision/recall tradeoff of the classifier.

In [None]:
thetas = np.linspace(0.1,0.9,101)

In [None]:
pred_test_probs = logreg.predict_proba(data_test)

In [None]:
print(logreg.classes_)

In [None]:
#So 1st col is probability of class='high' and 2nd col is probability of class='low'
pred_test_probs[0:10,:]

In [None]:
precision, recall = [], []
for theta in thetas:
    pred_test = np.where(pred_test_probs[:,0] >= theta, 'high','low')
    m = confusion_matrix(labels_test, pred_test)
    # Assume that the class=='high' is the Positive Case (i.e. what we care about classifying)
    tp, fn, fp, tn = m.ravel()
    precision.append(tp/(tp+fp+0.))
    recall.append(tp/(tp+fn+0.))
logreg_thetas = pd.DataFrame()
logreg_thetas['threshold']=thetas
logreg_thetas['precision'] = precision
logreg_thetas['recall'] = recall

In [None]:
logreg_thetas.plot(x='threshold')

The above plot is typical of the recall/threshold tradeoff. You get better precision (i.e. fewer mistakes) at the cost of missing more of the true (i.e. high) predictions (lower recall).

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=0)

In [None]:
rf.fit(data_train,labels_train)

In [None]:
pred_test = rf.predict(data_test)

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [None]:
m = confusion_matrix(labels_test, pred_test)
# Assume that the class=='high' is the Positive Case (i.e. what we care about classifying)
tp, fn, fp, tn = m.ravel()
print(m)
#print('tn = {}, fp = {}, fn = {}, tp = {}'.format(tn,fp,fn,tp))
print('Using class="high" as the positive prediction (i.e. a true prediction).')
precision = tp/(tp+fp+0.)
recall = tp/(tp+fn+0.)
accuracy = accuracy_score(labels_test, pred_test)
print('Precision = {:.2f} and Recall = {:.2f}'.format(precision,recall))
print('Accuracy = {:.2f}'.format(accuracy))

We see a great example of why Accuracy isn't a good metric when there's class imbalance.
In our case, we've got roughly a 10 to 1 class imbalance and the model gets the class='low' right lots, but for the class='high' case we're not doing great.

### Varying Threshold

In [None]:
thetas = np.linspace(0.1,0.9,101)

In [None]:
pred_test_probs = rf.predict_proba(data_test)

In [None]:
print(rf.classes_)

In [None]:
#If not 'high', 'low' then ensure you change [:,0] to the correct column slice to use!
precision, recall = [], []
for theta in thetas:
    pred_test = np.where(pred_test_probs[:,0] >= theta, 'high','low')
    m = confusion_matrix(labels_test, pred_test)
    # Assume that the class=='high' is the Positive Case (i.e. what we care about classifying)
    tp, fn, fp, tn = m.ravel()
    precision.append(tp/(tp+fp+0.))
    recall.append(tp/(tp+fn+0.))
rf_thetas = pd.DataFrame()
rf_thetas['threshold']=thetas
rf_thetas['precision'] = precision
rf_thetas['recall'] = recall

In [None]:
rf_thetas.plot(x='threshold')

## Repeat RF but with k-fold Cross Validation
Thus far have been using the test/train split with 33% for test. 
This section is to do k-fold Cross Validation in order to get an estimate on the errors for the model accuracy.

Also an opportunity to have some error bars on our precision/recall plots!

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import numpy as np

In [None]:
# KFOLD just provides indexes, so we can just do it on the data (not labels) since they're the same size and share the same indices
nfolds = 5
kf = KFold(n_splits=nfolds)
kf.get_n_splits(data)

In [None]:
# We are going to loop thru the KFOLDS and also through the different thresholds.
# Yields a NTHRESHOLD rows x KFOLDS cols
nthresholds = 21
thetas = np.linspace(0.1, 0.9, nthresholds)
precision, recall = np.zeros(shape=(nthresholds, nfolds)), np.zeros(shape=(nthresholds, nfolds))

ifold = 0
for train_index, test_index in kf.split(data):
    data_train, data_test = data.iloc[train_index], data.iloc[test_index]
    labels_train, labels_test = labels.iloc[train_index], labels.iloc[test_index]
    
    rf = RandomForestClassifier(n_estimators=100, random_state=0)
    rf.fit(data_train,labels_train)
    pred_test_probs = rf.predict_proba(data_test)
    
    itheta = 0
    for theta in thetas:
        pred_test = np.where(pred_test_probs[:,0] >= theta, 'high','low')
        m = confusion_matrix(labels_test, pred_test)
        # Assume that the class=='high' is the Positive Case (i.e. what we care about classifying)
        tp, fn, fp, tn = m.ravel()
        precision[itheta, ifold] = tp/(tp+fp+0.)
        recall[itheta, ifold] = tp/(tp+fn+0.)
        itheta += 1
    
    ifold += 1


In [None]:
precision_errors = np.std(precision, axis=1)
precision_line = np.mean(precision, axis=1)

recall_errors = np.std(recall, axis=1)
recall_line = np.mean(recall, axis=1)

In [None]:
plt.title('Precision and Recall - 1 Std Dev shown')
plt.xlabel('threshold')
plt.ylabel('Precision/Recall')
plt.errorbar(thetas, recall_line, yerr=recall_errors, c='red', capsize=3)
plt.errorbar(thetas, precision_line, yerr=precision_errors, c='blue', capsize=3)