In [145]:
import pandas as pd
import numpy as np

In [74]:
#import processed data
df = pd.read_csv('E:\Projects\canadacharities\data\interim\charity_data.csv')

#import data dictionary
data_dic = pd.read_csv(r'E:\Projects\canadacharities\data\external\data_dictionary.csv')

## Dealing with Class Imbalance

For every charity that had its status revoked since 2018, there were approx. 57 charities that retained their status (1:57 ratio), representing a slight-to-moderate imbalance between classes. While the data appears to be biased in favour of charities that retained their status, this may not necessarily be a problem if this data represents what happens in reality.

Addressing the following questions can help us better understand if this class imbalance is at all problemmatic.
- <b>The most recent revocation date is Jan 22, 2020. On what date was this data shared by the government? Could an updated table of revoked charities be requested to ensure we're working with all revoked labels possible?</b> (Otherwise, potential mis-labeling of registered charities)
- <b>How do charities have their status revoked? Is this dataset complete or could some entries be missing due to under-reporting?</b> (Potentially causing mis-labeling of registered charities)
- <b>Do charities have to file their info each year?  Could charities forget to do so without any repercussions?</b> (Potentially causing an undersampling of both registered charities)
- <b>Could charities that planned on withdrawing their charity status in 2018 be discouraged from filing charity information in the same year?</b> (Potentially causing an undersampling of revoked charities)

And one more question that can be answered with existing data:
- <b>Is there any 2018 filing information missing for charities that were revoked 2018-2020?</b> (Potentially causing undersampling of revoked charities)

For now, let's assume that the data is representative and that in reality, less than 2 of 100 registered charities will have their status revoked in the next 2 years. This means we're good to go with using it for modelling purposes.

We can always come back to these considerations later.

## Choosing our metrics

[Towards Data Science](https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28) provides a great visual of the confusion matrix and corresponding metrics generating using produced values. 

Below, we refer to <b>class 1</b> as <b>registered</b> and <b>class 2</b> as <b>revoked</b> charity status.
<img src='https://miro.medium.com/max/1400/1*Yslau43QN1pEU4jkGiq-pw.png', width='60%'>

Let's now determine what each box represents here and what we want to happen to it.

Maximize Orange <br />
True positive for "registered" class. Accurately predicting registered charities is always good.

Minimize Green <br />
False positive for "revoked" class. A charity that was predicted to "churn" (have their charity status revoked) erroneously could have consequences. Unnecessary resources could be spent to target that charity to either help them retain their status, or undue stress could result if they are notified of the risk when there really is no real risk.

Minimize Yellow <br />
False positive for "registered" class. One could argue that there could be greater consequences if a charity were to have their status revoked without the predicting of such. This could create a sense of false security, and have repercussions on the charity's decision-making, and in turn, charity status.

Maximize Blue <br />
True positive for "revoked" class. Accurately predicting revoked charities is what we ultimately want to do the most.

## Modelling

## Data setup

In [156]:
#extract only the columns we want
x_vars = data_dic.loc[(data_dic['predict']=='yes')&(data_dic['data_type']=='bool'),'column_name'].to_list()


In [159]:
#which columns are object? (need to be coded)
coding_cols = data_dic.loc[(data_dic['predict']=='yes')&(data_dic['data_type']=='object'), 'column_name'].to_list()

#since 4050 is a Y/N question, let make these boolean
df.loc[df['4050']=='Y', '4050'] = 1
df.loc[df['4050']=='N', '4050'] = 0
coding_cols.remove('4050')

#for other object columns, create dummy variables
#for col in coding_cols:
#    new_cols = pd.get_dummies(df[col], prefix=col)
#    df = pd.concat([df, new_cols], axis=1)
    
#if flagged in data dictionary as predictive attribute OR generated using encoding, add to column list
#x_vars = x_vars + df.columns[df.columns.str.contains('1200 Program Area Code_')].to_list()
#x_vars = x_vars + df.columns[df.columns.str.contains('4020_')].to_list()

#extract datasets for modelling
x = df[x_vars]
y = df['Status']

In [161]:
#removing columns with more than 50% of data missing
columns = x.columns[x.isnull().mean() <= 0.5]
x = x[columns]

# Further reading

## Imbalanced datasets

[Handling imbalanced datasets in machine learning (Towards Data Science)](https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28)

## Random Forest Regressor

In [164]:
x.isna().sum()

1570     884
1600     828
2000     378
2100     390
2400     899
2700    1529
3200     978
3400     467
3900     742
4000     837
5800     750
5810     790
5820     773
5830    1420
dtype: int64

In [None]:
#imputing missing data
#see https://medium.com/airbnb-engineering/overcoming-missing-values-in-a-random-forest-classifier-7b1fc1fc03ba

In [None]:
#how does this deal with categorical columns? need to code?

In [163]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)

# Train the model on training data
rf.fit(x, y)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

## Linear SVC

[scitkit-learn user guide](https://scikit-learn.org/stable/modules/svm.html#svm-classification)

[sklearn.svm.LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

In [None]:
from sklearn import svm
x = df
y = labels
clf = svm.SVC()
clf.fit(x, y)

In [13]:
df['5750'].value_counts()

 0.0           110
 3000000.0       2
 245000.0        1
 650.0           1
 472017.0        1
 29668.0         1
 475000.0        1
 60000.0         1
 30000.0         1
 87500.0         1
 6566.0          1
 1572536.0       1
 16368.0         1
 37540.0         1
-9454.0          1
 426.0           1
 92766.0         1
 153135.0        1
 20000000.0      1
 223405.0        1
 10000.0         1
 3328.0          1
Name: 5750, dtype: int64

## KNeighbours Classifier

[scikit-learn user guide](https://scikit-learn.org/stable/modules/neighbors.html#classification)

[sklearn.neighbors.KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)