# Your Title Here

**Prabina Pokharel, Atharva Kulkarni**

## Summary of Findings

### Introduction
We've decided to create a binary classification model for this project. Specifically, we are going to try to predict whether a given stock transaction resulted in capital gains over $200 or not.

Response variable: `cap_gains_over_200_usd`
Validation parameter: Recall

### Baseline Model
TODO

### Final Model
TODO

### Fairness Analysis
TODO

## Code

In [106]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, precision_score
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

### Framing the Problem

Before we get to framing our problem, here's an overview of our cleaning steps and feature engineering from project 3:

* Changing `disclosure_date` and `transaction_date` to pandas datetime object 
  * This was handy to create, as we could simply subtract the two columns to create a new new `non_disclosure_period(days)` column.
* Removed the 'Hon.' out of representatives' names
  * To make it easier to read.
* Added a `state` column, created from `district`
  * Necessary in order to be able to answer our main question!
  * Also useful to create cool aggregates by state!
* Added an `amount_cleaned` column, created from `amount`
  * This was important in order to be able to do any sort of math calculations with relation to amount of the transaction value.


We've decided to create a binary classification model for this project. Specifically, we are going to try to predict whether a given stock transaction resulted in capital gains over $200 or not.

Response variable: `cap_gains_over_200_usd`

Figuring our what metric to use is a bit trickier. First we'd need to see if our response variable data distribution is imbalanced or not:

In [56]:
df = pd.read_csv('data/cleaned_transactions.csv')
df['cap_gains_over_200_usd'].value_counts()

False    13238
True       964
Name: cap_gains_over_200_usd, dtype: int64

So unfortunately we don't have 50/50 balanced data. When it comes to unbalanced data, we can choose between recall or precision. Since we care about false positives in our prediction (i.e., we don't want to just classify every trade as having a positive response variable value and having many false positives), we will choose **precision** as our validation parameter.

Here's why we're *not* choosing...

- Accuracy: since our `False` values in our response variable data accounts for nearly 93% of the total response variable data, we could achieve 93% accuracy by just predicting all values are `False`.
- Recall: seeimingly similar to precision, recall only takes into account true positives and false negatives. We could have also chosen recall as our validation parameter, but decided not to because we are placing more importance on predicting values as positive if they really are positive.

We won't have any 'time of prediction' issues because all of the columns in our dataset, and any extra features we engineer, because all of them can be found out before finding out `cap_gains_over_200_usd`.

### Baseline Model

For our baseline model, we're using 2 variables to predict the response variable, `state` and `type`.

- state: we believe state actually does matter in predicting capital gains over 200 because in our project 3 permutation test, we determined representatives from some states are more likely to have `cap_gains_over_200_usd` as `True` than other states.
- type: if a transaction is not some sort of sale, then it is impossible for `cap_gains_over_200_usd` to be `True`.

Let's split up the data into X and y, and then conduct a train-test-split.

In [92]:
X = df[['state', 'type']]
y = df['cap_gains_over_200_usd']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

Our `base_preproc` is a column transformer that's going to OHE the `state` and `type` columns.

In [93]:
base_preproc = ColumnTransformer(
    transformers = [
        ('ohe', OneHotEncoder(), ['state', 'type'])
    ],
    remainder='drop'
)

We then put the column transformer into a pipeline, paired with a `RandomForestClassifier` object. Note that the `RandomForestClassifier` has no parameters passed in, so it will use default values. We will conduct GridSearchCV for the best combination of hyperparameters in the next section.

In [138]:
base_pl = Pipeline([
    ('prepocessor', base_preproc),
    ('dec-tree', RandomForestClassifier())
])

# fit the pipeline
base_pl.fit(X_train, y_train);

In [139]:
# predict response variables values based on fitted pipieline
y_pred_base = base_pl.predict(X_test)

# print(f'Testing accuracy: {base_pl.score(X_test, y_test):.2f}')
print(f'Precision: {metrics.precision_score(y_test, y_pred_base):.2f}')
# print(f'Recall: {metrics.recall_score(y_test, y_pred_base):.2f}')

Precision: 0.80


### Final Model

For our final model, we're using 5 variables to predict the response variable:
1. `state` (from baseline)
2. `type` (from baseline)
3. `owner` 
4. `amount_cleaned` (a feature we created)
5. `non_disclosure_period(days)` (a feature we created)

How we feature engineered the 2 new columns:

- `amount_cleaned`
  - We created a dictionary to map range values to their average value, and then replaced range values in `amount` with its average amount.
- `non_disclosure_period(days)`
  - We changed `disclosure_date` and `transaction_date` to pandas datetime object. This was handy to create, as we could simply subtract the two columns to create the new feature. 


Now let's deal with missing values in the 5 variables we're using for prediction, and make sure all column types are the types that we want.

In [75]:
# '--' does not signify anything so replace with nan
df['owner'] = df['owner'].replace({'--': np.nan})

# np.nan is not acceptable in sklearn pipelines so we replace it with its own 
# categorical value
df = df.replace({np.NaN:'missing'})

# since we imported from csv, make sure 'disclosure_date' and 'transaction_date'
# are of type date time
df['disclosure_date'] = pd.to_datetime(df['disclosure_date'], errors = 'coerce')
df['transaction_date'] = pd.to_datetime(df['transaction_date'], errors = 'coerce')

# we are only going to include rows in which the 'non_disclosure_period(days)'
# values are not missing
df = df[df['non_disclosure_period(days)']!='missing']
df.head()

Unnamed: 0,disclosure_year,disclosure_date,transaction_date,owner,ticker,asset_description,type,amount,representative,district,ptr_link,cap_gains_over_200_usd,state,amount_cleaned,non_disclosure_period(days)
0,2021,2021-10-04,2021-09-27,joint,BP,BP plc,purchase,"$1,001 - $15,000",Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,8000.5,7.0
1,2021,2021-10-04,2021-09-13,joint,XOM,Exxon Mobil Corporation,purchase,"$1,001 - $15,000",Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,8000.5,21.0
2,2021,2021-10-04,2021-09-10,joint,ILPT,Industrial Logistics Properties Trust - Common...,purchase,"$15,001 - $50,000",Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,32500.5,24.0
3,2021,2021-10-04,2021-09-28,joint,PM,Phillip Morris International Inc,purchase,"$15,001 - $50,000",Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,32500.5,6.0
4,2021,2021-10-04,2021-09-17,self,BLK,BlackRock Inc,sale_partial,"$1,001 - $15,000",Alan S. Lowenthal,CA47,https://disclosures-clerk.house.gov/public_dis...,False,CA,8000.5,17.0


Let's split up the data into X and y, and then conduct a train-test-split.

In [147]:
X = df[['owner', 'type', 'state', 'amount_cleaned', 'non_disclosure_period(days)']]
y = df['cap_gains_over_200_usd']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [148]:
final_preproc = ColumnTransformer(
    transformers = [
        ('ohe', OneHotEncoder(drop='first'), ['owner', 'type', 'state']),
        ('std', StandardScaler(), ['amount_cleaned', 'non_disclosure_period(days)'])
    ],
    remainder='drop'
)

In [149]:
final_pl = Pipeline([
    ('prepocessor', final_preproc),
    # these hyperparameter values are from conducting GridSearchCV (see below)
    ('dec-tree', RandomForestClassifier(criterion='entropy', max_depth=5,min_samples_split=5, n_estimators=70))
])
final_pl.fit(X_train, y_train);

In [150]:
y_pred_final = final_pl.predict(X_test)
# print(f'Testing accuracy: {final_pl.score(X_test, y_test):.2f}')
print(f'Precision: {metrics.precision_score(y_test, y_pred_final):.2f}')
# print(f'Recall: {metrics.recall_score(y_test, y_pred):.2f}')

Precision: 0.94


Let's conduct GridSearchCV to see which combination of hyperparamters is best.
This section appears after instantiating and fitting the pipeline, but we conducted 
GridSearchCV before fitting final_pl.

In [111]:
hyperparameters = {
    'dec-tree__n_estimators': np.arange(10, 100, 20),
    'dec-tree__criterion': ['gini', 'entropy'],
    'dec-tree__max_depth': [2,3,4,5],
    'dec-tree__min_samples_split': [1,2,5]
}

In [112]:
searcher = GridSearchCV(final_pl, param_grid = hyperparameters, cv=5, scoring='precision')
searcher.fit(X_train, y_train)
searcher.best_params_

{'dec-tree__criterion': 'entropy',
 'dec-tree__max_depth': 5,
 'dec-tree__min_samples_split': 5,
 'dec-tree__n_estimators': 30}

### Fairness Analysis

We're going to split our dataset into representatives from California and representatives not from California.

In [37]:
N = 1_000
ca_df = df.loc[df['state'] == 'CA']
not_ca_df = df.loc[df['state'] != 'CA']
# obs_test_stat = diff in precisions? in 2 groups

In [34]:
df

Unnamed: 0.1,Unnamed: 0,disclosure_year,disclosure_date,transaction_date,owner,ticker,asset_description,type,amount,representative,district,ptr_link,cap_gains_over_200_usd,state,amount_cleaned
0,0,2021,2021-10-04,2021-09-27,joint,BP,BP plc,purchase,"$1,001 - $15,000",Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,8000.5
1,1,2021,2021-10-04,2021-09-13,joint,XOM,Exxon Mobil Corporation,purchase,"$1,001 - $15,000",Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,8000.5
2,2,2021,2021-10-04,2021-09-10,joint,ILPT,Industrial Logistics Properties Trust - Common...,purchase,"$15,001 - $50,000",Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,32500.5
3,3,2021,2021-10-04,2021-09-28,joint,PM,Phillip Morris International Inc,purchase,"$15,001 - $50,000",Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,32500.5
4,4,2021,2021-10-04,2021-09-17,self,BLK,BlackRock Inc,sale_partial,"$1,001 - $15,000",Alan S. Lowenthal,CA47,https://disclosures-clerk.house.gov/public_dis...,False,CA,8000.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14197,14197,2020,2020-06-10,2020-04-09,--,SWK,"Stanley Black & Decker, Inc.",sale_partial,"$1,001 - $15,000",Ed Perlmutter,CO07,https://disclosures-clerk.house.gov/public_dis...,False,CO,8000.5
14198,14198,2020,2020-06-10,2020-04-09,--,USB,U.S. Bancorp,sale_partial,"$1,001 - $15,000",Ed Perlmutter,CO07,https://disclosures-clerk.house.gov/public_dis...,False,CO,8000.5
14199,14199,2020,2020-06-10,2020-03-13,,BMY,Bristol-Myers Squibb Company,sale_full,"$100,001 - $250,000",Nicholas Van Taylor,TX03,https://disclosures-clerk.house.gov/public_dis...,False,TX,175000.5
14200,14200,2020,2020-06-10,2020-03-13,,LLY,Eli Lilly and Company,sale_full,"$500,001 - $1,000,000",Nicholas Van Taylor,TX03,https://disclosures-clerk.house.gov/public_dis...,False,TX,375000.5


In [73]:
df.dtypes

disclosure_year                  int64
disclosure_date                 object
transaction_date                object
owner                           object
ticker                          object
asset_description               object
type                            object
amount                          object
representative                  object
district                        object
ptr_link                        object
cap_gains_over_200_usd            bool
state                           object
amount_cleaned                 float64
non_disclosure_period(days)    float64
dtype: object