# Stock Trades by Members of the US House of Representatives

## Problem: Can we predict what type of trade the transaction is?
- This is a Classification problem
- It will be evaluated using f1-score

# Summary of Findings


### Introduction
The goal here is to predict what type of trade the transaction is (purchase, sale_full, sale_partial, or exchange). Since we're trying to predict a categorical variable, this is a ***CLASSIFICATION*** problem. By extension of the problem being posed, the target variable is 'type.' As for the evaluation metric being used, I chose f1-score, this is because it is a good choice when one has class-imbalanced data, unlike Accuracy which is weak when there are rare events. In this case, the majority of stocks were purchases, thus, the Accuracy metric would not have been the best choice, and f1-score is a good trade-off between percision and recall.  

### Baseline Model
The Baseline Model used 4 features. Since this data set is entirely categorical, all features were categorical. 'amount' was ordinal, and 'owner', 'ticker', and 'disclosure_year' were nominal. Thus, amount was ordinally encoded while the rest were one-hot-encoded. The f1-score was ~66%. This is pretty low, a possible reason for that is the choice of features. It is likely that had I used a more consistent set of features, the evaluation metric could've been higher. 

### Final Model
I added a feature for the state where the representative represents (engineered from distric column) and the transaction year and month (engineered from transaction_date column). State is useful because there might be a trend in which rep.s from certain states are more likely to BUY vs SELL. And extracting month/year was because having a date column isn't that useful since we don't have any groups. Extracting year for example, gives us bigger picture data granularity that can be analyzed for more trends. The method for model selection was GridSearch. Best parameter was depth of 202. 

### Fairness Evaluation
With a pvalue of 0.8, I ***fail to reject*** the null hypothesis that the model had equal accuracy for transactions made in 2020 and 2021. The alternnative hypothesis was that the model performed better for transacrion in 2020, but evidence did not support this. I used accuracy as my parity because it is meant to test whether proportions of correctly classified predictions should be equal across groups, which matches with my fairness test. 

# Code

In [145]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [135]:
df = pd.read_csv('all_transactions.csv')
df.head()

Unnamed: 0,disclosure_year,disclosure_date,transaction_date,owner,ticker,asset_description,type,amount,representative,district,ptr_link,cap_gains_over_200_usd
0,2021,10/04/2021,2021-09-27,joint,BP,BP plc,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False
1,2021,10/04/2021,2021-09-13,joint,XOM,Exxon Mobil Corporation,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False
2,2021,10/04/2021,2021-09-10,joint,ILPT,Industrial Logistics Properties Trust - Common...,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False
3,2021,10/04/2021,2021-09-28,joint,PM,Phillip Morris International Inc,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False
4,2021,10/04/2021,2021-09-17,self,BLK,BlackRock Inc,sale_partial,"$1,001 - $15,000",Hon. Alan S. Lowenthal,CA47,https://disclosures-clerk.house.gov/public_dis...,False


### Cleaning

In [137]:
def convert_to_null(cell):
    if cell == '--':
        return np.nan
    else:
        return cell

# loop through columns that are str and apply helper function
for column in df.columns:
    if df[column].dtype == object:
        df[column] = df[column].apply(convert_to_null)

In [138]:
# Clean and restructure binning for ordinal encoding
data = df.copy()
data.replace(to_replace='$1,001 -', value='$1,001 - $15,000', inplace=True)
data.replace(to_replace='$1,000 - $15,000', value='$1,001 - $15,000', inplace=True)
data.replace(to_replace='$1,000,000 +', value='$1,000,001 - $5,000,000', inplace=True)
data.replace(to_replace='$15,000 - $50,000', value='$15,001 - $50,000', inplace=True)

### Baseline Model

In [139]:
# split target feature from training features
X = data.drop('type', axis=1)
# X = data[['owner', 'ticker', 'amount']]
y = data.type

# split data set intro training vs test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

# ordinal encoding
enc = Pipeline(steps=[('ordinal', OrdinalEncoder())])

# nominal encoding
ohe = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])


# preprocessor
preproc = ColumnTransformer(
    transformers = [
        ('owner_ohe', ohe, ['owner']),
        ('ticker_ohe', ohe, ['ticker']),
        ('disclosure_ohe', ohe, ['disclosure_year']),
        ('amount_enc', enc, ['amount'])
    ])

pl = Pipeline(steps=[('preprocessor', preproc), ('classifier', DecisionTreeClassifier())])

pl.fit(X_train, y_train)
preds = pl.predict(X_test)
preds

array(['purchase', 'purchase', 'sale_partial', ..., 'purchase',
       'sale_full', 'sale_full'], dtype=object)

#### Evaluation of Baseline Model

In [140]:
recall = metrics.recall_score(y_test, preds, average='weighted')
precision = metrics.precision_score(y_test, preds, average='weighted')

f1 = 2 * ((precision * recall)/(precision + recall))
f1

0.5182883290793954

### Final Model

In [141]:
df['state'] = df['district'].str[:2]
df['transaction_month'] = df['transaction_date'].str[5:7]
df['transaction_year'] = df['transaction_date'].str[0:4]

data = df.copy()
data.head()

Unnamed: 0,disclosure_year,disclosure_date,transaction_date,owner,ticker,asset_description,type,amount,representative,district,ptr_link,cap_gains_over_200_usd,state,transaction_month,transaction_year
0,2021,10/04/2021,2021-09-27,joint,BP,BP plc,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,9,2021
1,2021,10/04/2021,2021-09-13,joint,XOM,Exxon Mobil Corporation,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,9,2021
2,2021,10/04/2021,2021-09-10,joint,ILPT,Industrial Logistics Properties Trust - Common...,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,9,2021
3,2021,10/04/2021,2021-09-28,joint,PM,Phillip Morris International Inc,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,NC,9,2021
4,2021,10/04/2021,2021-09-17,self,BLK,BlackRock Inc,sale_partial,"$1,001 - $15,000",Hon. Alan S. Lowenthal,CA47,https://disclosures-clerk.house.gov/public_dis...,False,CA,9,2021


In [155]:
# split target feature from training features
X_final = data.drop('type', axis=1)
# X = data[['owner', 'ticker', 'amount']]
y_final = data.type

# split data set intro training vs test
Xf_train, Xf_test, yf_train, yf_test = train_test_split(X_final, y_final, test_size=0.3) # 70% training and 30% test

# ordinal encoding
enc = Pipeline(steps=[('ordinal', OrdinalEncoder(handle_unknown='ignore'))])

# nominal encoding
ohe = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])


# preprocessor
preproc = ColumnTransformer(
    transformers = [
        ('owner_ohe', ohe, ['owner']),
        ('ticker_ohe', ohe, ['ticker']),
        ('disclosure_ohe', ohe, ['disclosure_year']),
        ('state_ohe', ohe, ['state']),
        ('tmonth_ohe', ohe, ['transaction_month']),
        ('tyear_ohe', ohe, ['transaction_year']),
        ('amount_enc', enc, ['amount'])
    ])

pl_final = Pipeline(steps=[('preprocessor', preproc), ('classifier', DecisionTreeClassifier())])

pl_final.fit(Xf_train, yf_train)
preds_final = pl_final.predict(Xf_test)
preds_final

array(['sale_full', 'purchase', 'sale_full', ..., 'purchase', 'sale_full',
       'purchase'], dtype=object)

In [170]:
params = {'classifier__max_depth': np.arange(2,500,20)}



grids = GridSearchCV(pl_final, param_grid=params, cv=5)
grids.fit(Xf_train, yf_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('owner_ohe',
                                                                         Pipeline(steps=[('onehot',
                                                                                          OneHotEncoder(handle_unknown='ignore'))]),
                                                                         ['owner']),
                                                                        ('ticker_ohe',
                                                                         Pipeline(steps=[('onehot',
                                                                                          OneHotEncoder(handle_unknown='ignore'))]),
                                                                         ['ticker']),
                                                                        ('disclosure_ohe',
                        

In [172]:
grids.best_params_

{'classifier__max_depth': 202}

In [171]:
grids.best_score_

0.6633419205593964

#### Evaluation of Final Model

In [143]:
recall_final = metrics.recall_score(yf_test, preds_final, average='weighted')
precision_final = metrics.precision_score(yf_test, preds_final, average='weighted')

f1_final = 2 * ((precision_final * recall_final)/(precision_final + recall_final))
f1_final

0.6661775263317862

### Fairness Evaluation

In [183]:
# A = data[data['transaction_year'] == '2021']
# B = data[data['transaction_year'] == '2020']

def test_stat(A, B):
    def helper(data):
        X_final = data.drop('type', axis=1)
        # X = data[['owner', 'ticker', 'amount']]
        y_final = data.type

        # split data set intro training vs test, 70% training and 30% test
        Xf_train, Xf_test, yf_train, yf_test = train_test_split(X_final, y_final, test_size=0.3)

        # ordinal encoding
        enc = Pipeline(steps=[('ordinal', OrdinalEncoder(handle_unknown='ignore'))])

        # nominal encoding
        ohe = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])


        # preprocessor
        preproc = ColumnTransformer(
            transformers = [
                ('owner_ohe', ohe, ['owner']),
                ('ticker_ohe', ohe, ['ticker']),
                ('disclosure_ohe', ohe, ['disclosure_year']),
                ('state_ohe', ohe, ['state']),
                ('tmonth_ohe', ohe, ['transaction_month']),
                ('tyear_ohe', ohe, ['transaction_year']),
                ('amount_enc', enc, ['amount'])
            ])

        pl_final = Pipeline(steps=[('preprocessor', preproc), ('classifier', DecisionTreeClassifier())])

        pl_final.fit(Xf_train, yf_train)
        preds_final = pl_final.predict(Xf_test)
    
        return metrics.accuracy_score(yf_test, preds_final)
    return helper(B) - helper(A)



def simulate_null(data):
    # change A and B
    df = data.copy()
    shuffled_data = df['transaction_year'].sample(replace=False, frac=1).reset_index(drop=True)
    df['transaction_year'] = shuffled_data
    A = df[df['transaction_year'] == '2021']
    B = df[df['transaction_year'] == '2020']
    return test_stat(A, B)


def pval(data):
    permutations = []
    n_repititions = 10
    df = data.copy()
    
    A = df[df['transaction_year'] == '2021']
    B = df[df['transaction_year'] == '2020']
    
    obs = test_stat(A, B)

    for i in range(n_repititions):
        permutations.append(simulate_null(df))

    return pd.Series(permutations > obs).sum() / len(permutations)

pval(data.copy())


0.8