In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pandas_profiling import ProfileReport

In [13]:
%matplotlib inline

In [14]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# HW1 - Classification models in sklearn

You'll be building a few classifier models and using some of the tech tools we learned about in Modules 1 and 2. 

## The Data

The data is a relatively small and simple dataset of taxpayer data. I got it from:

https://www.kaggle.com/dmaillie/sample-us-taxpayer-dataset

As you'll see if you visit that page, this dataset was used in a series of YouTube tutorials on using R to build random forest models. 

I read it into a pandas dataframe and used `info()` to get:

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   HHI             1004 non-null   int64 
 1   HHDL            1004 non-null   int64 
 2   Married         1004 non-null   int64 
 3   CollegGrads     1004 non-null   int64 
 4   AHHAge          1004 non-null   int64 
 5   Cars            1004 non-null   int64 
 6   Filed_2017      1004 non-null   int64 
 7   Filed_2016      1004 non-null   int64 
 8   Filed_2015      1004 non-null   int64 
 9   PoliticalParty  1004 non-null   object
dtypes: int64(9), object(1)
memory usage: 78.6+ KB
```

Some information about the fields:

* `HHI` - household income
* `HHDL` - household debt level
* `Married` - categorical with a few levels
* `CollegGrads` - number of college grads in the household
* `AHHAge` - average age of people in the household
* `Cars` - number of cars in the household
* `Filed_2017` - 1 means they filed a tax return with the IRS for 2017
* `Filed_2016` - 1 means they filed a tax return with the IRS for 2016
* `Filed_2015` - 1 means they filed a tax return with the IRS for 2015
* `PoliticalParty` - categorical with 3 levels

## The Problem

Our overall goal is to build classifier models to predict `PoliticalParty` using the the other variables. You must use sklearn Pipelines that contain your preprocessing steps and your model estimation step.

You can do your work in a Jupyter Notebook(s) or in a Python script(s) (i.e. a ``.py`` file) or both. It's up to you.

### Task 1

Start by creating a new project folder structure with the cookiecutter-datascience-simple template that I covered in Module 1. Put the data file into its appropriate folder and put this notebook in the main project folder. Any additional notebooks and/or Python files you end up creating should go in the main project folder. 

### Task 2

Put your new project folder under version control using git. You should **NOT** track the data file. You must track any notebooks, Python scripts or additional text files you end up creating.

### Task 3

Build at least one logistic regression model (with regularization) and one random forest model to predict `PoliticalParty`. Yes, this is very similar to what we did for the Pump it Up project in Module 2. Some detailed requirements and additional information:

* I suggest you start by reading the csv file into a pandas dataframe. My dataframe is called ``tax_df``.
* Then start with some basic EDA. You can certainly use automated tools such as pandas-profiling or SweetViz as I showed in the class notes. Remember, when you run either of those, you **must** have your notebook open in the classic Jupyter Notebook interface (and **NOT** in Jupyter Lab). Once you've created the EDA reports you can close your notebook and reopen in Jupyter Lab if you wish. As we've seen, the reports get created as HTML documents. These should go in your output folder within your project.
* Since we are using regularization, all of the numeric variables should be rescaled using the `StandardScaler` - be careful, just because a variable has a numeric datatype in the pandas dataframe, it does not mean that it's necessarily a numeric variable in the context of the classification models. Think about each column and look at your EDA reports and decide whether or not it's truly numeric or needs to be treated as categorical data in the models.
* For any variables that you decide should be treated as categorical in your models, use the `OneHotEncoder` on them in the preprocessing stage.
* Even though our target variable, `PoliticalParty`, is categorical, you do **NOT** need to do any preprocessing on it. As I mentioned in our class notes, scikit-learn will automatically detect that and will do any encoding needed on its own (it uses the `LabelEncoder`).
* I broke up the ``tax_df`` into two separate dataframes that I called ``X`` and ``y``, to use in the models. Here's my code for that:

```
X = tax_df.iloc[:, 0:9]
y = tax_df.iloc[:, 9]
```

* Please use the following code for your data partitioning so that we all end up with the same training and test split:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
```

* For each model you fit, you should compute its ``score`` and create a confusion matrix on both train and test data. I did all of this repeatedly in the class notes.
* For each model (the logistic regression model and the random forest) you should make some summary comments about how well the model fits and predicts and if there is evidence of overfitting. 

**IMPORTANT** You always should put summary comments in a markdown cell. Do **NOT** write them as comments in a code cell. The whole point of Jupyter notebooks is to be able to mix markdown cells with code cells. If you choose to do all of your Python work in a ``.py`` file(s), then simple create a Jupyter notebook in which you include your summary comments.

#### Read in raw data

In [16]:
tax_df = pd.read_csv("./data/TaxInfo.csv")

#### Initial EDA

In [17]:
tax_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   HHI             1004 non-null   int64 
 1   HHDL            1004 non-null   int64 
 2   Married         1004 non-null   int64 
 3   CollegGrads     1004 non-null   int64 
 4   AHHAge          1004 non-null   int64 
 5   Cars            1004 non-null   int64 
 6   Filed_2017      1004 non-null   int64 
 7   Filed_2016      1004 non-null   int64 
 8   Filed_2015      1004 non-null   int64 
 9   PoliticalParty  1004 non-null   object
dtypes: int64(9), object(1)
memory usage: 78.6+ KB


#### Panda Profiling

In [18]:
#from pandas_profiling import ProfileReport

In [20]:
#profile = ProfileReport(tax_df, title="Pandas Profiling Report")

In [11]:
#profile.to_file("output/hw1_pandas_profiling_report.html")

#### Sweetviz Report

In [8]:
import sweetviz

In [9]:
report = sweetviz.analyze(tax_df)

                                             |                                             | [  0%]   00:00 ->…

In [10]:
report.show_html("output/hw1_sweetviz_report.html")

Report output/hw1_sweetviz_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [21]:
tax_df.select_dtypes(include=np.number).columns.tolist()

['HHI',
 'HHDL',
 'Married',
 'CollegGrads',
 'AHHAge',
 'Cars',
 'Filed_2017',
 'Filed_2016',
 'Filed_2015']

In [22]:
tax_df.groupby(['PoliticalParty']).size()

PoliticalParty
Democrat       336
Independent    337
Republican     331
dtype: int64

In [24]:
tax_df['PoliticalParty'].value_counts(normalize=True)

Independent    0.335657
Democrat       0.334661
Republican     0.329681
Name: PoliticalParty, dtype: float64

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#### Creating X and Y Models

In [None]:
X = tax_df.iloc[:, 0:9]

In [None]:
y = tax_df.iloc[:, 9]

## Optional Hacker Extra tasks
I always like to include some extra credit tasks for those who want to push themselves a little further. For this problem, consider doing one or more of the following:

* Try out the Histogram based Gradient Boosting Classifier shown in the optional materials at the end of Module 2. Compare its performance to logistic regression and the random forest.
* Create a second set of models in which you treat ``Filed_2017`` as a binary target variable and use ``PoliticalParty`` as a categorical feature variable. Is it any easier to predict ``Filed_2017`` than it was to predict ``PoliticalParty``?

## Deliverables
You should simply compress your entire project folder as either a zip file or a tar.gz file (do **NOT** ever use WinRAR to create rar files in this class). Note that when you do this, your "hidden" ``.git`` folder will get included. So, I'll be able to tell that you put the project under version control and I'll be able to look at your project folder structure. Before compressing the project folder to submit it:

* make sure all of your notebooks and .py files are in the main project folder and have good filenames,
* make sure you've committed all of your changes (git),
* upload your compressed folder in Moodle.