# 3 Hour cram course

Will use housing data. We will formulate and implement two different problems:

1. Regression modeling on Sale Price
2. Classification modeling on Housing Type

In [1]:
!wget https://s3.amazonaws.com/ink2019share/iowa_housing_train.csv
!wget https://s3.amazonaws.com/ink2019share/iowa_housing_data_dict.txt

--2019-05-01 12:46:41--  https://s3.amazonaws.com/ink2019share/iowa_housing_train.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.136.238
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.136.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 460676 (450K) [text/csv]
Saving to: ‘iowa_housing_train.csv’


2019-05-01 12:46:42 (331 KB/s) - ‘iowa_housing_train.csv’ saved [460676/460676]

--2019-05-01 12:46:42--  https://s3.amazonaws.com/ink2019share/iowa_housing_data_dict.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.136.238
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.136.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13369 (13K) [text/plain]
Saving to: ‘iowa_housing_data_dict.txt’


2019-05-01 12:46:43 (21.4 KB/s) - ‘iowa_housing_data_dict.txt’ saved [13369/13369]



### Library Imports

In [8]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression, LassoCV, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

# for plotting
import matplotlib.pyplot as plt
%matplotlib inline

### Step 1: review the data dictionary

- Find fields that will be useless (does the house have pictures of serbian clowns?)
- Find fields that are categorical (will need to be transformed)
- Find fields that are numeric, but will probably have to be transformed to categories
- note the units of the continous variable fields (will we have to transform the data)

### Step 2: load the data into pandas

In [6]:
!ls

Bulldozer_Example.ipynb     auto-dataset.ipynb
LICENSE                     auto-mpg-train.csv
README.md                   crime_dataset.ipynb
SuperConductorExample.ipynb iowa_housing_data_dict.txt
Untitled.ipynb              iowa_housing_train.csv
Untitled1.ipynb             word2vec_100d.parquet


In [7]:
# read in the CSV data
housing_df = pd.read_csv("iowa_housing_train.csv")

### Step 3: Check the data for the following:

- missing data (blanks)
- outliers (a few points way too high, few points way too low)
    - `.hist()` will be useful for this, will make histogram
- imbalanced features ( 1000 x 0's and only 5 x 1's)
- drop unnecessary rows
- drop unnecessary columns
- fill in the blanks with defaults if necessary
- Any relationship that seems linear already?
    - plot two variables against each other and see if relationship is linear

### Step 4: check for duplicate features

An example of this is "has garage" and "number of garage spaces". This can be found by either the data dictionary or checking out the correlation matrix.

```python
import matplotlib.pyplot as plt

plt.matshow(dataframe.corr())
plt.show()

```

Look for fields with high correlation (close to 1) this are probably features that have a very close cousin that represents the same type of quantity. Keep only one of them

### Step 5: Split the data into Continuous Features and Categorical Features

- Make two data frames for continuous and categorical

#### 5.1 for continuous

- normalize (subtract the mean and divide by the std) the continous data

#### 5.2 for categorical

- look at the distribution of categories (per field) and consolidate if necessary
- the convert to a collection of binary columns (keep track of your headings!)


#### 5.3 glue the two dataframes back together

- should use the `pd.concat` or `pd.merge` to accomplish this task

### Step LINREG1: Split the data into TRAIN and TEST, X, Y

We will save 20% of the data to score our algorithm. 

1. Randomly select 20 of the rows and save into a test dataframe
2. split out the **sales price (Y)** of each of the datasets. The remaining features will be considered (X)
3. you should have 4 different arrays, `X_train`, `X_test`, `y_train`, and `y_test` 



### Step LINREG2: Create and fit a Linear Regression Model


1. create a blank model
2. fit it to the training data
3. find the top features via the `model.coef_`
4. show the top features, and calculate the MSE score between predictions and the actuals via the test set

```python
model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)
```



### Step LINREG3: Repeat and compare to LassoCV(cv=3)

### Step CLASSIFY1: Split the data into TRAIN and TEST, X, Y

We will save 20% of the data to score our algorithm. 

1. Randomly select 20 of the rows and save into a test dataframe
2. split out the **HouseStyle (Y)** of each of the datasets. The remaining features will be considered (X)
3. you should have 4 different arrays, `X_train`, `X_test`, `y_train`, and `y_test` 



### Step CLASSIFY2: Create and fit a Linear Regression Model


1. create a blank model
2. fit it to the training data
3. find the top features via the `model.coef_`
4. show the top features, and calculate the confusion matrix score between predictions and the actuals via the test set

```python
model = LogisticRegression()
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
y_test_pred_proba = model.predict_prob(X_test)
```



### Step LINREG3: Repeat and compare to DecisionTreeClassifier(cv=3)