# Imbalanced Data
# Vivian Zeng

In [25]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression, LogisticRegression, LogisticRegressionCV
from sklearn.metrics import mean_squared_error, accuracy_score, recall_score,f1_score, precision_score, confusion_matrix, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
pd.options.display.max_columns = None # display all columns


1. Load the Ames housing data set. Create a feature matrix using the following variables: Lot_Area, Year_Built, Gr_Liv_Area, Total_Bsmt_SF, and Full_Bath. Print the first few rows. (5 pts)

In [5]:
ames = pd.read_csv('ames.csv')
cols = ['Sale_Price', 'Lot_Area', 'Year_Built', 'Gr_Liv_Area', 'Total_Bsmt_SF', 'Full_Bath']
df = ames[cols].copy()
df = df.dropna()
df.head()

Unnamed: 0,Sale_Price,Lot_Area,Year_Built,Gr_Liv_Area,Total_Bsmt_SF,Full_Bath
0,215000,31770,1960,1656,1080,1
1,105000,11622,1961,896,882,1
2,172000,14267,1958,1329,1329,1
3,244000,11160,1968,2110,2110,2
4,189900,13830,1997,1629,928,2


In [6]:
X = df. drop('Sale_Price', axis=1).values
X[:5, :]

array([[31770,  1960,  1656,  1080,     1],
       [11622,  1961,   896,   882,     1],
       [14267,  1958,  1329,  1329,     1],
       [11160,  1968,  2110,  2110,     2],
       [13830,  1997,  1629,   928,     2]])

2. Create a vector for the response. This will be 1 if the Sale_Price is greater than $300,000, and 0 otherwise. What is the proportion of homes that have a sale price greater than 300,000? (10 pts)

In [7]:
y=pd.DataFrame(df.Sale_Price.copy())

In [8]:
y[y.Sale_Price <= 300000]=0

In [9]:
y[y.Sale_Price >300000]=1

In [10]:
y=y.values.reshape(-1, 1)

In [11]:
print('the proportion of sale price greater than 300,000:', y.mean())

the proportion of sale price greater than 300,000: 0.07849829351535836


3. Split the data into training and testing sets using a 60/40 split, making sure to stratify based on the response variable. Print the dimensions of each set. (10 pts)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, stratify=y, random_state=2020)

In [14]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1758, 5) (1172, 5) (1758, 1) (1172, 1)


4. Standardize the features from the training set and apply the transformation to the test set. Print the first few rows of the standardized features from the training set. (10 pts)

In [17]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [18]:
print(X_train[:5, :])

[[ 1.10865434e-01 -2.54708797e-01 -9.38079761e-01 -4.71576501e-02
  -1.04550661e+00]
 [ 3.03869304e+00 -2.88449892e-01  1.50742740e-01  1.26851503e+00
   7.58223144e-01]
 [-5.60914773e-01  6.90041850e-01 -3.41437173e-01  6.21943011e-01
   7.58223144e-01]
 [-6.64488121e-01  8.25006228e-01 -6.38754018e-01  2.88519112e-01
   7.58223144e-01]
 [ 2.20021201e-03  9.26229512e-01  5.72611236e-01 -2.81455525e-01
   7.58223144e-01]]


5. Model 1: Fit a model without any correction for the data being imbalanced. (10 pts) Fit a Logistic Regression model to the training set (using regularization is optional). Calculate and print the precision, recall, and F1 score for the test set.

In [22]:
model1 = LogisticRegression() 
model1.fit(X_train, y_train)

  return f(**kwargs)


LogisticRegression()

In [26]:
preds1 = model1.predict(X_test)
probs1 = model1.predict_proba(X_test)

In [27]:
print('Recall: ', recall_score(y_test, preds1))
print('Precision: ', precision_score(y_test, preds1))
print('F1 score: ', f1_score(y_test, preds1))

Recall:  0.5652173913043478
Precision:  0.8253968253968254
F1 score:  0.6709677419354838


6. Model 2: Fit a model using weights to balance the classes. (10 pts)
Fit a Logistic Regression model to the training set again, but this time use different weights for the expensive and inexpensive houses. This is done by setting the class_weight parameter to 'balanced'.
Calculate and print the precision, recall, and F1 score for the test set.


In [28]:
model2 = LogisticRegression(class_weight='balanced') 
model2.fit(X_train, y_train)

  return f(**kwargs)


LogisticRegression(class_weight='balanced')

In [29]:
preds2 = model2.predict(X_test)
probs2 = model2.predict_proba(X_test)

In [30]:
print('Recall: ', recall_score(y_test, preds2))
print('Precision: ', precision_score(y_test, preds2))
print('F1 score: ', f1_score(y_test, preds2))

Recall:  0.9456521739130435
Precision:  0.5
F1 score:  0.6541353383458647


7. Model 3: Fit a model after oversampling the minority class. (20 pts)

We first need to create a new training dataset with resampled data from the minority class. Determine how many additional observations you need from the expensive homes to make the classes balanced.

Sample from the minority class with replacement so that the classes are balanced. 

There are many ways to do this using functions from numpy or pandas, but you may find the np.random.choice function useful to sample the indices of the expensive homes.

Create a new set of features and a new response vector. Confirm that the resampling worked by printing the proportion of expensive homes in the new dataset.

Print the dimensions of this new dataset.

Fit a Logistic Regression model to this new training data.

Calculate and print the precision, recall, and F1 score for the test set.

In [31]:
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')

In [32]:
# fit and apply the transform
X_over_train, y_over_train = oversample.fit_resample(X_train, y_train)

In [33]:
y_over_train.mean()

0.5

In [34]:
print(X_over_train.shape, X_test.shape, y_over_train.shape, y_test.shape)

(3240, 5) (1172, 5) (3240,) (1172, 1)


In [35]:
model3 = LogisticRegression() 
model3.fit(X_over_train, y_over_train)

LogisticRegression()

In [36]:
preds3 = model3.predict(X_test)
probs3 = model3.predict_proba(X_test)

In [37]:
print('Recall: ', recall_score(y_test, preds3))
print('Precision: ', precision_score(y_test, preds3))
print('F1 score: ', f1_score(y_test, preds3))

Recall:  0.9456521739130435
Precision:  0.5087719298245614
F1 score:  0.6615969581749049


8. Model 4: Fit a model after undersampling the majority class. (20 pts)

We first need to create a new training dataset after undersampling the majority class. Determine how many of the original observations you need to keep from the inexpensive homes to make the classes balanced.

Sample from the majority class without replacement so that the classes are balanced.
Create a new set of features and a new response vector. Confirm that the resampling worked by printing the proportion of expensive homes in the new dataset.

Print the dimensions of this new dataset.

Fit a Logistic Regression model to this new training data.

Calculate and print the precision, recall, and F1 score for the test set.

In [38]:
# define undersampling strategy
undersample = RandomUnderSampler(sampling_strategy='majority')

In [39]:
# fit and apply the transform
X_u_train, y_u_train = undersample.fit_resample(X_train, y_train)

In [40]:
y_u_train.mean()

0.5

In [41]:
print(X_u_train.shape, X_test.shape, y_u_train.shape, y_test.shape)

(276, 5) (1172, 5) (276,) (1172, 1)


In [42]:
model4 = LogisticRegression() 
model4.fit(X_u_train, y_u_train)

LogisticRegression()

In [43]:
preds4 = model4.predict(X_test)
probs4 = model4.predict_proba(X_test)

In [44]:
print('Recall: ', recall_score(y_test, preds4))
print('Precision: ', precision_score(y_test, preds4))
print('F1 score: ', f1_score(y_test, preds4))

Recall:  0.9565217391304348
Precision:  0.45595854922279794
F1 score:  0.6175438596491227


9. Write a few sentences discussing the difference in performance of the various models. (5 pts)

The models with about equal proportion (~50%) of each class (model 2 and 3) showed more reasonable performance in terms of F1 score. The issue for Model 1 is the unbalanced structure of the original dataset. The issue for model 2 could be due to the inadequate sample size caused by the undersampling.