# Instructions

This notebook contains two questions with multiple parts. Please refer to the provided *problem statement* for these questions.

A count-up timer is located on the lower lefthand corner of this page. After 48 hours, this assessment will be automatically submitted and made read-only.

To submit your notebooks before the 48 hours have elapsed, return to https://modeling.hddatascience.us and click "Complete Course..." next to where you launched this server.

For support, please contact tech@hddatascience.us
    
    
### Supplemental documents
    
There are three reference documents that will be used in the questions.
    
<ol>
    <li> <a href="Data Dictionary.csv">Data Dictionary.csv </a> - a data dictionary that describes the fields of the following datasets</li>
    <li> <a type="text/csv" href="Property Level Data.csv" target="_blank">Property Level Data.csv </a>  - a dataset containing property level data</li>
    <li> <a href="Census Level Data.csv">Census Level Data.csv </a> - a dataset containing census level data</li>
    
</ol>

Review the documents.  The files are found within the root notebook folder and can be loaded from code as needed.





### Installing packages
<p>  
     You may import the packages of your chosing. Most common package are already installed on the server.  If you are not able to import the package, you may install packages using <code> !{sys.executable} -m pip install <package> </code> for python and <code> install.packages("forecaset")</code> for R within a cell as needed.</p>
    

In [1]:
import sys
# !{sys.executable} -m pip install shap==0.38.1

### n_jobs Hyperparameter
<p>
In case model(s) you choose require(s) setting up the <code>n_jobs</code> hyperparameter, please set <code>n_jobs=1</code> or <code>n_jobs=4</code>
</p>

# Setup

If you are new to Jupyter Notebooks, please see this <a href="https://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Running%20Code.ipynb"> documentation </a> for more information on how to run code and use the environment. Alternatively, click on the "Help" dropdown menu at the top of this page.

You may complete this notebook in either python or R. To change the kernal from python to R, go the Kernel Menu and select "Change Kernel"

Run the the cell below to import your Python packages.  You may add additional packages to import as needed.

In [2]:
## add imports as needed
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")



In [None]:
# If using python
df_main = pd.read_csv(r"Property Level Data.csv") # Load main data
df_census = pd.read_csv(r"Census Level Data.csv") # Load census data

#If using R
#df_main = read.csv('Property Level Data.csv') # Load main data
#df_census = read.csv('Census Level Data.csv') # Load census data

<div class="alert alert-block pheader">
        <span class="heading">Question 1</span> 
</div>

Accuracy: (TP + TN) / (TP + TN + FP + FN) = (5387 + 2011) / (5387 + 2011 + 2105 + 497) = 0.74. This means the model correctly predicted the outcome for 74% of the properties.

Precision: TP / (TP + FP) = 5387 / (5387 + 2105) = 0.72. This means that when the model predicted a property would succeed, it was correct 72% of the time.

Recall (Sensitivity): TP / (TP + FN) = 5387 / (5387 + 497) = 0.92. This means the model correctly identified 92% of the successful properties.

Specificity: TN / (TN + FP) = 2011 / (2011 + 2105) = 0.49. This means the model correctly identified 49% of the unsuccessful properties.

In [4]:
# Define the costs and benefits
benefit = 1000000
cost = 3000000

# Calculate the expected net benefit
expected_net_benefit = 5387 * benefit - 2105 * cost

print(f'Expected Net Benefit: {expected_net_benefit}')


Expected Net Benefit: -928000000


The negative value indicates that the costs associated with investing in unsuccessful properties (as predicted by the model) outweigh the benefits from successful properties.

Given that the cost of an unsuccessful property is three times higher than the benefit of a successful property, the model needs to have a high precision (i.e., a low false positive rate) to achieve a positive net benefit. However, based on the confusion matrix, the model has a relatively high number of false positives (2105), which leads to a high cost that outweighs the benefits from true positives.

<div class="alert alert-block alert-warning">
 <span class="heading">Interview Candidate Problem Statement </span>
</div>

<p>
The Real Estate team has approached you to help predict which properties they should invest in. They have compiled market data, containing thousands of properties. It is assumed that if a property is successful, on average it will yield a \$1,000,000 benefit. Conversely, it is assumed that if a property is unsuccessful, on average it will cost the business \$3,000,000.  
</p>

Please insert Markdown cells for discussion and Code cells for your codes as necessary. 

<div class="alert alert-block pheader">
        <span class="heading">Question 2</span> 
</div>

Please insert Markdown cells for discussion and Code cells for your codes as necessary. 

In [5]:
# Answer to Question: 2

# Examine the data 
# Handle missing values using imputations
# Encode Categorical and normalize numerical variables
# Build Classification models
# Perform cost benifit analysis

In [None]:
df_main.head(2).T

In [None]:
df_census.head(2).T

In [None]:
#handling missing values

df_main['AnnualAverageRent'].fillna(df_main['AnnualAverageRent'].median(), inplace=True)
df_main['PropertyValue'].fillna(df_main['PropertyValue'].median(), inplace=True)

for column in ['ExpenseTax', 'ExpenseRepairs', 'ExpenseInsurance', 'ExpensePayroll', 'ExpenseGeneralFees']:
    df_main[column + '_is_missing'] = df_main[column].isna().astype(int)
    df_main[column].fillna(0, inplace=True)


In [None]:
df_main.head()

In [None]:
# One-hot encoding
df_main = pd.get_dummies(df_main, columns=['PropertyType', 'PropertySubType'])

# Ordinal encoding for 'ParkingRatio'
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
df_main['ParkingRatio'] = ord_enc.fit_transform(df_main[['ParkingRatio']])

In [None]:
#scaling of numerical values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_main[['BuildingCount', 'StoryCount', 'YearBuilt', 'UnitCount', 'NetRentableSF', 'YearLastRenovated', 'GrossLandArea', 'OccupancyPercentage']] = scaler.fit_transform(df_main[['BuildingCount', 'StoryCount', 'YearBuilt', 'UnitCount', 'NetRentableSF', 'YearLastRenovated', 'GrossLandArea', 'OccupancyPercentage']])


In [None]:
#featuree engg
df_main['AgeOfProperty'] = pd.datetime.now().year - df_main['YearBuilt']
df_main['YearsSinceRenovated'] = pd.datetime.now().year - df_main['YearLastRenovated']


In [None]:
#merge both datasets on StateCode
df = pd.merge(df_main, df_census, left_on='StateCode', right_on='STATECODE')

In [None]:
df.head(3)

#### Visulasations for various columns in the df

In [None]:
df[['BuildingCount', 'StoryCount', 'YearBuilt', 'UnitCount', 'NetRentableSF', 'YearLastRenovated', 'GrossLandArea', 'OccupancyPercentage']].hist(bins=30, figsize=(15, 10))
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(15, 10))
sns.boxplot(data=df[['BuildingCount', 'StoryCount', 'YearBuilt', 'UnitCount', 'NetRentableSF', 'YearLastRenovated', 'GrossLandArea', 'OccupancyPercentage']])
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
df['PropertyType_Multifamily'].value_counts().plot(kind='bar')
plt.title('Distribution of PropertyType_Multifamily')
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
df['PropertySubType_COOP'].value_counts().plot(kind='bar')
plt.title('Distribution of PropertyType_Multifamily')
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
df['PropertySubType_Co-op'].value_counts().plot(kind='bar')
plt.title('Distribution of PropertyType_Multifamily')
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
df['PropertySubType_student_housing'].value_counts().plot(kind='bar')
plt.title('Distribution of PropertyType_Multifamily')
plt.show()

In [None]:
plt.figure(figsize=(30, 22))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix Heatmap')
plt.show()


In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(x='NetRentableSF', y='UnitCount', data=df)
plt.title('NetRentableSF vs. UnitCount')
plt.show()

# Model building -- Logistic Regression

In [None]:

from sklearn.model_selection import train_test_split

# Define your features and target variable
X = df.drop('SuccesssProb', axis=1)  # assuming 'SuccesssProb' is your target variable
y = df['SuccesssProb']

# Convert 'SuccesssProb' into a binary variable
y = (df['SuccesssProb'] > 0.5).astype(int)

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a logistic regression model
model = LogisticRegression()

# Train the model on the training data
model.fit(X_train, y_train)



In [None]:
from sklearn.linear_model import LogisticRegression

# Create a logistic regression model
model_lr = LogisticRegression()

# Train the model on the training data
model_lr.fit(X_train, y_train)


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Make predictions on the test set
y_pred = model_lr.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Print metrics
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print(f'ROC AUC Score: {roc_auc}')


In [None]:
from sklearn.metrics import confusion_matrix

# Make probability predictions on the test set
y_prob = model.predict_proba(X_test)[:, 1]

# Classify properties as successful or unsuccessful based on the optimal decision threshold
y_pred = (y_prob > min_threshold).astype(int)

# Create a confusion matrix
cm  = confusion_matrix(y_test, y_pred)

# Create a DataFrame from the confusion matrix for better visualization
cm_df = pd.DataFrame(cm, columns=['Predicted Negative', 'Predicted Positive'], index=['Actual Negative', 'Actual Positive'])

plt.figure(figsize=(10, 7))

sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()


In [None]:
#cost benifit analysis
import numpy as np

# Define the costs and benefits
benefit = 1000000
cost = 3000000

# Calculate the probabilities of the positive class
y_prob = model.predict_proba(X_test)[:, 1]

# Calculate the expected cost for different decision thresholds
thresholds = np.linspace(0, 1, 100)
expected_costs = []
for threshold in thresholds:
    y_pred = (y_prob > threshold).astype(int)
    fp = np.sum((y_test == 0) & (y_pred == 1))
    tp = np.sum((y_test == 1) & (y_pred == 1))
    expected_cost = fp * cost - tp * benefit
    expected_costs.append(expected_cost)

# Find the threshold with the minimum expected cost
min_cost = min(expected_costs)
min_threshold = thresholds[expected_costs.index(min_cost)]

print(f'Minimum Expected Cost: {min_cost}')
print(f'Optimal Decision Threshold: {min_threshold}')


The threshold value of approximately 0.76.  is the optimal decision threshold for classifying a property as successful or unsuccessful, based on minimizing the expected cost.

This means that if the model predicts a success probability of greater than 0.76 for a property, 
we should classify it as successful; otherwise, you should classify it as unsuccessful.

Generally, the accuracy is 50%, but based on our conditions of 1:3 ratio of profit to loss,
we ended up at 0.76 as threshold

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)


from sklearn.tree import DecisionTreeClassifier

# Initialize the model
dt_model = DecisionTreeClassifier(random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

# Make predictions
y_pred_dt = dt_model.predict(X_test)


import xgboost as xgb

# Initialize the model
xgb_model = xgb.XGBClassifier(random_state=42)

# Train the model
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)


In [None]:
from sklearn.metrics import confusion_matrix

# Make probability predictions on the test set
y_prob_rf = rf_model.predict_proba(X_test)[:, 1]
y_prob_dt = dt_model.predict_proba(X_test)[:, 1]
y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]

# Classify properties as successful or unsuccessful based on the optimal decision threshold
y_pred_rf = (y_prob_rf > min_threshold).astype(int)
y_pred_dt = (y_prob_dt > min_threshold).astype(int)
y_pred_xgb = (y_prob_xgb > min_threshold).astype(int)

# Create confusion matrices
cm_rf = confusion_matrix(y_test, y_pred_rf)
cm_dt = confusion_matrix(y_test, y_pred_dt)
cm_xgb = confusion_matrix(y_test, y_pred_xgb)

import seaborn as sns
import matplotlib.pyplot as plt

# Create a list of models and their confusion matrices
models = ['Random Forest', 'Decision Tree', 'XGBoost']
cms = [cm_rf, cm_dt, cm_xgb]

# Plot the confusion matrix for each model
for model, cm in zip(models, cms):
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix for {model}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Calculate metrics for Random Forest
tn_rf, fp_rf, fn_rf, tp_rf = cm_rf.ravel()
specificity_rf = tn_rf / (tn_rf + fp_rf)

# Calculate metrics for Decision Tree
tn_dt, fp_dt, fn_dt, tp_dt = cm_dt.ravel()
specificity_dt = tn_dt / (tn_dt + fp_dt)

# Calculate metrics for XGBoost
tn_xgb, fp_xgb, fn_xgb, tp_xgb = cm_xgb.ravel()
specificity_xgb = tn_xgb / (tn_xgb + fp_xgb)



# Calculate metrics for Random Forest
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)

# Calculate metrics for Decision Tree
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)

# Calculate metrics for XGBoost
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
precision_xgb = precision_score(y_test, y_pred_xgb)
recall_xgb = recall_score(y_test, y_pred_xgb)



# Print the metrics for each model
print('\nMetrics for Random Forest:')
print(f'Accuracy: {accuracy_rf:.4f}')
print(f'Precision: {precision_rf:.4f}')
print(f'Recall (Sensitivity): {recall_rf:.4f}')
print(f'Specificity: {specificity_rf:.4f}')

print('\nMetrics for Decision Tree:')
print(f'Accuracy: {accuracy_dt:.4f}')
print(f'Precision: {precision_dt:.4f}')
print(f'Recall (Sensitivity): {recall_dt:.4f}')
print(f'Specificity: {specificity_dt:.4f}')

print('\nMetrics for XGBoost:')
print(f'Accuracy: {accuracy_xgb:.4f}')
print(f'Precision: {precision_xgb:.4f}')
print(f'Recall (Sensitivity): {recall_xgb:.4f}')
print(f'Specificity: {specificity_xgb:.4f}')


In [None]:
model_rf = rf_model
model_dt = dt_model
model_xgb = xgb_model

In [None]:
# Probability thresholds for various models:

import numpy as np

# Define the costs and benefits
benefit = 1000000
cost = 3000000

# Calculate the probabilities of the positive class for each model

y_prob_rf = model_rf.predict_proba(X_test)[:, 1]
y_prob_dt = model_dt.predict_proba(X_test)[:, 1]
y_prob_xgb = model_xgb.predict_proba(X_test)[:, 1]

# Calculate the expected cost for different decision thresholds for each model
thresholds = np.linspace(0, 1, 100)



# Random Forest
expected_costs_rf = []
for threshold in thresholds:
    y_pred_rf = (y_prob_rf > threshold).astype(int)
    fp_rf = np.sum((y_test == 0) & (y_pred_rf == 1))
    tp_rf = np.sum((y_test == 1) & (y_pred_rf == 1))
    expected_cost_rf = fp_rf * cost - tp_rf * benefit
    expected_costs_rf.append(expected_cost_rf)

# Decision Tree
expected_costs_dt = []
for threshold in thresholds:
    y_pred_dt = (y_prob_dt > threshold).astype(int)
    fp_dt = np.sum((y_test == 0) & (y_pred_dt == 1))
    tp_dt = np.sum((y_test == 1) & (y_pred_dt == 1))
    expected_cost_dt = fp_dt * cost - tp_dt * benefit
    expected_costs_dt.append(expected_cost_dt)

# XGBoost
expected_costs_xgb = []
for threshold in thresholds:
    y_pred_xgb = (y_prob_xgb > threshold).astype(int)
    fp_xgb = np.sum((y_test == 0) & (y_pred_xgb == 1))
    tp_xgb = np.sum((y_test == 1) & (y_pred_xgb == 1))
    expected_cost_xgb = fp_xgb * cost - tp_xgb * benefit
    expected_costs_xgb.append(expected_cost_xgb)

# Find the threshold with the minimum expected cost for each model


min_cost_rf = min(expected_costs_rf)
min_threshold_rf = thresholds[expected_costs_rf.index(min_cost_rf)]

min_cost_dt = min(expected_costs_dt)
min_threshold_dt = thresholds[expected_costs_dt.index(min_cost_dt)]

min_cost_xgb = min(expected_costs_xgb)
min_threshold_xgb = thresholds[expected_costs_xgb.index(min_cost_xgb)]

# Print the minimum expected costs and optimal decision thresholds for each model

print('Random Forest:')
print(f'Minimum Expected Cost: {min_cost_rf}')
print(f'Optimal Decision Threshold: {min_threshold_rf}\n')

print('Decision Tree:')
print(f'Minimum Expected Cost: {min_cost_dt}')
print(f'Optimal Decision Threshold: {min_threshold_dt}\n')

print('XGBoost:')
print(f'Minimum Expected Cost: {min_cost_xgb}')
print(f'Optimal Decision Threshold: {min_threshold_xgb}\n')



In [None]:
# Conclusion

In [None]:
import pandas as pd

# Create a dictionary with the model metrics
model_metrics = {
    'Model': ['Logistic Regression', 'Random Forest', 'Decision Tree', 'XGBoost'],
    'Accuracy': [accuracy, accuracy_rf, accuracy_dt, accuracy_xgb],
    'Optimal Threshold': [min_threshold, min_threshold_rf, min_threshold_dt, min_threshold_xgb]
}

# Create a DataFrame from the dictionary
metrics_table = pd.DataFrame(model_metrics)

# Print the table
print(metrics_table)


Based on the EDA and the modeling contingent on 
the cost benigit contriants provided in the Question

Here are the best models:
    
Logistic Regression: 77% accuracy, threshold is 75%

XGboost: 72% accuracy , threshold is 77%
    