Hey Suweatha! Here is the code for the Random Forest model. Esther had a tutoring session on Thursday and she reduced the *q_score_tier* to 2 bins and used that as her y (output) value, this increased the model to 58%!

So to fit with our purpose of the model I did the same thing with the *accepted_answer_duration* and used that as our output.

*accepted_answer_duration* was reduced to 2 bins, and the parameters for the bins was changed from [less than an hour, less than a day, more than a day] to [less than a day, more than a day] to better distrubute the data within the bins

This made the model run wayyyyy faster and increase accuracy to the same as Esthers

So I think our next steps are:

<b>Feature selection</b>
- I dropped the *q_hour_min* column as it was kind of redundant to the *q_hour* column 
- In order to fit our story we have to decide if we should keep the *q_score_tier* and *q_view_count* columns. Removing these drops our accuracy about 3 or 4%, but if were using the model as a response time predictor the time a user posts a question the *q_score_tier* and *q_view_count* will be 0.




## Import dependencies and read csv

I saved the df that nisha and suweatha queried into a csv and imported it into this notebook just so our analysis can be a little cleaner

In [191]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from pathlib import Path
from collections import Counter
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from imblearn.metrics import classification_report_imbalanced
from sklearn.ensemble import RandomForestClassifier

In [192]:
file_path='Resources/ML_Input_Jan2021.csv'
df=pd.read_csv(file_path)

In [193]:
df.head()

Unnamed: 0,q_id,accepted_answer_id,q_score,q_score_tier,q_view_count,q_view_count_bin,q_title_char_count,q_title_char_count_bin,q_title_word_count,q_title_word_count_bin,q_body_word_count,q_body_len_bin,q_tags_count,q_day,q_hour,q_hour_min,accepted_answer_duration
0,65526420,65526457,2,Positive Score (>0),62,50-16000,72,Medium (50-100),13,Medium (10-20),116,100-250,3,Friday,0,00:05,0.122066
1,65526423,65526533,2,Positive Score (>0),48,40-50,48,Short (0 - 50),8,Short (0 - 10),58,50-100,2,Friday,0,00:06,0.475172
2,65526490,65526541,2,Positive Score (>0),35,30-40,81,Medium (50-100),13,Medium (10-20),117,100-250,2,Friday,0,00:20,0.287423
3,65526419,65526554,3,Positive Score (>0),351,50-16000,76,Medium (50-100),9,Short (0 - 10),50,<50,4,Friday,0,00:05,0.575997
4,65526523,65526577,2,Positive Score (>0),117,50-16000,82,Medium (50-100),14,Medium (10-20),305,250-500,3,Friday,0,00:30,0.253412


## Data Preprocessing

- Dropped null values and dropped uncessary columns
- Binned data in accepted_answer_duration

In [194]:
#drop identification columns also q_hour_min and q_body_len_bin columns as they are redundant to other columns

df=df.drop(['q_id','accepted_answer_id','q_hour_min','q_body_len_bin','q_score','q_view_count'], axis=1)
df.head()

Unnamed: 0,q_score_tier,q_view_count_bin,q_title_char_count,q_title_char_count_bin,q_title_word_count,q_title_word_count_bin,q_body_word_count,q_tags_count,q_day,q_hour,accepted_answer_duration
0,Positive Score (>0),50-16000,72,Medium (50-100),13,Medium (10-20),116,3,Friday,0,0.122066
1,Positive Score (>0),40-50,48,Short (0 - 50),8,Short (0 - 10),58,2,Friday,0,0.475172
2,Positive Score (>0),30-40,81,Medium (50-100),13,Medium (10-20),117,2,Friday,0,0.287423
3,Positive Score (>0),50-16000,76,Medium (50-100),9,Short (0 - 10),50,4,Friday,0,0.575997
4,Positive Score (>0),50-16000,82,Medium (50-100),14,Medium (10-20),305,3,Friday,0,0.253412


In [195]:
df=df.dropna()

In [196]:
#bin accepted_answer_duration

answer_bins = [0, 24, 6000]
answer_bins_group_names = ["<1D", ">1D"]

# Categorize score based on the bins.
df['accepted_answer_duration_bin'] = pd.cut(df['accepted_answer_duration'], answer_bins, labels=answer_bins_group_names)

In [197]:
df.head()

Unnamed: 0,q_score_tier,q_view_count_bin,q_title_char_count,q_title_char_count_bin,q_title_word_count,q_title_word_count_bin,q_body_word_count,q_tags_count,q_day,q_hour,accepted_answer_duration,accepted_answer_duration_bin
0,Positive Score (>0),50-16000,72,Medium (50-100),13,Medium (10-20),116,3,Friday,0,0.122066,<1D
1,Positive Score (>0),40-50,48,Short (0 - 50),8,Short (0 - 10),58,2,Friday,0,0.475172,<1D
2,Positive Score (>0),30-40,81,Medium (50-100),13,Medium (10-20),117,2,Friday,0,0.287423,<1D
3,Positive Score (>0),50-16000,76,Medium (50-100),9,Short (0 - 10),50,4,Friday,0,0.575997,<1D
4,Positive Score (>0),50-16000,82,Medium (50-100),14,Medium (10-20),305,3,Friday,0,0.253412,<1D


In [222]:
df.dtypes

q_score_tier                      object
q_view_count_bin                  object
q_title_char_count                 int64
q_title_char_count_bin            object
q_title_word_count                 int64
q_title_word_count_bin            object
q_body_word_count                  int64
q_tags_count                       int64
q_day                             object
q_hour                             int64
accepted_answer_duration         float64
accepted_answer_duration_bin    category
dtype: object

## Create features and encode our features using pd.get_dummies

In [224]:
# Create our features
X = df.drop(['accepted_answer_duration','accepted_answer_duration_bin','q_title_char_count_bin'], axis=1)
X = pd.get_dummies(X)

# Create our target

y = df["accepted_answer_duration_bin"]

X.head()

Unnamed: 0,q_title_char_count,q_title_word_count,q_body_word_count,q_tags_count,q_hour,q_score_tier_Negative Score (<0),q_score_tier_Positive Score (>0),q_score_tier_Zero Score (0),q_view_count_bin_10-20,q_view_count_bin_20-30,...,q_title_word_count_bin_Medium (10-20),q_title_word_count_bin_Short (0 - 10),q_title_word_count_bin_XL (30+),q_day_Friday,q_day_Monday,q_day_Saturday,q_day_Sunday,q_day_Thursday,q_day_Tuesday,q_day_Wednesday
0,72,13,116,3,0,0,1,0,0,0,...,1,0,0,1,0,0,0,0,0,0
1,48,8,58,2,0,0,1,0,0,0,...,0,1,0,1,0,0,0,0,0,0
2,81,13,117,2,0,0,1,0,0,0,...,1,0,0,1,0,0,0,0,0,0
3,76,9,50,4,0,0,1,0,0,0,...,0,1,0,1,0,0,0,0,0,0
4,82,14,305,3,0,0,1,0,0,0,...,1,0,0,1,0,0,0,0,0,0


In [225]:
# Check the balance of our target values
y.value_counts()

<1D    386440
>1D     60979
Name: accepted_answer_duration_bin, dtype: int64

## Split data to training and testing sets

In [226]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1,stratify=y)

## Fit model: Random Forest Classifier

In [227]:
# Resample the training data with the BalancedRandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

brfc = BalancedRandomForestClassifier(n_estimators=100,random_state=1)
rf = brfc.fit(X_train,y_train)

## Calculate Accuracy

In [228]:
# Calculated the balanced accuracy score
y_pred=rf.predict(X_test)
ba_balanced_forest=balanced_accuracy_score(y_test,y_pred)
ba_balanced_forest

0.6093991256022591

## Display Confusion Matrix


In [203]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)

cm_df=pd.DataFrame(cm,
                  index=["Actual <1D", "Actual >1D"],
                  columns=["Predicted <1D", "Predicted >1D"])
cm_df

Unnamed: 0,Predicted <1D,Predicted >1D
Actual <1D,59006,37604
Actual >1D,5971,9274


## Print additional scores for analysis: precision, recall, and f1

In [204]:
#imbalanced classification report
icr_balanced_forest=classification_report_imbalanced(y_test,y_pred)

In [205]:
#Summary of findings

print(f'For the Balanced Random Forest Classifier algortihm, the balanced accuracy score is {ba_balanced_forest}' 
      f'\n\nand the imbalanced classifcation report is:\n\n{icr_balanced_forest}')

For the Balanced Random Forest Classifier algortihm, the balanced accuracy score is 0.6095477656816659

and the imbalanced classifcation report is:

                   pre       rec       spe        f1       geo       iba       sup

        <1D       0.91      0.61      0.61      0.73      0.61      0.37     96610
        >1D       0.20      0.61      0.61      0.30      0.61      0.37     15245

avg / total       0.81      0.61      0.61      0.67      0.61      0.37    111855



## Feature Importance

In [206]:
# List the features sorted in descending order by feature importance
sorted(zip(rf.feature_importances_, X.columns), reverse=True)

[(0.2604004290132404, 'q_body_word_count'),
 (0.20090699970916734, 'q_title_char_count'),
 (0.18722323353459056, 'q_hour'),
 (0.10957955976280016, 'q_title_word_count'),
 (0.06166580680479658, 'q_tags_count'),
 (0.034553418156016695, 'q_view_count_bin_50-16000'),
 (0.011748948382487483, 'q_day_Thursday'),
 (0.011362179957643171, 'q_day_Wednesday'),
 (0.011200806516883125, 'q_day_Monday'),
 (0.011196568927768429, 'q_day_Tuesday'),
 (0.009913550507342432, 'q_day_Saturday'),
 (0.009549311123316995, 'q_score_tier_Zero Score (0)'),
 (0.00929931340128493, 'q_view_count_bin_30-40'),
 (0.009218759118217682, 'q_day_Sunday'),
 (0.009125899808521925, 'q_view_count_bin_20-30'),
 (0.0087868730630419, 'q_day_Friday'),
 (0.008286014622191301, 'q_score_tier_Positive Score (>0)'),
 (0.0066174062996257905, 'q_score_tier_Negative Score (<0)'),
 (0.005494437961565036, 'q_view_count_bin_40-50'),
 (0.004495789774098216, 'q_title_word_count_bin_Medium (10-20)'),
 (0.004318664164736397, 'q_title_char_count_bi