Hey Suweatha! Here is the code for the Random Forest model. Esther had a tutoring session on Thursday and she reduced the *q_score_tier* to 2 bins and used that as her y (output) value, this increased the model to 58%!

So to fit with our purpose of the model I did the same thing with the *accepted_answer_duration* and used that as our output.

*accepted_answer_duration* was reduced to 2 bins, and the parameters for the bins was changed from [less than an hour, less than a day, more than a day] to [less than a day, more than a day] to better distrubute the data within the bins

This made the model run wayyyyy faster and increase accuracy to the same as Esthers

So I think our next steps are:

<b>Feature selection</b>
- I dropped the *q_hour_min* column as it was kind of redundant to the *q_hour* column 
- In order to fit our story we have to decide if we should keep the *q_score_tier* and *q_view_count* columns. Removing these drops our accuracy about 3 or 4%, but if were using the model as a response time predictor the time a user posts a question the *q_score_tier* and *q_view_count* will be 0.

<b>Increase Data </b>
- We can try running the model with a larger data set to see if accuracy improves further



## Import dependencies and read csv

I saved the df that nisha and suweatha queried into a csv and imported it into this notebook just so our analysis can be a little cleaner

In [170]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from pathlib import Path
from collections import Counter
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from imblearn.metrics import classification_report_imbalanced
from sklearn.ensemble import RandomForestClassifier

In [171]:
file_path='Resources/stack_overflow.csv'
df=pd.read_csv(file_path)

In [172]:
df.head()

Unnamed: 0.1,Unnamed: 0,q_id,q_score_tier,q_title_char_count_bin,q_title_word_count_bin,q_view_count_bin,q_body_word_count,q_body_len_bin,q_tags_count,accepted_answer_id,q_day,q_hour,q_hour_min,accepted_answer_duration
0,0,67341742,Zero Score (0),Medium (50-100),Medium (10-20),30-40,217,100-250,4,67341801,Saturday,0,00:03,0.192894
1,1,67341817,Zero Score (0),Short (0 - 50),Short (0 - 10),40-50,47,<50,4,67341857,Saturday,0,00:17,0.149909
2,2,67341895,Positive Score (>0),Medium (50-100),Medium (10-20),40-50,423,250-500,2,67341911,Saturday,0,00:32,0.076868
3,3,67341936,Positive Score (>0),Medium (50-100),Medium (10-20),50-16000,243,100-250,1,67341961,Saturday,0,00:41,0.08895
4,4,67341921,Positive Score (>0),Long (100-150),Medium (10-20),40-50,57,50-100,2,67341974,Saturday,0,00:38,0.166569


## Data Preprocessing

- Dropped null values and dropped uncessary columns
- Binned data in accepted_answer_duration

In [173]:
df=df.dropna()

In [174]:
#drop identification columns also q_hour_min and q_body_len_bin columns as they are redundant to other columns

df=df.drop(['q_id','accepted_answer_id','Unnamed: 0','q_hour_min','q_body_len_bin'], axis=1)
df.head()

Unnamed: 0,q_score_tier,q_title_char_count_bin,q_title_word_count_bin,q_view_count_bin,q_body_word_count,q_tags_count,q_day,q_hour,accepted_answer_duration
0,Zero Score (0),Medium (50-100),Medium (10-20),30-40,217,4,Saturday,0,0.192894
1,Zero Score (0),Short (0 - 50),Short (0 - 10),40-50,47,4,Saturday,0,0.149909
2,Positive Score (>0),Medium (50-100),Medium (10-20),40-50,423,2,Saturday,0,0.076868
3,Positive Score (>0),Medium (50-100),Medium (10-20),50-16000,243,1,Saturday,0,0.08895
4,Positive Score (>0),Long (100-150),Medium (10-20),40-50,57,2,Saturday,0,0.166569


In [181]:
#bin accepted_answer_duration

answer_bins = [0, 24, 3000]
answer_bins_group_names = ["<1D", ">1D"]

# Categorize score based on the bins.
df['accepted_answer_duration_bin'] = pd.cut(df['accepted_answer_duration'], answer_bins, labels=answer_bins_group_names)

In [180]:
df.head()

Unnamed: 0,q_score_tier,q_title_char_count_bin,q_title_word_count_bin,q_view_count_bin,q_body_word_count,q_tags_count,q_day,q_hour,accepted_answer_duration,accepted_answer_duration_bin
0,Zero Score (0),Medium (50-100),Medium (10-20),30-40,217,4,Saturday,0,0.192894,<1D
1,Zero Score (0),Short (0 - 50),Short (0 - 10),40-50,47,4,Saturday,0,0.149909,<1D
2,Positive Score (>0),Medium (50-100),Medium (10-20),40-50,423,2,Saturday,0,0.076868,<1D
3,Positive Score (>0),Medium (50-100),Medium (10-20),50-16000,243,1,Saturday,0,0.08895,<1D
4,Positive Score (>0),Long (100-150),Medium (10-20),40-50,57,2,Saturday,0,0.166569,<1D


## Create features and encode our features using pd.get_dummies

In [158]:
# Create our features
X = df.drop(['accepted_answer_duration','accepted_answer_duration_bin'], axis=1)
X = pd.get_dummies(X)

# Create our target
y = df["accepted_answer_duration_bin"]

X.head()

Unnamed: 0,q_body_word_count,q_tags_count,q_hour,q_score_tier_Negative Score (<0),q_score_tier_Positive Score (>0),q_score_tier_Zero Score (0),q_title_char_count_bin_Long (100-150),q_title_char_count_bin_Medium (50-100),q_title_char_count_bin_Short (0 - 50),q_title_word_count_bin_Long (20-30),...,q_view_count_bin_40-50,q_view_count_bin_50-16000,q_view_count_bin_<10,q_day_Friday,q_day_Monday,q_day_Saturday,q_day_Sunday,q_day_Thursday,q_day_Tuesday,q_day_Wednesday
0,217,4,0,0,0,1,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
1,47,4,0,0,0,1,0,0,1,0,...,1,0,0,0,0,1,0,0,0,0
2,423,2,0,0,1,0,0,1,0,0,...,1,0,0,0,0,1,0,0,0,0
3,243,1,0,0,1,0,0,1,0,0,...,0,1,0,0,0,1,0,0,0,0
4,57,2,0,0,1,0,1,0,0,0,...,1,0,0,0,0,1,0,0,0,0


In [178]:
# Check the balance of our target values
y.value_counts()

<1D    182709
>1D     26711
Name: accepted_answer_duration_bin, dtype: int64

In [169]:
X.dtypes

q_body_word_count                         int64
q_tags_count                              int64
q_hour                                    int64
q_score_tier_Negative Score (<0)          uint8
q_score_tier_Positive Score (>0)          uint8
q_score_tier_Zero Score (0)               uint8
q_title_char_count_bin_Long (100-150)     uint8
q_title_char_count_bin_Medium (50-100)    uint8
q_title_char_count_bin_Short (0 - 50)     uint8
q_title_word_count_bin_Long (20-30)       uint8
q_title_word_count_bin_Medium (10-20)     uint8
q_title_word_count_bin_Short (0 - 10)     uint8
q_title_word_count_bin_XL (30+)           uint8
q_view_count_bin_10-20                    uint8
q_view_count_bin_20-30                    uint8
q_view_count_bin_30-40                    uint8
q_view_count_bin_40-50                    uint8
q_view_count_bin_50-16000                 uint8
q_view_count_bin_<10                      uint8
q_day_Friday                              uint8
q_day_Monday                            

## Split data to training and testing sets

In [160]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1,stratify=y)

## Fit model: Random Forest Classifier

In [161]:
# Resample the training data with the BalancedRandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

brfc = BalancedRandomForestClassifier(n_estimators=100,random_state=1)
rf = brfc.fit(X_train,y_train)

## Calculate Accuracy

In [162]:
# Calculated the balanced accuracy score
y_pred=rf.predict(X_test)
ba_balanced_forest=balanced_accuracy_score(y_test,y_pred)
ba_balanced_forest

0.5820849127055627

## Display Confusion Matrix


In [163]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)

cm_df=pd.DataFrame(cm,
                  index=["Actual <1D", "Actual >1D"],
                  columns=["Predicted <1D", "Predicted >1D"])
cm_df

Unnamed: 0,Predicted <1D,Predicted >1D
Actual <1D,27273,18404
Actual >1D,2891,3787


## Print additional scores for analysis: precision, recall, and f1

In [164]:
#imbalanced classification report
icr_balanced_forest=classification_report_imbalanced(y_test,y_pred)

In [165]:
#Summary of findings

print(f'For the Balanced Random Forest Classifier algortihm, the balanced accuracy score is {ba_balanced_forest}' 
      f'\n\nand the imbalanced classifcation report is:\n\n{icr_balanced_forest}')

For the Balanced Random Forest Classifier algortihm, the balanced accuracy score is 0.5820849127055627

and the imbalanced classifcation report is:

                   pre       rec       spe        f1       geo       iba       sup

        <1D       0.90      0.60      0.57      0.72      0.58      0.34     45677
        >1D       0.17      0.57      0.60      0.26      0.58      0.34      6678

avg / total       0.81      0.59      0.57      0.66      0.58      0.34     52355



## Feature Importance

In [166]:
# List the features sorted in descending order by feature importance
sorted(zip(rf.feature_importances_, X.columns), reverse=True)

[(0.4285449857805372, 'q_body_word_count'),
 (0.2794148492143895, 'q_hour'),
 (0.08147420331883974, 'q_tags_count'),
 (0.031159933834605927, 'q_view_count_bin_50-16000'),
 (0.012315110916460115, 'q_day_Thursday'),
 (0.012081080222742967, 'q_day_Wednesday'),
 (0.012069963956658549, 'q_day_Tuesday'),
 (0.011358344039055977, 'q_day_Monday'),
 (0.010682326522940176, 'q_day_Saturday'),
 (0.010539010477978826, 'q_score_tier_Zero Score (0)'),
 (0.010185732758723863, 'q_day_Friday'),
 (0.010184626515897244, 'q_view_count_bin_20-30'),
 (0.009862921645079188, 'q_day_Sunday'),
 (0.009589839182411899, 'q_title_word_count_bin_Medium (10-20)'),
 (0.009213119941429863, 'q_title_word_count_bin_Short (0 - 10)'),
 (0.008910597743588577, 'q_score_tier_Positive Score (>0)'),
 (0.008851994686586302, 'q_view_count_bin_30-40'),
 (0.008540291706560288, 'q_title_char_count_bin_Medium (50-100)'),
 (0.007461209575381232, 'q_title_char_count_bin_Short (0 - 50)'),
 (0.007452117722385542, 'q_score_tier_Negative Sco