# Modeling
## Predicting If the Question Was Answered

To better understand the impact of how the question was asked, I built a model to see how well I could predict if the question was answered. In this notebook, I grid-search over two sampled dataframes with several models.

In the unsampled model from the previous notebook, I grid-search over the whole data frame using `questions_body` and `questions_score` as features. Since the classes were highly unbalanced, 98% baseline score, I decided to sample the data to even out the classes testing to see if I can predict if a question was answered. In the second dataframe, there was the sampled data set, with only `questions_body` as a feature, since I hypothesized that `question_score` was a strong indicator if the question was answered. 

For the first dataframe, I grid-search over Logistic Regression, K-Nearest Neighbors, and Random Forest, with Random Forest Providing the Best Results. 

In the second dataframe I only used Logicstic Regression due to time. 

For the first model, I grid-searched over Logistic Regression and began a K-Nearest Neighbors but after about 8 hrs I terminated the kernel - especially because Logistic Regression was providing 99% test accuracy (baseline of 98%). 

In [None]:
    #General
import pandas as pd
import numpy as np


    #Plotting
import matplotlib.pyplot as plt
import seaborn as sns


    #Sklearn Packages
from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.feature_extraction import stop_words, text
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, classification_report
# from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

%config InlineBackend.figure_format = 'retina'

#### Reading in DataFrame

In [2]:
df = pd.read_csv('./Datasets/cleaned_4_modeling.csv')

In [3]:
df.isnull().sum()

questions_id              0
questions_author_id       0
questions_date_added      0
questions_title           0
questions_body            0
questions_score           8
was_answered              0
answers_score           837
dtype: int64

For the sake of this classification model, we're not going to use the answers_score column and so I will drop it. Since there are only 8 rows will nulls out of about 52k, dropping the rows where there are nulls in the questions_score shouldn't affect the predictions.

In [4]:
df.drop(columns='answers_score', inplace=True)

Dropping rows where there are nulls in question score

In [5]:
df.dropna(axis = 0, inplace=True)

## Baseline Accuracy

Since most of the questions are answered the classes are highly unbalanced with 98% Baseline Accuracy Score

In [6]:
print("baseline:", df['was_answered'].mean())
df['was_answered'].value_counts()

baseline: 0.9841920825631546


1.0    51115
0.0      821
Name: was_answered, dtype: int64

## Preprocessing

#### Transforming Data With FunctionTransformer

In order to format the data for modeling I'm using a Function Transformer

In [7]:
get_text_data = FunctionTransformer(lambda x: x['questions_body'], validate = False)
get_numeric_data = FunctionTransformer(lambda x: x[['questions_score',]], validate = False)

## Modeling

### Sampling the Data In Order to Create Even Classes

Since the classes above are so unbalanced I'm only taking a sample of the data were the question was answered. This creates a new baseline accuracy of 53% so we can actually model and test how much impact our features have on being able to predict if the question is answered or not (if we balance the classes the model could just predict was answered every time and would be 98% correct).

Creating portioned dataframes that have even classes of the dataframe where questions were and were not answered.

In [8]:
df_was = df[df['was_answered']==1].head(900)
df_wasnt = df[df['was_answered']==0]

Concatenating the portioned dataframes above back into one dataframe to be used in modeling.

In [9]:
sample_df = pd.concat([df_was, df_wasnt])

Getting the new baseline accuracy score so we can compare how the models performed.

In [10]:
sample_df['was_answered'].mean()

0.5229517722254503

#### Instantiating X and y variables

In [11]:
y = sample_df['was_answered']
X = sample_df[['questions_body','questions_score']]

#### Train Test Split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   random_state=42)

## Logicstic Regression Gridsearch

#### Building a Pipeline to Grid Search Using Standard Scaler, Count Vectorizer

In [13]:
pipe_logreg = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('ss', StandardScaler())
            ])),
            ('text_features', Pipeline([
                ('selector', get_text_data),
                ('cvec', CountVectorizer())
            ]))
    ])),
    ('logreg', LogisticRegression(solver='liblinear'))
])

params = {
           'logreg__penalty' : ['l1', 'l2']
}

gs = GridSearchCV(pipe_logreg, params, cv=5)

gs.fit(X_train, y_train)
print("train score", gs.score(X_train, y_train))
print("test score", gs.score(X_test, y_test))
print("best params:", gs.best_params_)

train score 0.9418604651162791
test score 0.8097447795823666
best params: {'logreg__penalty': 'l1'}


### KNN Gridsearch

In [14]:
pipe_knn = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('ss', StandardScaler())
            ])),
            ('text_features', Pipeline([
                ('selector', get_text_data),
                ('cvec', CountVectorizer())
            ]))
    ])),
    ('knn', KNeighborsClassifier())

])

params = {
    'knn__n_neighbors' : [3, 5, 10, 15, 20],
#     'knn__metric': ['euclidean', 'manhattan']  #Takes a while to run

}

gs = GridSearchCV(pipe_knn, params, cv=5)

gs.fit(X_train, y_train)
print("train score:", gs.score(X_train, y_train))
print("test score", gs.score(X_test, y_test))
print("best params:", gs.best_params_)

train score: 0.8992248062015504
test score 0.7587006960556845
best params: {'knn__n_neighbors': 3}


## Random Forest Gridsearch

In [15]:
pipe_knn = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('ss', StandardScaler())
            ])),
            ('text_features', Pipeline([
                ('selector', get_text_data),
                ('cvec', CountVectorizer())
            ]))
    ])),
    ('rf', RandomForestClassifier(random_state=42))

])

params = {
    'rf__n_estimators': [100,125],
    'rf__max_depth': [None, 4, 5, 6],
    'rf__max_features': [None,"auto"]}
    


gs = GridSearchCV(pipe_knn, params, cv=5)

gs.fit(X_train, y_train)
print("train score:", gs.score(X_train, y_train))
print("test score:", gs.score(X_test, y_test))
print("best params:", gs.best_params_)

train score: 1.0
test score: 0.8793503480278422
best params: {'rf__max_depth': None, 'rf__max_features': 'auto', 'rf__n_estimators': 125}


#### Defining a Function that Returns a Confusion Matrix as a DataFrame

A confusion matrix provides evaluation metrics that highlight how the model is being accurate and erroneous. The confusion matrix below shows scores from the Random Forest Model which had the best prediction results.

In [16]:
def make_confusion(y_test, preds, classes):

    conmat = confusion_matrix(y_test, preds)
    print(f'Accuracy Score: {accuracy_score(y_test, preds)}')
    print(f'Precision Score: {precision_score(y_test, preds)}')
    print(f'Recall Score: {recall_score(y_test, preds)}')
    return pd.DataFrame(conmat, columns=['Predicted ' +class_ for class_ in classes], \
                index=['Actual '+ class_ for class_ in classes])

#### Calling `make_confusion` function to get Accuracy, Precision and Recalls Scores and Confusion Matrix

In [17]:
# build a function to print out a nice confusion matrix
preds = gs.best_estimator_.predict(X_test)

make_confusion(y_test, preds, ["wasn't answered", "was answered"])

Accuracy Score: 0.8793503480278422
Precision Score: 0.9013452914798207
Recall Score: 0.8701298701298701


Unnamed: 0,Predicted wasn't answered,Predicted was answered
Actual wasn't answered,178,22
Actual was answered,30,201


---

After successfully modeling if a question will be answered or not based on question body and score, I wanted to know how we could predict just on question body since score was a likely tell. Below I modeled using a pipeline and grid-searching with logistic regression. Keep in mind, the below is with the sampled data and balanced classes.

Setting up the X and Y variable, with only `questions_body` used as a predictor for y.

In [18]:
y = sample_df['was_answered']
X = sample_df['questions_body']
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   random_state=42)

## Logicstic Regression Gridsearch

In [29]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('logreg', LogisticRegression(solver = 'lbfgs'))
])

params = {
    'cvec__stop_words' : [None, 'english'],
    'logreg__penalty' : ['none','l2'],
    'cvec__max_features': [2000, 3000, 4000, 5000],
    'cvec__ngram_range': [(1,1), (1,2)]
}


gs = GridSearchCV(pipe, # what object are we optimizing?
                  params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.
gs.fit(X_train, y_train)
preds = gs.best_estimator_.predict(X_test)
gs_model = gs.best_estimator_
print("train score:", gs_model.score(X_train, y_train))
print("test score:", gs_model.score(X_test, y_test))



train score: 0.9674418604651163
test score: 0.8283062645011601


In [20]:
gs.fit(X_train, y_train)
preds = gs.best_estimator_.predict(X_test)
gs_model = gs.best_estimator_
print("train score:", gs_model.score(X_train, y_train))
print("test score:", gs_model.score(X_test, y_test))



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [21]:
preds = gs.best_estimator_.predict(X_test)
gs_model = gs.best_estimator_
print("train score:", gs_model.score(X_train, y_train))
print("test score:", gs_model.score(X_test, y_test))

train score: 0.9674418604651163
test score: 0.8283062645011601


#### Words Most Indicative To The Question Being Answered.
Below I set up and output a dataframe with the coefficients (words most indicative to questions being answered or not answered)

In [30]:
coefs = gs.best_estimator_.named_steps['logreg'].coef_[0]
features = gs.best_estimator_.named_steps['cvec'].get_feature_names()
coef_df = pd.DataFrame({'features' : features,
             'coefficients': coefs})
coef_df.sort_values('coefficients', ascending = False)


Unnamed: 0,features,coefficients
656,csusmfreshman,1.916582
952,entrepreneurship,1.404943
1291,hard,1.318574
19,accounting,1.316048
1619,lawyer,1.252850
...,...,...
2002,physical,-0.945298
48,admissions,-1.022460
73,aerospace,-1.123089
2720,tips,-1.138122


Words most indicative to the question being answered:

In [27]:
coef_df.sort_values('coefficients', ascending = False).head(20)

Unnamed: 0,features,coefficients
656,csusmfreshman,1.916582
952,entrepreneurship,1.404943
1291,hard,1.318574
19,accounting,1.316048
1619,lawyer,1.25285
2460,social,1.214599
1756,military,1.176548
1521,interviews,1.148334
566,computer,1.103817
1730,media,1.083464


Words most indicative to the question NOT being answered:

In [28]:
coef_df.sort_values('coefficients', ascending = False).tail(20)

Unnamed: 0,features,coefficients
643,criminaljustice,-0.757701
2068,premed,-0.76554
1951,pathology,-0.780744
455,choice,-0.793528
1759,minor,-0.802987
567,computers,-0.806896
1959,paying,-0.830597
1323,help,-0.833407
334,biochemistry,-0.86759
512,colleges,-0.898431


### Thank You!!