# Classification Modeling
### A.) Data Prep

In [1]:
# To hide warnings in notebooks

import warnings
warnings.filterwarnings('ignore')

### Loading the dataset into a Pandas dataframe

In [2]:
# Read in manually classified sentences CSV

import pandas as pd
pd.set_option('display.max_colwidth', 200)

man_class = pd.read_csv("CleanSentenceSentiment.csv", sep=",")
man_class

Unnamed: 0,ID,sentence,subjectivity,length,sentiment
0,266448,Giori agreed that surgeons should only proceed with surgical cases they feel comfortable with.,0.900000,94,0
1,35146,"If you're the doctor who reported the misbehavior, you're potentially opening a can of worms.",1.000000,93,-1
2,88447,"""This study is unique in including a very large number of patients, which enabled important subgroup analyses.""",0.852381,111,1
3,134291,"""The fact that we can't even do that suggests that we've gotten the balance wrong.""",0.900000,83,-1
4,291037,Her screams and the speeding truck prompted neighbors to report a possible abduction.,1.000000,85,-1
...,...,...,...,...,...
1995,156383,"Forrest concluded that: ""These results provide patients and heart teams important data to aid in the shared decision-making process.""",1.000000,133,0
1996,30453,"The Washington Post: ""School shootings rose to highest number in 20 years, federal data says.""",0.950000,94,-1
1997,74172,The research team found a significant survival benefit for patients receiving a living-donor liver transplant based on mortality risk and survival scores.,0.875000,154,1
1998,267749,"One reason for this imbalance is that the mRNA vaccines that have been so successful in wealthy nations are novel, expensive and technologically challenging to produce.",0.912500,168,-1


In [3]:
# Extract only sentence and polarity sentiment data from created dataframe

man_class = man_class[['sentence', 'sentiment']]
man_class

Unnamed: 0,sentence,sentiment
0,Giori agreed that surgeons should only proceed with surgical cases they feel comfortable with.,0
1,"If you're the doctor who reported the misbehavior, you're potentially opening a can of worms.",-1
2,"""This study is unique in including a very large number of patients, which enabled important subgroup analyses.""",1
3,"""The fact that we can't even do that suggests that we've gotten the balance wrong.""",-1
4,Her screams and the speeding truck prompted neighbors to report a possible abduction.,-1
...,...,...
1995,"Forrest concluded that: ""These results provide patients and heart teams important data to aid in the shared decision-making process.""",0
1996,"The Washington Post: ""School shootings rose to highest number in 20 years, federal data says.""",-1
1997,The research team found a significant survival benefit for patients receiving a living-donor liver transplant based on mortality risk and survival scores.,1
1998,"One reason for this imbalance is that the mRNA vaccines that have been so successful in wealthy nations are novel, expensive and technologically challenging to produce.",-1


In [4]:
# Check structure of dataframe

man_class.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentence   2000 non-null   object
 1   sentiment  2000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


### Remove biases in the data

In [5]:
# Obtain counts for categorical polarity sentiment values

man_class.sentiment.value_counts()

 0    1174
-1     504
 1     322
Name: sentiment, dtype: int64

In [6]:
N = 504

In [7]:
# Randomly sample 504 records from the sentences classified as neutral for the unbiased dataset

man_class_neutral = man_class[man_class.sentiment == 0].sample(n=N, random_state=0)

In [8]:
# Pull all records from sentences classified as negative and positive for the unbiased dataset

man_class_positive = man_class[man_class.sentiment == 1]
man_class_negative = man_class[man_class.sentiment == -1]

In [9]:
# Join records to form unbiased dataset

man_class_unbiased = pd.concat([man_class_neutral, man_class_positive, man_class_negative], axis=0)
man_class_unbiased.sentiment.value_counts()

 0    504
-1    504
 1    322
Name: sentiment, dtype: int64

### Cleaning the data

In [10]:
# Check for any sentences found in the dataframe more than one time

man_class_unbiased.sentence.value_counts()

Among the key findings: - 49.9% of patients on maintenance mirikizumab achieved clinical remission at one year compared with 25.1% of patients on placebo (P<0.001).            1
Katzer confirmed that MSF is still finding access to the northwest part of Syria extremely difficult, since there is only one open border crossing between Turkey and Syria.    1
Finding the Energy  Time was not the only factor that physicians said stole from their ability to maintain friendships.                                                         1
When he approached pharmaceutical companies, well before the pandemic, showing them his "beautiful vaccine data," he was told there would be no use for them, he said.          1
"We can now fully understand the devastating impact the virus had on communities of color across generations."                                                                  1
                                                                                                              

In [11]:
# Drop any duplicate sentences from dataframe
# No duplicates found in unbiased dataset, but kept code in case changes were made

man_class_unbiased = man_class_unbiased.drop_duplicates(keep="first")
man_class_unbiased

Unnamed: 0,sentence,sentiment
667,Among the key findings: - 49.9% of patients on maintenance mirikizumab achieved clinical remission at one year compared with 25.1% of patients on placebo (P<0.001).,0
1343,"Since the law passed, clinics statewide have experienced its chilling effect, reporting that they have performed fewer abortions.",0
1818,Key Takeaway Esophagectomy for esophageal squamous cell carcinoma (ESCC) should be performed 6–8 weeks after the end of neoadjuvant chemotherapy.,0
1376,"There also wasn't a significant difference in distribution regarding adenomatous polyp location, size, or morphology.",0
1475,The absolute number of potentially preventable ED visits among cancer patients increased from about 1.8 million in 2012 to 3.2 million in 2019.,0
...,...,...
1983,The court has so far rejected the company's petitions to hear Roundup lawsuits.,-1
1986,The results showed that the median time on the waiting list was 6 months and that only 25% of patients eventually received CAR T-cell therapy.,-1
1996,"The Washington Post: ""School shootings rose to highest number in 20 years, federal data says.""",-1
1998,"One reason for this imbalance is that the mRNA vaccines that have been so successful in wealthy nations are novel, expensive and technologically challenging to produce.",-1


### Preparing data for modeling

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(use_idf=True, norm="l2", stop_words="english", max_df=0.7)
X = vectorizer.fit_transform(man_class_unbiased.sentence)
X

<1330x5081 sparse matrix of type '<class 'numpy.float64'>'
	with 13352 stored elements in Compressed Sparse Row format>

In [13]:
y = man_class_unbiased.sentiment
y

667     0
1343    0
1818    0
1376    0
1475    0
       ..
1983   -1
1986   -1
1996   -1
1998   -1
1999   -1
Name: sentiment, Length: 1330, dtype: int64

In [14]:
X.shape, y.shape

((1330, 5081), (1330,))

In [15]:
# Obtain test and train sets from data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [16]:
# Training set: 1064 sentences (80% of data)
# Testing set: 266 sentences (20% of data)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1064, 5081), (1064,), (266, 5081), (266,))

# B.) Classification Models

## 1.) Modeling with k-Nearest Neighbors (k-NNs)

### I.) Initialize a model object with initial parameters

In [17]:
from sklearn.neighbors import KNeighborsClassifier 

knc = KNeighborsClassifier(n_neighbors=1)
knc

KNeighborsClassifier(n_neighbors=1)

### II.) Fit model on the training data

In [18]:
knc.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=1)

### III.) Evaluate the model performance

#### Prediction Accuracy

In [19]:
knc.score(X_train, y_train)                   # Get the training set accuracy of the model 

1.0

In [20]:
knc.score(X_test, y_test)                     # Get the test set accuracy of the model 

0.40977443609022557

#### Confusion Matrix

In [21]:
y_true = y_test
y_true

1815   -1
16      0
453     0
1786    0
551    -1
       ..
1159    0
889    -1
605     1
1601    0
1545    0
Name: sentiment, Length: 266, dtype: int64

In [22]:
y_pred = knc.predict(X_test)
y_pred

array([ 0,  1,  0,  0,  0,  0, -1, -1, -1,  0, -1,  0, -1,  0, -1,  0, -1,
        0, -1,  1,  0,  1,  1,  0,  1,  0, -1, -1,  1,  1,  1,  0,  0,  0,
        1,  0, -1, -1,  1,  1,  1,  1,  1,  1,  0,  0, -1,  0,  0,  1, -1,
       -1,  0,  0,  0,  1,  1, -1, -1,  0, -1,  0,  0,  1, -1,  1, -1, -1,
        0, -1,  0,  1,  0,  0,  0,  0,  1,  0,  0,  0,  0, -1, -1,  0,  1,
       -1, -1,  0, -1, -1, -1, -1, -1, -1, -1,  0,  1, -1, -1,  0,  0,  1,
        0,  0, -1,  0,  0, -1,  0, -1,  0, -1,  0,  1,  0, -1, -1, -1, -1,
        0, -1,  0, -1, -1,  0,  1, -1,  0,  0, -1,  0,  0, -1, -1,  1,  0,
       -1,  0,  0, -1, -1, -1, -1,  1, -1, -1,  1, -1,  0,  0,  0, -1,  1,
        0,  1,  1,  0, -1,  0,  1,  0,  1,  0,  0,  0,  0,  0,  0, -1,  0,
        0,  0,  0,  1, -1,  0,  1,  1, -1,  1, -1,  1,  0, -1, -1,  1, -1,
        1, -1,  1, -1,  0, -1,  0,  0, -1, -1,  1,  1, -1, -1,  0,  1,  1,
       -1, -1, -1, -1,  1,  0, -1,  0,  1,  1,  0,  0,  0, -1, -1, -1, -1,
        1,  1, -1,  1,  1

In [23]:
from sklearn.metrics import confusion_matrix, classification_report

In [24]:
print(confusion_matrix(y_true, y_pred))

[[43 38 17]
 [34 43 26]
 [21 21 23]]


In [25]:
cm = pd.DataFrame(data=confusion_matrix(y_true, y_pred), index=knc.classes_, columns=knc.classes_)
cm

Unnamed: 0,-1,0,1
-1,43,38,17
0,34,43,26
1,21,21,23


#### Classification Report

In [26]:
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

          -1       0.44      0.44      0.44        98
           0       0.42      0.42      0.42       103
           1       0.35      0.35      0.35        65

    accuracy                           0.41       266
   macro avg       0.40      0.40      0.40       266
weighted avg       0.41      0.41      0.41       266



### IV.) Perform cross validation and choose the best parameters if there are parameters to optimize

In [27]:
from sklearn.model_selection import cross_val_score

In [28]:
# Cross validation using 5 folds

knc = KNeighborsClassifier(n_neighbors=1)
scores_k1 = cross_val_score(knc, X_train, y_train, cv=5)
scores_k1

array([0.42723005, 0.40375587, 0.53051643, 0.43192488, 0.44811321])

In [29]:
scores_k1.max(), scores_k1.min(), scores_k1.mean(), scores_k1.std()

(0.5305164319248826,
 0.40375586854460094,
 0.4483080875188237,
 0.04348756898347918)

#### To find the right parameter for k that yields the best performance:

In [30]:
score_max = 0                      # Score_max is a temoporay variable to store the max score 

for param in [1, 3, 10, 20, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 40]:
    model = KNeighborsClassifier(n_neighbors=param)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"k = {param}: {scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}\n")
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param         # Param_best is a temoporay variable to store the best parameter 
        
print(f"Highest score : {round(score_max, 3)} when k = {param_best}")

k = 1: [0.42723005 0.40375587 0.53051643 0.43192488 0.44811321]
0.448, 0.043

k = 3: [0.45539906 0.4741784  0.50234742 0.46478873 0.48113208]
0.476, 0.016

k = 10: [0.45539906 0.52112676 0.49295775 0.46478873 0.49528302]
0.486, 0.023

k = 20: [0.53521127 0.48826291 0.50704225 0.46948357 0.48113208]
0.496, 0.023

k = 25: [0.54929577 0.48826291 0.49765258 0.46478873 0.48584906]
0.497, 0.028

k = 26: [0.54929577 0.4741784  0.4741784  0.46478873 0.49056604]
0.491, 0.03

k = 27: [0.55868545 0.4741784  0.44600939 0.48356808 0.48584906]
0.49, 0.037

k = 28: [0.56338028 0.46948357 0.47887324 0.48826291 0.48113208]
0.496, 0.034

k = 29: [0.54460094 0.47887324 0.48826291 0.46948357 0.47169811]
0.491, 0.028

k = 30: [0.51643192 0.48826291 0.4600939  0.47887324 0.47169811]
0.483, 0.019

k = 31: [0.54929577 0.45539906 0.4741784  0.46948357 0.47169811]
0.484, 0.033

k = 32: [0.55399061 0.4741784  0.4600939  0.50234742 0.45754717]
0.49, 0.036

k = 33: [0.5399061  0.44600939 0.46948357 0.50704225 0.49

### V.) Build the final model with the optimal parameter(s) found

In [31]:
def train_test(X_train, X_test, y_train, y_test, cls):
    cls.fit(X_train, y_train)
    
    y_true = y_test
    y_pred = cls.predict(X_test)
    
    print(f"Train accuracy score: {round(cls.score(X_train, y_train), 3)}")
    print(f"Test accuracy score: {round(cls.score(X_test, y_test), 3)}\n")
    print(confusion_matrix(y_true, y_pred))
    print()p
    print(classification_report(y_true, y_pred, zero_division=0))

In [32]:
print(f"k = {param_best}")
knc = KNeighborsClassifier(n_neighbors=param_best)
%time train_test(X_train, X_test, y_train, y_test, knc)

k = 40
Train accuracy score: 0.536
Test accuracy score: 0.526

[[60 36  2]
 [32 62  9]
 [21 26 18]]

              precision    recall  f1-score   support

          -1       0.53      0.61      0.57        98
           0       0.50      0.60      0.55       103
           1       0.62      0.28      0.38        65

    accuracy                           0.53       266
   macro avg       0.55      0.50      0.50       266
weighted avg       0.54      0.53      0.51       266

CPU times: user 68.6 ms, sys: 9.02 ms, total: 77.6 ms
Wall time: 76.8 ms


## 2.) Modeling with Logistic Regression

In [33]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=0)
lr

LogisticRegression(random_state=0)

In [34]:
# Cross validation with 5 folds

scores = cross_val_score(lr, X_train, y_train, cv=5)
print(f"{scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}")

[0.5258216  0.50704225 0.52112676 0.53051643 0.51886792]
0.521, 0.008


In [35]:
%time train_test(X_train, X_test, y_train, y_test, lr)

Train accuracy score: 0.962
Test accuracy score: 0.549

[[72 25  1]
 [32 66  5]
 [28 29  8]]

              precision    recall  f1-score   support

          -1       0.55      0.73      0.63        98
           0       0.55      0.64      0.59       103
           1       0.57      0.12      0.20        65

    accuracy                           0.55       266
   macro avg       0.56      0.50      0.47       266
weighted avg       0.55      0.55      0.51       266

CPU times: user 6.38 s, sys: 38.6 s, total: 44.9 s
Wall time: 5.69 s


## 3.) Modeling with Multinomial Naive Bayes

In [36]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb

MultinomialNB()

In [37]:
# Cross validation with 5 folds

scores = cross_val_score(mnb, X_train, y_train, cv=5)
print(f"{scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}")

[0.53051643 0.53051643 0.49765258 0.5258216  0.51886792]
0.521, 0.012


In [38]:
%time train_test(X_train, X_test, y_train, y_test, mnb)

Train accuracy score: 0.927
Test accuracy score: 0.538

[[69 29  0]
 [33 69  1]
 [29 31  5]]

              precision    recall  f1-score   support

          -1       0.53      0.70      0.60        98
           0       0.53      0.67      0.59       103
           1       0.83      0.08      0.14        65

    accuracy                           0.54       266
   macro avg       0.63      0.48      0.45       266
weighted avg       0.60      0.54      0.49       266

CPU times: user 3.96 ms, sys: 2.54 ms, total: 6.5 ms
Wall time: 5.97 ms


## 4.) Modeling with Decision Trees

In [39]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=0)
dtc

DecisionTreeClassifier(random_state=0)

In [40]:
# Cross validation with 5 folds

scores = cross_val_score(dtc, X_train, y_train, cv=5)
print(f"{scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}")

[0.3943662  0.3943662  0.4741784  0.42723005 0.39622642]
0.417, 0.031


In [41]:
%time train_test(X_train, X_test, y_train, y_test, dtc)

Train accuracy score: 1.0
Test accuracy score: 0.421

[[56 30 12]
 [43 42 18]
 [34 17 14]]

              precision    recall  f1-score   support

          -1       0.42      0.57      0.48        98
           0       0.47      0.41      0.44       103
           1       0.32      0.22      0.26        65

    accuracy                           0.42       266
   macro avg       0.40      0.40      0.39       266
weighted avg       0.42      0.42      0.41       266

CPU times: user 95.5 ms, sys: 13.8 ms, total: 109 ms
Wall time: 108 ms


## 5.) Modeling with Random Forest

In [42]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=0)
rfc

RandomForestClassifier(random_state=0)

In [43]:
# Cross validation with 5 folds

scores = cross_val_score(rfc, X_train, y_train, cv=5)
print(f"{scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}")

[0.49765258 0.49765258 0.51643192 0.52112676 0.46226415]
0.499, 0.021


In [44]:
%time train_test(X_train, X_test, y_train, y_test, rfc)

Train accuracy score: 1.0
Test accuracy score: 0.485

[[70 27  1]
 [45 51  7]
 [32 25  8]]

              precision    recall  f1-score   support

          -1       0.48      0.71      0.57        98
           0       0.50      0.50      0.50       103
           1       0.50      0.12      0.20        65

    accuracy                           0.48       266
   macro avg       0.49      0.44      0.42       266
weighted avg       0.49      0.48      0.45       266

CPU times: user 894 ms, sys: 103 ms, total: 997 ms
Wall time: 996 ms


## 6.) Modeling with Linear Support Vector Machines

In [45]:
from sklearn.svm import LinearSVC

lsvc = LinearSVC(random_state=0)
lsvc

LinearSVC(random_state=0)

In [46]:
%time train_test(X_train, X_test, y_train, y_test, lsvc)

Train accuracy score: 1.0
Test accuracy score: 0.549

[[61 29  8]
 [24 61 18]
 [16 25 24]]

              precision    recall  f1-score   support

          -1       0.60      0.62      0.61        98
           0       0.53      0.59      0.56       103
           1       0.48      0.37      0.42        65

    accuracy                           0.55       266
   macro avg       0.54      0.53      0.53       266
weighted avg       0.55      0.55      0.54       266

CPU times: user 9.69 ms, sys: 0 ns, total: 9.69 ms
Wall time: 9.63 ms


In [47]:
# find the optimal value for C

score_max = 0

for param in [0.01, 0.03, 0.1, 0.3, 1, 3, 10]:
    model = LinearSVC(C=param, random_state=0)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"C = {param}: {scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}\n")
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param
        
print(f"Highest score : {round(score_max, 3)} when C = {param_best}")

C = 0.01: [0.51643192 0.47887324 0.5258216  0.51643192 0.51415094]
0.51, 0.016

C = 0.03: [0.51173709 0.48356808 0.52112676 0.50704225 0.52830189]
0.51, 0.015

C = 0.1: [0.52112676 0.50704225 0.51173709 0.5258216  0.50943396]
0.515, 0.007

C = 0.3: [0.53521127 0.52112676 0.5258216  0.53521127 0.50471698]
0.524, 0.011

C = 1: [0.50234742 0.49765258 0.52112676 0.51643192 0.51415094]
0.51, 0.009

C = 3: [0.50234742 0.47887324 0.50704225 0.49295775 0.49528302]
0.495, 0.01

C = 10: [0.49765258 0.4741784  0.49765258 0.49295775 0.48584906]
0.49, 0.009

Highest score : 0.524 when C = 0.3


In [48]:
print(f"C = {param_best}")
lsvc = LinearSVC(C=param_best)
%time train_test(X_train, X_test, y_train, y_test, lsvc)

C = 0.3
Train accuracy score: 0.993
Test accuracy score: 0.571

[[70 26  2]
 [29 67  7]
 [23 27 15]]

              precision    recall  f1-score   support

          -1       0.57      0.71      0.64        98
           0       0.56      0.65      0.60       103
           1       0.62      0.23      0.34        65

    accuracy                           0.57       266
   macro avg       0.59      0.53      0.52       266
weighted avg       0.58      0.57      0.55       266

CPU times: user 7.26 ms, sys: 1.17 ms, total: 8.44 ms
Wall time: 8.08 ms


## 7.) Modeling with Kernelized Support Vector Machines

In [49]:
from sklearn.svm import SVC

svc = SVC(C=1, kernel="rbf", gamma="scale", random_state=0)
svc

SVC(C=1, random_state=0)

In [50]:
%time train_test(X_train, X_test, y_train, y_test, svc)

Train accuracy score: 0.999
Test accuracy score: 0.545

[[76 22  0]
 [35 68  0]
 [33 31  1]]

              precision    recall  f1-score   support

          -1       0.53      0.78      0.63        98
           0       0.56      0.66      0.61       103
           1       1.00      0.02      0.03        65

    accuracy                           0.55       266
   macro avg       0.70      0.48      0.42       266
weighted avg       0.66      0.55      0.47       266

CPU times: user 306 ms, sys: 35.8 ms, total: 342 ms
Wall time: 341 ms


In [51]:
# find the optimal value for C

score_max = 0

for param in [0.01, 0.03, 0.1, 0.3, 1, 3, 10]:
    model = SVC(C=param, kernel="rbf", gamma="scale", random_state=0)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"C = {param}: {scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}\n")
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param
        
print(f"Highest score : {round(score_max, 3)} when C = {param_best}")

C = 0.01: [0.38497653 0.38028169 0.38028169 0.38028169 0.38207547]
0.382, 0.002

C = 0.03: [0.38497653 0.38028169 0.38028169 0.38028169 0.38207547]
0.382, 0.002

C = 0.1: [0.38497653 0.38028169 0.38028169 0.38028169 0.38207547]
0.382, 0.002

C = 0.3: [0.38497653 0.38028169 0.38028169 0.38028169 0.38207547]
0.382, 0.002

C = 1: [0.51173709 0.48826291 0.51643192 0.5258216  0.51886792]
0.512, 0.013

C = 3: [0.53521127 0.51643192 0.53521127 0.5258216  0.51415094]
0.525, 0.009

C = 10: [0.53521127 0.51643192 0.53521127 0.5258216  0.51415094]
0.525, 0.009

Highest score : 0.525 when C = 3


In [52]:
print(f"C = {param_best}")
svc = SVC(C=param_best)
%time train_test(X_train, X_test, y_train, y_test, svc)

C = 3
Train accuracy score: 1.0
Test accuracy score: 0.541

[[70 27  1]
 [33 65  5]
 [25 31  9]]

              precision    recall  f1-score   support

          -1       0.55      0.71      0.62        98
           0       0.53      0.63      0.58       103
           1       0.60      0.14      0.22        65

    accuracy                           0.54       266
   macro avg       0.56      0.49      0.47       266
weighted avg       0.55      0.54      0.51       266

CPU times: user 318 ms, sys: 36.1 ms, total: 355 ms
Wall time: 353 ms


## 8.) Modeling with Neural Networks

In [53]:
from sklearn.neural_network import MLPClassifier

mlpc = MLPClassifier(hidden_layer_sizes=(10, ), random_state=0)
mlpc

MLPClassifier(hidden_layer_sizes=(10,), random_state=0)

In [54]:
%time train_test(X_train, X_test, y_train, y_test, mlpc)

Train accuracy score: 1.0
Test accuracy score: 0.538

[[52 35 11]
 [26 60 17]
 [ 9 25 31]]

              precision    recall  f1-score   support

          -1       0.60      0.53      0.56        98
           0       0.50      0.58      0.54       103
           1       0.53      0.48      0.50        65

    accuracy                           0.54       266
   macro avg       0.54      0.53      0.53       266
weighted avg       0.54      0.54      0.54       266

CPU times: user 22.1 s, sys: 1min 55s, total: 2min 17s
Wall time: 17.1 s


In [55]:
# find the optimal hidden layer size

score_max = 0

for param in [10, 30, 100]:
    model = MLPClassifier(hidden_layer_sizes=(param, ), random_state=0)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"hidden_layer_size = {param}: {scores}\n{round(scores.mean(), 3)}, {round(scores.std(), 3)}\n")
    
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param
        
print(f"Highest score : {round(score_max, 3)} when hidden_layer_sizes = {param_best}")

hidden_layer_size = 10: [0.50234742 0.4741784  0.49295775 0.51643192 0.48584906]
0.494, 0.014

hidden_layer_size = 30: [0.49295775 0.46948357 0.48826291 0.48826291 0.48113208]
0.484, 0.008

hidden_layer_size = 100: [0.51173709 0.46948357 0.49295775 0.48356808 0.47641509]
0.487, 0.015

Highest score : 0.494 when hidden_layer_sizes = 10


In [56]:
print(f"hidden_layer_size = {param_best}")
mlpc = MLPClassifier(hidden_layer_sizes=(param_best, ), random_state=0)
%time train_test(X_train, X_test, y_train, y_test, mlpc)

hidden_layer_size = 10
Train accuracy score: 1.0
Test accuracy score: 0.538

[[52 35 11]
 [26 60 17]
 [ 9 25 31]]

              precision    recall  f1-score   support

          -1       0.60      0.53      0.56        98
           0       0.50      0.58      0.54       103
           1       0.53      0.48      0.50        65

    accuracy                           0.54       266
   macro avg       0.54      0.53      0.53       266
weighted avg       0.54      0.54      0.54       266

CPU times: user 20.3 s, sys: 1min 50s, total: 2min 10s
Wall time: 16.3 s


# C.) Choosing the Best Model

In [57]:
# Add final results from each model type to the summary dictionary

summary = {}
summary["k-NNs"] = round(knc.score(X_test, y_test), 3)
summary["Logistic Regression"] = round(lr.score(X_test, y_test), 3)
summary["Multinomial Naive Bayes"] = round(mnb.score(X_test, y_test), 3)
summary["Decision Trees"] = round(dtc.score(X_test, y_test), 3)
summary["Random Forest"] = round(rfc.score(X_test, y_test), 3)
summary["Linear SVMs"] = round(lsvc.score(X_test, y_test), 3)
summary["Kernelized SVMs"] = round(svc.score(X_test, y_test), 3)
summary["Neural Networks"] = round(mlpc.score(X_test, y_test), 3)

summary

{'k-NNs': 0.519,
 'Logistic Regression': 0.549,
 'Multinomial Naive Bayes': 0.538,
 'Decision Trees': 0.421,
 'Random Forest': 0.485,
 'Linear SVMs': 0.571,
 'Kernelized SVMs': 0.541,
 'Neural Networks': 0.538}

### Best model is the Linear SVMs model with an Accuracy score of 0.571

# D.) Testing the Best Model on New Data

### Randomly sample 20 sentences from unclassified data scraped from the website

In [58]:
# Read in non-classified sentences CSV

unclass = pd.read_csv("non2000Data.csv", sep="\t")
unclass

Unnamed: 0.1,Unnamed: 0,sentence,subjectivity,length
0,0,"PANAMA CITY (Reuters) - Panama registered its first case of monkeypox in a resident who was infected after being in contact with tourists from Europe, Panama's health ministry said Tuesday.",0.333333,189
1,1,"""Yesterday in the afternoon, the first case of monkeypox in our country was confirmed,"" Health Minister Luis Sucre said during a press conference, adding the patient ""is completely stable"" and the...",0.458333,227
2,2,"The patient, whose nationality and sex were not revealed, was isolating at home after being notified of a possible monkeypox infection, Sucre said.",1.000000,147
3,3,The person was later transferred to a medical facility.,0.000000,55
4,4,"Three lesions typical of monkeypox were found on the patient's body, though the person is ""practically asymptomatic,"" authorities said.",0.500000,135
...,...,...,...,...
290553,292553,The study was funded by grants to multiple researchers from the National Science and Technology Major Project of the Ministry of Science and Technology of China and other government sources.,0.291667,190
290554,292554,The researchers have disclosed no relevant financial relationships.,0.450000,67
290555,292555,Chest.,0.000000,6
290556,292556,"Published online July 18, 2022.",0.000000,31


In [59]:
# Randomly select 20 rows of non-classified data

test_records = unclass.sample(n=20, replace=False, random_state=1)   

In [60]:
# get list of sampled sentences alone

sentences = []

for idx, item in test_records.sentence.iteritems():
    sentences.append(item)

In [61]:
# transform sentences to apply model

sentences_new = vectorizer.transform(sentences)

In [62]:
# get list of predicted values for the randomly sampled sentences

lsvc.predict(sentences_new)

array([-1,  1, -1, -1,  1,  0, -1,  0,  0,  0,  0,  0, -1, -1,  0, -1, -1,
        0,  0,  0])

In [63]:
# create dataframe to list sampled sentences and polarity scores from Linear SVMs model

test_performance = pd.DataFrame()

test_performance['sentence'] = sentences
test_performance['polarity'] = lsvc.predict(sentences_new)

test_performance

Unnamed: 0,sentence,polarity
0,"Butler was in the fifth month of her pregnancy, one day past 21 weeks gestation.",-1
1,"This article originally appeared on MDedge.com, part of the Medscape Professional Network.",1
2,"Although the technique hasn't been consistently successful, its overall trial experience has intrigued and impressed the field enough for it to explore the treatment in increasingly more instructi...",-1
3,But she acknowledged that the optimal blood pressure management strategy in these patients remains uncertain and should be the focus of future research.,-1
4,"""But I think it's also important to note that the data that they use in this study isn't perfect for measuring things like readmissions.""",1
5,"Among both completers and non-completers there was an over-representation of individuals aged 18-34 years and women compared with the general population, and fewer participants aged at least 65 ye...",0
6,"""There are abbreviations that are really hard to understand even after you expand them such as MI for myocardial infarction, which is really a tough term all around.",-1
7,"Also weighing in on the study, Amit Singal, MD, chief of hepatology at UT Southwestern Medical Center in Dallas, Texas, said this study highlights that underlying cirrhosis is ""the strongest risk ...",0
8,"To change that, bioengineers have created a new hydrogel formula that dissolves rapidly from wound sites, melting off in 6 minutes or less.",0
9,"The procedure duration was reduced by almost 21 minutes, cutting the time from nearly 39 minutes to less than 18 minutes (P < .001).",0
