<center> <h2> Text Classifiers</h2></center>

## Outline
1. <a href='#1'>Logistic Regression</a>
2. <a href='#2'>Multilayer Perceptron Classifier</a>




## 1. Logistic Regression
* Despite its name, a classification algorithm
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [1]:
import pandas as pd
data = pd.read_csv("game_review.csv")
features = data["comment"]
target = data["sentiment"]

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)


#create the vocabulary based on the training data
vect = TfidfVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#train the classifier
model = LogisticRegression().fit(X=X_train_vectorized, y=y_train)


print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))


Classification accuracy on training set:  0.9012009495880463
Classification accuracy on testing set:  0.8217801047120419


### 1.1. Probability Estimates
* possible to get probability estimates for predictions
* use model.predict_proba()
    * Returned estimates for all classes are ordered by the label of classes.


* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba



In [3]:
model.predict(X_test_vectorized)

array([0, 1, 1, ..., 1, 0, 0], dtype=int64)

In [4]:
model.predict_proba(X_test_vectorized)

array([[0.74943183, 0.25056817],
       [0.07551562, 0.92448438],
       [0.27700778, 0.72299222],
       ...,
       [0.49807486, 0.50192514],
       [0.58939542, 0.41060458],
       [0.65454571, 0.34545429]])

In [5]:
proba = pd.DataFrame(model.predict_proba(X_test_vectorized), columns = ['Not Recommended', 'Recommended'])
proba['prediction'] = model.predict(X_test_vectorized)

In [6]:
proba

Unnamed: 0,Not Recommended,Recommended,prediction
0,0.749432,0.250568,0
1,0.075516,0.924484,1
2,0.277008,0.722992,1
3,0.590125,0.409875,0
4,0.111302,0.888698,1
...,...,...,...
4770,0.271429,0.728571,1
4771,0.633448,0.366552,0
4772,0.498075,0.501925,1
4773,0.589395,0.410605,0


## 2. Multilayer Perceptron Classifier
* Simple, feed-forward neural network
* hidden_layer_sizes = (10) adds a single hidden layer with 10 hidden units

* https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier


In [15]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)


#create the vocabulary based on the training data
vect = TfidfVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#train the classifier
# solvers
# lbfgs : an optimizer on the family of quasi-Newton Methods
# sgd : refers to the stochastic gradient descent
# adam : refers to a stochastic gradient-based optimizer

model = MLPClassifier(solver = 'lbfgs', hidden_layer_sizes = (10), activation = 'logistic',
                      random_state = 3000).fit(X=X_train_vectorized, y=y_train)


print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))


Classification accuracy on training set:  0.9973467392822232
Classification accuracy on testing set:  0.8098429319371728


In [8]:
model.n_layers_

3

In [9]:
model.coefs_

[array([[ 0.19149116, -0.08667255, -0.11229894, ..., -0.12139146,
          0.33095417,  0.20810238],
        [ 0.08775012, -0.09972273, -0.14690432, ..., -0.11743543,
          0.26695724,  0.15056969],
        [ 0.0098364 , -0.00665959, -0.01558428, ..., -0.02525612,
          0.02521357,  0.01154272],
        ...,
        [ 0.03256535,  0.01780136,  0.03735015, ...,  0.03451751,
          0.03644553,  0.00166909],
        [-0.0778984 ,  0.07554666,  0.09826078, ...,  0.09402758,
         -0.16705507, -0.13957366],
        [ 0.08251011, -0.04873931, -0.07672365, ..., -0.05703895,
          0.14424234,  0.09660233]]),
 array([[-0.63120187],
        [ 1.27958601],
        [ 1.92305251],
        [ 1.6750819 ],
        [-0.87286191],
        [ 1.85710689],
        [ 1.85462438],
        [ 1.61090875],
        [-0.52948209],
        [-6.01310613]])]