In [1]:
import pickle
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

%matplotlib inline

In [2]:
mbti_df = pd.read_csv('./processed_mbti.csv', header = 0)
mbti_df

Unnamed: 0,Type,Posts
0,INFJ,"['moment', 'sportscenter', 'top', 'ten', 'play..."
1,ENTP,"['finding', 'lack', 'these', 'post', 'very', '..."
2,INTP,"['good', 'one', 'course', 'which', 'say', 'kno..."
3,INTJ,"['dear', 'enjoyed', 'our', 'conversation', 'ot..."
4,ENTJ,"['fired', 'another', 'silly', 'misconception',..."
...,...,...
8670,ISFP,"['because', 'always', 'think', 'cat', 'fi', 'd..."
8671,ENFP,"['thread', 'already', 'exists', 'someplace', '..."
8672,INTP,"['many', 'question', 'when', 'these', 'thing',..."
8673,INFP,"['very', 'conflicted', 'right', 'now', 'when',..."


In [3]:
X_data = mbti_df['Posts']
y_data = mbti_df['Type']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3, random_state=42)

In [4]:
from sklearn.tree import export_graphviz
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

<font color='lightblue'> <h2>Vectorization</h2> </font>

In [5]:
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

<font color='lightblue'> <h2>L1 Logistic Regression</h2> </font>

In [6]:
l1_clf1 = LogisticRegression(random_state=0, solver='liblinear', penalty='l1', C=1.0).fit(X_train_vectorized, y_train)
l1_clf1.score(X_test_vectorized, y_test)

0.4917402996542451

In [7]:
l1_clf2 = LogisticRegression(random_state=0, solver='liblinear', penalty='l1', C=2.0).fit(X_train_vectorized, y_train)
l1_clf2.score(X_test_vectorized, y_test)

0.5197848636189013

In [8]:
l1_clf3 = LogisticRegression(random_state=0, solver='liblinear', penalty='l1', C=1.9).fit(X_train_vectorized, y_train)
l1_clf3.score(X_test_vectorized, y_test)

0.5194006915097964

In [9]:
l1_clf7 = LogisticRegression(random_state=0, solver='liblinear', penalty='l1', C=1.8).fit(X_train_vectorized, y_train)
l1_clf7.score(X_test_vectorized, y_test)

0.5209373799462159

In [26]:
l1_clf8 = LogisticRegression(random_state=0, solver='liblinear', penalty='l1', C=2.5).fit(X_train_vectorized, y_train)
l1_clf8.score(X_test_vectorized, y_test)

0.5224740683826354

<font color='lightblue'> <h3>Evaluation<h3></font>

- For L1 penalty and using the solver 'liblinear', the score for logistic regression model is the highest (0.67960) when C = 1.9. 
- When C is greater or smaller than 1.9, the score is lower. Even there is just a 0.1 difference in the value of C.


<font color='lightblue'> <h3>Let's try to change solver to 'saga' (another solver left that works for L1):</h3> </font>

In [10]:
l1_clf4 = LogisticRegression(random_state=0, solver='saga', penalty='l1', C=2.0, max_iter=10000).fit(X_train_vectorized, y_train)
l1_clf4.score(X_test_vectorized, y_test)

0.5113330772185939

In [11]:
l1_clf5 = LogisticRegression(random_state=0, solver='saga', penalty='l1', C=1.5, max_iter=10000).fit(X_train_vectorized, y_train)
l1_clf5.score(X_test_vectorized, y_test)

0.5132539377641183

In [12]:
l1_clf5 = LogisticRegression(random_state=0, solver='saga', penalty='l1', C=1.0, max_iter=10000).fit(X_train_vectorized, y_train)
l1_clf5.score(X_test_vectorized, y_test)

0.5021129466000769

In [13]:
l1_clf6 = LogisticRegression(random_state=0, solver='saga', penalty='l1', C=0.9, max_iter=10000).fit(X_train_vectorized, y_train)
l1_clf6.score(X_test_vectorized, y_test)

0.4994237418363427

<font color='lightblue'> <h3>Evaluation<h3></font>

- For L1 penalty and using the solver 'saga', the score for logistic regression model is the highest (0.68037) when C = 1.0.
- The score decreases when I either move up or down even just 0.1.

Compare to what I got from using the solver 'liblinear' with C = 1.9 (score = 0.67960), model with 'saga' and C = 1.0 has slightly better performance, and it has the best performance among all combo of parameters I have tried today.

<font color='lightblue'> <h2>L2 Logistic Regression</h2> </font>

In [14]:
l2_clf1 = LogisticRegression(random_state=0, solver='liblinear', penalty='l2', C=1.0).fit(X_train_vectorized, y_train)
l2_clf1.score(X_test_vectorized, y_test)

0.46753745678063774

In [15]:
l2_clf2 = LogisticRegression(random_state=0, solver='liblinear', penalty='l2', C=2.0).fit(X_train_vectorized, y_train)
l2_clf2.score(X_test_vectorized, y_test)

0.4932769880906646

In [16]:
l2_clf3 = LogisticRegression(random_state=0, solver='liblinear', penalty='l2', C=3.0).fit(X_train_vectorized, y_train)
l2_clf3.score(X_test_vectorized, y_test)

0.5009604302727622

In [17]:
l2_clf4 = LogisticRegression(random_state=0, solver='liblinear', penalty='l2', C=5.0).fit(X_train_vectorized, y_train)
l2_clf4.score(X_test_vectorized, y_test)

0.5086438724548598

In [18]:
l2_clf5 = LogisticRegression(random_state=0, solver='liblinear', penalty='l2', C=5.1).fit(X_train_vectorized, y_train)
l2_clf5.score(X_test_vectorized, y_test)

0.5082597003457549

<font color='lightblue'> <h3>Evaluation<h3></font>

- For L2 penalty and using the solver 'liblinear', the score for logistic regression model is the highest (0.65617) when C = 5.0. 

<font color='lightblue'> <h3>Try Other Solvers (Here only shows the model with best performance in different solvers) <h3></font>

In [19]:
l2_clf6 = LogisticRegression(random_state=0, solver='lbfgs', penalty='l2', C=3.7, max_iter = 1000).fit(X_train_vectorized, y_train)
l2_clf6.score(X_test_vectorized, y_test)

0.50787552823665

For solver 'lbfgs', the best performance score is 0.656550 at C = 3.7

In [20]:
l2_clf7 = LogisticRegression(random_state=0, solver='newton-cg', penalty='l2', C=3.0, max_iter = 1000).fit(X_train_vectorized, y_train)
l2_clf7.score(X_test_vectorized, y_test)

0.504802151363811

For solver 'newton-cg', the best performance score is 0.656934 at C = 3.0.

In [21]:
l2_clf8 = LogisticRegression(random_state=0, solver='saga', penalty='l2', C=2.4, max_iter = 1000).fit(X_train_vectorized, y_train)
l2_clf8.score(X_test_vectorized, y_test)

0.49980791394544755

For solver 'saga', the best performance score is 0.656550 at C = 2.4.

In [22]:
l2_clf9 = LogisticRegression(random_state=0, solver='sag', penalty='l2', C=3.0, max_iter = 1000).fit(X_train_vectorized, y_train)
l2_clf9.score(X_test_vectorized, y_test)

0.504802151363811

For solver 'sag', the best performance score is 0.656934 at C = 3.0.

In [23]:
# l2_clf10 = LogisticRegression(random_state=0, solver='newton-cholesky', penalty='l2', C=3.0, max_iter = 1000).fit(X_train_vectorized, y_train)
# l2_clf10.score(X_test_vectorized, y_test)

For solver 'newton-cholesky', the model cost too long to implement so there is no score for this one. 

<font color='lightblue'> <h3>Evaluation<h3></font>

- All in all, for L2, the models with best performance (0.656934) have parameters (solver='newton-cg', C = 3.0) or (solver='sag', C = 2.7). 
- Compare to the best model using L1, model with L1 and parameters (solver = 'saga', C = 1.0) has better performance at score = 0.68037. 
- According to the coefficients checking below, model with l1 regularization has a lot of 0-value coefficients, referring to that L1 penalty has ability to drive some coefficients to 0 and perform feature selection. Thus, the model is easier to interpret. Unlike L1, L2 cannot perform feature selection. All the coefficients are non-zero, making the model more complex and harder to interpret. 
- When I was playing around with different C-value and solvers, in both L1 and L2, I encountered convergence issue, thus it's hard to define which regularization is more likely to have such issue.

In [24]:
l1_clf5.coef_

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [25]:
l2_clf9.coef_

array([[-8.39857862e-02,  1.12159855e-01, -2.98069890e-03, ...,
        -1.17238823e-02, -5.07653395e-04, -3.83207798e-03],
       [-1.10433832e-01,  4.51155639e-01, -7.91610365e-03, ...,
         1.62640620e-01, -3.36453371e-03, -2.25928212e-02],
       [ 1.99060476e-01,  3.46716726e-01, -4.98020077e-03, ...,
        -2.97072827e-03, -7.29894239e-04, -8.68032675e-03],
       ...,
       [ 7.58923816e-01, -3.08842907e-01,  2.06115233e-01, ...,
        -1.63131356e-02, -2.55231794e-03, -5.82909784e-03],
       [-2.40358148e-01, -2.01293536e-01, -8.14025976e-03, ...,
        -4.88366130e-03, -1.44182025e-03, -6.77584810e-03],
       [-1.90118801e-01, -4.79191086e-01, -1.25699046e-02, ...,
        -1.18530056e-02, -4.39622404e-03, -6.27113388e-03]])