# NLP Authorship Identification - Your Social Media Posts vs Bad CEOs'

![Image: image from https://trekkingwithdennis.com/tag/star-trek-voyager/](banner.png "image from trekkingwithdennis.com")

In this article, we will entertian ourselves by comparing our's, and, our galatic leaders' posts on the professional network BeamedIn, againts those of past earthling CEOs who were convicted frauds or unnanimously declared as unpleasant folk. 

We will do this with a technique called **Authorship Identification** in NLP.

This enables us to identify the most likely author of articles, news or messages. Authorship identification. This will be our aegies in navigating the multitude of Star Trek level of villinous propaganda in this age of social media

# Building The Author Learning Pipeline

Here are the steps we will undertake:
1. Clean these articles: stop words, lematize, and normalize.
2. Extract features through bag of word (BoW).
3. Tokenize words
4. Downscale to frequencies from word occurences.
5. Train classifier - for this we will find the best from a group of classifiers.

Let's prepare our notebook before the above work:

In [45]:
# Pip below for Kaggle and online notebooks.
# !pip install ipywidgets
# !pip install sklearn
# !pip install spacy

# General-purpose Libraries
import os
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
%matplotlib inline

# Remove anywarning texts from notebooks.
import warnings
warnings.filterwarnings("ignore")

# Discover files in kaggle if any.
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data = pd.read_csv('./communications.csv')
data.head(1)

Unnamed: 0,publication,author,content
0,messages,Elizabeth Holmes,"‘I want to be a billionaire.’\n‘No, the presid..."


We will be using `scikit-learn` to create pipelines. A Pipeline will create a compound classifier through these steps:
1. vectorizer
2. transformer
3. classifier

In [2]:
# 1
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import StackingClassifier, ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from nltk.stem.snowball import SnowballStemmer
from sklearn.linear_model import LogisticRegression
from nltk.stem.porter import *
import nltk
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
nltk.download()

stemmer = SnowballStemmer("english", ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

# 2
RANDOM_STATE = 42

nbc = MultinomialNB(fit_prior=False)
lgc = LogisticRegression(n_jobs=-1)
knc = KNeighborsClassifier(n_neighbors=1)
sgdc = SGDClassifier(loss='hinge',
                      penalty='l2',
                      alpha=1e-3,
                      random_state=42,
                      max_iter=5,
                      tol=None,
                      n_jobs=-1)
estimators = [('nbc', nbc), ('sgdc', sgdc), ('lgc', lgc)]
etc = ExtraTreesClassifier(random_state=RANDOM_STATE)

# 3
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
sclf = StackingClassifier(estimators=estimators, 
                          final_estimator=etc, 
                          passthrough=True, 
                          cv=kfold)
tfidf = TfidfTransformer(use_idf=False)

# Pipeline
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

data = data.dropna()

y = data['author']
X = data['content']
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.24,
                                                    random_state=0,
                                                    stratify=y)

parameters = {
  'vect__ngram_range': [(1, 1), (1, 2)],
  'tfidf__use_idf': (True, False),
  'clf__sgdc__alpha': (1e-2, 1e-3),
  'clf__sgdc__max_iter': (5, 10),
  'clf__sgdc__tol': (0.0, 1e-3),
  'clf__lgc__solver': ['newton-cg', 'lbfgs', 'sag'],
  'clf__lgc__C': [0.3, 0.5, 0.7, 1],
  'clf__lgc__penalty': ['none', 'l2']
}
# To find out for the param grid.
# print(text_clf.get_params())
text_clf = Pipeline([
    ('vect', stemmed_count_vect),
    ('tfidf', tfidf),
    ('clf', sclf),
])
gs_clf = GridSearchCV(estimator=text_clf,
                      param_grid=parameters,
                      cv=5, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


Let's vet the models we are stacking:

In [4]:
predicted = gs_clf.predict(X_test)

from sklearn import metrics
print(metrics.classification_report(y_test, predicted,
                                     target_names=y.unique()))

np.mean(predicted == y_test)

                  precision    recall  f1-score   support

Elizabeth Holmes       0.00      0.00      0.00         5
   Sunny Balwani       1.00      0.33      0.50         3
   Trevor Milton       0.48      1.00      0.65        11
    Adam Neumann       0.00      0.00      0.00         4
Jeffrey Skilling       0.00      0.00      0.00         1
    Donald Trump       0.80      0.80      0.80         5

        accuracy                           0.55        29
       macro avg       0.38      0.36      0.32        29
    weighted avg       0.42      0.55      0.44        29



0.5517241379310345

50% no perfect. We need more work here or specific models. Here is what SKTLearn recommends:

ml_map.png

Here we try some text to see to who it is

In [62]:
test_data = {'author': [''], 'publication': [''], 'content': [
    'The indictment, in a lot of ways, that was the turning point.']}
text = pd.DataFrame(test_data)

text_predicted = gs_clf.predict(text['content'])
text_predicted_prob = gs_clf.predict_proba(text['content'])

y = np.array(data['author'].unique())
txt = text_predicted[0]
txt_idx = np.where(y == txt)
text_predicted_prob[0][txt_idx]

array([0.58])

# Concluding our Analysis



# References

- https://spacy.io

## Github and Kaggle

Article here is also available on [Github]() and [Kaggle]()

#
<div align="right">Made with :heartpulse: by <b>Adam</b></div>