# Exploratory Data Analysis

# Imports, Read-in

In [11]:
# Data manip.
import pandas as pd
import numpy as np

# Vizz
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# scikit-learn
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

# NLTK
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
# nltk.download('stopwords')
sw = stopwords.words('english')

# etc.
import sys
sys.path.append( '../src' )
from parse_it import get_wordnet_pos, parse_doc
from pretty_results import *

The data is stored in two separate `.csv` files.

In [None]:
df_good = pd.read_csv('../../data/good.csv')
df_promo = pd.read_csv('../../data/promotional.csv')

Taking a look at a sample article from the `good` articles dataset.

In [None]:
df_good.iloc[138].text

Testing out a custom function, `parse_doc`, which takes care of several NLP preprocessing steps: lowercasing, punctuation and character stripping, lemmatizing, and removing stopwords. It returns a string of *non-unique lemmas*, but can also return a list by setting the argument `as_list = True`. If stemming is preferred to lemmatizing, this can also be done within the function: `stem = 'stem'`.

In [None]:
parse_doc( df_good.iloc[138].text )

And, just to get the lay of the land, a look at one of the `promotional` articles.

In [None]:
df_promo.iloc[138].text

# Sample Size, Scope, and Content

In [None]:
print(df_good.shape)
print(df_promo.shape)

In [None]:
df_good.shape[0] + df_promo.shape[0]

In terms of **rows/records**,
- The dataframe containing **"good"** articles has 30,279 entries.
- The dataframe containing **"promotional"** articles has 23,837 entries.

Combined, we have **54,116** articles for examination.

Next, let's discuss features/columns.

In [None]:
print(df_good.columns)
print(df_promo.columns)

- The dataframe containing **"good"** articles has 2 columns - `text` and `url`.
- The dataframe containing **"promotional"** articles has 7 columns - in addition to `text` and `url`, there are five subtypes of "promotional tone":

    - `advert`: The article reads like an advertisement.
    - `coi`: The article appears to have been written by someone with a close connection to the subject.
    - `fanpov`: The article appears to have been written from a fan's point of view, rather than a neutral point of view.
    - `pr`: The article reads like a press release/news article.
    - `resume`: The (biographical) article reads like a résumé, i.e. it is neither neutral nor encylopedic in nature.
    
    The values contained in these columns are one-hot encoded binary values. See the dataframe heads below for a tabular representation of the data.

In [None]:
df_good.head(3)

In [None]:
df_promo.head(3)

### Side Investigation: Average Length of Articles?

While reading in the data, I noticed that the `promotional` article I selected at semi-random was considerably shorter than the semi-random `good` article - I had to wonder if this observation held true at all in the rest of the dataset and decided to investigate average article length for each class.

`len()` can give us a count of characters - `split()` must be used to get a word count.

In [None]:
print(len(df_good.iloc[0].text))
print(len(df_good.iloc[0].text.split()))

We can use string methods on the dataframes by calling `.str` - this makes calculations a lot easier.

In [None]:
df_good.text.str.len()

In [None]:
df_promo.text.str.len()

In [None]:
avg_char_good = df_good.text.str.len().mean()
avg_char_promo = df_promo.text.str.len().mean()

In [None]:
split_words_good = df_good.text.str.split()
split_words_promo = df_promo.text.str.split()

In [None]:
word_count_good = 0

for article in split_words_good:
    word_count_good += len(article)
    
avg_words_good = word_count_good / len(split_words_good)

print(avg_words_good)

In [None]:
word_count_promo = 0

for article in split_words_promo:
    word_count_promo += len(article)
    
avg_words_promo = word_count_promo / len(split_words_promo)

print(avg_words_promo)

In [None]:
print(f"Average 'good' article length: {avg_char_good:.0f} characters, {avg_words_good:.0f} words")
print(f"Average 'promotional' article length: {avg_char_promo:.0f} characters, {avg_words_promo:.0f} words")

In [None]:
# # Visualization of average lengths

# fig, ax = plt.subplots(nrows=2,
#                        ncols=1,
#                        figsize=(12, 12))

# ax[0].bar(x=['Good', 'Promo'],
#            height=(avg_words_good, avg_words_promo))

# ax[1].bar(x=['Good', 'Promo'],
#           height=(avg_char_good, avg_char_promo))

## Checking for Nulls / Data Types

In [None]:
df_good.info()

In [None]:
df_promo.info()

## Value Counts for Subclasses

Knowing there are five different subtypes of promotional article indicated within the dataset raises a further question: *how are those subtypes distributed?*

In [None]:
df_promo.select_dtypes(include='number').columns

In [None]:
# class_cols = df_promo.select_dtypes(include='number').columns
class_cols = df_promo.select_dtypes(include='number').columns.tolist()

class_cols

In [None]:
for col in class_cols:
    print(f"{df_promo[[col]].value_counts(normalize=True)}\n")

In [None]:
df_promo.columns

In [None]:
fig, ax = plt.subplots(nrows=1,
                       ncols=5,
                       figsize=(30,6),
                       sharey='all')

ax[0].bar(x=df_promo['advert'].value_counts(normalize=True).index,
          height=df_promo['advert'].value_counts(normalize=True).values,
          tick_label=['True', 'False'])
ax[0].set_title("Advertisement-like")

ax[1].bar(x=df_promo['coi'].value_counts(normalize=True).index,
          height=df_promo['coi'].value_counts(normalize=True).values,
          tick_label=['False', 'True'])
ax[1].set_title("Conflict of interest")

ax[2].bar(x=df_promo['fanpov'].value_counts(normalize=True).index,
          height=df_promo['fanpov'].value_counts(normalize=True).values,
          tick_label=['False', 'True'])
ax[2].set_title("Written from fan's point of view")

ax[3].bar(x=df_promo['pr'].value_counts(normalize=True).index,
          height=df_promo['pr'].value_counts(normalize=True).values,
          tick_label=['False', 'True'])
ax[3].set_title("Written like a news article/press release")

ax[4].bar(x=df_promo['resume'].value_counts(normalize=True).index,
          height=df_promo['resume'].value_counts(normalize=True).values,
          tick_label=['False', 'True'])
ax[4].set_title("Reads like a résumé");

# Set-up for Simple Binary Classification

First, we drop all columns but `text`, which will be our primary feature.

In [None]:
df_good = df_good[['text']]
df_promo = df_promo[['text']]

Before concatening the simplified dataframes, I create a new column `label` in each dataframe and give it the same value in every row. In `df_good`, each row is given the label `0` to indicate `False`, i.e. the article does ***not*** have a promotional tone. Conversely, each row in `df_promo` is given the label `1` to represent `True`, that the article ***does*** contain content that is promotional in tone.

Multi-class classification is explored in a [separate notebook](Multi-label.ipynb).

In [None]:
df_good['label'] = 0
df_good.head(3)

In [None]:
df_promo['label'] = 1
df_promo.head(3)

Next, we concatenate the dataframes using the pandas method `.append` - setting `ignore_index` to `True` means that the unique index values from `df_promo` are not carried over when this dataframe is appended to `df_good`; the indexing, rather, continues where `df_good`'s index leaves off.

In [None]:
df = df_good.append(other=df_promo,
                    ignore_index=True)

df

In [None]:
# df = df.sample(frac = 1).reset_index(drop=True)

# df

In [None]:
# freq_out(df, 'text', 10)

In [None]:
# df['text_lem'] = df['text'].apply(parse_doc)

In [None]:
# df.to_csv('lemmed_combined.csv')

## Reading in Newly Created csv file (with lemmas)

In [2]:
lemmed_df = pd.read_csv('../../data/lemmed_combined.csv', index_col=0)

In [3]:
lemmed_df.head(3)

Unnamed: 0,text,label,text_lem
0,"Ryan Steven Lochte lkti LOK tee born August 3,...",0,ryan steven lochte lkti lok tee bear august am...
1,CAM ships were World War II era British mercha...,0,cam ship world war ii era british merchant shi...
2,The politics of Vietnam are defined by a singl...,0,politics vietnam define single party socialist...


In [None]:
additional_sw = ['january',
                 'february',
                 'april', # 'march' and 'may' are English verbs and
                          #  are thus excluded
                 'june',
                 'july',
                 'august',
                 'september',
                 'october',
                 'november',
                 'december']

# Modeling Setup/Brainstorming

In [4]:
X = lemmed_df['text_lem']
y = lemmed_df['label']

In [5]:
X

0        ryan steven lochte lkti lok tee bear august am...
1        cam ship world war ii era british merchant shi...
2        politics vietnam define single party socialist...
3        pennsylvania route pa state highway locate mon...
4        clubland tv british free air dance music chann...
                               ...                        
54111    guatemala send delegation compete summer paral...
54112    charles augustus ollivierre july march vincent...
54113    dhanushka jayakody bear july colombo sri lanka...
54114    elmer harrison flick january january american ...
54115    safdarjung tomb sandstone marble mausoleum del...
Name: text_lem, Length: 54116, dtype: object

In [6]:
y

0        0
1        0
2        0
3        0
4        1
        ..
54111    0
54112    0
54113    1
54114    0
54115    0
Name: label, Length: 54116, dtype: int64

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.33,
                                                    random_state=42,
                                                    stratify=y)

In [8]:
y_train.value_counts(normalize=True)

0    0.559533
1    0.440467
Name: label, dtype: float64

In [9]:
y_test.value_counts(normalize=True)

0    0.559494
1    0.440506
Name: label, dtype: float64

## `DummyClassifier`

In [10]:
dum_pipe = Pipeline(steps=[
    ('cvec', CountVectorizer()),
    ('dum', DummyClassifier(strategy='most_frequent'))
])

dum_cv_res = cross_validate(dum_pipe,
                            X_train,
                            y_train,
                            scoring=('accuracy', 'f1_macro'),
                            cv=5,
                            verbose=1,
                            n_jobs=-2,
                            return_train_score=True)

pretty_cv(dum_cv_res)

[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.


CV Results
Accuracy
--------------------------------
Training accuracy: 0.560
Test accuracy:     0.560
F-1 Score
--------------------------------
Training F1 score: 0.359
Test F1 score:     0.359


[Parallel(n_jobs=-2)]: Done   5 out of   5 | elapsed:  1.5min finished


Results from cross-validation with `DummyClassifier`:
- Validation accuracy = 0.56 - **baseline performance**
- Validation F1 = 0.36
- **Execution time:** 1 m, 30 s

## `DecisionTreeClassifier`

In [12]:
dtc_pipe = Pipeline(steps=[
    ('cvec', CountVectorizer()),
    ('dtc', DecisionTreeClassifier())
])

dtc_cv_res = cross_validate(dtc_pipe,
                            X_train,
                            y_train,
                            scoring=('accuracy', 'f1_macro'),
                            cv=5,
                            verbose=1,
                            n_jobs=-2,
                            return_train_score=True)

pretty_cv(dtc_cv_res)

[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.


CV Results
Accuracy
--------------------------------
Training accuracy: 1.000
Test accuracy:     0.873
F-1 Score
--------------------------------
Training F1 score: 1.000
Test F1 score:     0.871


[Parallel(n_jobs=-2)]: Done   5 out of   5 | elapsed:  4.0min finished


Results from cross-validation with `DecisionTreeClassifer`:
- Validation accuracy = 0.87
- Validation F1 = 0.87
- **Execution time:** 3 m, 57 s

## `MultinomialNB`

## `RandomForestClassifier`

## `GradientBoostingClassifier`

## `XGBRFClassifier`