# Classification

Classification, a method popular in machine learning, determines whether and how a model can distinguish between to sets of text.

It works like this. Everyone with email relies on classification to separate spam from legitimate emails. Email providers train their computational models to recognize the difference by giving them emails they have labeled “spam” and “not spam.” They then ask the model to learn the features that most reliably distinguish the two types, which could include a preponderance of all caps or phrases like “free money” or “get paid.” They test the model by giving it unlabeled emails and asking it to classify them. If the model can do it accurately a high percentage of the time, that’s a good spam filter.

We can take the underlying idea and apply it to many experiments.

## Imports

As always, we begin with some imports.

In [None]:
import pandas as pd
import glob
from pathlib import Path
import re
from pandas import DataFrame
from pandas import Series, DataFrame
import numpy as np
from sklearn.linear_model import LogisticRegression
from scipy.stats import pearsonr, norm

## Corpus 

In [None]:
directory = "../docs/NYT-Obituaries/"
files = glob.glob(f"{directory}/*.txt")
obit_titles = [Path(file).stem for file in files]
obit_titles

In [None]:
#create dtm
corpus_path = '../docs/NYT-Obituaries/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=40, dtype='float64')
corpus = []
for title in obit_titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)
print('df shape is: ' + str(df.shape))

In [None]:
meta = pd.read_csv("../docs/NYT-Obituaries.csv", encoding = 'utf-8')
meta = meta[["title", "gender", "date"]]
meta

In [None]:
# load stopwords
from sklearn.feature_extraction import text
text_file = open('../docs/jockers_stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

In [None]:
df_final = pd.concat([meta, df], axis = 1)

In [None]:
df_final.head()

In [None]:
# PIPELINE FOR ONE VS ALL CLASSIFICATION

meta['PROBS'] = ''
meta['PREDICTED'] = ''

model = LogisticRegression(penalty = 'l1', C = 1.0)

for this_index in df_final.index.tolist():
    print(this_index)
    title = meta.loc[meta.index[this_index], 'title'] 
    CLASS = meta.loc[meta.index[this_index], 'gender']
    print(title, CLASS)
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index]

    X = df.loc[train_index_list]
    y = meta.loc[train_index_list, 'gender']
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y)
    prediction = model.predict_proba(TEST_CASE)
    predicted = model.predict(TEST_CASE)
    meta.at[this_index, 'PREDICTED'] = predicted
    meta.at[this_index, 'PROBS'] = str(prediction)
    print('Class is: ' + str(CLASS) + '\n' + 'Prediction is: ' + str(predicted) + ' ' + str(prediction) + '\n')

In [None]:
canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval

def feat_pval_weight(meta_df_, dtm_df_):
    
    #dtm_df_ = dtm_df_.loc[meta_df_.index.tolist()]
    #dtm_df_ = normalize_model(dtm_df_, dtm_df_)[0]
    #dtm_df_ = dtm_df_.dropna(axis = 1, how='any')

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['gender']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['gender']==1].index.tolist()].to_numpy()

    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced')
    clf.fit(dtm_df_, meta_df_['gender']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df

sig_thresh = 0.05 / len(df.columns)

feat_df = feat_pval_weight(meta, df)

feat_df.to_csv('../docs/features_obits.csv')
out = feat_df[(feat_df['P_VALUE'] <= sig_thresh)].sort_values('LR_WEIGHT', ascending = True)
out = out[out['LR_WEIGHT'] != 0]
outM = out[out['LR_WEIGHT'] >= 0]
outW = out[out['LR_WEIGHT'] <= 0]

outM = outM['FEAT'].tolist()
print("Here are significant words that distinguish men: " + str(outM))
outW = outW['FEAT'].tolist()
print("Here are significant words that distinguish women: " + str(outW))

### Read in dataframe

For classification, we need two kinds of things: text and classes (e.g. gender, race, publisher). Pandas dataframes are useful for classification because they can hold a complete text and its metadata in a single row.

For this lesson, we're going to use our _New York Times_ obituaries corpus, which I have supplemented with the gender of the person who died and the date of publication.

## Load stopwords

Many words are uninteresting or unhelpful for classification, so we treat them as stopwords and remove them from the corpus.