# Classification

Classification, a method popular in machine learning, determines whether and how a model can distinguish between to sets of text.

It works like this. Everyone with email relies on classification to separate spam from legitimate emails. Email providers train their computational models to recognize the difference by giving them emails they have labeled “spam” and “not spam.” They then ask the model to learn the features that most reliably distinguish the two types, which could include a preponderance of all caps or phrases like “free money” or “get paid.” They test the model by giving it unlabeled emails and asking it to classify them. If the model can do it accurately a high percentage of the time, that’s a good spam filter.

We can take the underlying idea and apply it to many experiments.

## Imports

As always, we begin with some imports.

In [137]:
import pandas as pd
import glob
from pathlib import Path
import re
from pandas import DataFrame
from pandas import Series, DataFrame

## Corpus 

In [138]:
directory = "../docs/NYT-Obituaries/"
files = glob.glob(f"{directory}/*.txt")
obit_titles = [Path(file).stem for file in files]
obit_titles

['1945-Adolf-Hitler',
 '1915-F-W-Taylor',
 '1975-Chiang-Kai-shek',
 '1984-Ethel-Merman',
 '1953-Jim-Thorpe',
 '1964-Nella-Larsen',
 '1955-Margaret-Abbott',
 '1984-Lillian-Hellman',
 '1959-Cecil-De-Mille',
 '1928-Mabel-Craty',
 '1973-Eddie-Rickenbacker',
 '1989-Ferdinand-Marcos',
 '1991-Martha-Graham',
 '1997-Deng-Xiaoping',
 '1938-George-E-Hale',
 '1885-Ulysses-Grant',
 '1909-Sarah-Orne-Jewett',
 '1957-Christian-Dior',
 '1987-Clare-Boothe-Luce',
 '1976-Jacques-Monod',
 '1954-Getulio-Vargas',
 '1979-Stan-Kenton',
 '1990-Leonard-Bernstein',
 '1972-Jackie-Robinson',
 '1998-Fred-W-Friendly',
 '1991-Leo-Durocher',
 '1915-B-T-Washington',
 '1997-James-Stewart',
 '1981-Joe-Louis',
 '1983-Muddy-Waters',
 '1942-George-M-Cohan',
 '1989-Samuel-Beckett',
 '1962-Marilyn-Monroe',
 '2000-Charles-M-Schulz',
 '1967-Gregory-Pincus',
 '1894-R-L-Stevenson',
 '1978-Bruce-Catton',
 '1982-Arthur-Rubinstein',
 '1875-Andrew-Johnson',
 '1974-Charles-Lindbergh',
 '1964-Rachel-Carson',
 '1953-Marjorie-Rawlings',


In [139]:
#create dtm
corpus_path = '../docs/NYT-Obituaries/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=40, dtype='float64')
corpus = []
for title in obit_titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)
print('df shape is: ' + str(df.shape))

df shape is: (378, 1391)


### Read in dataframe

For classification, we need two kinds of things: text and classes (e.g. gender, race, publisher). Pandas dataframes are useful for classification because they can hold a complete text and its metadata in a single row.

For this lesson, we're going to use our _New York Times_ obituaries corpus, which I have supplemented with the gender of the person who died and the date of publication.

In [127]:
meta = pd.read_csv("../docs/NYT-Obituaries.csv", encoding = 'utf-8')
meta = meta[["title", "gender", "date"]]
df

Unnamed: 0,gender,obit,date
0,0,Hitler Fought Way to Power Unique in Modern Hi...,1945.0
1,0,"F. W. Taylor, Expert in Efficiency, Dies BY TH...",1915.0
2,0,The Life of Chiang Kai-shek: A Leader Who Was ...,1975.0
3,1,"Ethel Merman, Queen of Musicals, Dies at 76 By...",1984.0
4,0,Jim Thorpe Is Dead On West Coast at 64 Special...,1953.0
...,...,...,...
373,0,Andres Segovie Is Dead at 94; His Crusade Elev...,1987.0
374,1,"Rita Hayworth, Movie Legend, Dies By ALBIN KRE...",1987.0
375,0,"June 20, 1993 William Golding Is Dead at 81; ...",1993.0
376,1,Florenz Ziegfeld Dies in Hollywood After Long ...,1932.0


## Load stopwords

Many words are uninteresting or unhelpful for classification, so we treat them as stopwords and remove them from the corpus.

In [128]:
# load stopwords
from sklearn.feature_extraction import text
text_file = open('../docs/jockers_stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

In [None]:
#set classes for logistic regression classification
meta = meta.reset_index()
meta['CLASS'] = ''
meta.ix[meta['imprint'].str.contains('NAL') == True, 'CLASS'] = 0
meta.ix[meta['imprint'].str.contains('RH') == True, 'CLASS'] = 1

In [None]:
df_final = pd.concat([meta, df], axis = 1)

In [None]:
# PIPELINE FOR ONE VS ALL CLASSIFICATION
#  print(df_final.head())
#df_final['PREDICTED'] = np.nan
#df_final['PROBS'] = np.nan
#print(df_final.head())

meta_df = df_final.iloc[:,:7]
feat_df = df_final.iloc[:,7:]
#print(meta_df.head)

meta_df['PROBS'] = np.nan
meta_df['PREDICTED'] = np.nan

model = LogisticRegression(penalty = 'l1', C = 1.0)

for this_index in df_final.index.tolist():
    title = meta_df.loc[meta_df.index[this_index], 'title'] 
    CLASS = meta_df.loc[meta_df.index[this_index], 'CLASS']
    print(title)
    train_index_list = [index_ for index_ in feat_df.index.tolist() if index_ != this_index]

    X = feat_df.loc[train_index_list]
    y = meta_df.loc[train_index_list, 'CLASS']
    TEST_CASE = feat_df.loc[[this_index]]

    model.fit(X,y)
    prediction = model.predict_proba(TEST_CASE)
    predicted = model.predict(TEST_CASE)

    meta_df.loc[meta_df.index[this_index], 'PREDICTED'] = predicted
    meta_df.loc[meta_df.index[this_index], 'PROBS'] = str(prediction)
    print('Class is: ' + str(CLASS) + ' ' + 'prediction is: ' + str(predicted) + ' ' + str(prediction))

In [None]:
import csv
meta2.to_csv('/media/secure_volume/CLASSIFIER_OUTPUT_MASS_TRADE_1980_2007_5.csv', sep='\t')


canonic_c = 1.0


def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval

def feat_pval_weight(meta_df_, dtm_df_):
    
    #dtm_df_ = dtm_df_.loc[meta_df_.index.tolist()]
    dtm_df_ = normalize_model(dtm_df_, dtm_df_)[0]
    #dtm_df_ = dtm_df_.dropna(axis = 1, how='any')

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['CLASS']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['CLASS']==1].index.tolist()].to_numpy()

    print(dtm0.shape)
    print(dtm1.shape)
    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced')
    clf.fit(dtm_df_, meta_df_['CLASS']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df

#meta_df2 = df_final2.iloc[:,:7]
#feat_df2 = df_final2.iloc[:,7:]
sig_thresh = 0.05 / len(df2.columns)

feat_df = feat_pval_weight(meta2, df2)
print(feat_df.shape)

feat_df.to_csv('/media/secure_volume/FEATURES_MASS_TRADE_1980_2007_5.csv')
out = feat_df[(feat_df['P_VALUE'] <= sig_thresh)].sort_values('LR_WEIGHT', ascending = True)
out1 = out['FEAT'].tolist()
for o in out1[0:30]:
    print(o)
featuresFile = open('/media/secure_volume/FEATURE_LIST_MASS_TRADE_1980_2007_5.txt', 'w')
out1 = '\n'.join(out1) 
featuresFile.write(str(out1))


In [141]:
d = {'title':["",""], 'gender': ['', ''], 'obit': ['', ''], 'date':["",""]}

df = pd.DataFrame(data=d)
df

count = 0
for file in files:
    text = open(file, encoding='utf-8').read()
    p = re.compile("[1-9]{4}")
    m = p.search(text)
    if m != None:
        df.at[count, 'date'] = m.group(0)
    else:
        df.at[count, 'date'] = "n/a"
    text = re.sub("[A-Z][a-z]*[ ][1-9]*[,][ ][1-9]*[\n][\n]","", text)
    text = re.sub("[A-Z]{8}", "", text)
    text = re.sub("[\n]","", text)
    df.at[count, 'obit'] = text
    df.at[count, 'title'] = obit_titles[count]
    count += 1
df
df.to_csv("../docs/NYT-Obituaries.csv", encoding='utf-8', index=False)