# Classification

Classification, a method popular in machine learning, determines whether and how a model can distinguish between to sets of text.

It works like this. Everyone with email relies on classification to separate spam from legitimate emails. Email providers train their computational models to recognize the difference by giving them emails they have labeled “spam” and “not spam.” They then ask the model to learn the features that most reliably distinguish the two types, which could include a preponderance of all caps or phrases like “free money” or “get paid.” They test the model by giving it unlabeled emails and asking it to classify them. If the model can do it accurately a high percentage of the time, that’s a good spam filter.

We can take the underlying idea and apply it to many experiments.

## Imports

As always, we begin with some imports.

In [164]:
import pandas as pd
import glob
from pathlib import Path
import re
from pandas import DataFrame
from pandas import Series, DataFrame
import numpy as np
from sklearn.linear_model import LogisticRegression


## Corpus 

In [165]:
directory = "../docs/NYT-Obituaries/"
files = glob.glob(f"{directory}/*.txt")
obit_titles = [Path(file).stem for file in files]
obit_titles

['1945-Adolf-Hitler',
 '1915-F-W-Taylor',
 '1975-Chiang-Kai-shek',
 '1984-Ethel-Merman',
 '1953-Jim-Thorpe',
 '1964-Nella-Larsen',
 '1955-Margaret-Abbott',
 '1984-Lillian-Hellman',
 '1959-Cecil-De-Mille',
 '1928-Mabel-Craty',
 '1973-Eddie-Rickenbacker',
 '1989-Ferdinand-Marcos',
 '1991-Martha-Graham',
 '1997-Deng-Xiaoping',
 '1938-George-E-Hale',
 '1885-Ulysses-Grant',
 '1909-Sarah-Orne-Jewett',
 '1957-Christian-Dior',
 '1987-Clare-Boothe-Luce',
 '1976-Jacques-Monod',
 '1954-Getulio-Vargas',
 '1979-Stan-Kenton',
 '1990-Leonard-Bernstein',
 '1972-Jackie-Robinson',
 '1998-Fred-W-Friendly',
 '1991-Leo-Durocher',
 '1915-B-T-Washington',
 '1997-James-Stewart',
 '1981-Joe-Louis',
 '1983-Muddy-Waters',
 '1942-George-M-Cohan',
 '1989-Samuel-Beckett',
 '1962-Marilyn-Monroe',
 '2000-Charles-M-Schulz',
 '1967-Gregory-Pincus',
 '1894-R-L-Stevenson',
 '1978-Bruce-Catton',
 '1982-Arthur-Rubinstein',
 '1875-Andrew-Johnson',
 '1974-Charles-Lindbergh',
 '1964-Rachel-Carson',
 '1953-Marjorie-Rawlings',


In [166]:
#create dtm
corpus_path = '../docs/NYT-Obituaries/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=40, dtype='float64')
corpus = []
for title in obit_titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)
print('df shape is: ' + str(df.shape))

df shape is: (378, 1391)


In [167]:
meta = pd.read_csv("../docs/NYT-Obituaries.csv", encoding = 'utf-8')
meta = meta[["title", "gender", "date"]]
meta

Unnamed: 0,title,gender,date
0,1945-Adolf-Hitler,0,1945.0
1,1915-F-W-Taylor,0,1915.0
2,1975-Chiang-Kai-shek,0,1975.0
3,1984-Ethel-Merman,1,1984.0
4,1953-Jim-Thorpe,0,1953.0
...,...,...,...
373,1987-Andres-Segovie,0,1987.0
374,1987-Rita-Hayworth,1,1987.0
375,1993-William-Golding,0,1993.0
376,1932-Florenz-Ziegfeld,1,1932.0


In [168]:
# load stopwords
from sklearn.feature_extraction import text
text_file = open('../docs/jockers_stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

In [169]:
df_final = pd.concat([meta, df], axis = 1)

In [170]:
df_final.head()

Unnamed: 0,title,gender,date,000,10,100,11,12,13,14,...,written,wrong,wrote,yale,year,years,yesterday,york,younger,youth
0,1945-Adolf-Hitler,0,1945.0,21.0,1.0,0.0,2.0,3.0,4.0,3.0,...,0.0,0.0,3.0,0.0,11.0,19.0,0.0,1.0,1.0,9.0
1,1915-F-W-Taylor,0,1915.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,1975-Chiang-Kai-shek,0,1975.0,3.0,3.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,6.0,0.0,3.0,14.0,0.0,1.0,2.0,1.0
3,1984-Ethel-Merman,1,1984.0,0.0,1.0,0.0,1.0,0.0,1.0,2.0,...,0.0,0.0,0.0,0.0,3.0,5.0,2.0,5.0,0.0,0.0
4,1953-Jim-Thorpe,0,1953.0,2.0,3.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,1.0,0.0,4.0,0.0,0.0


In [181]:
# PIPELINE FOR ONE VS ALL CLASSIFICATION

meta['PROBS'] = ''
meta['PREDICTED'] = ''

model = LogisticRegression(penalty = 'l1', C = 1.0)

for this_index in df_final.index.tolist():
    print(this_index)
    title = meta.loc[meta.index[this_index], 'title'] 
    CLASS = meta.loc[meta.index[this_index], 'gender']
    print(title, CLASS)
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index]

    X = df.loc[train_index_list]
    y = meta.loc[train_index_list, 'gender']
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y)
    prediction = model.predict_proba(TEST_CASE)
    predicted = model.predict(TEST_CASE)
    meta.at[this_index, 'PREDICTED'] = predicted
    meta.at[this_index, 'PROBS'] = str(prediction)
    print('Class is: ' + str(CLASS) + '\n' + 'Prediction is: ' + str(predicted) + ' ' + str(prediction) + '\n')

0
1945-Adolf-Hitler 0
Class is: 0
Prediction is: [0] [[9.99999976e-01 2.37062221e-08]]

1
1915-F-W-Taylor 0
Class is: 0
Prediction is: [0] [[0.93486545 0.06513455]]

2
1975-Chiang-Kai-shek 0
Class is: 0
Prediction is: [0] [[9.99999994e-01 6.08918640e-09]]

3
1984-Ethel-Merman 1
Class is: 1
Prediction is: [0] [[0.62145621 0.37854379]]

4
1953-Jim-Thorpe 0
Class is: 0
Prediction is: [0] [[0.99547045 0.00452955]]

5
1964-Nella-Larsen 1
Class is: 1
Prediction is: [1] [[0.00499555 0.99500445]]

6
1955-Margaret-Abbott 1
Class is: 1
Prediction is: [1] [[1.64310898e-09 9.99999998e-01]]

7
1984-Lillian-Hellman 1
Class is: 1
Prediction is: [1] [[0.05506694 0.94493306]]

8
1959-Cecil-De-Mille 0
Class is: 0
Prediction is: [0] [[0.99613283 0.00386717]]

9
1928-Mabel-Craty 1
Class is: 1
Prediction is: [1] [[0.21935596 0.78064404]]

10
1973-Eddie-Rickenbacker 0
Class is: 0
Prediction is: [0] [[9.99657659e-01 3.42340927e-04]]

11
1989-Ferdinand-Marcos 0
Class is: 0
Prediction is: [1] [[0.15051496 0.84



Class is: 1
Prediction is: [0] [[9.99595984e-01 4.04016349e-04]]

13
1997-Deng-Xiaoping 0
Class is: 0
Prediction is: [0] [[9.99999011e-01 9.88617256e-07]]

14
1938-George-E-Hale 0
Class is: 0
Prediction is: [0] [[0.99670907 0.00329093]]

15
1885-Ulysses-Grant 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 1.78966579e-37]]

16
1909-Sarah-Orne-Jewett 1
Class is: 1
Prediction is: [0] [[0.63190351 0.36809649]]

17
1957-Christian-Dior 0
Class is: 0
Prediction is: [0] [[0.74512041 0.25487959]]

18
1987-Clare-Boothe-Luce 1
Class is: 1
Prediction is: [1] [[0.18760727 0.81239273]]

19
1976-Jacques-Monod 0
Class is: 0
Prediction is: [0] [[0.99861025 0.00138975]]

20
1954-Getulio-Vargas 0
Class is: 0
Prediction is: [0] [[0.99784181 0.00215819]]

21
1979-Stan-Kenton 0
Class is: 0
Prediction is: [0] [[0.99404698 0.00595302]]

22
1990-Leonard-Bernstein 0
Class is: 0
Prediction is: [0] [[9.99999832e-01 1.67897958e-07]]

23
1972-Jackie-Robinson 0
Class is: 0
Prediction is: [0] [[9.99977761e-01 2.22



Class is: 0
Prediction is: [0] [[0.99773478 0.00226522]]

27
1997-James-Stewart 0
Class is: 0
Prediction is: [0] [[9.99989373e-01 1.06267990e-05]]

28
1981-Joe-Louis 0
Class is: 0
Prediction is: [0] [[0.99798289 0.00201711]]

29
1983-Muddy-Waters 0
Class is: 0
Prediction is: [0] [[0.99818708 0.00181292]]

30
1942-George-M-Cohan 0
Class is: 0
Prediction is: [0] [[9.99996421e-01 3.57935784e-06]]

31
1989-Samuel-Beckett 0
Class is: 0
Prediction is: [0] [[0.68329962 0.31670038]]

32
1962-Marilyn-Monroe 1
Class is: 1
Prediction is: [1] [[4.22158250e-04 9.99577842e-01]]

33
2000-Charles-M-Schulz 0
Class is: 0
Prediction is: [0] [[0.99878157 0.00121843]]

34
1967-Gregory-Pincus 0
Class is: 0
Prediction is: [0] [[0.62424122 0.37575878]]

35
1894-R-L-Stevenson 0
Class is: 0
Prediction is: [0] [[0.85608416 0.14391584]]

36
1978-Bruce-Catton 0
Class is: 0
Prediction is: [0] [[0.87365071 0.12634929]]

37
1982-Arthur-Rubinstein 0
Class is: 0
Prediction is: [0] [[0.97461365 0.02538635]]

38
1875-And



Class is: 0
Prediction is: [0] [[9.99948073e-01 5.19271581e-05]]

40
1964-Rachel-Carson 1
Class is: 1
Prediction is: [0] [[0.8709938 0.1290062]]

41
1953-Marjorie-Rawlings 1
Class is: 1
Prediction is: [0] [[0.76319702 0.23680298]]

42
1973-Otto-Klemperer 0
Class is: 0
Prediction is: [0] [[0.99727648 0.00272352]]

43
1963-Sylvia-Plath 1
Class is: 1
Prediction is: [1] [[0.00266516 0.99733484]]

44
1956-Charles-Merrill 0
Class is: 0
Prediction is: [0] [[0.84099456 0.15900544]]

45
1966-Lenny-Bruce 0
Class is: 0
Prediction is: [0] [[0.58633248 0.41366752]]

46
1908-Cleveland 0
Class is: 0
Prediction is: [0] [[9.99999811e-01 1.88778122e-07]]

47
1982-Anna-Freud 1
Class is: 1
Prediction is: [0] [[0.90659466 0.09340534]]

48
1941-Frank-Conrad 0
Class is: 0
Prediction is: [0] [[0.98508994 0.01491006]]

49
1966-Alfred-P-Sloan-Jr 0
Class is: 0
Prediction is: [0] [[9.99987585e-01 1.24149408e-05]]

50
1960-Beno-Gutenberg 0
Class is: 0
Prediction is: [0] [[0.85600578 0.14399422]]

51
1976-J-Paul-Ge



Class is: 1
Prediction is: [0] [[1.00000000e+00 1.95493299e-13]]

54
1970-Erich-Maria-Remarque 0
Class is: 0
Prediction is: [0] [[9.99988279e-01 1.17210564e-05]]

55
1989-August-A-Busch-Jr 0
Class is: 0
Prediction is: [0] [[0.70289495 0.29710505]]

56
1992-John-Cage 0
Class is: 0
Prediction is: [0] [[9.99974436e-01 2.55641745e-05]]

57
1994-Erik-Erikson 0
Class is: 0
Prediction is: [1] [[0.08344786 0.91655214]]

58
1990-Erte 0
Class is: 0
Prediction is: [0] [[0.93563017 0.06436983]]

59
1966-Chester-Nimitz 0
Class is: 0
Prediction is: [0] [[0.99845862 0.00154138]]

60
1954-Enrico-Fermi 0
Class is: 0
Prediction is: [0] [[9.99990566e-01 9.43442392e-06]]

61
1961-Primitive-Artist 1
Class is: 1
Prediction is: [1] [[0.34778316 0.65221684]]

62
1945-Bela-Bartok 0
Class is: 0
Prediction is: [0] [[9.99638649e-01 3.61350503e-04]]

63
1978-Pope-Paul-VI 0
Class is: 0
Prediction is: [1] [[0.01800115 0.98199885]]

64
1965-Martin-Buber 0
Class is: 0
Prediction is: [0] [[9.99981737e-01 1.82630784e-05



Class is: 0
Prediction is: [0] [[9.99811757e-01 1.88243190e-04]]

67
1998-Bob-Kane 0
Class is: 0
Prediction is: [0] [[0.9377862 0.0622138]]

68
1969-Judy-Garland 1
Class is: 1
Prediction is: [1] [[0.08703755 0.91296245]]

69
1959-Frank-Lloyd-Wright 0
Class is: 0
Prediction is: [0] [[0.98934243 0.01065757]]

70
1995-Ginger-Rogers 1
Class is: 1
Prediction is: [1] [[0.00265528 0.99734472]]

71
1920-Marlene-Dietrich 1
Class is: 1
Prediction is: [1] [[6.37983406e-06 9.99993620e-01]]

72
1998-Alan-B-Shepard-Jr 0
Class is: 0
Prediction is: [0] [[9.99908556e-01 9.14436368e-05]]

73
1971-Khrushchev 0
Class is: 0
Prediction is: [0] [[9.99974924e-01 2.50762295e-05]]

74
1935-Justice-Holmes 0
Class is: 0
Prediction is: [0] [[0.99729787 0.00270213]]

75
1969-David-Eisenhower 0
Class is: 0
Prediction is: [0] [[9.99999704e-01 2.95987261e-07]]

76
1992-Marsha-P-Johnson 1
Class is: 1
Prediction is: [0] [[0.89984079 0.10015921]]

77
1914-Alfred-Thayer-Mahan 0
Class is: 0
Prediction is: [0] [[0.99836874 



Class is: 0
Prediction is: [0] [[0.99595839 0.00404161]]

81
1989-William-B-Shockley 0
Class is: 0
Prediction is: [0] [[9.99421059e-01 5.78940796e-04]]

82
1910-William-James 0
Class is: 0
Prediction is: [0] [[9.99546195e-01 4.53805491e-04]]

83
1951-Fanny-Brice 1
Class is: 1
Prediction is: [1] [[2.31520122e-04 9.99768480e-01]]

84
1916-Jack-London 0
Class is: 0
Prediction is: [1] [[0.39173252 0.60826748]]

85
1947-Al-Capone 0
Class is: 0
Prediction is: [0] [[0.99615098 0.00384902]]

86
1989-Lucille-Ball 1
Class is: 1
Prediction is: [1] [[2.19577548e-05 9.99978042e-01]]

87
1980-Jean-Paul-Sartre 0
Class is: 0
Prediction is: [0] [[0.93925256 0.06074744]]

88
1969-Sonja-Henie 1
Class is: 1
Prediction is: [1] [[0.25384065 0.74615935]]

89
1976-Adolph-Zukor 0
Class is: 0
Prediction is: [0] [[0.99436609 0.00563391]]

90
1959-John-Dulles 0
Class is: 0
Prediction is: [0] [[9.99999999e-01 5.59418918e-10]]

91
1980-Alfred-Hitchcock 0
Class is: 0
Prediction is: [0] [[9.99897269e-01 1.02730899e-0



Class is: 0
Prediction is: [0] [[9.99933298e-01 6.67015658e-05]]

96
1941-Virginia-Woolf 1
Class is: 1
Prediction is: [1] [[0.02065321 0.97934679]]

97
1922-Alexander-Graham-Bell 0
Class is: 0
Prediction is: [0] [[0.81801787 0.18198213]]

98
1907-Qiu-Jin 1
Class is: 1
Prediction is: [1] [[4.73800332e-10 1.00000000e+00]]

99
1895-Fred-Douglass 0
Class is: 0
Prediction is: [1] [[0.00365029 0.99634971]]

100
1971-Florence-Blanchfield 1
Class is: 1
Prediction is: [1] [[0.08880136 0.91119864]]

101
1984-Ray-A-Kroc 0
Class is: 0
Prediction is: [0] [[9.99772814e-01 2.27186372e-04]]

102
1947-Fiorello-La-Guardia 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 6.13153678e-11]]

103
1988-Louis-L-Amour 0
Class is: 0
Prediction is: [0] [[9.99504489e-01 4.95510921e-04]]

104
1902-Elizabeth-Cady-Stanton 1
Class is: 1
Prediction is: [1] [[2.93265412e-11 1.00000000e+00]]

105
1965-Shirley-Jackson 1
Class is: 1
Prediction is: [1] [[0.00809148 0.99190852]]

106
1979-John-Wayne 0
Class is: 0
Prediction



Class is: 1
Prediction is: [1] [[0.2998338 0.7001662]]

110
1954-Liberty-H-Bailey 0
Class is: 0
Prediction is: [0] [[0.96058245 0.03941755]]

111
1977-Dash-Ended 0
Class is: 0
Prediction is: [0] [[9.99990420e-01 9.58024158e-06]]

112
1936-Maxim-Gorky 0
Class is: 0
Prediction is: [0] [[0.99696159 0.00303841]]

113
2000-Pierre-Trudeau 0
Class is: 0
Prediction is: [0] [[0.96495129 0.03504871]]

114
1976-Mao-Tse-Tung 0
Class is: 0
Prediction is: [0] [[0.79051173 0.20948827]]

115
1955-Walter-White 0
Class is: 0
Prediction is: [0] [[9.99973835e-01 2.61653336e-05]]

116
1930-Conan-Doyle 0
Class is: 0
Prediction is: [0] [[9.99867863e-01 1.32136867e-04]]

117
1995-Jonas-Salk 0
Class is: 0
Prediction is: [0] [[9.99998618e-01 1.38182284e-06]]

118
1949-Mitchell 1
Class is: 1
Prediction is: [1] [[0.16308502 0.83691498]]

119
1951-Henrietta-Lacks 1
Class is: 1
Prediction is: [1] [[6.05240821e-04 9.99394759e-01]]

120
1903-James-M-N-Whistler 0
Class is: 0
Prediction is: [0] [[0.98536158 0.01463842]



Class is: 1
Prediction is: [1] [[0. 1.]]

124
1941-James-Joyce 0
Class is: 0
Prediction is: [0] [[9.99750890e-01 2.49109933e-04]]

125
1952-John-Dewey 0
Class is: 0
Prediction is: [0] [[9.99999998e-01 2.18636631e-09]]

126
1940-Marcus-Garvey 0
Class is: 0
Prediction is: [0] [[0.79693842 0.20306158]]

127
1971-Louis-Armstrong 0
Class is: 0
Prediction is: [0] [[9.99817662e-01 1.82337782e-04]]

128
1923-Warren-Harding 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 1.22560776e-11]]

129
1937-John-Rockefeller 0
Class is: 0
Prediction is: [0] [[9.99886500e-01 1.13499837e-04]]

130
2000-Elliot-Richardson 0
Class is: 0
Prediction is: [0] [[9.99698378e-01 3.01621775e-04]]

131
1972-J-Edgar-Hoover 0
Class is: 0
Prediction is: [0] [[9.99999998e-01 2.05758890e-09]]

132
1975-Franco 0
Class is: 0
Prediction is: [0] [[9.99992825e-01 7.17521537e-06]]

133
1973-Jeanette-Rankin 1
Class is: 1
Prediction is: [1] [[9.30944000e-08 9.99999907e-01]]

134
1888-Louisa-M-Alcott 1
Class is: 1
Prediction is: [



Class is: 0
Prediction is: [0] [[9.99989658e-01 1.03419525e-05]]

137
1946-Lord-Keynes 0
Class is: 0
Prediction is: [0] [[0.90787626 0.09212374]]

138
1996-Gene-Kelly 0
Class is: 0
Prediction is: [0] [[0.99656124 0.00343876]]

139
1973-Roberto-Clemente 0
Class is: 0
Prediction is: [0] [[0.87322338 0.12677662]]

140
1986-Georgia-O-Keeffe 1
Class is: 1
Prediction is: [1] [[8.32704727e-08 9.99999917e-01]]

141
1990-Rex-Harrison 0
Class is: 0
Prediction is: [0] [[0.87427784 0.12572216]]

142
1916-Martian-Theory 0
Class is: 0
Prediction is: [0] [[0.99445904 0.00554096]]

143
1961-Ernest-Hemingway 0
Class is: 0
Prediction is: [0] [[0.99845777 0.00154223]]

144
1978-Margaret-Mead 1
Class is: 1
Prediction is: [0] [[9.99463924e-01 5.36075597e-04]]

145
1982-Ingrid-Bergman 1
Class is: 1
Prediction is: [1] [[8.44714882e-06 9.99991553e-01]]

146
1946-C-E-M-Clung 0
Class is: 0
Prediction is: [0] [[0.99407062 0.00592938]]

147
1977-Maria-Callas 1
Class is: 1
Prediction is: [1] [[0.0567967 0.9432033]



Class is: 1
Prediction is: [1] [[1.87432680e-04 9.99812567e-01]]

152
1989-Andrei-Sakharov 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 9.72616875e-12]]

153
1939-W-B-Yeats 0
Class is: 0
Prediction is: [0] [[0.87453322 0.12546678]]

154
1961-Hammarskjold 0
Class is: 0
Prediction is: [0] [[9.99953569e-01 4.64305786e-05]]

155
1969-Madhubala 1
Class is: 1
Prediction is: [1] [[0.0010527 0.9989473]]

156
1970-Edouard-Daladier 0
Class is: 0
Prediction is: [0] [[9.99978526e-01 2.14741353e-05]]

157
1988-John-Houseman 0
Class is: 0
Prediction is: [0] [[0.9275304 0.0724696]]

158
1992-William-Shawn 0
Class is: 0
Prediction is: [0] [[0.98984597 0.01015403]]

159
1950-Edna-St-V-Millay 1
Class is: 1
Prediction is: [1] [[0.49262276 0.50737724]]

160
1989-Claude-Pepper 0
Class is: 0
Prediction is: [0] [[0.998681 0.001319]]

161
1929-Marie-Curie 1
Class is: 1
Prediction is: [1] [[1.79118241e-07 9.99999821e-01]]

162
1972-The-Duke-of-Windsor 0
Class is: 0
Prediction is: [1] [[0.02405376 0.975946



Class is: 1
Prediction is: [0] [[0.78009793 0.21990207]]

166
1900-Nietzsche 0
Class is: 0
Prediction is: [0] [[0.71143663 0.28856337]]

167
1880-Lucretia-Mott 1
Class is: 1
Prediction is: [1] [[0.00138211 0.99861789]]

168
1967-J-Robert-Oppenheimer 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 1.05295046e-14]]

169
1984-Richard-Burton 0
Class is: 0
Prediction is: [0] [[0.99153543 0.00846457]]

170
1964-Sean-O-Casey 0
Class is: 0
Prediction is: [0] [[0.98136257 0.01863743]]

171
1931-Thomas-Edison 0
Class is: 0
Prediction is: [0] [[0.99354276 0.00645724]]

172
1971-Bobby-Jones 0
Class is: 0
Prediction is: [0] [[9.99793779e-01 2.06220945e-04]]

173
1914-John-P-Holland 0
Class is: 0
Prediction is: [0] [[0.8473702 0.1526298]]

174
1995-George-Abbott 0
Class is: 0
Prediction is: [1] [[0.36829537 0.63170463]]

175
1950-Henry-L-Stimson 0
Class is: 0
Prediction is: [0] [[1.0000000e+00 4.0052554e-11]]

176
1955-Dale-Carnegie 0
Class is: 0
Prediction is: [1] [[0.07150112 0.92849888]]

177
1



Class is: 0
Prediction is: [0] [[0.7670591 0.2329409]]

180
1909-Geronimo 0
Class is: 0
Prediction is: [0] [[0.99263458 0.00736542]]

181
1952-Charles-Spaulding 0
Class is: 0
Prediction is: [0] [[0.91158373 0.08841627]]

182
1922-Nellie-Bly 1
Class is: 1
Prediction is: [0] [[0.64713174 0.35286826]]

183
1955-Cy-Young 0
Class is: 0
Prediction is: [0] [[0.95644082 0.04355918]]

184
1993-Arthur-Ashe 0
Class is: 0
Prediction is: [0] [[0.92014159 0.07985841]]

185
1986-Bernard-Malamud 0
Class is: 0
Prediction is: [0] [[0.97425411 0.02574589]]

186
1939-Howard-Carter 0
Class is: 0
Prediction is: [0] [[0.93225564 0.06774436]]

187
1994-Linus-C-Pauling 0
Class is: 0
Prediction is: [0] [[9.99999993e-01 7.02555108e-09]]

188
1967-Langston-Hughes 0
Class is: 0
Prediction is: [0] [[0.88541303 0.11458697]]

189
1948-Mohandas-K-Gandhi 0
Class is: 0
Prediction is: [0] [[9.99529075e-01 4.70924973e-04]]

190
1931-Melvil-Dewey 0
Class is: 0
Prediction is: [0] [[9.99062219e-01 9.37780714e-04]]

191
1969-



Class is: 0
Prediction is: [0] [[0.9939347 0.0060653]]

195
1981-Anwar-el-Sadat 0
Class is: 0
Prediction is: [0] [[9.99999991e-01 9.41120745e-09]]

196
1937-Maurice-Ravel 0
Class is: 0
Prediction is: [0] [[0.98653366 0.01346634]]

197
1966-Margaret-Sanger 1
Class is: 1
Prediction is: [1] [[3.67481570e-07 9.99999633e-01]]

198
1989-I-F-Stone 0
Class is: 0
Prediction is: [0] [[0.99897474 0.00102526]]

199
1943-J-H-Kellogg 0
Class is: 0
Prediction is: [0] [[0.86811697 0.13188303]]

200
1985-Roger-Maris 0
Class is: 0
Prediction is: [0] [[0.99515504 0.00484496]]

201
1972-Mahalia-Jackson 1
Class is: 1
Prediction is: [1] [[1.24686517e-04 9.99875313e-01]]

202
1990-Greta-Garbo 1
Class is: 1
Prediction is: [1] [[2.23652874e-10 1.00000000e+00]]

203
1998-Benjamin-Spock 0
Class is: 0
Prediction is: [0] [[9.99632413e-01 3.67587241e-04]]

204
1994-Jessica-Tandy 1
Class is: 1
Prediction is: [1] [[0.04161796 0.95838204]]

205
1979-Richard-Rodgers 0
Class is: 0
Prediction is: [0] [[0.83796115 0.16203



Class is: 0
Prediction is: [0] [[0.92523886 0.07476114]]

208
1989-Hirohito 0
Class is: 0
Prediction is: [1] [[0.49434997 0.50565003]]

209
1933-Calvin-Coolidge 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 7.17091655e-11]]

210
1969-Ho-Chi-Minh 0
Class is: 0
Prediction is: [0] [[9.99994822e-01 5.17842546e-06]]

211
1987-John-Huston 0
Class is: 0
Prediction is: [0] [[0.94679041 0.05320959]]

212
1984-Johnny-Weissmuller 0
Class is: 0
Prediction is: [0] [[0.94570143 0.05429857]]

213
1969-Maureen-Connolly 1
Class is: 1
Prediction is: [1] [[2.90980134e-04 9.99709020e-01]]

214
1968-Yuri-Gagarin 0
Class is: 0
Prediction is: [0] [[0.99788122 0.00211878]]

215
1991-Peggy-Ashcroft 1
Class is: 1
Prediction is: [0] [[0.79017464 0.20982536]]

216
1979-A-Philip-Randolph 0
Class is: 0
Prediction is: [0] [[9.99999728e-01 2.71978913e-07]]

217
1946-Gertrude-Stein 1
Class is: 1
Prediction is: [0] [[0.82855788 0.17144212]]

218
1993-Thurgood-Marshall 0
Class is: 0
Prediction is: [0] [[9.99999625e-



Class is: 0
Prediction is: [0] [[9.99914554e-01 8.54460593e-05]]

222
1977-Joan-Crawford 1
Class is: 1
Prediction is: [1] [[3.26094707e-12 1.00000000e+00]]

223
1971-Hugo-Black 0
Class is: 0
Prediction is: [0] [[9.99998988e-01 1.01204627e-06]]

224
1954-Henri-Matisse 0
Class is: 0
Prediction is: [0] [[0.83525568 0.16474432]]

225
1906-Susan-B-Anthony 1
Class is: 1
Prediction is: [1] [[3.2107117e-10 1.0000000e+00]]

226
1990-Sammy-Davis-Jr 0
Class is: 0
Prediction is: [0] [[0.81260244 0.18739756]]

227
1914-John-Muir 0
Class is: 0
Prediction is: [0] [[0.93833791 0.06166209]]

228
1954-Anne-O-Hare-McCormick 1
Class is: 1
Prediction is: [1] [[4.78266384e-06 9.99995217e-01]]

229
1986-Kate-Smith 1
Class is: 1
Prediction is: [0] [[0.99642938 0.00357062]]

230
1919-Anna-H-Shaw 1
Class is: 1
Prediction is: [1] [[3.38675088e-10 1.00000000e+00]]

231
1980-Jean-Piaget 0
Class is: 0
Prediction is: [1] [[0.38742066 0.61257934]]

232
1966-Buster-Keaton 0
Class is: 0
Prediction is: [0] [[0.89164795 



Class is: 0
Prediction is: [0] [[0.99162938 0.00837062]]

236
1931-Ida-B-Wells 1
Class is: 1
Prediction is: [1] [[0.09106043 0.90893957]]

237
1967-Henry-R-Luce 0
Class is: 0
Prediction is: [0] [[9.99601599e-01 3.98401235e-04]]

238
1984-Ansel-Adams 0
Class is: 0
Prediction is: [0] [[0.99842072 0.00157928]]

239
1945-Ernie-Pyle 0
Class is: 0
Prediction is: [0] [[0.95782913 0.04217087]]

240
1952-Chaim-Weizmann 0
Class is: 0
Prediction is: [0] [[9.99999495e-01 5.05017793e-07]]

241
1973-Nancy-Mitford 1
Class is: 1
Prediction is: [0] [[0.75462978 0.24537022]]

242
1852-Ada-Lovelace 1
Class is: 1
Prediction is: [1] [[0.01623727 0.98376273]]

243
1996-Timothy-Leary 0
Class is: 0
Prediction is: [0] [[9.99825842e-01 1.74157850e-04]]

244
1919-Carnegie-Started 0
Class is: 0
Prediction is: [0] [[9.99949505e-01 5.04948142e-05]]

245
1989-Andrei-A-Gromyko 0
Class is: 0
Prediction is: [0] [[9.99957672e-01 4.23284637e-05]]

246
1965-David-O-Selznick 0
Class is: 0
Prediction is: [0] [[9.99490284e-0



Class is: 0
Prediction is: [0] [[0.85055291 0.14944709]]

250
1936-John-W-Heisman 0
Class is: 0
Prediction is: [0] [[0.83551799 0.16448201]]

251
1998-Maureen-O-Sullivan 1
Class is: 1
Prediction is: [1] [[0.0059423 0.9940577]]

252
1953-Joseph-Stalin 0
Class is: 0
Prediction is: [0] [[9.99547901e-01 4.52098753e-04]]

253
1944-Alfred-E-Smith 0
Class is: 0
Prediction is: [1] [[9.94196225e-06 9.99990058e-01]]

254
1910-Tolstoy 0
Class is: 0
Prediction is: [0] [[0.90828197 0.09171803]]

255
1964-Cole-Porter 0
Class is: 0
Prediction is: [0] [[0.98340878 0.01659122]]

256
1983-Jack-Dempsey 0
Class is: 0
Prediction is: [0] [[0.97195693 0.02804307]]

257
1975-Haile-Selassie 1
Class is: 1
Prediction is: [0] [[9.99969117e-01 3.08834119e-05]]

258
1882-Charles-Darwin 0
Class is: 0
Prediction is: [0] [[0.91705052 0.08294948]]

259
1903-Emily-Warren-Roebling 1
Class is: 1
Prediction is: [1] [[4.96309564e-06 9.99995037e-01]]

260
1962-Eleanor-Roosevelt 1
Class is: 1
Prediction is: [1] [[1.45165019e-



Class is: 1
Prediction is: [0] [[0.84956919 0.15043081]]

264
1959-Billie-Holiday 1
Class is: 1
Prediction is: [1] [[0.34000871 0.65999129]]

265
1969-Coleman-Hawkins 0
Class is: 0
Prediction is: [0] [[0.86416629 0.13583371]]

266
1954-Frida-Kahlo 1
Class is: 1
Prediction is: [1] [[0.04948078 0.95051922]]

267
1911-Joseph-Pulitzer 0
Class is: 0
Prediction is: [0] [[9.99999640e-01 3.60160722e-07]]

268
1993-Carlos-Montoya 0
Class is: 0
Prediction is: [0] [[0.78055432 0.21944568]]

269
1947-Max-Planck 0
Class is: 0
Prediction is: [0] [[0.9902041 0.0097959]]

270
1985-Orson-Welles 0
Class is: 0
Prediction is: [0] [[0.99774191 0.00225809]]

271
1974-Earl-Warren 0
Class is: 0
Prediction is: [0] [[9.99999892e-01 1.07766276e-07]]

272
1971-Ralph-Bunche 0
Class is: 0
Prediction is: [0] [[9.99974601e-01 2.53992213e-05]]

273
1999-Hassan-II 0
Class is: 0
Prediction is: [0] [[0.998955 0.001045]]

274
1931-Knute-Rocke 0
Class is: 0
Prediction is: [0] [[0.80511089 0.19488911]]

275
1998-Theodore-Sc



Class is: 0
Prediction is: [0] [[0.99867783 0.00132217]]

277
1927-Victoria-Martin 1
Class is: 1
Prediction is: [1] [[6.68222073e-08 9.99999933e-01]]

278
1943-George-Washington-Carver 0
Class is: 0
Prediction is: [0] [[9.99325586e-01 6.74414066e-04]]

279
1956-Thomas-J-Watson-Sr 0
Class is: 0
Prediction is: [0] [[9.99956756e-01 4.32435956e-05]]

280
1947-Henry-Ford 0
Class is: 0
Prediction is: [0] [[9.99999984e-01 1.58126765e-08]]

281
1953-Fred-Vinson 0
Class is: 0
Prediction is: [0] [[9.99999991e-01 9.01985167e-09]]

282
1982-Leonid-Brezhnev 0
Class is: 0
Prediction is: [0] [[9.99999335e-01 6.65359350e-07]]

283
1999-King-Hussein 0
Class is: 0
Prediction is: [0] [[9.99996828e-01 3.17232016e-06]]

284
1930-Elmer-Sperry 0
Class is: 0
Prediction is: [0] [[0.90329228 0.09670772]]

285
1985-E-B-White 0
Class is: 0
Prediction is: [0] [[9.99999893e-01 1.06587095e-07]]

286
1959-Ethel-Barrymore 1
Class is: 1
Prediction is: [0] [[0.94991993 0.05008007]]

287
1986-Benny-Goodman 0
Class is: 0




Class is: 0
Prediction is: [0] [[0.95905513 0.04094487]]

291
1993-Federico-Fellini 0
Class is: 0
Prediction is: [0] [[0.88226618 0.11773382]]

292
1945-Harry-S-Truman 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 2.60459076e-19]]

293
1970-De-Gaulle-Rallied 0
Class is: 0
Prediction is: [0] [[9.99999958e-01 4.20102676e-08]]

294
1995-Alfred-Eisenstaedt 0
Class is: 0
Prediction is: [0] [[0.98512347 0.01487653]]

295
1999-Iris-Murdoch 1
Class is: 1
Prediction is: [0] [[0.59469511 0.40530489]]

296
1987-Alf-Landon 0
Class is: 0
Prediction is: [0] [[0.99844834 0.00155166]]

297
1998-Bella-Abzug 1
Class is: 1
Prediction is: [1] [[1.29720679e-11 1.00000000e+00]]

298
1940-Scott-Fitzgerald 0
Class is: 0
Prediction is: [0] [[0.82615423 0.17384577]]

299
1989-Vladimir-Horowitz 0
Class is: 0
Prediction is: [0] [[9.99766207e-01 2.33792681e-04]]

300
1969-Mies-van-der-Rohe 0
Class is: 0
Prediction is: [0] [[0.93824741 0.06175259]]

301
1944-Ida-M-Tarbell 1
Class is: 1
Prediction is: [1] [[0.41



Class is: 0
Prediction is: [0] [[0.82351905 0.17648095]]

305
1971-Igor-Stravinsky 0
Class is: 0
Prediction is: [0] [[9.99998797e-01 1.20325676e-06]]

306
1945-George-Patton 0
Class is: 0
Prediction is: [0] [[0.93873109 0.06126891]]

307
1930-William-Howard-Taft 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 4.01254546e-14]]

308
1935-Will-Rogers 0
Class is: 0
Prediction is: [0] [[0.99156317 0.00843683]]

309
1992-Shirley-Booth 1
Class is: 1
Prediction is: [1] [[6.27130966e-04 9.99372869e-01]]

310
1994-Jacqueline-Kennedy 1
Class is: 1
Prediction is: [0] [[9.99871166e-01 1.28833501e-04]]

311
1933-Ring-Lardner 0
Class is: 0
Prediction is: [0] [[0.98800302 0.01199698]]

312
1974-Sylvia-Plath 1
Class is: 1
Prediction is: [1] [[1.46240253e-06 9.99998538e-01]]

313
1945-FDR 0
Class is: 0
Prediction is: [0] [[0.98180569 0.01819431]]

314
1995-Yitzhak-Rabin 0
Class is: 0
Prediction is: [0] [[9.99997420e-01 2.58048123e-06]]

315
1960-Emily-Post 1
Class is: 1
Prediction is: [1] [[0.16429667



Class is: 0
Prediction is: [0] [[0.88224404 0.11775596]]

319
1957-Gerard-Swope 0
Class is: 0
Prediction is: [0] [[9.99545293e-01 4.54706829e-04]]

320
1993-Albert-Sabin 0
Class is: 0
Prediction is: [0] [[9.99998333e-01 1.66727435e-06]]

321
1955-Thomas-Mann 0
Class is: 0
Prediction is: [0] [[9.99867673e-01 1.32327338e-04]]

322
1991-Dr-Seuss 0
Class is: 0
Prediction is: [1] [[0.05651314 0.94348686]]

323
1877-Bedford-Forrest 0
Class is: 0
Prediction is: [0] [[0.94323231 0.05676769]]

324
1964-Douglas-MacArthur 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 1.05552591e-15]]

325
1965-Churchill 0
Class is: 0
Prediction is: [0] [[1.0000000e+00 1.0187854e-11]]

326
1962-William-Faulkner 0
Class is: 0
Prediction is: [0] [[0.97283116 0.02716884]]

327
1956-Babe-Zaharias 1
Class is: 1
Prediction is: [1] [[8.42795581e-07 9.99999157e-01]]

328
1932-John-Philip-Sousa 0
Class is: 0
Prediction is: [0] [[0.99800378 0.00199622]]

329
1964-Herbert-Hoover 0
Class is: 0
Prediction is: [0] [[0.98670



Class is: 0
Prediction is: [0] [[9.99999200e-01 8.00346607e-07]]

333
1948-Babe-Ruth 0
Class is: 0
Prediction is: [0] [[9.99998684e-01 1.31573852e-06]]

334
1947-Willa-Cather 1
Class is: 1
Prediction is: [0] [[0.91793663 0.08206337]]

335
1963-John-F-Kennedy 0
Class is: 0
Prediction is: [0] [[9.99714881e-01 2.85119340e-04]]

336
1975-Walker-Evans 0
Class is: 0
Prediction is: [0] [[0.88855491 0.11144509]]

337
1916-J-J-Hill 0
Class is: 0
Prediction is: [0] [[0.9798686 0.0201314]]

338
1980-Jesse-Owens 0
Class is: 0
Prediction is: [0] [[0.99860095 0.00139905]]

339
1948-Sergei-Eisenstein 0
Class is: 0
Prediction is: [0] [[0.84400836 0.15599164]]

340
1981-Robert-Moses 0
Class is: 0
Prediction is: [0] [[9.99895037e-01 1.04963312e-04]]

341
1989-Robert-Penn-Warren 0
Class is: 0
Prediction is: [0] [[9.99653295e-01 3.46704729e-04]]

342
1901-William-McKinley 0
Class is: 0
Prediction is: [0] [[0.99235975 0.00764025]]

343
1970-Walter-Reuther 0
Class is: 0
Prediction is: [0] [[0.99842943 0.001



Class is: 1
Prediction is: [1] [[6.11604411e-06 9.99993884e-01]]

346
1978-Golda-Meir 1
Class is: 1
Prediction is: [1] [[1.16689284e-04 9.99883311e-01]]

347
1983-Earl-Hines 0
Class is: 0
Prediction is: [0] [[0.94570462 0.05429538]]

348
1974-Katharine-Cornell 1
Class is: 1
Prediction is: [1] [[7.08803903e-06 9.99992912e-01]]

349
1982-Lee-Strasberg 0
Class is: 0
Prediction is: [0] [[9.99092119e-01 9.07881095e-04]]

350
1939-Pope-Pius-XI 0
Class is: 0
Prediction is: [0] [[9.99975297e-01 2.47033922e-05]]

351
1886-Mary-Ewing-Outerbridge 1
Class is: 1
Prediction is: [1] [[0.01471231 0.98528769]]

352
1993-Dizzy-Gillespie 0
Class is: 0
Prediction is: [0] [[9.99943765e-01 5.62347054e-05]]

353
1910-Florence-Nightingale 1
Class is: 1
Prediction is: [1] [[6.17936124e-08 9.99999938e-01]]

354
1960-Richard-Wright 0
Class is: 0
Prediction is: [0] [[0.99722111 0.00277889]]

355
1986-The-Challenger 1
Class is: 1
Prediction is: [1] [[0.00186657 0.99813343]]

356
1992-Menachem-Begin 0
Class is: 0
P



Class is: 1
Prediction is: [1] [[0.45904446 0.54095554]]

358
1976-Max-Ernst 0
Class is: 0
Prediction is: [0] [[9.99994599e-01 5.40121499e-06]]

359
1993-Cesar-Chavez 0
Class is: 0
Prediction is: [0] [[0.76815719 0.23184281]]

360
1965-Adlai-Ewing-Stevenson 0
Class is: 0
Prediction is: [0] [[9.99999897e-01 1.03402666e-07]]

361
1935-Adolph-S-Ochs 0
Class is: 0
Prediction is: [0] [[1.00000000e+00 4.68616197e-14]]

362
1941-Lou-Gehrig 0
Class is: 0
Prediction is: [0] [[0.88889929 0.11110071]]

363
1961-Carl-G-Jung 0
Class is: 0
Prediction is: [0] [[9.99998256e-01 1.74421046e-06]]

364
1963-Robert-Frost 0
Class is: 0
Prediction is: [0] [[0.54857866 0.45142134]]

365
1965-Edward-R-Murrow 0
Class is: 0
Prediction is: [0] [[9.99999280e-01 7.19969826e-07]]

366
1971-Dean-Acheson 0
Class is: 0
Prediction is: [0] [[9.99990027e-01 9.97307892e-06]]

367
1986-Jorge-Luis-Borges 0
Class is: 0
Prediction is: [0] [[0.99666127 0.00333873]]

368
1966-Walt-Disney 0




Class is: 0
Prediction is: [0] [[9.99978640e-01 2.13599626e-05]]

369
1996-Carl-Sagan 0
Class is: 0
Prediction is: [0] [[9.99991566e-01 8.43366435e-06]]

370
1959-Ross-G-Harrison 0
Class is: 0
Prediction is: [0] [[0.98497165 0.01502835]]

371
1945-Jerome-Kern 0
Class is: 0
Prediction is: [0] [[0.98802519 0.01197481]]

372
1991-Frank-Capra 0
Class is: 0
Prediction is: [0] [[0.88338017 0.11661983]]

373
1987-Andres-Segovie 0
Class is: 0
Prediction is: [0] [[9.99902002e-01 9.79975627e-05]]

374
1987-Rita-Hayworth 1
Class is: 1
Prediction is: [1] [[0.28722634 0.71277366]]

375
1993-William-Golding 0
Class is: 0
Prediction is: [0] [[0.92519959 0.07480041]]

376
1932-Florenz-Ziegfeld 1
Class is: 1
Prediction is: [1] [[0.02275304 0.97724696]]

377
1938-Constantin-Stanislavsky 0
Class is: 0
Prediction is: [0] [[0.99667875 0.00332125]]





In [None]:
canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval

def feat_pval_weight(meta_df_, dtm_df_):
    
    #dtm_df_ = dtm_df_.loc[meta_df_.index.tolist()]
    dtm_df_ = normalize_model(dtm_df_, dtm_df_)[0]
    #dtm_df_ = dtm_df_.dropna(axis = 1, how='any')

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['CLASS']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['CLASS']==1].index.tolist()].to_numpy()

    print(dtm0.shape)
    print(dtm1.shape)
    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced')
    clf.fit(dtm_df_, meta_df_['CLASS']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df

sig_thresh = 0.05 / len(df2.columns)

feat_df = feat_pval_weight(meta2, df2)
print(feat_df.shape)

feat_df.to_csv('/media/secure_volume/FEATURES_MASS_TRADE_1980_2007_5.csv')
out = feat_df[(feat_df['P_VALUE'] <= sig_thresh)].sort_values('LR_WEIGHT', ascending = True)
out1 = out['FEAT'].tolist()
for o in out1[0:30]:
    print(o)
featuresFile = open('/media/secure_volume/FEATURE_LIST_MASS_TRADE_1980_2007_5.txt', 'w')
out1 = '\n'.join(out1) 
featuresFile.write(str(out1))

### Read in dataframe

For classification, we need two kinds of things: text and classes (e.g. gender, race, publisher). Pandas dataframes are useful for classification because they can hold a complete text and its metadata in a single row.

For this lesson, we're going to use our _New York Times_ obituaries corpus, which I have supplemented with the gender of the person who died and the date of publication.

## Load stopwords

Many words are uninteresting or unhelpful for classification, so we treat them as stopwords and remove them from the corpus.

In [None]:
import csv
meta2.to_csv('/media/secure_volume/CLASSIFIER_OUTPUT_MASS_TRADE_1980_2007_5.csv', sep='\t')

In [141]:
d = {'title':["",""], 'gender': ['', ''], 'obit': ['', ''], 'date':["",""]}

df = pd.DataFrame(data=d)
df

count = 0
for file in files:
    text = open(file, encoding='utf-8').read()
    p = re.compile("[1-9]{4}")
    m = p.search(text)
    if m != None:
        df.at[count, 'date'] = m.group(0)
    else:
        df.at[count, 'date'] = "n/a"
    text = re.sub("[A-Z][a-z]*[ ][1-9]*[,][ ][1-9]*[\n][\n]","", text)
    text = re.sub("[A-Z]{8}", "", text)
    text = re.sub("[\n]","", text)
    df.at[count, 'obit'] = text
    df.at[count, 'title'] = obit_titles[count]
    count += 1
df
df.to_csv("../docs/NYT-Obituaries.csv", encoding='utf-8', index=False)