# Classification

Classification, a method popular in machine learning, determines whether and how a model can distinguish between sets of text.

It works like this. Everyone with email relies on classification to separate spam from legitimate emails. Email providers train classification models to recognize the difference by giving them emails they have labeled “spam” and “not spam.” They then ask the model to learn the features that most reliably distinguish the two types, which could include a preponderance of all caps or phrases like “free money” or “get paid.” They test the model by giving it unlabeled emails and asking it to classify them. If the model can do it accurately a high percentage of the time, that’s a good spam filter.

We can take the underlying idea and apply it to many experiments.

## Today's model

We are going to use our _New York Times_ obituaries corpus to test whether our model can learn to distinguish between obituaries about men and women.

## Imports

As always, we begin with some imports.

In [28]:
import pandas as pd
import glob
from pathlib import Path
from pandas import DataFrame
from pandas import Series, DataFrame
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import pearsonr, norm

## Corpus 

For this notebook, we'll return to our corpus of _New York Times_ obituaries.

In [29]:
# collect filepaths as files
directory = "../docs/NYT-Obituaries/"
files = glob.glob(f"{directory}/*.txt")

In [30]:
# and collect obit titles, which are also the final section of the filepaths
obit_titles = [Path(file).stem for file in files]
obit_titles

['1945-Adolf-Hitler',
 '1915-F-W-Taylor',
 '1975-Chiang-Kai-shek',
 '1984-Ethel-Merman',
 '1953-Jim-Thorpe',
 '1964-Nella-Larsen',
 '1955-Margaret-Abbott',
 '1984-Lillian-Hellman',
 '1959-Cecil-De-Mille',
 '1928-Mabel-Craty',
 '1973-Eddie-Rickenbacker',
 '1989-Ferdinand-Marcos',
 '1991-Martha-Graham',
 '1997-Deng-Xiaoping',
 '1938-George-E-Hale',
 '1885-Ulysses-Grant',
 '1909-Sarah-Orne-Jewett',
 '1957-Christian-Dior',
 '1987-Clare-Boothe-Luce',
 '1976-Jacques-Monod',
 '1954-Getulio-Vargas',
 '1979-Stan-Kenton',
 '1990-Leonard-Bernstein',
 '1972-Jackie-Robinson',
 '1998-Fred-W-Friendly',
 '1991-Leo-Durocher',
 '1915-B-T-Washington',
 '1997-James-Stewart',
 '1981-Joe-Louis',
 '1983-Muddy-Waters',
 '1942-George-M-Cohan',
 '1989-Samuel-Beckett',
 '1962-Marilyn-Monroe',
 '2000-Charles-M-Schulz',
 '1967-Gregory-Pincus',
 '1894-R-L-Stevenson',
 '1978-Bruce-Catton',
 '1982-Arthur-Rubinstein',
 '1875-Andrew-Johnson',
 '1974-Charles-Lindbergh',
 '1964-Rachel-Carson',
 '1953-Marjorie-Rawlings',


## Create document-term matrix

### Initiate CountVectorizer as vectorizer

Remember document-term matrices, aka doc-term matrices, aka dtms? We learned about them in notebooks 10 and 11. Our classifier uses a dtm as its input. We build it with scikit-learn's CountVectorizer, which we imported at the start of the lesson. 

When we load our vectorizer, we include an argument to encode as utf-8 and we load our stopwords. We can also set the minimum number of times a word must appear in the corpus to be included in the dtm. In this case, I've set it at 20.

In [31]:
# load stopwords
from sklearn.feature_extraction import text
text_file = open('../docs/jockers_stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

# create dtm
corpus_path = '../docs/NYT-Obituaries/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')

### Make list of filepaths

CountVectorizer builds a dtm from a list of filepaths.

In [32]:
corpus = []
for title in obit_titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)

### Get feature names and set as column titles

The columns store word counts. We want to name the columns with the words stored in each, and to transform the dtm into a pandas dataframe, as follows:

In [33]:
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)
print('df shape is: ' + str(df.shape))

df shape is: (378, 2985)


Our dataframee has 378 rows, one for each document, or obituary, and 2985 columns, one for each word that's not in stopwords and appears at least 20 times in the corpus.

## Import metadata

In [34]:
meta = pd.read_csv("../docs/NYT-Obituaries.csv", encoding = 'utf-8')
meta = meta.rename(columns={'title': 'obit_title'})
meta = meta[["obit_title", "gender", "date"]]
meta

Unnamed: 0,obit_title,gender,date
0,1945-Adolf-Hitler,0,1945.0
1,1915-F-W-Taylor,0,1915.0
2,1975-Chiang-Kai-shek,0,1975.0
3,1984-Ethel-Merman,1,1984.0
4,1953-Jim-Thorpe,0,1953.0
...,...,...,...
373,1987-Andres-Segovie,0,1987.0
374,1987-Rita-Hayworth,1,1987.0
375,1993-William-Golding,0,1993.0
376,1932-Florenz-Ziegfeld,1,1932.0


Our metadata is stored as a pandas dataframe with a row for each obituary and three columns: title, gender, and year.

## Concatenate metadata and doc-term dataframe

In [35]:
df_concat = pd.concat([meta, df], axis = 1)

In [36]:
df_concat.head()

Unnamed: 0,obit_title,gender,date,000,10,100,11,12,13,14,...,wrote,yale,year,years,yellow,yesterday,york,younger,youngest,youth
0,1945-Adolf-Hitler,0,1945.0,21.0,1.0,0.0,2.0,3.0,4.0,3.0,...,3.0,0.0,11.0,19.0,0.0,0.0,1.0,1.0,0.0,9.0
1,1915-F-W-Taylor,0,1915.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1975-Chiang-Kai-shek,0,1975.0,3.0,3.0,1.0,1.0,0.0,0.0,0.0,...,6.0,0.0,3.0,14.0,0.0,0.0,1.0,2.0,1.0,1.0
3,1984-Ethel-Merman,1,1984.0,0.0,1.0,0.0,1.0,0.0,1.0,2.0,...,0.0,0.0,3.0,5.0,0.0,2.0,5.0,0.0,0.0,0.0
4,1953-Jim-Thorpe,0,1953.0,2.0,3.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,6.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0


## Equalize numbers of men and women

We want our dataframe to have equal numbers of men and women. How many women are there? Women are counted as 1 and men as 0, so if we sum the gender column, we'll have the number of women:

In [37]:
meta['gender'].sum()

93

Then we separate men and women into two dataframes and take a random sample of 93 obituaries about men.

In [38]:
df_men = df_concat[df_concat['gender'] == 0]
df_women = df_concat[df_concat['gender'] == 1]
df_men = df_men.sample(n=93)

We then concatenate the sampled men dataframe with the women dataframe and reset the index.

In [39]:
df_final = pd.concat([df_men, df_women])
df_final = df_final.reset_index()
df_final = df_final.drop(columns="index")
df_final

Unnamed: 0,obit_title,gender,date,000,10,100,11,12,13,14,...,wrote,yale,year,years,yellow,yesterday,york,younger,youngest,youth
0,1936-John-W-Heisman,0,1936.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,3.0,0.0,1.0,1.0,0.0,0.0,0.0
1,1971-Dean-Acheson,0,1971.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,7.0,2.0,1.0,8.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1970-Edouard-Daladier,0,1946.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,...,0.0,0.0,1.0,3.0,0.0,1.0,1.0,0.0,0.0,0.0
3,1995-Jonas-Salk,0,1995.0,4.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,7.0,5.0,0.0,1.0,3.0,1.0,0.0,0.0
4,1988-John-Houseman,0,1988.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,1.0,7.0,0.0,1.0,4.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181,1910-Florence-Nightingale,1,1854.0,3.0,0.0,0.0,0.0,1.0,0.0,1.0,...,2.0,0.0,2.0,7.0,0.0,1.0,1.0,0.0,0.0,1.0
182,1986-The-Challenger,1,1986.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,4.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
183,1998-Galina-Ulanova,1,1998.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,1.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0
184,1987-Rita-Hayworth,1,1987.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0


We now have 186 rows: 93 men, 93 women.

### Match meta and data dataframes with subset of df_final

We'll continue to use meta and df, so we need to ensure they match our subsetted df_final

In [40]:
meta = df_final[["obit_title", "gender", "date"]]
meta

Unnamed: 0,obit_title,gender,date
0,1936-John-W-Heisman,0,1936.0
1,1971-Dean-Acheson,0,1971.0
2,1970-Edouard-Daladier,0,1946.0
3,1995-Jonas-Salk,0,1995.0
4,1988-John-Houseman,0,1988.0
...,...,...,...
181,1910-Florence-Nightingale,1,1854.0
182,1986-The-Challenger,1,1986.0
183,1998-Galina-Ulanova,1,1998.0
184,1987-Rita-Hayworth,1,1987.0


In [41]:
df = df_final.loc[:,'000':]
df

Unnamed: 0,000,10,100,11,12,13,14,15,150,16,...,wrote,yale,year,years,yellow,yesterday,york,younger,youngest,youth
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,3.0,0.0,1.0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,7.0,2.0,1.0,8.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,3.0,0.0,1.0,1.0,0.0,0.0,0.0
3,4.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,7.0,5.0,0.0,1.0,3.0,1.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,1.0,7.0,0.0,1.0,4.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181,3.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,2.0,0.0,2.0,7.0,0.0,1.0,1.0,0.0,0.0,1.0
182,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
183,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,1.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0
184,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,1.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0


## Let's run our classifier!

Once we have a dataframe with metadata and vocab counts we're ready to run our classifier!

### We add columns for probabilities and predicted class to our metadata

As we run the model, we are going to store its output with our metadata. This will allow us to easily examine the model's output.

In [42]:
meta['PROBS'] = ''
meta['PREDICTED'] = ''

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


### Load model

We will use scikit-learn's `LogisticRegression` model. There are many other options for classifier models. Some are better for some tasks, other for others. LogisticRegression is standard for classifying literature. We set the penalty as l1 and the 'C' value as 1.0. If you decide to specialize in classification, you can explore further the implications of these arguments.

In [43]:
model = LogisticRegression(penalty = 'l1', C = 1.0)

### Run the model!

We run the model in the following for-loop.

Classification models need classes: they need the texts grouped into different sets. Our metadata has built-in classes: gender. Men are stored as 0; women as 1. We could, if we wanted, create a new 0/1 class based on year.

Each iteration trains on all the titles except one, then predicts which class the excluded title belongs to. We'll call this leave-one-out classification. There are other ways of dividing training and testing sets, which we won't explore today.

The first four indented lines simply track our progress by printing index, title, and class. The next four lines exclude a single title, and set the training data and the test data.

The final six lines fit the model, calculate the probabilities and predicted class of the test case, and add that information to our metadata dataframe.

In [44]:
for this_index in df_final.index.tolist():
    print(this_index) # keep track of where we are in the corpus
    title = meta.loc[meta.index[this_index], 'obit_title'] 
    CLASS = meta.loc[meta.index[this_index], 'gender']
    print(title, CLASS) 
    
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'gender'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata
    print('Class is: ' + str(CLASS) + '\n' + 'Prediction is: ' + str(predicted) + ' ' + str(prediction) + '\n')

0
1936-John-W-Heisman 0
Class is: 0
Prediction is: [0] [[0.79355842 0.20644158]]

1
1971-Dean-Acheson 0
Class is: 0
Prediction is: [0] [[9.99999779e-01 2.21229604e-07]]

2
1970-Edouard-Daladier 0
Class is: 0
Prediction is: [0] [[0.98035948 0.01964052]]

3
1995-Jonas-Salk 0
Class is: 0
Prediction is: [0] [[9.99988000e-01 1.19995998e-05]]

4
1988-John-Houseman 0
Class is: 0
Prediction is: [1] [[0.39230848 0.60769152]]

5
1968-Yuri-Gagarin 0
Class is: 0
Prediction is: [0] [[0.80787114 0.19212886]]

6
1957-Gerard-Swope 0
Class is: 0
Prediction is: [0] [[0.98994386 0.01005614]]

7
1994-Linus-C-Pauling 0
Class is: 0
Prediction is: [0] [[9.99996248e-01 3.75212485e-06]]

8
1985-Orson-Welles 0
Class is: 0
Prediction is: [0] [[0.93995894 0.06004106]]

9
1984-Count-Basie 0
Class is: 0
Prediction is: [0] [[0.94674981 0.05325019]]

10
1943-J-H-Kellogg 0
Class is: 0
Prediction is: [0] [[0.67547514 0.32452486]]

11
1992-Alex-Haley 0
Class is: 0
Prediction is: [1] [[0.3289677 0.6710323]]

12
1931-Melv



Class is: 0
Prediction is: [0] [[0.97883526 0.02116474]]

15
1973-Lyndon-Johnson 0
Class is: 0
Prediction is: [0] [[9.99404996e-01 5.95003619e-04]]

16
1941-Frank-Conrad 0
Class is: 0
Prediction is: [0] [[0.96332145 0.03667855]]

17
1984-Ansel-Adams 0
Class is: 0
Prediction is: [0] [[0.87330515 0.12669485]]

18
1990-Ralph-David-Abernathy 0
Class is: 0
Prediction is: [0] [[9.99896178e-01 1.03822097e-04]]

19
1935-Will-Rogers 0
Class is: 0
Prediction is: [0] [[0.81817072 0.18182928]]

20
1994-Jan-Tinbergen 0
Class is: 0
Prediction is: [0] [[0.8062079 0.1937921]]

21
1994-Thomas-P-O-Neill-Jr 0
Class is: 0
Prediction is: [0] [[9.99788638e-01 2.11362492e-04]]

22
1989-Ferdinand-Marcos 0
Class is: 0
Prediction is: [1] [[0.00812173 0.99187827]]

23
1910-William-James 0
Class is: 0
Prediction is: [0] [[0.96898195 0.03101805]]

24
1937-John-Rockefeller 0
Class is: 0
Prediction is: [0] [[0.9889836 0.0110164]]

25
1973-Otto-Klemperer 0
Class is: 0
Prediction is: [1] [[0.27508426 0.72491574]]

26




Class is: 0
Prediction is: [0] [[0.99077256 0.00922744]]

30
1969-Coleman-Hawkins 0
Class is: 0
Prediction is: [0] [[0.81938776 0.18061224]]

31
1945-Adolf-Hitler 0
Class is: 0
Prediction is: [0] [[9.99994629e-01 5.37147531e-06]]

32
1969-Mies-van-der-Rohe 0
Class is: 0
Prediction is: [0] [[0.99684422 0.00315578]]

33
1989-Andrei-Sakharov 0
Class is: 0
Prediction is: [0] [[9.99999697e-01 3.02605569e-07]]

34
1965-Churchill 0
Class is: 0
Prediction is: [0] [[9.99999918e-01 8.19405231e-08]]

35
1977-Dash-Ended 0
Class is: 0
Prediction is: [0] [[9.99867667e-01 1.32333120e-04]]

36
1989-Andrei-A-Gromyko 0
Class is: 0
Prediction is: [0] [[9.99998011e-01 1.98931056e-06]]

37
1978-Pope-Paul-VI 0
Class is: 0
Prediction is: [1] [[0.03282008 0.96717992]]

38
1942-George-M-Cohan 0
Class is: 0
Prediction is: [0] [[0.98257334 0.01742666]]

39
1986-Benny-Goodman 0
Class is: 0
Prediction is: [0] [[9.99998627e-01 1.37261716e-06]]

40
1916-J-J-Hill 0
Class is: 0
Prediction is: [0] [[0.99679591 0.003204



Class is: 0
Prediction is: [0] [[0.98254746 0.01745254]]

45
1914-John-Muir 0
Class is: 0
Prediction is: [0] [[0.95648481 0.04351519]]

46
1875-Andrew-Johnson 0
Class is: 0
Prediction is: [0] [[9.99458369e-01 5.41630857e-04]]

47
1997-James-Stewart 0
Class is: 0
Prediction is: [0] [[0.99295745 0.00704255]]

48
1952-Chaim-Weizmann 0
Class is: 0
Prediction is: [0] [[9.99971582e-01 2.84175616e-05]]

49
1954-Enrico-Fermi 0
Class is: 0
Prediction is: [0] [[9.99743935e-01 2.56065358e-04]]

50
1939-Pope-Pius-XI 0
Class is: 0
Prediction is: [0] [[9.99965417e-01 3.45833823e-05]]

51
1935-Justice-Holmes 0
Class is: 0
Prediction is: [0] [[0.82364084 0.17635916]]

52
1954-Getulio-Vargas 0
Class is: 0
Prediction is: [0] [[0.99114278 0.00885722]]

53
1984-Johnny-Weissmuller 0
Class is: 0
Prediction is: [1] [[0.08180441 0.91819559]]

54
1993-Albert-Sabin 0
Class is: 0
Prediction is: [0] [[9.99998052e-01 1.94773051e-06]]

55
1939-W-B-Yeats 0
Class is: 0
Prediction is: [1] [[0.28880672 0.71119328]]

56



Class is: 0
Prediction is: [0] [[9.99988364e-01 1.16364537e-05]]

62
1945-Ernie-Pyle 0
Class is: 0
Prediction is: [1] [[0.03144808 0.96855192]]

63
1938-Constantin-Stanislavsky 0
Class is: 0
Prediction is: [0] [[0.88579174 0.11420826]]

64
1967-Gregory-Pincus 0
Class is: 0
Prediction is: [1] [[0.06895972 0.93104028]]

65
1978-Bruce-Catton 0
Class is: 0
Prediction is: [0] [[0.50774572 0.49225428]]

66
1970-De-Gaulle-Rallied 0
Class is: 0
Prediction is: [0] [[9.99999743e-01 2.56568998e-07]]

67
1967-Henry-R-Luce 0
Class is: 0
Prediction is: [0] [[0.98893138 0.01106862]]

68
1986-Jorge-Luis-Borges 0
Class is: 0
Prediction is: [0] [[0.90518532 0.09481468]]

69
1926-Harry-Houdini 0
Class is: 0
Prediction is: [0] [[0.84385102 0.15614898]]

70
1998-Frank-Sinatra 0
Class is: 0
Prediction is: [1] [[0.02597769 0.97402231]]

71
1932-John-Philip-Sousa 0
Class is: 0
Prediction is: [0] [[9.99588225e-01 4.11775384e-04]]

72
1979-A-Philip-Randolph 0
Class is: 0
Prediction is: [0] [[9.99989730e-01 1.02



Class is: 0
Prediction is: [0] [[0.73784122 0.26215878]]

79
1980-Jesse-Owens 0
Class is: 0
Prediction is: [0] [[0.9988516 0.0011484]]

80
1961-Carl-G-Jung 0
Class is: 0
Prediction is: [0] [[9.99992031e-01 7.96886352e-06]]

81
1955-Thomas-Mann 0
Class is: 0
Prediction is: [0] [[0.97563727 0.02436273]]

82
1986-James-Cagney 0
Class is: 0
Prediction is: [0] [[0.99868581 0.00131419]]

83
1945-FDR 0
Class is: 0
Prediction is: [0] [[0.99436868 0.00563132]]

84
1923-Warren-Harding 0
Class is: 0
Prediction is: [0] [[9.99999999e-01 5.24098358e-10]]

85
1963-Robert-Frost 0
Class is: 0
Prediction is: [1] [[0.43609737 0.56390263]]

86
1966-Walt-Disney 0
Class is: 0
Prediction is: [0] [[0.93470129 0.06529871]]

87
1954-Henri-Matisse 0
Class is: 0
Prediction is: [0] [[0.72523268 0.27476732]]

88
1993-Dizzy-Gillespie 0
Class is: 0
Prediction is: [0] [[0.99873891 0.00126109]]

89
1985-Roger-Maris 0
Class is: 0
Prediction is: [0] [[0.83784886 0.16215114]]

90
1952-Charles-Spaulding 0
Class is: 0
Predi



Class is: 1
Prediction is: [1] [[1.35854012e-04 9.99864146e-01]]

95
1955-Margaret-Abbott 1
Class is: 1
Prediction is: [1] [[6.81516377e-09 9.99999993e-01]]

96
1984-Lillian-Hellman 1
Class is: 1
Prediction is: [1] [[0.00163692 0.99836308]]

97
1928-Mabel-Craty 1
Class is: 1
Prediction is: [1] [[0.25208215 0.74791785]]

98
1991-Martha-Graham 1
Class is: 1
Prediction is: [1] [[0.3773696 0.6226304]]

99
1909-Sarah-Orne-Jewett 1
Class is: 1
Prediction is: [1] [[0.15088728 0.84911272]]

100
1987-Clare-Boothe-Luce 1
Class is: 1
Prediction is: [1] [[6.19514596e-04 9.99380485e-01]]

101
1962-Marilyn-Monroe 1
Class is: 1
Prediction is: [1] [[0.00106837 0.99893163]]

102
1964-Rachel-Carson 1
Class is: 1
Prediction is: [1] [[0.16626539 0.83373461]]

103
1953-Marjorie-Rawlings 1
Class is: 1
Prediction is: [0] [[0.56210026 0.43789974]]

104
1963-Sylvia-Plath 1
Class is: 1
Prediction is: [1] [[0.00333813 0.99666187]]

105
1982-Anna-Freud 1
Class is: 1
Prediction is: [1] [[0.14210091 0.85789909]]

1



Class is: 1
Prediction is: [1] [[1.86619781e-06 9.99998134e-01]]

112
1992-Marsha-P-Johnson 1
Class is: 1
Prediction is: [0] [[0.6290889 0.3709111]]

113
1951-Fanny-Brice 1
Class is: 1
Prediction is: [1] [[0.00346189 0.99653811]]

114
1989-Lucille-Ball 1
Class is: 1
Prediction is: [1] [[0.00987049 0.99012951]]

115
1969-Sonja-Henie 1
Class is: 1
Prediction is: [1] [[0.15626216 0.84373784]]

116
1941-Virginia-Woolf 1
Class is: 1
Prediction is: [1] [[0.07297939 0.92702061]]

117
1907-Qiu-Jin 1
Class is: 1
Prediction is: [1] [[1.37716616e-09 9.99999999e-01]]

118
1971-Florence-Blanchfield 1
Class is: 1
Prediction is: [1] [[0.20381989 0.79618011]]

119
1902-Elizabeth-Cady-Stanton 1
Class is: 1
Prediction is: [1] [[8.35831404e-11 1.00000000e+00]]

120
1965-Shirley-Jackson 1
Class is: 1
Prediction is: [1] [[0.00159949 0.99840051]]

121
1919-C-J-Walker 1
Class is: 1
Prediction is: [1] [[0.21866083 0.78133917]]

122
1949-Mitchell 1
Class is: 1
Prediction is: [1] [[0.36897068 0.63102932]]

123




Class is: 1
Prediction is: [1] [[1.26879872e-08 9.99999987e-01]]

127
1888-Louisa-M-Alcott 1
Class is: 1
Prediction is: [1] [[0.0012887 0.9987113]]

128
1986-Georgia-O-Keeffe 1
Class is: 1
Prediction is: [1] [[2.8957060e-07 9.9999971e-01]]

129
1978-Margaret-Mead 1
Class is: 1
Prediction is: [0] [[0.98317437 0.01682563]]

130
1982-Ingrid-Bergman 1
Class is: 1
Prediction is: [1] [[0.00357452 0.99642548]]

131
1977-Maria-Callas 1
Class is: 1
Prediction is: [0] [[0.6441851 0.3558149]]

132
1952-Eva-Peron 1
Class is: 1
Prediction is: [1] [[0.00140041 0.99859959]]

133
1961-Emily-Balch 1
Class is: 1
Prediction is: [1] [[0.00347989 0.99652011]]

134
1896-Harriet-Beecher-Stowe 1
Class is: 1
Prediction is: [1] [[5.24164089e-06 9.99994758e-01]]

135
1969-Madhubala 1
Class is: 1
Prediction is: [1] [[0.00246769 0.99753231]]

136
1950-Edna-St-V-Millay 1
Class is: 1
Prediction is: [0] [[0.60598603 0.39401397]]

137
1929-Marie-Curie 1
Class is: 1
Prediction is: [1] [[2.03009810e-05 9.99979699e-01]]




Class is: 1
Prediction is: [1] [[0.02882269 0.97117731]]

143
1966-Margaret-Sanger 1
Class is: 1
Prediction is: [1] [[2.41345323e-08 9.99999976e-01]]

144
1972-Mahalia-Jackson 1
Class is: 1
Prediction is: [1] [[7.18332261e-05 9.99928167e-01]]

145
1990-Greta-Garbo 1
Class is: 1
Prediction is: [1] [[5.52589308e-10 9.99999999e-01]]

146
1994-Jessica-Tandy 1
Class is: 1
Prediction is: [1] [[0.01936142 0.98063858]]

147
1969-Maureen-Connolly 1
Class is: 1
Prediction is: [1] [[8.12054050e-04 9.99187946e-01]]

148
1991-Peggy-Ashcroft 1
Class is: 1
Prediction is: [1] [[0.3352565 0.6647435]]

149
1946-Gertrude-Stein 1
Class is: 1
Prediction is: [1] [[0.41409909 0.58590091]]

150
1887-Emma-Lazarus 1
Class is: 1
Prediction is: [1] [[0.34720682 0.65279318]]

151
1977-Joan-Crawford 1
Class is: 1
Prediction is: [1] [[2.83018053e-12 1.00000000e+00]]

152
1906-Susan-B-Anthony 1
Class is: 1
Prediction is: [1] [[6.86398050e-10 9.99999999e-01]]

153
1954-Anne-O-Hare-McCormick 1
Class is: 1
Prediction is



Class is: 1
Prediction is: [1] [[0.22576056 0.77423944]]

160
1975-Haile-Selassie 1
Class is: 1
Prediction is: [0] [[0.9986789 0.0013211]]

161
1903-Emily-Warren-Roebling 1
Class is: 1
Prediction is: [1] [[1.36168726e-04 9.99863831e-01]]

162
1962-Eleanor-Roosevelt 1
Class is: 1
Prediction is: [1] [[2.30928175e-05 9.99976907e-01]]

163
1971-Coco-Chanel 1
Class is: 1
Prediction is: [0] [[0.63999875 0.36000125]]

164
1959-Billie-Holiday 1
Class is: 1
Prediction is: [1] [[0.11151153 0.88848847]]

165
1954-Frida-Kahlo 1
Class is: 1
Prediction is: [1] [[0.01190255 0.98809745]]

166
1927-Victoria-Martin 1
Class is: 1
Prediction is: [1] [[1.72254091e-05 9.99982775e-01]]

167
1959-Ethel-Barrymore 1
Class is: 1
Prediction is: [0] [[0.8974851 0.1025149]]

168
1999-Iris-Murdoch 1
Class is: 1
Prediction is: [1] [[0.00704369 0.99295631]]

169
1998-Bella-Abzug 1
Class is: 1
Prediction is: [1] [[4.48665549e-11 1.00000000e+00]]

170
1944-Ida-M-Tarbell 1
Class is: 1
Prediction is: [1] [[0.23693805 0.76



Class is: 1
Prediction is: [0] [[0.82612156 0.17387844]]

177
1984-Indira-Gandhi 1
Class is: 1
Prediction is: [1] [[2.05817611e-04 9.99794182e-01]]

178
1978-Golda-Meir 1
Class is: 1
Prediction is: [1] [[7.55112062e-06 9.99992449e-01]]

179
1974-Katharine-Cornell 1
Class is: 1
Prediction is: [1] [[0.00315108 0.99684892]]

180
1886-Mary-Ewing-Outerbridge 1
Class is: 1
Prediction is: [1] [[0.47597005 0.52402995]]

181
1910-Florence-Nightingale 1
Class is: 1
Prediction is: [1] [[3.12202620e-08 9.99999969e-01]]

182
1986-The-Challenger 1
Class is: 1
Prediction is: [1] [[0.07456909 0.92543091]]

183
1998-Galina-Ulanova 1
Class is: 1
Prediction is: [1] [[0.26420843 0.73579157]]

184
1987-Rita-Hayworth 1
Class is: 1
Prediction is: [1] [[0.2891711 0.7108289]]

185
1932-Florenz-Ziegfeld 1
Class is: 1
Prediction is: [1] [[0.06412618 0.93587382]]





How cool is this! For each obituary, we see who it's about, that person's gender (0 or 1), and which gender the model thinks it's about, by which probabilities. 

What can you glean by glancing through?

ANSWER HERE
* *
* *
* *
* *

## Results

Remember, we've stored our results in our metadata dataframe. Let's take a look!

In [45]:
meta

Unnamed: 0,obit_title,gender,date,PROBS,PREDICTED
0,1936-John-W-Heisman,0,1936.0,[[0.79355842 0.20644158]],[0]
1,1971-Dean-Acheson,0,1971.0,[[9.99999779e-01 2.21229604e-07]],[0]
2,1970-Edouard-Daladier,0,1946.0,[[0.98035948 0.01964052]],[0]
3,1995-Jonas-Salk,0,1995.0,[[9.99988000e-01 1.19995998e-05]],[0]
4,1988-John-Houseman,0,1988.0,[[0.39230848 0.60769152]],[1]
...,...,...,...,...,...
181,1910-Florence-Nightingale,1,1854.0,[[3.12202620e-08 9.99999969e-01]],[1]
182,1986-The-Challenger,1,1986.0,[[0.07456909 0.92543091]],[1]
183,1998-Galina-Ulanova,1,1998.0,[[0.26420843 0.73579157]],[1]
184,1987-Rita-Hayworth,1,1987.0,[[0.2891711 0.7108289]],[1]


There's lots to look at here. We could explore probabilities: which obituaries is the model most sure about? Which are closest to 50-50? Which does it get most right and most wrong? Is there a pattern to misclassified obituaries?

For now, we just want to calculate its accuracy. Let's get rid of those brackets in the PREDICTED column.

In [46]:
meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
meta

Unnamed: 0,obit_title,gender,date,PROBS,PREDICTED
0,1936-John-W-Heisman,0,1936.0,[[0.79355842 0.20644158]],0
1,1971-Dean-Acheson,0,1971.0,[[9.99999779e-01 2.21229604e-07]],0
2,1970-Edouard-Daladier,0,1946.0,[[0.98035948 0.01964052]],0
3,1995-Jonas-Salk,0,1995.0,[[9.99988000e-01 1.19995998e-05]],0
4,1988-John-Houseman,0,1988.0,[[0.39230848 0.60769152]],1
...,...,...,...,...,...
181,1910-Florence-Nightingale,1,1854.0,[[3.12202620e-08 9.99999969e-01]],1
182,1986-The-Challenger,1,1986.0,[[0.07456909 0.92543091]],1
183,1998-Galina-Ulanova,1,1998.0,[[0.26420843 0.73579157]],1
184,1987-Rita-Hayworth,1,1987.0,[[0.2891711 0.7108289]],1


### Result column

Now we can add a 'RESULT' column that is the result of subtracting the predicted gender from the actual gender.

0 means the model was correct.
-1 means the model mistook a man for a woman.
1 means the model mistook a woman for a man.

In [47]:
sum_column = meta['gender'] - meta['PREDICTED']
meta['RESULT'] = sum_column
meta

Unnamed: 0,obit_title,gender,date,PROBS,PREDICTED,RESULT
0,1936-John-W-Heisman,0,1936.0,[[0.79355842 0.20644158]],0,0
1,1971-Dean-Acheson,0,1971.0,[[9.99999779e-01 2.21229604e-07]],0,0
2,1970-Edouard-Daladier,0,1946.0,[[0.98035948 0.01964052]],0,0
3,1995-Jonas-Salk,0,1995.0,[[9.99988000e-01 1.19995998e-05]],0,0
4,1988-John-Houseman,0,1988.0,[[0.39230848 0.60769152]],1,-1
...,...,...,...,...,...,...
181,1910-Florence-Nightingale,1,1854.0,[[3.12202620e-08 9.99999969e-01]],1,0
182,1986-The-Challenger,1,1986.0,[[0.07456909 0.92543091]],1,0
183,1998-Galina-Ulanova,1,1998.0,[[0.26420843 0.73579157]],1,0
184,1987-Rita-Hayworth,1,1987.0,[[0.2891711 0.7108289]],1,0


Let's look at the accurate guesses.

In [48]:
meta_correct = meta[meta['RESULT'] == 0]
meta_correct

Unnamed: 0,obit_title,gender,date,PROBS,PREDICTED,RESULT
0,1936-John-W-Heisman,0,1936.0,[[0.79355842 0.20644158]],0,0
1,1971-Dean-Acheson,0,1971.0,[[9.99999779e-01 2.21229604e-07]],0,0
2,1970-Edouard-Daladier,0,1946.0,[[0.98035948 0.01964052]],0,0
3,1995-Jonas-Salk,0,1995.0,[[9.99988000e-01 1.19995998e-05]],0,0
5,1968-Yuri-Gagarin,0,1968.0,[[0.80787114 0.19212886]],0,0
...,...,...,...,...,...,...
181,1910-Florence-Nightingale,1,1854.0,[[3.12202620e-08 9.99999969e-01]],1,0
182,1986-The-Challenger,1,1986.0,[[0.07456909 0.92543091]],1,0
183,1998-Galina-Ulanova,1,1998.0,[[0.26420843 0.73579157]],1,0
184,1987-Rita-Hayworth,1,1987.0,[[0.2891711 0.7108289]],1,0


How many did the model get correct?

We can calculate its accuracy by dividing the correct number by the total.

In [49]:
# divide here

Pretty good rate! At random, the model should guess correctly 50% of the time. It does **much** better than that!

## P-values and weights

In [56]:
canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval

def feat_pval_weight(meta_df_, dtm_df_):
    
    #dtm_df_ = dtm_df_.loc[meta_df_.index.tolist()]
    #dtm_df_ = normalize_model(dtm_df_, dtm_df_)[0]
    #dtm_df_ = dtm_df_.dropna(axis = 1, how='any')

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['gender']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['gender']==1].index.tolist()].to_numpy()

    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced')
    clf.fit(dtm_df_, meta_df_['gender']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df

sig_thresh = 0.05 / len(df.columns)

feat_df = feat_pval_weight(meta, df)

feat_df.to_csv('../docs/features_obits.csv')
out = feat_df[(feat_df['P_VALUE'] <= sig_thresh)].sort_values('LR_WEIGHT', ascending = True)
out = out[out['LR_WEIGHT'] != 0]
outM = out[out['LR_WEIGHT'] >= 0]
outW = out[out['LR_WEIGHT'] <= 0]

outM = outM['FEAT'].tolist()
print("Here are significant words that distinguish men: " + str(outM))
outW = outW['FEAT'].tolist()
print("Here are significant words that distinguish women: " + str(outW))

Here are significant words that distinguish men: []
Here are significant words that distinguish women: ['woman', 'women', 'husband']


