In [1]:
# read in some helpful libraries
import nltk # the natural langauage toolkit, open-source NLP
import pandas as pd # dataframes

### Read in the data

# read our data into a dataframe
texts = pd.read_csv(r"C:\Users\piush\Desktop\chatbot\sentences.csv")

# look at the first few rows
texts.head()

Unnamed: 0,SENTENCE,CLASS
0,I took my medicine,Yes
1,I have not taken my remedy,No
2,No never I will not have it,No
3,I was never given my medicine for my stomach ache,No
4,I had my medicine,Yes


###### Find out how often each author uses each word
A lot of NLP applications rely on counting how often certain words are used. (The fancy term for this is "word frequency".) Let's look at the word frequency for each of the authors in our dataset. The NLTK has lots of nice built-in functions and data structures for this that we can make use of.

In [3]:
### Split data

# split the data by author
byAuthor = texts.groupby("CLASS")

### Tokenize (split into individual words) our text

# word frequency by author
wordFreqByAuthor = nltk.probability.ConditionalFreqDist()

# for each author...
for name, group in byAuthor:
    # get all of the sentences they wrote and collapse them into a
    # single long string
    sentences = group['SENTENCE'].str.cat(sep = ' ')
    
    # convert everything to lower case (so "The" and "the" get counted as 
    # the same word rather than two different words)
    sentences = sentences.lower()
    
    # split the text into individual tokens    
    tokens = nltk.tokenize.word_tokenize(sentences)
    
    # calculate the frequency of each token
    frequency = nltk.FreqDist(tokens)

    # add the frequencies for each author to our dictionary
    wordFreqByAuthor[name] = (frequency)
    
# now we have an dictionary where each entry is the frequency distrobution
# of words for a specific author.     

Now we can look at how often each writer uses specific words. Since this is a Halloween competition, how about "blood", "scream" and "fear"?

In [4]:
# see how often each author says "blood"
for i in wordFreqByAuthor.keys():
    print("medication: " + i)
    print(wordFreqByAuthor[i].freq('medication'))

# print a blank line
print()

# see how often each author says "scream"
for i in wordFreqByAuthor.keys():
    print("medicine: " + i)
    print(wordFreqByAuthor[i].freq('medicine'))
    
# print a blank line
print()

# see how often each author says "fear"
for i in wordFreqByAuthor.keys():
    print("remedy: " + i)
    print(wordFreqByAuthor[i].freq('remedy'))

medication: No
0.020202020202020204
medication: Yes
0.0189873417721519

medicine: No
0.06060606060606061
medicine: Yes
0.056962025316455694

remedy: No
0.015151515151515152
remedy: Yes
0.012658227848101266


###### Use word frequency to guess which author wrote a sentence
The general idea is is that different people tend to use different words more or less often. (I had a beloved college professor that was especially fond of "gestalt".) If you're not sure who said something but it has a lot of words one person uses a lot in it, then you might guess that they were the one who wrote it.

Let's use this general principle to guess who might have been more likely to write the sentence "It was a dark and stormy night."

In [5]:
# One way to guess authorship is to use the joint probabilty that each 
# author used each word in a given sentence.

# first, let's start with a test sentence
testSentence = "I want to take my medication."

# and then lowercase & tokenize our test sentence
preProcessedTestSentence = nltk.tokenize.word_tokenize(testSentence.lower())

# create an empy dataframe to put our output in
testProbailities = pd.DataFrame(columns = ['response','word','probability'])

# For each author...
for i in wordFreqByAuthor.keys():
    # for each word in our test sentence...
    for j  in preProcessedTestSentence:
        # find out how frequently the author used that word
        wordFreq = wordFreqByAuthor[i].freq(j)
        # and add a very small amount to every prob. so none of them are 0
        smoothedWordFreq = wordFreq + 0.000001
        # add the author, word and smoothed freq. to our dataframe
        output = pd.DataFrame([[i, j, smoothedWordFreq]], columns = ['response','word','probability'])
        testProbailities = testProbailities.append(output, ignore_index = True)

# empty dataframe for the probability that each author wrote the sentence
testProbailitiesByAuthor = pd.DataFrame(columns = ['response','jointProbability'])

# now let's group the dataframe with our frequency by author
for i in wordFreqByAuthor.keys():
    # get the joint probability that each author wrote each word
    oneAuthor = testProbailities.query('response == "' + i + '"')
    jointProbability = oneAuthor.product(numeric_only = True)[0]
    
    # and add that to our dataframe
    output = pd.DataFrame([[i, jointProbability]], columns = ['response','jointProbability'])
    testProbailitiesByAuthor = testProbailitiesByAuthor.append(output, ignore_index = True)

# and our winner is...
testProbailitiesByAuthor.loc[testProbailitiesByAuthor['jointProbability'].idxmax(),'response']

'No'

So based on what we've seen in our training data, it looks like of our three authors, H.P. Lovecraft was the most likely to write the sentence "It was a dark and stormy night".