# NLP Modeling

The "big idea" of modeling is to determine what a document is all about; which words are important or not. The main task is to determine the weight of each word, relative to the document.


## What
- Introducing our _Dramatis Personae_, the characters in our play:
    - `Term frequency` is a direct way to measure what a document is about, but it over-emphasizes common terms. Consider term frequency the baseline, kinda like how median, median, or mode can be baselines. It's at least somewhere to start, even if it's a blunt tool w/ some issues.
    - `TF` = # times a word occurs divided by the total amount of words. 
    - `Bag of words` is a representation of a document as a vector, where the values indicate word frequency.
        ```
        string = "Mary had a little lamb, little lamb, little lamb."
        string = string.replace(",", "")
        words = string.split()
        bag_of_words = pd.Series(words).value_counts()
        ```
    - Word clouds are a visual bag of words with larger font sizes representing higher term frequency
    - Inverse Document Frequency, `IDF`, tells us how much information a word provides. 
        - A higher IDF means that a word provides more information. That is, it is more relevant within a single document.
        - As the number of documents that a word appears in increases, the IDF value decreases.
        - Example: if "Codeup" appears frequently in every document in a list of documents, then the word doesn't add much new information on any given individual document.
        - Example: if "scholarship" shows up a whole bunch one one or two documents, but not frequently across the corups of documents, then we can conclude that that word conveys more meaning.
        
    - `TF-IDF` is the product of `tf * idf` and is 


## So What?
- Determining what a document is about is both valuable and challening.
- Term frequency is super sensitive to noise
- TF-IDF is super common and has been used in the majority of text based recommendation systems. See [tf-idf in Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

## Now What?


- tf-idf is the product of tf * idf

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [2]:
#grab the first 15 characters with most line
df = pd.read_csv("tng.txt")

top_15_characters = df.character.value_counts().index[0:15]

top_15 = df[df.character.isin(top_15_characters)]
top_15

Unnamed: 0,episode_name,line,character
0,Encounter at Farpoint,Difficult? Simply solve the mystery of Farpoi...,DATA
1,Encounter at Farpoint,As simple as that.,PICARD
2,Encounter at Farpoint,Farpoint Station. Even the name sounds myster...,TROI
3,Encounter at Farpoint,"It's hardly simple, Data, to negotiate a frie...",PICARD
4,Encounter at Farpoint,Inquiry. The word snoop?,DATA
...,...,...,...
51983,All Good Things,Of course. Have a seat.,RIKER
51984,All Good Things,"Would you care to deal, sir?",DATA
51985,All Good Things,"Oh, er, thank you, Mister Data. Actually, I u...",PICARD
51986,All Good Things,You were always welcome.,TROI


In [4]:
ADDITIONAL_STOPWORDS = ['r', 'u', '2', 'ltgt'] #ltgt is html artifact

def clean(text):
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
    text = (unicodedata.normalize('NFKD', text)
             .encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split()
    return " ".join([wnl.lemmatize(word) for word in words if word not in stopwords])

In [5]:
# We'll use this split function later to create in-sample and out-of-sample datasets for modeling
def split(df, stratify_by=None):
    """
    3 way split for train, validate, and test datasets
    To stratify, send in a column name
    """
    
    
    train, test = train_test_split(df, test_size=.2, random_state=123, stratify=df[stratify_by])
    
    train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train[stratify_by])
    
    return train, validate, test

- end goal: predicting what character said what line

In [6]:
train, validate, test = split(top_15, 'character')
train.head()

Unnamed: 0,episode_name,line,character
43382,Second Chances,The transporters are considerably more effici...,DATA
25169,The Loss,I look around me and all I see are surfaces w...,TROI
5207,Coming Of Age,"All right. \nCaptain's log, stardate 41416.2....",WESLEY
40494,Ship in a Bottle,My ship is in danger. It is imperative that I...,PICARD
7290,Conspiracy,"Three, sir. All gathered inside what appears ...",DATA


In [None]:
# Setup our X variables
X_train = train.line
X_validate = validate.line
X_test = test.line

In [None]:
# Setup our y variables
y_train = train.character
y_validate = validate.character
y_test = test.character

In [None]:
#All text
#
X_train.head()

In [None]:
#like one hot encodoing
#produces a matric for each line
#not a scaler, but basically like an encoder

# Create the tfidf vectorizer object
tfidf = TfidfVectorizer()

# Fit on the training data
tfidf.fit(X_train)

#use the object
X_train_vectorized = tfidf.transform(X_train)
X_validate_vectorized = tfidf.transform(X_validate)
X_test_vectorized = tfidf.transform(X_test)

In [None]:
#sparse vectors/matrices have tons of zeros
#sparce matrix, an array with lots of 0
#to dense to see what it looks like
#every single word has a column, thats why there are so many zeros
X-train_vectorized.todense()

In [None]:
#something new but actually something old
#now that we have vectorized dataset, we can use our classificaiton tools
#remember we are trying to predict a discrete outcome
#had to turn our words into numbers
#want log regression bc we want a percentage
lm = LogisticRegression()
#fit the classification model on our vetorized train data
.fit(X_train_vectorized, y_train)

In [None]:
train = pd.DataFrame(dict(actual=y_train))
validate = pd.DataFrame(dict(actual=y_validate))
test = pd.DataFrame(dict(actual=y_test))

In [None]:
#use the trained model to predict y give those vectorized inputs of x
train['predicted'] = lm.predict(X_train_vectorized)
validate["predicted"] = lm.predict(X_validate_vectorized)
test['predicted'] = lm.predict(X_test_vectorized)

In [None]:
# Train Accuracy
(train.actual == train.predicted).mean()

In [None]:
#out of sample accuracy
(validate.actual == validate.predicted).mean()

In [None]:
#now that we have a trained model
#lets use our mode to predict the character of any give line

lines = pd.Series([
    "we have a responsibility", 
    "set phasers to stun", 
    "the warp drive is about to go critical", 
    "What does it mean to be human? I cannot calculate feelings", 
    "Romulan bird of prey decloaking off the port bow"
])

#clean our inputs and lemmatize
lines = lines.apply(clean)
X = tfidf.transform(lines)
X

In [None]:
lm.predict(X)

In [None]:
wesley = train[train.actual == "WESLEY"]

In [None]:
#accuracy
(wesley.actual == wesley.predicted).mean()

In [None]:
data = train[train.actual == "DATA"]

In [None]:
(data.actual == data.predicted).mean()

In [None]:
#pull in classification report
