# Intro to Natural Language Processing

So far we've covered three ways to use natural language processing to extract text features from data.

- Count Vectorizer
- Hashing Vectorizer
- TF-IDF

Now we'll go over the implementation in code.

Each of the methods has two steps - a fit and transform. The **fit** teaches the function the vocabulary, **transform** applies that vocabulary to the selected text.

These models are all 'bag of words' models. What does this mean?


In [1]:
#let's load some sample data:

spam = 'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.'

ham = 'Hello,\nI am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.'


In [2]:
print spam
print
print ham

Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.

Hello,
I am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.


In [3]:
# Let's apply a count vectorizer:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer

#instantiate our model
cvec = CountVectorizer()

#fit the count vectorizer to the data. This 'teaches' the count vectorizer the dictionary.
cvec.fit([spam])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
# to generate features we have to transform the data after we fit the count vecotrizer.
cvecdata= cvec.transform([spam])

In [6]:
# Now we can turn our features into a dataframe:

df  = pd.DataFrame(cvecdata.todense(),
             columns=cvec.get_feature_names())

In [7]:
df

Unnamed: 0,00,000,750,86,ago,am,an,and,be,board,...,to,valery,was,week,why,will,with,years,you,your
0,1,1,1,1,1,2,2,3,1,1,...,2,1,1,1,1,2,2,2,2,2


## Hashing Vectorizer

In [8]:
hvec = HashingVectorizer()

In [9]:
#your turn

In [45]:
df  = pd.DataFrame(hvec.transform([spam]).todense())
df.transpose().sort_values(0, ascending=False).head(10).transpose()

Unnamed: 0,479532,144749,174171,832412,828689,994433,1005907,170062,675997,959146
0,0.338062,0.169031,0.169031,0.169031,0.169031,0.169031,0.169031,0.169031,0.169031,0.084515


## tf-idf Document Frequency

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer(stop_words='english')
tvec.fit([spam, ham])

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [12]:
df  = pd.DataFrame(tvec.transform([spam, ham]).todense(),
                   columns=tvec.get_feature_names(),
                   index=['spam', 'ham'])

In [13]:
df

Unnamed: 0,00,000,750,86,ago,application,assistance,attached,best,board,...,seven,site,smith,sum,thousand,time,valery,week,writing,years
spam,0.145067,0.145067,0.145067,0.145067,0.145067,0.0,0.0,0.0,0.0,0.145067,...,0.145067,0.0,0.0,0.145067,0.145067,0.0,0.145067,0.145067,0.0,0.290133
ham,0.0,0.0,0.0,0.0,0.0,0.155195,0.155195,0.155195,0.155195,0.0,...,0.0,0.155195,0.155195,0.0,0.0,0.155195,0.0,0.0,0.155195,0.0


### What are some shortcomings of the way we've created features from words so far?

## Lemmization


### Question: How would we adjust to accomodate different words with the same or similar meanings. For example:

- is, are, am
- wolf, wolves
- dance, dancing

Our current process will count those words separately, when we might not necessarily want it to.

We can use a library called nltk, or natural langauge toolkit, to preprocess our data.


In [17]:
from nltk.stem import WordNetLemmatizer
import nltk
lemmatizer = WordNetLemmatizer()
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [36]:
lemmatizer.lemmatize("cats")

u'cat'

In [37]:
lemmatizer.lemmatize('geese')

u'goose'

In [38]:
lemmatizer.lemmatize('interviews')

u'interview'

In [39]:
#Before we can lemmitize our spam string we need to tokenize it.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
#What is our tokenizer doing? Is anyone familiar with regex?

In [40]:
spam_tokens = tokenizer.tokenize(spam.lower())


In [41]:
spam_tokens

['hello',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will',
 'donate',
 'the',
 'sum',
 'of',
 '8',
 '750',
 '000',
 '00',
 'euros',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 'etc',
 'etc']

In [42]:
tokens_lem = [lemmatizer.lemmatize(i) for i in spam_tokens]

In [43]:
tokens_lem

['hello',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 u'director',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 u'year',
 'old',
 'and',
 'i',
 u'wa',
 'diagnosed',
 'with',
 'cancer',
 '2',
 u'year',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will',
 'donate',
 'the',
 'sum',
 'of',
 '8',
 '750',
 '000',
 '00',
 u'euro',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 u'euro',
 'only',
 'etc',
 'etc']

In [44]:
paired = zip(spam_tokens, tokens_lem)

In [None]:
paired

### How much of our text actually changed?

## Stemming

Stemming is similar to lemmization, but it reduces topically similar words to their "root."

In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

In [None]:
stem_spam = [p_stemmer.stem(i) for i in spam_tokens]

In [None]:
paired_stem = zip(spam_tokens, stem_spam)

In [None]:
paired_stem

## What can we do with our results?
It's now in the form of a list, so we can't plug it directly back in to the countvectorizer, but we can turn it back into a string if we wanted.

We can use this information for other purposes though....



## Follow-up

How would we apply this information to a classification model