### Many machine learning problems require learning from categorical variables, text or images. 

### Extracting features from categorical variables

### Many problems have explanatory variables that are categorical or nominal. A categorical variable can take one of a fixed set of values. For example, an application that predicts the salary for a job might use categorical variables such as the city in which the position is located. Categorical variables are commonly encoded using 'ONE-OF-K ENCODING' or 'ONE-HOT ENCODING', in which the explanatory variable is represented using one binary feature for each of its possible values.

### For example, let's assume our model has a 'city' variable that can take one of three values: New York, San Francisco or Chapel Hill. One-hot encoding represents the variable using one binary feature for each of the three possible cities. scikit-learn's 'DictVectorizer' class is a transformer that can be used to one-hot encode categorical features:

In [12]:
from sklearn.feature_extraction import DictVectorizer
onehot_encoder = DictVectorizer()
X = [
    {'city': 'New York'},
    {'city': 'San Francisco'},
    {'city': 'Chapel Hill'}
]

print(onehot_encoder.fit_transform(X).toarray())

[[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


### Note that the order of the features in the resulting vectors is arbitrary. In the first training example, the value of 'city' is 'New York'. The second element in the feature vector corresponds to the 'New York' value, and it is equal to one for the first instance. It may seem intuitive to represent the values of a categorical explanatory variable with a single integer feature. 'New York' could be represented by zero, 'San Francisco' by one, and 'Chapel Hill' by two, for example. The problem is that this representation encodes artificial information. Representing cities with integers encodes an order for cities that does not exist in the real world, and facilitates comparisons of them that do not make sense. There is no natural order of cities by which 'Chapel Hill' is one more than 'San Francisco'. One-hot encoding avoids this problem and only represents the value of the variable.

### Standardizing Features
### We learned in the previous chapter that many learning algorithms perform better when they are trained on standardized data. Recall that standardized data has zero mean and unit variance. An explanatory variable with zero mean is centered about the origin; its average value is zero. A feature vector has unit variance when the variances of its features are all of the same order of magnitude. If one feature's variance is orders of magnitude greater than the variances of the other features, that feature may dominate the learning algorithm and prevent it from learning from the other variables. Some learning algorithms also converge to the optimal parameter values more slowly when data is not standardized. In addition to the 'StandardScaler' transformer we used in the previous chapter, the 'scale' function from the 'preprocessing' module can be used to standardize a dataset along any axis:

In [13]:
from sklearn import preprocessing
import numpy as np
X = np.array([
    [0., 0., 5., 13., 9., 1.],
    [0., 0., 13., 15., 10., 15.],
    [0., 3., 15., 2., 0., 11.]
])

print(preprocessing.scale(X))

[[ 0.         -0.70710678 -1.38873015  0.52489066  0.59299945 -1.35873244]
 [ 0.         -0.70710678  0.46291005  0.87481777  0.81537425  1.01904933]
 [ 0.          1.41421356  0.9258201  -1.39970842 -1.4083737   0.33968311]]


### Finally, 'RobustScaler' is an alternative to 'StandardScaler' that is robust to outliers. 'StandardScaler' subtracts the mean of a feature from each instance's value, and divides by the feature's standard deviation. To mitigate the effect of large outliers, 'RobustScaler' subtracts the median and divides by the interquartile range. QUARTILES are calculated by splitting the sorted dataset into four parts of equal size. The MEDIAN is the SECOND QUARTILE; the IQR (INTER QUARTILE RANGE) is the difference of the THIRD and FIRST QUARTILES.

## Extracting Features From Text
### The most common representation of text is the BAG-OF-WORDS model. This representation uses a multiset, or bag, that encodes the words that appear in a text; bag-of-words does not encode any of the text's syntax, ignores the order of words, and disregards all grammar. Bag-of-words can be thought of as an extension to one-hot encoding. It creates one feature for each word of interest in the text. The bag-of-words model is motivated by the intuition that documents containing similar words often have similar meanings. The bag-of-words model can be used effectively for document classification and retrieval despite the limited information that it encodes. A collection of documents is called a CORPUS. 

### I will create a CORPUS that contains eight unique words. The corpus's unique words comprise its vocabulary. The bag-of-words model uses a feature vector with an element for each of the words in the corpus's vocabulary to represent each document. Our corpus has eight unique words, so each document will be represented by a vector with eight elements. The number of elements that comprise a feature vector is called the vector's dimension. A dictionary maps the vocabulary to indices in the feature vector.

### NOTE: The dictionary for a bag-of-words could be implemented using a Python 'Dictionary', but the Python data structure and the representation's mapping are distinct.

### In the most basic bag-of-words representation, each element in the feature vector is a binary value that represents whether or not the corresponding word appeared in the document. For example, the first word in the first document is UNC. UNC is the first word in the dictionary, so the first element in the vector is equal to one. The last word in the dictionary is game. The first document does not contain the word game, so the eighth element in its vector is set to zero. The 'CountVectorizer' transformer can produce a bag-of-words representation from a string or file. By default, CountVectorizer' converts the characters in the documents to lowercase and tokenizes the documents. Tokenization is the process of splitting a string into tokens, or meaningful sequences of characters. Tokens are often words, but they may also be shorter sequences, including punctuation characters and affixes. 'CountVectorizer' tokenizes using a regular expression that splits strings on whitespace and extracts sequences of characters that are two or more characters in length. The documents in our corpus are represented by the following feature vectors:

### NOTE: CountVectorizer. 'vocabulary_' returns a Dictionary(Key/Value)

### 'vocabulary_ : dict'
### A mapping of terms to feature indices.

### vocabulary : Mapping or iterable, optional' 
### Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. Indices in the mapping should not be repeated and should not have any gap between 0 and the largest index.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['UNC played Duke in basketball',
         'Duke lost the basketball game']

vectorizer = CountVectorizer()
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 1, 'in': 3, 'basketball': 0, 'lost': 4, 'the': 6, 'game': 2}


### Our corpus's dictionary now contains the following ten unique words. Note that 'I' and 'a' were not extracted as they do not match the regular expression. Now let's add a third document to our corpus and inspect the dictionary and feature vectors:

In [15]:
corpus.append('I ate a sandwich')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
{'unc': 9, 'played': 6, 'duke': 2, 'in': 4, 'basketball': 1, 'lost': 5, 'the': 8, 'game': 3, 'ate': 0, 'sandwich': 7}


### The meanings of the first two documents are more similar to each other than they are to the third document, and their corresponding feature vectors are more similar to each other than they are to the third document's feature vector when using a metric such as 'Euclidean distance'. The 'Euclidean distance' between two vectors is equal to the Euclidean norm, or L2 norm, of the difference between the two vectors, as given by this equation:

### d = ||x0 - x1||

### A norm is a function that assigns a positive size to a vector. The Euclidean norm of a vector is equal to the vector's magnitude, which is given by the following equation:

### ||x|| = sqroot of x1**2 + x2**2 + x3**2 xn**2


### scikit-learn's 'euclidean_distance' function can be used to calculate the distance between two or more vectors, and confirms that the most semantically similar documents are also the closest to each other in the vector space. In the following example, we will use the 'euclidean_distance' function to compare the feature vectors for our documents:

In [16]:
from sklearn.metrics.pairwise import euclidean_distances
X = vectorizer.fit_transform(corpus).todense()
print('Distance between 1st and 2nd documents:',
     euclidean_distances(X[0], X[1]))
print('Distance between 1st and 3rd documents:',
     euclidean_distances(X[0], X[2]))
print('Distance between 2nd and 3rd documents:',
     euclidean_distances(X[1], X[2]))


Distance between 1st and 2nd documents: [[2.44948974]]
Distance between 1st and 3rd documents: [[2.64575131]]
Distance between 2nd and 3rd documents: [[2.64575131]]


### High-dimensional vectors that have many zeroed elements are called 'SPARSE VECTORS'.

### Using high-dimensional data creates several problems for all machine learning tasks, including those that do not involve text. Collectively, these problems are known as the CURSE OF DIMENSIONALITY. The first problem is that high-dimensional vectors require more memory and computation than low-dimensional vectors. SciPy provides some data types that mitigate this problem by efficiently representing only the non-zero elements of sparse vectors. The second problem is that as the feature space's dimensionality increases, more training data is required to ensure that there are enough training instances with each combination of the feature's values. 

### If there are insufficient training instances for a feature, the algorithm may overfit noise in the training data and fail to generalize. In the following sections, we will review several strategies for reducing the dimensionality of text features.

## STOP WORD FILTERING

### A basic strategy for reducing the dimensions of the feature space is to convert all of the text to lowercase. This is motivated by the insight that the letter case does not contribute to the meanings of most words; sandwich and Sandwich have the same meaning in most contexts. Capitalization may indicate that a word is beginning a sentence, but the bag-of-words model has already discarded all information from word order and grammar.


### A second strategy is to remove words that are common to most of the documents in the corpus. These words, called STOP WORDS, frequently include determiners such as "the", "a", and "an"; auxiliary verbs such as "do", "be", and "will"; and prepositions such as "on", "around", and "beneath". Stop words are often functional words that contribute to the document's meaning through grammar rather than their denotations. 'CountVectorizer' can filter stop words provided as the 'stop_words' keyword parameter, and also includes a basic list of English stop words.

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['UNC played Duke in basketball',
         'Duke lost the basketball game',
         'I ate a sandwich']

# CountVectorizer with and without 'stop_words'

vectorizer = CountVectorizer() # no stop_words
print(vectorizer.fit_transform(corpus).todense())
print('No Stop_Words %s' % vectorizer.vocabulary_)

vectorizer = CountVectorizer(stop_words = 'english')
print(vectorizer.fit_transform(corpus).todense())
print('With Stop_Words %s' % vectorizer.vocabulary_)

[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
No Stop_Words {'unc': 9, 'played': 6, 'duke': 2, 'in': 4, 'basketball': 1, 'lost': 5, 'the': 8, 'game': 3, 'ate': 0, 'sandwich': 7}
[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
With Stop_Words {'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}


### Stemming and Lemmatization

### While stop word filtering is an easy strategy for dimensionality reduction, most stop word lists contain only a few hundred words. A large corpus may still have hundreds of thousands of unique words after filtering. Two similar strategies for further reducing dimensionality are called stemming and lemmatization.

### A high-dimensional document vector may separately encode several derived or inflected forms of the same word. For example, "jumping" and "jumps" are both forms of the word "jump"; a document vector in a corpus of long-jumping articles may encode each inflected form with a separate element in the feature vector. Stemming and lemmatization are two strategies for condensing inflected and derived forms of a word into a single feature.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['He ate the sandwiches',
         'Every sandwich was eaten by him']
vectorizer = CountVectorizer(binary=True, stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 0 0 1]
 [0 1 1 0]]
{'ate': 0, 'sandwiches': 3, 'sandwich': 2, 'eaten': 1}


### The documents have similar meanings, but their feature vectors have no elements in common! Both documents contain a conjugation of 'ate' and a form of 'sandwich'. Ideally, these similarities should be reflected in the feature vectors. 


### LEMMATIZATION is the process of determining the lemma, or the morphological root, of an inflected word based on its context. LEMMAS are the base forms of words that are used to key the word in a dictionary.

### STEMMING has a similar goal to lemmatization, but it does not attempt to produce the morphological roots of words. Instead, stemming removes all patterns of characters that appear to be affixes, resulting in a token that is not necessarily a valid word. 

### Lemmatization frequently requires a lexical resource, like WordNet, and the word's part-of-speech.

### Stemming algorithms frequently use rules instead of lexical resources to produce stems and can operate on any token, even without its context. Let's consider lemmatization of the word 'gathering' in two documents:

### In the first sentence, 'gathering' is a verb, and its lemma is 'gather'.  In the second sentence, 'gathering' is a noun, and its lemma is 'gathering'. We will use the Natural Language Tool Kit (NLTK) to stem and lemmatize the corpus. 

### First I must install the NTLK package as shown in the following cell.

In [36]:
!pip3 install --user -U nltk
import nltk
nltk.download('wordnet')

Requirement already up-to-date: nltk in /Users/squeeko/Library/Python/3.7/lib/python/site-packages (3.4.5)


[nltk_data] Downloading package wordnet to /Users/squeeko/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [37]:
from nltk.stem.wordnet import WordNetLemmatizer
corpus = ['I am gathering ingredients for the sandwich',
         'There were many wizards at the gathering']

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('gathering', 'v')) # 'v' = verb
print(lemmatizer.lemmatize('gathering', 'n')) # 'n' = noun

gather
gathering


### Now I will compare lemmatization with stemming. 'PorterStemmer' cannot consider the inflected form's part-of-speech and returns 'gather' for both documents

In [1]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('gathering'))

gather


In [6]:
# Now I will lemmatize the toy corpus

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag

wordnet_tags = ['n','v']
corpus = [
    'He ate the sandwiches',
    'Every sandwich was eaten by him'
]

stemmer = PorterStemmer()
print('Stemmed: ', [[stemmer.stem(token) for token in 
                     word_tokenize(document)] for document in corpus])

def lemmatize(token, tag):
    if tag[0].lower() in ['n', 'v']:
        return lemmatizer. lemmatize(token, tag[0].lower())
    return token

lemmatizer = WordNetLemmatizer()
tagged_corpus = [pos_tag(word_tokenize(document)) for document in 
                 corpus]
print('Lemmatized: ', [[lemmatize(token, tag) for token, tag in
                       document] for document in tagged_corpus])

[nltk_data] Downloading package punkt to /Users/squeeko/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/squeeko/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Stemmed:  [['He', 'ate', 'the', 'sandwich'], ['everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]
Lemmatized:  [['He', 'eat', 'the', 'sandwich'], ['Every', 'sandwich', 'be', 'eat', 'by', 'him']]


## Extending bag-of-words with tf-idf weights

### In the previous section, we used the bag-of-words representation to create feature vectors that encode whether or not a word from the corpus's dictionary appears in a document.

### These feature vectors do not encode grammar, word order, or the frequencies of words. It is intuitive that the frequency with which a word appears in a document could indicate the extent to which a document pertains to that word. A long document that contains one occurrence of a word may discuss an entirely different topic than a document that contains many occurrences of the same word. In this section, we will create feature vectors that encode the frequencies of words, and discuss strategies for mitigating two problems caused by encoding term frequencies. Instead of using a binary value for each element in the feature vector, we will now use an integer that represents the number of times that the words appeared in the document. With stop word filtering, the corpus is represented by the following feature vector:

In [15]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich']
vectorizer = CountVectorizer(stop_words='english')
frequencies = np.array(vectorizer.fit_transform(corpus).todense())[0]
print(frequencies)
print('Token indices %s' % vectorizer.vocabulary_) #indices are numbered based on alpha order
for token, index in vectorizer.vocabulary_.items():
     print('The token "%s" appears %s times' % (token, frequencies[index]))



[2 1 3 1 1]
Token indices {'dog': 1, 'ate': 0, 'sandwich': 2, 'wizard': 4, 'transfigured': 3}
The token "dog" appears 1 times
The token "ate" appears 2 times
The token "sandwich" appears 3 times
The token "wizard" appears 1 times
The token "transfigured" appears 1 times


### The element for 'dog' (at index 1) is now set to one, and the element for sandwich (at index 2) is set to three to indicate that the corresponding words occurred one and three times, respectively. Note that the 'CountVectorizer' binary parameter is omitted; its default value is 'False', which causes it to return raw term frequencies rather than binary frequencies.

### Encoding the terms' raw frequencies in the feature vector provides additional information about the meanings of the documents, but assumes that all the documents are of similar lengths. Many words might appear with the same frequency in two documents, but the documents could still be dissimilar if one document is many times larger than the other.

### scikit-learn's 'TfdfTransformer' can mitigate this problem by transforming a matrix of term frequency vectors into a matrix of normalized term frequency weights. By default, 'TfdfTransformer' smooths the raw counts and applies L2 normalization. The smoothed, normalized term frequencies are given by this equation:

## tf(t,d) = f(t,d)/||x||

### The numerator is the frequency of the term in the document. The denominator is the L2 norm of the term count vector. In addition to normalizing raw term counts, we can improve our feature vectors by calculating logarithmically scaled term frequencies, which scale the counts to a more limited range. Logarithmically scaled term frequencies are given by the following equation:

## tf(t,d) = 1 + log f(t,d)

### 'TfdfTransformer' calculates logarithmically scaled term frequencies when its 'sublinear_tf' keyword parameter is set to 'True'. Normalization and logarithmically scaled term frequencies can represent the frequencies of terms in a document while mitigating the effects of different document sizes. However, another problem remains with these representations. The feature vectors contain large weights for terms that occur frequently in a document, even if those terms occur frequently in most documents in the corpus. These terms do not help to represent the meaning of a particular document relative to the rest of the corpus. For example, most of the documents in a corpus of articles about Duke's basketball team could include the words 'Coach K', 'trip', and 'flop'. These words can be thought of as corpus-specific stop words, and may not be useful for calculating the similarity of documents. The Inverse Document Frequency (IDF) is a measure of how rare or common a word is in a corpus. The INVERSE DOCUMENT FREQUENCY is given by this equation:

## idf(tD) = log(N/1+|d E D:t E d| 

### Here, the numerator is the total number of documents in the corpus and the denominator is the number of documents in the corpus that contain the term. A term's tf-idf value is the product of its term frequency and inverse document frequency. 'TfidTransformer' returns tf-idf weights when its 'use-idf' keyword argument is set to its default value,'True'.

### Since tf-idf weighted feature vectors are commonly used to represent text; scikit-learn provides a'TfidVectorizer' transformer class that wraps 'CountVectorizer' and 'TfidTransformer' Let's use 'TfidVectorizer' to create tf-idf weighted feature vectors for our corpus:

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich'
]

vectorizer = TfidfVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())

[[0.75458397 0.37729199 0.53689271 0.         0.        ]
 [0.         0.         0.44943642 0.6316672  0.6316672 ]]


# Will complete at a later date (hashes and word embeddings)