<a href="https://colab.research.google.com/github/sergiomora03/AdvancedTopicsAnalytics/blob/main/notebooks/L5-BasicVectorizationApproaches.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Basic Vectorization Approaches

This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). Special thanks goes to [Kevin Markham](https://github.com/justmarkham)

# Data

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

In [3]:
df = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/mashable_texts.csv', index_col=0)

In [4]:
df.head()

Unnamed: 0,author,author_web,shares,text,title,facebo,google,linked,twitte,twitter_followers
0,Seth Fiegerman,http://mashable.com/people/seth-fiegerman/,4900,\nApple's long and controversial ebook case ha...,The Supreme Court smacked down Apple today,http://www.facebook.com/sfiegerman,,http://www.linkedin.com/in/sfiegerman,https://twitter.com/sfiegerman,14300
1,Rebecca Ruiz,http://mashable.com/people/rebecca-ruiz/,1900,Analysis\n\n\n\n\n\nThere is a reason that Don...,Every woman has met a man like Donald Trump,,,,https://twitter.com/rebecca_ruiz,3738
2,Davina Merchant,http://mashable.com/people/568bdab351984019310...,7000,LONDON - Last month we reported on a dog-sized...,Adorable dog-sized rabbit finally finds his fo...,,https://plus.google.com/105525238342980116477?...,,,0
3,Scott Gerber,[],5000,Today's digital marketing experts must have a ...,15 essential skills all digital marketing hire...,,,,,0
4,Josh Dickey,http://mashable.com/people/joshdickey/,1600,"LOS ANGELES — For big, fun, populist popcorn m...",Mashable top 10: 'The Force Awakens' is the be...,,https://plus.google.com/109213469090692520544?...,,https://twitter.com/JLDlite,11200


# Tokenization

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

### Create the target feature (number of shares)

In [6]:
y = df.shares
y.describe()

count       82.000000
mean      3090.487805
std       8782.031594
min        437.000000
25%        893.500000
50%       1200.000000
75%       2275.000000
max      63100.000000
Name: shares, dtype: float64

In [7]:
y = pd.cut(y, [0, 893, 1200, 2275, 63200], labels=[0, 1, 2, 3])

In [8]:
y.value_counts()

1    22
0    21
3    21
2    18
Name: shares, dtype: int64

In [9]:
df['y'] = y

### create document-term matrices

In [10]:
X = df.text

In [11]:
# use CountVectorizer to create document-term matrices from X
vect = CountVectorizer()
X_dtm = vect.fit_transform(X)

In [16]:
X_dtm

<82x7969 sparse matrix of type '<class 'numpy.int64'>'
	with 25929 stored elements in Compressed Sparse Row format>

In [15]:
X_dtm.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 3, 0, ..., 0, 0, 0]])

In [12]:
temp=X_dtm.todense()

In [13]:
vect.vocabulary_

{'apple': 682,
 'long': 4303,
 'and': 617,
 'controversial': 1747,
 'ebook': 2401,
 'case': 1307,
 'has': 3367,
 'reached': 5734,
 'its': 3884,
 'final': 2893,
 'chapter': 1383,
 'it': 3878,
 'not': 4883,
 'the': 7054,
 'happy': 3352,
 'ending': 2527,
 'company': 1612,
 'wanted': 7620,
 'supreme': 6865,
 'court': 1809,
 'on': 4969,
 'monday': 4687,
 'rejected': 5841,
 'an': 603,
 'appeal': 673,
 'filed': 2882,
 'by': 1224,
 'to': 7150,
 'overturn': 5075,
 'stinging': 6723,
 'ruling': 6087,
 'that': 7051,
 'led': 4181,
 'broad': 1147,
 'conspiracy': 1706,
 'with': 7748,
 'several': 6303,
 'major': 4374,
 'publishers': 5610,
 'fix': 2927,
 'price': 5483,
 'of': 4935,
 'books': 1088,
 'sold': 6528,
 'through': 7106,
 'online': 4979,
 'bookstore': 1089,
 'decision': 2009,
 'means': 4496,
 'now': 4895,
 'no': 4858,
 'choice': 1437,
 'but': 1215,
 'pay': 5178,
 'out': 5037,
 '400': 223,
 'million': 4611,
 'consumers': 1714,
 'additional': 446,
 '50': 252,
 'in': 3664,
 'legal': 4187,
 'fees'

In [17]:
# rows are documents, columns are terms (aka "tokens" or "features")
X_dtm.shape

(82, 7969)

In [28]:
# last 50 features
print(list(vect.vocabulary_.keys())[-150:-100])

['fiber', 'keen', 'accelerates', 'tuner', 'inlfux', 'rab', 'streamed', 'hungary', 'poverty', 'syrians', 'austria', 'slamming', 'doors', 'migrants', 'arriving', 'closes', 'border', 'croatian', 'exceeded', 'capacities', 'centers', 'thru', 'lines', 'patience', 'channel4news', 'mjizfew4wj', 'millerc4', 'tovarnik', '5_news', 'e5is8r68us', 'lane', 'peterlane5news', 'separated', 'collapsed', 'fields', 'rpffh9n9b8', 'cpr', 'uhyoyxtika', 'ranko', 'ostojić', 'visited', 'acknowledged', 'looming', 'impromptu', 'nebojša', 'stefanović', 'urged', 'affairs', 'vesna', 'pusić']


In [32]:
# show vectorizer options
vect

In [31]:
# show vectorizer options
vect.__dict__

{'input': 'content',
 'encoding': 'utf-8',
 'decode_error': 'strict',
 'strip_accents': None,
 'preprocessor': None,
 'tokenizer': None,
 'analyzer': 'word',
 'lowercase': True,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'stop_words': None,
 'max_df': 1.0,
 'min_df': 1,
 'max_features': None,
 'ngram_range': (1, 1),
 'vocabulary': None,
 'binary': False,
 'dtype': numpy.int64,
 'fixed_vocabulary_': False,
 '_stop_words_id': 94930521564128,
 'stop_words_': set(),
 'vocabulary_': {'apple': 682,
  'long': 4303,
  'and': 617,
  'controversial': 1747,
  'ebook': 2401,
  'case': 1307,
  'has': 3367,
  'reached': 5734,
  'its': 3884,
  'final': 2893,
  'chapter': 1383,
  'it': 3878,
  'not': 4883,
  'the': 7054,
  'happy': 3352,
  'ending': 2527,
  'company': 1612,
  'wanted': 7620,
  'supreme': 6865,
  'court': 1809,
  'on': 4969,
  'monday': 4687,
  'rejected': 5841,
  'an': 603,
  'appeal': 673,
  'filed': 2882,
  'by': 1224,
  'to': 7150,
  'overturn': 5075,
  'stinging': 6723,
  'ruling': 6

- **lowercase:** boolean, True by default
- Convert all characters to lowercase before tokenizing.

In [33]:
vect = CountVectorizer(lowercase=False)
X_dtm = vect.fit_transform(X)
X_dtm.shape

(82, 8759)

In [34]:
X_dtm.todense()[0].argmax()

8097

In [36]:
list(vect.vocabulary_.keys())[8097]

'nightmare'

- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [37]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 4))
X_dtm = vect.fit_transform(X)
X_dtm.shape

(82, 115172)

In [38]:
# last 50 features
print(list(vect.vocabulary_.keys())[-1000:-950])

['to go to', 'go to cuba', 'to cuba even', 'cuba even as', 'even as we', 'as we wait', 'we wait for', 'wait for the', 'for the embargo', 'the embargo to', 'embargo to be', 'to be lifted', 'be lifted congress', 'lifted congress has', 'congress has to', 'has to do', 'to do that', 'do that airlines', 'that airlines are', 'airlines are announcing', 'are announcing new', 'announcing new charters', 'new charters and', 'charters and the', 'and the historically', 'the historically isolated', 'historically isolated island', 'isolated island nation', 'island nation is', 'nation is preparing', 'is preparing for', 'preparing for continuing', 'for continuing influx', 'continuing influx of', 'influx of travelers', 'of travelers at', 'travelers at the', 'same time separately', 'time separately there', 'separately there seems', 'there seems to', 'to be new', 'be new smartphone', 'new smartphone friendly', 'smartphone friendly private', 'friendly private jet', 'private jet service', 'jet service announ

### Predict shares

In [39]:
# Default CountVectorizer
vect = CountVectorizer()
X_dtm = vect.fit_transform(X)

# use Naive Bayes to predict the star rating
nb = MultinomialNB()
pd.Series(cross_val_score(nb, X_dtm, y, cv=10)).describe()

count    10.000000
mean      0.369444
std       0.158925
min       0.111111
25%       0.250000
50%       0.354167
75%       0.500000
max       0.625000
dtype: float64

In [40]:
# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    X_dtm = vect.fit_transform(X)
    print('Features: ', X_dtm.shape[1])
    nb = MultinomialNB()
    print(pd.Series(cross_val_score(nb, X_dtm, y, cv=10)).describe())

In [41]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Features:  37905
count    10.000000
mean      0.379167
std       0.139512
min       0.125000
25%       0.333333
50%       0.375000
75%       0.468750
max       0.625000
dtype: float64


# Stopword Removal

- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text



- **stop_words:** string {'english'}, list, or None (default)
- If 'english', a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [42]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

Features:  7710
count    10.000000
mean      0.331944
std       0.146614
min       0.111111
25%       0.250000
50%       0.354167
75%       0.468750
max       0.500000
dtype: float64


In [43]:
# set of stop words
print(vect.get_stop_words())

frozenset({'ten', 'were', 'do', 'all', 'afterwards', 'becomes', 'herself', 'so', 'whether', 'somewhere', 'amongst', 'often', 'out', 'there', 'should', 'towards', 'seems', 'whose', 'any', 'myself', 'whoever', 'on', 'then', 'someone', 'back', 'none', 'without', 'whither', 'others', 'had', 'from', 'every', 'me', 'most', 'hereby', 'thereafter', 'ours', 'even', 'put', 'besides', 'de', 'once', 'full', 'behind', 'upon', 'mill', 'her', 'whatever', 'together', 'hasnt', 'we', 'which', 'yourself', 'am', 'him', 'get', 'each', 'done', 'during', 'noone', 'himself', 'amoungst', 'meanwhile', 'much', 'found', 'describe', 'fifteen', 'too', 'or', 'detail', 'thereupon', 'ever', 'except', 're', 'being', 'mine', 'whereby', 'four', 'some', 'same', 'further', 'while', 'see', 'fire', 'still', 'such', 'nothing', 'bill', 'whenever', 'off', 'whereupon', 'however', 'indeed', 'well', 'yet', 'through', 'un', 'thereby', 'either', 'forty', 'move', 'after', 'something', 'us', 'alone', 'few', 'by', 'eg', 'only', 'fifty'

# Other CountVectorizer Options

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [44]:
# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

Features:  100
count    10.000000
mean      0.356944
std       0.130298
min       0.125000
25%       0.260417
50%       0.375000
75%       0.468750
max       0.500000
dtype: float64


In [49]:
# all 100 features
print(vect.vocabulary_.keys())

dict_keys(['company', 'day', 'just', 'big', 'world', 'new', 'image', 'said', 'mashable', 'says', 'time', 've', 'trump', 'know', 'business', 'man', 'like', 'work', 'media', '2016', 'police', 'campaign', 'pic', 'twitter', 'com', 'way', 'people', 'don', 'make', 'facebook', 'year', 'old', 'digital', 'marketing', '15', 'best', 'platform', 'years', 'conversion', '11', 'good', '10', 'author', 'topics', 'window', 'article', 'null', 'internal', 'true', 'uncategorized', 'http', 'rack', 'mshcdn', 'jpg', 'og', 'url', 'title', 'description', 'https', 'photo', 'sailthru', 'false', 'short_url', 'hot', 'rising', 'function', 'oct', 'js', 'timer', 'twttr', 'return', 'initpage', 'entertainment', 'rights', 'movies', '2015', 'movie', 'life', 'jr', '01', 'state', 'paris', 'robert', 'posted', 'australian', 'open', '28', 'season', 'travel', 'instagram', 'pu', 'watercooler', 'premiere', 'daniel', 'cystic', 'fibrosis', 'iron', 'downey', 'rdj', '1cd'])


In [50]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=1000)
tokenize_test(vect)

Features:  1000
count    10.000000
mean      0.355556
std       0.127496
min       0.222222
25%       0.250000
50%       0.354167
75%       0.375000
max       0.625000
dtype: float64


- **min_df:** float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [None]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

Features:  7620
count    10.000000
mean      0.407594
std       0.141763
min       0.125000
25%       0.366477
50%       0.409722
75%       0.500000
max       0.571429
dtype: float64


# Stemming and Lemmatization

**Stemming:**

- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way
- **Notes:**
    - Uses a "simple" and fast rule-based approach
    - Stemmed words are usually not shown to users (used for analysis/indexing)
    - Some search engines treat words with the same stem as synonyms

In [51]:
# initialize stemmer
stemmer = SnowballStemmer('english')

In [52]:
vect = CountVectorizer()
vect.fit(X)

In [53]:
words = list(vect.vocabulary_.keys())[:100]

In [54]:
# stem each word
print([stemmer.stem(word) for word in words])

['appl', 'long', 'and', 'controversi', 'ebook', 'case', 'has', 'reach', 'it', 'final', 'chapter', 'it', 'not', 'the', 'happi', 'end', 'compani', 'want', 'suprem', 'court', 'on', 'monday', 'reject', 'an', 'appeal', 'file', 'by', 'to', 'overturn', 'sting', 'rule', 'that', 'led', 'broad', 'conspiraci', 'with', 'sever', 'major', 'publish', 'fix', 'price', 'of', 'book', 'sold', 'through', 'onlin', 'bookstor', 'decis', 'mean', 'now', 'no', 'choic', 'but', 'pay', 'out', '400', 'million', 'consum', 'addit', '50', 'in', 'legal', 'fee', 'accord', 'origin', 'settlement', '2014', 'see', 'also', 'here', 'how', 'marshal', 'entir', 'tech', 'industri', 'fight', 'fbi', 'for', 'verdict', 'is', 'more', 'damag', 'reput', 'as', 'consum', 'friend', 'brand', 'mention', 'legaci', 'belov', 'founder', 'steve', 'job', 'than', 'actual', 'bottom', 'line', 'put', 'fine', 'context']


**Lemmatization**

- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming
- **Notes:** Uses a dictionary-based approach (slower than stemming)

In [55]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [56]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [57]:
# assume every word is a noun
print([wordnet_lemmatizer.lemmatize(word) for word in words])

['apple', 'long', 'and', 'controversial', 'ebook', 'case', 'ha', 'reached', 'it', 'final', 'chapter', 'it', 'not', 'the', 'happy', 'ending', 'company', 'wanted', 'supreme', 'court', 'on', 'monday', 'rejected', 'an', 'appeal', 'filed', 'by', 'to', 'overturn', 'stinging', 'ruling', 'that', 'led', 'broad', 'conspiracy', 'with', 'several', 'major', 'publisher', 'fix', 'price', 'of', 'book', 'sold', 'through', 'online', 'bookstore', 'decision', 'mean', 'now', 'no', 'choice', 'but', 'pay', 'out', '400', 'million', 'consumer', 'additional', '50', 'in', 'legal', 'fee', 'according', 'original', 'settlement', '2014', 'see', 'also', 'here', 'how', 'marshalled', 'entire', 'tech', 'industry', 'fight', 'fbi', 'for', 'verdict', 'is', 'more', 'damaging', 'reputation', 'a', 'consumer', 'friendly', 'brand', 'mention', 'legacy', 'beloved', 'founder', 'steve', 'job', 'than', 'actual', 'bottom', 'line', 'put', 'fine', 'context']


In [58]:
# assume every word is a verb
print([wordnet_lemmatizer.lemmatize(word,pos='v') for word in words])

['apple', 'long', 'and', 'controversial', 'ebook', 'case', 'have', 'reach', 'its', 'final', 'chapter', 'it', 'not', 'the', 'happy', 'end', 'company', 'want', 'supreme', 'court', 'on', 'monday', 'reject', 'an', 'appeal', 'file', 'by', 'to', 'overturn', 'sting', 'rule', 'that', 'lead', 'broad', 'conspiracy', 'with', 'several', 'major', 'publishers', 'fix', 'price', 'of', 'book', 'sell', 'through', 'online', 'bookstore', 'decision', 'mean', 'now', 'no', 'choice', 'but', 'pay', 'out', '400', 'million', 'consumers', 'additional', '50', 'in', 'legal', 'fee', 'accord', 'original', 'settlement', '2014', 'see', 'also', 'here', 'how', 'marshal', 'entire', 'tech', 'industry', 'fight', 'fbi', 'for', 'verdict', 'be', 'more', 'damage', 'reputation', 'as', 'consumer', 'friendly', 'brand', 'mention', 'legacy', 'beloved', 'founder', 'steve', 'job', 'than', 'actual', 'bottom', 'line', 'put', 'fine', 'context']


In [59]:
# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    text = text.lower()
    words = text.split()
    return [wordnet_lemmatizer.lemmatize(word) for word in words]

In [60]:
# use split_into_lemmas as the feature extraction function (WARNING: SLOW!)
vect = CountVectorizer(analyzer=split_into_lemmas)
tokenize_test(vect)

Features:  10208
count    10.000000
mean      0.379167
std       0.108584
min       0.222222
25%       0.281250
50%       0.375000
75%       0.486111
max       0.500000
dtype: float64


#  Term Frequency-Inverse Document Frequency (TF-IDF)

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
- **Notes:** Used for search engine scoring, text summarization, document clustering

In [61]:
# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [64]:
# Term Frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.vocabulary_.keys())
tf

Unnamed: 0,call,you,tonight,me,cab,please
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [65]:
# Document Frequency
vect = CountVectorizer(binary=True)
df_ = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df_.reshape(1, 6), columns=vect.vocabulary_.keys())

Unnamed: 0,call,you,tonight,me,cab,please
0,1,3,2,1,1,1


In [66]:
# Term Frequency-Inverse Document Frequency (simple version)
tf/df_

Unnamed: 0,call,you,tonight,me,cab,please
0,0.0,0.333333,0.0,0.0,1.0,1.0
1,1.0,0.333333,0.5,0.0,0.0,0.0
2,0.0,0.333333,0.5,2.0,0.0,0.0


In [68]:
# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.vocabulary_.keys())

Unnamed: 0,call,you,tonight,me,cab,please
0,0.0,0.385372,0.0,0.0,0.652491,0.652491
1,0.720333,0.425441,0.547832,0.0,0.0,0.0
2,0.0,0.266075,0.34262,0.901008,0.0,0.0


**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)

# Using TF-IDF to Summarize a text


In [70]:
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(X)
features = vect.vocabulary_.keys()
dtm.shape

(82, 7710)

In [71]:
# choose a random text
review_id = 40
review_text = X[review_id]
review_length = len(review_text)

In [75]:
# create a dictionary of words and their TF-IDF scores
word_scores = {}
for word in vect.vocabulary_.keys():
    word = word.lower()
    if word in features:
        word_scores[word] = dtm[review_id, list(features).index(word)]

In [76]:
# print words with the top 5 TF-IDF scores
print('TOP SCORING WORDS:')
top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
for word, score in top_scores:
    print(word)

TOP SCORING WORDS:
nurtured
delicious
virustotal
therapy
auditioned


In [77]:
# print 5 random words
print('\n' + 'RANDOM WORDS:')
random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False)
for word in random_words:
    print(word)


RANDOM WORDS:
subtle
reporters
tops
essentially
reacted


# Conclusion

- NLP is a gigantic field
- Understanding the basics broadens the types of data you can work with
- Simple techniques go a long way
- Use scikit-learn for NLP whenever possible