# Hist 3368

## Word Vectors With Ngrams

### Play Around With Count_Vectorizer 

You may be thinking that this was a lot of work for a wordcount dataframe that you could have made in simpler ways.  But you've just learned an incredibly powerful tool.  

In [559]:
vectorizer2 = CountVectorizer(max_features=100000, 
                              lowercase=True, 
                              stop_words = 'english',
                              ngram_range=(1, 1),  # <-- to see how this works, consider changing the arguments here from (1, 2) to (3, 4). That's a lower and an upper bound.
                              analyzer = "word")

In [560]:
vectorized2 = vectorizer2.fit_transform(top_speakers['speech'])

In [561]:
all_words = np.array(vectorizer2.get_feature_names())

vectors_dataframe2 = pd.DataFrame(vectorized2.todense(), # the matrix we saw above is turned into a dataframe
                                 columns=all_words,
                                 index = speaker_names
                                 )

In [562]:
vectorized2.shape

(9, 32892)

In [563]:
vectors_dataframe2

Unnamed: 0,_concerned,_gasoline,_n,_percent,_tinue,aa,aaa,aap,aaron,aauw,...,zonethe,zoning,zoomed,zooming,zurich,zurichwhich,zweibrucken,zwich,zwick,zwicks
Mr. JAVITS,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,1,1,0,0,0,0
Mr. LONG of Louisiana,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,5,2
Mr. MANSFIELD,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
Mr. BYRD of West Virginia,0,0,1,0,0,0,0,0,2,0,...,0,1,0,0,0,0,0,0,0,0
Mr. WILLIAMS of Delaware,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,33,0
Mr. PROXMIRE,0,1,0,0,0,1,0,4,0,4,...,0,0,1,2,0,0,0,0,2,0
Mr. DODD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Mr. HOLLAND,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
Mr. TYDINGS,0,0,0,0,0,0,1,0,2,0,...,0,1,0,0,0,0,0,0,2,0


Pretty cool, eh? We're not going to do anything with this -- although you certainly could use it for the exercise that follows.  

You can actually use the features of CountVectorizers to do all sorts of things -- to clean your data, stopword it, lowercase it, and even implement a controlled vocabulary.

For instance, if I had a controlled vocabulary called my_dict and a stopwords list called stop_words_list, I could do this:

    CountVectorizer(lowercase=True, 
                stop_words = stop_words_list,
                vocabularyMapping = my_dict,
                ngram_range=(3, 4),
                analyzer = "word")


Read up more here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html