# Intro to scikit-learn (sklearn)

Much of what we'll do the rest of the semester entails turning words into numbers: tf-idf, topic modeling, BERT, similarity, classification, clustering. Python's machine learning library, scikit-learn, will be crucial to many of these methods. Today we'll just introduce ourselves to the library, setting ourselves up for what's to come.

## Install scikit-learn

We begin by installing scikit-learn as `sklearn`

In [1]:
!pip install sklearn

Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1316 sha256=b3c5d8c94bb6b3df928efdcf2b162ecf2644f64f02b8874215faf3166630b725
  Stored in directory: /Users/dsinyki/Library/Caches/pip/wheels/76/03/bb/589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0


## Import CountVectorizer

Now we import `CountVectorizer`, which [converts a collection of text documents to a matrix of token counts](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Turning words into numbers in this way allows us to perform analyses according to a philosophy of language called `distributional semantics`, which is at the basis of much of data science with text. Learn more by referring to [today's reading](https://web.stanford.edu/~jurafsky/slp3/6.pdf)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

#instantiate CountVectorizer()
cv=CountVectorizer()

## Set Directory Path

Below we're setting the directory filepath that contains all the lyrics text files that we want to analyze.

In [13]:
directory_path = "..notebooks/lyrics/"

Then we're going to use `glob` and `Path` to make a list of all the short story filepaths in that directory and a list of all the songs.

In [14]:
import glob
from pathlib import Path
text_files = glob.glob(f"{directory_path}/*.txt")
text_files

[]

In [None]:
# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(all_docs)

# check shape
word_count_vector.shape

# we can look at the whole vocabulary and counts like this
cv.vocabulary_

# and we can sort it like this:

sum_words = word_count_vector.sum(axis=0) # sum_words is a vector that contains
                                            # the sum of each word occurrence in all 
                                            # texts in the corpus. In other words, 
                                            # we are adding the elements for each column of
                                            # the word_count_vector matrix

# then sort the list of tuples that contain the word and their occurrence in the corpus.
words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# display the top 10
words_freq[:10]