# Lab 07

This week, we'll learn a number of smaller skills:

1. Getting Extracted Feature Files
2. The Scipy Stack
3. Pandas: Combining DataFrames
4. Classification with Scikit Learn

## Getting Extracted Features Files

As we learned in the datasets class, there are 14 million books in the Hathitrust Extracted Features Dataset.

To download the EF dataset file for a book, you need its HathiTrust ID. You can see this in the URL when you find books in the HathiTrust; e.g. for the Tom Sawyer book at https://babel.hathitrust.org/cgi/pt?id=nyp.33433042068894, the id is nyp.33433042068894.

With the ID you can download the file at the following URL:

   > `https://bedrock.resnet.cms.waikato.ac.nz/vol-checker/VolumeCheck?download-id={{VOLUME ID}}`

For example:

   > https://bedrock.resnet.cms.waikato.ac.nz/vol-checker/VolumeCheck?download-id=nyp.33433042068894

Two things that we aren't focusing on, but which you can explore if you want large numbers of files for your final projects:
- There are many ways to programmatically choose _many_ books at once, rather than looking up the books in the online interface. The easiest is to download the bibliographic metadata (called [Hathifile](https://www.hathitrust.org/hathifiles). This is a CSV file of all the available books. 
- The main way to download files or lists of files is using a command line application called `rsync`, after converting the IDs to a file path. The reference for doing so is here: [Syncing a list of files](https://github.com/htrc/htrc-feature-reader/blob/master/examples/ID_to_Rsync_Link.ipynb).

## The Scipy Stack

You guys are becoming Pandas pros! Pandas is the foundation of much data science work by professionals today. As we continue, increasingly we'll be learning Pandas-specific code and conventions rather than all of Python.

Pandas is part of what is called the SciPy Stack: a selection of scientific tools that all work together. Here are the other ones:

 - **Numpy**: A mathematical library, offering ways to represent multidimensional arrays.
 - **Scipy**: Foundational scientific code.
 - **Pandas**: A nicer way of structuring and working with data, through DataFrame and Series objects. You can think of Pandas as a more flexible version of Numpy's arrays, where you can name the columns and work with the data with more semantics.
 - **Matplotlib**: A visualization library. We've seen this!
 - **IPython**: This is the special interactive version of Python that you use in Jupyter! So you don't have to write a script, run it, edit it, run it again, and so on.
 
There are many tools that run very well with the SciPy stack and round out our data science environment:

- **Scikit Learn**: Where Scipy is foundation tools, Scikit Learn gives you many advanced scientific algorithms for data science. Great documentation too: next week's clustering reading is from their documentation.
- **Jupyter**: The web browser notebook-style way of using IPython.
- **Seaborn**: Higher-level visualization tools. Matplotlib is like buying IKEA furniture: some assembly required. Seaborn is like buying pre-assembled furniture: much easier! Plus, just importing Seaborn into your code makes your Matplotlib code look nicer!
- **Statmodels**: Similar to Scipy, statsmodels offers additional statistics models and tools.

There are a few benefits due to how standard these tools are. First, they tend to play nice together. If you move into advanced libraries for really large scale analysis, those libraries tend to have the SciPy stack in mind too. Finally, they were installed by default when you installed Anaconda in the first week, so you have them!

## Pandas: Combining DataFrames

To combine multiple DataFrames, you can use `pd.concat()` on a list of DataFrames (i.e. [dataframe1, dataframe2, etc.] ). For example:

In [1]:
import pandas as pd
test = pd.DataFrame([(1,'a'), (2,'b')])
test

Unnamed: 0,0,1
0,1,a
1,2,b


In [2]:
# combining two of the same dataframe:
list_of_dataframes = [test, test]
combined = pd.concat(list_of_dataframes)
combined

Unnamed: 0,0,1
0,1,a
1,2,b
0,1,a
1,2,b


# Naive Bayes Classification with Scikit Learn

I've prepared a set of training documents and testing documents for a French/English classifier in [english_french_class.csv](https://raw.githubusercontent.com/organisciak/Text-Mining-Course/master/data/classification/english_french_class.csv).

Load a CSV to a dataframe as follows:

In [3]:
url = 'https://raw.githubusercontent.com/organisciak/Text-Mining-Course/master/data/classification/english_french_class.csv'
data = pd.read_csv(url, encoding='utf-8').set_index('book')
data.head(2)

Unnamed: 0_level_0,!,!—,!—the,"""","""""","""because","""if","""it","""only","""or",...,ﬂight,ﬂights,ﬂoor,ﬂown,ﬂuid,ﬂung,ﬂush,ﬂushed,ﬂy,ﬂying
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
hvd.32044014292023,868.0,0.0,0.0,4582.0,2.0,6.0,10.0,22.0,7.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hvd.32044102860673,1354.0,0.0,0.0,139.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


This loaded a 'wide' DataFrame, where each row is a book, each column is a word, and the cell $value_{row, column}$ is the count for that $word_{column}$ in $book_{row}$. The book ids were also a column, but we converted those to an index after loading with `set_index('book')`. I'll detail later how this information was collected.

There is also a CSV with the [truth labels](https://raw.githubusercontent.com/organisciak/Text-Mining-Course/master/data/classification/english_french_class_labels.csv) for each book:

In [4]:
url = 'https://raw.githubusercontent.com/organisciak/Text-Mining-Course/master/data/classification/english_french_class_labels.csv'
labels = pd.read_csv(url, encoding='utf-8')
labels.head(2)

Unnamed: 0,book,title,language
0,hvd.32044014292023,"Alice's adventures in Wonderland ; and, Throug...",eng
1,hvd.32044102860673,"Notre Dame de Paris. Abridged and edited, with...",fre


For train/test, we'll use half of the documents to build a classifier and the other half to test it.

(`iloc` allows you to slice dataframes by number; e.g. `iloc[0:6]`.)

In [21]:
train_data = data.iloc[0:6]
train_labels = labels.iloc[0:6]
# print(train_data)
# print(train_labels)

test_data = data.iloc[6:]
test_labels = labels.iloc[6:]
# print(test_data)
# print(test_labels)

Naive Bayes classification is straightforward to use with Scikit Learn. To train, you need data and the correct classes:

```python 
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(training_data, training_labels)
```

To predict the class for unknown books, format their word frequencies in the same order and pass the information to the classifier:

```python
classifier.predict(new_data)
```

Lets try it for real:

In [22]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

# Train our model!
classifier.fit(train_data, train_labels['language'])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

And now load a new set of books and predict them. The books of the test dataset are:

1. Les caves du Vatican
2. Madame Bovary
3. Jean Barois
4. Catch-22
5. The Catcher in the Rye
6. The Lord of the Rings

Let's predict their languages:

In [7]:
# Predict the language of another book
classifier.predict(test_data)

array(['fre', 'fre', 'fre', 'eng', 'eng', 'eng'], 
      dtype='<U3')

Perfect classification!

For most classification tasks, the accuracy is lower. Languages are very distinct, however. You can see the underlying information from the classifier with `classifier.predict_log_proba(test_data)`: zero shows the chosen class and the closer to zero the other values are, the closer their class probability was to the one that was eventually selected.

These books are 'test' documents because we know the real answer. The true labels can be given to `classifier.score`, to count what proportion of classifications are correct.

In [8]:
classifier.score(test_data.values, test_labels['language'])

1.0

There isn't much complexity to this code. The tricky parts are in getting the data structured properly. Scikit Learn doesn't keep the column names from Pandas, it just pulls out the values. So if your training data looks like:

```document 1: [word X count, word Y count, ... word Z]
document 2: [word X count, word Y count, ... word Z]```

Then you need to make sure that your future documents order their counts as X, Y, ... Z.

To better see the information Scikit Learn is using, consider this test DataFrame:

In [9]:
test_df = pd.DataFrame([[1,2,3], [4,5,6]], columns=['A', 'B', 'C'], index=['i', 'ii'])
test_df

Unnamed: 0,A,B,C
i,1,2,3
ii,4,5,6


This is the information that the classifier actually uses:

In [10]:
test_df.values

array([[1, 2, 3],
       [4, 5, 6]])

It looks like a list of lists, doesn't it? It's a Numpy array, which for now you can think as a smarter, faster version of a list of lists. What matters is remembering that it is just numbers arranged in two dimensions, so don't expect the classifier to know what word each number refers to.

The truth labels are just one dimension:

In [11]:
test_labels['language'].values

array(['fre', 'fre', 'fre', 'eng', 'eng', 'eng'], dtype=object)

# Questions

Load the following data:

In [12]:
path = 'https://raw.githubusercontent.com/organisciak/Text-Mining-Course/master/data/contemporary_books/'
author_data = pd.read_csv(path + 'contemporary.csv', encoding='utf-8').set_index('book')
author_labels = pd.read_csv(path + 'contemporary_labels.csv', encoding='utf-8')

You'll be building a classifier for author.

**Q1**: What are the 3 classes?

In [13]:
author_labels.head()

#Atwood, King and Grisham?

Unnamed: 0,book,author,title
0,mdp.39015005028686,King,The stand
1,mdp.39015010763418,Atwood,Lady oracle;
2,mdp.39015027242315,Atwood,The robber bride
3,mdp.39015029244657,Grisham,The pelican brief
4,mdp.39015031703609,Grisham,The rainmaker


**Q2**: Show the code to split the data and truth labels into a test and train dataset, using the first fifteen books for training.

In [14]:
train_author_data = author_data.iloc[0:15]
train_author_labels = author_labels.iloc[0:15]
train_author_labels.head(20)

# test_author_data = author_data.iloc[15:]
# test_author_labels = author_labels.iloc[15:]
# test_author_labels.head(20)

Unnamed: 0,book,author,title
0,mdp.39015005028686,King,The stand
1,mdp.39015010763418,Atwood,Lady oracle;
2,mdp.39015027242315,Atwood,The robber bride
3,mdp.39015029244657,Grisham,The pelican brief
4,mdp.39015031703609,Grisham,The rainmaker
5,mdp.39015038148048,King,Desperation
6,mdp.39015040702071,Atwood,Alias Grace
7,mdp.39015043780249,King,The girl who loved Tom Gordon
8,mdp.39015043798936,King,Bag of bones
9,mdp.39015046381565,Grisham,A time to kill


**Q3**: Create and train a Multinomial Naive Bayes classifier. Paste your code.

In [15]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

# Train
classifier.fit(train_author_data, train_author_labels['author'])

# Run classifier
classifier.predict(test_author_data)


NameError: name 'test_author_data' is not defined

**Q4**: What are the classifier's predictions on the test code? Fill in the numbered values below:

```
array(['Grisham', '[[1]]', 'Atwood', 'King', 'Grisham', '[[2]]', 'King',
       'Grisham', 'King', '[[3]]', 'Atwood', 'Atwood', 'Grisham', 'King',
       '[[4]]', 'King'], 
```

## Answers:

1. Atwood
2. Grisham
3. Atwood
4. Atwood

**Q5**: What is the classifier's accuracy to four decimal places? i.e. X.XXXX

In [None]:
classifier.score(test_author_data.values, test_author_labels['author'])

**Q6**: Which books were classified correctly or incorrectly:

- Cell by Stephen King: [Correct, Incorrect]
- The Handmaid's Tale by Margaret Atwood: [Correct, Incorrect]
- Danse macabre by Stephen King: [Correct, Incorrect]
- Cat's eye by Margaret Atwood: [Correct, Incorrect]

_Make sure you're getting the correct answer before continuing._

In [None]:
# Q6
pred_author = classifier.predict(test_author_data)
print(pred_author)

test_author_labels.head(50)

**Q7**: Build a classification report with SciKit Learn as [shown here](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#evaluation-of-the-performance-on-the-test-set). The 'names' of classes are in `classified.classes_`. Fill in the missing precision and recall values below:

```
             precision    recall    f1-score   support

     Atwood       [[1]]     1.00       0.91         5
    Grisham       1.00      [[2]]      1.00         5
       King       1.00      0.83       0.91         6

avg / total       0.95      [[3]]      0.94        16
```

In [None]:
from sklearn import metrics

# ?metrics.classification_report

names = classifier.classes_

print(metrics.classification_report(test_author_labels['author'], pred_author, target_names=names))

**Q8**: Try to train a new classifier, without smoothing. In other words, set classifier.alpha to 0 before training. What happens to the predictions and what do the underlying probability patterns point to as the reason?

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.alpha = 0

# Train
classifier.fit(train_author_data, train_author_labels['author'])

# Run classifier
pred_auth2 = classifier.predict(test_author_data)
pred_auth_train = classifier.predict(train_author_data)

print(metrics.classification_report(test_author_labels['author'], pred_auth2, target_names=names))
print(metrics.classification_report(train_author_labels['author'], pred_auth_train, target_names=names))

**Q9**: [2 marks] [This printing of Sense and Sensibility](https://catalog.hathitrust.org/Record/008663968) has two volumes. I'm interested in looking at word patterns for the combined set of volumes all at once. How would you download the Extracted Features files for both volumes, and what code would you use to read and join the token count DataFrames into one long DataFrame?

In [None]:
from htrc_features import FeatureReader
import os

# First, use the above URL (https://bedrock.resnet.cms.waikato.ac.nz/vol-checker/VolumeCheck?download-id=<vol ID>) twice to download EF files,
#  changing out volume ID with 'nyp.33433074943592' and 'nyp.33433074943600' to get both volumes of Sense & Sensibility.

path1 = [os.path.join('/Users/rdubnic2/Documents/lis590txl/Data', 'nyp.33433074943592.json.bz2')]
fr1 = FeatureReader(path1)

vol1 = fr1.first()
print(vol1)

vol1_tokens = vol1.tokenlist()
print(len(vol1_tokens))

path2 = [os.path.join('/Users/rdubnic2/Documents/lis590txl/Data', 'nyp.33433074943600.json.bz2')]
fr2 = FeatureReader(path2)

vol2 = fr2.first()
print(vol2)

vol2_tokens = vol2.tokenlist()
print(len(vol2_tokens))

sense_sensibility_vols = [vol1_tokens, vol2_tokens]

s_and_s = pd.concat(sense_sensibility_vols)
s_and_s

## Extra Notes

I mentioned that even importing Seaborn makes your matplotlib graphics prettier. Here's an example:

In [None]:
%matplotlib inline
made_up_data = pd.Series([1,2,1,2,3,0,1])
made_up_data.plot()

In [None]:
import seaborn
made_up_data.plot()