<font color='orange' size=6>Standard Text Mining Pipeline</font>  
Working through a tutorial in Sci-kit Learn, as recommended from **LiveProject on Authorship analysis**
<hr>
Mar 7, 2021

# Scratch and Getting Started

~~there~~

In [1]:
conda list

# packages in environment at /Users/bradgreenwald/miniconda3/envs/nlp:
#
# Name                    Version                   Build  Channel
appnope                   0.1.2           py38hecd8cb5_1001    defaults
argon2-cffi               20.1.0           py38h9ed2024_1    defaults
async_generator           1.10               pyhd3eb1b0_0    defaults
attrs                     20.3.0             pyhd3eb1b0_0    defaults
backcall                  0.2.0              pyhd3eb1b0_0    defaults
beautifulsoup4            4.9.3              pyha847dfd_0    defaults
blas                      1.0                         mkl    defaults
bleach                    3.3.0              pyhd3eb1b0_0    defaults
brotlipy                  0.7.0           py38h9ed2024_1003    defaults
ca-certificates           2021.1.19            hecd8cb5_0    defaults
catalogue                 1.0.0                    py38_1    defaults
certifi                   2020.12.5        py38hecd8cb5_0    defaults
c


Note: you may need to restart the kernel to use updated packages.


In [6]:
os.getcwd()

'/Users/bradgreenwald/projects/manning/Author_ID_Live_Project'

# Imports

In [233]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics
import os
import sys
import numpy as np
from collections import Counter
from pprint import pprint as pp

# Config

In [166]:
%config Completer.use_jedi=False

# Helper classes and functions

In [65]:
dirp = lambda x: [d for d in dir(x) if not d.startswith('_')]

Where is the tutorial?  
> Supposedly there is a `doc` dir under sklearn module, but I can't find it.  

# The Main Tutorial
Working through **sklearn tutorial** as intro (and recollection to my earlier by interrupted work on NLP) to the Manning Live Project Authorship ID with Text Mining and ML  

## Load data and create dataset
- Data will be subset of sklearn dataset called 20 Newsgroups.  The subset is specified by four (4) particular categories within that dataset.  According to sklearn, "To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."  

The process is to use the built-in dataset loader for 20 newsgroups from scikit-learn. Alternatively, it is possible to download the dataset manually from the website and use the sklearn.datasets.load_files function by pointing it to the 20news-bydate-train sub-folder of the uncompressed archive folder.

In [7]:
# setup limited list of categories to create smaller dataset to be faster
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [8]:
from sklearn.datasets import fetch_20newsgroups

In [9]:
# subset will be records from each of the four categories, shuffled
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [11]:
# twenty_train is a sklearn "dataset" which is like a dict with the following keys
dir(twenty_train)

['DESCR', 'data', 'filenames', 'target', 'target_names']

In [12]:
# this is the subset of categories
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [13]:
len(twenty_train.data)

2257

In [14]:
len(twenty_train.filenames)

2257

In [16]:
print('\n'.join(twenty_train.data[0].split('\n')[:3]))

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton


In [22]:
# labels encoded as ints
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

## BG practice - refamiliarize myself by building a BOW vectorizer and related functions from scratch

### Functions developed

In [151]:
def preprocess(doc):
    """Create a bag of words from doc, which is a single string"""
    # replace linebreaks with spaces (not empty string because sometimes line break separates words)
    temp = doc.replace('\n', ' ')
    
    # split on white space, which may leave many empty strings needing to be removed from the list
    temp = temp.split(' ')
    temp = [tok for tok in temp if tok != '']
    
    # remove punct at start or end of words
    punct1 = '.:,?)(]['
    temp = mod_punct(temp, punct1)
    
    # Convert to lowercase (could just do initial caps now that punct is removed from beginning of words)
    temp = [word.lower() for word in temp]
    
    return temp


def create_wordmap_to_vectorize(words):
    """Create a dict and reverse-dict for mapping words to int"""
    word_set = set(words)   # remove dupes
    word_int_map = dict()
    int_word_map = dict()
    
    # create dict and rev dict
    for idx, word in enumerate(word_set):
        word_int_map[idx] = word
        int_word_map[word] = idx
        
    return word_int_map, int_word_map


def vectorize_wordlist(wordmap, wordlist):
    """Create and return a word vector which has frequencies of words in wordlist at the positional index. """
    word_vector = [0] * len(wordmap)
    for word in wordlist:        
        word_vector[wordmap[word]] += 1
    return word_vector


def mod_punct(word_list, punct):
    """Modify words in word list by removing punctuation at start or end of word"""
    new_words = []
    for word in word_list:
        if word[-1] in punct:
            if word[0] in punct:
                new_words.append(word[1:-1])
            else:
                new_words.append(word[:-1])
        elif word[0] in punct:
            new_words.append(word[1:])

        else:
            new_words.append(word)
    return new_words

Test of interim functioning of preprocessing, including removing punc, splitting on whitespace, lowercaseing, creating word map, then creating a word vector.   Functions developed below are hoisted and then tested immediately above
- then retested with 2d data item and it seems to work.  its quite basic, and as noted below there are many limitations, but its a very good start, I think.  

In [159]:
doc2 = twenty_train.data[1]
print(doc2)

From: ani@ms.uky.edu (Aniruddha B. Deglurkar)
Subject: help: Splitting a trimming region along a mesh 
Organization: University Of Kentucky, Dept. of Math Sciences
Lines: 28



	Hi,

	I have a problem, I hope some of the 'gurus' can help me solve.

	Background of the problem:
	I have a rectangular mesh in the uv domain, i.e  the mesh is a 
	mapping of a 3d Bezier patch into 2d. The area in this domain
	which is inside a trimming loop had to be rendered. The trimming
	loop is a set of 2d Bezier curve segments.
	For the sake of notation: the mesh is made up of cells.

	My problem is this :
	The trimming area has to be split up into individual smaller
	cells bounded by the trimming curve segments. If a cell
	is wholly inside the area...then it is output as a whole ,
	else it is trivially rejected. 

	Does any body know how thiss can be done, or is there any algo. 
	somewhere for doing this.

	Any help would be appreciated.

	Thanks, 
	Ani.
-- 
To get irritated is human, to stay cool, divi

In [160]:
pp_doc = preprocess(doc2)

In [161]:
len(pp_doc), pp_doc

(181,
 ['from',
  'ani@ms.uky.edu',
  'aniruddha',
  'b',
  'deglurkar',
  'subject',
  'help',
  'splitting',
  'a',
  'trimming',
  'region',
  'along',
  'a',
  'mesh',
  'organization',
  'university',
  'of',
  'kentucky',
  'dept',
  'of',
  'math',
  'sciences',
  'lines',
  '28',
  '\thi',
  '\ti',
  'have',
  'a',
  'problem',
  'i',
  'hope',
  'some',
  'of',
  'the',
  "'gurus'",
  'can',
  'help',
  'me',
  'solve',
  '\tbackground',
  'of',
  'the',
  'problem',
  '\ti',
  'have',
  'a',
  'rectangular',
  'mesh',
  'in',
  'the',
  'uv',
  'domain',
  'i.e',
  'the',
  'mesh',
  'is',
  'a',
  '\tmapping',
  'of',
  'a',
  '3d',
  'bezier',
  'patch',
  'into',
  '2d',
  'the',
  'area',
  'in',
  'this',
  'domain',
  '\twhich',
  'is',
  'inside',
  'a',
  'trimming',
  'loop',
  'had',
  'to',
  'be',
  'rendered',
  'the',
  'trimming',
  '\tloop',
  'is',
  'a',
  'set',
  'of',
  '2d',
  'bezier',
  'curve',
  'segments',
  '\tfor',
  'the',
  'sake',
  'of',
  'no

In [162]:
int_idx, word_idx = create_wordmap_to_vectorize(pp_doc)

In [163]:
w_vec = vectorize_wordlist(word_idx, pp_doc)
len(pp_doc), len(w_vec), sum(w_vec)

(181, 113, 181)

### Bag of words

The most intuitive way to do so is to use a bags of words representation:

- Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
- For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.

#### Practice

BG - lets start with toy example just to have the simple idea muscled out  
1. For a given document, need a list of Each Word in that document.  How do we do that? We could break on spaces?  Remove punctuation?  
    - Lets see what we get by splitting on white space; try first doc in dataset
        1. First problem is formatting codes, e.g. line breaks '\n'.  We shouldn't remove them, because they are word breaks, so prob replace with space, and then split on white space
        2. Second problem is removing multiple white spaces, which we can do with list comprehension after split, by removing all empty strings.  
        3. Removing punctuation is difficult if dealing with things like email addresses, or urls.  Need to identify first.  Or maybe for periods only sub whitespace if there is it least one white space after?  Another idea is to remove ONLY AFTER word list is created, and then only if it is at start or end.  
        4. re Punctuation, also may indicate presence of something like a fax number, address, etc.  

#### A mini-function with preprocessing and tokenizing

In [52]:
def preprocess(doc):
    """Create a bag of words from doc, which is a single string"""
    # replace linebreaks with spaces (not empty string because sometimes line break separates words)
    temp = doc.replace('\n', ' ')
    
    # replace punctuation with white space; eventually consider regex, for now use a list
    punct = ':/?'
    
    # split on white space, which may leave many empty strings needing to be removed from the list
    temp = temp.split(' ')
    temp = [tok for tok in temp if tok != '']   
    
    return temp

In [29]:
doc1 = twenty_train.data[0]
doc1

'From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n'

In [36]:
print(doc1)

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [31]:
# splitting on spaces with no preprocessing gives 154 words
w1 = doc1.split(' ')
len(w1)

154

In [34]:
print(w1, sep='\n')

['From:', 'sd345@city.ac.uk', '(Michael', 'Collier)\nSubject:', 'Converting', 'images', 'to', 'HP', 'LaserJet', 'III?\nNntp-Posting-Host:', 'hampton\nOrganization:', 'The', 'City', 'University\nLines:', '14\n\nDoes', 'anyone', 'know', 'of', 'a', 'good', 'way', '(standard', 'PC', 'application/PD', 'utility)', 'to\nconvert', 'tif/img/tga', 'files', 'into', 'LaserJet', 'III', 'format.', '', 'We', 'would', 'also', 'like', 'to\ndo', 'the', 'same,', 'converting', 'to', 'HPGL', '(HP', 'plotter)', 'files.\n\nPlease', 'email', 'any', 'response.\n\nIs', 'this', 'the', 'correct', 'group?\n\nThanks', 'in', 'advance.', '', 'Michael.\n--', '\nMichael', 'Collier', '(Programmer)', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'The', 'Computer', 'Unit,\nEmail:', 'M.P.Collier@uk.ac.city', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'The', 'City', 'University,\nTel:', '071', '477-8000', 'x3769', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 

What are all these multiple spaces??  Some of it is formatting in combination with white space.  Maybe we can remove all white space greater than 1.  Well, once we have the list of tokens, we can just remove any of them that are empty string.  

In [37]:
# use list comprehension to create new list with no empty strings
w1a = [w for w in w1 if w != '']
len(w1a)

73

So that removed over 1/2 of the words

In [38]:
w1a

['From:',
 'sd345@city.ac.uk',
 '(Michael',
 'Collier)\nSubject:',
 'Converting',
 'images',
 'to',
 'HP',
 'LaserJet',
 'III?\nNntp-Posting-Host:',
 'hampton\nOrganization:',
 'The',
 'City',
 'University\nLines:',
 '14\n\nDoes',
 'anyone',
 'know',
 'of',
 'a',
 'good',
 'way',
 '(standard',
 'PC',
 'application/PD',
 'utility)',
 'to\nconvert',
 'tif/img/tga',
 'files',
 'into',
 'LaserJet',
 'III',
 'format.',
 'We',
 'would',
 'also',
 'like',
 'to\ndo',
 'the',
 'same,',
 'converting',
 'to',
 'HPGL',
 '(HP',
 'plotter)',
 'files.\n\nPlease',
 'email',
 'any',
 'response.\n\nIs',
 'this',
 'the',
 'correct',
 'group?\n\nThanks',
 'in',
 'advance.',
 'Michael.\n--',
 '\nMichael',
 'Collier',
 '(Programmer)',
 'The',
 'Computer',
 'Unit,\nEmail:',
 'M.P.Collier@uk.ac.city',
 'The',
 'City',
 'University,\nTel:',
 '071',
 '477-8000',
 'x3769',
 'London,\nFax:',
 '071',
 '477-8565',
 'EC1V',
 '0HB.\n']

In [40]:
# replace linebreaks
w2 = doc1.replace('\n', ' ')

In [41]:
# use string split to create word list
w2 = w2.split(' ')
len(w2)

174

IDEA:  we can incrementally build a function by adding each of these steps as we work them out
- See above for implementation

In [42]:
# remove empty strings
w2a = [w for w in w2 if w != '']
len(w2a)

87

#### Punctuation

In [111]:
punct1 = '.:,?)]['

In [119]:
#  to retain some phrases that include punct chars
def mod_punct(word_list, punct):
    """Modify words in word list by removing punctuation at start or end of word"""
    new_words = []
    for word in word_list:
        if word[-1] in punct1:
            if word[0] in punct1:
                new_words.append(word[1:-1])
            else:
                new_words.append(word[:-1])
        elif word[0] in punct1:
            new_words.append(word[1:])

        else:
            new_words.append(word)
    return new_words

In [120]:
mod_punct(w2a, punct1)

['From',
 'sd345@city.ac.uk',
 '(Michael',
 'Collier',
 'Subject',
 'Converting',
 'images',
 'to',
 'HP',
 'LaserJet',
 'III',
 'Nntp-Posting-Host',
 'hampton',
 'Organization',
 'The',
 'City',
 'University',
 'Lines',
 '14',
 'Does',
 'anyone',
 'know',
 'of',
 'a',
 'good',
 'way',
 '(standard',
 'PC',
 'application/PD',
 'utility',
 'to',
 'convert',
 'tif/img/tga',
 'files',
 'into',
 'LaserJet',
 'III',
 'format',
 'We',
 'would',
 'also',
 'like',
 'to',
 'do',
 'the',
 'same',
 'converting',
 'to',
 'HPGL',
 '(HP',
 'plotter',
 'files',
 'Please',
 'email',
 'any',
 'response',
 'Is',
 'this',
 'the',
 'correct',
 'group',
 'Thanks',
 'in',
 'advance',
 'Michael',
 '--',
 'Michael',
 'Collier',
 '(Programmer',
 'The',
 'Computer',
 'Unit',
 'Email',
 'M.P.Collier@uk.ac.city',
 'The',
 'City',
 'University',
 'Tel',
 '071',
 '477-8000',
 'x3769',
 'London',
 'Fax',
 '071',
 '477-8565',
 'EC1V',
 '0HB']

In [109]:
w2a

['From:',
 'sd345@city.ac.uk',
 '(Michael',
 'Collier)',
 'Subject:',
 'Converting',
 'images',
 'to',
 'HP',
 'LaserJet',
 'III?',
 'Nntp-Posting-Host:',
 'hampton',
 'Organization:',
 'The',
 'City',
 'University',
 'Lines:',
 '14',
 'Does',
 'anyone',
 'know',
 'of',
 'a',
 'good',
 'way',
 '(standard',
 'PC',
 'application/PD',
 'utility)',
 'to',
 'convert',
 'tif/img/tga',
 'files',
 'into',
 'LaserJet',
 'III',
 'format.',
 'We',
 'would',
 'also',
 'like',
 'to',
 'do',
 'the',
 'same,',
 'converting',
 'to',
 'HPGL',
 '(HP',
 'plotter)',
 'files.',
 'Please',
 'email',
 'any',
 'response.',
 'Is',
 'this',
 'the',
 'correct',
 'group?',
 'Thanks',
 'in',
 'advance.',
 'Michael.',
 '--',
 'Michael',
 'Collier',
 '(Programmer)',
 'The',
 'Computer',
 'Unit,',
 'Email:',
 'M.P.Collier@uk.ac.city',
 'The',
 'City',
 'University,',
 'Tel:',
 '071',
 '477-8000',
 'x3769',
 'London,',
 'Fax:',
 '071',
 '477-8565',
 'EC1V',
 '0HB.']

In [45]:
c = Counter(w2a)

In [108]:
words1 = c.most_common(10)
words1

[('to', 4),
 ('The', 3),
 ('LaserJet', 2),
 ('City', 2),
 ('the', 2),
 ('071', 2),
 ('From:', 1),
 ('sd345@city.ac.uk', 1),
 ('(Michael', 1),
 ('Collier)', 1)]

Definitely starting to look better
- Still to do: remove punctuation


#### How to build an integer indexed dictionary of words that can be used to convert a bag-of-words to a bag of numbers

In [58]:
# build a dictionary
word_list = [w[0] for w in words1]
word_list

['to', 'The', 'LaserJet', 'City', 'the']

In [102]:
def create_wordmap_to_vectorize(words):
    """Create a dict and reverse-dict for mapping words to int"""
    word_set = set(words)   # remove dupes
    word_int_map = dict()
    int_word_map = dict()
    
    # create dict and rev dict
    for idx, word in enumerate(word_set):
        word_int_map[idx] = word
        int_word_map[word] = idx
        
    return word_int_map, int_word_map

In [103]:
word_map, rev_word_map = create_wordmap_to_vectorize(word_list)

In [104]:
word_map

{0: 'to', 1: 'The', 2: 'LaserJet', 3: 'the', 4: 'City'}

Maybe instead of a dict create just a list where the word is simply mapped to its index-pos in the list.  
- Access time is prob the same because indexing a list positionally is O(1).
Another issue is that these vectors will be long, as long as the entire list of words. Prob a dict representation using word-ints as keys and vals as frequencies will be faster, assuming that lack of entry means zero. 
Ultimately, may want the function to take in the document instead of a word list.  

In [95]:
# apply wordmap to wordlist
def vectorize_wordlist(wordmap, wordlist):
    """Create a word vector meaning a for each int index in wordmap if the word occurs we increment the vector
    element at that position"""
    word_vector = [0] * len(wordmap)
    for word in wordlist:
        word_vector[wordmap[word]] += 1
    return word_vector

In [96]:
vectorize_wordlist(rev_word_map, word_list)

[1, 1, 1, 1, 1]

In [90]:
len(w2a), len(word_keys)

(87, 78)

In [105]:
# try with wordlist w2a from above - nice a 1.78 vector. 
int_keys, word_keys = create_wordmap_to_vectorize(w2a)
vectorize_wordlist(word_keys, w2a)

[1,
 3,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 2,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 4,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1]

## Feature Extraction - back to tutorial

### Tokenizing
- including text preprocessing (as above, there are several things to deal with here such as formatting, punctuation, case, whitespace, etc; tokenizing; filtering stopwords.  For the tutorial we use CountVectorizer for this

In [165]:
from sklearn.feature_extraction.text import CountVectorizer

In [172]:
# instantiate
count_vect = CountVectorizer()

In [173]:
# learn vocab and return document-term matrix
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [174]:
# We get a sparse matrix, in particular csr
type(X_train_counts)

scipy.sparse.csr.csr_matrix

In [175]:
# This is frequency in entire corpus
count_vect.vocabulary_.get(u'algorithm')

4690

Note on term frequencies as a feature
Dealing with a corpus, need to account for longer v shorter documents, e.g. normalizing via tf-idf, so for longer documents dividing by the total number of words offsets for the potential increased frequency of a word.  

Another idea is that a word common to more documents in a corpus provides less differentiation, to the extent that is an issue, we can divide by the number of documents that the term occurs in.  It is 'inverse' because the more documents, the less important in this regard.  

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [179]:
# This builds on the previous step - we are using the counts from CountVectorizer as argument
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2257, 35788)

In [184]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

## Training a classifier

Recall from the Kadenze work on Naive Bayes, and rereading sklearn on Naive Bayes.  We get a model, for each potential class of the classifier, (here we have 4), which is derived from the feature frequencies of the data given the class.  To classify, we compute the probability/likelihood of the new event/observation for each of the classes, and whichever is higher, that class is predicted.  Also called Maximum Likelihood Prediction.  

In [189]:
# Start with Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [188]:
twenty_train.target.shape

(2257,)

In [192]:
# Some toy prediction; use the trained count_vect and tfidf transformer
docs_new = ['God is love', 'OpenGL on the GPU is fast']

# these transformers convert the document to a token vector based on the transforms as trained by the training data
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)   #  Notice we use the counts as argument
predicted = clf.predict(X_new_tfidf) # X_new_tfidf contains the new docs transformed into tfidf vectors
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


## Building a Pipeline
- Put the steps together:  vectorizer => transformer => classifier
- use Pipeline from sklearn.pipeline

In [193]:
from sklearn.pipeline import Pipeline

In [197]:
text_clf = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
    ('clf', MultinomialNB()),
])

In [198]:
# With this set up we can train the model with a single command
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])

### Evaluate performance on test set

In [200]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.8348868175765646

### Try an SVM (Support Vector Machine) - sub into the pipeline

In [201]:
from sklearn.linear_model import SGDClassifier

In [205]:
text_clf = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                         alpha=1e-3, random_state=42,
                         max_iter=5, tol=None)),
])

In [206]:
text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.9101198402130493

In [207]:
from sklearn import metrics

In [208]:
print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.80      0.87       319
         comp.graphics       0.87      0.98      0.92       389
               sci.med       0.94      0.89      0.91       396
soc.religion.christian       0.90      0.95      0.93       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502



In [211]:
metrics.confusion_matrix(twenty_test.target, predicted )

array([[256,  11,  16,  36],
       [  4, 380,   3,   2],
       [  5,  35, 353,   3],
       [  5,  11,   4, 378]])

### Parameter Tuning

In [212]:
from sklearn.model_selection import GridSearchCV

In [221]:
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}

In [222]:
# can be computationally expensive
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)

In [223]:
# apply to portion of data
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

In [227]:
twenty_train.target_names[gs_clf.predict([docs_new[0]])[0]]

'soc.religion.christian'

# First Exercise - Language Identification
- Write a text classification pipeline using a custom preprocessor and CharNGramAnalyzer using data from Wikipedia articles as training set.
- Evaluate the performance on some held out test set.


In [None]:
"""Build a language detector model

The goal of this exercise is to train a linear classifier on text features
that represent sequences of up to 3 consecutive characters so as to be
recognize natural languages by using the frequencies of short character
sequences as 'fingerprints'.

"""
# Author: Olivier Grisel <olivier.grisel@ensta.org>
# License: Simplified BSD

import sys

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Perceptron
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [236]:
# get data folder from how I imported the tutorial
data_folder = './code/data/languages/paragraphs'

In [237]:
# The training data folder must be passed as first argument
languages_data_folder = data_folder
dataset = load_files(languages_data_folder)

In [240]:
# Another sklearn dataset
dir(dataset)

['DESCR', 'data', 'filenames', 'target', 'target_names']

In [242]:
dataset.data[0]

b'Alguns observadores afirmam que a Wikip\xc3\xa9dia \xc3\xa9 uma amea\xc3\xa7a econ\xc3\xb4mica para os editores de enciclop\xc3\xa9dias tradicionais, muitos dos quais podem ser incapazes de competir com um produto que \xc3\xa9 essencialmente livre. Nicholas Carr escreveu no ensaio "A amoralidade da Web 2.0", falando sobre a chamada "Web 2.0" como um todo: "impl\xc3\xadcita nas vis\xc3\xb5es de \xc3\xaaxtase de Web 2.0 \xc3\xa9 a hegemonia do amador. Eu, pelo menos, n\xc3\xa3o posso imaginar nada mais assustador."[159]\n'

In [241]:
# Split the dataset in training and test set:
docs_train, docs_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.5)

In [None]:
# TASK: Build a vectorizer that splits strings into sequence of 1 to 3
# characters instead of word tokens

In [None]:
# TASK: Build a vectorizer / classifier pipeline using the previous analyzer
# the pipeline instance should stored in a variable named clf

In [None]:
# TASK: Fit the pipeline on the training set

In [None]:
# TASK: Predict the outcome on the testing set in a variable named y_predicted

In [None]:
# Print the classification report
print(metrics.classification_report(y_test, y_predicted,
                                    target_names=dataset.target_names))

In [None]:
# Plot the confusion matrix
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)

In [None]:
#import matplotlib.pyplot as plt
#plt.matshow(cm, cmap=plt.cm.jet)
#plt.show()

# Predict the result on some short new sentences:
sentences = [
    'This is a language detection test.',
    'Ceci est un test de d\xe9tection de la langue.',
    'Dies ist ein Test, um die Sprache zu erkennen.',
]
predicted = clf.predict(sentences)

for s, p in zip(sentences, predicted):
    print('The language of "%s" is "%s"' % (s, dataset.target_names[p]))