## Lecture 3: Modular Programming and scikit-learn

In the last two lectures, you've learned how to collect raw data from the web, process it, and visualize it by plotting. Along the way, we've worked with libraries like `BeautifulSoup`, `spacy`, `numpy`, `pandas`, `seaborn`, which should give you a broad idea of what the typical "Language Technology Toolkit" might look like. In this lecture, we will tie most of these libraries and concepts together in an attempt to write a proper piece of software: a POS-tagger that be trained by the user and used to tag sentences. To do this, we'll make use of concepts from a programming style called "modular programming", as well as the popular machine learning library, `scikit-learn` (which you've probably seen before). 

### What is modular programming?

In Wikipedia, modular programming is described as follows:

"_Modular programming is a software design technique that emphasizes separating the functionality of a program into independent, interchangeable modules, such that each contains everything necessary to execute only one aspect of the desired functionality._"

In other words, modular programming is a way of organizing and consolidating your Python code by the respective functions of its subparts. Imagine writing a POS-tagger (like we'll do in this lecture). What kind of components do you envision our code having? We'll likely have a part of the code that's devoted to reading our input and preprocessing it. Another part could include the model definition and training procedure. Lastly, we should include a part of the code that ties everything together: something that takes a set of arguments and directs them to each constituent part in order to produce the desired results. 

Before embarking on writing the tagger, let's look at how modules work in Python. 

You're probably used to importing modules that are in your system `PYTHONPATH` environment variable. This is everything you've installed using `pip`, e.g. `scikit-learn`. However you can also import modules that exist in the same directory as your active Python script. Let's look at a file called `module_1.py`.

```
raw_text = [
    "He worked for the BBC for a decade.",
    "She spoke to CNN style about the experience.",
    "Global warming has caused a change in the pattern of the rainy seasons.",
    "I also wonder whether the Davis Cup played a part.",
    "The scheme makes money through sponsorship and advertising.",
    "If a Turkish employee quits, then the Turkish work councils come.",
    "A witness told police that the victim had attacked the suspect in April.",
    "Mr Osborne signed up with a US speakers agency after being sacked in July.",
    "The RHS collected comments sent in by schoolchildren and teachers involved in the experiment.",
    "National reaction to the events in Kansas demonstrated how deeply divided the country had become."
    ]

def print_raw_text():
    for sent in raw_text:
        print(sent)
```

Using `import`, we can access objects defined in `module_1.py`, like variables and functions.

In [3]:
import module_1

print(module_1.raw_text)

ModuleNotFoundError: No module named 'module_1'

We can also import individual objects using the `import` notation you're familiar with:

In [253]:
from module_1 import print_raw_text

# print_raw_text() function imported from module_1.py
print_raw_text()

He worked for the BBC for a decade.
She spoke to CNN style about the experience.
Global warming has caused a change in the pattern of the rainy seasons.
I also wonder whether the Davis Cup played a part.
The scheme makes money through sponsorship and advertising.
If a Turkish employee quits, then the Turkish work councils come.
A witness told police that the victim had attacked the suspect in April.
Mr Osborne signed up with a US speakers agency after being sacked in July.
The RHS collected comments sent in by schoolchildren and teachers involved in the experiment.
National reaction to the events in Kansas demonstrated how deeply divided the country had become.


If our code lives in an existing "package", we can nonetheless import it, using dot notation. Here, we have a folder called `pkg`, where two dummy modules live: `module_2.py` and `module_3.py`. We can import them like so:

In [255]:
from pkg.module_2 import print_range
from pkg.module_3 import get_redundant_list as grl

# Adds N strings to a list
placeholder = grl("This is a placeholder", 100)

# Prints numbers up to N
print_range(5)

0
1
2
3
4


A package can also have an `__init__.py` file that will automatically execute when either the entire package or its constituents parts are imported. This can be used for defining variables that will need to be used in either module, or simply printing a message. Here, we imported module_2, which trigged the `print` statement in `__init__.py`. We can also import entire packages. 

In [3]:
import pkg

print(pkg.module_2.print_range(10))

0
1
2
3
4
5
6
7
8
9
None


Now that we've learned how to work with our own modules and packages, let's start thinking about how to write a Part-of-Speech Tagger. 

### Part-of-speech Tagging

Part-of-speech (POS) tagging is a classic NLP task that is concerned with classifying words in a sentence by their linguistic category, e.g. `NOUN` or `VERB`. Despite its simplicity, POS-tagging is a crucial component of major commercial NLP products, like search engines, speech recognition systems, and digital assistants. POS-tagging is a **sequence prediction** problem, which means that, for every word in a sentence, a classifier must assign a corresponding tag. This is distinct from something like sentiment analysis, which is a **classification** problem: it assigns a single tag (or number) to the entire sentence. Though there are numerous tagsets that exist for POS-tagging, a popular one is the [_Universal POS tags_](https://universaldependencies.org/u/pos/index.html) (UPOS) tagset, introduced by the Universal Dependencies project. The UPOS tagset contains 17 tags, split across open class words (nouns, verbs), closed class words (prepositions, conjunctions, pronouns), and miscallaneous words (punctuations or symbols). A simple POS-tagged sentence might looks something like this:

```
The       DET
staff     NOUN
leaves    VERB
a         DET
lot       NOUN
to        PART
be        AUX
desired   VERB
.         PUNCT

```

### Where to begin?

When writing machine learning code, it is sometimes difficult to figure out how to get started. In such situations, it might be helpful to take a step back and decide exactly what we want our code to do. For this lecture, we know that we want to write a package that allows us to A) train a POS-tagger and B) use it to tag unseen data. If we think about this in modular programming terms, we might decide to separate our processes into two separate pieces of code: a data-loading component and a training / prediction component. We can place these two modules in a package called `tagger`, where we'll also put an empty `__init__.py` file to indicate that we're working with a package. We'll start with the data-loading component. 

### Loading our data

One of the most crucial (and challenging) aspects of working with machine learning toolkits is figuring out how to load data correctly. In the POS-tagging case, we can often expect our training data to look like the example above, where words and their tags appear on individual lines, separated by a tab, and individual sentences are separated by empty lines (e.g. `\n` characters). Let's create a new file called `tagger/pre_processing.py` and write a function to process this format, where we store words and tags in a list of sentence lists. 

In [4]:
def load_txt(file):
    X, Y = [], []
    with open(file, "r") as infile:
        sents = infile.read().split("\n\n")
        if sents[-1] == "":
            sents = sents[:-1]
        for sent in sents:
            words, tags = [], []
            lines = sent.split("\n")
            for line in lines:
                line = line.strip().split("\t")
                if len(line) != 2:
                    raise TabError("Tried to read .txt file, but did not find two columns.")
                else:
                    words.append(line[0])
                    tags.append(line[1])
            X.append(words)
            Y.append(tags)

    return X, Y

Since we're working with the UPOS format, there's a chance that we might be interested in parsing Universal Dependencies files as well. This would make our package a lot more versatile: we'll be able to train POS-taggers for 90 languages, if we wanted. However, since Universal Dependencies (`.conllu`) files come in a slightly more complicated format, we'll have to write a separate function in case we're loading them. Let's see what a `.conllu` file looks like before we do so:

```
# text = The staff leaves a lot to be desired.
1	The	the	DET	DT	Definite=Def|PronType=Art	2	det	2:det	_
2	staff	staff	NOUN	NN	Number=Sing	3	nsubj	3:nsubj	_
3	leaves	leave	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	0:root	_
4	a	a	DET	DT	Definite=Ind|PronType=Art	5	det	5:det	_
5	lot	lot	NOUN	NN	Number=Sing	3	obj	3:obj	_
6	to	to	PART	TO	_	8	mark	8:mark	_
7	be	be	AUX	VB	VerbForm=Inf	8	aux:pass	8:aux:pass	_
8	desired	desire	VERB	VBN	Tense=Past|VerbForm=Part	5	acl	5:acl:to	SpaceAfter=No
9	.	.	PUNCT	.	_	3	punct	3:punct	_

# sent_id = reviews-055976-0002
# text = The front staff has seen quite a bit of turnover and changed from professional to rude.
1	The	the	DET	DT	Definite=Def|PronType=Art	3	det	3:det	_
2	front	front	NOUN	NN	Number=Sing	3	compound	3:compound	_
3	staff	staff	NOUN	NN	Number=Sing	5	nsubj	5:nsubj|12:nsubj	_
4	has	have	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	5	aux	5:aux	_
5	seen	see	VERB	VBN	Tense=Past|VerbForm=Part	0	root	0:root	_
6	quite	quite	DET	PDT	_	8	det:predet	8:det:predet	_
7	a	a	DET	DT	Definite=Ind|PronType=Art	8	det	8:det	_
8	bit	bit	NOUN	NN	Number=Sing	5	obj	5:obj	_
9	of	of	ADP	IN	_	10	case	10:case	_
10	turnover	turnover	NOUN	NN	Number=Sing	8	nmod	8:nmod:of	_
11	and	and	CCONJ	CC	_	12	cc	12:cc	_
12	changed	change	VERB	VBN	Tense=Past|VerbForm=Part	5	conj	5:conj:and	_
13	from	from	ADP	IN	_	14	case	14:case	_
14	professional	professional	ADJ	JJ	Degree=Pos	12	obl	12:obl:from	_
15	to	to	ADP	IN	_	16	case	16:case	_
16	rude	rude	ADJ	JJ	Degree=Pos	12	obl	12:obl:to	SpaceAfter=No
17	.	.	PUNCT	.	_	5	punct	5:punct	_
```

Here, we have the sentence text appearing as a commented line, which is followed by annotated properties of each individual token, separated by tabs. To extract the token form and and POS-tag, we'll need to take the 2nd and 4th columns. 

In [15]:
def load_conllu(file):
    X, Y = [], []
    with open(file, "r") as infile:
        sents = infile.read().split("\n\n")
        if sents[-1] == "":
            sents = sents[:-1]
        for sent in sents:
            words, tags = [], []
            lines = sent.split("\n")
            for line in lines:
                if line.startswith("#"):
                    continue
                line = line.strip().split("\t")
                if len(line) != 10:
                    raise TabError("Tried to read .txt file, but did not find ten columns.")
                else:
                    words.append(line[1])
                    tags.append(line[3])
            X.append(words)
            Y.append(tags)
            
    return X, Y

And now a final wrapper around these functions that will process the provided file, output the data in the desired format, and complain if it fails.

In [16]:
def load_dataset(file):
    if file.endswith(".conllu"):
        try:
            X, Y = load_conllu(file)
        except TabError:
            print("Tried to read .txt file, but did not find ten columns.")
        return X, Y
    else:
        try:
            X, Y = load_txt(file)
        except TabError:
            print("Tried to read .txt file, but did not find two columns.")
        return X, Y

In [17]:
X, Y = load_dataset("/Users/yonwu/NLP_Course/UD_English-EWT/en_ewt-ud-train.conllu")

We've just written a data loader. Now, let's think about how to format this data so that it's compatible with our model. As you've probably observed before, no type of classifier can learn from token or sentence strings alone. These must first be properly _vectorized_. In other words, each token or sentence needs to be converted to a vector representation, with each dimension representing some sort of feature. 

We've fit a `TfidfVectorizer` in `scikit-learn` before. However, though this is a great feature representation for sentence or document strings, our data here is inherently different: we are working at a token level now, where we have to output a single tag per token. Computing tf-idf scores for bigrams or trigrams when we're working directly with words doesn't make much sense. In this case, we will need to tell the classifier what features to look for. This process is called _feature engineering_ (or _featurization_). `scikit-learn` has a handy tool for turning user-specified features into vectors called `DictVectorizer`. However, in order to make use of it, we will first need to write a function that returns a dictionary of features per token. This can contain things like the suffix of the word, if the first letter of the word is capitalized, the preceding word, etc. 

In [18]:
def token_to_features(sent, i):
    word = sent[i]

    features = {
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        word1 = sent[i-1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True

    return features

This function will take a token list and a word index as arguments and calculate various features for that word, depending on its position in the sentence. After all, it might be useful to know what other words occur before and after our word in question. Let's try it:

In [21]:
token_to_features(X[0], 4)

{'word.lower()': 'american',
 'word[-3:]': 'can',
 'word[-2:]': 'an',
 'word.isupper()': False,
 'word.istitle()': True,
 'word.isdigit()': False,
 '-1:word.lower()': ':',
 '-1:word.istitle()': False,
 '-1:word.isupper()': False,
 '+1:word.lower()': 'forces',
 '+1:word.istitle()': False,
 '+1:word.isupper()': False}

This is a simple featurizer, but it should be enough for our likewise simple model. Now let's write another function that takes in sentence and tag lists and collapses them into one list of tokens and tags. 

In [22]:
def prepare_data_for_training(X, Y):
    
    X_out, Y_out = [], []
    for i, sent in enumerate(X):
        for j, token in enumerate(sent):
            features = token_to_features(sent, j)
            X_out.append(features)
            Y_out.append(Y[i][j])
            
    return X_out, Y_out

Let's see what our dataset looks like now.

In [23]:
X, Y = prepare_data_for_training(X, Y)

Now we can make use of the `DictVectorizer` to transform our list of dictionaries into a feature matrix.

In [24]:
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()
X = vec.fit_transform(X)

print(X.shape)

(204609, 54283)


We can wrap this into its own helper function, called `vectorize`. 

In [25]:
def vectorize(X):
    vec = DictVectorizer()
    X_out = vec.fit_transform(X)
    return X_out, vec

This concludes our pre-processing module. We've named it `pre_processing.py` and stored it in the `tagger` folder. This is what the full code of the module should look like:

In [27]:
from sklearn.feature_extraction import DictVectorizer

def load_txt(file):
    X, Y = [], []
    with open(file, "r") as infile:
        sents = infile.read().split("\n\n")
        if sents[-1] == "":
            sents = sents[:-1]
        for sent in sents:
            words, tags = [], []
            lines = sent.split("\n")
            for line in lines:
                line = line.strip().split("\t")
                if len(line) != 2:
                    raise TabError("Tried to read .txt file, but did not find two columns.")
                else:
                    words.append(line[0])
                    tags.append(line[1])
            X.append(words)
            Y.append(tags)

    return X, Y

def load_conllu(file):
    X, Y = [], []
    with open(file, "r") as infile:
        sents = infile.read().split("\n\n")
        if sents[-1] == "":
            sents = sents[:-1]
        for sent in sents:
            words, tags = [], []
            lines = sent.split("\n")
            for line in lines:
                if line.startswith("#"):
                    continue
                line = line.strip().split("\t")
                if len(line) != 10:
                    raise TabError("Tried to read .txt file, but did not find ten columns.")
                else:
                    words.append(line[1])
                    tags.append(line[3])
            X.append(words)
            Y.append(tags)

    return X, Y

def load_dataset(file):
    if file.endswith(".conllu"):
        try:
            X, Y = load_conllu(file)
            return X, Y
        except TabError:
            print("Tried to read .txt file, but did not find ten columns.")
    else:
        try:
            X, Y = load_txt(file)
            return X, Y
        except TabError:
            print("Tried to read .txt file, but did not find two columns.")

def token_to_features(sent, i):
    word = sent[i]

    features = {
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        word1 = sent[i-1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True

    return features

def prepare_data_for_training(X, Y):    
    X_out, Y_out = [], []
    for i, sent in enumerate(X):
        for j, token in enumerate(sent):
            features = token_to_features(sent, j)
            X_out.append(features)
            Y_out.append(Y[i][j])

    return X_out, Y_out

def vectorize(X):
    vec = DictVectorizer()
    X_out = vec.fit_transform(X)
    return X_out, vec

### Training a model

As with everything in `scikit-learn`, training a model is a simple process. Let's load a Support Vector Machine classifier.

In [28]:
from sklearn.svm import LinearSVC

svm = LinearSVC()

We've successfully loaded our model. So how do we know how well it does on held-out data? One thing we can do is report its cross-validation accuracy. In doing so, we first have to pick a number of "folds". Then, for every fold, we pick a proportional amount of data for training and testing. If we choose to cross-validate over 5 folds, we choose a random subset that corresponds to 4/5 of the data and reserve 1/5 for testing. We train and evaluate a model on that fold, report its accuracy, and then pick another fold of data that we haven't yet evaluated. We repeat this process until we've trained and evaluated across all five folds. The reported accuracy is then an average of the classifier's performance across all folds. Here is a nice visualization of how the process looks like: 

![crossval](./grid_search_cross_validation.png)

To perform cross-validation on our model, we can use `cross_val_score`:

In [29]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(svm, X, Y, cv=5)

In [30]:
print(scores)

[0.93343434 0.92686086 0.92952446 0.92932897 0.93287065]


Looks like we have a pretty good model on our hands. It's up to you to make it better, either by including more/better features, or simply using a different model. Since `cross_val_score` simply gives us an averaged score, not a trained model, let's finally fit our model on the full training data.

In [31]:
svm.fit(X, Y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

Let's write a function that first reports the cross-validation score over the training set (if preferred), and then returns our fit model.

In [32]:
import numpy as np

def fit_and_report(X, Y, cross_val = True, n_folds=5):
    
    svm = LinearSVC()
    
    if cross_val:
        from sklearn.model_selection import cross_val_score

        scores = cross_val_score(svm, X, Y, cv=n_folds)
        print(f"{n_folds}-fold cross-validation results over training set:\n")
        print("Fold\tScore".expandtabs(15))
        for i in range(n_folds):
            print(f"{i+1}\t{scores[i]}".expandtabs(15))
        print(f"Average\t{np.mean(scores)}".expandtabs(15))
    
    svm.fit(X, Y)
        
    return svm        

In [33]:
model = fit_and_report(X, Y, cross_val=True, n_folds=10)

10-fold cross-validation results over training set:

Fold           Score
1              0.9362201260935438
2              0.9418894482185621
3              0.9373930892918235
4              0.9322613752993499
5              0.9282048775719661
6              0.9361712526269488
7              0.9300620693025756
8              0.9380284443575583
9              0.9374908362250134
10             0.9383675464320625
Average        0.9356089065419404


Now, the model is loaded in our memory for interactive use. However, what if we wanted to save the model on disk, so that we can load it anytime and use it to tag unseen sentences? We can do so using `pickle`, which is a serialization library similar to `json`, commonly used for storing matrices and other types of numeric data. In addition to the model, we also need to save our vectorizer, which remembers the feature set of our training data. Here we'll add a function to binarize the model and vectorizer (for efficient storage) and save them to disk.

In [34]:
import pickle

def save_model(model_and_vec, output_file):
    with open(output_file, "wb") as outfile:
        pickle.dump(model_and_vec, outfile)

In [35]:
save_model((svm, vec), "svm_pos_tagger.pickle")

Here is a complimentary function for loading a model from disk:

In [36]:
def load_model(output_file):
    with open(output_file, "rb") as infile:
        model, vec = pickle.load(infile)
        
    return model, vec

In [37]:
loaded_svm, loaded_vec = load_model("svm_pos_tagger.pickle")

### Working with unseen data

Now that we've trained a model, we'll naturally need to use it to tag unseen data. For now, let's focus on tagging single sentences. We'll use `SpaCy` for tokenization.

In [38]:
import spacy

nlp = spacy.load("en_core_web_sm")

def tag_sentence(sentence, model, vec):
    doc = nlp(sentence)
    tokenized_sent = [token.text for token in doc]
    featurized_sent = []
    for i, token in enumerate(tokenized_sent):
        featurized_sent.append(token_to_features(tokenized_sent, i))

    featurized_sent = vec.transform(featurized_sent)
    labels = model.predict(featurized_sent)
    
    tagged_sent = list(zip(tokenized_sent, labels))
    
    return tagged_sent

In [39]:
tagged_sent = tag_sentence("I walked to the store.", svm, vec)

Now let's write a simple function that prints this output in a nice format. 

In [55]:
def print_tagged_sent(tagged_sent):
    result = []
    for token in tagged_sent:
        result.append(f"{token[0]}\t{token[1]}".expandtabs(15))
    
    
    print(result)

In [56]:
print_tagged_sent(tagged_sent)

['I              PRON', 'walked         VERB', 'to             ADP', 'the            DET', 'store          NOUN', '.              PUNCT']


In [48]:
import os
my_file = "my_file.txt"
os.path.isfile(my_file)
with open(my_file, "r") as infile:
        data = infile.read().replace('\n', '')
print(data)

this is a glass.I like to drink water.


In [51]:
try_file = "try_file.txt"
result = try_file.split(".")[0]
print(result)

try_file


In [53]:
new_file = open(result+".txt","w") 
new_file.write("Hello World")
new_file.close() 


This should be enough to get us started. The full model module can be found in `tagger/model.py`. Here is the full code (will not work in Notebook):

In [None]:
from sklearn.svm import LinearSVC
from .pre_process import token_to_features
import numpy as np
import pickle
import spacy
import time

nlp = spacy.load("en_core_web_sm")

def fit_and_report(X, Y, cross_val = True, n_folds=5):

    svm = LinearSVC()

    if cross_val:
        from sklearn.model_selection import cross_val_score
        print(f"Doing {n_folds}-fold cross-validation.")
        scores = cross_val_score(svm, X, Y, cv=n_folds)
        print(f"{n_folds}-fold cross-validation results over training set:\n")
        print("Fold\tScore".expandtabs(15))
        for i in range(n_folds):
            print(f"{i+1}\t{scores[i]:.3f}".expandtabs(15))
        print(f"Average\t{np.mean(scores):.3f}".expandtabs(15))

    print("Fitting model.")
    start_time = time.time()
    svm.fit(X, Y)
    end_time = time.time()
    print(f"Took {int(end_time - start_time)} seconds.")

    return svm

def save_model(model_and_vec, output_file):
    print(f"Saving model to {output_file}.")
    with open(output_file, "wb") as outfile:
        pickle.dump(model_and_vec, outfile)

def load_model(output_file):
    print(f"Loading model from {output_file}.")
    with open(output_file, "rb") as infile:
        model, vec = pickle.load(infile)

    return model, vec

def tag_sentence(sentence, model, vec):
    doc = nlp(sentence)
    tokenized_sent = [token.text for token in doc]
    featurized_sent = []
    for i, token in enumerate(tokenized_sent):
        featurized_sent.append(token_to_features(tokenized_sent, i))

    featurized_sent = vec.transform(featurized_sent)
    labels = model.predict(featurized_sent)
    tagged_sent = list(zip(tokenized_sent, labels))

    return tagged_sent

def print_tagged_sent(tagged_sent):
    for token in tagged_sent:
        print(f"{token[0]}\t{token[1]}".expandtabs(15))

### Putting it all together

Now that we've written the base code for the `tagger` module, we can write a `run.py` script that will put all the components together and parse the user-input arguments. Before doing so, let's think about the functionality we'd like to enable. Basically, we can see two options so far: training a model and tagging a short sentence using the model. We can incorporate this in an argument called `--mode` which expects the user to call either `train` or `tag`. If a user provides a string that does not correspond to either of these modes, the script will complain and exit. 

In [234]:
import argparse

PARSER = argparse.ArgumentParser(description=
                                 """
                                 A basic SVM-based POS-tagger.
                                 Accepts either .conllu or tab-delineated
                                 .txt files for training.
                                 """)

PARSER.add_argument('--mode', metavar='M', type=str, help=
                    """
                    Specifies the tagger mode: {train, tag}.
                    """)

_StoreAction(option_strings=['--mode'], dest='mode', nargs=1, const=None, default=None, type=<class 'str'>, choices=None, help='Specifies the tagger mode: {train, tag}.', metavar='M')

We need to also add a conditional argument `--text` that should be called whenever `--mode tag` is specified. This will accept a string that will be passed to our `tagger.tag_sentence()` function. 

In [235]:
PARSER.add_argument('--text', metavar='T', type=str, help=
                    """
                    Tags a sentence string.
                    Can only be called if '--mode tag' is specified.
                    """)

_StoreAction(option_strings=['--text'], dest='text', nargs=1, const=None, default=None, type=<class 'str'>, choices=None, help='Tags a sentence string.                     Can only be called if "--mode tag" is specified.', metavar='T')

Let's think about other parameters at play here. First, if we're calling `--train`, then we need to point our script to a file that can be used for training. Also, we need to tell the script whether or not we'd like to output our cross-validation scores, and, if so, for how many folds. If our training set is large, then this could be an important decision if we're interested in saving time. Lastly, we need to specify where to save our model after training, and where to load it from if we're tagging. 

Expanding the list of arguments to `ArgumentParser` for all these options could be messy when calling `run.py`. This is especially true if we'd like to modify our code later. A cleaner option could be to include a configuration file, where these remaining parameters are specified. We can do so using `yaml`, which is a human-readable markup language designed for easy serialization. Let's make a `config.yaml` file for our tagger, specifying these remaining parameters.

```
# tagger config
train_file: "../../files/en_ewt-ud-train.conllu"
model_file: "./svm_tagger.pickle"
crossval: True
n_folds: 5
```

If we'd like to train a tagger using any other option, we simply have to change this configuration file, or provide a new one. Let's add a final argument to `run.py` for specifying configurations. 

In [236]:
PARSER.add_argument('--config', metavar='C', type=str, help=
                    """
                    A config .yaml file that specifies the train data,
                    model output file, and number of folds for cross-validation.
                    """)

_StoreAction(option_strings=['--config'], dest='config', nargs=1, const=None, default=None, type=<class 'str'>, choices=None, help='A config .yaml file that specifies the train data,                     model output file, and number of folds for cross-validation.', metavar='C')

It seems we're done with specifying arguments for our script (for now). Now let's write a `run()` function that takes our arguments as input and interprets them for the tagger.

In [248]:
def run(args):
    mode = args.mode
    with open(args.config, "r") as yamlin:
        config = yaml.load(yamlin)
    if mode == "train":
        print(f"Training a model on {config['train_file']}.")
        X, Y = pre_process.load_dataset(config["train_file"])
        X, Y  = pre_process.prepare_data_for_training(X, Y)
        X, vec = pre_process.vectorize(X)
        svm = model.fit_and_report(X, Y, config["crossval"], config["n_folds"])
        model.save_model((svm, vec), config["model_file"])
    elif mode == "tag":
        print(f"Tagging text using pretrained model: {config['train_file']}.")
        svm, vec = model.load_model(config["model_file"])
        tagged_sent = model.tag_sentence(args.text, svm, vec)
        model.print_tagged_sent(tagged_sent)
    else:
        print(f"{args.mode} is an incompatible mode. Must be either 'train' or 'tag'.")

Now we have a `run.py` file that can interpret the arguments provided by the user and decide how to deploy the modules we wrote above. Here is the full file:

In [None]:
from tagger import pre_process, model
import argparse
import yaml

def run(args):
    mode = args.mode
    with open(args.config, "r") as yamlin:
        config = yaml.load(yamlin)
    if mode == "train":
        print(f"Training a model on {config['train_file']}.")
        X, Y = pre_process.load_dataset(config["train_file"])
        X, Y  = pre_process.prepare_data_for_training(X, Y)
        X, vec = pre_process.vectorize(X)
        svm = model.fit_and_report(X, Y, config["crossval"], config["n_folds"])
        model.save_model((svm, vec), config["model_file"])
    elif mode == "tag":
        print(f"Tagging text using pretrained model: {config['train_file']}.")
        svm, vec = model.load_model(config["model_file"])
        tagged_sent = model.tag_sentence(args.text, svm, vec)
        model.print_tagged_sent(tagged_sent)
    elif mode == "eval":
        X, Y = pre_process.load_dataset(args.gold)
        X, Y  = pre_process.prepare_data_for_training(X, Y)
        svm, vec = model.load_model(config["model_file"])
        model.evaluate(X, Y, svm, vec)
    else:
        print(f"{args.mode} is an incompatible mode. Must be either 'train' or 'tag'.")

if __name__ == '__main__':


    PARSER = argparse.ArgumentParser(description=
                                     """
                                     A basic SVM-based POS-tagger.
                                     Accepts either .conllu or tab-delineated
                                     .txt files for training.
                                     """)

    PARSER.add_argument('--mode', metavar='M', type=str, help=
                        """
                        Specifies the tagger mode: {train, tag, eval}.
                        """)
    PARSER.add_argument('--text', metavar='T', type=str, help=
                        """
                        Tags a sentence string.
                        Can only be called if '--mode tag' is specified.
                        """)
    PARSER.add_argument('--gold', metavar='G', type=str, help=
                        """
                        Path to a gold-standard POS-tagging file.
                        Can only be called if '--mode eval' is specified.
                        """)
    PARSER.add_argument('--config', metavar='C', type=str, help=
                        """
                        A config .yaml file that specifies the train data,
                        model output file, and number of folds for cross-validation.
                        """)

    ARGS = PARSER.parse_args()

    run(ARGS)

We've just written a package for part-of-speech tagging in Python! Using the principles introduced here, you should now be well-equipped to write your own machine learning code, on top of `scikit-learn` or any other library. 