# Problem 1: Text mining case study

Study Chapters 1 through 4 of the NLP Book:  

http://www.nltk.org/book/   

Then answer the questions from these chapters as below.  Use NLTK and Py programming as needed.  

**CHAPTER 1:**	Getting Started with NLTK				    	

Following the example on Page 5-6, Pick a pair of words and compare their usage in two different texts, using the similar() and common_contexts() functions.  Explain your results.


**CHAPTER 2:**  2.8 Exercises   (Page 74)						
Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?


**CHAPTER 3:**  3.12 Exercises    (Page 124)					
Save some text into a file corpus.txt. Define a function load(f) that reads from the file named in its sole argument, and returns a string containing the text of the file.

Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following kinds of expressions: monetary amounts; dates; names of people and organizations.



# Problem 2: Working with text data

In this assignment you will learn how to predict tags for posts from [StackOverflow](https://stackoverflow.com) by using multilabel classification approach.


### Text preprocessing

In [None]:
from nltk.corpus import stopwords

In this assignment, we are using dataset of post titles from StackOverflow. You are provided three sets of files: *train*, *validation* and *test*. All corpora (except for *test*) contain the post's title and corresponding tags (100 tags are available). 

In [None]:
#Import relevant libraries
import pandas as pd
import numpy as np
from ast import literal_eval
from nltk.tokenize import word_tokenize

In [None]:
def read_data(filename):
    data = pd.read_csv(filename, sep='\t')
    data['tags'] = data['tags'].apply(literal_eval)
    return data

In [None]:
train = read_data('train.tsv')
validation = read_data('validation.tsv')
test = pd.read_csv('test.tsv', sep='\t')

In [None]:
train.head()

In [None]:
#Split the data into train/val/test
X_train, y_train = train['title'].values, train['tags'].values
X_val, y_val = validation['title'].values, validation['tags'].values
X_test = test['title'].values

One of the major hurdles when working with text data is that it's unstructured and contains many unnecessary/weird tokens. To address this problem, it's usually useful to preprocess and clean the data. In this task you'll write a function, which will be used later. 

**Implement the function *text_processing* following the instructions. Run the function *test_test_processing* afterwards to test it on selected cases.**

In [None]:
#We'll be working with regular expressions to clean the text data
import re

In [None]:
replace_re_by_space = re.compile('[/(){}\[\]\|@,;]')
delete_re_symbols = re.compile('[^0-9a-z #+_]')
stop_words =  set(stopwords.words('english'))


def text_processing(text):
    """
        Input text: string
        
        Output: modified text based on RE
    """
    text = # add a function to convert text to lowercase
    text = # add a function that remove all symbols in replace_re_by_space symbols and replace them by space in text
    text = # add function that simply remove all symbols in delete_re_symbols from text
    token_word=word_tokenize(text)
    filtered_sentence = [w for w in token_word if not w in stop_words] # filtered_sentence contain all words that are not in stopwords dictionary
    lenght_of_string=len(filtered_sentence)
    text_new=""
    for w in filtered_sentence:
        if w!=filtered_sentence[lenght_of_string-1]:
             text_new=text_new+w+" " # when w is not the last word so separate by whitespace
        else:
            text_new=text_new+w
    text = text_new# remove stopwords from text, nothing to do here
    return text

In [None]:
def text_processing_test():
    examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
    answers = ["sql server equivalent excels choose function", 
               "free c++ memory vectorint arr"]
    for ex, ans in zip(examples, answers):
        if text_processing(ex) != ans:
            return "Wrong answer for the case: '%s'" % ex
    return "CONGRATS! ALL TESTS PASSED!"

In [None]:
#This should not throw an exception
print(text_processing_test())

We can now use our function *text_processing* on the data to clean the titles.

In [None]:
X_train = [text_processing(x) for x in X_train]
X_val = [text_processing(x) for x in X_val]
X_test = [text_processing(x) for x in X_test]

In [None]:
X_train[:3]

### Convert text to word count vectors with CountVectorizer.

Machine Learning algorithms work with numeric data. There are many ways to transform text data to numeric vectors. In this task you will try to use two of them.

#### Word Counts with CountVectorizer

Create three vectors 

X_train_vectorizer

X_val_vectorizer

X_test_vectorizer

which are bag of words representation of X_train, X_val and X_test

You can use sklearn.feature_extraction.text.CountVectorizer as follow:

Create an instance of the CountVectorizer class.
Call the fit_transform() function in order to learn a vocabulary from a document and encode as a vector.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def count_vectorizer_features(X_train, X_val, X_test):
    """
        X_train, X_val, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """
    # Create TF-IDF vectorizer with proper parameters choice, 
    # add token_pattern= '(\S+)' to the list of parameter,  '(\S+)'  means any non white space
    # Fit the vectorizer on the train set
    # Transform the train, test, and val sets and return the result
    
    '''
    YOUR CODE HERE
    '''
    
    return X_train, X_val, X_test

In [None]:
#Run this cell
X_train_vectorizer, X_val_vectorizer, X_test_vectorizer = count_vectorizer_features(X_train, X_val, X_test)

#### TF-IDF

The second approach extends the CountVectorizer framework by taking into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space. 

Implement function *tfidf_features* using class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from *scikit-learn*. Use *train* corpus to train a vectorizer. Don't forget to take a look into the arguments that you can pass to it. We suggest that you filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the titles). Also, use bigrams along with unigrams in your vocabulary. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def tfidf_features(X_train, X_val, X_test):
    """
        X_train, X_val, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """
    # Create TF-IDF vectorizer with proper parameters choice, 
    # add token_pattern= '(\S+)' to the list of parameter,  '(\S+)'  means any non white space
    # Fit the vectorizer on the train set
    # Transform the train, test, and val sets and return the result
    
    '''
    YOUR CODE HERE
    '''
    
    return X_train, X_val, X_test

In [None]:
#Run this cell
X_train_tfidf, X_val_tfidf, X_test_tfidf = tfidf_features(X_train, X_val, X_test)

In [None]:
print('X_test_tfidf ', X_test_tfidf.shape) 
print('X_val_tfidf ',X_val_tfidf.shape)
print('X_val_vectorizer ',X_val_vectorizer.shape)

### MultiLabel classifier

As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose it is convenient to use [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) from *sklearn*.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

In [None]:
mlb = MultiLabelBinarizer(classes=sorted(tags_count.keys()))
y_train = mlb.fit_transform(y_train)
y_val = mlb.fit_transform(y_val)

Implement the function *train_classifier* for training a classifier. In this task we suggest to use One-vs-Rest approach, which is implemented in [OneVsRestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) class. In this approach *k* classifiers (= number of tags) are trained. As a basic classifier, use [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time, because a number of classifiers to train is large.

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier

In [None]:
def train_classifier(X_train, y_train):
    """
      X_train, y_train — training data
      
      return: trained classifier
    """
    
    # Create and fit LogisticRegression wraped into OneVsRestClassifier.
        
     '''
    YOUR ONE LINE OF CODE HERE
    '''   
    return model

   

Train the classifiers for different data transformations: **CountVectorizer** and **tf-idf**.

In [None]:
classifier_vectorizer = train_classifier(X_train_vectorizer, y_train)
classifier_tfidf = train_classifier(X_train_tfidf, y_train)

Now you can create predictions for the data. You will need two types of predictions: labels and scores.

In [None]:
y_val_predicted_labels_vectorizer = classifier_vectorizer.predict(X_val_vectorizer)
y_val_predicted_scores_vectorizer = classifier_vectorizer.decision_function(X_val_vectorizer)

y_val_predicted_labels_tfidf = classifier_tfidf.predict(X_val_tfidf)
y_val_predicted_scores_tfidf = classifier_tfidf.decision_function(X_val_tfidf)

Now take a look at how classifier, which uses TF-IDF, works for a few examples:

In [None]:
y_val_pred_inversed = mlb.inverse_transform(y_val_predicted_labels_tfidf)
y_val_inversed = mlb.inverse_transform(y_val)
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_val[i],
        ','.join(y_val_inversed[i]),
        ','.join(y_val_pred_inversed[i])
    ))

Now, we would need to compare the results of different predictions, e.g. to see whether TF-IDF transformation helps or to try different regularization techniques in logistic regression. For all these experiments, we need to setup evaluation procedure. 

### Evaluation

To evaluate the results we will use several classification metrics:
 - [Accuracy](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
 - [F1-score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
 - [Area under ROC-curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)
 - [Area under precision-recall curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score) 
 

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

Implement the function *print_evaluation_scores* which calculates and prints to stdout:
 - *accuracy*
 - *F1-score macro/micro/weighted*
 - *Precision macro/micro/weighted*

In [None]:
def print_evaluation_scores(y_val, predicted):
    
    accuracy= #your code here
    f1_score_macro= #your code here
    f1_score_micro= #your code here
    f1_score_weighted= #your code here
    precision_weighted= #your code here
    print(accuracy,f1_score_macro,f1_score_micro,f1_score_weighted,precision_weighted)

In [None]:
print('CountVectorizer')
print_evaluation_scores(y_val, y_val_predicted_labels_vectorizer)
print('Tfidf')
print_evaluation_scores(y_val, y_val_predicted_labels_tfidf)

You might also want to plot some generalization of the [ROC curve](http://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc) for the case of multi-label classification. Provided function *roc_auc* can make it for you. The input parameters of this function are:
 - true labels
 - decision functions scores
 - number of classes

**Task 4 (MultilabelClassification).** Once we have the evaluation set up, we suggest that you experiment a bit with training your classifiers. 
- compare the quality of the CountVectorizer and TF-IDF approaches and choose one of them.
- for the one you choose, try *L1* and *L2*-regularization techniques in Logistic Regression with different coefficients (e.g. C equal to 0.1, 1, 10, 100).

You also could try other improvements of the preprocessing / model, if you want. 

Print the evaluation scores, did you make any improvement?

In [None]:
from sklearn.pipeline import make_pipeline

def train_classifier(X_train, y_train):
    """
      X_train, y_train — training data
      
      return: trained classifier
    """
    
    # Create and fit LogisticRegression wraped into OneVsRestClassifier.
        
    '''
    YOUR CODE HERE
    '''
    
    return model

'''
Call the train_classifier model and create predictions for the data. 
You will need two types of predictions: labels and scores.
'''

In [None]:
print('CountVectorizer model')
print_evaluation_scores(y_val, y_val_predicted_labels_vectorizer)
print('TF-IDF model')
print_evaluation_scores(y_val, y_val_predicted_labels_tfidf)    
