# Text Classification

There are three main types of classification:

- Binary: Two mutually exclusive categories (e.g., Spam detection)
- Multiclass: More than 2 mutually exclusive categories (e.g., Language detection)
- Multilabel: Non-mutually exclusive categories (e.g., movie genres)


# Binary text classification problem

We will adress the binary problem of detecting Sport related documents vs any other type of documents. In order to do this we will create an artificial (and very small collection).

- Define a set of labelled documents that will be our *training dataset*. These are the documents the classifier will learn from in order to categorise future _unseen_ documents

- Define a set of labelled documents that will be our *testing dataset*. These will be the "unseen" documents that the classifier will predict (without having being trained with them)

- Represent our training and testing documents

- Train the classifier based on the training data

- Predict the labels for the testing documents


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

# Train and test data. Both the full documents and their labels ("Sports" vs "Non Sports")
train_data = ['Football: a great sport', 'The referee has been very bad this season', 'Our team scored 5 goals', 'I love tenis',
              'Politics is in decline in the UK', 'Brexit means Brexit', 'The parlament wants to create new legislation',
              'I so want to travel the world']
train_labels = ["Sports","Sports","Sports","Sports", "Non Sports", "Non Sports", "Non Sports", "Non Sports"]

test_data = ['Swimming is a great sport', 
             'A lot of policy changes will happen after Brexit', 
             'The table tenis team will travel to the UK soon for the European Championship']
test_labels = ["Sports","Non Sports","Sports"]

# Representation of the data using TF-IDF
vectorizer = TfidfVectorizer()
vectorised_train_data = vectorizer.fit_transform(train_data)
vectorised_test_data = vectorizer.transform(test_data)

# Train the classifier given the training data
classifier = LinearSVC()
classifier.fit(vectorised_train_data, train_labels)

# Predict the labels for the test documents (not used for training)
print(classifier.predict(vectorised_test_data))

['Sports' 'Non Sports' 'Non Sports']


# Congratulations, you have built your first text classifier!!

However, the third case is wrongly classified. Why do you think that might be?

- Matching problems (e.g., "car" is different than "Cars")
- Cases never seen before (e.g., the classifier has never seen the word "table")
- "Spurious" correlations and bias ("car" appears only in the positive category)

Lets look into how we are representing our documents


In [2]:
from pprint import pprint

# Function to show the feature weights of a document (to be explained later)
def feature_values(doc, representer):
    doc_representation = representer.transform([doc])
    features = representer.get_feature_names()
    return [(features[index], doc_representation[0, index]) for index in doc_representation.nonzero()[1]]

pprint([feature_values(doc, vectorizer) for doc in test_data])

[[('sport', 0.57735026918962584),
  ('is', 0.57735026918962584),
  ('great', 0.57735026918962584)],
 [('brexit', 1.0)],
 [('uk', 0.34666892278432909),
  ('travel', 0.34666892278432909),
  ('to', 0.29053561299308733),
  ('the', 0.6594480187891556),
  ('tenis', 0.34666892278432909),
  ('team', 0.34666892278432909)]]


# Lets try again, with stop-word removal this time

In [3]:
from nltk.corpus import stopwords

# Load the list of (english) stop-words from nltk
stop_words = stopwords.words("english")

# Represent, train, predict
vectorizer = TfidfVectorizer(stop_words=stop_words)
vectorised_train_data = vectorizer.fit_transform(train_data)
vectorised_test_data = vectorizer.transform(test_data)
classifier = LinearSVC()
classifier.fit(vectorised_train_data, train_labels)

print(classifier.predict(vectorised_test_data))
# Expected: [Sports, Non Sports, Sports]

['Sports' 'Non Sports' 'Sports']


# Great!! 

# Multi-Class classification problem

We will adress the multi-class problem of detecting the language of a sentence based on 3 mutually exclusive languages (e.g., Spanish, English and French). For the sake of this example, we assume those are the only 3 languages that the documents can have. As before, we will create an artificial (and very small collection) with similar steps

In [4]:
# Artificial (and small) dataset. Spanish,English,French texts
train_data = ['PyCon es una gran conferencia', 'Aprendizaje automatico esta listo para dominar el mundo dentro de poco',
             'This is a great conference with a lot of amazing talks', 'AI will dominate the world in the near future',
             'Dix chiffres por resumer le feuilleton de la loi travail']
train_labels = ["SP", "SP", "EN", "EN", "FR"]

test_data = ['Estoy preparandome para dominar las olimpiadas', 'Me gustaria mucho aprender el lenguage de programacion Scala',
             'Machine Learning is amazing','Hola a todos']
test_labels = ["SP", "SP", "EN", "SP"]

# Represent
vectorizer = TfidfVectorizer() # Note, we are not doing stop-word removal. Stop words could be beneficial in this problems
vectorised_train_data = vectorizer.fit_transform(train_data)
vectorised_test_data = vectorizer.transform(test_data)

# Train
classifier = LinearSVC()
classifier.fit(vectorised_train_data, train_labels)

# Predict
predictions = classifier.predict(vectorised_test_data)
pprint(predictions)
# Expected: [SP, SP, EN, SP]

array(['SP', 'SP', 'EN', 'EN'],
      dtype='<U2')


# mmm, the last case is wrong. Can you guess why?

- Can we learn from never seen cases?

# You have just build a nice Language detection system!!

# Multi-label Problem

We will adress the multi-label problem of labeling documents as being relevant to Sports or Politics. As before, we will create an artificial (and very small collection) with initial similar steps. 

There are two modifications for our example to run in a multi-label way:

- Change the representation of the data viewing every document as a list of bits, representing being or not to each category. (*MultiLabelBinarizer*)
- Run the classifier N times, once for each category where the negative cases will be the documents in all the other categories. (*OneVsRestClassifier*)

In [5]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

# Artificial (and small) dataset. Sports and Politics
train_data = ['Football: a great sport', 'The referee has been very bad this season', 'Our team scored 5 goals', 'I love tenis',
              'Politics is in decline in the UK', 'Brexit means Brexit', 'The parlament wants to create new legislation',
              'I so want to travel the world', 
              'The goverment will increase the budget for sports in the UK after the victories in the Olimpic Games',
              "O'Reilly has a great conference this year"]
train_labels = [["Sports"], ["Sports"], ["Sports"], ["Sports"],["Politics"],["Politics"],["Politics"],[],["Politics", "Sports"],[]]

test_data = ['Swimming is a great sport', 
             'A lot of policy changes will happen after Brexit', 
             'The table tenis team will travel to the UK soon for the European Championship',
             'The goverment will increase the budget for sports in the UK after the victories in the Olimpic Games',
             'PyCon is my favourite conference']
test_labels = [["Sports"], ["Politics"], ["Sports"], ["Politics","Sports"],[]]

# Change the representation of our data as a list of bit lists 
mlb = MultiLabelBinarizer()
binary_train_labels = mlb.fit_transform(train_labels)
binary_test_labels = mlb.transform(test_labels)

print(binary_train_labels)

[[0 1]
 [0 1]
 [0 1]
 [0 1]
 [1 0]
 [1 0]
 [1 0]
 [0 0]
 [1 1]
 [0 0]]


In [6]:
# Represent 
vectorizer = TfidfVectorizer(stop_words=stop_words)
vectorised_train_data = vectorizer.fit_transform(train_data)
vectorised_test_data = vectorizer.transform(test_data)

# One classifer built per category using a one vs the rest approach
classifier = OneVsRestClassifier(LinearSVC())
classifier.fit(vectorised_train_data, binary_train_labels)

#Predict
predictions = classifier.predict(vectorised_test_data)

print(predictions)
print()

print(mlb.inverse_transform(predictions))

[[0 1]
 [1 0]
 [0 1]
 [1 1]
 [0 0]]

[('Sports',), ('Politics',), ('Sports',), ('Politics', 'Sports'), ()]


# This concludes this notebook!!!