# Machine Learning with Python

In [1]:
import numpy as np
import pandas as pd

## 3.1 Text Data

In this section we will explore some techniques for making use of *unstructured* text data in supervised and unsupervised learning. The techniques introduced come from the fields of *information retrieval* (IR) and *natural language processing* (NLP).

Each data point consists of a single text, called a *document*.

The set of all documents in the analysis is called a *corpus*.


### The corpus

We will look at a set of user-contributed movie reviews retrieved from IMDb (The Internet Movie Database). Each document is the text of one review, together with a label indicating whether the review is broadly "positive" or "negative".

The data provided here are derived from the dataset available at http://ai.stanford.edu/~amaas/data/sentiment


Firstly, you will need to unpack the archived dataset `imdb.zip`.

The unpacked data contains a `train` and a `test` directory, each of which contains positive and negative examples.

`scikit-learn` can directly load the labelled corpus from this directory structure:

In [2]:
!pwd

/home/snowztail/Repositories/learn-python/machine-learning-with-python/notebooks


In [3]:
from sklearn.datasets import load_files

reviews_train = load_files("../assets/imdb/train/")
# load_files returns a bunch, containing training texts and training labels
text_train, y_train = reviews_train.data, reviews_train.target

In [15]:
reviews_test = load_files("../assets/imdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)

# Transform text to sparse matrices
print("X_train:\n{}".format(repr(X_train)))

X_train:
<25000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 3445861 stored elements in Compressed Sparse Row format>


In [5]:
feature_names = vect.get_feature_names_out()

print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 74849
First 20 features:
['00' '000' '0000000000001' '00001' '00015' '000s' '001' '003830' '006'
 '007' '0079' '0080' '0083' '0093638' '00am' '00pm' '00s' '01' '01pm' '02']
Features 20010 to 20030:
['dratted' 'draub' 'draught' 'draughts' 'draughtswoman' 'draw' 'drawback'
 'drawbacks' 'drawer' 'drawers' 'drawing' 'drawings' 'drawl' 'drawled'
 'drawling' 'drawn' 'draws' 'draza' 'dre' 'drea']
Every 2000th feature:
['00' 'aesir' 'aquarian' 'barking' 'blustering' 'bête' 'chicanery'
 'condensing' 'cunning' 'detox' 'draper' 'enshrined' 'favorit' 'freezer'
 'goldman' 'hasan' 'huitieme' 'intelligible' 'kantrowitz' 'lawful' 'maars'
 'megalunged' 'mostey' 'norrland' 'padilla' 'pincher' 'promisingly'
 'receptionist' 'rivals' 'schnaas' 'shunning' 'sparse' 'subset'
 'temptations' 'treatises' 'unproven' 'walkman' 'xylophonist']


In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

scores = cross_val_score(LogisticRegression(solver='lbfgs', max_iter=1000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.88


In [13]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}

# Exhaustive search over specified parameter values for an estimator
grid = GridSearchCV(LogisticRegression(solver='newton-cg', max_iter=100), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.89
Best parameters:  {'C': 0.1}


In [16]:
X_test = vect.transform(text_test)

print("Test score: {:.2f}".format(grid.score(X_test, y_test)))

Test score: 0.88


Retry with `min_df` constraint: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature.

In [23]:
vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)

print("X_train with min_df: {}".format(repr(X_train)))

X_train with min_df: <25000x27272 sparse matrix of type '<class 'numpy.int64'>'
	with 3368680 stored elements in Compressed Sparse Row format>


In [21]:
# Compared to first 20 features above, those appearing at least 5 times seem more informative
feature_names = vect.get_feature_names_out()

print("First 50 features:\n{}".format(feature_names[:50]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 700th feature:\n{}".format(feature_names[::700]))

First 50 features:
['00' '000' '007' '00s' '01' '02' '03' '04' '05' '06' '07' '08' '09' '10'
 '100' '1000' '100th' '101' '102' '103' '104' '105' '107' '108' '10s'
 '10th' '11' '110' '112' '116' '117' '11th' '12' '120' '12th' '13' '135'
 '13th' '14' '140' '14th' '15' '150' '15th' '16' '160' '1600' '16mm' '16s'
 '16th']
Features 20010 to 20030:
['repent' 'repentance' 'repercussions' 'repertoire' 'repetition'
 'repetitions' 'repetitious' 'repetitive' 'rephrase' 'replace' 'replaced'
 'replacement' 'replaces' 'replacing' 'replay' 'replayable' 'replayed'
 'replaying' 'replays' 'replete']
Every 700th feature:
['00' 'affections' 'appropriately' 'barbra' 'blurbs' 'butcher' 'cheery'
 'commit' 'courtroom' 'deconstruct' 'disgraced' 'dvd' 'escapist' 'felix'
 'freeze' 'gorier' 'haunts' 'hungarian' 'insincere' 'juggernaut' 'leer'
 'mae' 'messes' 'mushy' 'occasion' 'parker' 'pleasantly' 'pronto' 'recipe'
 'reviewing' 'saruman' 'she' 'sneering' 'stefano' 'swashbuckling' 'thrust'
 'tvm' 'vampirism' 'wes

In [24]:
grid = GridSearchCV(LogisticRegression(solver='newton-cg', max_iter=100), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.89
Best parameters:  {'C': 0.1}


In [25]:
# text_train is a list, where individual elements are immutable "bytes"
print("type of text_train: {}".format(type(text_train)))
print("type of elements of text_train: {}".format(type(text_train[0])))
print("length of text_train: {}".format(len(text_train)))
print("text_train[6]:\n{}".format(text_train[6]))

type of text_train: <class 'list'>
type of elements of text_train: <class 'bytes'>
length of text_train: 25000
text_train[6]:
b"This movie has a special way of telling the story, at first i found it rather odd as it jumped through time and I had no idea whats happening.<br /><br />Anyway the story line was although simple, but still very real and touching. You met someone the first time, you fell in love completely, but broke up at last and promoted a deadly agony. Who hasn't go through this? but we will never forget this kind of pain in our life. <br /><br />I would say i am rather touched as two actor has shown great performance in showing the love between the characters. I just wish that the story could be a happy ending."


`text_train` is a `list` of length 25000.

The individual documents are stored as type `bytes` - i.e. immutable bytestrings that are interpreted as Unicode text data. See [docs](https://docs.python.org/3/library/stdtypes.html#bytes-objects) and [explanation](https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal).

A `bytes` literal looks like a string literal with a `b` prepended.


The training dataset is balanced, with equal numbers of positive and negative reviews:

In [26]:
print("Samples per class (training): {}".format(np.bincount(y_train)))

Samples per class (training): [12500 12500]


### Cleaning up the data

Firstly, we should remove the `<br />` tags, which just represent line breaks.

In [27]:
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

We load and clean up the test data in the same way:

In [28]:
print("Number of documents in test data: {}".format(len(text_test)))
print("Samples per class (test): {}".format(np.bincount(y_test)))
text_test = [doc.replace(b"<br />", b" ") for doc in text_test]

Number of documents in test data: 25000
Samples per class (test): [12500 12500]


### Exercise

The file `reviews.json` contains expert reviews for papers submitted to an international conference on computing and informatics.

*Appel, Orestes & Chiclana, Francisco & Carter, Jenny & Fujita, Hamido., 2016. A hybrid approach to sentiment analysis.*

The data is held in JSON format, which is a flexible format for structured data. Here's how we can extract the reviews into a pandas DataFrame:

In [30]:
import json

# load data using Python JSON module
with open('../assets/reviews.json','r') as f:
    data = json.loads(f.read())

reviews = pd.json_normalize(data, record_path=['review'])

In [32]:
reviews.keys()

Index(['confidence', 'evaluation', 'id', 'lan', 'orientation', 'remarks',
       'text', 'timespan'],
      dtype='object')

In [31]:
reviews.head()

Unnamed: 0,confidence,evaluation,id,lan,orientation,remarks,text,timespan
0,4,1,1,es,0,,- El artículo aborda un problema contingente y...,2010-07-05
1,4,1,2,es,1,,El artículo presenta recomendaciones prácticas...,2010-07-05
2,5,1,3,es,1,,- El tema es muy interesante y puede ser de mu...,2010-07-05
3,4,2,1,es,1,,Se explica en forma ordenada y didáctica una e...,2010-07-05
4,4,2,2,es,0,,,2010-07-05


The `text` column contains the text of each review, whilst the `evaluation` column is a numerical score for each paper.

Prepare a training and testing dataset containing only the Spanish language (`lan == es`) reviews. Later, we will use these documents to attempt regression analysis to predict the `evaluation` score, which you should also extract as the target values. 

*Notes*

It's fine to use strings for the documents rather than the `bytes` datatype we saw earlier. Remember that scikit-learn can handle pandas `Series` data without needing to unpack it.

You will need the DataFrame method `query()`.

Consider any basic cleaning operations you can sensibly apply to the text.

In [34]:
reviews[["lan"]]

Unnamed: 0,lan
0,es
1,es
2,es
3,es
4,es
...,...
400,es
401,es
402,es
403,es


In [36]:
reviews_es = reviews.query('lan == "es"')

In [39]:
reviews_es.text

0      - El artículo aborda un problema contingente y...
1      El artículo presenta recomendaciones prácticas...
2      - El tema es muy interesante y puede ser de mu...
3      Se explica en forma ordenada y didáctica una e...
4                                                       
                             ...                        
400    El trabajo pretende ofrecer una visión del uso...
401    El paper está bien escrito y de fácil lectura....
402    Observación de fondo:  No se presenta un ejemp...
403    Se propone un procedimiento para elaborar máqu...
404    El artículo describe básicamente los component...
Name: text, Length: 388, dtype: object

In [77]:
from sklearn.model_selection import train_test_split

text_train, text_test, y_train, y_test = train_test_split(reviews_es.text, reviews_es.evaluation, random_state=1)

In [85]:
from sklearn.feature_extraction.text import CountVectorizer

# vectorizer = CountVectorizer(min_df=2, token_pattern=r'(?u)\b\w\w+\b').fit(text_train)
vectorizer = CountVectorizer(min_df=1, token_pattern=r'(?u)\b[^_\d\W][^_\d\W]+\b').fit(text_train)
X_train = vectorizer.transform(text_train)
feature_names = vectorizer.get_feature_names_out()

In [86]:
feature_names

array(['aanalizar', 'abarca', 'abarcando', ..., 'único', 'útil', 'útiles'],
      dtype=object)

In [87]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

scores_train = cross_val_score(LogisticRegression(solver='lbfgs', max_iter=1000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores_train)))

Mean cross-validation accuracy: 0.34


In [89]:
X_test = vectorizer.transform(text_test)

scores_test = cross_val_score(LogisticRegression(solver='lbfgs', max_iter=1000), X_test, y_test, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores_test)))

Mean cross-validation accuracy: 0.36
