# Text Analysis.

Author: Jesús Cid Sueiro
Date: 2016/04/03


In this notebook we will explore some tools for text analysis in python. To do so, first we will import the requested python libraries.

In [33]:
# Required imports
from wikitools import wiki
from wikitools import category

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from gensim import corpora

# Garbage
#import numpy as np
#import nltk.stem
#from gensim import corpora, models, matutils

## 1. Corpus acquisition.

There are many available text collections to test topic modelling algorithm. In particular, the NLTK library has many examples. You can explore them using the `nltk.download()` tool.

    import nltk
    nltk.download()
    Mycorpus = nltk.corpus.gutenberg
    text_name = Mycorpus.fileids()[0]
    raw = Mycorpus.raw(text_name)
    Words = Mycorpus.words(text_name)

In this notebook, we will explore and analyize collections of Wikipedia articles from a given category. Thus, we will not use the availabel corpora at nltk, but the `wikitools` library, that makes easy the capture of content from wikimedia sites.

We will explore the category ...

In [2]:
site = wiki.Wiki("https://en.wikipedia.org/w/api.php")
cat = "Pseudoscience"
print cat

Pseudoscience


... though you can try with any other categories. Take into account that the behavior of topic modelling algorithms may depend on the amount of documents available for the analysis.

We start downloading the text collection.

In [3]:
# Loading category data. This may take a while
print "Loading category data. This may take a while..."
cat_data = category.Category(site, cat)

corpus_titles = []
corpus_text = []
n = 0
print "Loading article ...",
for page in cat_data.getAllMembersGen():
    n += 1
    print n,
    corpus_titles.append(page.title)
    corpus_text.append(page.getWikiText())

n_art = len(corpus_titles)
print "\nLoaded " + str(n_art) + " articles from category " + cat

Loading category data. This may take a while...
Loading article ... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 

Now, we have stored the whole text collection in two lists:

* `corpus_titles`, which contains the titles of the selected articles
* `corpus_text`, with the text content of the selected wikipedia articles

## 2. Corpus Processing

Topic modelling algorithms process vectorized data. In order to apply them, we need to transforms the raw text input data into vector representation. To do so, we need to process the text data in order to remove irrelevant information and preserve as much relevant information as possible to capture the semantic content in the document collection.

Thus, we will proceed with the following steps:

1. Tokenization
2. Homogeneization
3. Cleaning
4. Vectorization

### 2.1. Tokenization

In order to use the `word_tokenize` method from nltk, you might need to get the appropriate libraries using 

    nltk.download()
    # select option "d) Download", and identifier "punkt"

In [13]:
# You can comment this if the package is already available.
# select option "d) Download", and identifier "punkt"
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

In [75]:
corpus_tokens = []
n = 0

for art in corpus_text: 
    n += 1
    print "\rTokenizing article {0} out of {1}".format(n, n_art),

    # This is to remove strange characters
    art = art.decode('utf-8')    

    # Tokenize each text entry
    tokens = word_tokenize(art)

    corpus_tokens.append(tokens)
    
print "\n The corpus has been tokenized. Let's check some portion of the first article:"
print corpus_tokens[0][0:30]

Tokenizing article 337 out of 337 
 The corpus has been tokenized. Let's check some portion of the first article:
[u'{', u'{', u'npov|date=March', u'2016', u'}', u'}', u'{', u'{', u'synthesis|date=March', u'2016', u'}', u'}', u"'", u"''", u'Cryptozoology', u"''", u"'", u'is', u'a', u'[', u'[', u'pseudoscience', u']', u']', u'involving', u'the', u'search', u'for', u'creatures', u'whose']


### 2.2. Homogeneization

By looking at the tokenized corpus you may verify that there are many tokens that correpond to punktuation signs and other symbols that are not relevant to analyze the semantic content. They can be removed using the stemming tool from `nltk`.

The homogeneization process will consist of:

1. Removing capitalization: capital alphabetic characters will be transformed to their corresponding lowercase characters.
2. Removing non alphanumeric tokens (e.g. punktuation signs)
3. Stemming: removing word terminations to preserve the rood of the words and ignore grammatical information.

In [78]:
# Select stemmer.
s = nltk.stem.SnowballStemmer('english')

corpus_stemmed = []
n = 0
for tokens in corpus_tokens:
    n +=1
    print "\rStemming article {0} out of {1}".format(n, n_art),

    # Set to lowercas, remove non alfanumeric tokens and stem.
    clean_tokens = [s.stem(token.lower()) for token in tokens if token.isalnum()]
    corpus_stemmed.append(clean_tokens)

print "\nLet's check the first tokens after stemming:"
print corpus_stemmed[0][0:30]

Stemming article 337 out of 337 
Let's check the first tokens after stemming:
[u'2016', u'2016', u'cryptozoolog', u'is', u'a', u'pseudosci', u'involv', u'the', u'search', u'for', u'creatur', u'whose', u'exist', u'has', u'not', u'been', u'proven', u'due', u'to', u'lack', u'of', u'evid', u'the', u'anim', u'cryptozoologist', u'studi', u'are', u'refer', u'to', u'as']


### 2.3. Cleaning

The third stepd consists on removing those words that are very common in language and do not carry out usefull semantic content (articles, pronouns, etc).

Once again, we might need to load the stopword files using the download tools from `nltk`

In [27]:
# You can comment this if the package is already available.
# select option "d) Download", and identifier "stopwords"
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True




In [79]:
corpus_clean = []

n = 0
for tokens in corpus_stemmed:
    n += 1
    print "\rRemoving stopwords from article {0} out of {1}".format(n, n_art),

    clean_tokens = [token for token in tokens if token not in stopwords.words('english')]    
    corpus_clean.append(clean_tokens)
    # corpus_clean.append(' '.join(clean_tokens)) (not used)
    
print "\n Let's check tokens after cleaning:"
print corpus_clean[0][0:30]

Removing stopwords from article 337 out of 337 
 Let's check tokens after cleaning:
[u'2016', u'2016', u'cryptozoolog', u'pseudosci', u'involv', u'search', u'creatur', u'whose', u'exist', u'proven', u'due', u'lack', u'evid', u'anim', u'cryptozoologist', u'studi', u'refer', u'cryptid', u'cryptozoologist', u'includ', u'live', u'exampl', u'creatur', u'otherwis', u'consid', u'extinct', u'live', u'dinosaur', u'cryptozoolog', u'dinosaur']


### 2.4. Vectorization

Up to this point, we have transformed the raw text collection of articles in a list of articles, where each article is a collection of the word roots that are most relevant for semantic analysis. Now, we need to convert these text data into a numerical representation. To do so, we will start using the tools provided by the `gensim` library.



In [80]:
# Create dictionary of tokens
D = corpora.Dictionary(corpus_clean)

# Transform token lists into sparse vectors on the D-space
bow = [D.doc2bow(doc) for doc in corpus_clean]

At this point, it is good to make sure to understand what has happened. In `corpus_clean` we had a list of token list. With it, we have constructed a Dictionary, `D`, which assign an integer identifier to each token in the corpus: 

In [81]:
print "First tokens in the dictionary: "
for n in range(10):
    print str(n) + ": " + D[n]

First tokens in the dictionary: 
0: lack
1: focus
2: slick
3: four
4: stumbl
5: francesco
6: follow
7: whose
8: privat
9: tv


After that, we have transformed each article (in `corpus_clean`) into a sparse vector. These sparse vectors will be the inputs to the topic modeling algorithms.

In [82]:
print "Original article (after cleaning): "
print corpus_clean[0][0:30]
print "Sparse vector representation (first 30 components):"
print bow[0][0:30]
print "The first component, " +  str(bow[0][0]) + ", states that in article number 0,",
print "token 0 (" + D[0] + ") appears " + str(bow[0][0][1]) + " times" 


Original article (after cleaning): 
[u'2016', u'2016', u'cryptozoolog', u'pseudosci', u'involv', u'search', u'creatur', u'whose', u'exist', u'proven', u'due', u'lack', u'evid', u'anim', u'cryptozoologist', u'studi', u'refer', u'cryptid', u'cryptozoologist', u'includ', u'live', u'exampl', u'creatur', u'otherwis', u'consid', u'extinct', u'live', u'dinosaur', u'cryptozoolog', u'dinosaur']
Sparse vector representation (first 30 components):
[(0, 3), (1, 2), (2, 1), (3, 1), (4, 1), (5, 5), (6, 2), (7, 3), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 2), (14, 2), (15, 1), (16, 1), (17, 1), (18, 5), (19, 1), (20, 1), (21, 1), (22, 1), (23, 2), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2)]
The first component, (0, 3), states that in article number 0, token 0 (lack) appears 3 times


## 3. Semantic Analyisis

In [None]:
# Garbage.
# from nltk.tokenize import word_tokenize
# import nltk.stem
# from nltk.corpus import stopwords
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.decomposition import LatentDirichletAllocation
# import numpy as np
# import urllib
# import codecs
# import wikipedia
# import pickle
# from scipy.sparse import vstack
# from sklearn import preprocessing
# from sklearn.ensemble import ExtraTreesClassifier, BaggingClassifier
# from sklearn import svm
# import sys

"""
Funcion para detectar si todos los caracteres de una determinada palabra
estan en formato ascii. Extraida de:

http://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii
"""
def is_ascii(s):
    return all(ord(c) < 128 for c in s)

"""
Funcion para imprimir los topics de un modelo LDA. Extraida de:

http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html
"""

def print_top_words(model, feature_names, n_top_words):
    print "\n"
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()


#Clase para seleccionar Mozilla como cliente para realizar peticiones HTTP.
class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"

"""
Funcion para descargar automaticamente todos los articulos de una determinada
categoria desde Wikipedia. Referencias:

http://stackoverflow.com/questions/7958191/download-articles-from-wikipedia-using-special-export
https://wikipedia.readthedocs.org/en/latest/code.html#api
"""
   
def descargaArticulos(nombreCategoria):
    #Se crea un archivo donde volcar el XML.
    f =  codecs.open('workfile2.xml', 'w',"utf-8" )
    
    #Se selecciona Mozilla como cliente para realizar la peticion.
    urllib._urlopener = AppURLopener()
    #Se forma la peticion HTTP para buscar los articulos de la categoria
    query = "http://en.wikipedia.org/w/index.php?title=Special:Export&action=submit"
    data = { 'catname':nombreCategoria,'addcat':'', 'wpDownload':1 }
    data = urllib.urlencode(data)
    # Se manda la petición y se almacena el resultado.
    f = urllib.urlopen(query, data)
    s = f.read()
    
    #Se parsea el XML descargado.
    """
    Los titulos de los articulos aparecen a partir de un string que pone
    name="pages". Se divide el documento en trozos cada vez que aparece
    ese string. Los titulos de los articulos acaban cuando aparece el string
    </textarea>. Se divide nuevamente el texto cuando aparece ese string, de
    forma que se obtiene el nombre de los articulos. Finalmente, para obtener
    individualmente los nombres, se divide el texto por el string \n
    """
    div1 = s.split('name="pages">')
    div2 = div1[1].split("</textarea>")[0]
    titulos = div2.split("\n")

    articulos = []
    # Se descargan los articulos mediante la API de Wikipedia y se guardan.
    for titulo in titulos:
        if (titulo.startswith("Category:") == 0):
            print titulo
            try:
                articulos.append(wikipedia.page(titulo).content)
            #Se ponen excepciones para evitar fallos en la descarga.
            except (wikipedia.exceptions.DisambiguationError, wikipedia.exceptions.PageError):
                pass

    return articulos

"""
PROGRAMA PRINCIPAL
"""

#Se pregunta al usuario si quiere descargar categorias de forma automatica o si quiere utilizar
#las predeterminadas (American Social Sciences Writers y Fictional Families)
desAuto = raw_input("Introduce S para descargar las categorias o N para usar las predeterminadas: ")
while (desAuto != "S" and desAuto != "N"):
    desAuto = raw_input("\nIntroduce una opcion valida por favor [S\N]:  ")
if(desAuto == "S"): #Se solicita descarga de categorias.
    #Se pide al usuario el nombre de las categorias y se descargan.
    cat1 = raw_input('Introduce el nombre de la primera categoria: ')
    cat2 = raw_input('\nIntroduce el nombre de la segunda categoria: ')
    print "Descargando categoria: " + cat1 + "\n"
    print "Tratando de descargar los siguientes articulos: \n"
    articulos1 = descargaArticulos(cat1)
    print "Descargando categoria: " + cat2 + "\n"
    print "Tratando de descargar los siguientes articulos: \n"
    articulos2 = descargaArticulos(cat2)
    print "\nArticulos descargados de la categoria 1: " + str(len(articulos1))
    print "\nArticulos descargados de la categoria 2: " + str(len(articulos2))  
else: #Se usan las categorias predeterminadas.
    print("\nUTILIZANDO CATEGORIAS PREDETERMINADAS") 
    sys.stdout.flush()
    f = open('articulosAmericanSocial.txt', 'rb')
    articulos1 = pickle.load(f)
    f = open('articulosFictionalFamilies.txt', 'rb')
    articulos2 = pickle.load(f)

"""
PREPROCESADO DE LOS ARTICULOS
"""

print("\nREALIZANDO EL PREPROCESADO DE LOS ARTICULOS")
sys.stdout.flush()
s = nltk.stem.SnowballStemmer('english')
eng_stopwords = stopwords.words('english')

#Procesado de los articulos de la primera categoria.
textos1 = list()

for articulo in articulos1:
    #Se divide el articulo en tokens.
    tokens = word_tokenize(articulo)
    #Se guarda la raiz de aquellos tokens que son alfanumericos.
    tokens = [s.stem(token.lower()) for token in tokens if token.isalnum()]
    #Se eliminan los tokens que estan en las stopwords.
    tokens = [token for token in tokens if token not in eng_stopwords]
    #Se eliminan los tokens que no son ascii o que son "http" o "ref".
    tokens = [token for token in tokens if (is_ascii(token) and token!="http" and token!="ref")]
    textos1.append(' '.join(tokens))

#Procesado de los articulos de la segunda categoria. Identico que para la primera.
textos2 = list()

for articulo in articulos2:
    tokens = word_tokenize(articulo)
    tokens = [s.stem(token.lower()) for token in tokens if token.isalnum()]
    tokens = [token for token in tokens if (token not in eng_stopwords  and token!="american" and token!="social" and token!="sciences" and token!="writers" and token!="fictional" and token!="families")]
    tokens = [token for token in tokens if (is_ascii(token) and token!="http" and token!="ref")]
    textos2.append(' '.join(tokens))

"""
Se convierten los textos a una matriz en la que cada palabra es una columna
cuyo valor es el numero de veces que aparece dicha palabra en cada documento
"""

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000,stop_words='english')
tf = tf_vectorizer.fit_transform(textos1+textos2)

#Se dividen los textos en conjunto de entrenamiento y test.
#Porcentaje de textos del total que se van a usar para entrenamiento
division = 0.7

#Se dividen los textos en train y en test
tftrain1 = tf[0:round(len(textos1)*division)]
tftrain2 = tf[len(textos1):(len(textos1)+round(len(textos2)*division))]
tftrain = vstack((tftrain1,tftrain2))
tftest1 = tf[round(len(textos1)*division):len(textos1)]
tftest2 = tf[(len(textos1)+round(len(textos2)*division)):len(textos1+textos2)]
tftest = vstack((tftest1,tftest2))

"""
ALGORITMO LDA: Extraccion de topics.
"""

print("\nREALIZANDO EL LDA Y LA CLASIFICACION")
sys.stdout.flush()
#Se define el algoritmo LDA, el cual se entrena con el conjunto de entrenamiento.
ldaBagging = LatentDirichletAllocation(n_topics=21, max_iter=3, learning_method='online', learning_offset=50., random_state=0)
ldaBagging.fit(tftrain)
#Se extraen los datos de entrenamiento y de test, convertidos a valores numericos.
X_trainBagging = ldaBagging.transform(tftrain)
X_testBagging = ldaBagging.transform(tftest)

#Mismo proceso para los demas clasificadores:
ldaExtraTrees = LatentDirichletAllocation(n_topics=52, max_iter=9, learning_method='online', learning_offset=50., random_state=0)
ldaExtraTrees.fit(tftrain)
ldaExtraTrees.
X_trainExtraTrees = ldaExtraTrees.transform(tftrain)
X_testExtraTrees = ldaExtraTrees.transform(tftest)

ldaSVC = LatentDirichletAllocation(n_topics=6, max_iter=8, learning_method='online', learning_offset=50., random_state=0)
ldaSVC.fit(tftrain)
X_trainSVC = ldaSVC.transform(tftrain)
X_testSVC = ldaSVC.transform(tftest)

#Se genera el Y_train, asignando 1 a los articulos de la categoria 1 y -1 a los de la categoria 2.
Y_train1 = [1]*tftrain1.shape[0]
Y_train2 = [-1]*tftrain2.shape[0]
Y_test1 = [1]*tftest1.shape[0]
Y_test2 = [-1]*tftest2.shape[0]

Y_train = np.concatenate((Y_train1,Y_train2),axis=0)
Y_test = np.concatenate((Y_test1,Y_test2),axis=0)


"""
CLASIFICACION DEL CONJUNTO DE TEST
"""

#Clasificador ExtraTrees:
#Normalizado de X_train y X_test.
scaler = preprocessing.StandardScaler().fit(X_trainExtraTrees)
X_trainExtraTrees = scaler.transform(X_trainExtraTrees)
X_testExtraTrees = scaler.transform(X_testExtraTrees)
extraTrees = ExtraTreesClassifier(n_estimators=600,max_features='log2',random_state=8)
extraTrees.fit(X_trainExtraTrees,Y_train.ravel())
salida1 = extraTrees.predict(X_testExtraTrees)
acierto1 = extraTrees.score(X_testExtraTrees,Y_test)

#Clasificador Bagging:
#Normalizado de X_train y X_test.
scaler = preprocessing.StandardScaler().fit(X_trainBagging)
X_trainBagging = scaler.transform(X_trainBagging)
X_testBagging = scaler.transform(X_testBagging)
bagging = BaggingClassifier(n_estimators=600,random_state=8)
bagging.fit(X_trainBagging,Y_train.ravel())
salida2 = bagging.predict(X_testBagging)
acierto2 = bagging.score(X_testBagging,Y_test)

#Clasificador SVC:
scaler = preprocessing.StandardScaler().fit(X_trainSVC)
X_trainSVC = scaler.transform(X_trainSVC)
X_testSVC = scaler.transform(X_testSVC)
clasSVC = svm.SVC(C=41)
clasSVC.fit(X_trainSVC,Y_train.ravel())
salida3 = clasSVC.predict(X_testSVC)  
acierto3 = clasSVC.score(X_testSVC,Y_test)

#Moda de los 3 clasificadores:
moda=np.zeros(len(X_testSVC))
ErrorModa=0
for i in range(len(X_testSVC)):
    if(salida1[i]+salida2[i]+salida3[i]>0):
        moda[i]=1
    else:
        moda[i]=-1
    if(moda[i]!=Y_test[i]):
        ErrorModa=ErrorModa+1   
        
aciertoModa=(len(Y_test)-ErrorModa)*1.0/len(Y_test)

#AREA BAJO LA CURVA
from sklearn.metrics import roc_curve, auc
false_positive1, true_positive1, thresholds1 = roc_curve(Y_test, salida1)
auc1=auc(false_positive1,true_positive1)
false_positive2, true_positive2, thresholds2 = roc_curve(Y_test, salida2)
auc2=auc(false_positive2,true_positive2)
false_positive3, true_positive3, thresholds3 = roc_curve(Y_test, salida3)
auc3=auc(false_positive3,true_positive3)
false_positivemoda, true_positivemoda, thresholdsmoda = roc_curve(Y_test, moda)
aucmoda=auc(false_positivemoda,true_positivemoda)

print "\nEl error del clasificador ExtraTrees es " + str(1-acierto1) + ", con una region bajo la curva de " + str(auc1)
print "\nEl error del clasificador Bagging es " + str(1-acierto2) + ", con una region bajo la curva de " + str(auc2)
print "\nEl error del clasificador SVC es " + str(1-acierto3) + ", con una region bajo la curva de " + str(auc3)
print "\nEl erro del clasificador de la moda es " + str(1-aciertoModa) + ", con una region bajo la curva de " + str(aucmoda)


#Se imprimen los topics de los 3 modelos:
print("\n\n TOPICS PARA EL EXTRATREES\n\n")
print_top_words(ldaExtraTrees,tf_vectorizer.get_feature_names(),5)
print("\n\n TOPICS PARA EL BAGGING\n\n")
print_top_words(ldaBagging,tf_vectorizer.get_feature_names(),5)
print("\n\n TOPICS PARA EL SVC\n\n")
print_top_words(ldaSVC,tf_vectorizer.get_feature_names(),8)