# Ejercicio Procesamiento del Lenguaje Natural
## NLP - www.ApendeMachineLearning.com

## Analizaremos los cuentos del escritor Hernán Casciari

### Sus contenidos están en Español y son libres. (también puedes comprar sus libros)

Descargaremos los textos de su Blog con cuentos de humor de los años 2004 a 2015

Analizaremos su obra para ver comprender sobre lo que escribe y su evolución a lo largo del tiempo

Puedes visitar su blog y cuentos en hernancasciari.com

# Nuestra Agenda será
<ul><li>1 - Obtener datos</li>
    <li>2 - Cargar los datos</li>
    <li>3 - Limpiar datos </li>
    <li>4 - Analisis Exploratorio</li>
    <li>5 - Anáisis de Sentimiento</li>
    <li>6 - Modelado de Temáticas</li></ul>

In [None]:
# imports
import requests
from bs4 import BeautifulSoup
import pickle
from time import sleep

# 1 - Obtener los textos

In [None]:
def url_to_transcript(url):
    '''Obtener los enlaces del blog de Hernan Casciari.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    print('URL',url)
    enlaces = []
    for title in soup.find_all(class_="entry-title"):
        for a in title.find_all('a', href=True):
            print("Found link:", a['href'])
            enlaces.append(a['href'])
    sleep(0.75) #damos tiempo para que no nos penalice un firewall
    return enlaces

In [None]:
base = 'https://editorialorsai.com/category/epocas/'
urls = []
anios = ['2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015']
for anio in anios:
    urls.append(base + anio + "/")
urls

In [None]:
# Recorrer las URLs y obtener los enlaces
enlaces = [url_to_transcript(u) for u in urls]
print(enlaces)

In [None]:
def url_get_text(url):
    '''Obtener los textos de los cuentos de Hernan Casciari.'''
    print('URL',url)
    text=""
    try:
        page = requests.get(url).text
        soup = BeautifulSoup(page, "lxml")
        text = [p.text for p in soup.find(class_="entry-content").find_all('p')]
    except Exception:
        print('ERROR, puede que un firewall nos bloquea.')
        return ''
    sleep(0.75) #damos tiempo para que no nos penalice un firewall
    return text

In [None]:
# Recorrer las URLs y obtener los textos
MAX_POR_ANIO = 50 # para no saturar el server
textos=[]
for i in range(len(anios)):
    arts = enlaces[i]
    arts = arts[0:MAX_POR_ANIO]
    textos.append([url_get_text(u) for u in arts])
print(len(textos))

In [None]:
#Probamos a ver alguno de los textos
print(len(textos[0]))
print(textos[0])

In [None]:
# # Pickle files para usar luego

# # Creamos un directorio y nombramos los archivos por año
!mkdir blog

for i, c in enumerate(anios):
    with open("blog/" + c + ".txt", "wb") as file:
        cad=""
        for texto in textos[i]:
            for texto0 in texto:
                cad=cad + texto0
        pickle.dump(cad, file)

# 2 - Cargar los Datos

In [None]:
import pickle

# Cargamos los pickled files
anios = ['2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015']
data = {}
for i, c in enumerate(anios):
    with open("blog/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [None]:
# Revisamos que se haya guardado bien
data.keys()

In [None]:
# Veamos algun trozo de texto
data['2008'][1000:1222]

In [None]:
# checkeamos primer clave
next(iter(data.keys()))

In [None]:
# nuestro diccionario esta cómo clave:Año valor:texto
next(iter(data.values()))

In [None]:
# lo combinamos
data_combined = {key: [value] for (key, value) in data.items()}

In [None]:
# lo metemos en un Panda's dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

In [None]:
# Veamos uno de los contenidos
data_df.transcript.loc['2007']

# 3 - Limpiar los Datos

In [None]:
# Aplicaremos varios rounds de limpieza
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?¿\]\%', ' ', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [None]:
# vemos la primer limpieza
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

In [None]:
# Segundo round
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…«»]', '', text)
    text = re.sub('\n', ' ', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [None]:
# veamos como queda
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

In [None]:
# Let's take a look at our dataframe
#data_df

In [None]:
# Como no tenemos un Lemmatizer en español, hacemos manualmente algunas conversiones
# OJO: esto realmente no se hace a mano!!!

def detectadas(palabra):
    eliminar_s = ('libreros','textos','papelitos','monedas','páginas','anécdotas','perros','cuadernos','blogs',
                  'revistas','caballos','vecinos','madres','puntos','ricos','libros')
    if palabra in eliminar_s :
        return palabra[:-1]
    eliminar_es = ('mundiales','lectores','campeones','maníes','ustedes','autores')
    if palabra in eliminar_es:
        return palabra[:-2]
    return palabra

def clean_text_round3(text):
    '''.'''
    return " ".join([detectadas(word) for word in text.split()])
    
round3 = lambda x: clean_text_round3(x)

In [None]:
#vemos como queda
data_clean = pd.DataFrame(data_clean.transcript.apply(round3))
data_clean

In [None]:
# Esto es un nuevo campo por si quisieramos agregar alguna info adicional a cada año
# Nuestro caso repetimos los años, nos servirá para alguna visualización
full_names = ['2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015']

data_df['full_name'] = full_names
data_df

In [None]:
# Hacemos el pickle para usar más adelante
data_df.to_pickle("corpus.pkl")

In [None]:
data_clean.transcript[0:255]

In [None]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common Spanish stop words
from sklearn.feature_extraction.text import CountVectorizer

with open('spanish.txt') as f:
    lines = f.read().splitlines()

cv = CountVectorizer(stop_words=lines)
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

In [None]:
# Lo guardamos como pickle
data_dtm.to_pickle("dtm.pkl")

In [None]:
# Lo guardamos como pickle también
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

# 4 - Análisis Exploratorio

In [None]:
# Read in the document-term matrix
import pandas as pd

data = pd.read_pickle('dtm.pkl')
data = data.transpose()
data.head()

In [None]:
# Find the top 30 words (per Year)
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index, top.values))

top_dict

In [None]:
# Print the top 15 words p/Year
for anio, top_words in top_dict.items():
    print(anio)
    print(', '.join([word for word, count in top_words[0:14]]))
    print('---')

In [None]:
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first pull out the top 30 words for each anio
words = []
for anio in data.columns:
    top = [word for (word, count) in top_dict[anio]]
    for t in top:
        words.append(t)
        
words

In [None]:
# Let's aggregate this list and identify the most common words along with how many routines they occur in
Counter(words).most_common()

In [None]:
# Las mas repetidas las descartaremos
add_stop_words = [word for word, count in Counter(words).most_common() if count > 6]
add_stop_words

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/jbagnato/python_projects/blog' 
wordlists = PlaintextCorpusReader(corpus_root, '.*', encoding='latin-1')
#wordlists.fileids()
#pals = wordlists.words('2004.txt')

cfd = nltk.ConditionalFreqDist(
        (word,genre)
        for genre in anios
        for w in wordlists.words(genre + '.txt')
        for word in ['casa','mundo','tiempo','vida']
        if w.lower().startswith(word) )
cfd.plot()

In [None]:
# Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# Read in cleaned data
data_clean = pd.read_pickle('data_clean.pkl')

# Add new stop words
with open('spanish.txt') as f:
    stop_words = f.read().splitlines()
for pal in add_stop_words:
    stop_words.append(pal)
more_stop_words=['alex','andrés','asi','andres','así','año','alejandro','alfonso','allí','alguien',
                 'basdala','bernardo','bien',
                 'cosa','cosas','costoya','costa','cinco','celoni','cuatro','cómo','casi','colo','caprio','českomoravský','české','costa','canoso','carla','comequechu',
                 'dos','dice','decir','días','dije','digo','diez',
                 'ésa', 'ésas', 'ése', 'ésos', 'ésta', 'éstas', 'éste', 'ésto', 'éstos',
                 'fernando','fenwick',
                 'gelós','gente',
                 'hornby','hernan','hernán','hoy','horacio','horas','hará','hans','hacía','haber',
                 'iveta',
                 'jesús','jorge','juan',
                 'karen',
                 'lucas','luego', 'luis',
                 'mirta','mientras','menos','mónica','medio','mil','moncho','momento','mañana','mejor',
                 'narcís','número','noche','nadie',
                 'ojos',
                 'primer','primera','pase','pablo','pepe','pack','peter', 'pues','prieto','politto','pol','paola','puede','próximo','podrán','podía',
                 'quizá','quizás','quince','quién','quiero',
                 'rato',
                 'sólo','solamente','sakhan','šeredova','seis','šeredovà','seselovsky','solo','salas','sant','sino','se','sé','sabés','semana','soto','sido','solamente',
                 'tres','tan','todas','trece','toda','todavía','tarde','tener',
                 'uno','usted',
                 'veces','ver','ve','vos','va','voy',
                 'waiser','woung'
                ]
for pal in more_stop_words:
    stop_words.append(pal)

# Recreate document-term matrix
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean.transcript)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_stop.index = data_clean.index

# Pickle it for later use
import pickle
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")

In [None]:
# Let's make some word clouds!
# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud

wc = WordCloud(stopwords=stop_words, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)

In [None]:
# Reset the output dimensions
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [16,12]

anios = ['2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015']

# Create subplots for each anio
for index, anio in enumerate(data.columns):
    wc.generate(data_clean.transcript[anio])
    plt.subplot(4, 3, index+1)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(anios[index])
    
plt.show()

In [None]:
# Find the number of unique words per Year

# Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once
unique_list = []
for anio in data.columns:
    uniques = data[anio].nonzero()[0].size
    unique_list.append(uniques)

# Create a new dataframe that contains this unique word count
data_words = pd.DataFrame(list(zip(anios, unique_list)), columns=['Anio', 'unique_words'])
#data_unique_sort = data_words.sort_values(by='unique_words')
data_unique_sort = data_words # sin ordenar
data_unique_sort

In [None]:
# ejecuta este si hicimos el webscrapping, o no tenemos los valores en la variable
posts_per_year=[]
try:
  enlaces
except NameError:
  # Si no hice, los tengo hardcodeados:
    posts_per_year = [50, 27, 18, 50, 42, 22, 50, 33, 31, 17, 33, 13]
else:
    for i in range(len(anios)):
        arts = enlaces[i]
        #arts = arts[0:10] #limito a maximo 10 por año
        print(anios[i],len(arts))
        posts_per_year.append(min(len(arts),MAX_POR_ANIO))

In [None]:
# Calculate the words per post of each Year

# Find the total number of words per Year
total_list = []
for anio in data.columns:
    totals = sum(data[anio])
    total_list.append(totals)
    
# Let's add some columns to our dataframe
data_words['total_words'] = total_list
data_words['posts_per_year'] = posts_per_year
data_words['words_per_posts'] = data_words['total_words'] / data_words['posts_per_year']

# Sort the dataframe by words per minute to see who talks the slowest and fastest
#data_wpm_sort = data_words.sort_values(by='words_per_posts')
data_wpm_sort = data_words #sin ordenar
data_wpm_sort

In [None]:
# Let's plot our findings
import numpy as np
plt.rcParams['figure.figsize'] = [16, 6]

y_pos = np.arange(len(data_words))

plt.subplot(1, 3, 1)
plt.barh(y_pos,posts_per_year, align='center')
plt.yticks(y_pos, anios)
plt.title('Number of Posts', fontsize=20)


plt.subplot(1, 3, 2)
plt.barh(y_pos, data_unique_sort.unique_words, align='center')
plt.yticks(y_pos, data_unique_sort.Anio)
plt.title('Number of Unique Words', fontsize=20)

plt.subplot(1, 3, 3)
plt.barh(y_pos, data_wpm_sort.words_per_posts, align='center')
plt.yticks(y_pos, data_wpm_sort.Anio)
plt.title('Number of Words Per Posts', fontsize=20)

plt.tight_layout()
plt.show()

# 5 - Análisis de Sentimiento

In [None]:
# Leeremos el corpus que aún preserva el orden de las palabras
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

In [None]:
# Create quick lambda functions to find the polarity and subjectivity of each routine
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob
    
pol = lambda x: TextBlob(x).sentiment.polarity
pol2 = lambda x: x.sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity
sub2 = lambda x: x.sentiment.subjectivity

# Realmente lo traducimos al inglés pues el analisis de sentimiento de TextBlob no funciona en Español :(
traducir = lambda x: TextBlob(x).translate(to="en")

data['blob_en'] = data['transcript'].apply(traducir)
data['polarity'] = data['blob_en'].apply(pol2)
data['subjectivity'] = data['blob_en'].apply(sub2)
data

In [None]:
# Let's plot the results
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

for index, anio in enumerate(data.index):
    x = data.polarity.loc[anio]
    y = data.subjectivity.loc[anio]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)
    plt.xlim(-0.051, 0.152) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()

## Sentiment Over Time

In [None]:
# Split each routine into 12 parts
import numpy as np
import math

def split_text(text, n=12):
    '''Takes in a string of text and splits into n equal parts, with a default of 12 equal parts.'''

    # Calculate length of text, the size of each chunk of text and the starting points of each chunk of text
    length = len(text)
    size = math.floor(length / n)
    start = np.arange(0, length, size)
    
    # Pull out equally sized pieces of text and put it into a list
    split_list = []
    for piece in range(n):
        split_list.append(text[start[piece]:start[piece]+size])
    return split_list

In [None]:
# Let's take a look at our data again
data

In [None]:
# Let's create a list to hold all of the pieces of text
list_pieces = []
for t in data.blob_en:#transcript:
    split = split_text(t,12)
    list_pieces.append(split)   
#list_pieces

In [None]:
# The list has n elements, one for each transcript
len(list_pieces)

In [None]:
# Each transcript has been split into 10 pieces of text
len(list_pieces[0])

In [None]:
# Calculate the polarity for each piece of text

polarity_transcript = []
for lp in list_pieces:
    polarity_piece = []
    for p in lp:
        #polarity_piece.append(TextBlob(p).translate(to="en").sentiment.polarity)
        polarity_piece.append(p.sentiment.polarity)
    polarity_transcript.append(polarity_piece)
    
polarity_transcript

In [None]:
# Show the plot for one anio
plt.plot(polarity_transcript[0])
plt.title(data['full_name'].index[0])
plt.show()

In [None]:
# Show the plot for all anios
plt.rcParams['figure.figsize'] = [16, 12]

for index, anio in enumerate(data.index):    
    plt.subplot(3, 4, index+1)
    plt.plot(polarity_transcript[index])
    plt.plot(np.arange(0,12), np.zeros(12))
    plt.title(data['full_name'][index])
    plt.ylim(ymin=-.45, ymax=.45)
    
plt.show()

# 6 - Modelado de Temáticas

Realizaremos diversos intentos para obtener los temas que predominan en los cuentos

In [None]:
# Let's read in our document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('dtm_stop.pkl')
data

In [None]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

In [None]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [None]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [None]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

In [None]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

In [None]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

## Intento 2: sólo Sustantivos

In [None]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text,language='spanish')
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [None]:
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

In [None]:
colname=[]
list_pieces = []
contador=0
for t in data_clean.transcript:
    split = split_text(t,posts_per_year[contador]-7)
    subcont=0
    for p in split:
        list_pieces.append(p)
        colname.append(str(2004+contador)+ "-" + str(subcont))
        subcont=subcont+1
    contador=contador+1
len(list_pieces)

In [None]:
data_split = pd.DataFrame(data=list_pieces).transpose()
data_split.columns=colname
data_split2=data_split.transpose()
data_split2.columns = ['transcript']
data_split2

In [None]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_split2.transcript.apply(nouns))
data_nouns

In [None]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

with open('spanish.txt') as f:
    stop_words = f.read().splitlines()
for pal in add_stop_words:
    stop_words.append(pal)
for pal in more_stop_words:
    stop_words.append(pal)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn

In [None]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 4
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

## Intento 3: Sustantivos y adjetivos

In [None]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text,language='spanish')
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [None]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_split2.transcript.apply(nouns_adj)) #data_clean
data_nouns_adj

In [None]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

In [None]:
#data_dtmna['escritor']
print(data_dtmna.shape)
#print(cvna.get_feature_names())

In [None]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [None]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

In [None]:
# Let's start with 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

In [None]:
# Probamos a modelar con 4 tópicos
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

## Identificar los temas

In [None]:
# Our final LDA model
QTY_TOPICS=4
ldana = models.LdaModel(corpus=corpusna, num_topics=QTY_TOPICS, id2word=id2wordna, passes=40,
                        random_state=15)
ldana.print_topics(QTY_TOPICS,5)

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/jbagnato/python_projects/blog' 
wordlists = PlaintextCorpusReader(corpus_root, '.*', encoding='latin-1')
#wordlists.fileids()
#pals = wordlists.words('2004.txt')
for i in range(QTY_TOPICS):
    theList=ldana.get_topic_terms(i)

    cfd = nltk.ConditionalFreqDist(
        (word,genre)
        for genre in anios
        for w in wordlists.words(genre + '.txt')
        for word in [id2wordna.get(a) for (a,b) in theList]
        if w.lower().startswith(word) )
    cfd.plot()

In [None]:
# Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

Esto es lo que descubrimos: <br>
TEMA 0- Personas [2004,2009]<br>
TEMA 1- Medios de comunicación [2008,2010,2011,2012,2015]<br>
TEMA 2- Casciari [2005,2007]<br> 
TEMA 3- Niñez / Infancia [2006,2013]<br>

In [None]:
#Info de Wikipedia
casciariTL = {2004:'blog gorda en españa. Nace su hija Nina.',
             2005:'premio alemania Deutsche Welle El mejor blog del mundo blog Más respeto, que soy tu madre ',
             2006:'Editorial Sudamericana publico en la Argentina y publica Diario de una mujer gorda',
             2007:'publicó su segundo libro, España deci alpiste. Colabora El PAis y La Nación',
             2008:'Gasalla se interesa por la obra teatro. ',
             2009:'se estrena en teatro. Le dió fama y mejora economica. Libro El pibe que arruinaba las fotos',
             2010:'renuncia a periódicos y funda Revista Orsai junto a Chiri, amigo de la infancia',
             2011:'Aparece primera edición de Orsai. Publica Charlas con mi hemisferio derecho',
             2012:'Inicia leyendo cuentos en radio Vorterix, por 2 años',
             2013:'Finaliza primera edicion Orsai',
             2014:'Edito revista tb para niños Bonsai',
             2015:'Publica El nuevo paraíso de los tontos. Se separa de su mujer. Sufre infarto y vuelve a la Argentina'}


In [None]:
casciariTL

# Conclusiones ¿Finales?

Y ahora... con toda la info obtenida, las gráficas y el TimeLine real de la Vida de Hernan Casciari, a sacar conclusiones! Revisa el artículo sobre NLP en el blog: www.aprendemachinelearning.com 