
## Ambiente

Bibliotecas utilizadas:

- **NLTK**, para geração de *tokens*.
- **Matplotlib** e **Seaborn**, para criação de gráficos.
- **Pandas** e **NumPy**, para estatísticas.
- **RE**, para filtragem através de expressões regulares.


In [2]:
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
from nltk.tokenize import RegexpTokenizer

%matplotlib inline

nltk.download('rslp')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package rslp to
[nltk_data]     C:\Users\tclem\AppData\Roaming\nltk_data...
[nltk_data]   Package rslp is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tclem\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tclem\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
#import re
#from unidecode import unidecode
#import unicodedata as ud

RESULTS_CSV = '..\data\stanford\in.csv'
data = pd.read_csv(RESULTS_CSV).replace(np.nan, "", regex = True)

data.head()

Unnamed: 0,Text,Opinion,Question,Answer,Sentiment,Confusion,Urgency,CourseType,forumpostid,coursedisplayname,forumuid,createdat,posttype,anonymous,anonymoustopeers,upcount,commentthreadid,reads
0,Interesting! How often we say those things to ...,1,0,0,65,2,15,Education,5225177f2c501f0a00000015,Education/EDUC115N/How_to_Learn_Math,30CADB93E6DE4711193D7BD05F2AE95C,02/09/2013 22:55,Comment,FALSO,FALSO,0,5221a8262cfae31200000001,41
1,What is \Algebra as a Math Game\'''' or are yo...,0,1,0,4,5,35,Education,5207d0e9935dfc0e0000005e,Education/EDUC115N/How_to_Learn_Math,37D8FAEE7D0B94B6CFC57D98FD3D0BA5,11/08/2013 17:59,Comment,FALSO,FALSO,0,520663839df35b0a00000043,55
2,I like the idea of my kids principal who says ...,1,0,0,55,3,25,Education,52052c82d01fec0a00000071,Education/EDUC115N/How_to_Learn_Math,CC11480215042B3EB6E5905EAB13B733,09/08/2013 17:53,Comment,FALSO,FALSO,0,51e59415e339d716000001a6,25
3,"From their responses, it seems the students re...",1,0,0,6,3,25,Education,5240a45e067ebf1200000008,Education/EDUC115N/How_to_Learn_Math,C717F838D10E8256D7C88B33C43623F1,23/09/2013 20:28,CommentThread,FALSO,FALSO,0,,0
4,"The boys loved math, because \there is freedom...",1,0,0,7,2,3,Education,5212c5e2dd10251500000062,Education/EDUC115N/How_to_Learn_Math,F83887D68EA48964687C6441782CDD0E,20/08/2013 01:26,CommentThread,FALSO,FALSO,0,,3


In [4]:
postings = pd.DataFrame(data={'Opinion': data['Opinion'], 
                              'Question': data['Question'], 
                              'Answer': data['Answer'], 
                              'Sentiment': data['Sentiment'], 
                              'Confusion': data['Confusion'], 
                              'Urgency': data['Urgency'], 
                              'Text': data['Text'], 
                              'forumpostid': data['forumpostid'], 
                              'CourseType': data['CourseType'], 
                              'coursedisplayname': data['coursedisplayname'], 
                              'commentthreadid': [row['forumpostid'] if row['commentthreadid'] == 'None' or row['commentthreadid'] == '' else row['commentthreadid'] for index, row in data.iterrows()]})

postings.set_index(['commentthreadid'],inplace=True,drop=True)
postings.sort_values(by=['commentthreadid'],inplace=True)

postings_aux = postings.groupby('commentthreadid')

discussions = pd.DataFrame(data={'discussion': postings_aux['Text'].sum(), 'CourseType': postings_aux['CourseType'].first(), 'coursedisplayname': postings_aux['coursedisplayname'].first()})#, 'commentthreadid': data_cleanned.groupby('commentthreadid').first().index}) 

#data_cleanned.reset_index(inplace=True)
#data_cleanned.drop(['index'], axis=1, inplace=True)
#print(re.sub(rb'[^\x00-\x7f]',rb' ',discussions.loc['51b513008e8d330d00000001','discussion'].encode('utf-8')))

#print(re.sub(r'[^\x00-\x7f]',r' ',ud.normalize('NFD',discussions.loc['51b513008e8d330d00000001','discussion'])))

#print(discussions.loc['51b513008e8d330d00000001','discussion']) 
postings['Text'].head()

commentthreadid
5,26331E+23                 I am after second review. Although I had appli...
5,26331E+23                 Also got that. Are you going to resubmit and s...
5,26331E+23                 I don't think the system is fair - some review...
5,30483E+23                 Thanks. It was useful to think about the answe...
51b513008e8d330d00000001    Hello Kristin Thank you for this very nice and...
Name: Text, dtype: object

In [5]:
postings.reset_index(inplace=False)['Text'].head()

0    I am after second review. Although I had appli...
1    Also got that. Are you going to resubmit and s...
2    I don't think the system is fair - some review...
3    Thanks. It was useful to think about the answe...
4    Hello Kristin Thank you for this very nice and...
Name: Text, dtype: object

## Tokenização

Inicialmente, foram definidas as estratégias de tokenização que seriam adotadas para, a partir daí, gerar os tokens que serão utilizados nas análises. A primeira estratégia adotada foi a **remoção das _stopwords_**. Essa remoção foi realizada através da coleção de palavras disponibilizada pela biblioteca NLTK e é motivada pelo fato que, apesar de sua frequência elevada, as _stopwords_ são pouco significantes para os textos.

As palavras adotadas como *tokens* foram mantidas respeitando a **bicameralidade** (caixa alta ou baixa), visando preservar as possíveis diferenças sintáticas. Também foram mantidas todas as palavras com **duas ou mais letras** e considerou-se que **hifens** e **apóstrofos** são caracteres integrantes das palavras.

Também foram utilizados, como *tokens*, os números inteiros e decimais com, pelo menos, **dois dígitos** e datas no formato **dd/mm/yy** ou **dd/mm/yyyy**, dada a importância que esses valores podem ter para o texto, ainda que uma possível vetorização possa vir a descartá-los.


In [6]:
regex_patterns = { "words": '''\w+[-']*\w+''',
                  "discrete": '''\d{2,}''',
                  "dates": '''\d{2}\/\d{2}\/\d{2,4}''',
                  "continuous": '''\d*[\.]*\d+[\.|\,]*\d+''' }

regex = '|'.join(regex_patterns.values())

In [7]:
def tokenize_data(data, regex):
    
    token_list = []
    token_bag = []
    tokenizer = RegexpTokenizer(regex)
    stopwords_en = stopwords.words("english")
  
    for row in data:
        
        tokens = tokenizer.tokenize(row)
        tokens = [token for token in tokens if token not in stopwords_en]
        token_bag.extend(tokens)
        token_list.append(tokens)
      
    return token_list, token_bag

postings['Text'], corpus_vocabulary = tokenize_data(postings['Text'], regex)
corpus_vocabulary_set = set(corpus_vocabulary)

In [8]:
postings.head()

Unnamed: 0_level_0,Opinion,Question,Answer,Sentiment,Confusion,Urgency,Text,forumpostid,CourseType,coursedisplayname
commentthreadid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
"5,26331E+23",1,0,1,4,4,3,"[second, review, Although, applied, comments, ...",526563a587050a90ef000006,Medicine,Medicine/SciWrite/Fall2013
"5,26331E+23",0,1,1,4,4,45,"[Also, got, Are, going, resubmit, see, get, be...",5264e2152cc6095e83000015,Medicine,Medicine/SciWrite/Fall2013
"5,26331E+23",1,0,0,3,4,55,"[think, system, fair, reviewers, rush, process...",5263a0de9ec9282178000010,Medicine,Medicine/SciWrite/Fall2013
"5,30483E+23",0,0,0,5,4,2,"[Thanks, It, useful, think, answers, Corey, ge...","5,30483E+23",Humanities,HumanitiesSciences/EP101/Environmental_Physiology
51b513008e8d330d00000001,0,1,0,35,4,5,"[Hello, Kristin, Thank, nice, usefull, Mooc, U...",53e1dfb0fac7aaea13000007,Medicine,Medicine/HRP258/Statistics_in_Medicine


Conhecendo os *tokens* e o vocabulário do *corpus* é possível calcular as métricas restantes e necessárias para analisar a frequência das palavras no *corpus*. Nesse processo será produzida um *Series* com as frequências das palavras que será reutilizado e complementado posteriormente.

In [9]:
def get_frequency_series(tokens):
  token_frequency = pd.Series(tokens).value_counts().reset_index()
  token_frequency.columns = ["Word", "Frequency"]
  return token_frequency

total_word_occurrences = len(corpus_vocabulary)
words_frequency = get_frequency_series(corpus_vocabulary)

Agora que as características do *corpus* utilizado estão bem definidas, há interesse em conhecer como se comporta a distribuição de frequência das palavras nele contidas. Para isso, o *series* criado anteriormente descrevendo a frequência de todas as palavras do vocabulário será complementado com as novas variáveis de interesse. Além da frequência absoluta de cada palavra, serão também analisados:

- **r:** O *rank* das palavras em relação à sua frequência
- **Pr:** A probabilidade de ocorrência da das palavras
- **r.Pr:** Os resultados da Lei de Zipf 

In [10]:
words_frequency["Ranking (r)"] = words_frequency["Frequency"].rank(ascending=False, method='first')
pr = (words_frequency["Frequency"] / total_word_occurrences)
r_pr = (words_frequency["Ranking (r)"] * pr)

Para melhor compreensão do leitor, são realizados alguns ajustes de apresentação para os valores calculados.

In [11]:
words_frequency["Ranking (r)"] = words_frequency["Ranking (r)"].astype(int)
words_frequency["Pr (%)"] = round(pr * 100, 2)
words_frequency["r.Pr"] = round(r_pr, 3)

Por fim, devido à grande quantidade de palavras no vocabulário do *corpus*, a tabela abaixo descreve apenas os resultados obtidos para as 50 palavras mais frequentes da coleção de documentos.

In [12]:
df_words_frequency = pd.DataFrame(words_frequency)
df_words_frequency.head(50)

Unnamed: 0,Word,Frequency,Ranking (r),Pr (%),r.Pr
0,students,8709,1,0.9,0.009
1,would,6188,2,0.64,0.013
2,The,6161,3,0.64,0.019
3,math,5954,4,0.62,0.025
4,think,5260,5,0.55,0.027
5,one,5036,6,0.52,0.031
6,course,4279,7,0.44,0.031
7,like,4012,8,0.42,0.033
8,It,3926,9,0.41,0.037
9,also,3867,10,0.4,0.04


In [13]:
postings.reset_index().to_json(r'..\data\stanford\postings.json')
discussions.to_json(r'..\data\stanford\discussions.json')