## Pré-processamento dos dados

In [1]:
import os

cwd = os.getcwd()

In [2]:
import pandas as pd


df = pd.read_csv(cwd + '/data/data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Post Link         10000 non-null  int64 
 1   Title             10000 non-null  object
 2   Body              10000 non-null  object
 3   Tags              10000 non-null  object
 4   CreationDate      10000 non-null  object
 5   Answer Date       10000 non-null  object
 6   AcceptedAnswerId  10000 non-null  int64 
 7   id                10000 non-null  int64 
 8   body              10000 non-null  object
 9   score             10000 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 781.4+ KB


In [4]:
df.head()

Unnamed: 0,Post Link,Title,Body,Tags,CreationDate,Answer Date,AcceptedAnswerId,id,body,score
0,11227809,Why is processing a sorted array faster than p...,<p>Here is a piece of C++ code that shows some...,<java><c++><performance><cpu-architecture><bra...,2012-06-27 13:51:36,2012-06-27 13:56:42,11227902,11227902,"<p><strong>You are a victim of <a href=""https:...",26621
1,927358,How do I undo the most recent local commits in...,<p>I accidentally committed the wrong files to...,<git><version-control><git-commit><undo>,2009-05-29 18:09:14,2009-05-29 18:13:42,927386,927386,<h1>Undo a commit &amp; redo</h1>\n<pre class=...,24809
2,2003505,How do I delete a Git branch locally and remot...,<h4>Failed Attempts to Delete a Remote Branch:...,<git><version-control><git-branch><git-push><g...,2010-01-05 01:12:15,2010-01-05 01:13:55,2003515,2003515,<h1>Executive Summary</h1>\n<pre><code>git pus...,19556
3,292357,What is the difference between 'git pull' and ...,"<p>What are the differences between <a href=""h...",<git><version-control><git-pull><git-fetch>,2008-11-15 09:51:09,2008-11-15 09:52:40,292359,292359,"<p>In the simplest terms, <a href=""http://git-...",13368
4,231767,"What does the ""yield"" keyword do?",<p>What is the use of the <code>yield</code> k...,<python><iterator><generator>,2008-10-23 22:21:11,2008-10-23 22:48:44,231855,231855,"<p>To understand what <code>yield</code> does,...",12259


In [5]:
!pip install bs4



In [6]:
from bs4 import BeautifulSoup
import unicodedata
import re


def remove_html_tags_func(text):
    '''
    Removes HTML-Tags from a string, if present
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        Clean string without HTML-Tags
    ''' 
    return BeautifulSoup(text, 'html.parser').get_text()


def remove_url_func(text):
    '''
    Removes URL addresses from a string, if present
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        Clean string without URL addresses
    ''' 
    return re.sub(r'https?://\S+|www\.\S+', '', text)


def remove_accented_chars_func(text):
    '''
    Removes all accented characters from a string, if present
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        Clean string without accented characters
    '''
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')


def remove_punctuation_func(text):
    '''
    Removes all punctuation from a string, if present
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        Clean string without punctuations
    '''
    return re.sub(r'[^a-zA-Z0-9]', ' ', text)


def remove_irr_char_func(text):
    '''
    Removes all irrelevant characters (numbers and punctuation) from a string, if present
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        Clean string without irrelevant characters
    '''
    return re.sub(r'[^a-zA-Z]', ' ', text)


def remove_extra_whitespaces_func(text):
    '''
    Removes extra whitespaces from a string, if present
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        Clean string without extra whitespaces
    ''' 
    return re.sub(r'^\s*|\s\s*', ' ', text).strip()


def word_count_func(text):
    '''
    Counts words within a string
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        Number of words within a string, integer
    ''' 
    return len(text.split())

In [7]:
!pip install nltk



In [21]:
nltk.download()

/bin/bash: -c: line 2: syntax error: unexpected end of file


In [22]:
import nltk

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

def text_normaliser(text):
    words = word_tokenize(text)
    
    stop_words = stopwords.words('english')
    filtered_words = [word for word in words if word not in stop_words]
    
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in filtered_words]
    
    return ' '.join(word for word in stemmed)

In [23]:
def pre_process_text(text):
    text = remove_html_tags_func(text)
    text = remove_url_func(text)
    text = remove_accented_chars_func(text)
    text = remove_punctuation_func(text)
    text = remove_irr_char_func(text)
    text = remove_extra_whitespaces_func(text)
    text = text.lower()
    
    text = text_normaliser(text)
    
    return text

In [24]:
df['complete_text'] = df.Title + " " + df.Body

In [25]:
df.complete_text.iloc[0]

'Why is processing a sorted array faster than processing an unsorted array? <p>Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data (<em>before</em> the timed region) miraculously makes the loop almost six times faster.</p>\n<pre class="lang-cpp prettyprint-override"><code>#include &lt;algorithm&gt;\n#include &lt;ctime&gt;\n#include &lt;iostream&gt;\n\nint main()\n{\n    // Generate data\n    const unsigned arraySize = 32768;\n    int data[arraySize];\n\n    for (unsigned c = 0; c &lt; arraySize; ++c)\n        data[c] = std::rand() % 256;\n\n    // !!! With this, the next loop runs faster.\n    std::sort(data, data + arraySize);\n\n    // Test\n    clock_t start = clock();\n    long long sum = 0;\n    for (unsigned i = 0; i &lt; 100000; ++i)\n    {\n        for (unsigned c = 0; c &lt; arraySize; ++c)\n        {   // Primary loop\n            if (data[c] &gt;= 128)\n                sum += data[c];\n        }\n    }\n\n    double e

In [26]:
df.complete_text = df.complete_text.apply(lambda x: pre_process_text(x))

In [27]:
df.complete_text.iloc[0]

'process sort array faster process unsort array piec c code show peculiar behavior strang reason sort data time region miracul make loop almost six time faster includ algorithm includ ctime includ iostream int main gener data const unsign arrays int data arrays unsign c c arrays c data c std rand next loop run faster std sort data data arrays test clock start clock long long sum unsign unsign c c arrays c primari loop data c sum data c doubl elapsedtim static cast doubl clock start clock per sec std cout elapsedtim n std cout sum sum n without std sort data data arrays code run second sort data code run second sort take time one pass array actual worth need calcul unknown array initi thought might languag compil anomali tri java import java util array import java util random public class main public static void main string arg gener data int arrays int data new int arrays random rnd new random int c c arrays c data c rnd nextint next loop run faster array sort data test long start syst

### Criando o BoW

In [28]:
words_post = df.complete_text.apply(lambda x: x)

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(words_post)

CountVectorizer()

In [30]:
print("Vocabulary: ", vectorizer.vocabulary_)



In [33]:
len(vectorizer.vocabulary_)

19571

In [31]:
# Encode the Document
vector = vectorizer.transform(words_post)

In [32]:
# Summarizing the Encoded Texts
print("Encoded Document is:")
print(vector.toarray())

Encoded Document is:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## Construindo o LDA