# Text Analysis Pre-processing

We will be using spaCy for data pre-processing and computational linguistics, gensim for topic modelling, scikit-learn for classification, and Keras for text generation.

Based on ```https://github.com/bhargavvader/personal/tree/master/notebooks/text_analysis_tutorial```

In [1]:
import matplotlib.pyplot as plt
import gensim
import numpy as np
import spacy

from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.models.wrappers import LdaMallet
from gensim.corpora import Dictionary
import pyLDAvis.gensim

import os, re, operator, warnings, sys
warnings.filterwarnings('ignore')  # Let's not pay heed to them right now
%matplotlib inline
print('Python Version: %s' % (sys.version))

Python Version: 2.7.15 | packaged by conda-forge | (default, Feb 28 2019, 04:00:11) 
[GCC 7.3.0]


## Getting all extracted text files 

In [2]:
path = os.getcwd() #get current directory

In [3]:
files = []
for root, _, filenames in os.walk(path):
     for filename in filenames:
            if '.txt' in filename:
                files.append(os.path.join(root, filename))

for filename in files:
     print filename
print('Total Files:', len(files))

/home/storopoli/Documents/jupyter_notebooks/textract/INEP 2017d.txt
/home/storopoli/Documents/jupyter_notebooks/textract/Trends 2018.txt
/home/storopoli/Documents/jupyter_notebooks/textract/INEP 2017c.txt
/home/storopoli/Documents/jupyter_notebooks/textract/Becker et al 2017.txt
/home/storopoli/Documents/jupyter_notebooks/textract/OECD 2014.txt
/home/storopoli/Documents/jupyter_notebooks/textract/Freeman, Becker _ Hall 2015.txt
('Total Files:', 6)


## Function to clean text
We need to make sure to clean our data to make it unicode consistent

In [4]:
# only if python 2 (not for python 3)
def clean(text):
    return unicode(''.join([i if ord(i) < 128 else ' ' for i in text]))

In [5]:
text = open('Trends 2018.txt').read()

## Pre-processing data

It's been often said in Machine Learning and NLP algorithms - garbage in, garbage out. We can't have state-of-the-art results without data which is aa good. Let's spend this section working on cleaning and understanding our data set. NTLK is usually a popular choice for pre-processing - but is a rather outdated and we will be checking out ```spaCy```, an industry grade text-processing package.

### English
You need to run ```python -m spacy download en_core_web_sm``` in your environment before
and then ```nlp = spacy.load('en_core_web_sm')```

### Portuguese
You need to run ```python -m spacy download pt_core_news_sm``` in your environment before
and then ```nlp = spacy.load('pt_core_web_sm')```

### Multi-laguage
You need to run ```python -m spacy download xx_ent_wiki_sm``` in your environment before
and then ```nlp = spacy.load('xx_ent_wiki_sm')```

In [6]:
nlp = spacy.load('en_core_web_sm')
print('spaCy Version: %s' % (spacy.__version__))

spaCy Version: 2.1.4


## Stop Words
```spaCy``` has some Stop Words

In [7]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('Number of stop words: %d' % len(spacy_stopwords))
print('First ten stop words: %s' % list(spacy_stopwords)[:10])

Number of stop words: 326
First ten stop words: [u'all', u'six', u'just', u'less', u'being', u'indeed', u'over', u'move', u'anyway', u'fifty']


In [9]:
doc = nlp(clean(text))

In [10]:
# we add some words to the stop word list
texts, article = [], []
for w in doc:
    # if it's not a stop word or punctuation mark, add it to our article!
    if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I':
        # we add the lematized version of the word
        article.append(w.lemma_)
    # if it's a new line, it means we're onto our next document
    if w.text == '\n':
        texts.append(article)
        article = []

In [11]:
tokens = [token.lemma_ for token in doc if not token.is_stop]

### Function to clean up text
This functions tries to best pre-process the text

In [12]:
def clean_up(text):
    """
    This function clean up you text
    and generate list of words for 
    each document.
    
    It also corrects for unicode problems
    with python version 2.
    """
    import sys, spacy
    if sys.version_info.major == 2:
        text = unicode(''.join([i if ord(i) < 128 else ' ' for i in text]))
    removal=['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE']
    text_out = []
    doc = nlp(text)
    for token in doc:
        if token.is_stop == False and token.is_alpha and len(token)>2 and token.pos_ not in removal:
            lemma = token.lemma_
            text_out.append(lemma)
    return text_out

In [13]:
def clean_up_spacy(text):
    """
    This function clean up you text
    and generate list of words for 
    each document.
    
    It also corrects for unicode problems
    with python version 2.
    """
    import sys, spacy
    if sys.version_info.major == 2:
        text = unicode(''.join([i if ord(i) < 128 else ' ' for i in text]))
    text_out = set()
    clean = re.sub("\s\s+",',',text)
    clean =re.sub("'|•|<br/>","",clean)
    clean =re.sub(r'\w*(?=\\|:)','',clean)
    text = re.sub("xa0|\\\\|:",',',clean)
    text = re.sub("(?<=\w)\s(?=\w+\,)",'_',text)
    text = re.sub("(?<=\w)\s(?=\w+)",'',text)
    text = re.sub(r"\[|]",' ',text)
    text = ' '.join(text.split(','))

    removal1= ['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','VERB','ADJ','SYM','NOUN','X','NUM','SPACE']
    doc= nlp(text)
    for token in doc:
        
        if token.string == token.string.upper() and len(token)<15 and token.is_punct is False and token.is_alpha: 
            lemma = token.lemma_.strip() 
            text_out.add(lemma)            
        if  token.pos_ not in removal1 and len(token)<15 and token.is_punct is False :
            lemma = token.lemma_.strip()
            if lemma != '':
                text_out.add(lemma)
    text_out = list(text_out)
    return text_out

In [14]:
# Define function to cleanup text by removing personal pronouns, stopwords, and puncuation
def cleanup_text(text, logging=False):
    import sys, spacy
    punctuations = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~©'
    stopwords = spacy.lang.en.stop_words.STOP_WORDS
    if sys.version_info.major == 2:
        text = unicode(''.join([i if ord(i) < 128 else ' ' for i in text]))
    texts = []
    doc = nlp(text, disable=['parser', 'ner'])
    tokens = [token.lemma_.lower().strip() for token in doc if token.lemma_ != '-PRON-']
    tokens = [token.lemma_ for token in tokens if token not in stopwords and token not in punctuations]
    tokens = ' '.join(tokens)
    texts.append(tokens)
    return pd.Series(texts)
#reviews['Description_Cleaned'] = reviews['description_Cleaned_1'].apply(lambda x: cleanup_text(x, False))

In [17]:
temp = clean_up(text)

In [21]:
with open('temp.txt', 'w') as file:
    for line in temp:
        file.write(line)
        file.write('\n')
    file.close()