# T2 - Text Analysis using Natural Language Processing 

## Problem
The problem selected was the first option: "Problema 1 - Determinação de Valência em Manchetes de Jornais Brasileiros no 1° Semestre de 2017".

### Dataset
The datase used in this experiment is avaliable in [Manchetes de Jornais Brasileiros](https://github.com/pdpcosta/manchetesBrasildatabase)
The dataset has 5 columns:
* Day 
* Month 
* Year
* Newspaper names
* Headlines

## Approach

The Classifier selected was [Bayes Naive](http://scikit-learn.org/stable/modules/naive_bayes.html). As my first implementation, I decided this one, because of the simplicity and documentation avaliable (easy to find documentation, tutorials and examples). 
Besides that, I found an interesting example using a portuguese train dataset labeled ('Positive', 'Neutral' and 'Negative'). The example is available in [Md Repo](https://github.com/minerandodados/mdrepo).

The Bayes Naive Classifier uses Bayes Theorem to predict the probability that a given feature set belongs to a particular label. The implementation presented here uses the Extraction Feature **tf-idf**

### Language and Libraries
For running the experiments the following languages and libraries were selected:

1. Environment: [Anaconda3 4.3.1](https://repo.continuum.io/archive/index.html)
2. Programming Language: [Python 3.3](https://www.python.org/) 
3. Dataframe Library: [Panda 0.19.2](http://pandas.pydata.org/).
4. Natural Language: [NLTK 3.2.4](http://www.nltk.org/)
5. Machine Learning: [Scikit-learn](http://scikit-learn.org/stable/index.html)


## Experiment Implementation

In [None]:
import nltk
#nltk.download()
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import cross_val_predict

### Preprocessing
The preprocessing steps were:
1. Replace the character separator comma (Problems with quantity of columns) 
2. Clean special caracteres
3. Tokenize the sentences
4. Delete the stop words

#### Openning dataset and replace separator

In [None]:
datasetNews = open("manchetesBrasildatabase.csv", mode="r", encoding="utf-8")
datasetNewsPreprocessed = open("manchetesBrasildatabaseProcessada.csv", mode="w", encoding="utf-8")

for row in datasetNews:
     
    tupleNews = str(row)
    
    tupleNews = tupleNews.replace(",","|")
    
    separatedRow = tupleNews.split("|")

    if len(separatedRow) > 5:
        titleNews = ""
        count = 4
        while count < len(separatedRow):
            titleNews += separatedRow[count]
            if (count < len(separatedRow)-1):
                titleNews += ", "
            count+=1
        separatedRow[4] = titleNews
        
    tuplePreprocessed = separatedRow[0] + " | " + separatedRow[1] + " | " + separatedRow[2] \
                        + " | " + separatedRow[3] + " | " + separatedRow[4]
    
    datasetNewsPreprocessed.write(tuplePreprocessed)

datasetNews.close()
datasetNewsPreprocessed.close()

#### Openning the preprocessed dataset

In [None]:
datasetInFrame = pd.read_csv("manchetesBrasildatabaseProcessada.csv",header=None, sep="|")

In [None]:
datasetInFrame.columns =['Dia', 'Mês', 'Ano', 'Jornal', 'Título']

In [None]:
datasetInFrame.describe()

#### Cleaning special characteres

In [None]:
sentences = [ str(sentence).lower().replace("'","").replace(".","").replace(",","") for sentence in datasetInFrame["Título"]]

#### Tokenizing the sentences

In [None]:
#portuguese_tokenizer = nltk.data.load('tokenizers/punkt/portuguese.pickle')

sentencesWithTokens = [nltk.word_tokenize(sentence.lower()) for sentence in sentences]

#### Deleting Stop Words

In [None]:
stop_words = stopwords.words('portuguese')

sentencesWithoutStopWords = [word for word in sentencesWithTokens if word not in stop_words]

sentencesWithoutStopWords = []

#Navegando no conjunto de dados
for row in sentencesWithTokens:
    sentence = []
    for word in row:
        if not word in stop_words:
            sentence.append(word)
    #print(sentence)
    sentencesWithoutStopWords.append(sentence)

In [None]:
#sentencesWithoutStopWords
#nltk.corpus.mac_morpho.words()
#tags = nltk.pos_tag(sentencesWithTokens[1])

In [None]:
nltk.corpus.mac_morpho.words()

tsents = nltk.corpus.mac_morpho.tagged_sents()
train = tsents[100:]
test = tsents[:100]


## Classification - Using Bayes Naive
Source: [Md Repo](https://github.com/minerandodados/mdrepo)
The dataset used to train the classifier is composed by 9817 Tweets labeled with "Positivo", "Neutro" and "Negativo".

In [211]:
datasetLabeled = pd.read_csv('Tweets_Mg.csv',encoding='utf-8')

In [219]:
datasetLabeled[datasetLabeled.Classificacao=='Neutro'].count()

Unnamed: 0                   2453
Created At                   2453
Text                         2453
Geo Coordinates.latitude      102
Geo Coordinates.longitude     102
User Location                1712
Username                     2453
User Screen Name             2453
Retweet Count                2453
Classificacao                2453
Observação                      0
Unnamed: 10                     0
Unnamed: 11                     0
Unnamed: 12                     0
Unnamed: 13                     0
Unnamed: 14                     0
Unnamed: 15                     0
Unnamed: 16                     0
Unnamed: 17                     0
Unnamed: 18                     0
Unnamed: 19                     0
Unnamed: 20                     0
Unnamed: 21                     0
Unnamed: 22                     0
Unnamed: 23                     0
Unnamed: 24                     0
dtype: int64

In [215]:
datasetLabeled[datasetLabeled.Classificacao=='Positivo'].count()

Unnamed: 0                   3300
Created At                   3300
Text                         3300
Geo Coordinates.latitude        1
Geo Coordinates.longitude       1
User Location                2118
Username                     3300
User Screen Name             3300
Retweet Count                3300
Classificacao                3300
Observação                      1
Unnamed: 10                     0
Unnamed: 11                     0
Unnamed: 12                     0
Unnamed: 13                     0
Unnamed: 14                     0
Unnamed: 15                     0
Unnamed: 16                     0
Unnamed: 17                     0
Unnamed: 18                     0
Unnamed: 19                     0
Unnamed: 20                     0
Unnamed: 21                     0
Unnamed: 22                     0
Unnamed: 23                     0
Unnamed: 24                     0
dtype: int64

In [216]:
datasetLabeled[datasetLabeled.Classificacao=='Negativo'].count()

Unnamed: 0                   2446
Created At                   2446
Text                         2446
Geo Coordinates.latitude        1
Geo Coordinates.longitude       1
User Location                1659
Username                     2446
User Screen Name             2446
Retweet Count                2446
Classificacao                2446
Observação                      0
Unnamed: 10                     0
Unnamed: 11                     0
Unnamed: 12                     0
Unnamed: 13                     0
Unnamed: 14                     0
Unnamed: 15                     0
Unnamed: 16                     0
Unnamed: 17                     0
Unnamed: 18                     0
Unnamed: 19                     0
Unnamed: 20                     0
Unnamed: 21                     0
Unnamed: 22                     0
Unnamed: 23                     0
Unnamed: 24                     0
dtype: int64

In [220]:
tweets = datasetLabeled['Text'].values
classes = datasetLabeled['Classificacao'].values

### Multinomial Naive Bayse

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts.

In [234]:
vectorizer = CountVectorizer(analyzer="word")
freq_tweets = vectorizer.fit_transform(tweets)
model = MultinomialNB()
model.fit(freq_tweets,classes)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [230]:
# From News Dataset
# sentences

In [231]:
freq_tests = vectorizer.transform(sentences)

In [232]:
prediction = modelo.predict(freq_tests)

### Results

In [233]:
for i in range(500):
    print("'" + sentences[i] + "' - valence: " + prediction[i])

' bndes encolhe e volta ao nível de 20 anos atrás' - valence: Neutro
' bc cria novo instrumento de política monetária' - valence: Neutro
' câmbio gera bate-boca entre ua e ue' - valence: Neutro
' indenização a transmissoras de energia já chega à tarifa' - valence: Neutro
' políticos esperam que relator separe "joio do trigo"' - valence: Neutro
' philips quer administrar hospitais públicos no brasil' - valence: Neutro
' com vendas em queda  c&amp;c muda lojas e troca diretoria' - valence: Neutro
' fachin poderá ir para turma que julga lava-jato' - valence: Neutro
' eike tem multas que superam fundo para prisões' - valence: Positivo
' operador pagou decoração de luxo de imóveis de cabral' - valence: Neutro
' drama do desemprego está longe de diminuir' - valence: Negativo
' bc prevê queda da inflação a 3%' - valence: Positivo
' boicote atrasa governo trump' - valence: Negativo
' crivella negocia vila dos atletas' - valence: Positivo
' crise amplia diferença de salário entre clt e servidor

In [None]:
datasetInFrame['Valência'] = pd.Series(prediction, index=datasetInFrame.index)


In [240]:
for row in datasetInFrame:
    row['']

# Discussion and Conclusion
The model selected 
I spent a lot of time working with NLTK, I used the examples in English and I tried to adapt to Portuguese.
Problemas com a implementação
Vocabulários com labels
Tweets não tem a mesma formalidade do Jornal
Aparentemente mais acertos com positivo
O modelo não considera relacionamento entre as palavras

# References

Python Text Processing with NLTK 2.0 Cookbook

http://blog.chapagain.com.np/machine-learning-sentiment-analysis-text-classification-using-python-nltk/
