# T2 - Text Analysis using Natural Language Processing 

## Problem
The problem selected was the first option: "Problema 1 - Determinação de Valência em Manchetes de Jornais Brasileiros no 1° Semestre de 2017".

### Dataset
The datase used in this experiment is avaliable in [Manchetes de Jornais Brasileiros](https://github.com/pdpcosta/manchetesBrasildatabase)
The dataset has 5 columns:
* Day 
* Month 
* Year
* Newspaper names
* Headlines

## Approach

The Classifier selected was [Bayes Naive](http://scikit-learn.org/stable/modules/naive_bayes.html). As my first implementation, I decided this one, because of the simplicity and documentation avaliable (easy to find documentation, tutorials and examples). 
Besides that, I found an interesting example using a portuguese train dataset labeled ('Positive', 'Neutral' and 'Negative'). The example is available in [Md Repo](https://github.com/minerandodados/mdrepo).

The Bayes Naive Classifier uses Bayes Theorem to predict the probability that a given feature set belongs to a particular label. The implementation presented here uses the Extraction Feature **tf-idf**

### Language and Libraries
For running the experiments the following languages and libraries were selected:

1. Environment: [Anaconda3 4.3.1](https://repo.continuum.io/archive/index.html)
2. Programming Language: [Python 3.3](https://www.python.org/) 
3. Dataframe Library: [Panda 0.19.2](http://pandas.pydata.org/).
4. Natural Language: [NLTK 3.2.4](http://www.nltk.org/)
5. Machine Learning: [Scikit-learn](http://scikit-learn.org/stable/index.html)


## Experiment Implementation

In [None]:
import nltk
#nltk.download()
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import cross_val_predict

### Preprocessing
The preprocessing steps were:
1. Replace the character separator comma (Problems with quantity of columns) 
2. Clean special caracteres
3. Tokenize the sentences
4. Delete the stop words

#### Openning dataset and replace separator

In [None]:
datasetNews = open("manchetesBrasildatabase.csv", mode="r", encoding="utf-8")
datasetNewsPreprocessed = open("manchetesBrasildatabaseProcessada.csv", mode="w", encoding="utf-8")

for row in datasetNews:
     
    tupleNews = str(row)
    
    tupleNews = tupleNews.replace(",","|")
    
    separatedRow = tupleNews.split("|")

    if len(separatedRow) > 5:
        titleNews = ""
        count = 4
        while count < len(separatedRow):
            titleNews += separatedRow[count]
            if (count < len(separatedRow)-1):
                titleNews += ", "
            count+=1
        separatedRow[4] = titleNews
        
    tuplePreprocessed = separatedRow[0] + " | " + separatedRow[1] + " | " + separatedRow[2] \
                        + " | " + separatedRow[3] + " | " + separatedRow[4]
    
    datasetNewsPreprocessed.write(tuplePreprocessed)

datasetNews.close()
datasetNewsPreprocessed.close()

#### Openning the preprocessed dataset

In [None]:
datasetInFrame = pd.read_csv("manchetesBrasildatabaseProcessada.csv",header=None, sep="|")

In [None]:
datasetInFrame.columns =['Dia', 'Mês', 'Ano', 'Jornal', 'Título']

In [None]:
#datasetInFrame.describe()

#### Cleaning special characteres

In [None]:
sentences = [ str(sentence).lower().replace("'","").replace(".","").replace(",","") for sentence in datasetInFrame["Título"]]

#### Tokenizing the sentences

In [None]:
#portuguese_tokenizer = nltk.data.load('tokenizers/punkt/portuguese.pickle')

sentencesWithTokens = [nltk.word_tokenize(sentence.lower()) for sentence in sentences]

#### Deleting Stop Words

In [None]:
stop_words = stopwords.words('portuguese')

sentencesWithoutStopWords = [word for word in sentencesWithTokens if word not in stop_words]

sentencesWithoutStopWords = []

#Navegando no conjunto de dados
for row in sentencesWithTokens:
    sentence = ""
    for word in row:
        if not word in stop_words:
            sentence = sentence + " "+ str(word)
    sentencesWithoutStopWords.append(sentence)

In [None]:
#sentencesWithoutStopWords
#nltk.corpus.mac_morpho.words()
#tags = nltk.pos_tag(sentencesWithTokens[1])

In [None]:
#nltk.corpus.mac_morpho.words()

#tsents = nltk.corpus.mac_morpho.tagged_sents()
#train = tsents[100:]
#test = tsents[:100]


## Classification - Using Bayes Naive
Source: [Md Repo](https://github.com/minerandodados/mdrepo)
The dataset used to train the classifier is composed by 9817 Tweets labeled with "Positivo", "Neutro" and "Negativo".

In [None]:
datasetLabeled = pd.read_csv('Tweets_Mg.csv',encoding='utf-8')

In [None]:
datasetLabeled[datasetLabeled.Classificacao=='Neutro'].count()

In [None]:
datasetLabeled[datasetLabeled.Classificacao=='Positivo'].count()

In [None]:
datasetLabeled[datasetLabeled.Classificacao=='Negativo'].count()

In [None]:
tweets = datasetLabeled['Text'].values
classes = datasetLabeled['Classificacao'].values

### Multinomial Naive Bayse

> The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. [Multinomial Naive Bayse](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

#### Sentences

In [None]:
vectorizer = CountVectorizer(analyzer="word")
freq_tweets = vectorizer.fit_transform(tweets)
model = MultinomialNB()
model.fit(freq_tweets,classes)
freq_tests = vectorizer.transform(sentences)
prediction1 = model.predict(freq_tests)

#### Sentences Without Stopwords

In [None]:
vectorizer = CountVectorizer(analyzer="word")
freq_tweets = vectorizer.fit_transform(tweets)
model = MultinomialNB()
model.fit(freq_tweets,classes)
freq_tests = vectorizer.transform(sentencesWithoutStopWords)
prediction2 = model.predict(freq_tests)

### Results  Sentences Valence

In [None]:
for i in range(500):
    print("'" + sentences[i] + "' - valence: " + prediction1[i])

In [None]:
datasetInFrame['Valência'] = pd.Series(prediction1, index=datasetInFrame.index)
#datasetInFrame

### Results Sentences Without StopWords

In [None]:
for i in range(500):
    print("'" + sentencesWithoutStopWords[i] + "' - valence: " + prediction2[i])

### Differences StopWords

In [None]:
#print(prediction1 == prediction2)
for i in range(500):
    if (prediction1[i] != prediction2[i]) : 
        print("Description: Sentence and Sentence2 (WithoutStopWords)")
        print("Sentence1:"  + sentences[i] + ": "+ prediction1[i])
        print("Sentence2:"  + sentencesWithoutStopWords[i] + ": "+ prediction2[i])
        print("")

# Discussion and Conclusion

Evaluating the results, I don't agree with the most labels. The classifier labeled wrong even in sentences with words that represent some "sentiment":

* "cortes e déficit" -> neutro
* "violência no rio: assassinatos sobem 41 37%" -> neutro
* "jovens estudam com medo" -> positivo


Some aspects to consider:
* The classifier doesn't consider the relationship between words, probably it is a restriction. 
* The example that I select has a train dataset doesn't reflect the same context (Tweeter vs Newspaper).
* Comparing the results between Sentences and Sentences Without StopWords, some results are better, maybe apply in the train dataset can be an option. 

Futures improvements (in  my point of view):
* Improve the preprocessing steps, there are some special caracteres that were not treated, and incomplete sentences;
* Consider other corpus and extend them, for this implementation, the dataset test was not enough;
* Explore and analyse better the train dataset;
* Use the POS tagging;
* Explore N-grams, to explore the relationhip among words in sentence;
* Explore other classifiers (this [Nlpnet](http://nilc.icmc.usp.br/nlpnet/) looks interesting);
* There are some interesting train labeled datasets in English, one option is translate and explore it.


# References

[Python Text Processing with NLTK 2.0 Cookbook](https://www.amazon.com.br/Python-Text-Processing-Nltk-Cookbook/dp/1849513600)

[Blog Chapagain](http://blog.chapagain.com.np/machine-learning-sentiment-analysis-text-classification-using-python-nltk/)

[Blog iMasters](https://imasters.com.br/desenvolvimento/analise-de-sentimentos-aprenda-de-uma-vez-por-todas-como-funciona-utilizando-dados-do-twitter/?trace=1519021197&source=single)
