# Análisis de sentimiento usando sklearn y spacy

**Autor:** Roberto Muñoz <br />
**E-mail:** <rmunoz@uc.cl> <br />
**Github:** <https://github.com/rpmunoz> <br />

Instale la libreria spacy
```
pip install spacy
```

Descargue modelos pre-entrenados desde la web de spacy https://spacy.io/usage/models

```
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
```

In [1]:
import os
import spacy
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams.update({'font.size': 16})
pd.set_option('display.max_columns', None)

In [2]:
dataDir='data'
resultsDir='results'

if not os.path.exists(resultsDir):
    os.mkdir(resultsDir)

In [34]:
dataFile='Reviews IMDB-Movie-Data.csv'
dataFile=os.path.join(dataDir, dataFile)

dataDF=pd.read_csv(dataFile, header=0)
dataDF.head()

Unnamed: 0,review,valoracion
0,films adapted from comic books have had plent...,Positiva
1,every now and then a movie comes along from a...,Positiva
2,you ve got mail works alot better than it des...,Positiva
3,jaws is a rare film that grabs your atte...,Positiva
4,moviemaking is a lot like being the general m...,Positiva


In [35]:
dataDF.groupby('valoracion').count()

Unnamed: 0_level_0,review
valoracion,Unnamed: 1_level_1
Negativa,1000
Positiva,1000


In [21]:
for name, group in dataDF.groupby('valoracion'):
    print("Grupo con valoracion: {}\n".format(name))
    for idx, row in group.iloc[0:3].iterrows():
        print("*** {}\n".format(row['contenido']))
        
    print("-"*20)

Grupo con valoracion: Negativa

***  plot   two teen couples go to a church party   drink and then drive    they get into an accident    one of the guys dies   but his girlfriend continues to see him in her life   and has nightmares    what s the deal    watch the movie and   sorta   find out        critique   a mind fuck movie for the teen generation that touches on a very cool idea   but presents it in a very bad package    which is what makes this review an even harder one to write   since i generally applaud films which attempt to break the mold   mess with your head and such   lost highway & memento     but there are good and bad ways of making all types of films   and these folks just didn t snag this one correctly    they seem to have taken this pretty neat concept   but executed it terribly    so what are the problems with the movie    well   its main problem is that it s simply too jumbled    it starts off   normal   but then downshifts into this   fantasy   world in which you

In [9]:
# Splitting Data Set
from sklearn.model_selection import train_test_split

In [None]:
# Features and Labels
X = dataDF['contenido']
ylabels = dataDF['Label']

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=42)

## Preprocessing the data using Spacy and Machine learning model training using sklearn

In this stage, Spacy package of python is used to lemmatize and remove stop words from the obtained dataset.

In [7]:
import spacy
from  spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_sm')

# To build a list of stop words for filtering
stopwords = list(STOP_WORDS)
print("Lista de stop words o palabras reservadas\n")
print(stopwords)

Lista de stop words o palabras reservadas

['perhaps', 'whereas', 'towards', 'amount', 'least', 'say', 'we', 'herself', 'and', 'sixty', 'ever', 'may', 'often', 'along', 'cannot', 'nobody', 'must', 'whereby', 'n’t', 'latter', 'neither', 'own', 'see', 'nevertheless', 'herein', 'why', 'our', 'n‘t', 'whose', 'with', 'four', 'against', 'latterly', 'seem', 'across', "'ll", 'go', '’d', 'ca', 'two', 'already', 'her', 'side', 'seems', 'please', 'since', 'anywhere', 'between', 'beside', 'its', 'from', 'hereby', 'otherwise', 'thus', 'they', 'fifteen', 'enough', 'became', 'there', '‘d', 'somewhere', 'three', 'him', 'might', 'no', 'on', 'unless', 'mostly', 'those', 'sometime', 'what', 'anything', 'an', 'nothing', 'call', 'was', 'never', 're', '’m', 'thereafter', 'done', 'elsewhere', 'still', 'fifty', 'however', 'forty', 'through', 'per', 'eleven', 'former', 'part', 'six', 'while', 'something', 'around', 'can', 'to', 'his', 'ten', 'doing', 'else', 'then', 'much', 'rather', 'now', 'further', 'are', '

In [8]:
import string
from spacy.lang.en import English

punctuations = string.punctuation
# Creating a Spacy Parser
parser = English()