# Exploratory Data Analysis - Spanish Movie Reviews
**By Zach Friedman**, zacheryfriedman@my.unt.edu

- Get most frequent sentiment words
- Visualize sentiment words in English translations (need sentiment word dictionary)

In [135]:
import pandas as pd
import regex as re
import spacy
import nltk

from nltk.tokenize import WordPunctTokenizer, sent_tokenize
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('sentiwordnet')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /home/zach/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/zach/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /home/zach/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /home/zach/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/zach/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/zach/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [136]:
df = pd.read_csv('./IMDB_Dataset_SPANISH.csv')
df.shape

(50000, 5)

In [37]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_en,review_es,sentiment,sentimiento
0,0,One of the other reviewers has mentioned that ...,Uno de los otros críticos ha mencionado que de...,positive,positivo
1,1,A wonderful little production. The filming tec...,Una pequeña pequeña producción.La técnica de f...,positive,positivo
2,2,I thought this was a wonderful way to spend ti...,Pensé que esta era una manera maravillosa de p...,positive,positivo
3,3,Basically there's a family where a little boy ...,"Básicamente, hay una familia donde un niño peq...",negative,negativo
4,4,"Petter Mattei's ""Love in the Time of Money"" is...","El ""amor en el tiempo"" de Petter Mattei es una...",positive,positivo


In [38]:
df = df.drop(['Unnamed: 0'], 1)

  df = df.drop(['Unnamed: 0'], 1)


In [39]:
df.isnull().sum()

review_en      0
review_es      0
sentiment      0
sentimiento    0
dtype: int64

At first glance at the data, see punctuation error present in the second sample. 
- Missing space between first and second sentences.
- Want to use regex to get the number of instances like the following.
- A tokenizer can probably handle this easily. Still may be useful to know.

In [41]:
es = df.iloc[1]['review_es']
en = df.iloc[1]['review_en'] 
print(f"ES:\n{es}\n")
print(f"EN:\n{en}\n")

ES:
Una pequeña pequeña producción.La técnica de filmación es muy incuestionable, muy antigua, la moda de la BBC y le da una sensación de realismo reconfortante, y, a veces, incómodo, y, a veces, a la pieza.Los actores son extremadamente bien elegidos, Michael Sheen, no solo "tiene todo el polari", ¡pero tiene todas las voces por palmaditas!Realmente puede ver la edición perfecta guiada por las referencias a las entradas del diario de Williams, no solo vale la pena la observación, pero es una pieza imperrementemente escrita y realizada.Una producción magistral sobre uno de los grandes maestros de la comedia y su vida.El realismo realmente llega a casa con las pequeñas cosas: la fantasía del guardia que, en lugar de usar las técnicas de "sueño" tradicionales permanece sólido, entonces desaparece.Se desempeña nuestro conocimiento y nuestros sentidos, particularmente con las escenas relacionadas con Orton y Halliwell y los conjuntos (particularmente de su apartamento con murales de Halliw

In [42]:
pattern = r'([a-z]\S[A-Z])'
mask = df['review_es'].str.contains(pattern)

filtered_df = df[mask]
filtered_df

  mask = df['review_es'].str.contains(pattern)


Unnamed: 0,review_en,review_es,sentiment,sentimiento
1,A wonderful little production. The filming tec...,Una pequeña pequeña producción.La técnica de f...,positive,positivo
2,I thought this was a wonderful way to spend ti...,Pensé que esta era una manera maravillosa de p...,positive,positivo
3,Basically there's a family where a little boy ...,"Básicamente, hay una familia donde un niño peq...",negative,negativo
5,"Probably my all-time favorite movie, a story o...",Probablemente mi película favorita de todos lo...,positive,positivo
6,I sure would like to see a resurrection of a u...,Seguro que me gustaría ver una resurrección de...,positive,positivo
...,...,...,...,...
49993,Robert Colomb has two full-time jobs. He's kno...,Robert Colomb tiene dos trabajos a tiempo comp...,negative,negativo
49994,This is your typical junk comedy.There are alm...,Esta es tu comedia de chatarra típica. Casi no...,negative,negativo
49995,I thought this movie did a down right good job...,Pensé que esta película hizo un buen trabajo a...,positive,positivo
49996,"Bad plot, bad dialogue, bad acting, idiotic di...","Mala parcela, mal diálogo, mala actuación, dir...",negative,negativo


In [121]:
x = filtered_df.iloc[0]['review_en']
print(x)

A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.


In [130]:
def preprocess(doc):
    # Instantiate the tools to tokenize and stem a given review
    tokenizer = WordPunctTokenizer()
    porter = PorterStemmer()
    
    tokens = tokenizer.tokenize(doc)
    tokens = [word.lower() for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    #tokens = [porter.stem(token) for token in tokens]
    
    return tokens

def analyze(doc):
    sia = SentimentIntensityAnalyzer()
    #doc = " ".join(doc)
    #sentiment = sia.polarity_scores(doc)
    return sia.polarity_scores(doc)

def sentence_tokenizer(doc):
    sentences = sent_tokenize(doc)
    lemmatizer = WordNetLemmatizer()
    
    sentences = [[word.lower() for word in sent.split() if word not in stopwords.words('english')] for sent in sentences]
    
    return [" ".join(sent) for sent in sentences]


In [132]:
y = sentence_tokenizer(x)
print(y)

['a wonderful little production.', 'the filming technique unassuming- old-time-bbc fashion gives comforting, sometimes discomforting, sense realism entire piece.', 'the actors extremely well chosen- michael sheen "has got polari" voices pat too!', "you truly see seamless editing guided references williams' diary entries, well worth watching terrificly written performed piece.", "a masterful production one great master's comedy life.", "the realism really comes home little things: fantasy guard which, rather use traditional 'dream' techniques remains solid disappears.", "it plays knowledge senses, particularly scenes concerning orton halliwell sets (particularly flat halliwell's murals decorating every surface) terribly well done."]


In [134]:
for sent in y:
    print(analyze(sent))

{'neg': 0.0, 'neu': 0.351, 'pos': 0.649, 'compound': 0.5719}
{'neg': 0.15, 'neu': 0.694, 'pos': 0.156, 'compound': 0.0258}
{'neg': 0.0, 'neu': 0.817, 'pos': 0.183, 'compound': 0.3989}
{'neg': 0.0, 'neu': 0.67, 'pos': 0.33, 'compound': 0.7096}
{'neg': 0.0, 'neu': 0.431, 'pos': 0.569, 'compound': 0.765}
{'neg': 0.12, 'neu': 0.8, 'pos': 0.08, 'compound': -0.2023}
{'neg': 0.146, 'neu': 0.688, 'pos': 0.166, 'compound': -0.128}
