# 1. Preparar los Datos

In [7]:
import pandas as pd
import numpy as np
from sklearn import linear_model

## 1.1 Leer Dataset

- Input(x) -> Comentarios (review)
- Output(y) -> Sentimientos

In [8]:
df_review = pd.read_csv('IMDB Dataset.csv')
df_review

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [9]:
df_review.value_counts('sentiment')

sentiment
negative    25000
positive    25000
Name: count, dtype: int64

### 1.1.1  Reducimos el número de filas para entrenar el modelo de forma sencilla

In [10]:
df_positivo = df_review[df_review['sentiment']=='positive'][:9000]
df_negativo = df_review[df_review['sentiment']=='negative'][:1000]

df_review_des = pd.concat([df_positivo, df_negativo])
df_review_des

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
...,...,...
2000,Stranded in Space (1972) MST3K version - a ver...,negative
2005,"I happened to catch this supposed ""horror"" fli...",negative
2007,waste of 1h45 this nasty little film is one to...,negative
2010,Warning: This could spoil your movie. Watch it...,negative


In [11]:
df_review_des.value_counts('sentiment')

sentiment
positive    9000
negative    1000
Name: count, dtype: int64

## 1.2 Dataset Desbalanceado

In [42]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
df_review_bal, df_review_bal['sentiment'] = rus.fit_resample(df_review_des[['review']], df_review_des['sentiment'])

df_review_bal

Unnamed: 0,review,sentiment
3,Basically there's a family where a little boy ...,negative
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
10,Phil the Alien is one of those quirky films wh...,negative
11,I saw this movie when I was about 12 when it c...,negative
...,...,...
3254,Sergeant Ryker is accused of being a traitor d...,positive
1067,I've never laughed and giggled so much in my l...,positive
15985,I saw it in Europe-plex. Great movie!! <br /><...,positive
6913,"Shamefully, before I saw this film, I was unfa...",positive


In [22]:
df_review_bal.value_counts(['sentiment'])

sentiment
negative     1000
positive     1000
Name: count, dtype: int64

- Hemos balanceado los datos, pues tenemos 2000 reviews, 1000 sentimientos negativos y 1000 sentimientos positivos

## 1.3 Separando datos para entrenar (train) y testear (test)

In [23]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_review_bal, test_size=0.33, random_state=42)

In [24]:
train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']

# 2. Transformar datos de texto a datos numéricos

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
train_x_vector = tfidf.fit_transform(train_x)

test_x_vector = tfidf.transform(test_x)

- obtenemos los mejores parámetros con fit y los aplicamos a los datos con transform

In [26]:
train_x_vector

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 113861 stored elements and shape (1340, 19985)>

- tenemos una matriz de 1340 reviews y 20057 palabras
- sparse matrix = matriz dispersa (con muchos 0)
- de los millones de celdas de la matriz solo 113897 van a tener valores diferentes a 0

# 3. Selección del Modelo

Machine Learning algoritmos

1. Aprendizaje supervisado (Supervised Learning): Regresión (output numérico), Clasificación (output discreto)

- Input: Review
- Output: Sentiment (discreto)

2. Aprendizaje No Supervisado

## 3.1 Modelos de Clasificación

### 3.1.1 Suport Vector Machines (SVM)

In [27]:
from sklearn.svm import SVC

svc = SVC(kernel='linear')
svc.fit(train_x_vector, train_y)

#### 3.1.1.1 Testeo SVM

In [28]:
print(svc.predict(tfidf.transform(['A good movie']))) #buena pelicula
print(svc.predict(tfidf.transform(['An excellent movie']))) #excelente pelicula
print(svc.predict(tfidf.transform(['"I did not like this movie at all I gave this movie away"'])))# no le gustó

['positive']
['positive']
['negative']


- vemos que este modelo de clasificación da resultados razonables

### 3.1.2 Decision Tree

In [29]:
from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier()
dec_tree.fit(train_x_vector, train_y)

### 3.1.3 Naive Bayes

In [30]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x_vector.toarray(), train_y)

### 3.1.4 Logistic Regression

In [31]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(train_x_vector, train_y)

# 4. Evaluación del Modelo

## 4.1 Score (Accuracy)

In [32]:
print(svc.score(test_x_vector, test_y))
print(dec_tree.score(test_x_vector, test_y))
print(gnb.score(test_x_vector.toarray(), test_y))
print(lr.score(test_x_vector, test_y))

0.8333333333333334
0.6272727272727273
0.646969696969697
0.8287878787878787


- Observamos que tanto SVC y LR tienen casi la misma precisión, si introducimos una frase tendremos mayor probabilidad de que nos prediga el correcto sentimiento del comentario

## 4.2 F1 Score

F1 Score = 2(Recall * Precision) / (Recall + Precision)

In [33]:
from sklearn.metrics import f1_score

f1_score(test_y, svc.predict(test_x_vector), labels=['positive', 'negative'], average=None)
    

array([0.83870968, 0.82758621])

usamos el modelo SVC 
- el F1 score tiene en cuenta la distribución de los datos (en nuestro caso desbalanceado) a diferencia del Score (método anterior). Por eso este método en nuestro caso es más conveniente

## 4.3 Reporte de Clasificación

In [34]:
from sklearn.metrics import classification_report

print(classification_report(test_y, svc.predict(test_x_vector), labels=['positive', 'negative']) )

              precision    recall  f1-score   support

    positive       0.82      0.85      0.84       335
    negative       0.84      0.81      0.83       325

    accuracy                           0.83       660
   macro avg       0.83      0.83      0.83       660
weighted avg       0.83      0.83      0.83       660



## 4.4 Confusion Matrix

In [35]:
from sklearn.metrics import confusion_matrix

confusion_matrix(test_y, svc.predict(test_x_vector), labels=['positive', 'negative'])

array([[286,  49],
       [ 61, 264]], dtype=int64)

 la suma de los 4 elementos de la matriz es 660, es decir, el total de la muestra
- 286 = true positives
- 49 = false positives
- 61 = false negatives
- 264 = true negatives

# 5. Optimización del Modelo

## 5.1 GridSearchCV

Optimizamos el modelo SVC

In [36]:
from sklearn.model_selection import GridSearchCV

parametros = {'C':[1,4,8,16,32,64,128], 'kernel':['linear','rbf']}
svc = SVC()
svc_grid = GridSearchCV(svc, parametros, cv=10)
svc_grid.fit(train_x_vector, train_y)

- el valor de 'C' es un parámetro de penalización el cuál indica cuanto error es soportable
- el kernel hace todos los procesamientos en el cual debemos especificar que tipo de función queremos usar (lineal, polinómicas, RBF)

In [37]:
print(svc_grid.best_estimator_)
print(svc_grid.best_params_)

SVC(C=1, kernel='linear')
{'C': 1, 'kernel': 'linear'}


- los mejores parámetros se dan cuando el 'C' es 1 y cuando el kernel es rbf

In [38]:
svc_grid.best_score_

0.8305970149253732

El mejor valor del modelo después de su optimización disminuyó de 0.85 a 0.83