# Let's Predict the Sentiment
We employ machine learning to predict the sentiment of a review based on the words used in the review. We use logistic regression and evaluate its performance in a few different ways. These are some solid first models!

## Let's predict the sentiment!
<video controls src="video/video4_1.mp4" width=720>

## Logistic regression of movie reviews
### Exercise
In the video we learned that logistic regression is a common way to model a classification task, such as classifying the sentiment as positive or negative.

In this exercise, you will work with the `movies` reviews dataset. The `label` column stores the sentiment, which is `1` when the review is positive, and `0` when negative. The text review has been transformed, using BOW, to numeric columns.

Your task is to build a logistic regression model using the `movies` dataset and calculate its accuracy.

### Instructions
+ Import the logistic regression function.
+ Create and fit a logistic regression on the labels `y` and the features `X`.
+ Calculate the accuracy of the logistic regression model, using the default `.score()` method.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

df_movies = pd.read_csv("IMDB_sample.csv")
vect = CountVectorizer(max_features=200, token_pattern=r'[A-Za-z]+', stop_words=ENGLISH_STOP_WORDS)
vect.fit(df_movies.review)
vect_trans = vect.transform(df_movies.review)
movies = pd.DataFrame(vect_trans.toarray(), columns=vect.get_feature_names())
movies["label"] = df.label
movies.head()

In [None]:
# Import the logistic regression
from sklearn.linear_model import LogisticRegression

# Define the vector of targets and matrix of features
y = movies.label
X = movies.drop('label', axis=1)

# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X, y)
print('Accuracy of logistic regression: ', log_reg.score(X, y))

Excellent work! You have built your first logistic regression model and checked its accuracy! Let's practice some more!

## Logistic regression using Twitter data
### Exercise
In this exercise, you will build a logistic regression model using the `tweets` dataset. The target is given by the `airline_sentiment`, which is `0` for negative tweets, `1` for neutral, and `2` for positive ones. So, in this case, you are given a multi-class classification task. Everything we learned about binary problems applies to multi-class classification problems as well.

You will evaluate the accuracy of the model using the two different approaches from the slides.

The logistic regression function and accuracy score have been imported for you.

### Instructions
+ Build and fit a logistic regression model using the defined `X` and `y` as arguments.
+ Calculate the accuracy of the logistic regression model.
+ Predict the labels.
+ Calculate the *accuracy score* using the predicted and true labels.

In [None]:
df_tweets = pd.read_csv("Tweets.csv")
df_tweets = df_tweets.loc[:, ["airline_sentiment", "airline_sentiment_confidence", "retweet_count", "airline", "text", "negativereason"]]
df_tweets = pd.get_dummies(df_tweets, columns = ["airline", "negativereason"])
#df_tweets.columns
df_tweets.rename(columns={'negativereason_Bad Flight': "Bad Flight", "negativereason_Can't Tell": "Can't Tell",
       'negativereason_Cancelled Flight': "Cancelled Flight",
       'negativereason_Customer Service Issue': "Customer Service Issue",
       'negativereason_Damaged Luggage': "Damaged Luggage",
       'negativereason_Flight Attendant Complaints': "Flight Attendant Complains",
       'negativereason_Flight Booking Problems': "Flight Booking Problems", 'negativereason_Late Flight': "Late Flight",
       'negativereason_Lost Luggage': "Lost Luggage", 'negativereason_longlines': "longlines"}, inplace=True)

vect = CountVectorizer(max_features=100, token_pattern=r'[A-Za-z]+', stop_words=ENGLISH_STOP_WORDS)
vect.fit(df_tweets.text)
vect_trans = vect.transform(df_tweets.text)
tweets = pd.DataFrame(vect_trans.toarray(), columns=vect.get_feature_names())
tweets = pd.concat([df_tweets, tweets], axis=1)
tweets["airline_sentiment"] = tweets["airline_sentiment"].replace(["negative", "neutral", "positive"], [0, 1, 2])
tweets.drop("text", axis=1, inplace=True)
tweets.head()

In [None]:
from sklearn.metrics import accuracy_score

# Define the vector of targets and matrix of features
y = tweets.airline_sentiment
X = tweets.drop('airline_sentiment', axis=1)

# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X, y)
print('Accuracy of logistic regression: ', log_reg.score(X, y))

# Create an array of prediction
y_predict = log_reg.predict(X)

# Print the accuracy using accuracy score
print('Accuracy of logistic regression: ', accuracy_score(y, y_predict))

Great work! You have built another logistic regression model and calculated its accuracy in two different ways. Have you noticed how the calculated accuracy scores are the same? This will not always be the case for other methods because the `.score()` function can use other default model performance metrics. So, use `.accuracy_score()` to be certain that you are calculating the accuracy when you are training a different supervised learning model.

## Did we really predict the sentiment well?
<video controls src="video/video4_2.mp4" width=720>

## Build and assess a model: movies reviews
### Exercise
In this problem, you will build a logistic regression model using the `movies` dataset. The score is stored in the `label` column and is `1` when the review is positive, and `0` when negative. The text review has been transformed, using BOW, to numeric columns.

You have already built a classifier but evaluated it using the same data employed in the training step. Make sure you now assess the model using an unseen test dataset. How does the performance of the model change when evaluated on the test set?

### Instructions
+ Import the function required for a train/test split.
+ Perform the train/test split, specifying that 20% of the data should be used as a test set.
+ Train a logistic regression model.
+ Print out the accuracy of the model on the training and on the testing data.

In [None]:
# Import the required packages
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Define the vector of labels and matrix of features
y = movies.label
X = movies.drop('label', axis=1)

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build a logistic regression model and print out the accuracy
log_reg = LogisticRegression().fit(X_train,y_train)
print('Accuracy on train set: ', log_reg.score(X_train, y_train))
print('Accuracy on test set: ', log_reg.score(X_test, y_test))

Excellent work! Did you notice how the logistic regression's accuracy decreases when we evaluate it on the test set instead of on the training set? It's normal to observe a small drop but if the decrease is large, this could be a signal that your model will not generalize well and will do poorly when evaluating new movie reviews.

## Performance metrics of Twitter data
### Exercise
You will train a logistic regression model that predicts the sentiment of tweets and evaluate its performance on the test set using different metrics.

A matrix `X` has been created for you. It contains features created with a BOW on the `text` column.

The labels are stored in a vector called `y`. Vector `y` is `0` for negative tweets, `1` for neutral, and `2` for positive ones.
Note that although we have 3 classes, it is still a classification problem. The accuracy still measures the proportion of correctly predicted instances. The confusion matrix will now be of size 3x3, each row will give the number of predicted cases for classes 2, 1, and 0, and each column - the true number of cases in class 2, 1, and 0.

All required packages have been imported for you.

### Instructions
+ Perform the train/test split, and stratify by `y`.
+ Train a a logistic regression classifier.
+ Predict the performance on the test set.
+ Print the accuracy score and confusion matrix obtained on the test set.

In [None]:
from sklearn.metrics import confusion_matrix

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify=y)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)

# Make predictions on the test set
y_predicted = log_reg.predict(X_test)

# Print the performance metrics
print('Accuracy score test set: ', accuracy_score(y_test, y_predicted))
print('Confusion matrix test set: \n', confusion_matrix(y_test, y_predicted)/len(y_test))

Good job! Although the sentiment category here has 3 classes instead of 2, the way we trained and evaluated the model is the same as with 2 classes. The accuracy on the test data was good and the confusion matrix can also show us which category we are bad at predicting.

## Build and assess a model: product review data
### Exercise
In this exercise, you will build a logistic regression using the `reviews` dataset, containing customers' reviews of Amazon products. The array `y` contains the sentiment : `1` if positive and `0` otherwise. The array `X` contains all numeric features created using a BOW approach. Feel free to explore them in the IPython Shell.

Your task is to build a logistic regression model and calculate the accuracy and confusion matrix using the test data set.

The logistic regression and train/test splitting functions have been imported for you.

### Instructions
+ Import the accuracy score and confusion matrix functions.
+ Split the data into training and testing, using 30% of it as a test set and set the random seed to `42`.
+ Train a logistic regression model.
+ Print out the accuracy score and confusion matrix using the test data.

In [None]:
# Import the accuracy and confusion matrix
from sklearn.metrics import accuracy_score, confusion_matrix

# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Build a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict the labels 
y_predict = log_reg.predict(X_test)

# Print the performance metrics
print('Accuracy score of test data: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of test data: \n', confusion_matrix(y_test, y_predict)/len(y_test))

Great work! You have successfully built another logistic regression model and evaluated its performance on the test set. Is there any way we can improve the performance of the model? We will discuss that in our next video!

## Logistic regression: revisited
<video controls src="video/video4_3.mp4" width=720>

## Predict probabilities of movie reviews
### Exercise
In this problem, you will build a logistic regression using the `movies` dataset. The labels are stored in the array `y` and the features in `X`.

Train the model on the training data. Instead of predicting classes, predict the probabilities that each instance in the test set belongs to each of the two classes.

The logistic regression and train/test splitting functions have been imported for you.

### Instructions
+ Split the data into training and testing set.
+ Train a logistic regression model.
+ Predict the probabilities for class 0 and for class 1 of the testing data. Class 0 is located as the first column in the predicted probabilities, and class 1 is the second one.

In [None]:
# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict the probability of the 0 class
prob_0 = log_reg.predict_proba(X_test)[:, 0]
# Predict the probability of the 1 class
prob_1 = log_reg.predict_proba(X_test)[:, 1]

print("First 10 predicted probabilities of class 0: ", prob_0[:10])
print("First 10 predicted probabilities of class 1: ", prob_1[:10])

Great job! Did you notice how the probabilities of class 0 and class 1 add up to 1 for each instance? In problems where the proportion of one class is larger than the other, we might want to work with predicted probabilities instead of predicted classes.

## Product reviews with regularization
### Exercise
In this exercise, you will work once more with the `reviews` dataset of Amazon product reviews. A vector of labels y contains the sentiment : `1` if positive and `0` otherwise. The matrix `X` contains all numeric features created using a BOW approach.

You will need to train two logistic regression models with different levels of regularization and compare how they perform on the test data. Remember that regularization is a way to control the complexity of the model. The more regularized a model is, the less flexible it is but the better it can generalize. Models with higher level of regularization are often less accurate than non-regularized ones.

### Instructions
+ Split the data into a train and test sets.
+ Train a logistic regression with regularization parameter of `1000`. Train a second logistic regression with regularization parameter equal to `0.001`.
+ Print the accuracy scores of both models on the test set.

In [None]:
# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Train a logistic regression with regularization of 1000
log_reg1 = LogisticRegression(C=1000).fit(X_train, y_train)
# Train a logistic regression with regularization of 0.001
log_reg2 = LogisticRegression(C=0.001).fit(X_train, y_train)

# Print the accuracies
print('Accuracy of model 1: ', log_reg1.score(X_test, y_test))
print('Accuracy of model 2: ', log_reg2.score(X_test, y_test))

Great work! Did you notice how the model with higher degree of penalization (low C) has lower accuracy than the one with very little penalization (high C)? We often sacrifice some accuracy when we regularize a model but the benefit is lower complexity and lower chance of overfitting.

## Regularizing models with Twitter data
### Exercise
You will work with the Twitter data expressing customers' sentiment about airline companies. The `X` matrix of features and `y` vector of labels have been created for you. In addition, the training and testing split has been performed. You can work with the `X_train`, `X_test`, `y_train` and `y_test` arrays directly.

You will train regularized and a more flexible models and evaluate them using different model performance metrics.

All required packages have been imported for you.

### Instructions
+ Train two logistic regressions: one with regularization parameter of 100 and a second of 0.1.
+ Print the accuracy scores of both models.
+ Print the confusion matrix of each model.

In [None]:
# Build a logistic regression with regularizarion parameter of 100
log_reg1 = LogisticRegression(C=100).fit(X_train, y_train)
# Build a logistic regression with regularizarion parameter of 0.1
log_reg2 = LogisticRegression(C=0.1).fit(X_train, y_train)

# Predict the labels for each model
y_predict1 = log_reg1.predict(X_test)
y_predict2 = log_reg2.predict(X_test)

# Print performance metrics for each model
print('Accuracy of model 1: ', log_reg1.score(X_test, y_test))
print('Accuracy of model 2: ', log_reg2.score(X_test, y_test))
print('Confusion matrix of model 1: \n' , confusion_matrix(y_test, y_predict1)/len(y_test))
print('Confusion matrix of model 2: \n', confusion_matrix(y_test, y_predict2)/len(y_test))

Excellent effort! You have trained a more and less flexible logistic regressions to predict the sentiment of tweets and evaluated them using different performance metrics. In this case, we again sacrificed some accuracy when we imposed regularizarion.

## Bringing it all together
<video controls src="video/video4_4.mp4" width=720>

## Step 1: Word cloud and feature creation
### Exercise
You will work with a sample of the `reviews` dataset throughout this exercise. It contains the `review` and `score` columns. Feel free to explore it in the IPython Shell.

In the first step, you will build a word cloud using only positive reviews. The string `positive_reviews` has been created for you by concatenating the top 100 positive reviews.

In the second step, you will create a new feature for the length of each review and add that new feature to the dataset.

All the functions needed to plot a word cloud have been imported for you, as well as the `word_tokenize` function from the `nltk` module.

### Instructions
1. 
	+ Call and create a word cloud image using the `positive_reviews`.
	+ Display the generated image.
2. 
	+ Tokenize each item in the `review` column, using the word tokenizing function we have been working with.
	+ Iterate over the created `word_tokens` list and find the length of each item in the list. Append that length to the empty `len_tokens` list.

In [None]:
from wordcloud import WordCloud

# Create and generate a word cloud image
cloud_positives = WordCloud(background_color='white').generate(positive_reviews)
 
# Display the generated wordcloud image
plt.imshow(cloud_positives, interpolation='bilinear') 
plt.axis("off")

# Don't forget to show the final image
plt.show()

In [None]:
# Tokenize each item in the review column
word_tokens = [word_tokenize(review) for review in reviews.review]

# Create an empty list to store the length of the reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens 

Great job! Which words stood out in the word cloud image? After you have successfully created a feature about the number of tokens in each review, it is time to transform the text of the review.

## Step 2: Building a vectorizer
### Exercise
In this exercise, you are asked to build a TfIdf transformation of the `review` column in the `reviews` dataset. You are asked to specify the n-grams, stop words, the pattern of tokens and the size of the vocabulary arguments.

This is the last step before we train a classifier to predict the sentiment of a review.

### Instructions
+ Import the Tfidf vectorizer and the default list of English stop words.
+ Build the Tfidf vectorizer, specifying - in this order - the following arguments: use as stop words the default list of English stop words; as n-grams use uni- and bi-grams;the maximum number of features should be 200; capture only words using the specified pattern.
+ Create a DataFrame using the Tfidf vectorizer.

In [None]:
# Import the TfidfVectorizer and default list of English stop words
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

# Build the vectorizer
vect = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 2), max_features=200, token_pattern=r'\b[^\d\W][^\d\W]+\b').fit(reviews.review)
# Create sparse matrix from the vectorizer
X = vect.transform(reviews.review)

# Create a DataFrame
reviews_transformed = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
print('Top 5 rows of the DataFrame: \n', reviews_transformed.head())

Excellent! You have transfomed the text column using the TfidfVectorizer and created 200 numeric columns from the review. You are now ready to build a binary classifier predicting the sentiment of a review.

## Step 3: Building a classifier
### Exercise
This is the last step in the sentiment analysis prediction. We have explored and enriched our dataset with features related to the sentiment, and created numeric vectors from it.

You will use the dataset that you built in the previous steps. Namely, it contains a feature for the length of reviews, and 200 features created with the Tfidf vectorizer.

Your task is to train a logistic regression to predict the sentiment. The data has been imported for you and is called `reviews_transformed`. The target is called `score` and is binary : `1` when the product review is positive and `0` otherwise.

Train a logistic regression model and evaluate its performance on the test data. How well does the model do?

All the required packages have been imported for you.

### Instructions
+ Perform the train/test split, allocating 20% of the data to testing and setting the random seed to `456`.
+ Train a logistic regression model.
+ Predict the class.
+ Print out the accuracy score and the confusion matrix on the test set.

In [None]:
# Define X and y
y = reviews_transformed.score
X = reviews_transformed.drop('score', axis=1)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=456)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)
# Predict the labels
y_predicted = log_reg.predict(X_test)

# Print accuracy score and confusion matrix on test set
print('Accuracy on the test set: ', accuracy_score(y_test, y_predicted))
print(confusion_matrix(y_test, y_predicted)/len(y_test))


## Wrap up
<video controls src="video/video4_5.mp4" width=720>