# Predicting Movie Likes and Dislikes

In [None]:
## All important imports go here
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, roc_auc_score
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

In this project we are attempting to view different models' ability to predict likes and dislikes of movies. Below we have downloaded a dataframe from Kaggle that includes different information about movies from 1986-2016. The information includes things such as genre, company, the name of the movie, runtime, etc. What concerns us most is the score column that is included in this dataframe. The movies were each given a score that was ranked out of 10, 10 being the best, 1 being the worst. We want to utilize these scores to be able to build and train a model to predict movies scores accurately. 

In [None]:
movies = pd.read_csv('movies.csv', encoding = 'latin-1') #Read in this data

In [None]:
movies.head() #Display dataframe

In [None]:
#Dropping unnecessary columns 
movies = movies.drop(columns =['country', 'rating', 'released', 'votes', 'writer', 'director', 'star'], axis =1)

In [None]:
avg_score = movies['score'].mean()

In [None]:
##Binarizing the movie ratings into zeros and ones for easy classification later on
movies['score'][movies['score'] < avg_score]= 0
movies['score'][movies['score'] > avg_score]= 1

In [None]:
##Doing a linear regression to find values to replace 0.0 in the budget column 
x = movies['gross']
y = movies['budget']

x_constant = sm.add_constant(x)
gross_budget_model = sm.OLS(y, x_constant)
results = gross_budget_model.fit()
print("Intercept and slope are:", results.params)

In [None]:
m = results.params[0]
b = results.params[1]
#Replacing budgets of 0 with the budget values calculated in linear regression model
for i in range(movies.shape[0]):
    if movies['budget'][i] ==0.0:
        gross_val = movies['gross'][i]
        y = m*gross_val + b
        movies['budget'].iloc[i] = y

In [None]:
movies.head()

In [None]:
#Doing OneHotEncoder for genre labels, this allows genres to be in zeros and ones so we can use them as features
#since they are no longer strings
encoder = OneHotEncoder()
genre = movies['genre']
genre_np = genre.to_numpy()
genre_ary = encoder.fit_transform(genre_np.reshape(-1,1)).toarray()

In [None]:
genre_df = pd.DataFrame(genre_ary)

In [None]:
genre_df = genre_df.rename({0:'Action', 1:'Adventure', 2:'Animation', 3:'Biography', 4:'Comedy', 5:'Crime', 
                            6:'Drama', 7:'Family', 8:'Fantasy', 9:'Horror', 10:'Musical', 11:'Mystery', 12:'Romance', 
                            13:'Sci-Fi', 14:'Thriller', 15:'War', 16:'Western'}, axis = 1)

In [None]:
movies_encoded = pd.concat([movies, genre_df], axis =1)
movies_encoded.head() #OneHotEncoder worked

In [None]:
cleaned_df = movies_encoded.drop(columns=['genre', 'company', 'name'], axis=1)
cleaned_df.head() #Cleaned dataframe ready for modeling

## Visualizing feature spaces

In [None]:
##Visualizing features to see if there are any patterns present before we start modeling
features = cleaned_df.drop(columns=['score'], axis =1)
labels = cleaned_df['score']

plt.figure(figsize=(10,7))
plt.subplot(321)
plt.scatter(features['budget'], features['gross'], c=labels)
plt.xlabel('Movie Budget')
plt.ylabel('Movie Gross')
plt.title('Budget vs. Gross')

plt.subplot(322)
plt.scatter(features['runtime'], features['gross'], c=labels)
plt.xlabel('Movie Runtime')
plt.ylabel('Movie Gross')
plt.title('Runtime vs. Gross')

plt.subplot(323)
plt.scatter(features['year'], features['gross'], c=labels)
plt.xlabel('Release Year')
plt.ylabel('Movie Gross')
plt.title('Release Year vs. Gross')

plt.tight_layout()

This is pretty hard to interpret. We can try scaling the features next to see if the visualization with be a little better than what is shown above. In previous In-class assignments, we have scaled features to be able to interpret and work with their visualizations in a much easier way. We will test this with our data next and see if it changes anything.

In [None]:
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

In [None]:
features_scaled = pd.DataFrame(features)

In [None]:
##Attempting the same visualization with scaled features
plt.figure(figsize=(10,7))
plt.subplot(321)
plt.scatter(features_scaled.iloc[:, 0], features_scaled.iloc[:, 1], c=labels)
plt.xlabel('Movie Budget')
plt.ylabel('Movie Gross')
plt.title('Budget vs. Gross')

plt.subplot(322)
plt.scatter(features_scaled.iloc[:, 2], features_scaled.iloc[:, 1], c=labels)
plt.xlabel('Movie Runtime')
plt.ylabel('Movie Gross')
plt.title('Runtime vs. Gross')

plt.subplot(323)
plt.scatter(features_scaled.iloc[:, 4], features_scaled.iloc[:, 1], c=labels)
plt.xlabel('Release Year')
plt.ylabel('Movie Gross')
plt.title('Release Year vs. Gross')

plt.tight_layout()

As we can see above, scaling the features did little to help us visualize the relationship between the features. It still is just as hard to interpret the visualization with scaled features as it was before the features were scaled. Now below just to see more visualizations, we can test the plots of the OneHotEncoder columns that we created for the different movie genres. 

In [None]:
##Attempting to visualize OneHotEncoder columns 
#For this visualization we chose features that might seem to having overlapping genres, perhaps this might influence
# their ratings/score
plt.figure(figsize=(10,7))
plt.subplot(321)
plt.scatter(features['Action'], features['Adventure'], c=labels)
plt.xlabel('Action')
plt.ylabel('Adventure')
plt.title('Action vs. Adventure')

plt.subplot(322)
plt.scatter(features['Comedy'], features['Romance'], c=labels)
plt.xlabel('Comedy')
plt.ylabel('Romance')
plt.title('Comedy vs. Romance')

plt.subplot(323)
plt.scatter(features['Horror'], features['Mystery'], c=labels)
plt.xlabel('Horror')
plt.ylabel('Mystery')
plt.title('Horror vs. Mystery')

plt.tight_layout()

Due to the fact that there is only one genre associated with each movie, it makes it a lot harder to visualize the data in this manner. Perhaps if we did a heatmap of the correlations it might be easier to identify if there are any columns in the dataset that might have correlations with one another. 

In [None]:
sns.heatmap(features.corr())

Based on the color hue indicator on the right hand side of the figure we can see at a glance that none of the features have higher than 0.2 of a positive correlation with another given column. Some higher correlations present in the heatmap might be between runtime and gross as well as gross and year. But, as we saw in the feature space visualizations above, it's very difficult to be able to see any correlations or relationships between the features in the plots. 

## Logistic Regression

In [None]:
# splitting the data, using 75% for training the model, random state is set for reproducibility
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, 
                                                                            train_size = 0.75, random_state = 1)

With the training and testing sets of data, a logistic model is made to try and predict whether a movie was liked or disliked. First, the logistic model is found by passing the classes and features for the training data into the Logit function from the statsmodels package. A constant is added to the training features for this model. Next, the model results of the model can be viewed once it is fitted. Then the test features can be be passed in to get the predictions from the fitted model. Finally, these predictions are matched up with the actual likes and dislikes to evaluate the performance of the model. If the predicted values were above 0.5, then they were classified as liked. Otherwise, they were dislikes.

In [None]:
# creating the logistic regression model, adding a constant variable to the features
logit_model = sm.Logit(train_labels, sm.add_constant(train_features))

# fitting and evaluating the trained model
result = logit_model.fit()
print(result.summary())

temp = []
b = result.predict(sm.add_constant(test_features)) # have to loop through the results and sort the classes
for i in b:
    if i > 0.5: # if the prediction is higher than 0.5 then it is a like
        temp.append(1)
    else:
        temp.append(0)

print("The accuracy of the model is", accuracy_score(y_pred = temp,y_true= test_labels))
# accuracy of my model

From the results, it is evident that the accuracy of the model was not the best but it is significantly better than if the model was just left to chance. About 66% of movies were correctly guessed to have been liked or disliked. From the summary of the model, it seems that gross, runtime, budget, and year are significant in predicting movie likes. Also, one can see by the p-values that the genre categories and the constant were not very significant in the model. A simplified model can be made to try and increase accuracy without these insignificant features. After reducing the feature set and splitting the data into training and testing sets again, one can follow the previous methods to create the reduced logistic model.

In [None]:
# dropping high p value features to make a reduced model
features2 = features.drop(columns = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 
                            'Drama', 'Family', 'Fantasy', 'Horror', 'Musical', 'Mystery', 'Romance', 
                            'Sci-Fi', 'Thriller', 'War', 'Western'])


# splitting data again, sets of data and classes for training and testing
train_features2, test_features2, train_labels2, test_labels2 = train_test_split(features2, labels, 
                                                                                train_size = 0.75, random_state = 1)


logit_model2 = sm.Logit(train_labels2, train_features2) # my logistic model with the new training data, no constant


results2 = logit_model2.fit() # fitting the new model

temp2 = []
b = results2.predict(test_features2) # have to loop through the results and sort the classes
for i in b:
    if i > 0.5: # same threshold as before
        temp2.append(1)
    else:
        temp2.append(0)

        
print(results2.summary()) # printing the new results
print("The accuracy of the new model is", accuracy_score(y_pred = temp2,y_true= test_labels2))
# accuracy of my model

After reducing the model, it can be seen that the accuracy dropped a little, although it is nearly the same value as the full model.

## KNN Classifier

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features,labels, random_state = 1,
                                                                            train_size = .75)

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3) #creates the classifier with the neighbors #
fit = knn.fit(train_features,train_labels) #fits the training data
pred = fit.predict(test_features) #predicts the test data

print(confusion_matrix(test_labels, pred)) #evaluate the model with the confusion matrix, now using the test
print(accuracy_score(test_labels,pred)) #evaluate the model with the accuracy score, now using the test


fpr, tpr, thresholds = roc_curve(test_labels,pred)
plt.plot(fpr,tpr, "x-")
plt.xlabel("FPR") #false positive rate
plt.ylabel("TPR") #true positive rate

auc = roc_auc_score(test_labels, pred) #area under the curve; closer to 1 the better
print("auc:",auc)

The KNN classifier shows a fairly low accuracy in predicting the movie likes and dislikes based on the scoring data. The confusion matrix shows about 379 false positives predicitons, the number of predictions that were falsely catagorized as a 'liked' movie, and 369 false negatives, the number of predicaitons falsely catagorized as dislikes. Additionally, the ROC cuve shows a line only slighly greater than .5, meaning the predictions are just a little better than a guess. Based on these results, the KNN may not be the best classifier for the scored data.

## SVM with RBF kernel

In [None]:
svc_rbf = SVC(C = 10, kernel = 'rbf', gamma = .1) 
svc_rbf.fit(train_features, train_labels) 
pred2 = svc_rbf.predict(test_features) 


print(accuracy_score(test_labels,pred2))
print(confusion_matrix(test_labels,pred2))


fpr, tpr, thresholds = roc_curve(test_labels,pred2) 
plt.plot(fpr,tpr, "x-")
plt.plot([0, 1], [0, 1],'c--')
plt.xlabel("FPR")
plt.ylabel("TPR")

auc = roc_auc_score(test_labels, pred2)
print("auc:",auc)

As shown above our ROC curve is the same as the chance line and our AUC is .5. So for an SVC with default parameters and an rbf kernel, the classification of movie likes and dislikes is as accurate as guessing.

## Grid Search CV

In [None]:
#Create features and labels
features = movies['score']
labels = cleaned_df
labels

In [None]:
#Split data!
ltrain, ltest, ftrain, ftest = train_test_split(features, labels, random_state=1, train_size = .75)
print(len(ftrain),len(ltrain),len(ftest),len(ltest))
ltest.shape
print(ltrain)

In [None]:
params = {'C': [1,10,100,1000], 'gamma': [.1,.01,.001,.0001], 'kernel':['rbf']}
grid = GridSearchCV(SVC(),param_grid = params)
grid.fit(ftrain,ltrain)
grid_predictions = grid.predict(ftest)
print(grid.best_params_)

In [None]:
params = {'C': [1], 'gamma': [.1], 'kernel':['rbf']}
grid = GridSearchCV(SVC(),param_grid = params)
grid.fit(ftrain,ltrain)
grid_prediction = grid.predict(ftest)
print(confusion_matrix(ltest,grid_prediction),classification_report(ltest,grid_prediction))

## PCA

In [None]:
features = cleaned_df.drop(columns=['score'], axis =1)
labels = cleaned_df['score']

In [None]:
train_vectors, test_vectors, train_labels, test_labels = train_test_split(features, labels, train_size =.75,
                                                                         test_size = .25, random_state =1)

We will first create a PCA model using kernel rbf along with default C and gamma values. We will also includes 10 components for this PCA which is fewer than the number of features in our dataset.

In [None]:
##Creating and fiting PCA model, first using 10 components
pca = PCA(n_components=10, whiten=True, random_state =1)
pca = pca.fit(train_vectors)

##Now transforming train and test vectors into PCA train and test vectors
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)

In [None]:
##Now fitting model using SVC with kernel rbf with default C and gamma values
pca_svm = SVC(kernel ='rbf', C=10, gamma = 0.1)
pca_model = pca_svm.fit(pca_train_vectors, train_labels)
pca_ypred = pca_model.predict(pca_test_vectors)

In [None]:
#Now printing metrics to look at accuracy of model
print('The confusion matrix is \n', confusion_matrix(test_labels, pca_ypred))
print('The classification report is \n', classification_report(test_labels, pca_ypred))
print('The accuracy score is \n', accuracy_score(test_labels, pca_ypred))

Now, we will attempt to make a PCA that has the same number of components as features in our dataset. We will then see if the accuracy scores change at all with this change in number of components. We will keep the kernel as rbf and default C and gamma values, same as the first PCA. 

In [None]:
##Creating and fiting PCA model, now using 21 components, which is equal to the number of features in our dataset
pca = PCA(n_components=21, whiten=True, random_state =1)
pca = pca.fit(train_vectors)

##Now transforming train and test vectors into PCA train and test vectors
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)

In [None]:
##Now fitting model using SVC with kernel rbf with default C and gamma values
pca_svm = SVC(kernel ='rbf', C=10, gamma = 0.1)
pca_model = pca_svm.fit(pca_train_vectors, train_labels)
pca_ypred21 = pca_model.predict(pca_test_vectors)

In [None]:
#Now printing metrics to look at accuracy of model
print('The confusion matrix is \n', confusion_matrix(test_labels, pca_ypred21))
print('The classification report is \n', classification_report(test_labels, pca_ypred21))
print('The accuracy score is \n', accuracy_score(test_labels, pca_ypred21))

In [None]:
print('The accuracy score of the PCA with 10 components is \n', accuracy_score(test_labels, pca_ypred))
print('The accuracy score of the PCA with 21 components is \n', accuracy_score(test_labels, pca_ypred21))

As we can see in the cell above, the accuracy score for both PCAs, one with 10 components and one with 21 components, are pretty much equal. If we look more closely we can see that the accuracy for 21 components is slightly higher than the accuracy for 10 components. This might indicate that we might need to use fewer components and to get a better accuracy since after 10 components the accuracies are relatively the same. Perhaps once we pass 10 components we might be overfitting the data, so it might be a good idea to try a PCA with fewer components, for example 4. The number of features that do not include the OneHotEncoder we did of the genres is four features, so maybe that would be a good number less than 10 that we could try.

In [None]:
#Trying a PCA with 4 components to see if we were overfitting with 10 components
pca = PCA(n_components=4, whiten=True, random_state = 1)
pca = pca.fit(train_vectors)

##Now transforming train and test vectors into PCA train and test vectors
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)

In [None]:
##Now fitting model using SVC with kernel rbf with default C and gamma values
pca_svm = SVC(kernel ='rbf', C=10, gamma = 0.1)
pca_model = pca_svm.fit(pca_train_vectors, train_labels)
pca_ypred4 = pca_model.predict(pca_test_vectors)

In [None]:
#Now printing metrics to look at accuracy of model
print('The confusion matrix is \n', confusion_matrix(test_labels, pca_ypred4))
print('The classification report is \n', classification_report(test_labels, pca_ypred4))
print('The accuracy score is \n', accuracy_score(test_labels, pca_ypred4))

In [None]:
print('The accuracy score for a PCA with 4 components is \n', accuracy_score(test_labels, pca_ypred4))
print('The accuracy score of the PCA with 10 components is \n', accuracy_score(test_labels, pca_ypred))
print('The accuracy score of the PCA with 21 components is \n', accuracy_score(test_labels, pca_ypred21))

This didn't work either. The accuracy with 4 components seems to be the same as the accuracy with 10 components. However, one interesting thing to look at is the confusion matrix for all three models. 

In [None]:
print('The confusion matrix for a PCA with 4 components is \n', confusion_matrix(test_labels, pca_ypred4))
print('The confusion matrix for a PCA with 21 components is \n', confusion_matrix(test_labels, pca_ypred21))
print('The confusion matrix for a PCA with 10 components is \n', confusion_matrix(test_labels, pca_ypred))

As we can see above the confusion matrix has the higher accuracy of true positives for the PCA with 4 components and the lowest number of true positives for the PCA with 10 components. Overall, there are still a lot of false positives and false negatives present in all of the confusion matrices regardless of component number so we still have a really low accuracy for all versions of the PCA. One reason that we might have low accuracy in our models is because we have feature spaces that do not have positive correlations with one another, and therefore they do not have a strong relationship with one another. Since our dataset does not have strong relationships between different columns it would make it a lot harder to build an accurate model using the features that we currently have.  

#### PCA using best_params_ results:

From a previous section of code, we were able to calculate the best possible parameters with GridSearch. The best parameters that were found were C=1, gamma =.1, and kernel ='rbf'. Now we can use these parameters for our PCA. In this case we will use PCA with 4 components because in comparison to the other PCAs the PCA with 21 components had the highest accuracy score and the second highest number of true positives in the confusion matrix. This is also the number of features that are present in our dataset.

In [None]:
#Trying a PCA with 21 components to test best parameters
pca = PCA(n_components=21, whiten=True)
pca = pca.fit(train_vectors)

##Now transforming train and test vectors into PCA train and test vectors
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)

In [None]:
#Adjusting parameters to match calculated best parameters from GridSearch
pca_svm = SVC(kernel ='rbf', C=1, gamma = 0.1)
pca_model = pca_svm.fit(pca_train_vectors, train_labels)
pca_ypred21_best = pca_model.predict(pca_test_vectors)

In [None]:
#Now printing metrics to look at accuracy of model
print('The confusion matrix is \n', confusion_matrix(test_labels, pca_ypred21_best))
print('The classification report is \n', classification_report(test_labels, pca_ypred21_best))
print('The accuracy score is \n', accuracy_score(test_labels, pca_ypred21_best))

As we can see above, the accuracy for the model took a slight downturn in comparison to the previous SVC parameters. One thing that contrasts with the lower accuracy is that the number of true positives went up in the confusion matrix. Let's compare these metrics to the metrics of our PCA with 4 components that did not account for the best parameters.

In [None]:
print('The confusion matrix with best parameters and 21 components is \n',
      confusion_matrix(test_labels, pca_ypred21_best))
print('The confusion matrix with 21 components is \n', confusion_matrix(test_labels, pca_ypred21))
print('The accuracy score with best parameters and 21 components is \n',
      accuracy_score(test_labels, pca_ypred21_best))
print('The accuracy score with 21 components is \n', accuracy_score(test_labels, pca_ypred21))

In [None]:
drop_accuracy = accuracy_score(test_labels, pca_ypred21) - accuracy_score(test_labels, pca_ypred21_best)
print('The drop in accuracy between the two PCAs is \n', round(drop_accuracy, 6))
diff_confusion = confusion_matrix(test_labels, pca_ypred21_best) - confusion_matrix(test_labels, pca_ypred21)
print('The difference in confusion matrices between PCAs is \n', diff_confusion)

As we can see here, there was a slight drop in accuracy in the model. The drop in accuracy was around .009384 which is pretty close to 0. This means our best parameters model did worse than the default parameters model, keeping the same number of components (21) each time. Also calculating the difference between the confusion matrices, we can see that there was low increase in true positives but also larger increase in false negatives. This increase in false negatives or misclassifications might be where our accuracy score took a dip in comparison to the model using default parameters. Overall, these metrics don't really show the best parameters found via GridSearch helped build us a better model. 

### PCA Conclusion: 

Overall, we were able to look at how different numbers of components influenced our model metrics and accuracy. In comparing PCAs with 4, 10, 21 components all with default parameters, we found that the PCA with 21 components had the best highest accuracy score and second highest number of true positives in the confusion matrix. Testing our PCA model with 21 components and the best parameters calculated previously using GridSearchCV we found a slight dip in the accuracy score. Altogether the model did not improve, but it really did not become that much worse in comparsion to the model with default parameters.

# Possible Limitations: Netflix Comparison

Predicting a movie someone may like or dislike is similar to what the streaming site Netflix does for each viewer's profile. However, Netflix has access to much more personalized data such as all the other shows and movies someone watches and sometimes what rating they give it. Additionally, Netflix has information about other viewers who may watch similar things and can then recommend a movie based on things that these viewers have in common. They also factor in things such as the duration and times someone uses the site for, as well as the devices they watch on. All of these inputs are used to create more specific and more accurate recommendations. Additionally, within each different category on the home page such as "Trending now" or "Comedies", Netflix arranges these rows in order of which movie is most likely to be enjoyed by that specific viewer. So each column in the rows start with the movie Netflix recommends most in that category, even though it is not in the "Recommended" category. Furthermore, the most strongly recommended rows go on top of the home screen, so both the rows and columns of each profile is laid out in a way that puts Netflix's highest recommendations in the front. All this sorting is why it is said that there are "33 million different versions of Netflix", one personalized for each viewer. To be able to create so many predictions and sort movies in such a strong, personalized way is why Netflix needs all these specific factors, and to compare viewers with similar taste. Without this information it would be much more difficult and less accurate to predict a movie someone may like.

#### Sources:

https://help.netflix.com/en/node/100639

https://neilpatel.com/blog/how-netflix-uses-analytics/