# Predicting Movie Likes and Dislikes

In [None]:
## All important imports go here
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In this project we are attempting to view different models' ability to predict likes and dislikes of movies. Below we have downloaded a dataframe from Kaggle that includes different information about movies from 1986-2016. The information includes things such as genre, company, the name of the movie, runtime, etc. What concerns us most is the score column that is included in this dataframe. The movies were each given a score that was ranked out of 10, 10 being the best, 1 being the worst. We want to utilize these scores to be able to build and train a model to predict movies scores accurately. 

In [None]:
movies = pd.read_csv('movies.csv', encoding = 'latin-1') #Read in this data

In [None]:
movies.head() #Display dataframe

In [None]:
#Dropping unnecessary columns 
movies = movies.drop(columns =['country', 'rating', 'released', 'votes', 'writer', 'director', 'star'], axis =1)

In [None]:
avg_score = movies['score'].mean()

In [None]:
##Binarizing the movie ratings into zeros and ones for easy classification later on
movies['score'][movies['score'] < avg_score]= 0
movies['score'][movies['score'] > avg_score]= 1

In [None]:
##Doing a linear regression to find values to replace 0.0 in the budget column 
x = movies['gross']
y = movies['budget']

x_constant = sm.add_constant(x)
gross_budget_model = sm.OLS(y, x_constant)
results = gross_budget_model.fit()
print("Intercept and slope are:", results.params)

In [None]:
m = results.params[0]
b = results.params[1]
#Replacing budgets of 0 with the budget values calculated in linear regression model
for i in range(movies.shape[0]):
    if movies['budget'][i] ==0.0:
        gross_val = movies['gross'][i]
        y = m*gross_val + b
        movies['budget'].iloc[i] = y

In [None]:
movies.head()

In [None]:
#Doing OneHotEncoder for genre labels, this allows genres to be in zeros and ones so we can use them as features
#since they are no longer strings
encoder = OneHotEncoder()
genre = movies['genre']
genre_np = genre.to_numpy()
genre_ary = encoder.fit_transform(genre_np.reshape(-1,1)).toarray()

In [None]:
genre_df = pd.DataFrame(genre_ary)

In [None]:
genre_df = genre_df.rename({0:'Action', 1:'Adventure', 2:'Animation', 3:'Biography', 4:'Comedy', 5:'Crime', 
                            6:'Drama', 7:'Family', 8:'Fantasy', 9:'Horror', 10:'Musical', 11:'Mystery', 12:'Romance', 
                            13:'Sci-Fi', 14:'Thriller', 15:'War', 16:'Western'}, axis = 1)

In [None]:
movies_encoded = pd.concat([movies, genre_df], axis =1)
movies_encoded.head() #OneHotEncoder worked

In [None]:
cleaned_df = movies_encoded.drop(columns=['genre', 'company', 'name'], axis=1)
cleaned_df.head() #Cleaned dataframe ready for modeling

## Visualizing feature spaces

In [None]:
##Visualizing features to see if there are any patterns present before we start modeling
features = cleaned_df.drop(columns=['score'], axis =1)
labels = cleaned_df['score']

plt.figure(figsize=(10,7))
plt.subplot(321)
plt.scatter(features['budget'], features['gross'], c=labels)
plt.xlabel('Movie Budget')
plt.ylabel('Movie Gross')
plt.title('Budget vs. Gross')

plt.subplot(322)
plt.scatter(features['runtime'], features['gross'], c=labels)
plt.xlabel('Movie Runtime')
plt.ylabel('Movie Gross')
plt.title('Runtime vs. Gross')

plt.subplot(323)
plt.scatter(features['year'], features['gross'], c=labels)
plt.xlabel('Release Year')
plt.ylabel('Movie Gross')
plt.title('Release Year vs. Gross')

plt.tight_layout()

This is pretty hard to interpret. We can try scaling the features next to see if the visualization with be a little better than what is shown above. In previous In-class assignments, we have scaled features to be able to interpret and work with their visualizations in a much easier way. We will test this with our data next and see if it changes anything.

In [None]:
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

In [None]:
features_scaled = pd.DataFrame(features)

In [None]:
##Attempting the same visualization with scaled features
plt.figure(figsize=(10,7))
plt.subplot(321)
plt.scatter(features_scaled.iloc[:, 0], features_scaled.iloc[:, 1], c=labels)
plt.xlabel('Movie Budget')
plt.ylabel('Movie Gross')
plt.title('Budget vs. Gross')

plt.subplot(322)
plt.scatter(features_scaled.iloc[:, 2], features_scaled.iloc[:, 1], c=labels)
plt.xlabel('Movie Runtime')
plt.ylabel('Movie Gross')
plt.title('Runtime vs. Gross')

plt.subplot(323)
plt.scatter(features_scaled.iloc[:, 4], features_scaled.iloc[:, 1], c=labels)
plt.xlabel('Release Year')
plt.ylabel('Movie Gross')
plt.title('Release Year vs. Gross')

plt.tight_layout()

As we can see above, scaling the features did little to help us visualize the relationship between the features. It still is just as hard to interpret the visualization with scaled features as it was before the features were scaled. Now below just to see more visualizations, we can test the plots of the OneHotEncoder columns that we created for the different movie genres. 

In [None]:
##Attempting to visualize OneHotEncoder columns 
#For this visualization we chose features that might seem to having overlapping genres, perhaps this might influence
# their ratings/score
plt.figure(figsize=(10,7))
plt.subplot(321)
plt.scatter(features['Action'], features['Adventure'], c=labels)
plt.xlabel('Action')
plt.ylabel('Adventure')
plt.title('Action vs. Adventure')

plt.subplot(322)
plt.scatter(features['Comedy'], features['Romance'], c=labels)
plt.xlabel('Comedy')
plt.ylabel('Romance')
plt.title('Comedy vs. Romance')

plt.subplot(323)
plt.scatter(features['Horror'], features['Mystery'], c=labels)
plt.xlabel('Horror')
plt.ylabel('Mystery')
plt.title('Horror vs. Mystery')

plt.tight_layout()

Due to the fact that there is only one genre associated with each movie, it makes it a lot harder to visualize the data in this manner. Perhaps if we did a heatmap of the correlations it might be easier to identify if there are any columns in the dataset that might have correlations with one another. 

In [None]:
sns.heatmap(features.corr())

Based on the color hue indicator on the right hand side of the figure we can see at a glance that none of the features have higher than 0.2 of a positive correlation with another given column. Some higher correlations present in the heatmap might be between runtime and gross as well as gross and year. But, as we saw in the feature space visualizations above, it's very difficult to be able to see any correlations or relationships between the features in the plots. 

## PCA

In [None]:
features = cleaned_df.drop(columns=['score'], axis =1)
labels = cleaned_df['score']

In [None]:
train_vectors, test_vectors, train_labels, test_labels = train_test_split(features, labels, train_size =.75,
                                                                         test_size = .25, random_state =1)

We will first create a PCA model using kernel rbf along with default C and gamma values. We will also includes 10 components for this PCA which is fewer than the number of features in our dataset.

In [None]:
##Creating and fiting PCA model, first using 10 components
pca = PCA(n_components=10, whiten=True)
pca = pca.fit(train_vectors)

##Now transforming train and test vectors into PCA train and test vectors
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)

In [None]:
##Now fitting model using SVC with kernel rbf with default C and gamma values
pca_svm = SVC(kernel ='rbf', C=10, gamma = 0.1)
pca_model = pca_svm.fit(pca_train_vectors, train_labels)
pca_ypred = pca_model.predict(pca_test_vectors)

In [None]:
#Now printing metrics to look at accuracy of model
print('The confusion matrix is \n', confusion_matrix(test_labels, pca_ypred))
print('The classification report is \n', classification_report(test_labels, pca_ypred))
print('The accuracy score is \n', accuracy_score(test_labels, pca_ypred))

Now, we will attempt to make a PCA that has the same number of components as features in our dataset. We will then see if the accuracy scores change at all with this change in number of components. We will keep the kernel as rbf and default C and gamma values, same as the first PCA. 

In [None]:
##Creating and fiting PCA model, now using 21 components, which is equal to the number of features in our dataset
pca = PCA(n_components=21, whiten=True)
pca = pca.fit(train_vectors)

##Now transforming train and test vectors into PCA train and test vectors
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)

In [None]:
##Now fitting model using SVC with kernel rbf with default C and gamma values
pca_svm = SVC(kernel ='rbf', C=10, gamma = 0.1)
pca_model = pca_svm.fit(pca_train_vectors, train_labels)
pca_ypred21 = pca_model.predict(pca_test_vectors)

In [None]:
#Now printing metrics to look at accuracy of model
print('The confusion matrix is \n', confusion_matrix(test_labels, pca_ypred21))
print('The classification report is \n', classification_report(test_labels, pca_ypred21))
print('The accuracy score is \n', accuracy_score(test_labels, pca_ypred21))

In [None]:
print('The accuracy score of the PCA with 10 components is \n', accuracy_score(test_labels, pca_ypred))
print('The accuracy score of the PCA with 21 components is \n', accuracy_score(test_labels, pca_ypred21))

As we can see in the cell above, the accuracy score for both PCAs, one with 10 components and one with 21 components, are pretty much equal. If we look more closely we can see that the accuracy for 21 components is slightly lower than the accuracy for 10 components. This might indicate that we might need to use fewer components and to get a better accuracy since after 10 components the accuracies are the same. Perhaps once we pass 10 components we might be overfitting the data, so it might be a good idea to try a PCA with fewer components, for example 4. The number of features that do not include the OneHotEncoder we did of the genres is four features, so maybe that would be a good number less than 10 that we could try.

In [None]:
#Trying a PCA with 4 components to see if we were overfitting with 10 components
pca = PCA(n_components=4, whiten=True)
pca = pca.fit(train_vectors)

##Now transforming train and test vectors into PCA train and test vectors
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)

In [None]:
##Now fitting model using SVC with kernel rbf with default C and gamma values
pca_svm = SVC(kernel ='rbf', C=10, gamma = 0.1)
pca_model = pca_svm.fit(pca_train_vectors, train_labels)
pca_ypred4 = pca_model.predict(pca_test_vectors)

In [None]:
#Now printing metrics to look at accuracy of model
print('The confusion matrix is \n', confusion_matrix(test_labels, pca_ypred4))
print('The classification report is \n', classification_report(test_labels, pca_ypred4))
print('The accuracy score is \n', accuracy_score(test_labels, pca_ypred4))

In [None]:
print('The accuracy score for a PCA with 4 components is \n', accuracy_score(test_labels, pca_ypred4))
print('The accuracy score of the PCA with 10 components is \n', accuracy_score(test_labels, pca_ypred))
print('The accuracy score of the PCA with 21 components is \n', accuracy_score(test_labels, pca_ypred21))

This didn't work either. The accuracy with 4 components seems to be the same as the accuracy with 21 components. However, one interesting thing to look at is the confusion matrix for all three models. 

In [None]:
print('The confusion matrix for a PCA with 4 components is \n', confusion_matrix(test_labels, pca_ypred4))
print('The confusion matrix for a PCA with 21 components is \n', confusion_matrix(test_labels, pca_ypred21))
print('The confusion matrix for a PCA with 10 components is \n', confusion_matrix(test_labels, pca_ypred))

As we can see above the confusion matrix has the higher accuracy of true positives for the PCA with 4 components and the lowest number of true positive for the PCA with 10 components. Overall, there are still a lot of false positives and false negatives present in all of the confusion matrices regardless of component number so we still have a really low accuracy for all versions of the PCA. One reason that we might have low accuracy in our models is because we have feature spaces that do not have positive correlations with one another, and therefore they do not have a strong relationship with one another. Since our dataset does not have strong relationships between different columns it would make it a lot harder to build an accurate model using the features that we currently have.  