In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

### Make sure you apply the same transformations on your X_test and y_test data sets that you applied on the X_train and y_train data sets. Make sure that your X_train, y_train, X_test and y_test data sets only contain columns of numeric and non-null values. Explain and justify how you decide to deal with data issues.

In [2]:
import pickle

# read pickle file
with open('train.pkl', 'rb') as file:
    train_pkl = pickle.load(file)
    
with open('test.pkl', 'rb') as file:
    test_pkl = pickle.load(file)

In [3]:
# drop unnecessary columns
train = train_pkl.drop(columns=['movie_info', 'directors','writers','cast','in_theaters_date','on_streaming_date','studio_name','release_season','streaming_lag'])
test = test_pkl.drop(columns=['movie_info', 'directors','writers','cast','in_theaters_date','on_streaming_date','studio_name','release_season','streaming_lag'])
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5202 entries, 0 to 14087
Data columns (total 29 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   movie_title                      5202 non-null   object 
 1   rating                           5202 non-null   object 
 2   genre                            5202 non-null   object 
 3   runtime_in_minutes               5202 non-null   float64
 4   critic_rating                    5202 non-null   int64  
 5   in_theatre_year                  5202 non-null   int64  
 6   kid_friendly                     5202 non-null   int64  
 7   count                            5202 non-null   int64  
 8   genre_Action & Adventure         5202 non-null   int64  
 9   genre_Animation                  5202 non-null   int64  
 10  genre_Art House & International  5202 non-null   int64  
 11  genre_Classics                   5202 non-null   int64  
 12  genre_Comedy            

Since the data is large enough and it is hard to infer the missing values from other rows, we decide to remove all rows contains missing NAs in the data. Then we divide them into trainset and testset to ensure that there is no Null values and all are numeric. 

### Fit 3 linear regression models on the training data:


### Model 1: Use only runtime_in_minutes

In [4]:
X_train1 = train[['runtime_in_minutes']]

y_train = train['critic_rating']

import warnings

# Suppress FutureWarnings related to is_sparse deprecation
warnings.simplefilter(action='ignore', category=FutureWarning)

model1 = LinearRegression()
model1.fit(X_train1, y_train)

### Model 2: Use runtime_in_minutes and kid_friendly

In [5]:
X_train2 = train[['runtime_in_minutes','kid_friendly']]

y_train = train['critic_rating']

model2 = LinearRegression()
model2.fit(X_train2, y_train)

### Model 3: Use runtime_in_minutes, kid_friendly and the dummy columns for the genres

In [6]:
X_train3 = train[['runtime_in_minutes', 'kid_friendly',
        'genre_Action & Adventure', 'genre_Animation', 'genre_Art House & International',
        'genre_Classics', 'genre_Comedy', 'genre_Cult Movies', 'genre_Documentary',
        'genre_Drama', 'genre_Horror', 'genre_Kids & Family', 'genre_Musical & Performing Arts',
        'genre_Mystery & Suspense', 'genre_Romance', 'genre_Science Fiction & Fantasy', 'genre_Western']]


y_train = train['critic_rating']

model3 = LinearRegression()
model3.fit(X_train3, y_train)

### Score the linear regression models on the test data by writing a function where you can input the y_test and y_pred values (y_pred = predicted values after you apply the fitted model to your X_test data), and it outputs the following metrics: R2, MAE and RMSE. Apply the function to the three models that you’ve fit so far.

In [7]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

In [8]:
def evaluate_model(y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    
    return r2, mae, rmse

In [14]:
# train set
X_test1 = test[['runtime_in_minutes']]
y_pred1 = model1.predict(X_test1)

X_test2 = test[['runtime_in_minutes','kid_friendly']]
y_pred2 = model2.predict(X_test2)

X_test3 = test[['runtime_in_minutes', 'kid_friendly',
        'genre_Action & Adventure', 'genre_Animation', 'genre_Art House & International',
        'genre_Classics', 'genre_Comedy', 'genre_Cult Movies', 'genre_Documentary',
        'genre_Drama', 'genre_Horror', 'genre_Kids & Family', 'genre_Musical & Performing Arts',
        'genre_Mystery & Suspense', 'genre_Romance', 'genre_Science Fiction & Fantasy', 'genre_Western']]
y_pred3 = model3.predict(X_test3)

In [15]:
# test set
y_test = test['critic_rating']

r2_1, mae_1, rmse_1 = evaluate_model(y_test, y_pred1)
r2_2, mae_2, rmse_2 = evaluate_model(y_test, y_pred2)
r2_3, mae_3, rmse_3 = evaluate_model(y_test, y_pred3)

print("Model 1 \n R2:", round(r2_1,2), "MAE:", round(mae_1,2), "RMSE:", round(rmse_1,2))
print("Model 2 \n R2:", round(r2_2,2), "MAE:", round(mae_2,2), "RMSE:", round(rmse_2,2))
print("Model 3 \n R2:", round(r2_3,2), "MAE:", round(mae_3,2), "RMSE:", round(rmse_3,2))

Model 1 
 R2: 0.01 MAE: 24.28 RMSE: 28.18
Model 2 
 R2: 0.01 MAE: 24.33 RMSE: 28.19
Model 3 
 R2: 0.1 MAE: 22.77 RMSE: 26.86


### Which model performs the best so far? Which features seem to do a good job predicting the critic rating (hint: you can check p-values using statsmodels)?


We see that in terms of R2, Model 1 and Model 2 have very low R2 values (close to 0), suggesting that the models explain very little variance in the target variable.
Model 3 has a slightly higher R2 value (0.1), indicating a slightly better fit compared to Models 1 and 2. However, an R2 of 0.1 still suggests a relatively weak fit.

In terms of MAE, Model 3 has the lowest MAE (22.77), suggesting that, on average, its predictions are closer to the actual values compared to Models 1 and 2.

RMSE is similar to MAE but gives more weight to larger errors. Model 3 has the lowest RMSE (26.86), suggesting that it has smaller errors on average compared to Models 1 and 2.

Overall, Model 3 performs better than Models 1 and 2 in terms of R2, MAE, and RMSE.

### Try fitting 3 more linear regression models on your own using a combination of the columns so far (runtime_in_minutes, kid-friendly and the dummy columns for genre) and your newly engineered features.
• Each subsequent model should attempt to do a better job at prediction than the previous model (even if the metrics don’t end up being better, your choices should make sense)

• With each new model, explain why are you making your decisions of which feature(s) to include / remove


In [16]:
train.columns

Index(['movie_title', 'rating', 'genre', 'runtime_in_minutes', 'critic_rating',
       'in_theatre_year', 'kid_friendly', 'count', 'genre_Action & Adventure',
       'genre_Animation', 'genre_Art House & International', 'genre_Classics',
       'genre_Comedy', 'genre_Cult Movies', 'genre_Documentary', 'genre_Drama',
       'genre_Horror', 'genre_Kids & Family',
       'genre_Musical & Performing Arts', 'genre_Mystery & Suspense',
       'genre_Romance', 'genre_Science Fiction & Fantasy', 'genre_Western',
       'prior_to_1970', 'release_season_Fall', 'release_season_Spring',
       'release_season_Summer', 'release_season_Winter',
       'streaming_lag_years'],
      dtype='object')

### Model 4: Does release season affect the critic rating? 
For example, movies released near or during award seasons (e.g., late fall and early winter) might receive more attention from critics due to considerations for prestigious awards.

In [37]:
X_train4 = train[['runtime_in_minutes','kid_friendly','release_season_Fall', 'release_season_Spring',
       'release_season_Summer', 'release_season_Winter']]

y_train = train['critic_rating']

model4 = LinearRegression()
model4.fit(X_train4, y_train)

X_test4 = test[['runtime_in_minutes','kid_friendly','release_season_Fall', 'release_season_Spring',
       'release_season_Summer', 'release_season_Winter']]
y_pred4 = model4.predict(X_test4)

### Model 5: Does the lag between in theatre and on streaming affect the critic rating on different genres?
Movies available for streaming shortly after their theatrical release might attract more viewers and potentially receive more reviews, which could influence critic ratings.

Viewer expectations may vary depending on the type of movie. For example, certain genres or types of films may be expected to perform better in theatres, while others may find success through streaming.

In [51]:
X_train5 = train[['runtime_in_minutes','kid_friendly','streaming_lag_years','genre_Action & Adventure', 'genre_Animation', 'genre_Art House & International',
        'genre_Classics', 'genre_Comedy', 'genre_Cult Movies', 'genre_Documentary',
        'genre_Drama', 'genre_Horror', 'genre_Kids & Family', 'genre_Musical & Performing Arts',
        'genre_Mystery & Suspense', 'genre_Romance', 'genre_Science Fiction & Fantasy', 'genre_Western']]

y_train = train['critic_rating']

model5 = LinearRegression()
model5.fit(X_train5, y_train)

X_test5 = test[['runtime_in_minutes','kid_friendly','streaming_lag_years','genre_Action & Adventure', 'genre_Animation', 'genre_Art House & International',
        'genre_Classics', 'genre_Comedy', 'genre_Cult Movies', 'genre_Documentary',
        'genre_Drama', 'genre_Horror', 'genre_Kids & Family', 'genre_Musical & Performing Arts',
        'genre_Mystery & Suspense', 'genre_Romance', 'genre_Science Fiction & Fantasy', 'genre_Western']]
y_pred5 = model5.predict(X_test5)

### Model 6: Does the movies prior to 1970 and genre affect the critic rating?
Movies released before 1970 may have historical significance and cultural impact, potentially affecting how critics perceive and evaluate them. Some movies released prior to 1970 are considered classics and may receive higher ratings due to their enduring popularity and influence on subsequent films.

Critic ratings might be influenced by trends in genre popularity. For example, certain genres may experience periods of resurgence in critical acclaim.

In [59]:
X_train6 = train[['runtime_in_minutes','prior_to_1970','genre_Action & Adventure', 'genre_Animation', 'genre_Art House & International',
        'genre_Classics', 'genre_Comedy', 'genre_Cult Movies', 'genre_Documentary',
        'genre_Drama', 'genre_Horror', 'genre_Kids & Family', 'genre_Musical & Performing Arts',
        'genre_Mystery & Suspense', 'genre_Romance', 'genre_Science Fiction & Fantasy', 'genre_Western']]

y_train = train['critic_rating']

model6 = LinearRegression()
model6.fit(X_train6, y_train)

X_test6 = test[['runtime_in_minutes','prior_to_1970','genre_Action & Adventure', 'genre_Animation', 'genre_Art House & International',
        'genre_Classics', 'genre_Comedy', 'genre_Cult Movies', 'genre_Documentary',
        'genre_Drama', 'genre_Horror', 'genre_Kids & Family', 'genre_Musical & Performing Arts',
        'genre_Mystery & Suspense', 'genre_Romance', 'genre_Science Fiction & Fantasy', 'genre_Western']]
y_pred6 = model6.predict(X_test6)

In [60]:
# test set
y_test = test['critic_rating']

r2_4, mae_4, rmse_4 = evaluate_model(y_test, y_pred4)
r2_5, mae_5, rmse_5 = evaluate_model(y_test, y_pred5)
r2_6, mae_6, rmse_6 = evaluate_model(y_test, y_pred6)

print("Model 4 \n R2:", round(r2_4,2), "MAE:", round(mae_4,2), "RMSE:", round(rmse_4,2))
print("Model 5 \n R2:", round(r2_5,2), "MAE:", round(mae_5,2), "RMSE:", round(rmse_5,2))
print("Model 6 \n R2:", round(r2_6,2), "MAE:", round(mae_6,2), "RMSE:", round(rmse_6,2))

Model 4 
 R2: 0.0 MAE: 24.42 RMSE: 28.27
Model 5 
 R2: 0.04 MAE: 23.6 RMSE: 27.71
Model 6 
 R2: 0.09 MAE: 22.94 RMSE: 26.92


### Out of the 6 models you created, which model performs the best? Which features seem to do a good job predicting the critic rating?


Model 6 performs better than Model 4 and Model 5 based on all three metrics (R2, MAE, RMSE).

While the R2 values are still relatively low, indicating that the models are not explaining a large portion of the variance, the decrease in MAE and RMSE in Model 6 suggests improved predictive performance compared to the previous models.

### List 3 other things you could to do at this point to try and improve your model.

1. Models, even when improved, may not capture all the nuances and complexities of the underlying data. Since we only use the linear regression to fit, there may exist the **nonlinear relationships**.

2. The results above highlights the possibility of needing to include additional features, since there is no apparent improvement compared to Model 1-3. This recognizes that the predictive power of the models may be enhanced by incorporating more relevant variables that contribute to explaining the variability in critic ratings. Adjustments can be made to the existing models, such as **fine-tuning hyperparameters, feature engineering**, or considering alternative modeling techniques, may be beneficial for better performance.

3. During our process of **data cleasing**, we dropped all the NA rows, which may contain important portion in predicting critic_rating. For example, if the data before 2010 is mostly dropped, we may lose some fit in our training model because of the lack of enough data. We may consider adding them back and inference those NA values by appropriate methods. 