<a href="https://colab.research.google.com/github/quentindubois-epitech/-Applied-Predictive-Analytics-Assessment/blob/main/Applied_Predictive_Analytics_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Success of a movie based on their caracteristic with machine learning

## 1. <a name="1">Reading the dataset</a>
Get the data by reading the file at https://github.com/quentindubois-epitech/-Applied-Predictive-Analytics-Assessment/blob/main/movies.csv.


In [90]:
!pip install transformers torch scikit-learn



In [91]:
import pandas as pd
url='https://raw.githubusercontent.com/quentindubois-epitech/-Applied-Predictive-Analytics-Assessment/main/movies.csv'

df = pd.read_csv(url)

We create a new column in the dataset called success if the score is more than 6.5.

In [92]:
df['success'] = df['score'] > 6.5

Let's look at the first five rows in the datasets.

In [93]:
df.head(10)

Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime,success
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0,True
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0,False
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0,True
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0,True
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0,True
5,Friday the 13th,R,Horror,1980,"May 9, 1980 (United States)",6.4,123000.0,Sean S. Cunningham,Victor Miller,Betsy Palmer,United States,550000.0,39754601.0,Paramount Pictures,95.0,False
6,The Blues Brothers,R,Action,1980,"June 20, 1980 (United States)",7.9,188000.0,John Landis,Dan Aykroyd,John Belushi,United States,27000000.0,115229890.0,Universal Pictures,133.0,True
7,Raging Bull,R,Biography,1980,"December 19, 1980 (United States)",8.2,330000.0,Martin Scorsese,Jake LaMotta,Robert De Niro,United States,18000000.0,23402427.0,Chartoff-Winkler Productions,129.0,True
8,Superman II,PG,Action,1980,"June 19, 1981 (United States)",6.8,101000.0,Richard Lester,Jerry Siegel,Gene Hackman,United States,54000000.0,108185706.0,Dovemead Films,127.0,True
9,The Long Riders,R,Biography,1980,"May 16, 1980 (United States)",7.0,10000.0,Walter Hill,Bill Bryden,David Carradine,United States,10000000.0,15795189.0,United Artists,100.0,True


We check the number of data for 'success' category.

In [94]:
df["success"].value_counts()

False    4138
True     3530
Name: success, dtype: int64

## 4. <a name="4">Train - Validation Split</a>

Let's split our dataset into training (80%) and validation (20%).

In [95]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

We first try to predict the success of the film using the budget, the company and the genre.

In [96]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df[["budget"]],
                                                  df["success"],
                                                  test_size=0.20,
                                                  shuffle=True,
                                                  random_state=324
                                                 )

We impute the data with missing values to train the linear regression.

In [99]:
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_val_imputed = imputer.transform(X_val)

We use RandomOverSampler to handle class imbalance.

In [100]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train_imputed, y_train)

Creating a RandomForestClassifier model.

In [101]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=42)

Define a hyperparameter grid to search.

In [102]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
}

Use GridSearchCV to find the best hyperparameters.

In [104]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

grid_search = GridSearchCV(rf_model, param_grid, scoring=make_scorer(accuracy_score), cv=5)
grid_search.fit(X_train_resampled, y_train_resampled)

Display the best hyperparameters.

In [105]:
print("Best Hyperparameters:", grid_search.best_params_)

Best Hyperparameters: {'max_depth': 20, 'n_estimators': 50}


Use the model with the best hyperparameters to make predictions

In [106]:
rf_model_best = grid_search.best_estimator_
val_predictions_best = rf_model_best.predict(X_val_imputed)

Print the results.

In [107]:
print(confusion_matrix(y_val.values, val_predictions_best))
print(classification_report(y_val.values, val_predictions_best))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions_best))

[[341 466]
 [281 446]]
              precision    recall  f1-score   support

       False       0.55      0.42      0.48       807
        True       0.49      0.61      0.54       727

    accuracy                           0.51      1534
   macro avg       0.52      0.52      0.51      1534
weighted avg       0.52      0.51      0.51      1534

Accuracy (validation): 0.5130378096479792


Creating a Linear Regression model.

In [109]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()

lr_model.fit(X_train_imputed, y_train)

Making predictions using the Linear Regression model.

In [110]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions_lr = lr_model.predict(X_val_imputed)

print("\nLinear Regression Results:")
print(confusion_matrix(y_val.values, val_predictions_lr.round()))
print(classification_report(y_val.values, val_predictions_lr.round()))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions_lr.round()))


Linear Regression Results:
[[794  13]
 [687  40]]
              precision    recall  f1-score   support

       False       0.54      0.98      0.69       807
        True       0.75      0.06      0.10       727

    accuracy                           0.54      1534
   macro avg       0.65      0.52      0.40      1534
weighted avg       0.64      0.54      0.41      1534

Accuracy (validation): 0.5436766623207301
