# `DATA MODELING`

## **TOPIC: FILMS ANALYSIS**

`Group ID`: 17

`Group Member`:
- 22127404_Tạ Minh Thư
- 22127359_Chu Thúy Quỳnh
- 22127302_Nguyễn Đăng Nhân

### Import

In [20]:
import pandas as pd
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

### Read data from file

In [21]:
data = pd.read_csv('cleaned_data.csv', sep=",")
data



Unnamed: 0,Rank,Title,Foreign %,Domestic %,Year,Genre,Director,Writer,Cast
0,1,Avatar,73.1,26.9,2009,"['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",['James Cameron'],['James Cameron'],"['Sam Worthington', 'Zoe Saldana', 'Sigourney ..."
1,2,Avengers: Endgame,69.3,30.7,2019,"['Action', 'Adventure', 'Drama', 'Sci-Fi']","['Anthony Russo', 'Joe Russo']","['Christopher Markus', 'Stephen McFeely', 'Sta...","['Robert Downey Jr.', 'Chris Evans', 'Mark Ruf..."
2,3,Avatar: The Way of Water,70.5,29.5,2022,"['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",['James Cameron'],"['James Cameron', 'Rick Jaffa', 'Amanda Silver...","['Sam Worthington', 'Zoe Saldana', 'Sigourney ..."
3,4,Titanic,70.2,29.8,1997,"['Drama', 'Romance']",['James Cameron'],['James Cameron'],"['Leonardo DiCaprio', 'Kate Winslet', 'Billy Z..."
4,5,Star Wars: Episode VII - The Force Awakens,54.8,45.2,2015,"['Action', 'Adventure', 'Sci-Fi']",['J.J. Abrams'],"['Lawrence Kasdan', 'J.J. Abrams', 'Michael Ar...","['Daisy Ridley', 'John Boyega', 'Oscar Isaac',..."
...,...,...,...,...,...,...,...,...,...
995,996,The Final Destination,64.3,35.7,2009,"['Horror', 'Thriller']",['David R. Ellis'],"['Eric Bress', 'Jeffrey Reddick']","['Nick Zano', 'Krista Allen', 'Andrew Fiscella..."
996,997,Atlantis: The Lost Empire,54.8,45.2,2001,"['Action', 'Adventure', 'Animation', 'Family',...","['Gary Trousdale', 'Kirk Wise']","['Tab Murphy', 'Kirk Wise', 'Gary Trousdale', ...","['Michael J. Fox', 'Jim Varney', 'Corey Burton..."
997,998,Inside Man,52.4,47.6,2006,"['Crime', 'Drama', 'Mystery', 'Thriller']",['Spike Lee'],['Russell Gewirtz'],"['Denzel Washington', 'Clive Owen', 'Jodie Fos..."
998,999,The Waterboy,13.2,86.8,1998,"['Comedy', 'Sport']",['Frank Coraci'],"['Tim Herlihy', 'Adam Sandler']","['Adam Sandler', 'Kathy Bates', 'Henry Winkler..."


### Feature engineering

#### Genre

In the analysis phase, we identified the top 5 genres with the highest revenue.
- First, define a list containing these 5 genres: ["Adventure", "Action", "Comedy","Drama", "Fantasy"].
- For each movie, count how many of its genres match with the top 5 genres.
- Create the new column, which will hold the count of matching genres for each movie.

In [22]:
data['Genre'] = data['Genre'].apply(eval)
genre_groups = {
    "Very High": ["Adventure", "Action", "Comedy"], 
    "High": ["Drama", "Fantasy", "Thriller", "Sci-Fi", "Family", ],      
    "Medium": ["Animation", "Romance", "Crime", "Mystery"],
    "Low": [ "Musical", "Horror", "Biography", "War", "Music", "History", "Sport", "Western", "Documentary"]
}

group_scores = {
    "Very High": 3,
    "High": 2,
    "Medium": 1,
    "Low": 0
}

def calculate_group_score(genres):
    score = 0
    for genre in genres:
        for group, genre_list in genre_groups.items():
            if genre in genre_list:
                score += group_scores[group]
                break 
    return score

engineered_data = data.copy()
engineered_data["genres_score"] = data["Genre"].apply(calculate_group_score)
engineered_data = engineered_data[['genres_score']]
engineered_data


Unnamed: 0,genres_score
0,10
1,10
2,10
3,3
4,8
...,...
995,2
996,13
997,6
998,3


#### Main cast popularity

1. The first step is to analyze the list of actors from all movies in the dataset. For each actor that appears, the system counts how many times they have been featured in the dataset. This helps identify the popularity of each actor within the dataset.

2. The next step, for each movie, the system calculates the total number of appearances of the actors in the movie's main cast (the list of actors). This total reflects the level of fame of the main cast, as movies with more famous actors tend to have a higher appeal.

In [23]:
cast_counter = Counter(actor for cast_list in data['Cast'].apply(eval) for actor in cast_list)

engineered_data['main_cast_popularity'] = data['Cast'].apply(lambda x: sum(cast_counter[actor] for actor in eval(x)))
engineered_data

Unnamed: 0,genres_score,main_cast_popularity
0,10,28
1,10,43
2,10,18
3,3,20
4,8,14
...,...,...
995,2,4
996,13,10
997,6,21
998,3,21


#### Director

In [24]:
data['Director'] = data['Director'].apply(eval)

director_counter = Counter(director for director_list in data['Director'] for director in director_list)

engineered_data['director_movie_count'] = data['Director'].apply(
    lambda directors: sum(director_counter[director] for director in directors)
)

engineered_data


Unnamed: 0,genres_score,main_cast_popularity,director_movie_count
0,10,28,5
1,10,43,8
2,10,18,5
3,3,20,5
4,8,14,6
...,...,...,...
995,2,4,1
996,13,10,6
997,6,21,1
998,3,21,2


#### Writer

In [25]:
data['Writer'] = data['Writer'].apply(eval)
writer_counter = Counter(writer for writer_list in data['Writer'] for writer in writer_list)
engineered_data['writer_movie_count'] = data['Writer'].apply(
    lambda writers: 0 if 'unknown' in writers else sum(writer_counter[writer] for writer in writers)
)
engineered_data

Unnamed: 0,genres_score,main_cast_popularity,director_movie_count,writer_movie_count
0,10,28,5,11
1,10,43,8,295
2,10,18,5,75
3,3,20,5,11
4,8,14,6,38
...,...,...,...,...
995,2,4,1,2
996,13,10,6,31
997,6,21,1,1
998,3,21,2,13


#### Having top 25% writer in each genre

In [26]:
flat_writer_data = data.explode('Genre').explode('Writer')

writer_counts = flat_writer_data.groupby(['Genre', 'Writer']).size().reset_index(name='Count')

def get_top_25_percent_writers_for_genre(genre):
    genre_data = writer_counts[writer_counts['Genre'] == genre]
    
    Q3 = genre_data['Count'].quantile(0.75)
    
    top_25_writers = genre_data[genre_data['Count'] > Q3]['Writer'].tolist()
    return top_25_writers
def calculate_top_25_percent_writer_for_row(row):
    top_25_percent_writer = 0
    
    for genre in row['Genre']:
        top_25_writers = get_top_25_percent_writers_for_genre(genre)
        
        if any(writer in top_25_writers for writer in row['Writer']):
            top_25_percent_writer = 1
            break
    
    return top_25_percent_writer

engineered_data['has_top_25%_writer/genre'] = data.apply(calculate_top_25_percent_writer_for_row, axis=1)
engineered_data
a = engineered_data['has_top_25%_writer/genre'].value_counts()
a

1    768
0    232
Name: has_top_25%_writer/genre, dtype: int64

#### Having top 25% directors in each genre

In [8]:
flat_director_data = data.explode('Genre').explode('Director')

director_counts = flat_director_data.groupby(['Genre', 'Director']).size().reset_index(name='Count')

def get_top_25_percent_directors_for_genre(genre):
    genre_data = director_counts[director_counts['Genre'] == genre]
    
    Q3 = genre_data['Count'].quantile(0.75)
    
    top_25_directors = genre_data[genre_data['Count'] > Q3]['Director'].tolist()
    return top_25_directors

def calculate_top_25_percent_director_for_row(row):
    top_25_percent_director = 0
    
    for genre in row['Genre']:
        top_25_directors = get_top_25_percent_directors_for_genre(genre)
        
        if any(director in top_25_directors for director in row['Director']):
            top_25_percent_director = 1
            break
    
    return top_25_percent_director

engineered_data['has_top_25%_director/genre'] = data.apply(calculate_top_25_percent_director_for_row, axis=1)

engineered_data
a =engineered_data['has_top_25%_director/genre'].value_counts()
a

1    601
0    399
Name: has_top_25%_director/genre, dtype: int64

In [9]:
director_cast_collaborations = Counter()

for _, row in data.iterrows():
    directors = row['Director']  
    cast = row['Cast'] 

    for director in directors:
        for actor in cast:
            director_cast_collaborations[(director, actor)] += 1

engineered_data['director_cast_collaborations'] = data.apply(
    lambda row: sum(director_cast_collaborations[(director, actor)] 
                    for director in row['Director']
                    for actor in row['Cast']), axis=1
)


#### Ratio revenue

In [10]:
engineered_data['foreign_to_domestic_ratio'] = data.apply(
    lambda row: row['Foreign %'] / row['Domestic %'] if row['Domestic %'] != 0 else 0,
    axis=1
)
engineered_data

Unnamed: 0,genres_score,main_cast_popularity,director_movie_count,writer_movie_count,has_top_25%_writer/genre,has_top_25%_director/genre,director_cast_collaborations,foreign_to_domestic_ratio
0,10,28,5,11,1,1,1492,2.717472
1,10,43,8,295,1,1,2196,2.257329
2,10,18,5,75,1,1,1436,2.389831
3,3,20,5,11,0,0,1305,2.355705
4,8,14,6,38,1,1,1450,1.212389
...,...,...,...,...,...,...,...,...
995,2,4,1,2,0,0,225,1.801120
996,13,10,6,31,1,1,1518,1.212389
997,6,21,1,1,0,0,310,1.100840
998,3,21,2,13,1,0,527,0.152074


## Data preparation

### Categorize rank

In [11]:
def categorize_rank(rank):
    if rank <= 300:
        return 2  # High
    elif 301 <= rank <= 700:
        return 1  # Medium
    else:
        return 0  # Low
    
engineered_data["Rank Category"] = data["Rank"].apply(categorize_rank)
engineered_data

Unnamed: 0,genres_score,main_cast_popularity,director_movie_count,writer_movie_count,has_top_25%_writer/genre,has_top_25%_director/genre,director_cast_collaborations,foreign_to_domestic_ratio,Rank Category
0,10,28,5,11,1,1,1492,2.717472,2
1,10,43,8,295,1,1,2196,2.257329,2
2,10,18,5,75,1,1,1436,2.389831,2
3,3,20,5,11,0,0,1305,2.355705,2
4,8,14,6,38,1,1,1450,1.212389,2
...,...,...,...,...,...,...,...,...,...
995,2,4,1,2,0,0,225,1.801120,0
996,13,10,6,31,1,1,1518,1.212389,0
997,6,21,1,1,0,0,310,1.100840,0
998,3,21,2,13,1,0,527,0.152074,0


### Data separation as X and y

In [12]:
X = engineered_data.drop(columns=['Rank Category'])
scaler = MinMaxScaler()
columns_to_scale = X.columns
X[columns_to_scale] = scaler.fit_transform(X[columns_to_scale])
X

Unnamed: 0,genres_score,main_cast_popularity,director_movie_count,writer_movie_count,has_top_25%_writer/genre,has_top_25%_director/genre,director_cast_collaborations,foreign_to_domestic_ratio
0,0.555556,0.452830,0.190476,0.036066,1.0,1.0,0.237064,0.002720
1,0.555556,0.735849,0.333333,0.967213,1.0,1.0,0.363115,0.002260
2,0.555556,0.264151,0.190476,0.245902,1.0,1.0,0.227037,0.002392
3,0.166667,0.301887,0.190476,0.036066,0.0,0.0,0.203581,0.002358
4,0.444444,0.188679,0.238095,0.124590,1.0,1.0,0.229543,0.001214
...,...,...,...,...,...,...,...,...
995,0.111111,0.000000,0.000000,0.006557,0.0,0.0,0.010206,0.001803
996,0.722222,0.113208,0.238095,0.101639,1.0,1.0,0.241719,0.001214
997,0.333333,0.320755,0.000000,0.003279,0.0,0.0,0.025425,0.001102
998,0.166667,0.320755,0.047619,0.042623,1.0,0.0,0.064279,0.000152


In [13]:
y = engineered_data['Rank Category']
y

0      2
1      2
2      2
3      2
4      2
      ..
995    0
996    0
997    0
998    0
999    0
Name: Rank Category, Length: 1000, dtype: int64

### Data splitting

80% train - 20% test

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=38)

## Building predicting models

### Logistic Regression

In [15]:

log_model = LogisticRegression()
log_model.fit(X_train, y_train)

param_grid = [
    {'penalty':['l1','l2','elasticnet','none'],
    'C' : [1, 10, 100, 1000],
    'solver': ['lbfgs','newton-cg','lbfgs','sag','saga'],
    'max_iter'  : [100,1000,2500,5000]
}
]

grid_search = GridSearchCV(log_model, param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)


432 fits failed out of a total of 960.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
96 fits failed with the following error:
Traceback (most recent call last):
  File "c:\ProgramData\anaconda3\envs\min_ds-env\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\ProgramData\anaconda3\envs\min_ds-env\lib\site-packages\sklearn\linear_model\_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "c:\ProgramData\anaconda3\envs\min_ds-env\lib\site-packages\sklearn\linear_model\_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalti

In [16]:
y_pred = log_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)


print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.4950

Classification Report:
              precision    recall  f1-score   support

           0       0.53      0.25      0.34        63
           1       0.38      0.63      0.47        67
           2       0.68      0.59      0.63        70

    accuracy                           0.49       200
   macro avg       0.53      0.49      0.48       200
weighted avg       0.54      0.49      0.49       200

Best parameters: {'C': 1, 'max_iter': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
Best cross-validation score: 0.46999990613237214
Accuracy: 0.4950
Classification Report:
               precision    recall  f1-score   support

           0       0.53      0.25      0.34        63
           1       0.38      0.63      0.47        67
           2       0.68      0.59      0.63        70

    accuracy                           0.49       200
   macro avg       0.53      0.49      0.48       200
weighted avg       0.54      0.49      0.49       200



In [17]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=38)
rf_model.fit(X_train, y_train)

param_grid ={ 'n_estimators': [100, 200, 300], 
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

grid_search = GridSearchCV(rf_model, param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)


  warn(


In [18]:

y_pred = rf_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)


print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.5000

Classification Report:
              precision    recall  f1-score   support

           0       0.56      0.48      0.51        63
           1       0.37      0.51      0.43        67
           2       0.65      0.51      0.58        70

    accuracy                           0.50       200
   macro avg       0.53      0.50      0.51       200
weighted avg       0.53      0.50      0.51       200

Best parameters: {'criterion': 'entropy', 'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
Best cross-validation score: 0.47501243746069294
Accuracy: 0.5050
Classification Report:
               precision    recall  f1-score   support

           0       0.55      0.38      0.45        63
           1       0.38      0.54      0.44        67
           2       0.67      0.59      0.63        70

    accuracy                           0.51       200
   macro avg       0.53      0.50      0.51       200
weighted 

In [19]:
dt_model = DecisionTreeClassifier(random_state=38)

dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)

print("Accuracy (Baseline):", accuracy_score(y_test, y_pred))
print("\nClassification Report (Baseline):\n", classification_report(y_test, y_pred))
param_grid = {
    'criterion': ['gini', 'entropy'],       
    'max_depth': [None, 5, 10, 15, 20],    
    'min_samples_split': [2, 5, 10],         
    'min_samples_leaf': [1, 2, 5]           
}

grid_search = GridSearchCV(dt_model, param_grid, cv=3, n_jobs=-1)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print("Best parameters:", best_params)

best_dt_model = grid_search.best_estimator_
y_pred_tuned = best_dt_model.predict(X_test)

print("Accuracy (Tuned):", accuracy_score(y_test, y_pred_tuned))
print("\nClassification Report (Tuned):\n", classification_report(y_test, y_pred_tuned))


Accuracy (Baseline): 0.4

Classification Report (Baseline):
               precision    recall  f1-score   support

           0       0.45      0.38      0.41        63
           1       0.33      0.42      0.37        67
           2       0.46      0.40      0.43        70

    accuracy                           0.40       200
   macro avg       0.41      0.40      0.40       200
weighted avg       0.41      0.40      0.40       200

Best parameters: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 5, 'min_samples_split': 2}
Accuracy (Tuned): 0.545

Classification Report (Tuned):
               precision    recall  f1-score   support

           0       0.58      0.52      0.55        63
           1       0.45      0.46      0.46        67
           2       0.61      0.64      0.62        70

    accuracy                           0.55       200
   macro avg       0.55      0.54      0.54       200
weighted avg       0.55      0.55      0.54       200

