# `DATA MODELING`

## **TOPIC: FILMS ANALYSIS**

`Group ID`: 17

`Group Member`:
- 22127404_Tạ Minh Thư
- 22127359_Chu Thúy Quỳnh
- 22127302_Nguyễn Đăng Nhân

### Import

In [1]:
import pandas as pd
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import pearsonr, chi2_contingency

### Read data from file

In [2]:
data = pd.read_csv('cleaned_data.csv', sep=",")
data

Unnamed: 0,Rank,Title,Foreign %,Domestic %,Year,Genre,Director,Writer,Cast
0,1,Avatar,73.1,26.9,2009,"['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",['James Cameron'],['James Cameron'],"['Sam Worthington', 'Zoe Saldana', 'Sigourney ..."
1,2,Avengers: Endgame,69.3,30.7,2019,"['Action', 'Adventure', 'Drama', 'Sci-Fi']","['Anthony Russo', 'Joe Russo']","['Christopher Markus', 'Stephen McFeely', 'Sta...","['Robert Downey Jr.', 'Chris Evans', 'Mark Ruf..."
2,3,Avatar: The Way of Water,70.5,29.5,2022,"['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",['James Cameron'],"['James Cameron', 'Rick Jaffa', 'Amanda Silver...","['Sam Worthington', 'Zoe Saldana', 'Sigourney ..."
3,4,Titanic,70.2,29.8,1997,"['Drama', 'Romance']",['James Cameron'],['James Cameron'],"['Leonardo DiCaprio', 'Kate Winslet', 'Billy Z..."
4,5,Star Wars: Episode VII - The Force Awakens,54.8,45.2,2015,"['Action', 'Adventure', 'Sci-Fi']",['J.J. Abrams'],"['Lawrence Kasdan', 'J.J. Abrams', 'Michael Ar...","['Daisy Ridley', 'John Boyega', 'Oscar Isaac',..."
...,...,...,...,...,...,...,...,...,...
995,996,The Final Destination,64.3,35.7,2009,"['Horror', 'Thriller']",['David R. Ellis'],"['Eric Bress', 'Jeffrey Reddick']","['Nick Zano', 'Krista Allen', 'Andrew Fiscella..."
996,997,Atlantis: The Lost Empire,54.8,45.2,2001,"['Action', 'Adventure', 'Animation', 'Family',...","['Gary Trousdale', 'Kirk Wise']","['Tab Murphy', 'Kirk Wise', 'Gary Trousdale', ...","['Michael J. Fox', 'Jim Varney', 'Corey Burton..."
997,998,Inside Man,52.4,47.6,2006,"['Crime', 'Drama', 'Mystery', 'Thriller']",['Spike Lee'],['Russell Gewirtz'],"['Denzel Washington', 'Clive Owen', 'Jodie Fos..."
998,999,The Waterboy,13.2,86.8,1998,"['Comedy', 'Sport']",['Frank Coraci'],"['Tim Herlihy', 'Adam Sandler']","['Adam Sandler', 'Kathy Bates', 'Henry Winkler..."


### Feature engineering

#### Genre

- The genres are divided into four groups(Very High, High, Medium, Low) based on their frequency of appearance in the dataset.
- Assigning scores to each group: Very High: 3 points, High: 2 points, Medium: 1 point., Low: 0 points.
- For each list of genres in a movie, determine the group to which each genre belongs and add the corresponding points.

In [3]:
data['Genre'] = data['Genre'].apply(eval)
genre_groups = {
    "Very High": ["Adventure", "Action", "Comedy"], 
    "High": ["Drama", "Fantasy", "Thriller", "Sci-Fi", "Family", ],      
    "Medium": ["Animation", "Romance", "Crime", "Mystery"],
    "Low": [ "Musical", "Horror", "Biography", "War", "Music", "History", "Sport", "Western", "Documentary"]
}

group_scores = {
    "Very High": 3,
    "High": 2,
    "Medium": 1,
    "Low": 0
}

def calculate_group_score(genres):
    score = 0
    for genre in genres:
        for group, genre_list in genre_groups.items():
            if genre in genre_list:
                score += group_scores[group]
                break 
    return score

engineered_data = data.copy()
engineered_data["genres_score"] = data["Genre"].apply(calculate_group_score)
engineered_data = engineered_data[['genres_score']]
engineered_data


Unnamed: 0,genres_score
0,10
1,10
2,10
3,3
4,8
...,...
995,2
996,13
997,6
998,3


#### Main cast popularity

- A count is calculated for how often each actor appears in the dataset. This provides a measure of each actor's prevalence or "popularity" within the dataset.
- For each movie, the appearances of all actors in its cast are summed. This aggregated score represents the overall popularity of the movie's main cast.

In [4]:
cast_counter = Counter(actor for cast_list in data['Cast'].apply(eval) for actor in cast_list)

engineered_data['main_cast_popularity'] = data['Cast'].apply(lambda x: sum(cast_counter[actor] for actor in eval(x)))
engineered_data

Unnamed: 0,genres_score,main_cast_popularity
0,10,28
1,10,43
2,10,18
3,3,20
4,8,14
...,...,...
995,2,4
996,13,10
997,6,21
998,3,21


#### Director

- A count is created to track how many movies each director is credited for within the dataset. This provides a measure of their activity or prominence.
- For each movie, the contributions of all directors involved are summed up. This score reflects the cumulative movie count of all directors associated with the film.

In [5]:
data['Director'] = data['Director'].apply(eval)

director_counter = Counter(director for director_list in data['Director'] for director in director_list)

engineered_data['director_movie_count'] = data['Director'].apply(
    lambda directors: sum(director_counter[director] for director in directors)
)

engineered_data


Unnamed: 0,genres_score,main_cast_popularity,director_movie_count
0,10,28,5
1,10,43,8
2,10,18,5
3,3,20,5
4,8,14,6
...,...,...,...
995,2,4,1
996,13,10,6
997,6,21,1
998,3,21,2


#### Writer

- A tally is generated to track how many movies each writer has been credited for in the dataset. This provides insight into the frequency and prominence of each writer.
- For each movie, the total number of contributions from all credited writers is calculated by summing up their respective counts. This score represents the collective influence of the writers involved in the movie.

In [6]:
data['Writer'] = data['Writer'].apply(eval)
 
writer_counter = Counter(writer for writer_list in data['Writer'] for writer in writer_list)

engineered_data['writer_movie_count'] = data['Writer'].apply(
    lambda writers:sum(writer_counter[writer] for writer in writers)
)

engineered_data

Unnamed: 0,genres_score,main_cast_popularity,director_movie_count,writer_movie_count
0,10,28,5,11
1,10,43,8,295
2,10,18,5,75
3,3,20,5,11
4,8,14,6,38
...,...,...,...,...
995,2,4,1,2
996,13,10,6,31
997,6,21,1,1
998,3,21,2,13


#### Having top 25% writer in each genre

- The dataset is expanded to associate each writer with every genre they have worked on, creating a detailed mapping of writer contributions across genres.
- For each genre, the number of contributions (movies written) is calculated for every writer.
- Writers whose contribution counts fall in the top 25% (above the third quartile) for a specific genre are identified as "top writers" for that genre.
- For each movie, the associated genres and writers are analyzed. If any of the writers for the movie belong to the top 25% writers in at least one of its genres, the movie is flagged.

In [7]:
flat_writer_data = data.explode('Genre').explode('Writer')

writer_counts = flat_writer_data.groupby(['Genre', 'Writer']).size().reset_index(name='Count')

def get_top_25_percent_writers_for_genre(genre):
    genre_data = writer_counts[writer_counts['Genre'] == genre]
    
    Q3 = genre_data['Count'].quantile(0.75)
    
    top_25_writers = genre_data[genre_data['Count'] > Q3]['Writer'].tolist()
    return top_25_writers
def calculate_top_25_percent_writer_for_row(row):
    top_25_percent_writer = 0
    
    for genre in row['Genre']:
        top_25_writers = get_top_25_percent_writers_for_genre(genre)
        
        if any(writer in top_25_writers for writer in row['Writer']):
            top_25_percent_writer = 1
            break
    
    return top_25_percent_writer

engineered_data['has_top_25%_writer/genre'] = data.apply(calculate_top_25_percent_writer_for_row, axis=1)
engineered_data

Unnamed: 0,genres_score,main_cast_popularity,director_movie_count,writer_movie_count,has_top_25%_writer/genre
0,10,28,5,11,1
1,10,43,8,295,1
2,10,18,5,75,1
3,3,20,5,11,0
4,8,14,6,38,1
...,...,...,...,...,...
995,2,4,1,2,0
996,13,10,6,31,1
997,6,21,1,1,0
998,3,21,2,13,1


#### Having top 25% directors in each genre

- The dataset is expanded to associate each director with every genre they have directed movies for, providing a detailed mapping of director contributions across genres.
-For each genre, the number of movies directed by each director is calculated.
- Directors whose contribution counts are in the top 25% for a specific genre (above the third quartile) are classified as "top directors" for that genre.
- For each movie, the associated genres and directors are examined. If any of the directors of the movie belong to the top 25% of directors for at least one of its genres, the movie is flagged accordingly.

In [8]:
flat_director_data = data.explode('Genre').explode('Director')

director_counts = flat_director_data.groupby(['Genre', 'Director']).size().reset_index(name='Count')

def get_top_25_percent_directors_for_genre(genre):
    genre_data = director_counts[director_counts['Genre'] == genre]
    
    Q3 = genre_data['Count'].quantile(0.75)
    
    top_25_directors = genre_data[genre_data['Count'] > Q3]['Director'].tolist()
    return top_25_directors

def calculate_top_25_percent_director_for_row(row):
    top_25_percent_director = 0
    
    for genre in row['Genre']:
        top_25_directors = get_top_25_percent_directors_for_genre(genre)
        
        if any(director in top_25_directors for director in row['Director']):
            top_25_percent_director = 1
            break
    
    return top_25_percent_director

engineered_data['has_top_25%_director/genre'] = data.apply(calculate_top_25_percent_director_for_row, axis=1)

engineered_data

1    601
0    399
Name: has_top_25%_director/genre, dtype: int64

#### Director and cast collaborations

- For each movie, the number of collaborations between each director and each actor is tracked. A collaboration is counted each time a director and actor work together on a movie.
- For each movie, the collaboration score is calculated by summing the collaboration counts for each director-actor pair involved in the movie.
- This score reflects the number of times the director has worked with each actor across all movies in the dataset.

In [9]:
director_cast_collaborations = Counter()

for _, row in data.iterrows():
    directors = row['Director']  
    cast = row['Cast'] 

    for director in directors:
        for actor in cast:
            director_cast_collaborations[(director, actor)] += 1

engineered_data['director_cast_collaborations'] = data.apply(
    lambda row: sum(director_cast_collaborations[(director, actor)] 
                    for director in row['Director']
                    for actor in row['Cast']), axis=1
)
engineered_data

#### Ratio revenue

- For each movie, the ratio of foreign revenue percentage to domestic revenue percentage is computed. The ratio is calculated by dividing the "Foreign %" by the "Domestic %".
- If the domestic revenue percentage is zero, the ratio is set to zero to avoid division by zero errors.

In [10]:
engineered_data['foreign_to_domestic_ratio'] = data.apply(
    lambda row: row['Foreign %'] / row['Domestic %'] if row['Domestic %'] != 0 else 0,
    axis=1
)
engineered_data

Unnamed: 0,genres_score,main_cast_popularity,director_movie_count,writer_movie_count,has_top_25%_writer/genre,has_top_25%_director/genre,director_cast_collaborations,foreign_to_domestic_ratio
0,10,28,5,11,1,1,1492,2.717472
1,10,43,8,295,1,1,2196,2.257329
2,10,18,5,75,1,1,1436,2.389831
3,3,20,5,11,0,0,1305,2.355705
4,8,14,6,38,1,1,1450,1.212389
...,...,...,...,...,...,...,...,...
995,2,4,1,2,0,0,225,1.801120
996,13,10,6,31,1,1,1518,1.212389
997,6,21,1,1,0,0,310,1.100840
998,3,21,2,13,1,0,527,0.152074


## Data preparation

### Categorize rank

Movies are classified into three categories based on their rank:
- High: Rank less than or equal to 300.
- Medium: Rank between 301 and 700 (inclusive).
- Low: Rank above 700.

In [11]:
def categorize_rank(rank):
    if rank <= 300:
        return 2  # High
    elif 301 <= rank <= 700:
        return 1  # Medium
    else:
        return 0  # Low
    
engineered_data["Rank Category"] = data["Rank"].apply(categorize_rank)
engineered_data

Unnamed: 0,genres_score,main_cast_popularity,director_movie_count,writer_movie_count,has_top_25%_writer/genre,has_top_25%_director/genre,director_cast_collaborations,foreign_to_domestic_ratio,Rank Category
0,10,28,5,11,1,1,1492,2.717472,2
1,10,43,8,295,1,1,2196,2.257329,2
2,10,18,5,75,1,1,1436,2.389831,2
3,3,20,5,11,0,0,1305,2.355705,2
4,8,14,6,38,1,1,1450,1.212389,2
...,...,...,...,...,...,...,...,...,...
995,2,4,1,2,0,0,225,1.801120,0
996,13,10,6,31,1,1,1518,1.212389,0
997,6,21,1,1,0,0,310,1.100840,0
998,3,21,2,13,1,0,527,0.152074,0


### HypthesisHypthesis

We use two hypothesis testing methods to calculate the p-value for each attribute in relation to the "Rank Category":
1. Pearson Correlation Test:
- Used for continuous variables to assess the linear relationship between the attribute and "Rank Category".
- The p-value indicates the strength and significance of the correlation.
2. Chi-Square Test:
- Applied to categorical variables to evaluate the association with "Rank Category".
- The p-value indicates whether the observed differences are statistically significant.

In [15]:
attributes = [
    "genres_score",
    "main_cast_popularity",
    "director_movie_count",
    "writer_movie_count",
    "has_top_25%_writer/genre",
    "has_top_25%_director/genre",
    "director_cast_collaborations",
    "foreign_to_domestic_ratio"
]

p_values = {}

for attr in attributes:
    if engineered_data[attr].nunique() > 2:  # Continuous attribute
        corr, p_val = pearsonr(engineered_data[attr], engineered_data["Rank Category"])
    else:  # Binary/categorical attribute
        contingency_table = pd.crosstab(engineered_data[attr], engineered_data["Rank Category"])
        _, p_val, _, _ = chi2_contingency(contingency_table)
    
    p_values[attr] = p_val

for attr, p_val in p_values.items():
    print(f"P-value for {attr}: {p_val}")

P-value for genres_score: 3.183052292833676e-15
P-value for main_cast_popularity: 3.9809870085336134e-07
P-value for director_movie_count: 3.3456813613442225e-10
P-value for writer_movie_count: 1.901876898863242e-25
P-value for has_top_25%_writer/genre: 4.1287182509207585e-15
P-value for has_top_25%_director/genre: 4.4022404542948555e-10
P-value for director_cast_collaborations: 1.0340767933834285e-10
P-value for foreign_to_domestic_ratio: 0.10588355137374442


#### Conclusion
- Most attributes have very small p-values, indicating statistically significant relationships with Rank Category.
- foreign_to_domestic_ratio is the only exception, with a p-value above 0.05, suggesting no significant association with Rank Category.

### Data separation as X and y

We use MinMaxScaler - this normalization technique transforms features into a specific range, typically from 0 to 1. It ensures that each feature has the same scale, making it easier for machine learning algorithms to process the data more effectively. This is important because features with different scales can introduce bias into the model.

In [16]:
X = engineered_data.drop(columns=['Rank Category'])
scaler = MinMaxScaler()
columns_to_scale = X.columns
X[columns_to_scale] = scaler.fit_transform(X[columns_to_scale])
X

Unnamed: 0,genres_score,main_cast_popularity,director_movie_count,writer_movie_count,has_top_25%_writer/genre,has_top_25%_director/genre,director_cast_collaborations,foreign_to_domestic_ratio
0,0.555556,0.452830,0.190476,0.032895,1.0,1.0,0.237064,0.002720
1,0.555556,0.735849,0.333333,0.967105,1.0,1.0,0.363115,0.002260
2,0.555556,0.264151,0.190476,0.243421,1.0,1.0,0.227037,0.002392
3,0.166667,0.301887,0.190476,0.032895,0.0,0.0,0.203581,0.002358
4,0.444444,0.188679,0.238095,0.121711,1.0,1.0,0.229543,0.001214
...,...,...,...,...,...,...,...,...
995,0.111111,0.000000,0.000000,0.003289,0.0,0.0,0.010206,0.001803
996,0.722222,0.113208,0.238095,0.098684,1.0,1.0,0.241719,0.001214
997,0.333333,0.320755,0.000000,0.000000,0.0,0.0,0.025425,0.001102
998,0.166667,0.320755,0.047619,0.039474,1.0,0.0,0.064279,0.000152


In [13]:
y = engineered_data['Rank Category']
y

0      2
1      2
2      2
3      2
4      2
      ..
995    0
996    0
997    0
998    0
999    0
Name: Rank Category, Length: 1000, dtype: int64

### Data splitting

80% train - 20% test

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=38)

## Building predicting models

### Logistic Regression

In [15]:

log_model = LogisticRegression()
log_model.fit(X_train, y_train)

param_grid = [
    {'penalty':['l1','l2','elasticnet','none'],
    'C' : [1, 10, 100, 1000],
    'solver': ['lbfgs','newton-cg','lbfgs','sag','saga'],
    'max_iter'  : [100,1000,2500,5000]
}
]

grid_search = GridSearchCV(log_model, param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)


432 fits failed out of a total of 960.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
96 fits failed with the following error:
Traceback (most recent call last):
  File "c:\ProgramData\anaconda3\envs\min_ds-env\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\ProgramData\anaconda3\envs\min_ds-env\lib\site-packages\sklearn\linear_model\_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "c:\ProgramData\anaconda3\envs\min_ds-env\lib\site-packages\sklearn\linear_model\_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalti

In [16]:
y_pred = log_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)


print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.4950

Classification Report:
              precision    recall  f1-score   support

           0       0.53      0.25      0.34        63
           1       0.38      0.63      0.47        67
           2       0.68      0.59      0.63        70

    accuracy                           0.49       200
   macro avg       0.53      0.49      0.48       200
weighted avg       0.54      0.49      0.49       200

Best parameters: {'C': 1, 'max_iter': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
Best cross-validation score: 0.46999990613237214
Accuracy: 0.4950
Classification Report:
               precision    recall  f1-score   support

           0       0.53      0.25      0.34        63
           1       0.38      0.63      0.47        67
           2       0.68      0.59      0.63        70

    accuracy                           0.49       200
   macro avg       0.53      0.49      0.48       200
weighted avg       0.54      0.49      0.49       200



In [17]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=38)
rf_model.fit(X_train, y_train)

param_grid ={ 'n_estimators': [100, 200, 300], 
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

grid_search = GridSearchCV(rf_model, param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)


  warn(


In [18]:

y_pred = rf_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)


print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.5000

Classification Report:
              precision    recall  f1-score   support

           0       0.56      0.48      0.51        63
           1       0.37      0.51      0.43        67
           2       0.65      0.51      0.58        70

    accuracy                           0.50       200
   macro avg       0.53      0.50      0.51       200
weighted avg       0.53      0.50      0.51       200

Best parameters: {'criterion': 'entropy', 'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
Best cross-validation score: 0.47501243746069294
Accuracy: 0.5050
Classification Report:
               precision    recall  f1-score   support

           0       0.55      0.38      0.45        63
           1       0.38      0.54      0.44        67
           2       0.67      0.59      0.63        70

    accuracy                           0.51       200
   macro avg       0.53      0.50      0.51       200
weighted 

In [19]:
dt_model = DecisionTreeClassifier(random_state=38)

dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)

print("Accuracy (Baseline):", accuracy_score(y_test, y_pred))
print("\nClassification Report (Baseline):\n", classification_report(y_test, y_pred))
param_grid = {
    'criterion': ['gini', 'entropy'],       
    'max_depth': [None, 5, 10, 15, 20],    
    'min_samples_split': [2, 5, 10],         
    'min_samples_leaf': [1, 2, 5]           
}

grid_search = GridSearchCV(dt_model, param_grid, cv=3, n_jobs=-1)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print("Best parameters:", best_params)

best_dt_model = grid_search.best_estimator_
y_pred_tuned = best_dt_model.predict(X_test)

print("Accuracy (Tuned):", accuracy_score(y_test, y_pred_tuned))
print("\nClassification Report (Tuned):\n", classification_report(y_test, y_pred_tuned))


Accuracy (Baseline): 0.4

Classification Report (Baseline):
               precision    recall  f1-score   support

           0       0.45      0.38      0.41        63
           1       0.33      0.42      0.37        67
           2       0.46      0.40      0.43        70

    accuracy                           0.40       200
   macro avg       0.41      0.40      0.40       200
weighted avg       0.41      0.40      0.40       200

Best parameters: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 5, 'min_samples_split': 2}
Accuracy (Tuned): 0.545

Classification Report (Tuned):
               precision    recall  f1-score   support

           0       0.58      0.52      0.55        63
           1       0.45      0.46      0.46        67
           2       0.61      0.64      0.62        70

    accuracy                           0.55       200
   macro avg       0.55      0.54      0.54       200
weighted avg       0.55      0.55      0.54       200



## Reflection

### 22127359 - Chu Thuy Quynh

Difficulties:
 - When I started building the model, I faced difficulties because most of my data was non-numeric, making it challenging to apply existing libraries

Learned:
- I learned that feature engineering could help transform my raw data into numerical features, making it more suitable for modeling. By creating these engineered features, I was able to convert categorical data into numeric representations that machine learning models can better understand and utilize.