# `DATA MODELING`

## **TOPIC: FILMS ANALYSIS**

`Group ID`: 17

`Group Member`:
- 22127404_Tạ Minh Thư
- 22127359_Chu Thúy Quỳnh
- 22127302_Nguyễn Đăng Nhân

## **OBJECTIVES**

In this phase, the goal is to build models for predicting movie revenue rank. Here, we will use Random Forest and Decision Tree for analyzing.

Then, we will end this phase by each member reflection, as well as team reflection.

## **IMPLEMENTATION WITH EXPLANATION**

### **SETUP AND IMPORTS**

In [31]:
import pandas as pd
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import pearsonr, chi2_contingency

### **READ DATA**

The `cleaned_data.csv` file is read for data modeling.

In [26]:
data = pd.read_csv('cleaned_data.csv', sep=",")
data

Unnamed: 0,Rank,Title,Foreign %,Domestic %,Year,Genre,Director,Writer,Cast
0,1,Avatar,73.1,26.9,2009,"['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",['James Cameron'],['James Cameron'],"['Sam Worthington', 'Zoe Saldana', 'Sigourney ..."
1,2,Avengers: Endgame,69.3,30.7,2019,"['Action', 'Adventure', 'Drama', 'Sci-Fi']","['Anthony Russo', 'Joe Russo']","['Christopher Markus', 'Stephen McFeely', 'Sta...","['Robert Downey Jr.', 'Chris Evans', 'Mark Ruf..."
2,3,Avatar: The Way of Water,70.5,29.5,2022,"['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",['James Cameron'],"['James Cameron', 'Rick Jaffa', 'Amanda Silver...","['Sam Worthington', 'Zoe Saldana', 'Sigourney ..."
3,4,Titanic,70.2,29.8,1997,"['Drama', 'Romance']",['James Cameron'],['James Cameron'],"['Leonardo DiCaprio', 'Kate Winslet', 'Billy Z..."
4,5,Star Wars: Episode VII - The Force Awakens,54.8,45.2,2015,"['Action', 'Adventure', 'Sci-Fi']",['J.J. Abrams'],"['Lawrence Kasdan', 'J.J. Abrams', 'Michael Ar...","['Daisy Ridley', 'John Boyega', 'Oscar Isaac',..."
...,...,...,...,...,...,...,...,...,...
995,996,The Final Destination,64.3,35.7,2009,"['Horror', 'Thriller']",['David R. Ellis'],"['Eric Bress', 'Jeffrey Reddick']","['Nick Zano', 'Krista Allen', 'Andrew Fiscella..."
996,997,Atlantis: The Lost Empire,54.8,45.2,2001,"['Action', 'Adventure', 'Animation', 'Family',...","['Gary Trousdale', 'Kirk Wise']","['Tab Murphy', 'Kirk Wise', 'Gary Trousdale', ...","['Michael J. Fox', 'Jim Varney', 'Corey Burton..."
997,998,Inside Man,52.4,47.6,2006,"['Crime', 'Drama', 'Mystery', 'Thriller']",['Spike Lee'],['Russell Gewirtz'],"['Denzel Washington', 'Clive Owen', 'Jodie Fos..."
998,999,The Waterboy,13.2,86.8,1998,"['Comedy', 'Sport']",['Frank Coraci'],"['Tim Herlihy', 'Adam Sandler']","['Adam Sandler', 'Kathy Bates', 'Henry Winkler..."


### **FEATURE ENGINEERING**

In this step, we will extract valuable features for data modeling.

#### **Genre**

- The genres are divided into four groups (Very High, High, Medium, Low) based on their frequency of appearance in the dataset.
- Assigning scores to each group: Very High (3 points), High (2 points), Medium (1 point), Low (0 point).
- For each list of genres in a movie, determine the group to which each genre belongs and add the corresponding points.

In [27]:
data['Genre'] = data['Genre'].apply(eval)
genre_groups = {
    "Very High": ["Adventure", "Action", "Comedy"], 
    "High": ["Drama", "Fantasy", "Thriller", "Sci-Fi", "Family", ],      
    "Medium": ["Animation", "Romance", "Crime", "Mystery"],
    "Low": [ "Musical", "Horror", "Biography", "War", "Music", "History", "Sport", "Western", "Documentary"]
}

group_scores = {
    "Very High": 3,
    "High": 2,
    "Medium": 1,
    "Low": 0
}

def calculate_group_score(genres):
    score = 0
    for genre in genres:
        for group, genre_list in genre_groups.items():
            if genre in genre_list:
                score += group_scores[group]
                break 
    return score

engineered_data = data.copy()
engineered_data["genres_score"] = data["Genre"].apply(calculate_group_score)
engineered_data = engineered_data[['genres_score']]
engineered_data


Unnamed: 0,genres_score
0,10
1,10
2,10
3,3
4,8
...,...
995,2
996,13
997,6
998,3


#### **Main cast popularity**

- A count is calculated for how often each actor appears in the dataset. This provides a measure of each actor's prevalence or "popularity" within the dataset.
- For each movie, the appearances of all actors in its cast are summed. This aggregated score represents the overall popularity of the movie's main cast.

In [28]:
cast_counter = Counter(actor for cast_list in data['Cast'].apply(eval) for actor in cast_list)

engineered_data['main_cast_popularity'] = data['Cast'].apply(lambda x: sum(cast_counter[actor] for actor in eval(x)))
engineered_data

Unnamed: 0,genres_score,main_cast_popularity
0,10,28
1,10,43
2,10,18
3,3,20
4,8,14
...,...,...
995,2,4
996,13,10
997,6,21
998,3,21


#### **Director**

- A count is created to track how many movies each director is credited for within the dataset. This provides a measure of their activity or prominence.
- For each movie, the contributions of all directors involved are summed up. This score reflects the cumulative movie count of all directors associated with the film.

In [29]:
data['Director'] = data['Director'].apply(eval)

director_counter = Counter(director for director_list in data['Director'] for director in director_list)

engineered_data['director_movie_count'] = data['Director'].apply(
    lambda directors: sum(director_counter[director] for director in directors)
)

engineered_data


Unnamed: 0,genres_score,main_cast_popularity,director_movie_count
0,10,28,5
1,10,43,8
2,10,18,5
3,3,20,5
4,8,14,6
...,...,...,...
995,2,4,1
996,13,10,6
997,6,21,1
998,3,21,2


#### **Writer**

- A tally is generated to track how many movies each writer has been credited for in the dataset. This provides insight into the frequency and prominence of each writer.
- For each movie, the total number of contributions from all credited writers is calculated by summing up their respective counts. This score represents the collective influence of the writers involved in the movie.

In [30]:
data['Writer'] = data['Writer'].apply(eval)
 
writer_counter = Counter(writer for writer_list in data['Writer'] for writer in writer_list)

engineered_data['writer_movie_count'] = data['Writer'].apply(
    lambda writers:sum(writer_counter[writer] for writer in writers)
)

engineered_data

SyntaxError: invalid syntax. Perhaps you forgot a comma? (<string>, line 1)

#### **Having top 25% writer in each genre**

- The dataset is expanded to associate each writer with every genre they have worked on, creating a detailed mapping of writer contributions across genres.
- For each genre, the number of contributions (movies written) is calculated for every writer.
- Writers whose contribution counts fall in the top 25% (above the third quartile) for a specific genre are identified as "top writers" for that genre.
- For each movie, the associated genres and writers are analyzed. If any of the writers for the movie belong to the top 25% writers in at least one of its genres, the movie is flagged.

In [None]:
flat_writer_data = data.explode('Genre').explode('Writer')

writer_counts = flat_writer_data.groupby(['Genre', 'Writer']).size().reset_index(name='Count')

def get_top_25_percent_writers_for_genre(genre):
    genre_data = writer_counts[writer_counts['Genre'] == genre]
    
    Q3 = genre_data['Count'].quantile(0.75)
    
    top_25_writers = genre_data[genre_data['Count'] > Q3]['Writer'].tolist()
    return top_25_writers
def calculate_top_25_percent_writer_for_row(row):
    top_25_percent_writer = 0
    
    for genre in row['Genre']:
        top_25_writers = get_top_25_percent_writers_for_genre(genre)
        
        if any(writer in top_25_writers for writer in row['Writer']):
            top_25_percent_writer = 1
            break
    
    return top_25_percent_writer

engineered_data['has_top_25%_writer/genre'] = data.apply(calculate_top_25_percent_writer_for_row, axis=1)
engineered_data

#### **Having top 25% directors in each genre**

- The dataset is expanded to associate each director with every genre they have directed movies for, providing a detailed mapping of director contributions across genres.
-For each genre, the number of movies directed by each director is calculated.
- Directors whose contribution counts are in the top 25% for a specific genre (above the third quartile) are classified as "top directors" for that genre.
- For each movie, the associated genres and directors are examined. If any of the directors of the movie belong to the top 25% of directors for at least one of its genres, the movie is flagged accordingly.

In [None]:
flat_director_data = data.explode('Genre').explode('Director')

director_counts = flat_director_data.groupby(['Genre', 'Director']).size().reset_index(name='Count')

def get_top_25_percent_directors_for_genre(genre):
    genre_data = director_counts[director_counts['Genre'] == genre]
    
    Q3 = genre_data['Count'].quantile(0.75)
    
    top_25_directors = genre_data[genre_data['Count'] > Q3]['Director'].tolist()
    return top_25_directors

def calculate_top_25_percent_director_for_row(row):
    top_25_percent_director = 0
    
    for genre in row['Genre']:
        top_25_directors = get_top_25_percent_directors_for_genre(genre)
        
        if any(director in top_25_directors for director in row['Director']):
            top_25_percent_director = 1
            break
    
    return top_25_percent_director

engineered_data['has_top_25%_director/genre'] = data.apply(calculate_top_25_percent_director_for_row, axis=1)

engineered_data

#### **Director and cast collaborations**

- For each movie, the number of collaborations between each director and each actor is tracked. A collaboration is counted each time a director and actor work together on a movie.
- For each movie, the collaboration score is calculated by summing the collaboration counts for each director-actor pair involved in the movie.
- This score reflects the number of times the director has worked with each actor across all movies in the dataset.

In [None]:
director_cast_collaborations = Counter()

for _, row in data.iterrows():
    directors = row['Director']  
    cast = row['Cast'] 

    for director in directors:
        for actor in cast:
            director_cast_collaborations[(director, actor)] += 1

engineered_data['director_cast_collaborations'] = data.apply(
    lambda row: sum(director_cast_collaborations[(director, actor)] 
                    for director in row['Director']
                    for actor in row['Cast']), axis=1
)
engineered_data

#### **Ratio revenue**

- For each movie, the ratio of foreign revenue percentage to domestic revenue percentage is computed. The ratio is calculated by dividing the "Foreign %" by the "Domestic %".
- If the domestic revenue percentage is zero, the ratio is set to zero to avoid division by zero errors.

In [None]:
engineered_data['foreign_to_domestic_ratio'] = data.apply(
    lambda row: row['Foreign %'] / row['Domestic %'] if row['Domestic %'] != 0 else 0,
    axis=1
)
engineered_data

### **DATA PREPARATION**

#### **Categorize rank**

Movies are classified into three categories based on their rank:
- High: Rank less than or equal to 300.
- Medium: Rank between 301 and 700 (inclusive).
- Low: Rank above 700.

In [None]:
def categorize_rank(rank):
    if rank <= 300:
        return 2  # High
    elif 301 <= rank <= 700:
        return 1  # Medium
    else:
        return 0  # Low
    
engineered_data["Rank Category"] = data["Rank"].apply(categorize_rank)
engineered_data

#### **Hypthesis**

We use two hypothesis testing methods to calculate the p-value for each attribute in relation to the "Rank Category":
1. Pearson Correlation Test:
    - Used for continuous variables to assess the linear relationship between the attribute and "Rank Category".
    - The p-value indicates the strength and significance of the correlation.
2. Chi-Square Test:
    - Applied to categorical variables to evaluate the association with "Rank Category".
    - The p-value indicates whether the observed differences are statistically significant.

In [None]:
attributes = [
    "genres_score",
    "main_cast_popularity",
    "director_movie_count",
    "writer_movie_count",
    "has_top_25%_writer/genre",
    "has_top_25%_director/genre",
    "director_cast_collaborations",
    "foreign_to_domestic_ratio"
]

p_values = {}

for attr in attributes:
    if engineered_data[attr].nunique() > 2:  # Continuous attribute
        corr, p_val = pearsonr(engineered_data[attr], engineered_data["Rank Category"])
    else:  # Binary/categorical attribute
        contingency_table = pd.crosstab(engineered_data[attr], engineered_data["Rank Category"])
        _, p_val, _, _ = chi2_contingency(contingency_table)
    
    p_values[attr] = p_val

for attr, p_val in p_values.items():
    print(f"P-value for {attr}: {p_val}")

#### **Conclusion**
- Most attributes have very small p-values, indicating statistically significant relationships with Rank Category.
- foreign_to_domestic_ratio is the only exception, with a p-value above 0.05, suggesting no significant association with Rank Category.

#### **Data separation**

We use MinMaxScaler - this normalization technique transforms features into a specific range, typically from 0 to 1. It ensures that each feature has the same scale, making it easier for machine learning algorithms to process the data more effectively. This is important because features with different scales can introduce bias into the model.

In [None]:
X = engineered_data.drop(columns=['Rank Category'])
scaler = MinMaxScaler()
columns_to_scale = X.columns
X[columns_to_scale] = scaler.fit_transform(X[columns_to_scale])
X

In [None]:
y = engineered_data['Rank Category']
y

#### **Data splitting**

80% train - 20% test

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=38)

### **BUILDING MODELS**

#### **Random Forest**

Firstly, trains a RandomForestClassifier on the training data, makes predictions on the test data, and then evaluates the model's performance by calculating the accuracy and generating a classification report, which includes metrics like precision, recall, and F1-score. It serves as a baseline model for comparison.

In [None]:
rf_model = RandomForestClassifier(random_state=38)
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)


Next, perform hyperparameter tuning for the RandomForestClassifier using GridSearchCV with a defined param_grid. It evaluates different combinations of hyperparameters using cross-validation, selects the best model, and then evaluates its performance on the test data by calculating accuracy and generating a classification report.

In [None]:
param_grid ={ 'n_estimators': [100, 200, 300], 
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

grid_search = GridSearchCV(rf_model, param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))

#### **Decision Tree**

Firstly, train a DecisionTreeClassifier on the training data, make predictions on the test data, and then evaluate the model's performance by calculating accuracy and generating a classification report. It serves as a baseline model for comparison.

In [None]:
dt_model = DecisionTreeClassifier(random_state=38)

dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)

print("Accuracy (Baseline):", accuracy_score(y_test, y_pred))
print("\nClassification Report (Baseline):\n", classification_report(y_test, y_pred))

Next, perform hyperparameter tuning for the DecisionTreeClassifier using GridSearchCV to find the best combination of hyperparameters (e.g., criterion, max_depth, min_samples_split, min_samples_leaf) through cross-validation. It then trains the model with the best parameters, makes predictions on the test data, and evaluates its accuracy and classification report.

In [None]:
param_grid = {
    'criterion': ['gini', 'entropy'],       
    'max_depth': [None, 5, 10, 15, 20],    
    'min_samples_split': [2, 5, 10],         
    'min_samples_leaf': [1, 2, 5]           
}

grid_search = GridSearchCV(dt_model, param_grid, cv=3, n_jobs=-1)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print("Best parameters:", best_params)

best_dt_model = grid_search.best_estimator_
y_pred_tuned = best_dt_model.predict(X_test)

print("Accuracy (Tuned):", accuracy_score(y_test, y_pred_tuned))
print("\nClassification Report (Tuned):\n", classification_report(y_test, y_pred_tuned))


## **REFLECTION**

### **22127359 - Chu Thuy Quynh**

Difficulties:
 - When I started building the model, I faced difficulties because most of my data was non-numeric, making it challenging to apply existing libraries

Learned:
- I learned that feature engineering could help transform my raw data into numerical features, making it more suitable for modeling. By creating these engineered features, I was able to convert categorical data into numeric representations that machine learning models can better understand and utilize.

### **22127302 - Nguyen Dang Nhan**

Difficulties: 
- I have trouble with the data preprocessing part, especially dealing with missing values, i have to compute the ratio and i was wrong at the first place.
- I also have trouble with visualization, the bar plot in question 5 have a lot of Director-Cast pair which make the legend take too much space in the visualization.

Learned:
- I can now understand how to fix the missing values and visualize the data more effectively by reduce the size of the legend or reduce the number of Director-Cast pair in the visualization. Moreover, I can now use different visualization techniques to make the data more understandable.

### **22127404 - Tạ Minh Thư**

Difficulties: 
- I have diffulties in improving models accuracy, since at first, I haven't thought of that the size of dataset can also affect the accuracy of the models.

Learned:
- I know more ways to improve the accuracy of the models and how to choose the correct models for each problem and dataset.

### **TEAM REFLECTION**

The accuracy of model is still not good enough. If we have more time, we will crawl more data for model training and try to preprocess the data better for data modeling.