# Introduction To Data Science - Final Project

## Group members:

| Name              | ID       |
|-------------------|----------|
| Pham Dang Son Ha |21127206|
| Tran Dai Nien     | 21127664 |
| Nguyen Cao Khoi   | 21127632 |
| Nguyen Phan Minh Triet  | 21126007  |

## Table of Contents

1. [Data Collection](#data-collection)

2. [Data Preprocessing and Exploration](#data-preprocessing-and-exploration)

3. [Data Modeling](#data-modeling)

4. [Reference](#references)

## Data Collection

### 1. Set-up environment

#### Import Required Libraries: Import the necessary Python libraries - requests, BeautifulSoup, pandas, and time.

In [1]:
# ignore warning
import warnings
warnings.filterwarnings('ignore')

#Necessary Packages
!pip install bs4
!pip install requests
!pip install pandas
!pip install numpy
!pip install xgboost

import time
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import os
import seaborn as sns

from sklearn.model_selection import train_test_split # Split dataset into train set and test set
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV # Hyperparamater tuning
from sklearn.model_selection import cross_val_score # Evaluate model

# Regression Models
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import BayesianRidge
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor

# Metrics
from sklearn.metrics import mean_squared_error 
from sklearn.metrics import mean_absolute_error

# TF-IDF:  find relative frequency of a word in a document(used for suggesting relate films)
from sklearn.feature_extraction.text import TfidfVectorizer

# calculate the similarity between two vectors in the feature space
from sklearn.metrics.pairwise import linear_kernel



ModuleNotFoundError: No module named 'xgboost'

### 2. Collect data from a website by parsing HTML

#### List of collected information

Information related to the movie, including:

- `names`: Movie titles.
- `years`: Release years of the movies.
- `genres`: Categories or genres the movies belong to.
- `lengths`: Duration or length of the movies.
- `rating_stars`: Ratings received by the movies.
- `metascores`: Metascores assigned to the movies (if available).
- `votes`: Total votes accumulated by the movies.
- `grosses`: Box office gross earnings of the movies (if available).
- `directors`: Directors of the movies.
- `stars`: Lead actors/actresses in the movies.
- `descriptions`: Synopsis or descriptions of the movies.

#### Data Collection Process:

- Identify the URL of the webpage containing the list of movies to be scraped.
- Use the requests library to send GET requests to each page of the IMDb website.
- Parse the HTML of the webpage using BeautifulSoup to extract information about the movies.
- Iterate through each movie to collect details such as title, release year, genre, rating, - Metascore, votes, earnings, director, main cast, and description.
- Store the collected information in a DataFrame using the pandas library.

In [None]:
def collect_data(base_url, num_movies, movies_per_page=100):
    # Initialize lists for storing data
    names = []
    years = []
    genres = []
    lengths = []
    rating_stars = []
    metascores = []
    votes = []
    grosses = []
    directors = []
    stars = []
    descriptions = []

    # Iterate over the specified number of pages
    for page in range(1, int(num_movies / movies_per_page) + 1):
        try:
            # Construct the URL for the current page
            url = f"{base_url}&page={page}"
            
            # Send a GET request to the URL
            response = requests.get(url)
            time.sleep(2)  # Respectful crawling by adding delay

            # Check if the response status code is 200 (OK)
            if response.status_code == 200:
                # Parse the HTML content of the page
                soup = BeautifulSoup(response.text, 'html.parser')

                # Find all movie containers on the page
                movies = soup.find_all('div', class_='lister-item-content')

                # Process each movie
                for movie in movies:
                    # Extract movie details
                    name = movie.find('h3').find('a').text.strip()
                    year = movie.find('span', class_='lister-item-year').text.strip('()')
                    genre = movie.find('span', class_='genre').text.strip()
                    length = movie.find('span', class_='runtime').text.strip().split()[0]
                    rating = movie.find('span', class_='ipl-rating-star__rating').text.strip()

                    # Some movies might not have a metascore
                    metascore_tag = movie.find('span', class_='metascore')
                    metascore = metascore_tag.text.strip() if metascore_tag else 'N/A'

                    # Extract votes and gross, if available
                    nv_tags = movie.find_all('span', attrs={'name': 'nv'})
                    vote = nv_tags[0].text if nv_tags else 'N/A'
                    gross = nv_tags[1].text if len(nv_tags) > 1 else 'N/A'

                    # Extract director and stars
                    director, *star_list = movie.find_all('a', href=lambda href: href and 'name/nm' in href)
                    director = director.text
                    stars_str = ', '.join(star.text for star in star_list)

                    # Extract description
                    description = movie.find_all('p', class_='')[-1].text.strip()

                    # Append the extracted data to respective lists
                    names.append(name)
                    years.append(year)
                    genres.append(genre)
                    lengths.append(length)
                    rating_stars.append(rating)
                    metascores.append(metascore)
                    votes.append(vote)
                    grosses.append(gross)
                    directors.append(director)
                    stars.append(stars_str)
                    descriptions.append(description)

            else:
                print(f"Failed to process page {page}: Status code {response.status_code}")

        except requests.exceptions.RequestException as e:
            print(f"Request error on page {page}: {e}")
        except Exception as e:
            print(f"Error on page {page}: {e}")

    # Create a DataFrame with the collected data
    data = pd.DataFrame({
        'Name': names,
        'Year': years,
        'Genre': genres,
        'Length': lengths,
        'Rating': rating_stars,
        'Metascore': metascores,
        'Votes': votes,
        'Gross': grosses,
        'Director': directors,
        'Stars': stars,
        'Description': descriptions
    })

    return data

#### Collecting Movie Data from IMDb

- Identify the URL of the webpage containing the list of movies to be scraped.

- Use the collect_data function to gather information from the webpage based on the desired number of pages and movies.


In [None]:
# Specify the URL containing the list of movies
url = "https://www.imdb.com/list/ls051785783/?st_dt=&mode=detail&sort=list_order,asc"

# Scrape the data

if os.path.isfile('data_film.csv'):
  data_film = pd.read_csv('data_film.csv')
else: 
  data_film = collect_data(url, 1500, 100)

#### Data Storage

- Store the collected data in a CSV file named data_film.csv using data_film.to_csv().
- Read the data from the CSV file into a new DataFrame (data_film) using pd.read_csv().

In [None]:
#Save to csv file with name data_film.csv
# Save the DataFrame to a CSV file without including the index
data_film.to_csv("data_film.csv", index=False)

# Read the CSV file into a new DataFrame called data_film
data_film = pd.read_csv("data_film.csv")

# Display the 'data_film' DataFrame
data_film

## Data Preprocessing And Exploration

### 1) How many rows and columns

- There are 1500 rows and 11 columns

In [None]:
data_film.shape

### 2) What is the meaning of each row?

- Each row represents the information of a film(name, year, genres,...)

In [None]:
data_film.sample(5)

### 3) Are there duplicated rows?

There is 0 duplicated row

In [None]:
data_film.duplicated().sum()

### 4) What is the meaning of each columns

- `names`: Movie titles.
- `years`: Release years of the movies.
- `genres`: Categories or genres the movies belong to.
- `lengths`: Duration or length of the movies.
- `rating_stars`: Ratings received by the movies.
- `metascores`: Metascores assigned to the movies (if available).
- `votes`: Total votes accumulated by the movies.
- `grosses`: Box office gross earnings of the movies (if available).
- `directors`: Directors of the movies.
- `stars`: Lead actors/actresses in the movies.
- `descriptions`: Synopsis or descriptions of the movies.

### 5) What is the current data type of each column? Are there any columns having inappropriate data types?

In [None]:
data_film.dtypes

In [None]:
data_film.sample(1)

- There are some columns which have inappropriate type: `Year, Votes, Gross`

- For `Year, Votes` we can simply convert them into **numeric**. But with `Gross` we need to change columns name to `Gross(M$)` to indicated that `Gross` unit is million Dollars

In [None]:
# remove ',' in Votes
data_film.Votes = data_film.Votes.str.replace(',', '')

# remove $, M in Gross and create new column `Gross(M$)`
data_film['Gross(M$)'] = data_film.Gross.str.replace('M', '').str.replace('$', '')

# drop Gross
data_film.drop(columns='Gross', inplace=True)

In [None]:
data_film.sample(1)

- Convert them into **numeric**

In [None]:
to_numeric_cols = ['Year', 'Votes', 'Gross(M$)']

for col in to_numeric_cols:
  data_film[col] = pd.to_numeric(data_film[col], errors='coerce')

data_film.dtypes

### 6) With each numerical column, how are values distributed?

- All numerical columns:

In [None]:
numerical_cols = data_film.columns[(data_film.dtypes != 'object')]
numerical_cols

#### 6.1) What is the percentage of missing values?

- Number of missing values of each columns

In [None]:
data_film[numerical_cols].info()

- The percentage of missing values for each columns

In [None]:
sorted_data_film_missing_percentage = (data_film.isnull().mean() * 100).sort_values()
plt.barh(sorted_data_film_missing_percentage.index, sorted_data_film_missing_percentage.values)


> `Gross(M$)` and `Metascore` missing value's percentage are high

- the `Year` missing values is not significant, so we can drop observations which has missing `Year` value

In [None]:
data_film.dropna(subset=['Year'], inplace=True)

- For `Gross(M$)` and `Metascore`, we will impute median for missing value 

In [None]:
cols = ['Gross(M$)', 'Metascore']

for col in cols:
    data_film[col].fillna(data_film[col].median(), inplace=True)

- Check if any missing values left

In [None]:
data_film.isna().sum()

- The distribution of numerical attributes

In [None]:
data_film[numerical_cols].hist(figsize=(15, 10))
plt.tight_layout()
plt.show()

- General statistics of numerical attributes

In [None]:
data_film[numerical_cols].describe()

### 7) With each categorical column, how are values distributed?

- Quick glance at categorical columns

In [None]:
categorical_cols = data_film.columns[data_film.dtypes == 'object']

categorical_cols

- Convert object to category

In [None]:
new_data_film = data_film.copy()
for label, content in data_film.items():
    if pd.api.types.is_string_dtype(content):
        new_data_film[label] = pd.Categorical(content).codes + 1 # -1 means there is missing value. so we +1 

In [None]:
new_data_film.head(1)

In [None]:
data_film.head(1)

- Missing values = 0

In [None]:
for col in categorical_cols:
    print(new_data_film[col].isna().sum())

- The distribution of `Name`: the counts is not significantly different so we won't plot this

In [None]:
data_film.Name.value_counts()

- The distribution of `Genre`

In [None]:
data_film.Genre.value_counts()

- The distribution of top 20 `Genres`

In [None]:
genre_counts = data_film.Genre.value_counts()[:20]
plt.barh(genre_counts.index, genre_counts.values)
plt.show()

- The distribution of `Director`

In [None]:
data_film.Director.value_counts()

- The distribution of top 20 `Director`

In [None]:
director_counts = data_film.Director.value_counts()[:20]
plt.barh(director_counts.index, director_counts.values);
plt.show()

- The distribution of `Stars`: 

- Top 10 lead stars and their participated movies' count

In [None]:
star_counts = data_film.Stars.value_counts()[:10].sort_values()
plt.barh(star_counts.index, star_counts.values);
plt.xticks(rotation=90)
plt.show();

- The distribution of `Description`: each description is different so we won't plot this

In [None]:
data_film.Description.value_counts()

#### 7.1) What is the percentage of missing values?

- As we can see all the categorical columns has 0 missing value so the perccentage will be: 0%

### 8) Are they abnormal?

- After considering the distribution of each attributes, we can conclude that the values are good enough for us to use for model training and get insights from them. 

### 9) Making questions for exploration?

#### 9.1) Question 1: Which genres should we as a director want to make with?

- Purposes: By analyzing historical data on movie ratings and box office performance across different genres, we can identify which genres tend to be more well-received by audiences and/or more profitable. This can help the director choose a genre that aligns with their goals, whether it’s to create a critically acclaimed film, a box office hit, or both.

- Approaches: 

    + `Analyzing on Rating`: We can group the data by genres and calculate the average Rating (or Metascore) for each genre. This can give us an idea of which genres are generally more well-received.

    + `Analyzing on Gross`: We can group the data by genres and calculate the average Gross for each genre. This can give us an idea of which genres are generally more profitable.

##### Exploring about the genre of the movies

In [None]:
# We'll split the genres column into separate rows
df_genres = data_film.assign(Genre_Split=data_film['Genre'].str.split(',')).explode('Genre_Split')

# Remove leading and trailing spaces
df_genres['Genre_Split'] = df_genres['Genre_Split'].str.strip()

# Get unique genres
unique_genres = df_genres['Genre_Split'].unique()

# Print the number of unique genres
print("Number of unique genres:", len(unique_genres))

# Print all unique genres
print("Unique genres:", unique_genres)


- Explanation:

    + `Crime`: These films revolve around the sinister actions of criminals, mobsters, bank robbers, underworld figures, and ruthless hoodlums who operate outside the law, stealing and murdering their way through life.
    
    + `Drama`: Drama films are serious presentations or stories with settings or life situations that portray realistic characters in conflict with either themselves, others, or forces of nature.

    + `Romance`: Romance films are love stories that focus on passion, emotion, and the affectionate romantic involvement of the main characters, and the journey that their love takes them through dating, courtship or marriage.
    
    + `War`: War films acknowledge the horror and heartbreak of war, letting the actual combat fighting or conflict (against nations or humankind) provide the primary plot or background for the action of the film.

    + `Comedy`: Comedies are light-hearted plots consistently and deliberately designed to amuse and provoke laughter (with one-liners, jokes, etc.) by exaggerating the situation, the language, action, relationships and characters.

    + `Mystery`: These are types of films that make us think and keep us guessing. They deal with our sense of unease and anxiety.

    + `Action`: Action films usually include high energy, big-budget physical stunts and chases, possibly with rescues, battles, fights, escapes, destructive crises (floods, explosions, natural disasters, fires, etc.), non-stop motion, spectacular rhythm and pacing, and adventurous, often two-dimensional ‘good-guy’ heroes (or recently, heroines) battling ‘bad guys’ - all designed for pure audience escapism.
    
    + `Western`: Westerns are the major defining genre of the American film industry, a nostalgic eulogy to the early days of the expansive, untamed American frontier (the borderline between civilization and the wilderness).
    
    + `Thriller`: Thrillers are tension-laden, complex, mysterious, and often involve crime (solution of a murder, disappearance, theft, etc.).
    
    + `Adventure`: Adventure films are exciting stories, with new experiences or exotic locales, very 
    similar to or often paired with the action film genre.

    + `Family`: These are films that are designed to be suitable for all ages.

    + `Fantasy`: Fantasy films are films with fantastic themes, usually involving magic, supernatural 
    events, mythology, folklore, or exotic fantasy worlds.
    
    + `Film-Noir`: Film noir is a cinematic term used primarily to describe stylish Hollywood crime dramas, particularly those that emphasize cynical attitudes and sexual motivations.
    
    + `Biography`: These films depict and dramatize the life of an important historical personage (or group) from the past or present era.
    
    + `History`: Films in this genre focus on recreating a specific and important period or event in history.
    
    + `Sci-Fi`: Science fiction films are often quasi-scientific, visionary and imaginative - complete with heroes, aliens, distant planets, impossible quests, improbable settings, fantastic places, great dark and shadowy villains, futuristic technology, unknown and unknowable forces, and extraordinary monsters (‘things or creatures from space’), either created by mad scientists or by nuclear havoc.
    
    + `Sport`: Sports films are those that have a sports setting (football or baseball stadium, arena, or the Olympics, etc.), competitive event (the ‘big game,’ ‘fight,’ or ‘race’), athletes (boxers, racers, etc.), or coach in the storyline.
    
    + `Horror`: Horror films are designed to frighten and to invoke our hidden worst fears, often in a terrifying, shocking finale, while captivating and entertaining us at the same time in a cathartic experience.
    
    + `Music`: These are films that are centered around music and dance.
    
    + `Musical`: Musicals/Dance films are cinematic forms that emphasize and showcase full-scale song and dance routines in a significant way (usually with a musical or dance performance integrated as part of the film narrative, or as an unrealistic “eruption” within the film).

    + `Animation`: Animated films are ones in which individual drawings, paintings, or illustrations are photographed frame by frame.

##### Analyzing on Rating

In [None]:
# Then, we group by genres and calculate the average rating_stars
average_ratings = df_genres.groupby('Genre_Split')['Rating'].mean()

# Finally, we sort the result in descending order so the genres with the highest average ratings are on top
average_ratings = average_ratings.sort_values(ascending=False)

# Get the top 5 and bottom 5 genres
top_5 = average_ratings.head(5)
bottom_5 = average_ratings.tail(5)

# Combine them into one Series
combined = pd.concat([top_5, bottom_5])

# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=combined.index, y=combined.values)
plt.title('Average Ratings of Top 5 Best and Worst Genres')
plt.xlabel('Genre')
plt.ylabel('Average Rating')
plt.xticks(rotation=45)
plt.show()

=> Conclusion: Based on the data analysis, the top five genres with the highest average rating scores are Horror, Animation, Film-Noir, War, and Mystery. This suggests that movies in these genres tend to be more well-received by audiences, as indicated by their higher average ratings.

On the other hand, the genres with the lowest average rating scores are Musical, Comedy, Sport, Romance, Music. This suggests that movies in these genres tend to receive lower ratings from audiences.

This analysis provides a general trend and can be a useful guide for filmmakers when choosing a genre for their next project.

##### Analyzing on Gross

In [None]:
# Group by genres and calculate the average gross
average_gross = df_genres.groupby('Genre_Split')['Gross(M$)'].mean()

# Sort the result in descending order so the genres with the highest average gross are on top
average_gross = average_gross.sort_values(ascending=False)

# Get the top 5 and bottom 5 genres
top_5_gross = average_gross.head(5)
bottom_5_gross = average_gross.tail(5)

# Combine them into one Series
combined_gross = pd.concat([top_5_gross, bottom_5_gross])

# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=combined_gross.index, y=combined_gross.values)
plt.title('Average Gross of Top 5 Best and Worst Genres')
plt.xlabel('Genre')
plt.ylabel('Average Gross')
plt.xticks(rotation=45)
plt.show()

=> Conclusion: Based on the data analysis, the top five genres with the highest average gross are Fantasy, Animation, Adventure, Action and Sci-Fi. This suggests that movies in these genres tend to be more profitable, as indicated by their higher gross.

On the other hand, the genres with the lowest gross are Film-Noir, Horror, Musical, Western and Music. This suggests that movies in these genres tend to receive lower gross and not popluar with audiences.

This analysis provides a general trend and can be a useful guide for filmmakers when choosing a genre for their next project.

##### Final conclusion: 

    + Animation is the genre which tend to be more favoured by audiences and bring more profit for the filmakers because of it's high value on both Rating and Gross
    
    + Music and Musical are the 2 types of genre that generally bring low income to the company and are not favourable among audiences

#### 9.2) Question 2: How have the movie lengths and genres evolved over the year?

- Purpose: 

    + To understand the trends and changes in movie lengths and genres over time. This can provide insights into how the film industry has evolved and changed, reflecting shifts in cultural tastes, technological advancements, and other factors.

    + By analyzing this, we can gain a deeper understanding of the film industry’s history and potentially predict future trends. For filmmakers, this information could be useful in making decisions about what type of films to produce. For film enthusiasts or researchers, it could provide interesting insights into the evolution of cinema. 

- Appoaches:

    + `Yearly Average Movie Length`: Group the data by year and calculate the average movie length for each year. Plotting these averages over time can show how movie lengths have changed.

    + `Genre Popularity Over Time`: For each year, calculate the number of movies produced in each genre. Plotting these numbers can show how the popularity of different genres has evolved.

##### Yearly Average Movie Length

In [None]:
# Group by year and calculate the average length
average_lengths = data_film.groupby('Year')['Length'].mean()

# Plot the data
plt.figure(figsize=(10, 6))
average_lengths.plot(kind='line')
plt.title('Average Movie Length Over Time')
plt.xlabel('Year')
plt.ylabel('Average Length')
plt.show()

=> Conclusion: Over the given period, the length of movies as a whole decrease from roughly 200 minutes to around 90 minutes per movie.
Around 1920, the length of the movies witnessed a sharp decrease from around 200 to 75 minutes. After that time the length of the movies fuctuate between 60 and 120 minutes. In the recent years the length of the movies is falling and likely still fall in the future


##### Genre Popularity Over Time

In [None]:
#Make a copy of df_genres
copy_df_genres = df_genres.copy()

# Get the top 5 genres
top_genres = copy_df_genres['Genre_Split'].value_counts().index[:6]

# Replace all other genres with 'Other'
copy_df_genres['Genre_Split'] = copy_df_genres['Genre_Split'].where(copy_df_genres['Genre_Split'].isin(top_genres), 'Other')

# Convert the 'Year' column to numeric, coercing errors to NaN
copy_df_genres['Year'] = pd.to_numeric(copy_df_genres['Year'], errors='coerce')

# Drop rows with NaN in the 'Year' column
copy_df_genres = copy_df_genres.dropna(subset=['Year'])

# Divide the data into six time periods
copy_df_genres['Time_Period'] = pd.cut(copy_df_genres['Year'], bins=6)

# Count the number of movies in each genre for each time period
genre_counts = copy_df_genres.groupby(['Time_Period', 'Genre_Split']).size()

# Reset the index to make 'Time_Period' and 'Genre_Split' regular columns
genre_counts = genre_counts.reset_index(name='Count')

# Get the unique time periods
time_periods = genre_counts['Time_Period'].unique()

# Create a 2x3 grid of subplots
fig, axs = plt.subplots(2, 3, figsize=(15, 10))

# Loop over the time periods and plot the data in a separate subplot
for i, period in enumerate(time_periods):
    # Get the data for this time period
    data = genre_counts[genre_counts['Time_Period'] == period]
    
    # Calculate the row and column indices for the subplot
    row = i // 3
    col = i % 3
    
    # Create the pie chart in the subplot
    axs[row, col].pie(data['Count'], labels=data['Genre_Split'], autopct='%1.1f%%')
    axs[row, col].set_title('Genre Popularity in ' + str(int(period.left)) + '-' + str(int(period.right)))

# Add a legend
fig.legend(genre_counts['Genre_Split'].unique(), loc='upper left')

# Adjust the layout
plt.tight_layout()
plt.show()

=> Conclusion: In general, Drama , which over the six period accounts for 20%-25%, is the most popular genre among all.

Over the time, the proportion of Crime, Adventure and Action increase over the years. In contrast, the percentage of Comedy and Romance movies falls over the given period.

##### Final conclusion:

+ The length of a movie decrease over time and this status seem to continue in the future

+ Drama movie is the most popular movie type. Over the time the proportion of Crime, Adventure and Action increases, the percentage of Comedy and Romance decreases.

#### 9.3) Question 3: Does a good movie  comes with certain actors/actress?

- Purposes:

    + To understand if the presence of certain actors or actresses in a movie can be a predictor of the movie’s quality or success. It can help in understanding trends in the film industry and can potentially guide decisions about casting for future films.

    + Help viewers to find the good movies to watch just based on the casts.

- Approaches:

    + Step 1 - Define Success: We need to define what a “good” movie is. It could be based on Rating, Gross, or a combination of factors.

    + Step 2 - Analyze Movie Success: For each actors/actresses, calculate the average Rating and Gross of the movies they’ve starred in. Compare these averages to the overall averages to see if movies featuring these actors/actresses tend to be more successful.

    + Step 3 - Visualize the Results: Create bar plots or other visualizations to compare the success of movies with different actors/actresses. This could help visually identify any trends or patterns.

##### Exploring about the Stars in the movie data

In [None]:
# We'll split the Stars column into separate rows
df_stars = data_film.assign(Stars_Split=data_film['Stars'].str.split(',')).explode('Stars_Split')

# Remove leading and trailing spaces
df_stars['Stars_Split'] = df_stars['Stars_Split'].str.strip()

# Get unique stars
unique_stars = df_stars['Stars_Split'].unique()

# Print the number of unique stars
print("Number of unique stars:", len(unique_stars))

# Print all unique stars
print("Unique stars:", unique_stars)

##### Exploring the success of each star

In [None]:
# Calculate the median of 'Gross(M$)'
gross_median = df_stars['Gross(M$)'].median()

# Fill in missing values in 'Gross(M$)' with the median
df_stars['Gross(M$)'].fillna(gross_median, inplace=True)

# Normalize 'grosses' to a 0-10 scale to create 'Gross_Point'
df_stars['Gross_Point'] = (df_stars['Gross(M$)'] - df_stars['Gross(M$)'].min()) / (df_stars['Gross(M$)'].max() - df_stars['Gross(M$)'].min()) * 10

# Create 'Success' column as the average of 'rating_stars' and 'Gross_Point'
df_stars['Success'] = df_stars[['Rating', 'Gross_Point']].mean(axis=1)

# Group the data by 'Stars_Split' and calculate the average 'Success' for each star
average_success= df_stars.groupby('Stars_Split')['Success'].mean()

# Sort the average success scores in descending order
average_success = average_success.sort_values(ascending=False)

# Get the top 5 stars with the highest and lowest average success scores
top_5 = average_success.head(5)
bottom_5 = average_success.tail(5)

# Combine the top 5 and bottom 5 stars into one Series
combined = pd.concat([top_5, bottom_5])

# Create a bar plot of the average success scores of the top 5 and bottom 5 stars
plt.figure(figsize=(10, 6))
sns.barplot(x=combined.index, y=combined.values)
plt.title('Average Success Score of Top 5 Best and Worst Stars')
plt.xlabel('Star')
plt.ylabel('Average Success Score')
plt.xticks(rotation=45)
plt.show()

##### Final conclusion:

+ We can see that the top 5 best actors includes: Zoe Saldana, Billy Zane, Rob Minkoff, Henry Thomas and Dee Wallace. These are the people who tend to make the movies, in which they play a role, successful. Filmmaker should prioritize these casts.

+ On the contrary, Ken Hughes, Robert Parrish, Richard Talmadge, Joseph McGrath and Anne Twomey are the 5 stars who seem not to contribute to the success of the movies because of their low Success Score. Filmmaker should avoid these casts.

#### 9.4) Question 4: Are there any correlations between Rating and Length, Votes, Gross(M$)?

- Purpose: To understand if there are any relationships between the rating of a movie and its length, the number of votes it received, and its gross earnings. This can provide insights into what factors might influence a movie’s rating. These insights could be useful for filmmakers, critics, and audiences alike. For example, if a factor is positive correlation with the rating of movies, the filmmakers will focus on that factor to enhance the movie quality. On the other hand, if a factor is negative correlation with the rating of movies, the filmmakers will avoid that factor.

- Approaches: Using heat map to display the correlation between each attributes: Rating, Length, Votes, Gross(M$)

In [None]:
# Creating a smaller DataFrame with only the columns we're interested in
df_small = data_film[['Rating', 'Length', 'Votes', 'Gross(M$)']]

# Then, we calculate the correlation matrix
corr_matrix = df_small.corr()

# Finally, we create a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True, square=True)
plt.show()

##### Final conclusion:

- There is a moderate correlation between Rating and the Votes attribute. This indicates that the movies which receive a high number of Votes tend to have high Rating

- The Length and Gross of the movies don't affect most to the Rating of that movies. It means that the movies with a large budget doesn't mean it is a good movie and vice versa

#### 9.5) Question 5: Which director - actor pair often works together?

- Purposes: To identify frequent collaborations between directors and actors in the film industry. This can provide insights into professional relationships and recurring partnerships in filmmaking. Certain director-actor pairs often work together because they share a common vision, have a strong working relationship, or have had success together in the past. Identifying these pairs can give us a better understanding of patterns and trends in the film industry.

- Approaches: For each movie, create pairs of the director and each actor. This could involve splitting the ‘Stars’ field if it contains multiple actors, and pairing each actor with the director. Count the occurrence of each director-actor pair to see which pairs occur most frequently. Visualize the most frequent director-actor pairs using a bar plot. This can make it easier to see which pairs work together most often.

In [None]:
# Create director-actor pairs for each movie, excluding pairs where the director and actor are the same
director_actor_pairs = df_stars.apply(lambda row: (row['Director'], row['Stars_Split']) if row['Director'] != row['Stars_Split'] else None, axis=1)

# Remove None values
director_actor_pairs = director_actor_pairs.dropna()

# Count the occurrence of each director-actor pair
pair_counts = director_actor_pairs.value_counts()

# Visualizing
import matplotlib.pyplot as plt
pair_counts.head(10).plot(kind='barh')
plt.xlabel('Director-Actor Pair')
plt.ylabel('Number of Movies')
plt.title('Most Frequent Director-Actor Pairs')
plt.xticks(rotation=0)
plt.show()

##### Final Conclusion:

1. Joel Coen - Ethan Coen
2. Woody Allen - Mia Farrow
3. John Ford - Henry Fonda
4. George B. Seitz - Mickey Rooney
5. George B. Seitz - Cecilia Parker

These pairs have worked together on numerous projects, suggesting a strong professional relationship and a shared creative vision. Their repeated collaborations could also indicate that these pairs have found a successful formula that resonates with audiences and critics alike.

#### 9.6) Question 6: What are the related movies that people may have interest in?

- **Purpose:**

  When users finish watching a movie in our application, we want to keep them engaged by recommending similar films that captivate their interest and entice them to continue using the app.
- **Approaches:**
  1. To solve this problem we will use `Description` feature
  2. Compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each description.
  3. Compute a similarity score
  4. Return a list of top 10 most similar movies

- Some descriptions:

In [None]:
data_film['Description'].head(5)

- TF-IDF matrix: Column represents a word in description. Row represents a film

There are **6764 words** in total used in description

In [None]:
TF_IDF = TfidfVectorizer(stop_words='english') # remove 'the', 'a',...

TF_IDF_matrix = TF_IDF.fit_transform(data_film['Description'])

TF_IDF_matrix.shape

- Compute **consine similarity score** by calculating dot product

In [None]:
similarities = linear_kernel(TF_IDF_matrix, TF_IDF_matrix)

- Function used to return related movies

In [None]:
# get index for each movie
indexes = pd.Series(data_film.index, index=data_film['Name']).drop_duplicates()

In [None]:
indexes.head(5)

In [None]:
def get_related_movie(name, similarities=similarities):
    # get index of 'name' film
    idx = indexes[name]

    # get the similarites of all the films with `name` film
    similarity_scores = list(enumerate(similarities[idx]))

    # sort similarity score and get top 10 films with highest similarity score with `name` film
    similarity_scores = sorted(similarity_scores, key= lambda x: x[1], reverse=True)[1:11]

    # get the index of top 10 films
    movie_indexes = [score[0] for score in similarity_scores]

    # get the name of top 10 films
    return data_film['Name'].iloc[movie_indexes]

- Suggest related movies

In [None]:
get_related_movie('12 Người Đàn Ông Giận Dữ')

In [None]:
get_related_movie('Phù Thủy Xứ Oz')

> **Conclusion**: based the importance of words in **description**, we can suggest some related movies to capture users attention and earn some benefits

## Data modeling

### Problem Statement 1

- **The question that our team want to give an answer to is:**  How much grosses can this new film can possibly get?
- **Purpose:**
  When we want to make a film, we sure want to know how much money the film could make for us. The success or failure of a movie depends on many factors: the release date, budget, star-power, marketing,...  But with the data that our team has collected, we will answer the question above by predicting the gross with values of: Ratings. 
  By predicting the gross, the film maker can know wheter it is possible to make the film and the prediction will help them to have a clearer vision on the production plan and distribution stage.
- **How:** To solve this problem we will use Regression model to predict the Gross of the new movie.

### Target Variable

In [None]:
plt.hist(new_data_film['Gross(M$)']);

### Correlation

In [None]:
sns.heatmap(new_data_film.corr())

### Data preparation

- We will get the data from `data_film` that has the preprocessed data.

In [None]:
filmdata_df = data_film.copy()
filmdata_df.head()

### Data Preprocessing

- We will remove all the row in the DataFrame that has Gross equal to 0.

In [None]:
feature_cols = ['Rating', 'Gross(M$)']

modeling_data = filmdata_df[feature_cols].copy()
modeling_data['Gross(M$)'] = modeling_data['Gross(M$)'].fillna(0)
modeling_data.Rating = modeling_data.Rating.fillna(0)
modeling_data = modeling_data[~(modeling_data['Gross(M$)'] == 0)]

- Split the dataset into X_train, y_train, X_test, y_test and create Train dataset and Test dataset.

In [None]:
X, y = modeling_data.drop(['Gross(M$)'], axis= 1), modeling_data['Gross(M$)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2)

In [None]:
train_data = X_train.join(y_train)
train_data

In [None]:
test_data = X_test.join(y_test)
test_data

### Creating Model

- For the Regression model, our team choose Linear Regression and Ridge Regression.

In [None]:
class LinearRegression:
    '''
    lr: the learning rate
    n_iters: the max iterations for the fit function to run
    '''
    def __init__(self, lr, n_iters):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iters):
            y_pred = self.predict(X)

            # Update the Weights and Bias
            dW = - ( 2 * X.T.dot( y - y_pred ) ) / n_samples
            db = - 2 * np.sum( y - y_pred ) / n_samples
            self.weights = self.weights - self.lr *dW
            self.bias = self.bias - self.lr * db

    def predict(self, X):
        y_pred = np.dot(X, self.weights) + self.bias
        return y_pred
    
class RidgeRegression:
    '''
    lr: the learning rate
    n_iters: the max iterations for the fit function to run
    lamda: the values used to update the weights 
    '''
    def __init__(self, lr, n_iters, lamda):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None
        self.lamda = lamda 
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iters):
            y_pred = self.predict(X)
            
            # Update the Weights and Bias
            dW = ( - ( 2 * ( X.T ).dot( y - y_pred ) ) +  ( 2 * self.lamda * self.weights ) ) / n_samples   
            db = - 2 * np.sum( y - y_pred ) / n_samples
            self.weights = self.weights - self.lr *dW
            self.bias = self.bias - self.lr * db

    def predict(self, X):
        y_pred = np.dot(X, self.weights) + self.bias
        return y_pred

### Chosing Metrics

- As for the Metrics, we will use the Mean Absolute Error and mse to have more insight of the error of each model.

In [None]:
def mae(y_test, predictions):
    return np.mean(abs(y_test-predictions))


### Trainning and Validating Model

- For the training and validating model, we will use K-Fold cross validation with 10 folds to train and validate the model. 

In [None]:
# Initilizing Variables
X_train, y_train = train_data.drop(['Gross(M$)'], axis= 1), train_data['Gross(M$)']
X_test, y_test = test_data.drop(['Gross(M$)'], axis= 1), test_data['Gross(M$)']

learning_rate = 0.0001
n_iters = 1000
lamda = 1000

model_1 = LinearRegression(lr= learning_rate, n_iters =n_iters)
model_2 = RidgeRegression(lr= learning_rate, n_iters =n_iters, lamda = lamda)

In [None]:
def k_fold_cross_validation(n_folds, X, y, model):
    kf = KFold(n_splits= n_folds, shuffle= True)
    fold = kf.split(X, y)
    mae_score = []
    history = {}
    count = 0
    
    for train_idx, val_idx in fold:
        X_tr = X.iloc[train_idx]
        y_tr = y.iloc[train_idx]
        
        X_val = X.iloc[val_idx]
        y_val = y.iloc[val_idx]
        
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_val)
        
        new_mae = mae(y_val, y_pred)
        mae_score.append(new_mae)
        print("===== Fold",count,"=====")
        print("MSE=", new_mae)
        history[count] = new_mae
        count+= 1
        

    avg_mse = np.mean(mae_score)
    return avg_mse, history

In [None]:
number_of_fold = 10
print("======= Linear Regression =======")
k_fold_linear, linear_history = k_fold_cross_validation(number_of_fold, X_train, y_train, model_1)
print("===========================")
print("Avg. MAE= ", k_fold_linear)
print()

print("======= Ridge Regression =======")
k_fold_ridge, ridge_history= k_fold_cross_validation(number_of_fold, X_train, y_train, model_2)
print("===========================")
print("Avg. MAE= ", k_fold_ridge)
print()

### Re-Train Model

- Re-train the model on the whole Train dataset

In [None]:
model_1.fit(X_train, y_train)
model_2.fit(X_train, y_train)

### Testing Model

- Use the model to predict the Test dataset


In [None]:
y_pred = model_1.predict(X_test)
test_mae = mae(y_test, y_pred)
print("Linear Regression Testing MAE= ", test_mae)

y_pred = model_2.predict(X_test)
test_mae = mae(y_test, y_pred)
print("Ridge Regression Testing MAE= ", test_mae)

### Evaluation

- Both Linear Regression model and Ridge Regression model have nearly the same error. 
- The Mean Absolute Error of both models are quite high, about 54. But since we are working on the number million, 54 million is a acceptable value.
- Clearly both of these models are not the most suitable model to predict the gross. 

### Visualizing

-  Running Process

In [None]:
plt.figure(figsize=(15, 5))

history = []
for i in range(len(linear_history)):
  history.append(linear_history[i])

plt.subplot(1, 2, 1)
plt.plot(history, label=["K-Fold Validation MAE"])
plt.title("linear model")
plt.xlabel("Fold")
plt.ylabel("MAE")
plt.grid()
plt.xticks(np.arange(0, 10, step=1))
plt.legend(loc="lower right")

history = []
for i in range(len(ridge_history)):
  history.append(ridge_history[i])

plt.subplot(1, 2, 2)
plt.plot(history, label=["K-Fold Validation MAE"])
plt.title("ridge model")
plt.xlabel("Fold")
plt.ylabel("MAE")
plt.grid()
plt.xticks(np.arange(0, 10, step=1))
plt.legend(loc="lower right")

- Result

In [None]:
plt.figure(figsize=(15, 5))

y_pred_line_1 = model_1.predict(X)
y_pred_line_2 = model_2.predict(X)

plt.subplot(1, 2, 1)
plt.scatter(X, y, s=10)
plt.plot(X, y_pred_line_1, color='red', linewidth=2)
plt.title('Linear Regression')
plt.xticks(np.arange(5, 10, 0.5))
plt.grid()

plt.subplot(1, 2, 2)
plt.scatter(X, y, s=10)
plt.plot(X, y_pred_line_2, color='red', linewidth=2)
plt.title('Ridge Regression')
plt.xticks(np.arange(5, 10, 0.5))
plt.grid()
plt.show()

> After fitting with 1 `Rating` feature, we found that the result is good. However, Our team want to improve the result more. Therefore we will choose a model to tuning its hyperparameter

### Best model for tuning?

 - First, we will remove `Description` and `Name` feature

In [None]:
new_data_film = new_data_film.drop(columns=['Description', 'Name'], axis=1)

- Prepare data

In [None]:
X = new_data_film.drop('Gross(M$)', axis=1)
y = new_data_film['Gross(M$)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.head()

In [None]:
y_train.head()

- Create model

In [None]:
models = {
    'SVR':SVR(),
    'XGBRegressor': XGBRegressor(),
    'Ridge': linear_model.Ridge(),
    'ElasticNet': ElasticNet(),
    'SGDRegressor': SGDRegressor(),
    'BayesianRidge': BayesianRidge(),
    'LinearRegression': linear_model.LinearRegression(),
    'RandomForestRegressor': RandomForestRegressor()
}

- Fit model

In [None]:
model_results = []
model_names = []

for name,model in models.items():
    a = model.fit(X_train,y_train)
    predicted = a.predict(X_test)
    score = mean_absolute_error(y_test, predicted)
    model_results.append(score)
    model_names.append(name)
    
    #creating dataframe
    df_results = pd.DataFrame([model_names,model_results])
    df_results = df_results.transpose()
    df_results = df_results.rename(columns={0:'Model',1:'MAE'}).sort_values(by='MAE',ascending=False)
    
print(df_results)

> By far, `RandomForestRegressor` give us the best result so we will use this model for tuning 

#### Hyperparameter Tuning

In [None]:
# %%time

# import optuna

# def objective(trial):
 
#     param = {
#         'n_estimators': trial.suggest_categorical('n_estimators', [100, 500, 1000, 1250, 1500, 2000]),
#         'max_depth': trial.suggest_categorical('max_depth', [10, 25, 50, 100, 200]),
#         'min_samples_split': trial.suggest_categorical('min_samples_split', [10, 25, 50, 100, 200]),
#         'min_samples_leaf': trial.suggest_categorical('min_samples_leaf', [10, 25, 50, 100, 200]),
#         'max_features': trial.suggest_categorical('max_features', [0.5, 1, 'sqrt'])
#     }
    
#     model = RandomForestRegressor(**param)  
    
#     model.fit(X_train, y_train)
#     preds_valid = model.predict(X_test)
#     rmse = mean_squared_error(y_test, preds_valid, squared=False)
#     return rmse

    
# study = optuna.create_study(direction="minimize")
# study.optimize(objective, n_trials=5000)

# study.best_params

- Best parameters for `RandomForestRegressor`

In [None]:
best_params = {'n_estimators': 100,
 'max_depth': 100,
 'min_samples_split': 10,
 'min_samples_leaf': 10,
 'max_features': 0.5}

In [None]:
model = RandomForestRegressor(**best_params)

In [None]:
model.fit(X_train, y_train)
predicted = model.predict(X_test)
print(f'Root Mean Square Error test = {mean_squared_error(y_test, predicted,squared=False)}')

In [None]:
y_preds = model.predict(X_test)
print("After Tuning, the MAE is", mean_absolute_error(y_test, y_preds))

In [None]:
plt.scatter(y_preds, y_test)
plt.xlabel('True Values', fontsize=15)
plt.ylabel('Predictions', fontsize=15)
plt.title('The correctness of model prediction')
plt.show()

> **Conclusion**: With training on multiple models and hyperparameter tuning, the model does not improve. Therefore, the default parameters by far give us the best result. Our groups think that due to lacking observations, the model does not have enough data to learn. The tuning process takes place around 30 minutes to find the best combination of hyperparameter for RandomForestRegressor.

- Save predictions and true values to .csv file

In [None]:
result = {
    'predict_value': y_preds,
    'true_value': y_test,
}


pd.DataFrame(result).to_csv('result.csv', index=False)

## References

1. https://www.kaggle.com/code/heyrobin/house-price-prediction-beginner-s-notebook/notebook
2. https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system