# <p style="background-color:#000000; font-family:Ariel, sans-serif; color:#F6C800; font-size:200%; text-align:center; border-radius:30px; padding:40px;"><b></b>IMDB Movie Dataset Analysis</p>

<div align="center" style="margin-top: 20px;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/IMDB_Logo_2016.svg/1920px-IMDB_Logo_2016.svg.png" 
         alt="Data Science Trends" 
         style="max-width: 100%; height: auto; border-radius: 100px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);" />
</div>

<div align="left" style="font-size:25px; font-weight:bold;">
    📊 Yusuf Delikkaya 
</div>

<br>

<p align="left" style="font-size:18px; font-weight:bold;">
    🚀 Discover My Projects and Stay Connected! 
</p>

<p align="left" style="font-size:15px;">
🌍 If you're interested in my other projects, don't forget to follow me on these platforms:
</p>

<p align="left">
    <a href="https://linktr.ee/yusufdelikkaya" target="_blank" style="text-decoration:none;">
        <img src="https://img.shields.io/badge/Linktree-39E09B?style=for-the-badge&logo=linktree&logoColor=white" alt="Linktree" />
    </a>
    <a href="https://www.linkedin.com/in/yusufdelikkaya" target="_blank" style="text-decoration:none;">
        <img src="https://img.shields.io/badge/LinkedIn-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn" />
    </a>
    <a href="https://www.kaggle.com/yusufdelikkaya" target="_blank" style="text-decoration:none;">
        <img src="https://img.shields.io/badge/Kaggle-20BEFF?style=for-the-badge&logo=kaggle&logoColor=white" alt="Kaggle" />
    </a>
    <a href="https://github.com/yusufdelikkaya" target="_blank" style="text-decoration:none;">
        <img src="https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white" alt="GitHub" />
    </a>
    <a href="https://public.tableau.com/app/profile/yusuf.delikkaya/vizzes" target="_blank" style="text-decoration:none;">
        <img src="https://img.shields.io/badge/Tableau-E97627?style=for-the-badge&logo=tableau&logoColor=white" alt="Tableau" />
    </a>
</p>


# 📌 Project Summary

- The purpose of this project is to perform exploratory data analysis (EDA) on movie data.
- Various visualization libraries will be used to uncover relationships between different features.

<br>

### Data Set

- The dataset includes data about 1000 movies released between 2006 and 2016.
- The dataset consists of 1000 rows and each row represents a movie.
- There are 12 different columns in total, as outlined below:

<br>

| Column        | Description                                         |
|---------------|-----------------------------------------------------|
| **Rank**    | Film ranking                                        |
| **Title**   | Movie title                                         |
| **Genre**   | Genre(s) of the movie                               |
| **Description** | Short summary of the movie                     |
| **Director** | Director of the movie                             |
| **Actors**  | Leading actors in the movie                        |
| **Year**    | Release year of the movie                          |
| **Runtime** | Duration of the movie (in minutes)                 |
| **Rating**  | IMDb rating                                        |
| **Votes**   | Number of votes                                    |
| **Revenue (Millions)** | Box office revenue (in millions of USD) |
| **Metascore** | Metascore rating                                 |



<b>📋 TABLE OF CONTENTS 📋</b>
<ul>
<li><a href="#step-1">1| Import Libraries and Loading the Dataset</a></li>
<li><a href="#step-2">2| Initial Exploration and Analyzing Values</a></li>
<li><a href="#step-3">3| Organizing and Manipulating Data</a></li>
<li><a href="#step-4">4| Data Visualization</a></li>
<li><a href="#step-5">5| Conclusion</a></li>
</ul>

<a id='step-1'></a>
# <p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:100%; text-align:center; border-radius:10px; padding:20px;"><b>1| Import Libraries and Loading the Dataset</b></p>

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>1.1.| Import Libraries & Configure Settings</b></p>

In [None]:
# Import Libraries
import numpy as np  # Used for numerical calculations and data manipulation.

import pandas as pd  # Used for data analysis and handling data structures (DataFrame, Series).

import matplotlib.pyplot as plt  # Basic library for 2D plotting and data visualization.

import seaborn as sns  # Used for statistical data visualization. It is built on top of Matplotlib.

import missingno as msno  # Used for visualizing and analyzing missing data.

import plotly.express as px  # Used for interactive charts and data visualization.

from skimpy import skim  # Used for quick summarization and inspection of data.

from collections import Counter # Used for frequency analysis, tallying results, and simplifying counting tasks.

import plotly.io as pio
pio.renderers.default = 'notebook'

# Modify Pandas Display Settings
pd.set_option('display.float_format', '{:.0f}'.format)  # Sets the decimal precision for numbers.
pd.set_option('display.max_columns', None)  # Ensures all columns are displayed.
pd.set_option('display.max_rows', None)  # Ensures all rows are displayed.

# Warning Settings
import warnings  # Used to control and manage warning messages.
warnings.filterwarnings("ignore")  # Ignores warning messages.
warnings.warn("this will not show")  # An example warning message, but it will be suppressed by the ignore filter.

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>1.2.| Loading & Reading Dataset</b></p>

In [None]:
# Loading & Reading Dataset
df_original = pd.read_csv("imdb_movie_dataset.csv", index_col="Title")

# Making a copy
df = df_original.copy()

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>1.3.| Data Sample</b></p>

In [None]:
# Checking Dataset
df.sample(5)

<a id='step-2'></a>
# <p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:100%; text-align:center; border-radius:10px; padding:20px;"><b>2| Initial Exploration and Analyzing Values</b></p>

In [None]:
# Dataset Initial Summary by Skimpy Library
skim(df)

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
df.describe(include= "object").T

In [None]:
# Viewing Null, Unique and Duplicated Values

pd.DataFrame({
                'Count':df.shape[0],
                'Column':df.shape[1],
                'Size':df.size,
                'Null':df.isnull().sum(),
                'Null %':df.isnull().mean() * 100,
                'Not-Null':df.notnull().sum(),
                'Unique':df.nunique(),
                'Duplicated':df.duplicated().sum()
})

<a id='step-3'></a>
# <p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:100%; text-align:center; border-radius:10px; padding:20px;"><b>3| Organizing and Manipulating Data</b></p>

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>3.1.| Column Names</b></p>

In [None]:
# Viewing column names with List Compherension
[i for i in df.columns]

In [None]:
# Method 1

df.columns = [  'rank', 
                'genre', 
                'description', 
                'director', 
                'actors', 
                'year',
                'runtime', 
                'rating', 
                'votes', 
                'revenue',
                'metascore'  ]

[i for i in df.columns]

In [None]:
# Method 2

df.columns = [col.lower() for col in df]

[i for i in df.columns]

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>3.2.| Null Values</b></p>

In [None]:
# Null Values in Columns
df.isnull().sum()

In [None]:
# Total of Null Values in Columns
print(f"Total of Null Values in Columns: {df.isnull().sum().sum()}")

In [None]:
msno.matrix(df);

<p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>3.2.1.| revenue Column</b></p>

In [None]:
# Column with a most Null value is revenue
df.isnull().sum()

In [None]:
df.revenue.sample(5)

In [None]:
print(f"revenue mean: {df.revenue.mean()}\nrevenue median: {df.revenue.median()}")

In [None]:
# Filling Null values with median
df.revenue.fillna(df.revenue.median(), inplace= True)

<p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>3.2.2.| metascore Column</b></p>

In [None]:
# After handling revenue, Column with a most Null value is metascore
df.isnull().sum()

In [None]:
# Dropping the rows that have Null Values in the metascore Column
df = df.dropna()
print("Dropping the rows that have Null Values in the metascore Column")

In [None]:
# There are no Null values left
df.isnull().sum()

In [None]:
msno.matrix(df);

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>3.3.| Generating Top 250 Movies Dataset</b></p>

- **Select Top 250 Movies have most rating and create a new data set named top250. Also download this data set on your computer as top250.csv**

In [None]:
top250 = df.sort_values('rating', ascending=False)
top250 = top250[:250]
top250.head()

In [None]:
# Dataset to csv
top250.to_csv('top250.csv')
print("'top250.csv' Dataset to csv")

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>3.4.| Data Analysis With Top 250 Movies Dataset</b></p>

In [None]:
# Top 250 Movies Dataset Inıtial Summary by Skimpy Library
skim(top250)

In [None]:
# Viewing Null, Unique and Duplicated Values

pd.DataFrame({
                'Count':top250.shape[0],
                'Column':top250.shape[1],
                'Size':top250.size,
                'Null':top250.isnull().sum(),
                'Null %':top250.isnull().mean() * 100,
                'Not-Null':top250.notnull().sum(),
                'Unique':top250.nunique(),
                'Duplicated':top250.duplicated().sum()
})

In [None]:
top250.sample(5)

In [None]:
top250.info()

In [None]:
top250.describe().T

<a id='step-4'></a>
# <p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:100%; text-align:center; border-radius:10px; padding:20px;"><b>4| Data Visualization</b></p>

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.1.| Number of Movies Released by Years</b></p>

<p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.1.1.| Line Chart: Number of Movies Released by Years</b></p>

In [None]:
# Line Chart: Number of Movies Released by Years
movies_per_year = top250['year'].value_counts().sort_index()

plt.plot(movies_per_year.index, movies_per_year.values, color='blue')

plt.xlabel('Years')
plt.ylabel('Movie Numbers')

plt.title('Number of Movies Released by Years')

for x, y in zip(movies_per_year.index, movies_per_year.values):
    plt.text(x, y + 0.15, str(y), ha='center', va='bottom', fontsize=9)

plt.show()

<p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.1.2.| Bar Chart: Number of Movies Released by Years</b></p>

In [None]:
# Bar Chart: Number of Movies Released by Years
df['year'].plot(kind='hist', bins=range(df['year'].min(), df['year'].max() + 1), edgecolor='black')

plt.title('Number of Movies Released by Years')

plt.xlabel('Year')
plt.ylabel('Number of Movies')

plt.show()

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.2.| The 10 Most Filmed Genres</b></p>

In [None]:
# Stacking genre
top250["genre"].str.split(",", expand=True).stack().value_counts()

<p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.2.1.| Bar Chart: The 10 Most Filmed Genres</b></p>

In [None]:
top10_genres = top250["genre"].str.split(",", expand=True).stack().value_counts()[:10]

sns.set_style("whitegrid")

plt.figure(figsize=(10,6))

ax = sns.barplot(   x=top10_genres.values, 
                    y=top10_genres.index, 
                    palette="colorblind")

for container in ax.containers:
    ax.bar_label(   container, 
                    fmt='%d', 
                    label_type='edge', 
                    fontsize=12, 
                    color='black', 
                    padding=3)
    
ax.set(xlabel='Number of Movies', ylabel='Genres')

plt.title("Top 10 Movie Genres")

plt.show()

<p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.2.2.| Donut Chart: The 10 Most Filmed Genres</b></p>

In [None]:
top_genres = top250["genre"].str.split(",", expand=True).stack().value_counts()[:10]

fig = px.pie(   values=top_genres.values, names=top_genres.index,
                title='The 10 most filmed genres',
                labels={'value': 'Number', 'names': 'Genres'},
                hover_data={'value': top_genres.values, 'names': top_genres.index},
                hole=0.5)

fig.update_traces(textposition='inside', textinfo='percent+value', textfont_size=12)

fig.update_traces(marker=dict(line=dict(color='white', width=2)))

fig.show()

<p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.2.3.| Pie Chart: The 10 Most Filmed Genres</b></p>

In [None]:
top_genres = top250["genre"].str.split(",", expand=True).stack().value_counts()[:10]

fig, ax = plt.subplots(figsize=(8, 8))

ax.pie(top_genres.values, 
       labels=top_genres.index, 
       autopct='%1.1f%%', 
       startangle=90,
       explode=(0.3, 0.0, 0, 0, 0, 0.0, 0, 0, 0, 0), 
       shadow = 0.1, 
       textprops={'fontsize': 10})


ax.set_title('The 10 Most Filmed Genres', fontsize=16, fontweight='bold')

ax.axis('equal')

plt.show()

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.3.| Top 10 Directors Who Directed the Most Movies</b></p>

In [None]:
# top 10 directors
director_counts = df['director'].str.split(', ', expand=True).stack().value_counts()[1:11]
director_counts

In [None]:
# Top 10 Directors Who Directed the Most Movies
sns.set_style("whitegrid")

plt.figure(figsize=(10,6))

ax = sns.barplot(x=director_counts.values, y=director_counts.index, palette="turbo")

for container in ax.containers:
    ax.bar_label(container, fmt='%d', label_type='edge', fontsize=12, color='black', padding=3)
    
ax.set(xlabel='Number of Movies', ylabel='Directors')

plt.title("Top 10 Directors Who Directed the Most Movies")

plt.show()

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.4.| Top 20 Actors/Actress Appeared In The Most Movies</b></p>

In [None]:
# Using counter from collections Library
actors_counter = Counter()

for actors in df['actors']:
    actors_list = [actor.strip() for actor in actors.split(',')]
    actors_counter.update(actors_list)

top_actors = actors_counter.most_common(20)
top_actors

In [None]:
top_actors_df = pd.DataFrame(top_actors, columns=["Actor", "Movie Count"])

sns.set_style('whitegrid')

plt.figure(figsize=(10,8))

ax = sns.barplot(x='Movie Count', y='Actor', data=top_actors_df, palette='plasma')

for container in ax.containers:
    ax.bar_label(container, fmt='%d', label_type='edge', fontsize=12, color='black', padding=3)
    
sns.despine(left=True)

plt.title('Top 20 Actors/Actress Appeared In The Most Movies', fontsize=16)
plt.xlabel('Number of Movie Roles', fontsize=14)
plt.ylabel('Actor/Actress', fontsize=14)

plt.xticks(fontsize=12, rotation=0)

plt.yticks(fontsize=12)

plt.tight_layout()

plt.show()

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.4.| Revenue of 250 Popular Movies According to Rating Scores</b></p>

In [None]:
df = top250.sort_values("rating", ascending=False)

fig = px.scatter(   df, 
                    x=df.index, 
                    y="revenue", 
                    size="rating", 
                    color="rating")

fig.update_layout(  title="Revenue of 250 Popular Movies According to Rating Scores", 
                    height=800, 
                    width=1000, 
                    title_x = 0.5, 
                    xaxis_tickangle=-45)

fig.update_yaxes(title="Income (Million $)", range=[-10, 500], dtick=50)

fig.update_traces(mode='markers', marker=dict(sizemode='diameter', sizeref=0.5))

fig.show()

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.5.| Relationship Between Ratings and Number of Votes</b></p>

In [None]:
# Scatter Plot: Relationship Between Ratings and Number of Votes
plt.scatter(df['rating'], df['votes'], alpha=0.5)
plt.title('Relationship Between Ratings and Number of Votes')
plt.xlabel('Rating')
plt.ylabel('Votes')
plt.show()

<p style="background-color:#F6C800; font-family:Verdana, sans-serif; color:#000000; font-size:150%; text-align:left; border-radius:50px; padding:10px;"><b>4.6.| IMDB Correlation Between Columns</b></p>

In [None]:
df.select_dtypes(exclude="object").sample(10)

In [None]:
plt.figure(figsize=(10, 8))

numeric_df = df.select_dtypes(exclude="object")

sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title('IMDB Correlation Between Columns', fontsize=18)
plt.show()

<a id='step-5'></a>
# <p style="background-color:#000000; font-family:Verdana, sans-serif; color:#F6C800; font-size:100%; text-align:center; border-radius:10px; padding:20px;"><b>5| Conclusion</b></p>

The project provides a comprehensive exploration of a movie dataset, from data cleaning and handling missing values to detailed exploratory analysis and visualizations. The outcomes will offer insights into:

- The revenue trends over the years.
- Genre-based and director-specific patterns.
- Relationships between movie runtime, ratings, and revenue.

The visualizations will enhance understanding of movie attributes and highlight key trends and insights within the dataset, leading to meaningful conclusions about popular genres, successful directors, and other significant factors in the movie industry.

<p style="background-color:#000000; font-family:Arial, sans-serif; color:#F6C800; font-size:200%; text-align:center; border-radius:30px; padding:20px;"><b></b>🌟 THANK YOU for reviewing my project! 🌟</p>