# Project: Investigate TMDb movie Dataset (this dataset contains more than 1000 movies )



## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

### Questions to be answered:
<ul>
<li><a>1-The actor who appears the most in this list of movies?</a></li>
<li><a>2-How many movies based on their genres were produced?</a></li>
<li><a>3-Which director has made the most movies?</a></li>
</ul>

<a id='intro'></a>
## Introduction
This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.
<li>● Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters.</li>
<li>● There are some odd characters in the ‘cast’ column. Don’t worry about cleaning them. You can leave them as is.</li>
<li>● The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars,accounting for inflation over time.</li>


In [None]:
# import pandas
import pandas as pd
# import numpy
import numpy as np
# import seaborn
import seaborn as sns
# import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#import sklearn to replace zero values with mean
from sklearn.impute import SimpleImputer
#import datetime
import datetime

pd.set_option('display.precision',3)

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
Tmdb_dataset =pd.read_csv("../input/tmdb-movies-dataset/tmdb_movies_data.csv")
Tmdb_dataset.head()

In [None]:
#   types and look for instances of missing or possibly errant data.
Tmdb_dataset.info()

### Data Cleaning (Replace this with more specific notes!)

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.

Tmdb_dataset.columns

In [None]:
#1-we need to clean all not necessory info. 
Tmdb_dataset.drop(['popularity','budget','revenue','overview','imdb_id','homepage','tagline','keywords','production_companies'], axis = 1, inplace = True)
Tmdb_dataset.head()

In [None]:
## change release_date to datetime format
Tmdb_dataset["release_date"]=pd.to_datetime(Tmdb_dataset["release_date"])
Tmdb_dataset.head(10)

In [None]:
## add new column for profit 
Tmdb_dataset[['profit']] = Tmdb_dataset["revenue_adj"] - Tmdb_dataset["budget_adj"]

In [None]:
## transfer all currency column to million  
Tmdb_dataset[['budget_adj', 'revenue_adj',"profit"]]=(Tmdb_dataset[['budget_adj', 'revenue_adj',"profit"]]/1000000).astype(int)

In [None]:
## replace zero values with mean 
imputer = SimpleImputer(missing_values=0, strategy='mean')
imputer = imputer.fit(Tmdb_dataset[['budget_adj', 'revenue_adj','runtime']])
Tmdb_dataset[['budget_adj', 'revenue_adj','runtime']] = imputer.transform(Tmdb_dataset[['budget_adj', 'revenue_adj','runtime']])

In [None]:
## drop all Nan values
Tmdb_dataset.dropna(inplace = True)

In [None]:
## rename columns
Tmdb_dataset.rename(columns = {'budget_adj': 'budget',
                               'revenue_adj': 'revenue'}, inplace = True)
Tmdb_dataset.head()

In [None]:
## drop duplicates in dataset
Tmdb_dataset.drop_duplicates(inplace = True)

In [None]:
Tmdb_dataset.info()

>Now we have cleaned dataset to be ready for Exploratory Data Analysis

>Dataset after cleaning 

|# of Columns  | # of Rows     |
| -------------|:-------------:| 
| 13           | 10,731        |


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 :The actor who appears the most in this list of movies ?

In [None]:
#After cleaning the data now we will start extract the info. needed to answer our question

### first we will create a dictionary to collect all actors on the data set and to know how many times each actor casted for movies.
cast_dict={}
actors = Tmdb_dataset['cast'].str.split("|")
actors = np.array(actors)

In [None]:
for actorlist in actors:
    for actor in actorlist:
        if actor not in cast_dict:
            cast_dict[actor] = 1
        else:
            cast_dict[actor] += 1
            
sorted_cast=sorted(cast_dict.items(), key=lambda item: item[1], reverse = True)

In [None]:
x = list()
y = list()

for item in sorted_cast[0:20]:
    x.append(item[0])
    y.append(item[1])


sns.set(rc={'figure.figsize':(12,10)}, font_scale=1.4)
ax = sns.barplot(x=x, y=y, palette="Paired")
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points')
    
#rotate x-axis' text
for item in ax.get_xticklabels():
    item.set_rotation(85)
    

ax.set(xlabel='actor names', ylabel='number of appearances', title = 'Top 20 actors based on the number of the appearances in movies')

plt.show()

>Robert de Niro has appeared in maximum numbers of movies. I initially thought that Samuel Jackson aka Nick Fury  might be the actor with maximum movies, but Data always wins.

### Research Question 2 :How many movies based on their genres were produced? 

In [None]:
### now will do the same but for genre 
### first we will create a dictionary to collect all genres on the data set and to know how many movies produced.
genre_dict={}
genres_df = Tmdb_dataset['genres'].str.split("|")
genres_array = np.array(genres_df)

In [None]:
for genrelist in genres_array:
    for genre in genrelist:
        if genre not in genre_dict:
            genre_dict[genre] = 1
        else:
            genre_dict[genre] += 1
            
sorted_genre=sorted(genre_dict.items(), key=lambda item: item[1], reverse = True)

In [None]:
x = list()
y = list()

for item in sorted_genre[0:20]:
    x.append(item[0])
    y.append(item[1])


sns.set(rc={'figure.figsize':(12,10)}, font_scale=1.4)
ax = sns.barplot(x=x, y=y, palette="Paired")
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points')
    
#rotate x-axis' text
for item in ax.get_xticklabels():
    item.set_rotation(85)
    

ax.set(xlabel='genre names', ylabel='frequency', title = 'Top 20 genres')

plt.show()

>Drama appears to be the most popular genre followed by Comedy & thriller movies

### Research Question 3-Which director has made the most movies?

In [None]:
#fetching different columns with 2 different ways of code
directors = Tmdb_dataset[["director", "original_title"]]

Top10_directors = directors.groupby("director")["original_title"].count().sort_values(ascending=False)[0:9]

Top10_directors.plot.pie(autopct="%.1f%%");


>Now after comparsion between directors to know which one who made alot of movies according to this dataset we found that Woody Allen is the winner and to know more about him . Woody Allen is an American film director, writer, actor, and comedian whose career spans more than six decades and multiple Academy Award-winning films.

In [None]:
f = sns.pairplot(Tmdb_dataset, kind="reg", diag_kind="kde", diag_kws=dict(shade=True))
f.fig.suptitle('scatterplots for all dataset')
f.fig.tight_layout(rect=[0, 0.03, 1, 0.95])

<a id='conclusions'></a>
## Conclusions

According to the above analysis and within these period of time (1960 - 2015):
<li><a>1-Robert de niro is the actor who appears the most on this period of time.</a></li>
<li><a>2-Drama, Comedy and Thriller are the most frequent movies that produced more than other genres.</a></li>
<li><a>3-Woody Allen is the director who made alot of movies in this period such as "Manhattan" & "Annie Hall".</a></li>

## limitations

We need to highlight that we have alot of limitations we should consider:
<li><a>1-This dataset not confirm that we have all movies produced in this period of time.</a></li>
<li><a>2-To clean and analyze this dataset we replace or Nan values & zero values with mean values to avoid any bias towards the relationship between numerical values espacialy**</a></li>


**Finally :**
<li><a>I hope this analysis is helpful for anyone interesed in movie production field.</a></li>



