
#  Investigate a Dataset TMDb movie

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
> what can we do with information about 10,000 movies collected from The Movie Database (TMDb), we will analysis these data to get some answers about some questions :
**Which genres are most popular? 
which genre has the The Highest revenue ??**
in this data we have information about genre, adjusted budget ,adjusted revenue movies and etc for most of all movies lets dive in code 

In [None]:
#import packages 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
% matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> in this section 
>
1:Upload ,open and read data
>
2:Assessing Data 
>
3:Clean data

### General Properties

In [None]:
# Load data 
df=pd.read_csv("../input/tmdbmovies/tmdb-movies.csv")

#### Reading data

In [None]:
df.head()

In [None]:
#last 10 rows
df.tail(10)

In [None]:
#shape of the dataframe
df.shape

#### Assessing Data 
to know the structure of our data 

In [None]:
df.info()

In [None]:
#check for null values
df.isnull().sum()

In [None]:
#name of columns
df.columns


### Data Cleaning of (TMDb movie) Dataset

> only columns of (popularity,runtime,genres,vote_average,release_year,budget_adj,revenue_adj) that we need or will help to answer the questions that we asked 


In [None]:
# remove columns thats we don't need in the analysis 
df.drop(["id", 'imdb_id','cast', 'homepage','tagline', 'keywords', 'overview','original_title',
       'cast','director','production_companies', 'release_date',"vote_count","budget","revenue"], axis=1 , inplace= True )
df.head()

In [None]:
# the new shape od data frame
df.shape

In [None]:
#chech for duplicated 
sum(df.duplicated())

In [None]:
# the duplicated row
df[df.duplicated()]

In [None]:
#remove the duplicated row
df.drop_duplicates(inplace=True)

In [None]:
#chech for duplicated again
sum(df.duplicated())

In [None]:
#check of null values
df.isnull().sum()

In [None]:
#remove rows with null values
df.dropna(inplace=True)

In [None]:
##check of null values
df.isnull().sum()

In [None]:
#check if dataframe has zero elemens
df.eq(0).any().any()

In [None]:
# creates a boolean dataframe which is True where df is nonzero
df != 0

In [None]:
# Remove rows with zero values.
df = df[(df != 0).all(1)]
df.head()

In [None]:
#check if dataframe has zero elemens
df.eq(0).any().any()

In [None]:
#the new data shape
df.shape

there are alot hybrids genres we will split them into a new rows


In [None]:
data = df.join(df.genres.str.strip('|').str.split('|',expand=True).stack().reset_index(level=1,drop=True).rename('genre')).reset_index(drop=True)

data.head(8)

In [None]:
#check on clean data before exploring
df.isnull().sum()

In [None]:
#some statistics
data.describe()

from previous statistics we found out the avg **budget** is (4.953497e+07) <
< the avg **revenue** is (1.517441e+08)
<avg **runtime** is (110 min) and some other statistics

the main column we work with is the  genre we will figure out the mean  between the genre and every column

In [None]:
# the mean adjusted budget by genre
mean_genre_budget = data.groupby('genre').budget_adj.mean()

# the mean popularity by genre
mean_genre_popularity = data.groupby('genre').popularity.mean()

# the mean runtime by genre
mean_genre_runtime = data.groupby('genre').runtime.mean()

# the mean release year by genre
mean_genre_release_year = data.groupby('genre').release_year.mean()

# the mean vote average by genre
mean_genre_vote = data.groupby('genre').vote_average.mean()

#  the mean adjusted revenue by genre
mean_genre_revenue = data.groupby('genre').revenue_adj.mean()


collect all groups mean data in one data frame

In [None]:
mean_genre_data = pd.concat([ mean_genre_budget,mean_genre_popularity, mean_genre_revenue, mean_genre_runtime, mean_genre_release_year, mean_genre_vote],axis=1)
mean_genre_data

all data is gathered with mean and groubed according  by genre 

In [None]:
#make the genre as column at mean_genre_data
all_mean = mean_genre_data.reset_index()
all_mean

in previous cell we have all data with its mean 

<a id='eda'></a>
## Exploratory Data Analysis of TMDb Movie Data



# Question 1 : what is most popular genre?

> in this code i use three dataframs
 **data**=the clean data ...
**mean_genre_data** = mean clean data groubed by genre ...
>**all_mean**= mean clean data which the genre as column at mean_genre_data


### >**genres VS popularity**

In [None]:
# Distribution of Popularity vs genre
mean_genre_data["popularity"].hist(figsize=(8,5));
#set the title of figure
plt.title('Distribution of Popularity vs genre')
#set the xlabel and y label of the figure
plt.xlabel('Popularity')
plt.ylabel('Frequency of Occurence')

look like the popularity score for movie genres is approximately between 0.9 and 1.25

In [None]:
#sorting the highest popularity by genre
mean_genre_data.sort_values('popularity', ascending=False).popularity


from the statistics >Science Fiction has the most popularity (with=1.873294) and Foreign has the lower popularity with(0.179608)

In [None]:
# visualize the relations between the genres and popularity

x=all_mean["genre"]
y=all_mean["popularity"]
fig,ax = plt.subplots(figsize=(10,8))
ax.bar(x, y)
plt.xticks(rotation=90)
plt.title("popular genre",fontsize=12)
plt.xlabel('genre Of Movies',fontsize=12)
plt.ylabel("popularity",fontsize= 12)


as we see at the bars it confirm with statistics which **the Science Fiction genre** has the most popularity (with=1.873294)
and **the Foreign genre** has the little  popularity (with 0.179608)

# Question 2 :which genre has the The Highest budget?

In [None]:
# visualize the relations between the genres and budget
x=all_mean["genre"]
y=all_mean["budget_adj"]
fig,ax = plt.subplots(figsize=(10,8))
ax.bar(x, y)
plt.xticks(rotation=90)
#setup the title of the figure
plt.title("Highest budget vs genres",fontsize=12)
#setup the xlabel and ylabel of the figure
plt.xlabel('genre Of Movies',fontsize=12)
plt.ylabel("budget",fontsize= 12)

obviously with bars that **Animation genre** has the highest budget and the **Documentary genre** is the lowest budget we can confirm that with numbers at next cell

In [None]:
#sorting the highest budget by genre

mean_genre_data.sort_values('budget_adj', ascending=False).budget_adj


look like **Animation genre** has the Highest budget (with=8.347215e+07)
and **Documentary genre** has the lower budget (with=5.379702e+06) and that confirm with what we saw at the bars Visualizations 

# Question 3:which genre has the The Highest revenue ?

In [None]:
#plot the relations between the genres and budget
x=all_mean["genre"]
y=all_mean["revenue_adj"]
fig,ax = plt.subplots(figsize=(10,8))
ax.bar(x, y)
plt.xticks(rotation=90)
#setup the title of the figure
plt.title("Highest revenue vs genres",fontsize=12)
#setup the xlabel and ylabel of the figure
plt.xlabel('genre Of Movies',fontsize=12)
plt.ylabel("revenue",fontsize= 12)

from the bars Visualizations we can see that the **Animation genre** has highest revenue and the **foreign genre** has the lowest revenue 
we can confirm that with numbers at the next cell 

In [None]:
#sorting the highest revenue by genre
mean_genre_data.sort_values('revenue_adj', ascending=False).revenue_adj


look like **Animation genre** has the Highest revenue (with=2.909574e+08)
and **Documentary** has the lower revenue (with=1.273378e+07) and that confirm with what we saw at the bars Visualizations

# Question 4 :  the run time of the movies  increases or decreases from year to year ?

In [None]:
#avarge run time 
data["runtime"].mean()

as we see the avg run time of movies is approx 110 min

In [None]:
#make a Group of mean runtime by release_year
#i used here the clean datafram 'data' to make the group
mean_rel_run= data.groupby('release_year').runtime.mean()
mean_rel_run

In [None]:
#plot the relations between runtime and release year between years (1960 and 2015)
mean_rel_run.plot(xticks = np.arange(1960,2016,5),figsize=(10,8));
#setup the title of the figure
plt.title("Runtime Vs  year",fontsize = 14)
#setup the x-label and y-label of the plot.
plt.xlabel(' year',fontsize = 13)
plt.ylabel('Runtime',fontsize = 13)

as we see at the plot the run time decreasing year to year and that make sence because most of movies are short nowdayes and the avarge **runtime is 110 min**

# Question 5 : what is the genre with Highest vote ?

In [None]:
# the relations between the genres and popularity
x=all_mean["genre"]
y=all_mean["vote_average"]
fig,ax = plt.subplots(figsize=(10,8))
ax.bar(x, y)
plt.xticks(rotation=90)
plt.title("Highest vote of the genre",fontsize=12)
plt.xlabel('genre Of Movies',fontsize=12)
plt.ylabel("vote_average",fontsize= 12)

as we see  most the bars has a high vote and most of them  are Convergent but as we can see there are approx 4 numbers are so close 
**Documentary**,**War**,**History** and **Western** genre we will know and decide  which of them have the high vote at next cell 

In [None]:
#sorting the highest vote average by genre

mean_genre_data.sort_values('vote_average', ascending=False).vote_average


as we see  the four genres that we mentioned are sorted and the most avg voting genre is **Documentary genre** with = 6.66 and the lowest vote genre is **TV Movie genre** with =5.6

<a id='conclusions'></a>
## Conclusions

.**the Science Fiction genre** has the most popularity>
.**the Foreign genre** has the lowest popularity >
.**the Animation genre** has the Highest budget >
.**the Documentary genre** has the lower budget >
.**the Animation genre** has the Highest revenue>
.**the Documentary genre** has the lower revenue>
.**the run time decreasing year to year and avarge run time is 110 min**>
.the most avg voting genre is **Documentary** >
.the lowest vote genre is **TV Movie**
 ## Limitations
 .**Genre of movies may not be accurate**>>