
# Project: Investigate a Dataset (IMDb movie data)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
# Introduction

    This data set contains information about 10,000 movies collected from The Movie Database (TMDb),                               including user ratings and revenue.
    ● Certain columns, like ‘cast’and ‘genres’, contain multiple values separated by pipe (|) characters.
    
    ● There are some odd characters in the ‘cast’ column.
    
    ● The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars,
    accounting for inflation over time.
    
    the questions i will discuss is...
    does budget affect the popularity?
    does budget affect the revenue?
    does more popularity means more revenue?
    Which genres are most popular from year to year?
    which production company are most popular?
    What kinds of properties are associated with movies that have high revenues?
    what is the most popular release year?
    what is the most popular release month?


<a id='wrangling'></a>
## Data Wrangling


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('../input/imdb-data/tmdb-movies.csv')

In [None]:
df.head()

In [None]:
df.info()

###### drop unnecessary columns
    

In [None]:
df.drop(columns={'id','imdb_id','cast','homepage','director','overview','keywords','tagline','original_title'},inplace=True)

In [None]:
#test
df.info()

In [None]:
df.describe()

##### it seems we have some data entry errors

In [None]:
df[df['budget']==0]
#how colud we have a movie without budget so i'll drop the 0 budget films so it 
#doesn't affect my analysis

In [None]:
#i'm storing it's indices i a numpy array so i could drop it
ind=df[df['budget']==0].index
df.drop(ind,axis=0,inplace=True)

In [None]:
#test
df[df['budget']==0]

In [None]:
#another problem is runutime = 0
df[df['runtime']==0]

In [None]:
#will solve it as i did in the previous one
ind=df[df['runtime']==0].index
df.drop(ind,axis=0,inplace=True)

In [None]:
#test
df[df['runtime']==0]

In [None]:
#last check
df.describe()

###### turning release_date from string to datetime 

In [None]:
df['release_date']=pd.to_datetime(df['release_date'])

In [None]:
#test
df.info()

In [None]:
df.sample(5)

<a id='eda'></a>
## Exploratory Data Analysis

### Research Question 1 
### does budget affect the popularity?
### does budget affect the revenue?
### does more popularity means more revenue?

In [None]:
sns.pairplot(df,diag_kind='kda');

#                                      all data scatter pair plot

In [None]:
df.corr()

### Research Question 2 (Which genres are most popular from year to year?)

In [None]:
df.genres.mode()

     below I'm trying to visualize the films categories by making a data frame that includes the category and its frequency

In [None]:
c=df.genres.str.split('|')
lists=[]
for i in c:
    lists.append(i)

In [None]:
len(lists)

In [None]:
flat_list=[]
itereator=0
while itereator <= 5166:
    for i in lists[itereator]:
        flat_list.append(i)
        itereator=itereator+1

In [None]:
s=pd.Series(flat_list)
s=s.value_counts()
s=pd.DataFrame(s)
s.reset_index(level=0, inplace=True)
s=s.rename(columns={'index':'categories',0:'frequency'})

In [None]:
s.plot(kind='bar',x='categories',y='frequency',title='movies categories',ylabel='frequency',grid=True,figsize=(15,10));

### Research Question 3 (which production company are most popular?)

In [None]:
df.production_companies.mode()

### Research Question 4 (What kinds of properties are associated with movies that have high revenues?)

In [None]:
high_revenue=df.sort_values(by=['revenue'],ascending=False).head(100)

In [None]:
high_revenue.budget.mean()

In [None]:
high_revenue.runtime.mean()

In [None]:
high_revenue.genres.mode()

### Research Question 5 (what is the most popular release year?)

In [None]:
df.release_year.mode()

### Research Question 6 (what is the most popular release month?)

In [None]:
df.release_date.dt.month.mode()

In [None]:
df.release_year.plot(kind='hist',xlabel='release_year',title='release_year distribution',grid=True);

##### it seems that the number of movies is increasing in time increasing

In [None]:
df['runtime'].plot(kind='hist',xlabel='runtime',title='runtime distribution',grid=True);

###### most movies has a (50:100) run time
###### (150:200)is a rar run time

<a id='conclusions'></a>
## Conclusions


##### sources
    i did some search on stackoverflow for pandas syntax help
    got visualization help from (python for data analysis) book

 ### findings summarize and the results 

   
    -i found a positive correlation between budget and popularity, it's not so strong but we can say that if the 
    budget increases there is a chance that the popularity increases
    
    -i found a strong correlation between them , we can say that more budget will cause more revenue
    
    - there is a strong correlation between popularity and revenue so work hard on your movie so you can have a good               popularity and you will get rich
    
    -Drama movies it the most popular movies
    
    -Paramount Pictures is the most popular production company
    
    -the movies with highest revenue usualy have high budget
    they also has a relative long run time
    the most popular movies under the catigories (Adventure|Fantasy|Action)
    
    -most popular release year is 2011
    
    -the most popular release month is September
    
    -the biggest percentage of movies is drama movies
    
    -most movies has a (50:100) run time
    (150:200)is a rar run time
    
    - the number of movies is increasing in time increasing

   ### Limitations
    dealing with string separated with'|' was hard as I had to do a function to make it easy doing visualization to it so
    this has adversely affected my analysis
    
    also, there was illogical data in columns budget forced me to get rid of some rows which of course affected the 
    statistics results