<h1>Assignment TMDB <h1>
    

    - Part 1: Exploring and Preparing the data for analysis
    - Part 2: Analyses of different genres
    - Part 3: 

<h2> Part 1: Exploring and  preparing the data for analysis:  </h2> 
 We start with the import of packages we will eventually need. Furthermore, we import the datasets and start with exploring and preparing the data for further analysis.

In [1]:
import json

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
from sklearn.preprocessing import Imputer
from sklearn.decomposition import PCA # Principal Component Analysis module
from sklearn.cluster import KMeans # KMeans clustering 


import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output

movies = pd.read_csv('../input/tmdb_5000_movies.csv')
credits = pd.read_csv('../input/tmdb_5000_credits.csv')



Let's just start with some easy questions to get familiar with the data. So what does the data look like? We'll start with taking a look at the movies data frame.

In [2]:
movies.head()

The first thing we notice is that the columns are a bit in an awkward order to take a fine look at the data. A preferable first column of this data frame, would, for example, be the title of the movie and not the movie's budget. 

We also notice that the columns 'genres', 'keywords', 'production_companies', 'production_countries' and 'spoken_languages' are of the dictionary type, so right now they are quite hard to read, but later on we will find a way to work with them.

Amongst the numerical columns, there's a movie budget, a movie ID, popularity, revenue, runtime, a vote average and the amount of votes a movie has received. 

A good description of what the popularity variable should be telling us, is no where to be found, so it will be hard to use this column for our predictions later on. Besides the fact that the ID column is numerical, it is also not of interest for making predictions about, for example, the revenue of a movie. For now, we leave this data frame as it is and we'll take a quick look at the other one.

In [3]:
credits.head()

So this data frame has way fewer columns. The cast and crew might be interesting later on. Since this data frame contains only two extra columns, we'll try to merge it with the  movies data frame. If they are in the same order, we can just concatenate the data frames, so let's see if in both data frames every row is about the same movie:

In [4]:
(credits['title']==movies['title']).describe()

This tells us that every row in the credits data base has the same movie title as the same row in the movies data base. To prevent getting duplicate columns, we'll remove the movie_id and title column from the credits data frame and concatenate them.

In [5]:
del credits['title']
del credits['movie_id']
movie_df = pd.concat([movies, credits], axis=1)

In [6]:
movie_df.head()

The concatenation worked. However, the columns are a bit in an awkward order and columns like homepage aren't that interesting for us. We choose the interesting columns, put them in a nice order and create a new data frame

In [7]:
newCols = ['id','title','release_date','popularity','vote_average','vote_count',
           'budget','revenue','genres','keywords','cast','crew','tagline', 'runtime', 'production_companies', 
           'production_countries', 'status']

df2 = movie_df[newCols]
df2.head()

Let's explore our data frame a bit more in depth, let's take a look at our numerical columns.

In [8]:
df2.describe().round()


Note that runtime consists of a few empty values, before we can really work with our data frame, we need to solve this. We use an imputer for this:

In [9]:
my_imputer = Imputer()

temp=df2
X2 = my_imputer.fit_transform(df2[['runtime']])
df2['runtime'] = X2
df2.describe().round()

In [10]:
df2.loc[df2['budget'].idxmax()]

In [11]:
df3=df2[df2.title != 'Pirates of the Caribbean: On Stranger Tides']
df3.loc[df3['budget'].idxmax()]
df4 = df3[df3.title != "Pirates of the Caribbean: At World's End"]
df4.loc[df4['budget'].idxmax()]
df5 = df4[df4.title != "Avengers: Age of Ultron"]
df5.loc[df5['budget'].idxmax()]


So now at least all the numerical columns are complete. Let's take a quick look at how all the variables are distributed.

In [12]:
del df2['id']

In [13]:
#df2['vote_classes'] = pd.cut(df2['vote_average'],10, labels=["1", "2","3","4","5","6","7","8","9","10"])
df2['vote_classes'] = pd.cut(df2['vote_average'],4, labels=["low", "medium-low","medium-high","high"])

In [14]:
fig = plt.figure(figsize = (10,15))
ax = fig.gca()

#fig, axes = plt.subplots(nrows=3, ncols=2)
#fig.tight_layout() # Or equivalently,  "plt.tight_layout()"

#fig.subplots_adjust(hspace=0.1)
df2.hist(ax=ax)
#df2.hist(ax=ax)

Note that everything is quite skewed. We'll try getting more in depth into this later.

<h2> Part 3: Analyze genres: <h2>

Now that we've got a good overview of the distribution of our numerical variables, let's take a closer look at our non-numerical variables. We choose to start with looking at the genres, since this variable has got the least variability, should be the most easy target for analysis.

The genres column contains variables of the string type, while they are in dictionaries. Moreover, the colomn is a json column. To analyse and understand the data it is necessary to change the type of the variable and filter the columns.
Despite the fact that we already loaded our data for the exploration, we'll reload it here and make sure to load the json columns correctly. To do this, we made use of a few tricks found in another Kernel*

In [15]:
def load_tmdb_movies(path):
    df = pd.read_csv(path)
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.date())
    json_columns = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df

def load_tmdb_credits(path):
    df = pd.read_csv(path)
    json_columns = ['cast', 'crew']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df

def pipe_flatten_names(keywords):
    return '|'.join([x['name'] for x in keywords])

credits = load_tmdb_credits("../input/tmdb_5000_credits.csv")
movies = load_tmdb_movies("../input/tmdb_5000_movies.csv")

del credits['title']
df = pd.concat([movies, credits], axis=1)

df['genres'] = df['genres'].apply(pipe_flatten_names)

liste_genres = set()
for s in df['genres'].str.split('|'):
    liste_genres = set().union(s, liste_genres)
liste_genres = list(liste_genres)
liste_genres.remove('')


So what happened here is the following: first, we changed the type of the genres variable. Aferwards, we made use of the structure of the column and the *split()* function.  Because the genre always appears after the word *name*, we were able to filter out al the words after the word name and create a list of every genre that occurs in the genre-column.

Now, let's reduce our data frame. To get more insight about the influence of a movie's genre, title, vote_average, release_data, runtime, budget and revenue are the most import important variables. We also add a column for every genre, containing only 1s and 0s whether a movie is of a specific genre or not.  

In [16]:
df_reduced = df[['title','vote_average','release_date','runtime','budget','revenue']].reset_index(drop=True)

for genre in liste_genres:
    df_reduced[genre] = df['genres'].str.contains(genre).apply(lambda x:1 if x else 0)
df_reduced[:5]

df_reduced.head()

Now that we've got an easy to work with data frame for the movie genres, we can take a look to the distribution of the genres by creating a pie chart*. 

In [17]:
plt.rc('font', weight='bold')
f, ax = plt.subplots(figsize=(5,5))
genre_count = []
for genre in liste_genres:
    genre_count.append([genre, df_reduced[genre].values.sum()])
genre_count.sort(key = lambda x:x[1], reverse = True)
labels, sizes = zip(*genre_count)
labels_selected = [n if v > sum(sizes) * 0.01 else '' for n, v in genre_count]
ax.pie(sizes, labels=labels_selected,
      autopct = lambda x:'{:2.0f}%'.format(x) if x>1 else '',
      shadow = False, startangle=0)
ax.axis('equal')
plt.tight_layout()

This pie chart shows which genres are most common in the movies dataset.We find that drama movies are most common, followed by comedy. Afterwards, thriller and action movies are the most popular. Interestingly, half of the movies is from the top 5 genres. (51%). This suggest that the main genre of the most movies are drama, comedy, thriller, action. However, the top 5 most common genres could be seen as more general descriptions. For example, movies with the genre war might also be tagged as action movies or drama movies.

Now let's try to get a more in depth view of the genres. In this cell we calculate the average votes, budget, and revenue for the different genres. we create a new data frame consisiting of every genre and the calculated averages. **

In [18]:
mean_per_genre = pd.DataFrame(liste_genres)

#Mean votes average
newArray = []*len(liste_genres)
for genre in liste_genres:
    newArray.append(df_reduced.groupby(genre, as_index=True)['vote_average'].mean())
newArray2 = []*len(liste_genres)
for i in range(len(liste_genres)):
    newArray2.append(newArray[i][1])

mean_per_genre['mean_votes_average']=newArray2

#Mean budget
newArray = []*len(liste_genres)
for genre in liste_genres:
    newArray.append(df_reduced.groupby(genre, as_index=True)['budget'].mean())
newArray2 = []*len(liste_genres)
for i in range(len(liste_genres)):
    newArray2.append(newArray[i][1])

mean_per_genre['mean_budget']=newArray2

#Mean revenue 
newArray = []*len(liste_genres)
for genre in liste_genres:
    newArray.append(df_reduced.groupby(genre, as_index=True)['revenue'].mean())
newArray2 = []*len(liste_genres)
for i in range(len(liste_genres)):
    newArray2.append(newArray[i][1])

mean_per_genre['mean_revenue']=newArray2

mean_per_genre['profit'] = mean_per_genre['mean_revenue']-mean_per_genre['mean_budget']

mean_per_genre    

Let's see which genres are the best scoring ones in each category:

In [19]:
mean_per_genre.sort_values('mean_votes_average', ascending=False).head()


In [20]:
mean_per_genre.sort_values('mean_budget', ascending=False).head()

In [21]:
mean_per_genre.sort_values('mean_revenue', ascending=False).head()

In [22]:
mean_per_genre.sort_values('profit', ascending=False).head()

It's very interesting to see that the top 5 highest vote average consists of *History, War, Drama, Music* and *Foreign*, while none of these genres are in either one of the other three categories, which all have the same top 3: *Animation, Adventure, Fantasy*. On the one hand, this is easily explained, since budget and revenue should be closely elated and profit is directly derived from budget and revenue. However, we would have expected a higher correlation between the budget and the quality of a movie.

To go even more in depth, we want to analyse the averages per genre per year.  Therefore, we first extend the dataframe. with the year of release per movie.  Afterwards, we create a new dataframe which contains the average votes, average runtime, and average budget per release year and per genre. 

In the last step in the cell below, only the rows that contain a 1 for genre are kept, so we create a data frame with only the specific genres. 

In [23]:
from datetime import datetime

t = df_reduced['release_date']
t = pd.to_datetime(t)
t = t.dt.year
df_reduced['release_year'] = t

df_list = []*len(liste_genres)
for genre in liste_genres:
    df_list.append(df_reduced.groupby([genre,'release_year']).mean().reset_index())

df_per_genre = []*len(liste_genres)
for i in range(len(df_list)):
    df_per_genre.append(df_list[i][df_list[i].ix[:,0] == 1])


Now we create tables which contain the average budget, average revenue, and average votes per year per genre. We start with creating a new table with the cloumns 1988 till 2017. Afterwards, the data for the different variables is implemented. **

In [24]:
# Budget
columns = range(1988,2018)
budget_genre = pd.DataFrame( columns = columns)

for genre in liste_genres:
    temp=(df_per_genre[liste_genres.index(genre)].pivot_table(index = genre, columns = 'release_year', values = 'budget', aggfunc = np.mean))
    temp = temp[temp.columns[-30:]].loc[1]
    budget_genre.loc[liste_genres.index(genre)]=temp
budget_genre['genre']=liste_genres

# Revenue 

columns = range(1988,2018)
revenue_genre = pd.DataFrame( columns = columns)

for genre in liste_genres:
    temp=(df_per_genre[liste_genres.index(genre)].pivot_table(index = genre, columns = 'release_year', values = 'revenue', aggfunc = np.mean))
    temp = temp[temp.columns[-30:]].loc[1]
    revenue_genre.loc[liste_genres.index(genre)]=temp
revenue_genre['genre']=liste_genres

# Vote average 
columns = range(1988,2018)
vote_avg_genre = pd.DataFrame( columns = columns)

for genre in liste_genres:
    temp=(df_per_genre[liste_genres.index(genre)].pivot_table(index = genre, columns = 'release_year', values = 'vote_average', aggfunc = np.mean))
    temp = temp[temp.columns[-30:]].loc[1]
    vote_avg_genre.loc[liste_genres.index(genre)]=temp
vote_avg_genre['genre']=liste_genres

#vote_avg_genre.index = vote_avg_genre['genre']

Let's take a look at the data frames we generated.

### Mean budget per genre per year:

In [25]:
budget_genre.index = budget_genre['genre']
budget_genre

### Mean revenue per genre per year:

In [26]:
revenue_genre.index = revenue_genre['genre']
revenue_genre


### Mean vote average per genre per year:

In [27]:
vote_avg_genre.index = vote_avg_genre['genre']
vote_avg_genre

We can create more insight in these tables by making heatmaps**. 

### Budget:

In [28]:
fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(budget_genre.ix[:,0:30], xticklabels=3, cmap=cmap, linewidths=0.05)

The heatmap shows that in general, movies had  an increasing budget over the years. Especially the genres Fantasy, advernture, family, action, science fiction, and animation. The heatmap also shows that Western movies had an extremely high budget in 2013. This could mean that a costly movie is produced in 2013 which has great influence on the average.  We might later on remove this possible outlier, to get a better overview of the distribution of the rest of the movies.

## Revenue:

In [29]:

fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(revenue_genre.ix[:,0:30], xticklabels=3, cmap=cmap, linewidths=0.05)

This heatmap shows the average revenue of genres from 1988 till 2017. The most clear increase of average is in the genres fantasy, adventure, family, action, science fiction. Interestingly, the graph shows that the revenues of the genre animation are colored black in 1994. This is surprisingly because there are no black colored revenues in the graph and in general revenues are lower in 1994 than movies that are produced in later years.  A reason for this could be that there are only a few movies in the genre animation in 1994 and that those movies did extremely well.  The previous heatmap does not show an above average budget for animation movies in 1994. 


## Vote average:

In [30]:
fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(vote_avg_genre.ix[:,0:30], xticklabels=3, cmap=cmap, linewidths=0.05)

This heatmap is way darker than the previous two, which suggests that the average is relatively higher than in the other two categories. Most of the categories seem to be getting somewhere around a 6 out of 10 score. Especially notable is the fact that there are very few green or orange colored cells, which should mean that the most movies are on average just a nice watch.

As said before, we would like to remove the very high budget input from the Western genre, to make the heatmap less skewed. Let's see what happens:

In [31]:
temp = budget_genre
temp[2013]=temp[2013].replace(2.550000e+08, 0)

This heatmap obviously shows that Fantasy Adventure, Science Fiction, and Animation have on average the highest budget. It is also clear that movies had an increasing budget over the years. However, there are a few exceptions. For example  Western movies had an above average budget in 2004 and history in 2000. This might be an effect of individual movies with a high budget. 

In [32]:
fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(temp.ix[:,0:30], xticklabels=3, cmap=cmap, linewidths=0.05)

## It might also be nice to create a visualisation on how many times different genres are connected with each other. So which genres occur the most together in the same movie, but this is something for later on.

# Numerical Analysis

So let's take a closer look at our the numerical columns in our data frame.  Let's start by creating a data frame containing only numbered columns.

In [33]:
num_list = ['budget','popularity','revenue','runtime','vote_average','vote_count']
movie_num = df2[num_list]
movie_num.head()


Let's take a look at how everything is correlated:

In [34]:
f, ax = plt.subplots(figsize=(12,10))
plt.title('Pearson Correlation of Movie Features')
sns.heatmap(movie_num.astype(float).corr(), linewidths=0.25, vmax=1.0, square=True,
           cmap="YlGnBu", linecolor='black', annot=True)

We see quite a few dark/blue squares. These are the higher correlated variables. To be able to make predictions about certain movies later on, this might be some important knowledge.

# Comparing different regression techniques

We want to compare a few regression techniques to help us in making predictions. We'll use linear regression and random forest, as treated in the lectures.
We start by recreating our numerical data frame.

In [64]:
num_list = ['budget','popularity','revenue','runtime','vote_average','vote_count']
movie_num = df2[num_list]
movie_num.head()

We want the vote_average to be our target values, budget, popularity, revenue, runtime and vote_count are trainng values.

In [65]:
training_list = ['budget','popularity','revenue','runtime','vote_count']
training = movie_num[training_list]
target = movie_num['vote_average']


In [66]:
X = training.values
y = target.values

We split our data in a train and a test frame.

In [67]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

Now let's train a linear regression model and plot the results: \***

In [68]:
from sklearn import linear_model
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_lr = regr.predict(X_test)

In [69]:
f = plt.figure(figsize=(10,5))
plt.scatter(X_test[:,1], y_test, s=50,label="Real vote_average");
plt.scatter(X_test[:,1], y_pred_lr,s=100, c='r',label="Predicted vote_average");
plt.ylabel("vote_average");
plt.legend(loc=2);


Now let's see what happens if we use a random forest regression model:

In [70]:
from sklearn.ensemble import RandomForestRegressor
# Create linear regression object
rf = RandomForestRegressor(1)

# Train the model using the training sets
rf.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_rf = rf.predict(X_test)

In [71]:
f = plt.figure(figsize=(10,5))
plt.scatter(X_test[:,1], y_test, s=50,label="Real vote_average");
plt.scatter(X_test[:,1], y_pred_rf,s=100, c='r',label="Predited vote_average");
plt.ylabel("vote_average");
plt.legend(loc=2);

And let's compare them:

In [72]:
from sklearn.metrics import mean_squared_error

error_lr = mean_squared_error(y_test,y_pred_lr)
error_rf = mean_squared_error(y_test,y_pred_rf)

print(error_lr)
print(error_rf)

In [73]:
f = plt.figure(figsize=(10,5))
plt.bar(range(2),[error_lr,error_rf])
plt.xlabel("Classifiers");
plt.ylabel("Mean Squared Error of the vote_average");
plt.xticks(range(2),['Linear Regression','Random Forest'])
plt.legend(loc=2);

So the mean squared error for the random forest regression is a little higher than for the linear regression, but both estimators seem to be very decent.

\* https://www.kaggle.com/fabiendaniel/categorizing-actors-hands-on-plotly <br>
\** https://www.kaggle.com/diegoinacio/imdb-genre-based-analysis <br>
\*** introduction to data science, week 4, Comparison of Regression Techniques on House prediction prices.ipynb

In [45]:
training_list = ['budget', 'popularity']
training = movie_num[training_list]
target = movie_num['vote_average']


In [46]:
X = training.values
y = target.values

In [47]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [48]:
from sklearn import linear_model
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_lr = regr.predict(X_test)

In [49]:
f = plt.figure(figsize=(10,5))
plt.scatter(X_test[:,1], y_test, s=50,label="Real vote_average");
plt.scatter(X_test[:,1], y_pred_lr,s=100, c='r',label="Predicted vote_average");
plt.ylabel("vote_average");
plt.legend(loc=2);

In [57]:
df_prediction=df2[df2.title != "Pirates of the Caribbean: On Stranger Tides"]
df_prediction.loc[df_prediction['budget'].idxmax()]
df_prediction = df_prediction[df_prediction.title != "Pirates of the Caribbean: At World's End"]
df_prediction.loc[df_prediction['budget'].idxmax()]
df_prediction = df_prediction[df_prediction.title != "Avengers: Age of Ultron"]
df_prediction.loc[df_prediction['budget'].idxmax()]

In [None]:


##Dataset predicts Revenue

num_list = ['budget','popularity','revenue','runtime','vote_average','vote_count']
movie_num = df_prediction[num_list]

#training_list = ['budget','popularity','revenue','runtime','vote_count']
training_list = ["budget",'runtime',"vote_count", 'popularity', 'revenue' ]
training = movie_num[training_list]
target = movie_num['vote_average']

X = training.values
y = target.values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

from sklearn import linear_model
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_lr = regr.predict(X_test)

f = plt.figure(figsize=(15,10))
plt.scatter(X_test[:,1], y_test, s=50,label="Real Revenue");
plt.scatter(X_test[:,1], y_pred_lr,s=100, c='r',label="Predicted Revenue");
plt.ylabel("vote_average");
plt.legend(loc=2);

#---------Random forrest ------------------------------------------------------
from sklearn.ensemble import RandomForestRegressor
# Create linear regression object
rf = RandomForestRegressor(1)

# Train the model using the training sets
rf.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_rf = rf.predict(X_test)
                       
f = plt.figure(figsize=(15,10))
plt.scatter(X_test[:,1], y_test, s=50,label="Real Revenue");
plt.scatter(X_test[:,1], y_pred_rf,s=100, c='r',label="Predited Revenue");
plt.ylabel("Revenue");
plt.legend(loc=2);

from sklearn.metrics import mean_squared_error

error_lr = mean_squared_error(y_test,y_pred_lr)
error_rf = mean_squared_error(y_test,y_pred_rf)

print(error_lr)
print(error_rf)

f = plt.figure(figsize=(10,5))
plt.bar(range(2),[error_lr,error_rf])
plt.xlabel("Classifiers");
plt.ylabel("Mean Squared Error of the vote_average");
plt.xticks(range(2),['Linear Regression','Random Forest'])
plt.legend(loc=2);