## Please enter your team name

In [None]:
team_name = input("Please enter your team name: ")

 


# Can you predict the movies success ?
![](https://media.giphy.com/media/3ohhwDMC187JqL69DG/giphy.gif)

In this tutorial we will go through some basic steps of machine learning model creation, to get familiar with this topic on a real life dataset.


At the end of this tutorial you will discover:

1- How to explore and manipulate a dataset?

2- How to work with famous machine learning libraries like Scikit-learn?

3- How does feature engineering process work?

4- How to model a real life problem by Machine Learning algorithm?


Keep in mind that **YOUR** results will help to improve marketing efforts in the film industry and predicting if the film is going to show ducks or miracle in the cinema after release.



## The Movies Dataset

The dataset comes from the IMDB API, it consists of 26 million ratings on around 45,000 movies from 27,000 users.
The aim is to be able to create a regression model that predict the average rate of each movie. The exact number of rows are 45466 with 20 columns that are divided in two parts.


###  The features are
* **belongs_to_collection:** A stringified dictionary that gives information on the movie series the particular film belongs to.
* **budget:** The budget of the movie in dollars.
* **genres:** A stringified list of dictionaries that list out all the genres associated with the movie.
* **homepage:** The Official Homepage of the move.
* **id:** The ID of the move.
* **original_language:** The language in which the movie was originally shot in.
* **original_title:** The original title of the movie.
* **overview:** A brief blurb of the movie.
* **popularity:** The Popularity Score assigned by IMDB.
* **production_companies:** A stringified list of production companies involved with the making of the movie.
* **production_countries:** A stringified list of countries where the movie was shot/produced in.
* **revenue:** The total revenue of the movie in dollars.
* **spoken_languages:** A stringified list of spoken languages in the film.
* **title:** The Official Title of the movie.
* **vote_average:** The average rating of the movie.
* **vote_count:** The number of votes by users, as counted by IMDB.



## Machine learning pipeline

![title](img/pipeLineML2.png)

**1. Steps to create a machine learning pipeline:**
* Data preparation:
  * Data Cleaning
    * Irrelevant data
    * Duplicated rows
    * Missing values
    
  * Feature Engineering
    * Create new features
    * Convert features to appropriate format
* Training the model
  * Choosing a model
  * Tune it's parameters 
* Predicting
  * Predict on unseen data
  
**2. Regression Models:** Predict a numerical value given a set of features
![title](img/Reg.jpg)

## A good fitting tradeoff

![title](img/biais.png)

## Use a validation set to avoid over/underfitting 

![title](img/train-test-split.png)

### **Note:** If you have to pre-process data don't forget to do it for all the partitions Train, Validation and Test.

## Decision trees

**Consider you need a loan. How will the bank know if you'll pay it back or not? The bank has lots of profiles of people who took money before. They have data about age, education, occupation and salary and – most importantly – the fact of paying the money back Or not. Using this data, we can teach the machine to find the patterns and get the answer.**

![input](img/loan.png)


## From trees to forest

![title](img/RForest.png)

RF consists of a large number of individual randomized decision trees.

Now we are going to predict the average rate for each film by means of RF.


# Let's start

In [None]:
#Import all libraries

import pandas as pd                         #  easy-to-use data structures 
import numpy as np                          #  For multidimensional array objects
from math import sqrt

import matplotlib.pyplot as plt             #  To create plots
import seaborn as sns

from wordcloud import WordCloud, STOPWORDS  # word cloud generator
from IPython.display import HTML
from IPython.display import display

#Scikit learn library
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.metrics import mean_squared_error, mean_absolute_error



#to run the cell press the "Run" or "shift + Enter"

In [None]:
#Read the data
#The path to datasets
PATH = "data"


feature = pd.read_csv(PATH + '/Train.csv')    
target = pd.read_csv(PATH + '/Y_train.csv')   

test = pd.read_csv(PATH + '/Test.csv')      # The  set that we will predict the target

#For exploring text data
text = pd.read_csv(PATH + '/text.csv')


### Let's see what data can tell us. This important step is known as Exploratory data analysis (EDA)

The first 3 rows of the dataset are:

In [None]:
feature.head(3)

In [None]:
test.head(3)

In [None]:
plt.figure(figsize=(16,8))
plt.scatter(feature.revenue, target.vote_average)
plt.xlabel("Revenue")
plt.ylabel("Vote average")

## **PART 1: Preprocessing**


In [None]:
feature.info() #test.info()

### Data Cleaning:

**Irrelevant data:**
those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve.

In [None]:
feature = feature.drop(['belongs_to_collection'], axis=1)
test = test.drop(['belongs_to_collection'], axis=1)

### Handling missing values

**CASE 1**
If the missing values in a column rarely happen and occur at random like "production_countries", "popularity" ,... then the easiest and most forward solution is to drop observations (rows) that have missing values. If most of the column’s values are missing, and occur at random, then a typical decision is to drop the whole column.

In [None]:
feature.isnull().sum().sort_values(ascending = False)

In [None]:
test.isnull().sum().sort_values(ascending = False)

In [None]:
test['release_date']

In [None]:
#Percentage of missing values in "belongs_to_collection"
missing_value= feature["revenue"].isnull().sum()
print(f'The percentage of missing values in "revenue" is {round((missing_value/feature.shape[0]) * 100)} ')# round the value


In [None]:
#Visualize the missing value
sns.heatmap(feature.isnull(), yticklabels = False, cbar = False, cmap = 'viridis')

**CASE 2**
Imputation means calculate the missing value based on other observations by using statistical values like mean, median.


In [None]:
feature['revenue'] = feature['revenue'].fillna(feature['revenue'].mean())
test['revenue'] = test['revenue'].fillna(feature['revenue'].mean())

In [None]:
feature['genres'] = feature['genres'].replace([], np.nan)
test['genres'] = test['genres'].replace([], np.nan)
#it can be done the same for the other features

## Feature engineering

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models. 

Can you turn some of the features into things that the algorithm can understand?

In [None]:
#converted into number of genres
feature['len_genres'] = feature['genres'].apply(lambda x: len(x))
test['len_genres'] = test['genres'].apply(lambda x: len(x))

In [None]:
#Print the diffrent genres

import ast
list2 = [ast.literal_eval(item) for item in feature['genres'].unique()]
def flatten(lst):
    for el in lst:
        if isinstance(el, list):
            yield from el
        else:
            yield el

list3 = flatten(list2)
list(set(list3))


In [None]:
# Cast the genres column in to a list 
feature['genres_list'] = feature['genres'].apply(ast.literal_eval)
test['genres_list'] = test['genres'].apply(ast.literal_eval)

In [None]:
def is_comedy_in_the_list (x):
    if 'Comedy' in x :
        return 1
    return 0

In [None]:
# Lest add a binary column, that indicates if the movie is a comedy or not ?
feature['is_comedy'] = feature['genres_list'].apply(is_comedy_in_the_list)
test['is_comedy'] = test['genres_list'].apply(is_comedy_in_the_list)

In [None]:
#
#
# TODO: You are now ready to add your own features or to modify the existing ones 
#
#









### Select features that will be used to train the model

In [None]:
selected_columns = ['len_genres','revenue','popularity','is_comedy']

### Split data randomly to the train and validation

In [None]:
X_train, X_valid, Y_train, y_valid = train_test_split(feature[selected_columns], target,
                                                      test_size=0.25, random_state=42)

## Training the model

### Regression problem: what is the average rate of each movie?

Is it possible to predict the average vote of coming movies (unseen data)?

To model the problem, we are going to use a Random Forest regression model.

In [None]:
# Do not hesitate to tweak those parameters: find the good tradeoff between underfitting and overfitting
#
#
#
RF = RandomForestRegressor(bootstrap=True, #  method for sampling data points (with or without replacement)                
                           max_depth=None, #  max number of levels in each decision tree
                           max_features='auto', #  max number of features considered for splitting a node
                           min_samples_leaf=1,  #  min number of data points allowed in a leaf node
                           min_samples_split=2, #  min number of data points placed in a node before the node is split
                           n_estimators=10 # number of trees in the foreset
                           )       

In [None]:
#Build a forest of trees from the training set
RF.fit(X_train,Y_train['vote_average']) 

In [None]:
prediction =  RF.predict(X_valid)

In [None]:
print('Root Mean Square Error:', np.round(sqrt(mean_squared_error(y_valid['vote_average'], prediction)),3))

## Congratulations  !
 
You have now created a machine learning regression model using sklearn.

### Please save and submit your results:

In [None]:
submission_prediction = RF.predict(test[selected_columns])

In [None]:
submission = pd.DataFrame({
        "id": test.id,
        "predictions": np.round(submission_prediction, 3) 
})
submission.to_csv(team_name +'.csv', index=False)


### Please submit your predictions file (team_name +'.csv' located in your working directory) to the following url:

https://drive.google.com/drive/folders/1-doveGLiaPDGpAIzDM_JQ90TUXgZ5nX7?usp=sharing

## Appendix

### Are there certain words considered more worthy of a title or overviews?
Some features are text based like "title" and "overview". Word cloud (text clouds or tag clouds) helps to highlight important textual data points. The more a specific word appears in a source of textual data, the bigger and bolder it appears in the word cloud.
The only required argument for a WordCloud object is the text, while all others are optional.


In [None]:
# convert type of the "title" and "overview" features from object to string. It should be
# considered that values in a particular column must be of a particular datatype, 
#e.g., boolean, numeric, date, etc.

text['title'] = text['title'].astype('str')
text['overview'] = text['overview'].astype('str')


In [None]:
#create a corpus of words
title_corpus = ' '.join(text['title'])
overview_corpus = ' '.join(text['overview'])

In [None]:
overview_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(overview_corpus)
plt.figure(figsize=(16,8))
plt.imshow(overview_wordcloud)
plt.axis('off')
plt.show()

In [None]:
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(title_corpus)
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()