   # Movies Dataset Analysis

   ### Section 1: data cleaning and manipulation
   ### Section 2: data exploration
   ### Section 3: data pre-processing for classification
   ### Section 4: classification
   ---

 ## About the datasets:
 We are going to analyze two datasets, *tmdb_5000_credits* and *tmdb_5000_movies*.

 First one contains **4803 observations** with the following columns:
  * movie_id
  * title
  * cast
  * crew

Second one contains 4803 observations with the following columns:
  * budget, genres, homepage, id, keywords, original_language
  * original_title, overview, popularity, production_companies
  * production_countries, release_date, revenue runtime
  * spoken_languages, status, tagline, title, vote_average, vote_count

We will merge the two datasets in order to get all the informations about the actors and the directors of their relative movie.

We will try to find out which are the key factors who affect the success of a movie.

 **In the end we will use them to predict if a movie is successful or not**.

In [None]:
from collections import defaultdict, Counter
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
import matplotlib.pyplot as plt
import seaborn as sns
import re
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

parameters = {'n_estimators': [4, 6, 9],
              'max_features': ['log2', 'sqrt', 'auto'],
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10],
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1, 5, 8]
              }

def Checker(x):
    if type(x) is pd.DataFrame:
        return 0
    elif type(x) is pd.Series:
        return 1
    else:
        return -1


def dtype(data):
    what = Checker(data)
    if what == 0:
        dtypes = data.dtypes.astype('str')
        dtypes = dtypes.str.split(r'\d').str[0]
    else:
        dtypes = str(data.dtypes)
        dtypes = re.split(r'\d', dtypes)[0]
    return dtypes


def split(x, pattern):
    '''Regex pattern finds in data and returns match. Then, it is splitted accordingly.
        \d = digits
        \l = lowercase alphabet
        \p = uppercase alphabet
        \a = all alphabet
        \s = symbols and punctuation
        \e = end of sentence
        '''
    pattern2 = pattern.replace('\d', '[0-9]').replace('\l', '[a-z]').replace('\p', '[A-Z]').replace('\a', '[a-zA-Z]')        .replace('\s', '[^0-9a-zA-Z]').replace('\e', '(?:\s|$)')

    if dtype(x) != 'object':
        print('Data is not string. Convert first')
        return False

    regex = re.compile(r'{}'.format(pattern))
    if pattern == pattern2:
        return x.str.split(pattern)
    else:
        return x.apply(lambda i: re.split(regex, i))


def replace(x, pattern, with_=None):
    '''Regex pattern finds in data and returns match. Then, it is replaced accordingly.
        \d = digits
        \l = lowercase alphabet
        \p = uppercase alphabet
        \a = all alphabet
        \s = symbols and punctuation
        \e = end of sentence
        '''
    if type(pattern) is list:
        d = {}
        for l in pattern:
            d[l[0]] = l[1]
        try:
            return x.replace(d)
        except:
            return x.astype('str').replace(d)

    pattern2 = pattern.replace('\d', '[0-9]').replace('\l', '[a-z]').replace('\p', '[A-Z]').replace('\a', '[a-zA-Z]')        .replace('\s', '[^0-9a-zA-Z]').replace('\e', '(?:\s|$)')

    if dtype(x) != 'object':
        print('Data is not string. Convert first')
        return False

    regex = re.compile(r'{}'.format(pattern))
    if pattern == pattern2:
        return x.str.replace(pattern, with_)
    else:
        return x.apply(lambda i: re.sub(regex, with_, i))


def hcat(*columns):
    cols = []
    for c in columns:
        if c is None:
            continue
        if type(c) in (list, tuple):
            for i in c:
                if type(i) not in (pd.DataFrame, pd.Series):
                    cols.append(pd.Series(i))
                else:
                    cols.append(i)
        elif type(c) not in (pd.DataFrame, pd.Series):
            cols.append(pd.Series(c))
        else:
            cols.append(c)
    return pd.concat(cols, 1)


def parse_col_json(df, column, key, nested):
    """
    Args:
        column: string
            name of the column to be processed.
        key: string
            name of the dictionary key which needs to be extracted
    """
    import json
    for index, i in zip(df.index, df[column].apply(json.loads)):
        list1 = []
        males = []
        females = []

        for j in range(len(i)):
            if nested:
                if not(((i[j]["department"] == "Directing") and (i[j]["job"] == "Director"))):
                    continue
            name = i[j][key]
            if "," in name:
                name = name.replace(",", " ")
            if " " in name:
                name = name.replace(" ", "_")
            list1.append(name)
            if column=="cast":
                if i[j]["gender"] == 1:
                    females.append(name)
                elif i[j]["gender"] == 2:
                    males.append(name)
        df.loc[index, column] = str(list1)
        if column=="cast":
            df.loc[index, "actors"] = str(males)
            df.loc[index, "actress"] = str(females)


def counts_elements(df, columns):
    d = defaultdict(Counter)
    for column in columns:
        for el in df[column]:
            l = eval(str(el))
            for x in l:
                d[column][x] += 1
    return d


def counts_vectorized(df, col, min=1, vocabulary=None):
    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer(tokenizer=lambda x: x.split(
        ","), min_df=min, vocabulary=vocabulary)
    #analyze = vectorizer.build_analyzer()
    #f = analyze(succ_movies.cast.iloc[0].strip("[]"))
    data = [x.strip("[]") for x in df[col]]
    #analyze(["ciao, mamma, come, stai_oggi", "ehi, mamma, stai_oggi, cane"])
    vectorizer.fit(data)
    counts = pd.DataFrame(vectorizer.transform(data).toarray())
    counts.columns = [x.replace("'", "")
                      for x in vectorizer.get_feature_names()]
    return counts


def simplify(df, col, bins, group_names):
    df[col] = df[col].fillna(-0.5)
    categories = pd.cut(df[col], bins, labels=group_names)
    df[col] = categories


def encode_features(df):
    features = ['year', 'runtime']

    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df


def testClassifier(clf, name, dict):
    y_pred = []
    if name == "Gradient Boosting":
        y_pred = clf.fit(X_train, y_train, early_stopping_rounds=5,
             eval_set=[(X_test, y_test)], verbose=False).predict(X_test)
        y_pred = [round(value) for value in y_pred]
    else:
        clf = clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        
    # Compute confusion matrix
    CM = confusion_matrix(y_test, y_pred)

    TN = CM[0][0]
    FN = CM[1][0]
    TP = CM[1][1]
    FP = CM[0][1]

    # Sensitivity, hit rate, recall, or true positive rate
    TPR = TP/(TP+FN)
    # Specificity or true negative rate
    TNR = TN/(TN+FP) 
    # Precision or positive predictive value
    PPV = TP/(TP+FP)
    # Negative predictive value
    NPV = TN/(TN+FN)
    # Fall out or false positive rate
    FPR = FP/(FP+TN)
    # False negative rate
    FNR = FN/(TP+FN)
    # False discovery rate
    FDR = FP/(TP+FP)
    # Overall accuracy
    ACC = (TP+TN)/(TP+FP+FN+TN)
    print("{} Scores:\n".format(name))
    print("Accuracy: {0:.2f} %\nPrecision: {1:.2f} %\nRecall: {2:.2f} %\nFall out: {3:.2f} %\nFalse Negative Rate: {4:.2f} %\n\n"
        .format(ACC.round(4)*100.0,PPV.round(4)*100.0,TPR.round(4)*100.0,FDR.round(4)*100.0,FNR.round(4)*100.0))

    dict["classifier"].append(name)
    dict["accuracy"].append(ACC.round(4)*100.0)
    dict["fallout"].append(FDR.round(4)*100.0)
    dict["fnr"].append(FNR.round(4)*100.0)
    return clf

   # Section 1: data cleaning and manipulation
   ### Removing duplicates, useless columns, filtering records, and parsing json

In [None]:
movies = pd.read_csv(
    '../input/tmdb_5000_movies.csv', index_col=3)
credits = pd.read_csv(
    '../input/tmdb_5000_credits.csv', index_col=0)

movies.drop_duplicates(keep="first", inplace=True)
useless_col = ['homepage', 'original_title', 'original_language', 'overview',
               'spoken_languages', 'keywords', 'status', 'tagline']
movies.drop(useless_col, axis=1, inplace=True)

credits.drop(["title"], axis=1, inplace=True)

# Split the year from the date
movies.release_date = pd.to_datetime(movies['release_date'])
movies["year"] = movies.release_date.dt.year

# Changing the data type of the below mentioned columns and
change_cols = ['budget', 'revenue',"popularity", "runtime", "vote_average", "year"]

# replacing all the zeros from revenue and budget cols.
movies[change_cols] = movies[change_cols].replace(0, np.nan)
# dropping all the rows with na in the columns mentioned above in the list.
movies.dropna(subset=change_cols, inplace=True)
# filter useless records
budget_filter = movies['budget'] > 1e6
revenue_filter = movies['revenue'] > 1e6
movies = movies[budget_filter & revenue_filter]

movies = movies.join(credits)

parse_col_json(movies, "crew", "name", True)
movies.rename(columns={'crew': 'directors'}, inplace=True)
parse_col_json(movies, 'genres', 'name', False)
parse_col_json(movies, 'production_companies', 'name', False)
parse_col_json(movies, 'cast', 'name', False)



# changing data type
movies[change_cols] = movies[change_cols].applymap(np.int64)

#movies['profit'] = movies['revenue'] - movies['budget']

# columns list usefull for later manipulation
list_columns = ["directors", "genres","production_companies",
                "cast", "actors", "actress"]

movies = movies[['title', 'genres', 'directors', 'production_companies',
                 'year', 'cast', 'actors','actress', 'runtime', 'budget',
                 'revenue', 'popularity','vote_average']]



In [None]:
movies = movies.set_index("title")
movies = movies.reset_index()
movies.head(3)



In [None]:
summary = counts_elements(movies, list_columns)
directors = pd.Series(summary["directors"], name="movies")
genres = pd.Series(summary["genres"], name="movies")
p_companies = pd.Series(summary["production_companies"], name="movies")
cast = pd.Series(summary["cast"], name="movies")

actors = pd.Series(summary["actors"], name="movies")
actress = pd.Series(summary["actress"], name="movies")


In [None]:
df1 = pd.DataFrame({ 'gender' : ["male" for _ in range(actors.size)]})
df1["name"] = actors.index
df1["movies"] = actors.values
df1.drop(df1.tail(1).index,inplace=True) # drop last n rows

df2 = pd.DataFrame({ 'gender' : ["female" for _ in range(actress.size)]})
df2["name"] = actress.index
df2["movies"] = actress.values
df2.drop(df2.tail(1).index,inplace=True)

castDf = pd.concat([df1, df2], ignore_index=True)
castDf = castDf.set_index("name")
castDf["movies"] = castDf["movies"].map(np.int64)

 ## [How much money does a movie need to make to be profitable?](https://io9.gizmodo.com/how-much-money-does-a-movie-need-to-make-to-be-profitab-5747305)

 >  a rule of thumb seems to be that the film needs to make twice its production budget globally

In [None]:
# create new df based on a definition of successful movie:
# "it's successful when it earns the double it costs"
succ_movies = movies[movies["revenue"] > movies["budget"].apply(lambda x: 2*x)]
succ_summary = counts_elements(succ_movies, list_columns)

succ_directors = pd.Series(succ_summary["directors"], name="succ_movies")
succ_genres = pd.Series(succ_summary["genres"], name="succ_movies")
succ_p_companies = pd.Series(
    succ_summary["production_companies"], name="succ_movies")
succ_cast = pd.Series(succ_summary["cast"], name="succ_movies")

succ_actors = pd.Series(succ_summary["actors"], name="succ_movies")
succ_actress = pd.Series(succ_summary["actress"], name="succ_movies")


In [None]:
df1 = pd.DataFrame(index = succ_actors.index)
df1["succ_movies"] = succ_actors.values
df1.drop(df1.tail(1).index,inplace=True)

df2 = pd.DataFrame(index = succ_actress.index)
df2["succ_movies"] = succ_actress.values
df2.drop(df2.tail(1).index,inplace=True)

succ_castDf = pd.concat([df1, df2], ignore_index=False)
succ_castDf["succ_movies"] = succ_castDf["succ_movies"].map(np.int64)
castDf = castDf.join(succ_castDf)

In [None]:
# defining target variable
movies["success"] = movies.title.isin(succ_movies.title)


   ## Our question is:
   ** Which are the most important factors that make a movie succesful?**

   and of course, can we predict in advance if a movie will be successful?

   ### Let's start the investigation!

   # Section 2: data exploration
   ### looking for correlations between variables, finding new *successful* movies and compare them

   In order to investigate the pair-wise correlations between two variables X and Y, we use the Pearson correlation. Let $\sigma_X,\sigma_Y$ be the standard deviation of $X,Y$ and $cov(X,Y)=E[(X−E[X])(Y−E[Y])]$.

  Then we can define the Pearson correlation as the following:


   $$\rho_{X,Y} = \frac {cov(X,Y)}{\sigma_X\sigma_Y}$$


    It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

In [None]:
sns.pairplot(movies,hue="success",vars=["year","runtime","budget","revenue","popularity","vote_average"])
plt.show()



In [None]:
# let's see some correlations
sns.set(rc={'figure.figsize': (12, 10)})
sns.heatmap(movies.corr(), annot=True)
plt.show()



In [None]:
sns.heatmap(succ_movies.corr(), annot=True)
# older movies had lower runtime
# budget slightly increased across the years
# longer movies could have higher votes
# higher budget often means higher revenue (and perhaps popularity)
plt.show()



In [None]:
fig, axes = plt.subplots(nrows=2, figsize=(5, 5))
directors.nlargest(5).plot.barh(color="#99e699", legend=True, ax=axes[0])
succ_directors.nlargest(5).plot.barh(color="#1a8cff", legend=True, ax=axes[1])
plt.show()



In [None]:
genres.plot.barh(color="#b3ecff", legend=True)
succ_genres.plot.barh(color="#1a8cff", legend=True)
plt.show()



In [None]:
p_companies.nlargest(5).plot.barh(color="#b3ecff", legend=True)
succ_p_companies.nlargest(5).plot.barh(color="#1a8cff", legend=True)
plt.show()


In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(10, 5),constrained_layout=True)
castDf.movies.nlargest(10).plot.barh(color="#99e699", legend=True, ax=axes[0])
castDf.succ_movies.nlargest(10).plot.barh(color="#1a8cff", legend=True, ax=axes[1])
axes[0].set_xlim([20, 55])
axes[1].set_xlim([20, 55])
plt.show()

### Women only:

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(10, 5),constrained_layout=True)
castDf[castDf.gender=="female"].movies.nlargest(10).plot.barh(color="tomato", legend=True, ax=axes[0])
castDf[castDf.gender=="female"].succ_movies.nlargest(10).plot.barh(color="purple", legend=True, ax=axes[1])
axes[0].set_xlim([10, 35])
axes[1].set_xlim([10, 35])
plt.show()

   # Section 3: data pre-processing for classification
   ### applying "bag of words" technique to list columns, binning and encoding categorical labels

   ## Bag of Words
   Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

   In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:

   *     **tokenizing** strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
   *     **counting** the occurrences of tokens in each document.
   *     **normalizing** and weighting with diminishing importance tokens that occur in the majority of samples / documents.

   A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

   We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors.
   This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or “Bag of n-grams” representation.
   Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [None]:
cast_counts = counts_vectorized(movies, "cast", 20)
dir_counts = counts_vectorized(movies, "directors", 10)
gen_counts = counts_vectorized(movies, "genres", 100)
companies_counts = counts_vectorized(movies, "production_companies", 50)



In [None]:
X=[]
# concatenate vectors count
X = hcat(movies, cast_counts, dir_counts,
         gen_counts, companies_counts)



In [None]:
X = X.set_index("title")
X = X.drop(columns=["cast", "directors", "genres", "budget","revenue","popularity","vote_average",
                    "production_companies", "actors","actress"])
X.head()

   ## Binning
   Data binning (also called Discrete binning or bucketing) is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization, the process of transform numerical Variable into categorical counterparts.

   ## Label Encoder
   It is used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

In [None]:
simplify(X, "year", (1, 1998, 2005, 2016), ["vecchio", "contemporaneo", "moderno"])
simplify(X, "runtime", (1, 72, 89, 120, 180, 250), ["XS", "S", "M", "L", "XL"])
#simplify(X, "popularity", (-1, 11, 21, 876), ["Low", "Medium", "High"])
#simplify(X, "vote_average", (0, 4, 6, 10), ["basso", "normale", "alto"])

y = X.success
X = X.drop(columns="success")
# for one hot encoding
X_h = pd.get_dummies(X)
# label encoder
X_l = encode_features(X)


   # Section 4: classification
   ## Tested Classifiers:
    * SVM with radial kernel
    * K-Nearest Neighbors
    * Logistic Regession
    * Decision Tree
    * Random Forest
    * Gradient Boost Regressor

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_l, y, test_size=0.2, random_state=1)
mc_l = defaultdict(list)


   ## Support Vector Classifier:
   ### using grid search cross validation in order to find the best parameters

  Basic idea of support vector machines:

  – **Hard margin classifier**: to find optimal hyperplane who maximizes the margin (the distance between the two classes) for linearly separable patterns

  – **Soft margin classifier**: introducing some tolerance factor in the boundary in order to improve performance

  – **The Kernel trick**: extend to patterns that are not linearly separable by transformations of original data to map into new space

  ---
  ### Hard margin classifier

   In order to define the optimal separating hyperplane we define two planes:

   $H_1:w⋅x_i+b= +1$


   $H_2:w⋅x_i+b=−1$


   Furthemore, there is a plane in between:
   $H_0:w⋅x_i+b=0$

   We want a linear separator with the biggest margin possible.

   The distance between $H_0$ and $H_1$ is $\frac{1}{\|w\|}$

   we want $ \underset{w,b}{\text{maximize}} {1\over{\|w\|}} \text{ subject to  } y_i[\langle{x_i,w}\rangle + b]\ge 1 $

   This becomes a Quadratic programming problem and can be solved by optimization techniques (we use Lagrange multipliers to get this problem into a form that can be solved analytically).

   Because it is quadratic, the surface is a paraboloid, with just a single global minimum (thus avoiding a problem we had with neural nets!)

  ---
  ### Soft margin classifier

  Sometimes data has errors or is just non linearly separable and we want to be less strict separing them to obtain a better solution. So we want to be permissive for certain examples, allowing that their classification by the separator diverge from the real class

  This can be obtained by adding to the problem what is called slack variables $\xi$ that represent the deviation of the examples from the margin

  Doing this we obtain some kind of *relaxed* **soft margin**

  We need to introduce this slack variables on the original problem, we have now:

   $$ \text{minimize}{1\over{2}}\|w\|^2+C\sum\limits_{i=1}^{n}\xi_i \text{ subject to  } y_i[\langle{x_i,w}\rangle + b]\ge 1-\xi_i,     \forall i,\xi_i\ge 0 $$

   Now we have an additional parameter C that is the tradeoff between the error and the margin

  ---
  ### The Kernel trick

  In order to have better performance we have to be able to obtain non-linear boundaries, the idea is to transform the data from the input space (the original attributes of the examples) to a higher dimensional space, using a function $\phi(x)$, called the **feature space**. The advantage of the transformation is that linear operations in the feature space are equivalent to non-linear operations in the input space.

  Working directly in the feature space can be costly because we have to explicitly create the feature space and operate in it. Furthemore we may need infinite features to make a problem linearly separable.

  So we use what is called the Kernel trick:

  In the problem that define a SVM only the inner product of the examples is needed, so if we can define how to compute this product in the feature space, then it won't be necessary to explicitly build it, but we just need to define a kernel function $$K \text{  such that  } K(x_i,x_j)=\phi(x_i)\cdot\phi(x_j)$$

  then we do not need to know or compute φ at all! The kernel function defines inner products in the transformed space, in practice it defines a measure of similarity in the transformed space.

   **Radial Kernel** $$ K(x_i,x_{i'}) = exp(-\gamma \sum\limits_{j=1}^{p}(x_{ij}-x_{i'j})^2) $$
   thanks to kernels we can project our unseparable data in a non-linear space where they become separable

   The hyperparameters of a SVM include the following ones, which can be passed to the SVC of sklearn.svm:

   * C: the inverse of the regularization strength
   * kernel: the kernel used (default='rbf')
   * gamma: The higher the gamma value it tries to exactly fit the training data set

  ## Grid Search:

  After that we've seen the meaning and the differents possibilities to build our hyperplane, the big question is **"How to choose the hyper parameters?"**

  We cannot choose the optimal values based on our test set; in real problems we won’t have access to it!

  The proper way to evaluate a method is to split the data into 3 parts:

  * We choose some parameters and train your model on the training set.
  * We then evaluate your performances on the validation set.
  * Once we find the parameters which work best on the validation set, we can apply the same model on the test set.

  This is the correct estimate of the accuracy you will get
  on unseen data.

  If there are two (or more) parameters to optimize could be tricky to find the best one, so we must estimate all the possible combinations of those parameters. In this case, we should put all the results in a table (the grid), with $\gamma$ as column and $C$ in the rows, look for the best combination and finally select the best couple of C and gamma.

  Luckily, **GridSearchCV** of sklearn package will handle this for us (with an additional feature like stratified CV)!

  ## K-Fold Cross Validation:

  In this method we randomly split the training set into k folds, approximately of the same size. The first fold is considered as a validation set. The procedure is repeated k times, each time a different folder is selected as validation set and produces a different value for the MSE. The effective estimate will be the average of these values.

  $$CV_{(K)} = \sum\limits_{k=1}^K {n_k\over{n}}MSE_k $$

  This approach is very useful for two main reasons (despite other evaluation techniques such as LOOCV):
  - Is computationally feasible.
  - It produces low variance.

  ### Mean square error
  MSE is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:
  $$MSE = {1\over{N}}\sum\limits_{(x,y)\in D}(y-\text{prediction}(x))^2$$

In [None]:
# generate lists of parameters for GridSearch
C_list = [C for C in (0.0001*(10**p) for p in range(1, 8))]
gamma_l = [C for C in [0.1**n for n in range(1, 8)]]
param_grid = {'C': C_list, 'gamma': gamma_l, 'kernel': ['rbf']}
# Make grid search classifier
clf = GridSearchCV(SVC(), param_grid, verbose=0, cv=5).fit(X_train, y_train)
C = clf.best_params_['C']
gamma = clf.best_params_['gamma']
testClassifier(SVC(kernel='rbf', C=C, gamma=gamma), "Optimized RBF SVM",mc_l)


   ## K-Nearest Neighbors Classifier
   k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

   In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small).

   If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

   The hyperparameters of KNN include the following ones, which can be passed to the KNeighborsClassifier of sklearn.neighbors:

   * n_neighbors: corresponds to K, the number of nearest neighbors considered for the prediction (default=5)
   * weights:
       * if uniform, then all neighbors have the same weight for the voting (default)
       * if distance, then the votes of the neighbors are weighted by the inverse of the distance for the voting
   * p: the power parameter for the Minkowski metric (default=2)

In [None]:
testClassifier(KNeighborsClassifier(), "K-Nearest Kneighbors",mc_l)


   ## Logistic Regression
   Logistic regression allow us to use regression as classifier.
   It models the probabilities for classification problems with two possible outcomes.
   It’s an extension of the linear regression model for classification problems.
   Instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic(sigmoid) function to squeeze the output of a linear equation between 0 and 1.

   **Sigmoid**: $g(h) = \frac{1}{1+e^{-h}} $

   We define $x_i$ as the n-dimensional feature vector of a given sample and $β_0,β=(β_1,...,β_n)^T$ as the model parameters. Then the logistic regression model is defined as:

   $$P(Y=1|x_i)={{exp(β_0+x^T_iβ)}\over{1+exp(β_0+x^{T}_iβ)}}={1\over{1+exp(−(β_0+x^T_iβ))}}$$

   The interpretation of the weights in logistic regression differs from the interpretation of the weights in linear regression, since the outcome in logistic regression is a probability between 0 and 1. The weights do not influence the probability linearly any longer. The weighted sum is transformed by the logistic function to a probability, so
  the logistic regression model is not only a classification model, but also gives you probabilities.

  Note that both linear regression and logistic regression give you a straight line (or a higher order polynomial) but those lines have different meaning:
  - h(x) for linear regression interpolates, or extrapolates, the output and predicts the value for x we haven't seen. It's simply like plugging a new x and getting a raw number, and is more suitable for tasks like predicting, say car price based on {car size, car age} etc.

  - h(x) for **logistic regression** tells you the probability that x belongs to the "positive" class. This is why it is called a regression algorithm - it estimates a continuous quantity, the probability. However, if you set a threshold on the probability, such as $h(x)>0.5$ , you obtain a classifier, and in many cases this is what is done with the output from a logistic regression model. This is equivalent to putting a line on the plot: all points sitting above the classifier line belong to one class while the points below belong to the other class.


   The hyperparameters of a logistic regression include the following ones, which can be passed to the LogisticRegression of sklearn.linear_model:

   * penalty: the norm used for penalization (default='l2')
   * C: the inverse of the regularization strength (default=1.0)

In [None]:
testClassifier(LogisticRegression(), "Logistic Regression",mc_l)


   ## Decision Tree Classifier
   A decision tree for classification consists of several splits, which determine for a input sample, the predicted class, which is a leaf node in the tree. The construction of the decision trees is done with a greedy algorithm, because the theoretical minimum of function exists but it is NP-hard to determine it, because number of partitions has a factorial growth. Specifically, a greedy top-down approach is used which chooses a variable at each step that best splits the set of items. For measuring the "best" different metrics can be used, which generally measure the homogeneity of the target variable within the subsets.

   For this analysis we consider the following two metrics:

   * Gini impurity: Let j be the number of classes and $p_i$ the fraction of items of class i in a subset p, for i∈{1,2,...,j}. Then the gini impurity is defined as follows:
   $$I_G(p)=1-\sum\limits_{i=1}^{j}p_i^2$$
   * Information gain: It measures the reduction in entropy when applying the split. The entropy is defined as $$H(t)=−\sum\limits_{i=1}^{j}p_i\log[2]p_i$$

   Then we define the information gain to split n samples in parent node p into k partitions, where $n_i$ is the number of samples in partition i as $$I_G=H(p)-\sum\limits_{i=1}^{k}{n_i\over{n}}H(i)$$

   The hyperparameters of a Decision Tree include the following ones, which can be passed to the DecisionTreeClassifier of sklearn.tree:

   * criterion: the criterion which decides the feature and the value at the split (default='gini')
   * max_depth: the maximum depth of each tree (default=None)
   * min_samples_split: the minimum number of samples in a node to be considered for further splitting (default=2)

In [None]:
testClassifier(DecisionTreeClassifier(), "Decision Tree",mc_l)


   ## Random Forest
   A random forest is an ensemble model that fits a number of decision tree classifiers on various sub-samples of the dataset which are created by the use of bootstrapping. In the inference stage it uses a majority vote over all trees to obtain the prediction. This improves the predictive accuracy and controls over-fitting.

   The hyperparameters of a random forest include the following ones, which can be passed to the RandomForestClassifier of sklearn.ensemble:

   * n_estimators: the number of trees
   * criterion: the criterion which decides the feature and the value at the split (default='gini')
   * max_depth: the maximum depth of each tree (default=None)
   * min_samples_split: the minimum number of samples in a node to be considered for further splitting (default=2)
   * max_features: the number of features which are considered for a split (default='sqrt')

In [None]:
clf = RandomForestClassifier()
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)
# Run the grid search
grid_obj = GridSearchCV(
    clf, parameters, scoring=acc_scorer).fit(X_train, y_train)
# Set the clf to the best combination of parameters
clf = testClassifier(grid_obj.best_estimator_, "Random Forest",mc_l)

feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)


In [None]:
feature_importances.importance.nlargest(40).plot.barh()

   ## Gradient Boosting Classifier
   XGBoost is an ensemble learning method. Ensemble learning offers a systematic solution to combine the predictive power of multiple learners. The resultant is a single model which gives the aggregated output from several models.

   #### Bagging

   While decision trees are one of the most easily interpretable models, they exhibit highly variable behavior. Consider a single training dataset that we randomly split into two parts. Now, let’s use each part to train a decision tree in order to obtain two models.

   When we fit both these models, they would yield different results. Decision trees are said to be associated with high variance due to this behavior. Bagging or boosting aggregation helps to reduce the variance in any learner. Several decision trees which are generated in parallel, form the base learners of bagging technique. Data sampled with replacement is fed to these learners for training. The final prediction is the averaged output from all the learners.

   #### Boosting
   It is an ensemble technique where new models are added to correct the errors made by existing models, sequentially until no further improvements can be made.

   The base learners in boosting are weak learners in which the bias is high, and the predictive power is just a tad better than random guessing, but each of them contributes some vital information for prediction, enabling the boosting technique to produce a strong learner by effectively combining these weak learners, which brings down both the bias and the variance.

   In contrast to bagging techniques like Random Forest, in which trees are grown to their maximum extent, boosting makes use of trees with fewer splits, not very deep, and highly interpretable.

   Parameters like the number of trees or iterations, the rate at which the gradient boosting learns, and the depth of the tree, could be optimally selected through validation techniques like k-fold cross validation.

   Having a large number of trees might lead to overfitting, so it is necessary to carefully choose the stopping criteria for boosting.

   Boosting consists of three simple steps:

   1. An initial model $F_0$ is defined to predict the target variable y. This model will be associated with a residual $(y – F_0)$
   2. A new model $h_1$ is fit to the residuals from the previous step
   3. Now,$ F_0$ and $h_1$ are combined to give $F_1$, the boosted version of $F_0$. The mean squared error from $F_1$ will be lower than that from$ F_0$:

   $$F_1(x)\leftarrow F_0(x)+h_1(x)$$

   To improve the performance of $F_1$, we could model after the residuals of $F_1$ and create a new model $F_2$:

   $$F_2(x)\leftarrow F_1(x)+h_2(x)$$

   This can be done for ‘m’ iterations, until residuals have been minimized as much as possible:
   $$F_m(x)\leftarrow F_{m-1}(x)+h_{m}(x)$$

   Here, the additive learners do not disturb the functions created in the previous steps. Instead, they impart information of their own to bring down the errors.


   **Gradient boosting** is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because **it uses a gradient descent algorithm** to minimize the loss when adding new models.

   This approach supports both regression and classification predictive modeling problems.

In [None]:
testClassifier(XGBRegressor(n_estimators=1000, learning_rate=0.05), "Gradient Boosting",mc_l)



In [None]:
comparison_l = pd.DataFrame(mc_l, columns=['classifier','accuracy','fallout'])


  ## Principal Component Analysis
  PCA is used to find orthonormal basis for data. It sorts dimensions in order of "Importance" and enable you to discard lower significance dimensions.

  We have to find a lower dimension linear space that
  	1. Maximizes variance of projected data
  	2. Minimizes mean squared distance between data points and projections

  That lower dimensional space must preserves as much information as possible, in particular it minimizes the squared error in reconstructing the orgianl data

  Each PCA vector originates from the center of mass. The first one points in the direction of the largest variance. Each subsequent principal component is orthogonal to the previous ones and points in the direction of the largest variance of the residual subspace

  **PCA steps:**
  * calculate the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.
  * sort the Eigenvectors according to their Eigenvalues in decreasing order
  * build the $n×k-dimensional$ projection matrix $W$ by putting the top $k$ Eigenvectors into the columns of $W$
  * transform the dataset $X$ by multiplying it with $W: X_t=XW$

In [None]:
from sklearn.decomposition import PCA
pca = PCA(.85) # retain 85% of the variance
X_r = pca.fit_transform(X_l)



In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_r, y, test_size=0.2, random_state=1)
mc_r = defaultdict(list)

### SVM
testClassifier(SVC(kernel='rbf', C=C, gamma=gamma), "Optimized RBF SVM",mc_r)

### Random Forest
clf = RandomForestClassifier()
acc_scorer = make_scorer(accuracy_score)
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer).fit(X_train, y_train)
testClassifier(grid_obj.best_estimator_, "Random Forest",mc_r)

### XGBR
testClassifier(XGBRegressor(n_estimators=1000, learning_rate=0.05), "Gradient Boosting",mc_r)

comparison_r = pd.DataFrame(mc_r, columns=['classifier','accuracy','fallout'])


 ## One hot encoding
 It is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_h, y, test_size=0.2, random_state=1)
mc_h = defaultdict(list)

### SVM
testClassifier(SVC(kernel='rbf', C=C, gamma=gamma), "Optimized RBF SVM",mc_h)
### Random Forest
clf = RandomForestClassifier()
acc_scorer = make_scorer(accuracy_score)
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer).fit(X_train, y_train)
testClassifier(grid_obj.best_estimator_, "Random Forest",mc_h)
### XGBR
testClassifier(XGBRegressor(n_estimators=1000, learning_rate=0.05), "Gradient Boosting",mc_h)

comparison_h = pd.DataFrame(mc_h, columns=['classifier','accuracy','fallout'])



In [None]:
comparison_h = comparison_h.set_index("classifier")
comparison_r = comparison_r.set_index("classifier")
comparison_l = comparison_l.set_index("classifier")



In [None]:
comparison = comparison_h.join(comparison_r,rsuffix = "_pca",lsuffix="_1hot")
comparison = comparison.join(comparison_l,rsuffix = "_label")



In [None]:
comparison.plot.bar()
plt.show()


   # Conclusion:

  We started our analysis with this aim: to find out the key success factors on the film industry, and to try to use that factors in order to predict if a movie will be succesful or not.

  We found out some interesting facts, for example:

  - older movies had lower runtime
  - budget slightly increased across the years
  - longer movies tend to have higher votes
  - higher budget often means higher revenue and popularity

  **Results**:

  Our **Top Director** is **Steven Spielberg**, with almost all of his production is considered a success.

  The most used **genre** is **Drama**, althought **Comedy** movies has very great aspect: **100%** of them are succesful.

  **Universal Pictures** is the most prolific company, and also the most successful.

  As we could expect, **USA** is the most prolific and successful country, with no rivals pratically.

  The Top Actor is **Samuel L. Jackson** with 30+ succesful movies. Robert De Niro represents an interesting case, passing from 2nd position, when counting the amount of attendees, to out of top-5, if we consider attendees in successful movies only.

  Classifiers:

  The best classifier overall seems the Gradient Boosting classifier, for accuracy and false positive rate (who was selected as critical measure arbitrarily).

  But it's a pretty fair competition between XGB, RF and **SVM**, who outperforms the two ensemble methods in a reduced space of approximately 20 features (against the 40+ of the initial space)