## Data Imputation

### For missing categorical values imputation have been done by creating a new category "none".For numeric values missing values have been imputed by median imputation.
### In next steps this imputation would be done by algorithms such as knn and svm and be noted if that leads to improvement in accuarcy of the model.

## Data Transformation: 

### Numeric variables were scaled by minmax scaler as most  of these have different scales.for example budget and num_votes

## Scoring

### The metric for accuracy has been noted as mean squared error which is one of the most widely used metric for evaluating regression models

 $$ mse = (\frac{1}{n})\sum_{i=1}^{n}(y_{i} - x_{i})^{2} $$
 
 ### Lower the MSE better is the model.
 
 ## Modelling Techniques
 
### A group of models were evaluated for baselines using sklearn pipeline.The best performing ones were ensemble models hence we evaluated them further by  hyper parameter optimization.The model select was xgboost and it is recommended that the final model of xgboost be deployed as its final parameters reduced MSE by a factor of about 1/6.

In [1]:
import pandas as pd
path="movie_metadata.csv"

In [2]:
#path="https://raw.githubusercontent.com/sundeepblue/movie_rating_prediction/master/movie_metadata.csv"
from preprocessing import data_clean

data1=data_clean(path)
labels=data1.imdb_score.values
all_data=pd.concat([data1.drop(columns=["genres"]),data1.genres.str.get_dummies().add_prefix('Part_')],axis=1)
all_data=all_data.drop(columns=["imdb_score"])


In [42]:
all_data.shape

(4150, 35)

In [43]:
all_data.imdb_score

AttributeError: 'DataFrame' object has no attribute 'imdb_score'

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder,MinMaxScaler
from sklearn.model_selection import train_test_split
numeric_features = all_data._get_numeric_data().columns
numeric_transformer = Pipeline(steps=[
    
    ('scaler', MinMaxScaler())])


categorical_transformer = Pipeline(steps=[

    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

categorical_features=all_data.select_dtypes(exclude=['int',"float"]).columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoLars
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble  import  AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import metrics
from xgboost import XGBRegressor

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor

Using TensorFlow backend.


In [7]:
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
regressors = [
    (LinearRegression(),"linear regression "),
    (Lasso(alpha=.5),"Lasso"),
    (Ridge(alpha=.1),"Ridge"),
    
    (DecisionTreeRegressor(),"Decision Trees"),
    (RandomForestRegressor(),"Random Fores"),
    (AdaBoostRegressor(),"Ada-Boost"),
    (GradientBoostingRegressor(),"GBM"),
    (XGBRegressor(),"xgboost")
    
    
]

def MSE(y_true,y_pred):
    mse = mean_squared_error(y_true, y_pred)
    #print ('MSE: %2.3f' % mse)
    return mse
mse=make_scorer(MSE)


for r,v in regressors :
    
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', r)])
    clf.fit(X_train, y_train)
    scores = cross_val_score(clf,all_data, labels, cv=5,scoring=mse)
    
    print("Model MSE:"+ " " + str(v)+" "+"%0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    

Model MSE: linear regression  2288137692706663301120.00 (+/- 9152550770826653204480.00)
Model MSE: Lasso 1.11 (+/- 0.26)
Model MSE: Ridge 0.79 (+/- 0.21)
Model MSE: Decision Trees 1.40 (+/- 0.30)
Model MSE: Random Fores 0.70 (+/- 0.20)
Model MSE: Ada-Boost 1.01 (+/- 0.33)
Model MSE: GBM 0.70 (+/- 0.22)
Model MSE: xgboost 0.77 (+/- 0.19)


In [11]:
from sklearn.model_selection import GridSearchCV

xgb_pipeline=Pipeline(steps=[('preprocessor', preprocessor),
                      ('xgbrg', XGBRegressor())])
parameters = {
    'xgbrg__max_depth': range (2, 10, 1),
    'xgbrg__n_estimators': range(60, 220, 40),
    'xgbrg__learning_rate': [0.1, 0.01, 0.05]
}

# fit_params = {"xgbrg__eval_set": [(X_test, y_test)], 
#               "xgbrg__early_stopping_rounds": 10, 
#               "xgbrg__verbose": False} 

searchCV = GridSearchCV(xgb_pipeline, cv=5, param_grid=parameters,scoring=mse)
searchCV.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('preprocessor',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('num',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('scaler',
                                                                                          MinMaxScaler(copy=True,
                                                                                                       feature_range=(0,
                                          

In [12]:
searchCV.best_score_

11.672789338328254

In [13]:
searchCV.best_params_

{'xgbrg__learning_rate': 0.01,
 'xgbrg__max_depth': 9,
 'xgbrg__n_estimators': 60}