# Linear Regression & Variable Selection on TMDB movies

The file tmdb_5000_movies.csv has information from the Movie Database (TMDb) and has the movie related data


The purpose of this assignment is to do the following:
    - Perform feature selection on the given dataset.
    - Use Linear Regression to predict 'vote average' target 
    - Use score to identify most accurate and predictive features.

# 1) Data Preparation

In [1]:
#Import all required libraries for reading data, analysing and visualizing data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import json

In [2]:
movies = pd.read_csv('tmdb_5000_movies.csv')

In [3]:
movies.shape

(4803, 20)

In [4]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null 

Movie info has the following 20 features for 4803 movies:   
    - budget - movie budget                
    - genres - json data having the following info for the specific movie genre  
        * id - genre id  
        * name - genre name for the specific movie  
    - homepage - URL of the movie website               
    - id - movie id                     
    - keywords - json data having the following info for the specific movie keywords                
        * id - keyword id  
        * name - keyword name for the specific movie      
    - original_language - language in which original movie was released      
    - original_title - original title of the moview         
    - overview - movie description              
    - popularity - popularity rating of the movie              
    - production_companies - json data having the following info for the production companies for the movie  
        * id - production company id  
        * name - production company name for the specific movie          
    - production_countries  - json data having the following info for the production companies of the movie  
        * iso_3166_1 -  ISO Code for the countries   
        * name - Country name where the moview was released  
    - release_date - release date           
    - revenue - movie revenue                 
    - runtime                 
    - spoken_languages        
        * iso_639_1 - Code for the language     
        * name - language name  
    - status - Movie Status - Released, Rumored, Post production                
    - tagline - Movie Tagline                
    - title - movie title                  
    - vote_average - average vote           
    - vote_count - vote count              

# 2) Data Processing

These are the steps I'm going to do inorder to process the data:  
    - Work on the different JSON objects like Genres, cast, crew, production companies, production_countries etc.  
    - Transform Categorical features to numerical
    - Create new features release year and month based on Release date  

## 2.1) Analysis of JSON Objects
The following are the json objects
    - Genres
    - keywords
    - production_companies
    - production_countries
    - spoken_languages

In [6]:
#parse json input
json_columns = ['genres', 'keywords', 'production_companies', 'production_countries', 'spoken_languages']

In [7]:
for column in json_columns:
    movies[column] = movies[column].apply(json.loads, encoding="utf-8")

###  2.1.1) Function to process the JSON objects Genres, Keywords, Production Countries, Companies, Spoken languages.
In columns 'keywords', 'production_countries', 'spoken_languages', the structure is not nested and is simply id and name. I'm basically fetching the value of the key name for these columns.

In [9]:
def process_jsoncols(colname):
    jsoncollist=[]
    for x in colname:
        jsoncollist.append(x['name'])
    return jsoncollist

In [10]:
for colname in json_columns:
    movies[colname] = movies[colname].apply(process_jsoncols)

In [11]:
movies[['genres', 'keywords', 'production_companies', 'production_countries', 'spoken_languages']].head(2)

Unnamed: 0,genres,keywords,production_companies,production_countries,spoken_languages
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Ingenious Film Partners, Twentieth Century Fo...","[United States of America, United Kingdom]","[English, Español]"
1,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Walt Disney Pictures, Jerry Bruckheimer Films...",[United States of America],[English]


### 2.1.2) Convert Pandas Dataframe Column of Lists to string. 
The impacted columns are genres, keywords, production_countries, spoken_languages, production_companies. NOTE: crew and cast are not column of lists

In [12]:
listcols = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']

In [13]:
for colname in listcols:
    movies[colname] = movies[colname].apply(lambda x: ','.join(map(str, x))) #Map applies a function to all the items in an input_list. 

In [14]:
movies[['genres', 'keywords', 'production_companies', 'production_countries', 'spoken_languages']].head(2)

Unnamed: 0,genres,keywords,production_companies,production_countries,spoken_languages
0,"Action,Adventure,Fantasy,Science Fiction","culture clash,future,space war,space colony,so...","Ingenious Film Partners,Twentieth Century Fox ...","United States of America,United Kingdom","English,Español"
1,"Adventure,Fantasy,Action","ocean,drug abuse,exotic island,east india trad...","Walt Disney Pictures,Jerry Bruckheimer Films,S...",United States of America,English


## 2.2) Identify Missing values

In [15]:
movies.isnull().sum()[movies.isnull().sum()>0]

homepage        3091
overview           3
release_date       1
runtime            2
tagline          844
dtype: int64

The missing values for homepage, overview, runtime is not really important.
There is one movie with no release date. Lets remove that entry from the dataframe

In [16]:
len(movies)

4803

In [17]:
movies = movies[movies.release_date.notnull()]

In [18]:
movies.runtime.fillna(0, inplace=True)

## 2.2) Transform 3 Categorical features to numerical

The categorical features in the movies data are as follows:
    - genres, keywords, original_language, production_companies, production_countries, spoken_languages, status.
Pick 3 features and transform to numerical as follows:
    - the features to be picked are Genres, Production Companies, status
    - get the unique list of values from each columns genres, production_countries, production_companies, spoken_languages, keywords
    - for each value present in the column, create a new column with value 1. Else value 0. Eg) If there are 30 genres, we will have 30 columns and if we have a movie with genre action, thriller, we will have the value of 1 corresponding to Action & Thriller and 0 for other 28 columns. Same holds true for all the categorical features

### 2.2.1) Genres

In [19]:
genres_list = set()
for sstr in movies['genres'].str.split(','):
    genres_list = set().union(sstr, genres_list)
genres_list = list(genres_list)
genres_list.remove('')
genres_list

['Music',
 'Drama',
 'Fantasy',
 'Horror',
 'TV Movie',
 'Thriller',
 'War',
 'Mystery',
 'Foreign',
 'History',
 'Crime',
 'Comedy',
 'Documentary',
 'Adventure',
 'Family',
 'Action',
 'Animation',
 'Romance',
 'Western',
 'Science Fiction']

In [20]:
#Transforming categorical to one hot encoding
for genres in genres_list:
    movies[genres] = movies['genres'].str.contains(genres).apply(lambda x:1 if x else 0)

In [21]:
movies.head(2).T

Unnamed: 0,0,1
budget,237000000,300000000
genres,"Action,Adventure,Fantasy,Science Fiction","Adventure,Fantasy,Action"
homepage,http://www.avatarmovie.com/,http://disney.go.com/disneypictures/pirates/
id,19995,285
keywords,"culture clash,future,space war,space colony,so...","ocean,drug abuse,exotic island,east india trad..."
original_language,en,en
original_title,Avatar,Pirates of the Caribbean: At World's End
overview,"In the 22nd century, a paraplegic Marine is di...","Captain Barbossa, long believed to be dead, ha..."
popularity,150.438,139.083
production_companies,"Ingenious Film Partners,Twentieth Century Fox ...","Walt Disney Pictures,Jerry Bruckheimer Films,S..."


### 2.2.2) Production Companies

- get the list of production companies
- pick only the top 30 companies and proceed with one hot encoding

In [22]:
pc_list = []
for sstr in movies['production_companies'].str.split(','):
    for substr in sstr:
        pc_list.append(substr)

In [23]:
len(pc_list)

14101

##### Since there are 5025 unquie companies and 10141 occurences, I'm just going to pick the first production company in the list

In [24]:
def count_elements(lst):
    elements = {}
    for elem in lst:
        if elem in elements.keys():
            elements[elem] +=1
        else:
            elements[elem] = 1
    return elements

In [25]:
pc_count = count_elements(pc_list)

In [27]:
top30_pc = sorted(pc_count, key=pc_count.get, reverse=True)[1:30]
top30_pc[:10]

['Warner Bros.',
 'Universal Pictures',
 'Paramount Pictures',
 'Twentieth Century Fox Film Corporation',
 'Columbia Pictures',
 'New Line Cinema',
 'Metro-Goldwyn-Mayer (MGM)',
 'Touchstone Pictures',
 'Walt Disney Pictures',
 'Relativity Media']

In [28]:
for pc in top30_pc:
    movies[pc] = movies['production_companies'].str.contains(pc).apply(lambda x:1 if x else 0)
movies.head(2)    

  from ipykernel import kernelapp as app


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,Lionsgate,The,Fox 2000 Pictures,TriStar Pictures,Dimension Films,Summit Entertainment,Working Title Films,Amblin Entertainment,The Weinstein Company,StudioCanal
0,237000000,"Action,Adventure,Fantasy,Science Fiction",http://www.avatarmovie.com/,19995,"culture clash,future,space war,space colony,so...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"Ingenious Film Partners,Twentieth Century Fox ...",...,0,0,0,0,0,0,0,0,0,0
1,300000000,"Adventure,Fantasy,Action",http://disney.go.com/disneypictures/pirates/,285,"ocean,drug abuse,exotic island,east india trad...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"Walt Disney Pictures,Jerry Bruckheimer Films,S...",...,0,0,0,0,0,0,0,0,0,0


In [29]:
pc_count = []
for pc in top30_pc:
    pc_count.append([pc, movies[pc].values.sum()])
pc_count[:10]

[['Warner Bros.', 334],
 ['Universal Pictures', 314],
 ['Paramount Pictures', 285],
 ['Twentieth Century Fox Film Corporation', 222],
 ['Columbia Pictures', 299],
 ['New Line Cinema', 165],
 ['Metro-Goldwyn-Mayer (MGM)', 0],
 ['Touchstone Pictures', 118],
 ['Walt Disney Pictures', 114],
 ['Relativity Media', 102]]

### 2.2.3) status

In [30]:
stat_list = movies.status.value_counts().index.tolist()
stat_list

['Released', 'Rumored', 'Post Production']

In [31]:
for sl in stat_list:
    movies[sl] = movies['status'].str.contains(sl).apply(lambda x:1 if x else 0)

In [32]:
movies.head(1).T

Unnamed: 0,0
budget,237000000
genres,"Action,Adventure,Fantasy,Science Fiction"
homepage,http://www.avatarmovie.com/
id,19995
keywords,"culture clash,future,space war,space colony,so..."
original_language,en
original_title,Avatar
overview,"In the 22nd century, a paraplegic Marine is di..."
popularity,150.438
production_companies,"Ingenious Film Partners,Twentieth Century Fox ..."


## 2.3) Create new features release year and month based on Release date

In [33]:
from datetime import datetime
movies['release_date'] = pd.to_datetime(movies['release_date'])

In [34]:
movies['release_year'] = movies['release_date'].dt.year
movies['release_month'] = movies['release_date'].dt.month
movies.head(1).T

Unnamed: 0,0
budget,237000000
genres,"Action,Adventure,Fantasy,Science Fiction"
homepage,http://www.avatarmovie.com/
id,19995
keywords,"culture clash,future,space war,space colony,so..."
original_language,en
original_title,Avatar
overview,"In the 22nd century, a paraplegic Marine is di..."
popularity,150.438
production_companies,"Ingenious Film Partners,Twentieth Century Fox ..."


# 3) Linear Regression

    - Identify the features for Linear Regression
    - Split for training & test data 70-30
    - Perform linear regression to predict 'vote average'
    - Get the score of the model

## 3.1) Prepare the data for Linear Regression 

In [35]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4802 entries, 0 to 4802
Data columns (total 74 columns):
budget                                    4802 non-null int64
genres                                    4802 non-null object
homepage                                  1712 non-null object
id                                        4802 non-null int64
keywords                                  4802 non-null object
original_language                         4802 non-null object
original_title                            4802 non-null object
overview                                  4799 non-null object
popularity                                4802 non-null float64
production_companies                      4802 non-null object
production_countries                      4802 non-null object
release_date                              4802 non-null datetime64[ns]
revenue                                   4802 non-null int64
runtime                                   4802 non-null float64
spok

Drop the columns as they are either descriptive information about the movies or they are already converted to numeric. genres, homepage, original_language, original_title, overview

In [36]:
movies.drop(['genres', 'homepage', 'original_language', 'original_title', 'overview'], axis=1, inplace=True)

drop the columns 'keywords', production_companies, production_countries, spoken_languages, status, tagline, title

In [37]:
movies.drop(['keywords', 'production_companies', 'production_countries', 'spoken_languages', 'status', 'tagline', 'title', 'id', 'release_date'], axis=1, inplace=True)

In [38]:
movies.head(1).T

Unnamed: 0,0
budget,237000000.0
popularity,150.4376
revenue,2787965000.0
runtime,162.0
vote_average,7.2
vote_count,11800.0
Music,0.0
Drama,0.0
Fantasy,1.0
Horror,0.0


## 3.2) Identify the features for Linear Regression

- The outcome/dependent variable is vote_average
- We are trying to predict vote_average based on the features or input variables.

In [39]:
# Importing modules
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn import linear_model

#### Ignoring vote_count as it means the same as vote_average

In [40]:
X = movies.drop(['vote_average', 'vote_count'], axis=1)
y = movies['vote_average']

In [41]:
X.head()

Unnamed: 0,budget,popularity,revenue,runtime,Music,Drama,Fantasy,Horror,TV Movie,Thriller,...,Summit Entertainment,Working Title Films,Amblin Entertainment,The Weinstein Company,StudioCanal,Released,Rumored,Post Production,release_year,release_month
0,237000000,150.437577,2787965087,162.0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,2009,12
1,300000000,139.082615,961000000,169.0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,2007,5
2,245000000,107.376788,880674609,148.0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,2015,10
3,250000000,112.31295,1084939099,165.0,0,1,0,0,0,1,...,0,0,0,0,0,1,0,0,2012,7
4,260000000,43.926995,284139100,132.0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,2012,3


In [42]:
movies_corr = movies.corr()
movies_corr['vote_average'][(movies_corr['vote_average']>0.15) | (movies_corr['vote_average'] <-0.15)]

popularity      0.273990
revenue         0.197153
runtime         0.373141
vote_average    1.000000
vote_count      0.313263
Drama           0.237234
release_year   -0.198499
Name: vote_average, dtype: float64

### Correlation indicates that the following features are positively or negatively impacting vote_average 
popularity  
revenue  
runtime  
Drama  
release_year - negative  

In [43]:
from sklearn.utils import shuffle
# shuffle the dataset as there is no time related dependencies
X, y = shuffle(X, y, random_state=0)

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(3361, 58) (3361,) (1441, 58) (1441,)


In [45]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

Linear Regression Score: 
 28.98
Linear Regression Test Score: 
 23.61
Coefficient: 
 [-3.05542643e-09  7.92030356e-03  7.47966935e-10  1.34736195e-02
  3.77573686e-02  4.66360059e-01  2.94320727e-02 -7.15678212e-02
 -2.89600197e-01 -5.79958335e-02  7.93726533e-02  6.40812190e-02
  2.05757959e-01  5.38313493e-02  2.42301731e-01  6.24918013e-02
  9.01391999e-01 -8.00000654e-03 -1.06401395e-01 -1.18129704e-01
  6.51665916e-01  1.51639988e-02 -9.21654975e-02  2.30189950e-02
  3.83524561e-02  5.66298179e-02 -6.57909697e-03 -2.42635831e-03
  2.69805414e-02  8.46122329e-02  8.32667268e-17  7.79103575e-02
  1.77742055e-01  1.77442040e-01  3.08872026e-02  2.49216609e-01
  6.01901057e-02  3.70065987e-01  2.66024585e-01  2.68116686e-01
 -2.29653788e-01  3.35666394e-01  9.77924350e-02  2.07653368e-01
 -1.00570199e-01  1.38548659e-01  1.44468577e-01  1.00591839e-01
  1.59914286e-01  2.13116135e-01 -1.68632783e-02  4.49292223e-01
  6.80322962e-02 -1.30661764e+00  2.26160155e+00 -9.54983911e-01
 -1.

#### Not a great score 

In [46]:
print(np.sqrt(mean_squared_error(y_test, lin_predicted)))

1.0398434527041722


In [47]:
# pair the feature names with the coefficients
list(zip(X.columns, linear.coef_))

[('budget', -3.0554264325211527e-09),
 ('popularity', 0.007920303558048685),
 ('revenue', 7.479669353500917e-10),
 ('runtime', 0.013473619505127907),
 ('Music', 0.03775736861842914),
 ('Drama', 0.4663600594263453),
 ('Fantasy', 0.029432072693834382),
 ('Horror', -0.07156782121801814),
 ('TV Movie', -0.28960019705270273),
 ('Thriller', -0.05799583353439085),
 ('War', 0.07937265328706694),
 ('Mystery', 0.06408121904124732),
 ('Foreign', 0.20575795879655848),
 ('History', 0.053831349329254354),
 ('Crime', 0.24230173120504156),
 ('Comedy', 0.06249180127604961),
 ('Documentary', 0.9013919988371412),
 ('Adventure', -0.008000006543902205),
 ('Family', -0.10640139523539449),
 ('Action', -0.11812970413707757),
 ('Animation', 0.6516659161723337),
 ('Romance', 0.015163998820847957),
 ('Western', -0.09216549747484803),
 ('Science Fiction', 0.023018994971331017),
 ('Warner Bros.', 0.038352456114293554),
 ('Universal Pictures', 0.05662981789910618),
 ('Paramount Pictures', -0.006579096973525128),
 (

In [48]:
pd.DataFrame(list(zip(X.columns, linear.coef_)), columns = ['features', 'coefficients']).sort_values(by='coefficients')

Unnamed: 0,features,coefficients
53,Released,-1.306618
55,Post Production,-0.9549839
8,TV Movie,-0.2896002
40,Regency Enterprises,-0.2296538
19,Action,-0.1181297
18,Family,-0.1064014
44,The,-0.1005702
22,Western,-0.0921655
7,Horror,-0.07156782
9,Thriller,-0.05799583


# 4) Variance Threshold for feature selection
- Normalize the data
- define the threshold
- Calculate variance of each feature
- drop the features with variance below the threshold

## 4.1) Normalize the data - Standard Scaling

In [49]:
bmovies = movies.copy() # Lets take a backup

##### The following features should be normalized
- budget
- popularity
- revenue
- runtime
- vote_count
- release_year
- release_month

In [50]:
from sklearn.preprocessing import StandardScaler
bmovies['budget'] = StandardScaler().fit_transform(bmovies[['budget']])
bmovies['popularity'] = StandardScaler().fit_transform(bmovies[['popularity']])
bmovies['revenue'] = StandardScaler().fit_transform(bmovies[['revenue']])
bmovies['runtime'] = StandardScaler().fit_transform(bmovies[['runtime']])
bmovies['vote_count'] = StandardScaler().fit_transform(bmovies[['vote_count']])
bmovies['release_year'] = StandardScaler().fit_transform(bmovies[['release_year']])
bmovies['release_month'] = StandardScaler().fit_transform(bmovies[['release_month']])

In [51]:
bmovies.head(1)

Unnamed: 0,budget,popularity,revenue,runtime,vote_average,vote_count,Music,Drama,Fantasy,Horror,...,Summit Entertainment,Working Title Films,Amblin Entertainment,The Weinstein Company,StudioCanal,Released,Rumored,Post Production,release_year,release_month
0,5.106771,4.052813,16.614315,2.433671,7.2,8.998969,0,0,1,0,...,0,0,0,0,0,1,0,0,0.526158,1.519959


In [52]:
X = bmovies.drop(['vote_average', 'vote_count'], axis=1)
y = bmovies['vote_average']

## 4.2) Variance Threshold 

In [53]:
from sklearn.feature_selection import VarianceThreshold 

In [54]:
# Create VarianceThreshold object with a variance with a threshold of 0.5
selector = VarianceThreshold(threshold = 0.8*(1-0.8))
X_high_variance = selector.fit_transform(X)
X_high_variance

array([[ 5.10677107,  4.0528129 , 16.61431524, ...,  1.        ,
         0.52615844,  1.51995888],
       [ 6.65391347,  3.69590852,  5.39580758, ...,  1.        ,
         0.36503783, -0.52453483],
       [ 5.3032336 ,  2.69934433,  4.90256827, ...,  1.        ,
         1.00952026,  0.93581782],
       ...,
       [-0.71343128, -0.63027545, -0.50522793, ...,  0.        ,
         0.84839965,  0.93581782],
       [-0.71343128, -0.6487405 , -0.50522793, ...,  0.        ,
         0.76783935, -0.52453483],
       [-0.71343128, -0.61501834, -0.50522793, ...,  0.        ,
         0.20391722,  0.35167676]])

In [55]:
X_high_variance[0:4]

array([[ 5.10677107,  4.0528129 , 16.61431524,  2.43367061,  0.        ,
         0.        ,  0.        ,  1.        ,  0.52615844,  1.51995888],
       [ 6.65391347,  3.69590852,  5.39580758,  2.74258824,  0.        ,
         0.        ,  0.        ,  1.        ,  0.36503783, -0.52453483],
       [ 5.3032336 ,  2.69934433,  4.90256827,  1.81583535,  0.        ,
         0.        ,  0.        ,  1.        ,  1.00952026,  0.93581782],
       [ 5.42602268,  2.8544957 ,  6.15685756,  2.56606388,  1.        ,
         1.        ,  0.        ,  1.        ,  0.76783935,  0.05960623]])

In [56]:
columns = X.columns
selector.get_support(indices=True)

array([ 0,  1,  2,  3,  5,  9, 15, 19, 56, 57], dtype=int64)

In [57]:
labels = [columns[x] for x in selector.get_support(indices=True)]
labels

['budget',
 'popularity',
 'revenue',
 'runtime',
 'Drama',
 'Thriller',
 'Comedy',
 'Action',
 'release_year',
 'release_month']

In [59]:
features = pd.DataFrame(X_high_variance, columns = labels)
features.head()

Unnamed: 0,budget,popularity,revenue,runtime,Drama,Thriller,Comedy,Action,release_year,release_month
0,5.106771,4.052813,16.614315,2.433671,0.0,0.0,0.0,1.0,0.526158,1.519959
1,6.653913,3.695909,5.395808,2.742588,0.0,0.0,0.0,1.0,0.365038,-0.524535
2,5.303234,2.699344,4.902568,1.815835,0.0,0.0,0.0,1.0,1.00952,0.935818
3,5.426023,2.854496,6.156858,2.566064,1.0,1.0,0.0,1.0,0.767839,0.059606
4,5.671601,0.705017,1.239533,1.109738,0.0,0.0,0.0,1.0,0.767839,-1.108676


In [60]:
X_high_variance.shape

(4802, 10)

In [61]:
# Subset features
X_new = selector.transform(X)

In [64]:
X_new.shape

(4802, 10)

In [65]:
X_new, y = shuffle(X_new, y, random_state=0)

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.30)
print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(3361, 10) (3361,) (1441, 10) (1441,)


In [67]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

Linear Regression Score: 
 26.3
Linear Regression Test Score: 
 22.1
Coefficient: 
 [-0.11078172  0.32307679  0.12765247  0.23214799  0.50425164 -0.01610419
  0.06014994 -0.16886379 -0.20863147  0.06138356]
Intercept: 
 5.885966231531161


## 4.3) Variance Threshold without normalization

In [68]:
X = movies.drop(['vote_average', 'vote_count'], axis=1)
y = movies['vote_average']

In [71]:
# Create VarianceThreshold object with a variance with a threshold of 0.5
selector = VarianceThreshold(threshold = 0.8*(1-0.8))
X_high_variance = selector.fit_transform(X)
X_high_variance[0:4]

array([[2.37000000e+08, 1.50437577e+02, 2.78796509e+09, 1.62000000e+02,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        2.00900000e+03, 1.20000000e+01],
       [3.00000000e+08, 1.39082615e+02, 9.61000000e+08, 1.69000000e+02,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        2.00700000e+03, 5.00000000e+00],
       [2.45000000e+08, 1.07376788e+02, 8.80674609e+08, 1.48000000e+02,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        2.01500000e+03, 1.00000000e+01],
       [2.50000000e+08, 1.12312950e+02, 1.08493910e+09, 1.65000000e+02,
        1.00000000e+00, 1.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        2.01200000e+03, 7.00000000e+00]])

In [72]:
columns = X.columns
selector.get_support(indices=True)

array([ 0,  1,  2,  3,  5,  9, 15, 19, 56, 57], dtype=int64)

In [73]:
labels = [columns[x] for x in selector.get_support(indices=True)]
labels

['budget',
 'popularity',
 'revenue',
 'runtime',
 'Drama',
 'Thriller',
 'Comedy',
 'Action',
 'release_year',
 'release_month']

In [75]:
features = pd.DataFrame(X_high_variance, columns = labels)
features.head()

Unnamed: 0,budget,popularity,revenue,runtime,Drama,Thriller,Comedy,Action,release_year,release_month
0,237000000.0,150.437577,2787965000.0,162.0,0.0,0.0,0.0,1.0,2009.0,12.0
1,300000000.0,139.082615,961000000.0,169.0,0.0,0.0,0.0,1.0,2007.0,5.0
2,245000000.0,107.376788,880674600.0,148.0,0.0,0.0,0.0,1.0,2015.0,10.0
3,250000000.0,112.31295,1084939000.0,165.0,1.0,1.0,0.0,1.0,2012.0,7.0
4,260000000.0,43.926995,284139100.0,132.0,0.0,0.0,0.0,1.0,2012.0,3.0


In [76]:
X_high_variance.shape

(4802, 10)

In [77]:
# Subset features
X_new = selector.transform(X)

In [78]:
X_new, y = shuffle(X_new, y)

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.30)
print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(3361, 10) (3361,) (1441, 10) (1441,)


In [80]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

Linear Regression Score: 
 25.56
Linear Regression Test Score: 
 24.08
Coefficient: 
 [-2.12194634e-09  1.02921762e-02  5.87777891e-10  1.30374261e-02
  4.28508583e-01 -2.33802868e-02  7.31857643e-02 -1.79865353e-01
 -1.62149532e-02  1.10456201e-02]
Intercept: 
 36.69939612012242


### There seems to be no difference between normalized/scaled data and non-scaled data

# 5) Recursive Feature elimination with Linear Regression
Recursive feature elimination is based on the idea to repeatedly construct a model (Linear /SVM regression model) and choose either the best or worst performing feature (for example based on coefficients), setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. Features are then ranked according to when they were eliminated. As such, it is a greedy optimization for finding the best performing subset of features.

In [81]:
X = movies.drop(['vote_average', 'vote_count'], axis=1)
y = movies.vote_average

In [82]:
from sklearn.feature_selection import RFE

In [83]:
linear = linear_model.LinearRegression()
rfe = RFE(estimator=linear, n_features_to_select=1)
rfe.fit(X, y)

RFE(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
  n_features_to_select=1, step=1, verbose=0)

In [85]:
rfe.ranking_

array([56, 46, 57, 41, 22,  4, 50, 25,  6, 37,  9, 29, 28, 13, 20, 36,  8,
       23, 24, 35,  7, 49, 40, 55, 51, 18, 53, 19, 52, 17, 58, 26, 16, 43,
       45, 12, 54, 11, 21,  2, 31, 14, 42, 34, 39, 27, 32, 47, 15,  5,  3,
       10, 30,  1, 38, 33, 44, 48])

## 5.1) Ranking of the features based on recursive feature elimination

In [86]:
feature_ranking = pd.DataFrame(data=X.columns.values,  index=rfe.ranking_, columns=['Feature'])
feature_ranking.sort_index(inplace=True)
feature_ranking

Unnamed: 0,Feature
1,Released
2,United Artists
3,Amblin Entertainment
4,Drama
5,Working Title Films
6,TV Movie
7,Animation
8,Documentary
9,War
10,The Weinstein Company


### Observations:
The features displayed in the dataframe above impact the prediction of vote_average.

## 5.2) Run linear regression against top 10 ranking features

In [99]:
X1 = movies[feature_ranking.Feature[:10].values.tolist()]
y1 = movies['vote_average']

In [100]:
X1, y1 = shuffle(X1, y1)

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.30)
print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(3361, 10) (3361,) (1441, 10) (1441,)


In [102]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

Linear Regression Score: 
 9.14
Linear Regression Test Score: 
 5.51
Coefficient: 
 [1.58063197 0.61581653 0.50276488 0.64122135 0.49884275 0.04259413
 0.57449226 0.5217057  0.39644427 0.32207732]
Intercept: 
 4.152234815008736


### Very bad results. This indicates that we need more features for getting decent accuracy in this model

# 5) Recursive Feature elimination with Cross Validation
RFECV performs RFE in a cross-validation loop to find the optimal number or the best number of features. Hereafter a recursive feature elimination applied on linear regression with automatic tuning of the number of features selected with cross-validation.

In [103]:
X = movies.drop(['vote_average', 'vote_count'], axis=1)
y = movies.vote_average

In [104]:
from sklearn.feature_selection import RFECV
# Create the RFE object and compute a cross-validated score.
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=linear, step=1, cv=None)
rfecv.fit(X, y)

RFECV(cv=None,
   estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
   n_jobs=1, scoring=None, step=1, verbose=0)

In [105]:
print("Number of features: %d" % rfecv.n_features_)
print('Selected features: %s' % list(X.columns[rfecv.support_]))

Number of features: 48
Selected features: ['popularity', 'runtime', 'Music', 'Drama', 'Horror', 'TV Movie', 'Thriller', 'War', 'Mystery', 'Foreign', 'History', 'Crime', 'Comedy', 'Documentary', 'Adventure', 'Family', 'Action', 'Animation', 'Western', 'Universal Pictures', 'Twentieth Century Fox Film Corporation', 'New Line Cinema', 'Touchstone Pictures', 'Walt Disney Pictures', 'Relativity Media', 'Columbia Pictures Corporation', 'Miramax Films', 'DreamWorks SKG', 'Canal+', 'United Artists', 'Regency Enterprises', 'Fox Searchlight Pictures', 'Dune Entertainment', 'Lionsgate', ' The', 'Fox 2000 Pictures', 'TriStar Pictures', 'Dimension Films', 'Summit Entertainment', 'Working Title Films', 'Amblin Entertainment', 'The Weinstein Company', 'StudioCanal', 'Released', 'Rumored', 'Post Production', 'release_year', 'release_month']


In [106]:
rfecv.ranking_

array([ 9,  1, 10,  1,  1,  1,  3,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  2,  1,  8,  4,  1,  6,  1,  5,  1, 11,  1,  1,  1,
        1,  1,  7,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1])

In [107]:
for i in range(len(rfecv.ranking_)):
    if rfecv.ranking_[i] == 1:
        print(X.columns.values[i])

popularity
runtime
Music
Drama
Horror
TV Movie
Thriller
War
Mystery
Foreign
History
Crime
Comedy
Documentary
Adventure
Family
Action
Animation
Western
Universal Pictures
Twentieth Century Fox Film Corporation
New Line Cinema
Touchstone Pictures
Walt Disney Pictures
Relativity Media
Columbia Pictures Corporation
Miramax Films
DreamWorks SKG
Canal+
United Artists
Regency Enterprises
Fox Searchlight Pictures
Dune Entertainment
Lionsgate
 The
Fox 2000 Pictures
TriStar Pictures
Dimension Films
Summit Entertainment
Working Title Films
Amblin Entertainment
The Weinstein Company
StudioCanal
Released
Rumored
Post Production
release_year
release_month


In [108]:
X1 = movies[list(X.columns[rfecv.support_])[:30]]
y1 = movies['vote_average']

In [109]:
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.30)
print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(3361, 30) (3361,) (1441, 30) (1441,)


In [110]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

Linear Regression Score: 
 24.81
Linear Regression Test Score: 
 22.47
Coefficient: 
 [ 0.01052922  0.01454402  0.29127499  0.48645024 -0.03635568 -0.51891091
 -0.04659197  0.12388023  0.06826012  0.39056471  0.02099693  0.16277431
  0.06937506  0.72140612  0.00466596 -0.06585815 -0.1825031   0.58797897
 -0.14306288  0.11905353  0.05181367  0.10179242  0.09687233  0.07331992
 -0.03578154  0.05793024  0.32745721  0.16727662  0.34931422  0.58438213]
Intercept: 
 3.9809781303709992
