# **Trabalho Prático: Previsão de sucesso filmes**

## Descrição do Problema

* Se o filme fez mais dinheiro do que gastou, então considera-se que o filme teve sucesso.
* O objetivo deste projeto é prever se realmente fez mais dinheiro ou não a partir de outros atributos.

## Preparação do programa

### Bibliotecas e Funções

In [None]:

import sklearn as skl
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score

#Modelo a usar
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC

#Métricas de qualidade
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D, BatchNormalization
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.preprocessing import MinMaxScaler



### Leitura dos datasets

In [None]:
df = pd.read_csv('movies.csv')

## Exploração dos dados

Para esta etapa do trabalho, também se utilizou as informações obtidas no website "kaggle", tais como:
* Tipos dos atributos
* Conteúdo dos atributos
* Missing values

### Visualização dos dados

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.tail()

### Informações sobre os diferentes atributos

In [None]:
df.info()

In [None]:
df.describe()

### Missing Values

In [None]:
df.isna().any()

In [None]:
df.isna().sum()

### Matriz de Correlação

In [None]:
corr_matrix = df.corr()
f, ax = plt.subplots(figsize=(20,20))
sns.heatmap(corr_matrix, vmin=-1, vmax=1, square=True, annot=True)

### Pairplot

### Histogramas

## Preparação dos dados

### Tratamento de valores em falta

In [None]:
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

### Apagar atributos

In [None]:
df.drop(['name'], axis=1, inplace=True)

#### String to Dates

Limpar datas que não correspondem 

In [187]:
i = 0
r = 0
for row in df['released']:
    row = str(row).split(' (')
    try:
        pd.to_datetime(datetime.strptime(row[0], '%B %d, %Y'))
    except ValueError:
        df.at[i, 'released'] = pd.NA
        r = r+1
        
    i = i+1

df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

print("Dropped ", r, "rows")

df['released'] = df['released'].apply(lambda x: x.split(' ('))

df['released_date'] =  df['released'].apply(lambda x: x[0])
df['released_date'] = pd.to_datetime(df['released_date'] , format='%B %d, %Y')
df['released_country'] =  df['released'].apply(lambda x: x[1][:-1])

df['released_year']  = df['released_date'].dt.year
df['released_month'] = df['released_date'].dt.month
df['released_day']   = df['released_date'].dt.day
df['released_dayOfWeek']   = df['released_date'].dt.dayofweek

df = df.drop(["released"], axis=1)
df = df.drop(["released_date"], axis=1)

Dropped  14 rows


### Alteração do tipo dos atributos

#### Frequency Encoding

In [188]:
#frequency encoding

df['directorFrequency'] = df['director'].map(df['director'].value_counts())
df['writerFrequency'] = df['writer'].map(df['writer'].value_counts())
df['starFrequency'] = df['star'].map(df['star'].value_counts())
df['companyFrequency'] = df['company'].map(df['company'].value_counts())

#### Label Encoding

In [189]:

label_encoder = preprocessing.LabelEncoder()

#label encoding
df['director'] = label_encoder.fit_transform(df['director'])
df['writer'] = label_encoder.fit_transform(df['writer'])
df['star'] = label_encoder.fit_transform(df['star'])
df['company'] = label_encoder.fit_transform(df['company'])
df['country'] = label_encoder.fit_transform(df['country'])
df['genre'] = label_encoder.fit_transform(df['genre'])
df['rating'] = label_encoder.fit_transform(df['rating'])
df['released_country'] = label_encoder.fit_transform(df['released_country'])

df.head()

Unnamed: 0,rating,genre,year,score,votes,director,writer,star,country,budget,...,runtime,released_country,released_year,released_month,released_day,released_dayOfWeek,directorFrequency,writerFrequency,starFrequency,companyFrequency
0,6,6,1980,8.4,927000.0,1791,2828,694,46,19000000.0,...,146.0,47,1980,6,13,4,2,29,18,298
1,6,1,1980,5.8,65000.0,1574,1155,213,47,4500000.0,...,104.0,47,1980,7,2,2,4,2,4,302
2,4,0,1980,8.7,1200000.0,754,1815,1151,47,18000000.0,...,124.0,47,1980,6,20,4,3,1,2,10
3,4,4,1980,7.7,221000.0,885,1410,1467,47,3500000.0,...,88.0,47,1980,7,2,2,6,6,3,279
4,6,4,1980,7.3,108000.0,716,349,270,47,6000000.0,...,98.0,47,1980,7,25,4,9,2,16,46


#### Tipo do atributo dependente

## Modelos

### Preparação prévia

#### Separação da variável dependente

In [190]:
X_train = df.drop(['gross'], axis=1)
Y_train = df['gross'].to_frame()

#### Separação entre dataframe de teste e de treino

##### Modo de Treino

In [191]:
X_train, X_test, Y_train, Y_test = train_test_split(X_train, Y_train, test_size=0.25, random_state=42)

### Treino dos modelos

#### Decision Tree

#### Linear Regression

#### MLP


In [192]:
X = df.drop(['gross'], axis=1)
y = df[['gross']]

In [193]:
scaler_x = MinMaxScaler(feature_range=(0, 1)).fit(X)
scaler_y = MinMaxScaler(feature_range=(0, 1)).fit(y)
x_scaled = pd.DataFrame(scaler_x.transform(X[X.columns]), columns=X.columns)
y_scaled = pd.DataFrame(scaler_y.transform(y[y.columns]), columns=y.columns)

In [194]:
X_train, X_test, Y_train, Y_test = train_test_split(x_scaled, y_scaled, test_size=0.25, random_state=42)

In [195]:
def build_model(activation = 'relu', optimizer = 'adam', dropout = 0.2, neurons = 32, layers = 1):
    model = Sequential()
    model.add(Dense(neurons, activation=activation, input_shape=(X_train.shape[1],)))
    model.add(Dropout(dropout))
    for i in range(layers):
        model.add(Dense(neurons, activation=activation))
        model.add(Dropout(dropout))
    model.add(Dense(1))
    model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])
    return model

In [196]:
model = build_model()
model.summary()

TypeError: build_model() got an unexpected keyword argument 'random_state'

### Previsões

#### Decision Tree

#### Linear Regression

#### MLP

In [None]:
predictions =  model.predict(X_test)
predictions = predictions.reshape(predictions.shape[0],1)

In [None]:
y_test_unscaled = scaler_y.inverse_transform(Y_test)
y_test_unscaled[:5]

In [None]:
predictions_unscaled = scaler_y.inverse_transform(predictions)
predictions_unscaled[:5]

## Métricas de Qualidade

### Decision Tree

#### Accuracy

#### Médias de métricas por peso

* __micro__: Calculate metrics globally by counting the total true positives, false negatives and false positives. 
* __macro__: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
* __weighted__: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

#### Métricas por Label

#### Cross Validation

Cross validation irá fazer vários segmentos e testá-los. Se houver um desvio padrão mt grande entre eles, isso quer dizer que existe over-fitting

### Linear Regression

#### MLP

In [None]:
from livelossplot import PlotLossesKeras

In [None]:
print('MAE:', mean_absolute_error(Y_test, predictions))
print('MSE:', mean_squared_error(Y_test, predictions))
print('RMSE:', np.sqrt(mean_squared_error(Y_test, predictions)))

In [None]:
plt.scatter(y_test_unscaled, predictions_unscaled)

In [None]:
def real_vs_predicted(limit):
    plt.figure(figsize=(10,10))
    plt.plot(y_test_unscaled[:limit], label='Real', color='blue')
    plt.plot(predictions_unscaled[:limit], label='Predicted', color='red')
    plt.grid()
    plt.xlabel('Index')
    plt.ylabel('Gross')
    plt.title('Real vs Predicted')
    plt.legend()
    plt.show()
    
    
real_vs_predicted(100)