<a href="https://colab.research.google.com/github/rajgit-123/MyProject/blob/master/Artificial_Neural_Networks_IMDBsynopsis_Kaggle_Submitted.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Artificial Neural Networks Competition
# **Description**
Participants will be provided with a dataset containing various input features and corresponding multi-label classifications. The objective is to build a neural network model that can accurately predict the appropriate labels for each data point. However, the unique aspect of this challenge lies in optimizing the hyperparameters and architecture of the neural network to achieve the best possible multi-label classification performance.

# **Key Tasks:**

# Data Preparation:
Contestants will need to preprocess the dataset, including tasks like data cleaning, feature normalization, and encoding of categorical variables. Proper data preparation is crucial for building a reliable multi-label classification model.

# Model Building:
Participants will create neural network models for multi-label classification. They can experiment with various architectures, such as the number of layers, types of layers, activation functions, and regularization techniques. Model design is a creative and vital aspect of this challenge.
# New section
# Hyperparameter Tuning:
 An integral part of this challenge is optimizing the hyperparameters. Contestants will explore various hyperparameters, including learning rate, batch size, the number of epochs, dropout rates, and more. Techniques like grid search, random search can be used to find the best combination.

# Validation and Evaluation:
 Participants will need to implement a robust validation strategy to assess their model's performance. Common metrics for multi-label classification, such as F1-score, precision, recall, and Hamming loss, can be used

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten, Dropout, LSTM
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
#from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV

### Loading csv file and creating a dataframe
- The loading_dataset function is designed to load a dataset from a CSV file into a pandas DataFrame. This function has the ability to handle datasets where values are separated by spaces rather than commas, as indicated by the delim_whitespace=True argument.
- This DataFrame, df, is then returned as the output of the function.

In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive', force_remount=True)
train_file_path='/content/drive/MyDrive/ADSA010/ANN-IMDB-Assign/train_mpst.csv'
test_file_path='/content/drive/MyDrive/ADSA010/ANN-IMDB-Assign/test.csv'
val_file_path='/content/drive/MyDrive/ADSA010/ANN-IMDB-Assign/val.csv'
#Creating dataframe
train_df =pd.read_csv(train_file_path)
test_df=pd.read_csv(test_file_path)
val_df=pd.read_csv(val_file_path)
print(train_df.head)
print(train_df.shape)
print(val_df.shape)


df_testcopy = test_df
df_testcopy.rename(columns={'imdb_id':'ID'}, inplace=True)

Mounted at /content/drive
<bound method NDFrame.head of         imdb_id                                          title  \
0     tt0057603                        I tre volti della paura   
1     tt1733125  Dungeons & Dragons: The Book of Vile Darkness   
2     tt0113862                             Mr. Holland's Opus   
3     tt0249380                                      Baise-moi   
4     tt0408790                                     Flightplan   
...         ...                                            ...   
9484  tt0045053                          The Prisoner of Zenda   
9485  tt0074646                                     Hot Potato   
9486  tt0102592                                 One False Move   
9487  tt1371159                                     Iron Man 2   
9488  tt0063443                                     Play Dirty   

                                          plot_synopsis synopsis_source  \
0     Note: this synopsis is for the orginal Italian...            imdb   
1

### Finding percentage of null values
- The function percent_null is designed to compute and display the percentage of null or missing values in each column of a given pandas DataFrame.

In [None]:
# def percent_null(df):
#   percentage_ofnull=(train_df.isnull().sum()*100)/len(df)
#   return percentage_ofnull
# print(percent_null(train_df))

In [None]:
import re
def format_data(text):
    text = re.sub("\'", "", text)
    text = re.sub("[^a-zA-Z]"," ",text)
    text = ' '.join(text.split())
    text = text.lower()
    return text

In [None]:
print(train_df.shape)

(9489, 75)


### Preprocess data
- The preprocess_data function is designed to perform several preprocessing tasks on a given pandas DataFrame. Specifically, it targets the DataFrame's 'horsepower' column for conversion to numeric values, checks and prints the percentage of null values in each column, and eliminates rows containing null values.

In [None]:
train_df['plot_synopsis'] = train_df['plot_synopsis'].apply(lambda txt: format_data(txt))
test_df['plot_synopsis'] = train_df['plot_synopsis'].apply(lambda txt: format_data(txt))
val_df['plot_synopsis'] = train_df['plot_synopsis'].apply(lambda txt: format_data(txt))
print(train_df.columns)
print(test_df.columns)
print(val_df.columns)


Index(['imdb_id', 'title', 'plot_synopsis', 'synopsis_source', 'absurd',
       'action', 'adult comedy', 'allegory', 'alternate history',
       'alternate reality', 'anti war', 'atmospheric', 'autobiographical',
       'avant garde', 'blaxploitation', 'bleak', 'boring', 'brainwashing',
       'christian film', 'claustrophobic', 'clever', 'comedy', 'comic',
       'cruelty', 'cult', 'cute', 'dark', 'depressing', 'dramatic',
       'entertaining', 'fantasy', 'feel-good', 'flashback', 'good versus evil',
       'gothic', 'grindhouse film', 'haunting', 'historical',
       'historical fiction', 'home movie', 'horror', 'humor', 'insanity',
       'inspiring', 'intrigue', 'magical realism', 'melodrama', 'murder',
       'mystery', 'neo noir', 'non fiction', 'paranormal', 'philosophical',
       'plot twist', 'pornographic', 'prank', 'psychedelic', 'psychological',
       'queer', 'realism', 'revenge', 'romantic', 'sadist', 'satire', 'sci-fi',
       'sentimental', 'storytelling', 'stupid',

### Processing categorical variables
- The function process_categorical performs preprocessing on a pandas DataFrame by converting categorical features into one-hot encoded vectors.
- The output of this function is the preprocessed DataFrame, with the categorical features now converted into one-hot encoded vectors.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)

In [None]:
print(test_df.shape)
X_train = train_df['plot_synopsis'].values
X_test = test_df['plot_synopsis'].values
X_val = val_df['plot_synopsis'].values
y_train = train_df.iloc[:, 4:].values


y_val = val_df.iloc[:, 4:].values

(2966, 4)


In [None]:
xtrain_TfidfVect = tfidf_vectorizer.fit_transform(X_train)
xval_TfidfVect = tfidf_vectorizer.transform(X_val)
xtest_TfidfVect = tfidf_vectorizer.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

Logreg = LogisticRegression()
OnevsRestclf = OneVsRestClassifier(Logreg)
OnevsRestclf.fit(xtrain_TfidfVect, y_train)


y_pred = OnevsRestclf.predict(xval_TfidfVect)
y_pred_test = OnevsRestclf.predict(xtest_TfidfVect)

print(y_pred_test.shape)
print(df_testcopy.shape)

(2966, 71)
(2966, 4)


In [None]:
df_op_cols = df_testcopy

df_op_cols[['absurd',
       'action', 'adult comedy', 'allegory', 'alternate history',
       'alternate reality', 'anti war', 'atmospheric', 'autobiographical',
       'avant garde', 'blaxploitation', 'bleak', 'boring', 'brainwashing',
       'christian film', 'claustrophobic', 'clever', 'comedy', 'comic',
       'cruelty', 'cult', 'cute', 'dark', 'depressing', 'dramatic',
       'entertaining', 'fantasy', 'feel-good', 'flashback', 'good versus evil',
       'gothic', 'grindhouse film', 'haunting', 'historical',
       'historical fiction','home movie', 'horror', 'humor', 'insanity',
       'inspiring', 'intrigue', 'magical realism', 'melodrama', 'murder',
       'mystery', 'neo noir', 'non fiction', 'paranormal', 'philosophical',
       'plot twist', 'pornographic', 'prank', 'psychedelic', 'psychological',
       'queer', 'realism', 'revenge', 'romantic', 'sadist', 'satire', 'sci-fi',
       'sentimental', 'storytelling', 'stupid', 'suicidal', 'suspenseful',
       'thought-provoking', 'tragedy', 'violence', 'western', 'whimsical']] = y_pred_test

submission_df = df_op_cols.filter(['ID','absurd',
       'action', 'adult comedy', 'allegory', 'alternate history',
       'alternate reality', 'anti war', 'atmospheric', 'autobiographical',
       'avant garde', 'blaxploitation', 'bleak', 'boring', 'brainwashing',
       'christian film', 'claustrophobic', 'clever', 'comedy', 'comic',
       'cruelty', 'cult', 'cute', 'dark', 'depressing', 'dramatic',
       'entertaining', 'fantasy', 'feel-good', 'flashback', 'good versus evil',
       'gothic', 'grindhouse film', 'haunting', 'historical',
       'historical fiction', 'home movie', 'horror', 'humor', 'insanity',
       'inspiring', 'intrigue', 'magical realism', 'melodrama', 'murder',
       'mystery', 'neo noir', 'non fiction', 'paranormal', 'philosophical',
       'plot twist', 'pornographic', 'prank', 'psychedelic', 'psychological',
       'queer', 'realism', 'revenge', 'romantic', 'sadist', 'satire', 'sci-fi',
       'sentimental', 'storytelling', 'stupid', 'suicidal', 'suspenseful',
       'thought-provoking', 'tragedy', 'violence', 'western', 'whimsical'])
submission_df.to_csv('submission-imdb.csv', index=False)


In [None]:
# print(train_df.shape)
# from sklearn.feature_extraction.text import CountVectorizer
# cols_to_convert=train_df[['imdb_id','title','plot_synopsis','synopsis_source']]
# ext_data = cols_to_convert.apply(lambda x: ' '.join(x), axis=1)
# # Initialize CountVectorizer
# count_vectorizer = CountVectorizer()

# # Fit and transform the text data
# bow_matrix = count_vectorizer.fit_transform(ext_data)

# # Get the feature names (vocabulary)
# feature_names = count_vectorizer.get_feature_names_out()

# # Convert the BoW matrix to a DataFrame for better visualization (optional)
# bow_df = pd.DataFrame(bow_matrix.toarray(), columns=feature_names)

# # Display the BoW DataFrame
# print(bow_df)

# X = bow_df.values  # Features
# y=train_df.iloc[:, [0] + list(range(4, 74))]
# print(y.shape)

# X_train, X_test, y_train, y_test = train_test_split(bow_matrix, y, test_size=0.2, random_state=42)

# Initialize the RandomForestClassifier
# rf_clf = RandomForestClassifier()

# # Training the classifier
# rf_clf.fit(X_train, y_train)

# # Evaluating the classifier
# accuracy = rf_clf.score(X_test, y_test)/.
# print("Test Accuracy:", accuracy)

# non_numeric_columns = train_df.iloc[:, [1,3]].columns.tolist()  # Assuming non-numeric columns are at index 1
# for column in non_numeric_columns:
#     train_df[column] = train_df[column].astype(str)

# def process_categorical(train_df):
#   categorical_features=['imdb_id','title','plot_synopsis','synopsis_source']
#   one_hot_df = pd.get_dummies(train_df,columns=categorical_features)
#   return one_hot_df

# df_processed=process_categorical(train_df)
# X_train=train_df[['title','plot_synopsis','synopsis_source']]
# print(X_train.shape)



# transformers = []
# for column in non_numeric_columns:
#     transformers.append((column, OneHotEncoder(), [column]))

# col_transformer = ColumnTransformer(transformers=transformers, remainder='passthrough')
# X_transformed = col_transformer.fit_transform(X_train)

# print(X_transformed.head)
# print(train_df['plot_synopsis'].unique())
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X_train)

# print(test_df.shape)
# X_t=test_df[['title','plot_synopsis','synopsis_source']]
# print(X_t.head)

# y_t=test_df.iloc[:, [0] + list(range(4, 74))]
# print(y_t.head)



(9489, 75)
      00  000  0000  00000006  0008  000bc  000th  001  0014  002  ...  風林館高校  \
0      0    0     0         0     0      0      0    0     0    0  ...      0   
1      0    0     0         0     0      0      0    0     0    0  ...      0   
2      0    0     0         0     0      0      0    0     0    0  ...      0   
3      0    0     0         0     0      0      0    0     0    0  ...      0   
4      0    2     0         0     0      0      0    0     0    0  ...      0   
...   ..  ...   ...       ...   ...    ...    ...  ...   ...  ...  ...    ...   
9484   0    0     0         0     0      0      0    0     0    0  ...      0   
9485   0    0     0         0     0      0      0    0     0    0  ...      0   
9486   0    0     0         0     0      0      0    0     0    0  ...      0   
9487   0    0     0         0     0      0      0    0     0    0  ...      0   
9488   0    0     0         0     0      0      0    0     0    0  ...      0   

      馮婉瑜  駒王学園 

TypeError: '<' not supported between instances of 'int' and 'str'

### Splitting dataset to train and test
-  Write a function to split pandas DataFrame into a training set and a test set, separating features from the target variable 'Sales', and allowing for a customizable ratio for the test set size

In [None]:
# split the data into features and target
#X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


# Define the neural network architecture
def create_model(learning_rate=0.001, embedding_dim=128, lstm_units=64, dropout_rate=0.2):
    model = Sequential()
    model.add(Embedding(vocab_size, embedding_dim, input_length=max_length))
    model.add(LSTM(lstm_units, dropout=dropout_rate))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    optimizer = Adam(learning_rate=learning_rate)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model


### Model fitting
- The sklearn_multi_regression function is designed to implement a multiple linear regression model utilizing sklearn's LinearRegression. The function initiates a LinearRegression model and fits it on the supplied feature matrix and target variable. This function accepts a feature matrix (x) and a target variable (y), both of which can either be pandas.DataFrame or numpy.ndarray.
- The function returns a fitted LinearRegression model, which can be employed to make predictions on new data or to examine the coefficients and intercept of the fitted model.

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

### Find the model performance in test data
The performance function is designed to assess the performance of a fitted regression model using a test dataset.

The function works in the following steps:

1. It uses the fitted model to generate predictions on the test dataset.
2. It calculates the mean squared error (MSE) and the coefficient of determination (R^2 score) between the actual and predicted target values.
3. It prints the MSE and R^2 score, rounded to two decimal places, to the console.

In [None]:
# Performing Randomized Search Cross Validation
random_search = RandomizedSearchCV(estimator=classifier, param_distributions=param_grid, n_iter=10, cv=3, random_state=42)
random_search.fit(X_train, y_train)

# Getting the best parameters
best_params = random_search.best_params_
print("Best Parameters:", best_params)

# Training the model with the best parameters
best_rf_clf = RandomForestClassifier(**best_params)
best_rf_clf.fit(X_train, y_train)

# Evaluating the model
accuracy = best_rf_clf.score(X_test, y_test)
print("Test Accuracy:", accuracy)

TypeError: '<' not supported between instances of 'int' and 'str'